Parser - pdf
Introduction
The PDF Document Parser is an implementation of the Document Parser interface used to parse the contents of PDF files into plain text. This component implements the Eino: Document Loader guide and is mainly used for the following scenarios:
- When you need to convert PDF documents into a processable plain text format
- When you need to split the contents of a PDF document by page
Features
The PDF parser has the following features:
- Supports basic PDF text extraction
- Optionally splits documents by page
- Automatically handles PDF fonts and encoding
- Supports multi-page PDF documents
Notes:
- May not fully support all PDF formats currently
- Will not retain formatting like spaces and line breaks
- Complex PDF layouts may affect extraction results
Usage
Component Initialization
The PDF parser is initialized using the NewPDFParser function, with the main configuration parameters as follows:
import (
  "github.com/cloudwego/eino-ext/components/document/parser/pdf"
)
func main() {
    parser, err := pdf.NewPDFParser(ctx, &pdf.Config{
        ToPages: true,  // Whether to split the document by page
    })
}
Configuration parameters description:
- ToPages: Whether to split the PDF into multiple documents by page, default is false
Parsing Documents
Document parsing is done using the Parse method:
docs, err := parser.Parse(ctx, reader, opts...)
Parsing options:
- Supports setting the document URI using parser.WithURI
- Supports adding extra metadata using parser.WithExtraMeta
Complete Usage Example
Basic Usage
package main
import (
    "context"
    "os"
    
    "github.com/cloudwego/eino-ext/components/document/parser/pdf"
    "github.com/cloudwego/eino/components/document/parser"
)
func main() {
    ctx := context.Background()
    
    // Initialize the parser
    p, err := pdf.NewPDFParser(ctx, &pdf.Config{
        ToPages: false, // Do not split by page
    })
    if err != nil {
        panic(err)
    }
    
    // Open the PDF file
    file, err := os.Open("document.pdf")
    if err != nil {
        panic(err)
    }
    defer file.Close()
    
    // Parse the document
    docs, err := p.Parse(ctx, file, 
        parser.WithURI("document.pdf"),
        parser.WithExtraMeta(map[string]any{
            "source": "./document.pdf",
        }),
    )
    if err != nil {
        panic(err)
    }
    
    // Use the parsed results
    for _, doc := range docs {
        println(doc.Content)
    }
}
Using loader
Refer to the example in the Eino: Document Loader guide
Related Documents
Last modified
October 28, 2025
: fix(eino_doc): markdown splitter example codeblock \` error (#1450) (3c8bed9)