Eino: Document Transformer guide
Introduction
Document Transformer is a component used for document conversion and processing. Its main function is to perform various transformation operations on input documents, such as splitting, filtering, merging, etc., to obtain documents that meet specific needs. This component can be used in the following scenarios:
- Splitting long documents into smaller paragraphs for easier processing
- Filtering document content based on specific rules
- Performing structured transformations on document content
- Extracting specific parts from documents
Component Definition
Interface Definition
Code Location: eino/components/document/interface.go
Transform Method
- Function: Performs transformation processing on the input documents
- Parameters:
- ctx: Context object used to pass request-level information, and for passing the Callback Manager
- src: List of documents to be processed
- opts: Optional parameters to configure transformation behavior
- Return Values:
[]*schema.Document
: List of transformed documents- error: Error information encountered during the transformation process
Document Structure
The Document structure is the standard format of the document and includes the following important fields:
- ID: The unique identifier of the document, used to uniquely identify a document in the system
- Content: The actual content of the document
- MetaData: Metadata of the document, which can store information like:
- The source information of the document
- Vector representation of the document (for vector retrieval)
- Score of the document (for sorting)
- Sub-index of the document (for hierarchical retrieval)
- Other custom metadata
Common Option
The Transformer component uses TransformerOption to define optional parameters, and currently, there are no common options. Each specific implementation can define its own specific Option, which can be wrapped into a unified TransformerOption type via the WrapTransformerImplSpecificOptFn function.
Usage
Use Individually
Code Location: eino-ext/components/document/transformer/splitter/markdown/examples/headersplitter
Use in Orchestration
Usage of Option and Callback
Callback Usage Example
Code location: eino-ext/components/document/transformer/splitter/markdown/examples/headersplitter
Existing Implementations
- Markdown Header Splitter: Document splitting based on Markdown headers Splitter - markdown
- Text Splitter: Document splitting based on text length or delimiters Splitter - semantic
- Document Filter: Filtering document content based on rules Splitter - recursive
Reference Implementation
When implementing a custom Transformer component, please pay attention to the following points:
- Handling of options
- Handling of callbacks
Option Mechanism
A custom Transformer needs to implement its own option mechanism:
Handling Callbacks
The Transformer implementation needs to trigger callbacks at appropriate times:
Complete Implementation Example
Notes
- It’s important to manage the metadata of transformed documents carefully, ensuring that original metadata is retained and custom metadata is properly added.