Domtree Definition
In RAG (Retrieval-Augmented Generation) systems, high-quality document parsing is a key foundation for ensuring accuracy and efficiency of downstream tasks. As a core component of the document parsing module, the DomTree protocol transforms complex heterogeneous raw documents into programmable and reasoning-capable tree-structured logical structures by structurally representing the hierarchical and semantic relationships of documents.
Structure Definition
| Field Name | Field Description | Data Type |
|---|---|---|
| root | Root node | Node |
| source_file | Document source | object |
| id | File ID | string |
| name | File name | string |
| type | e.g.: pdf | string |
| mime_type | e.g.: application/pdf | string |
| version | File version number | number |
| summary | Summary | string |
| tokens | Estimated token count | number |
| path | Hierarchical information with numbering, e.g.: [1,2,1] | array[number] |
| element | Element information | Element |
| type | One of the following: ["Text","Title","List","Catalog","Table","Figure","Formula","Code","ListItem"] | string |
| positions | Position information, may span pages so it's an array | array[Position] |
| bbox | Rectangle coordinate information in document, e.g.: [90.1,263.8,101.8,274.3] | array[double] |
| page | Page number | integer |
| name | Name if type is Table or Figure | string |
| description | Description if type is Table or Figure | string |
| text | Text information, OCR text from images | string |
| image | Image information | image |
| type | Can be image_url, image_base64, image_file | string |
| url | Link address | string |
| base64 | Base64 encoded image | string |
| file_id | File ID uploaded to file-api | String |
| rows | Table-specific attribute, table rows | array[Cell] |
| cells | Cell attributes | Cell |
| path | Cell position in table: start row, end row, start column, end column | array[number] |
| text | Text | string |
| nodes | Used for complex cell elements, path numbering starts from beginning within node | array[Node] |
| children | Child node information | array[Node] |