Using as a Library

System Requirements

Python >= 3.9

Quick Start

Install dependencies
```
pip install document_parser
```

Configuration

parser_config = ParserConfig(image_provider=ImageStorageProvider(),
                             ocr_model_name="gtp-4o",
                             # Whether to enable OCR capability
                             # If not enabled, vision_model_provider or vision_model_list doesn't need to be implemented or configured
                             ocr_enable=True, 
                             vision_model_provider=OpenAIVisionModelProvider())
parser_context.register_all_config(parser_config)
parser_context.register_user("userId") # User ID when requesting models, if not set it will affect OCR usage

Execute parsing

converter = Converter(stream=stream) # Pass in as file stream
dom_tree = converter.dom_tree_parse( 
    remove_watermark=True,   # Whether to enable watermark removal
    parse_stream_table=False # Whether to parse streaming tables
)

Use standard domtree [Recommended]

dom_tree_json = jsonable_encoder(dom_tree)
standard_dom_tree = StandardDomTree.from_domtree_dict(dom_tree_json, file_info = file_info)
json_compatible_data = jsonable_encoder(standard_dom_tree.root)
print(json.dumps(json_compatible_data, ensure_ascii=False))

The standard domtree is a structurally and semantically more complete domtree structure. Services like bella-rag are based on this standard domtree for processing. With future iterations, the results from step 3 parsing execution will also output as standard domtree.

System Requirements​

Quick Start​

System Requirements

Quick Start