IBM Releases Granite-Docling-258M, a Compact Vision-Language Model for Precise Document Conversion

"IBM Research has recently introduced Granite-Docling-258M, a new open-source vision-language model (VLM) designed for high-fidelity document-to-text conversion while preserving complex layouts, tables, equations, and lists. Unlike typical OCR systems that rely on large, general-purpose models, Granite-Docling is purpose-built for document parsing. With only 258 million parameters, it delivers accuracy on par with models several times its size - offering a major cost and efficiency advantage."

"Granite-Docling builds on the earlier SmolDocling-256M-preview, replacing the SmolLM-2 backbone with a Granite 3-based architecture and upgrading the visual encoder from SigLIP to SigLIP2. The new version addresses previous stability issues such as token repetition or incomplete parses, thanks to improved dataset filtering and annotation cleanup. Early community reactions have highlighted the model's potential for on-device use. On Reddit, one commenter noted: 0.3B? Impressive. Almost like even low-end phones will have solid local LLM inferencing in the future."

Granite-Docling-258M is a purpose-built, open-source vision-language model for document-to-text conversion that preserves precise document structure including math notation, table layouts, and code blocks. The model uses only 258 million parameters while matching or exceeding accuracy of much larger systems, enabling cost- and memory-efficient deployment. Granite-Docling replaces the SmolLM-2 backbone with a Granite 3-based architecture and upgrades the visual encoder to SigLIP2, addressing stability issues like token repetition and incomplete parses through dataset filtering and annotation cleanup. The model performs well on standard document understanding benchmarks and is suitable for RAG pipelines, dataset preparation, and on-device inference.

#vision-language-model #document-understanding #layout-preservation #on-device-inference

Read at InfoQ

Unable to calculate read time

Collection

[

...

]

IBM Releases Granite-Docling-258M, a Compact Vision-Language Model for Precise Document ConversionIBM Releases Granite-Docling-258M, a Compact Vision-Language Model for Precise Document Conversion Briefly

IBM Releases Granite-Docling-258M, a Compact Vision-Language Model for Precise Document Conversion
IBM Releases Granite-Docling-258M, a Compact Vision-Language Model for Precise Document Conversion
Briefly