"ParseKit brings native document parsing capabilities to Ruby through high-performance Rust bindings. Extract text from PDFs, Office documents, images, and more - all without external dependencies or shelling out to Python. What makes it special: Zero runtime dependencies (MuPDF and Tesseract statically linked) Native Rust performance via Magnus FFI Unified API for multiple formats (PDF, DOCX, XLSX, PPTX, images) OCR support for images (PNG, JPEG, TIFF, BMP) Comprehensive format detection and error handling"
"PDF text extraction with MuPDF (handles complex layouts, tables, forms) Office document parsing (Word, Excel, PowerPoint) Image OCR with Tesseract (multiple languages supported) Smart format detection from content or file extension Production-ready with comprehensive test suite (334 specs) Perfect for content management systems, document indexing, data extraction pipelines, and any application that needs to process documents at scale - all within your Ruby process. Recent improvements in v0.1.1: Streamlined error handling with better context Enhanced validation helpers Improved format detection reliability Cleaner API surface"
ParseKit provides native document parsing for Ruby by exposing high-performance Rust bindings via Magnus FFI. It extracts text from PDFs, Office documents, and images without external runtime dependencies by statically linking MuPDF and Tesseract. The library offers a unified API across PDF, DOCX, XLSX, PPTX, and common image formats with OCR support in multiple languages. Built-in format detection and robust error handling improve reliability. Recent v0.1.1 improvements include streamlined error context, enhanced validation helpers, and improved format detection. A comprehensive test suite (334 specs) supports production readiness for content management, indexing, and data extraction pipelines.
Read at Rubyflow
Unable to calculate read time
Collection
[
|
...
]