Our primary goal is to convert diverse data file formats like Word, Excel, and PDF into a universal, machine-accessible JSON format, referred to as VUD.
While various APIs such as Apache Tika and PDFPlumber assist in data extraction, not all formats retain formatting features, necessitating a focus on core text and minimal structure.
In our approach, we define structured content in VUD, identifying pages, paragraphs, and tables, while maintaining continuity and recognizing tables that extend across multiple pages.
The process also considers the basic tabular structures, ensuring that tables are merged if they share the same indices without interjecting non-tabular content.
Collection
[
|
...
]