Building OCR Systems for Tamizhi and Kurdish Historical Documents

The Tamizhi script, also known as Tamil-Brahmi, is one of the oldest scripts in India. Digitizing documents from this era involves Optical Character Recognition (OCR), which faces significant challenges due to the similarities in character shapes and subtle structural variations. Specifically, the development of OCR systems for Tamizhi is complicated by the prevalence of combined characters. Despite these challenges, recent research has achieved an accuracy rate of 91.12 percent in recognizing printed Tamizhi characters. However, no efforts have been made to develop OCR for historical Kurdish documents, nor is there an accessible dataset available for such purposes.

"Digitizing documents from ancient history typically involves OCR, which poses challenges for Tamizhi documents due to similar shapes and subtle variations among characters."

"Tamizhi script is one of the oldest scripts in India and developing an OCR system for it is difficult due to the abundance of combined characters."

"The authors report that their Tamizhi OCR achieves an accuracy rate of 91.12 percent for printed text, demonstrating promising results in recognizing characters."

"Currently, no accessible dataset exists to train OCR systems specifically designed for extracting text from historical Kurdish documents."

#ocr #tamizhi-script #kurdish-documents #digitization #historical-texts

Read at Hackernoon

Unable to calculate read time

Collection

[

...

]

Building OCR Systems for Tamizhi and Kurdish Historical Documents | HackerNoonBuilding OCR Systems for Tamizhi and Kurdish Historical Documents | HackerNoon Briefly

Building OCR Systems for Tamizhi and Kurdish Historical Documents | HackerNoon
Building OCR Systems for Tamizhi and Kurdish Historical Documents | HackerNoon
Briefly