Significant amounts of historical Kurdish documents remain unprocessed in libraries. Lack of processing capabilities motivated exploration of OCR technology for Kurdish, a low-resource language. Tesseract was selected after assessing existing OCR systems and technologies. Efforts focused on collecting digital copies of documents printed before 1950, with sourcing and digitization presenting major challenges. The Zheen Center for Documentation and Research supplied several digitized books. Text lines were extracted, transcribed individually, and preprocessed to create a dataset. A 1233-line dataset was used to train a model initialized from the Arabic Tesseract model. Model performance was evaluated using multiple methods including Tesseract's lstmeval.
The primary motivation for this study stems from the significant amounts of historical documents stored in libraries that still need to be processed. The lack of processing capabilities has led to exploring OCR technology for Kurdish, a low-resource language. Implementing OCR for extracting text from historical documents in Kurdish would greatly enhance available resources. Extensive research was conducted to assess existing OCR systems for Kurdish and other languages worldwide.
Once the technology was identified, efforts were made to collect digital copies of historical documents printed before 1950. This task proved challenging, as locating documents and converting them into digital format presented additional hurdles. Fortunately, the Zheen Center for Documentation and Research in Sulaymaniyah, which specializes in archiving historical documents, provided some books in the form of digital copies. Upon receiving the digitized copies, a dataset was created to train the Tesseract model.
Collection
[
|
...
]