Over 32,000 medieval manuscripts transcribed in four months using AI - Medievalists.net
Briefly

Over 32,000 medieval manuscripts transcribed in four months using AI - Medievalists.net
"Medievalists can now access automated transcriptions of 32,763 digitised medieval manuscripts, produced in just four months as part of a project called CoMMA-a large-scale corpus designed to make manuscript texts searchable and analysable at a scale that would be impossible to tackle by hand. The work was carried out by researchers in computational humanities at Inria (Institut national de recherche en sciences et technologies du numérique), working with partners in France and Switzerland. Inria is France's national institute for research in digital science and technology, supporting research in areas such as computer science and applied mathematics through centres and teams across the country."
"Transcribing medieval handwriting is hard enough at the level of a single codex. At scale, the problem becomes even more complex because medieval writing does not conform to modern expectations about spelling, punctuation, or consistent letterforms. As the researchers note, in the Middle Ages most European vernaculars were still evolving, spelling was not standardised, new letterforms emerged, and manuscript pages were filled with abbreviations, symbols, and annotations. "When it came to transcribing medieval manuscripts, individual specialists went about it in their own way," explains Thibault Clérice, a computational humanities researcher within the ALMANACH project team at Inria. "But automating manuscript transcription requires machine learning, and for this you need standards.""
Automated transcriptions of 32,763 digitised medieval manuscripts were produced in four months as part of CoMMA, a large-scale corpus enabling searchable and analysable manuscript texts at scale. The work was carried out by researchers in computational humanities at Inria with partners in France and Switzerland. Medieval handwriting resists automation because spelling, punctuation, and letterforms were inconsistent, vernaculars evolved, and pages contained abbreviations, symbols, and annotations. Individual specialists used varied transcription practices, creating a need for standards. A first initiative, CATMuS, launched in 2022 aimed to build a large, uniform training dataset and researchers first collected 300 medieval manuscripts.
Read at Medievalists.net
Unable to calculate read time
[
|
]