Acasă Articole RTR Enhancing Usability of Digital Collections: Accuracy Assessment and OCR Post-Correction of the...

Enhancing Usability of Digital Collections: Accuracy Assessment and OCR Post-Correction of the Digital Museum of the Romanian Novel

603
Rezumat

Abstract: This paper presents a methodology for assessing the accuracy of large collections of digital documents resulting from Optical Character Recognition (OCR) applied to their original print editions during scanning. We applied this methodology to the Digital Museum of the Romanian Novel, a digital collection of Romanian literary texts from the 19th and 20th centuries. With minimal text interventions, we compared each word token in our collection against custom-made lexicons to establish an OCR accuracy rate for each document. Additionally, this methodology provides hints regarding the common OCR mistakes in the collection that can be safely corrected, but also identifies potential improvements for the custom lexicons by listing and storing the possible additions. We believe that, with access to appropriate lexicons, this method can be applied to any language when working with large corpora of OCRed texts.

Keywords: OCR accuracy, post-OCR correction, OCR evaluation, digital libraries, digital archives

Bibliografie

Alex, Beatrice, and John Burns. “Estimating and Rating the Quality of Optically Character Recognised Text”. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, 97–102. DATeCH 14. New York, NY, USA: Association for Computing Machinery, 2014. https://doi.org/10.1145/2595188.2595214.

Baghiu, Ș., A. Terian, V. Pojoga, T. Susarenco, I. Minculete, and O. Olaru. “Geografia romanului românesc (1901-1932): străinătatea” [The Geography of the Romanian Novel (1901-1932): Spaces from Abroad]. Transilvania, no. 10 (2020): 1–11.

Baghiu, Ș., A. Terian, V. Pojoga, S. Ung, B. Crăciun, and O. Olaru. “Geografia romanului românesc (1933-1947): străinătatea” [The Geography of the Romanian Novel (1933-1947): Spaces from Abroad]. Transilvania, no. 9 (2021): 1–9. https://doi.org/10.51391/TRVA.2021.09.01.

Baghiu, Ștefan, Vlad Pojoga, Mihnea Bâlici, Maria Chiorean, Alex Ciorogar, Jessica Codină Brenda, Bianca Crăciun, et al. “Muzeul Digital al Romanului Românesc:  1933-1947” [The Digital Museum of the Romanian Novel: 1933-1947]. Sibiu: Complexul Național Muzeal ASTRA, 2021. https://revistatransilvania.ro/mdrr.

Baghiu, Ștefan, Vlad Pojoga, Cosmin Borza, Andreea Coroian-Goldiș, Daiana Gârdan, Emanuel Modoc, Teodora Susarenco, Radu Vancu, and Dragoș Varga. “Muzeul Digital al Romanului Românesc: Secolul al XIX-lea” [The Digital Museum of the Romanian Novel: The 19th Century]. Sibiu: Complexul Național Muzeal ASTRA, 2019. https://revistatransilvania.ro/mdrr.

Baghiu, Ștefan, Vlad Pojoga, Cosmin Borza, Andreea Coroian-Goldiș, Denisa Frătean, Daiana Gârdan, Alex Goldiș, et al. “Muzeul Digital al Romanului Românesc:  1901-1932” [The Digital Museum of the Romanian Novel: 1901-1932]. Sibiu: Complexul Național Muzeal ASTRA, 2020. https://revistatransilvania.ro/mdrr.

Bast, Hannah, and Claudius Korzen. “A Benchmark and Evaluation for Text Extraction from PDF”. In 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 1–10. Toronto, ON, Canada: IEEE, 2017. https://doi.org/10.1109/JCDL.2017.7991564.

Drăgulescu, Radu. Istoria limbii române literare: primele manifestări [The History of the Romanian Literary Language: Early Manifestations]. Sibiu: Editura Universităţii “Lucian Blaga”, 2006.

Généreux, Michel, and Diego Spano. “NLP Challenges in Dealing with OCR-Ed Documents of Derogated Quality”. In Workshop Proceedings “Replicability and Reproducibility in Natural Language Processing: Adaptive Methods, Resources and Software” at IJCAI 2015, 6. Buenos Aires, 2015. https://www.researchgate.net/publication/281112670_NLP_challenges_in_dealing_with_OCR-ed_documents_of_derogated_quality.

Gliga, Lavinia. “Why We Need Diacritical Marks”. DoR, 8 March 2011. https://www.dor.ro/diacritical-marks/.

Hill, Mark J., and Simon Hengchen. “Quantifying the Impact of Dirty OCR on Historical Text Analysis: Eighteenth Century Collections Online as a Case Study”. Digital Scholarship in the Humanities 34, no. 4 (1 December 2019): 825–43. https://doi.org/10.1093/llc/fqz024.

Honnibal, Matthew, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. “spaCy: Industrial-Strength Natural Language Processing in Python”. Python, 2020. https://doi.org/10.5281/zenodo.1212303.

Ivănescu, G. Istoria limbii române [The History of the Romanian Language]. Iaşi: Editura Junimea, 1980.

Kettunen, Kimmo, Jukka Kervinen, and Mika Koistinen. “Creating and Using Ground Truth OCR Sample Data for Finnish Historical Newspapers and Journals”. In Proceedings of the Digital Humanities in the Nordic Countries 3rd Conference, edited by Eetu Mäkelä, Mikko Tolonen, and Jouni Tuominen, 2084:162–69. CEUR Workshop Proceedings. Helsinki, Finland: CEUR, 2018. https://ceur-ws.org/Vol-2084/#shortplus1.

Kim, A., C. Pethe, N. Inoue, and S. Skiena. “Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts”. In Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021, 4217–26. Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021.

Macrea, Dimitrie, Emil Petrovici, and Al. Rosetti. Dicţionarul limbii romîne literare contemporane [The Dictionary of the Contemporary Romanian Literary Language]. Vol. 1–4. Bucureşti: Editura Academiei Republicii Populare Romîne, 1955.

Mieskes, Margot, and Stefan Schmunk. “OCR Quality and NLP Preprocessing”, 102–5, 2019. https://aclanthology.org/W19-3633.

Moretti, Franco. “Conjectures on World Literature”. New Left Review, no. 1 (1 February 2000): 54–68.

Neudecker, Clemens, Konstantin Baierer, Mike Gerber, Christian Clausner, Apostolos Antonacopoulos, and Stefan Pletschacher. “A Survey of OCR Evaluation Tools and Metrics”. In Proceedings of the 6th International Workshop on Historical Document Imaging and Processing, 13–18. HIP 21. New York, NY, USA: Association for Computing Machinery, 2021. https://doi.org/10.1145/3476887.3476888.

Palmer, David D. “Text Preprocessing”. In Handbook of Natural Language Processing, by Nitin Indurkhya and Fred J. Damerau, 9–30, 2nd ed. Chapman and Hall/CRC, 2010.

Reynaert, Martin. “On OCR Ground Truths and OCR Post-Correction Gold Standards, Tools and Formats”. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, 159–66. DATeCH 14. New York, NY, USA: Association for Computing Machinery, 2014. https://doi.org/10.1145/2595188.2595216.

Șăineanu, Lazăr. Dicționar universal al limbei române [Universal Dictionary of the Romanian Language]. 6th ed. S.l.: Editura Scrisul Românesc, 1929.

Stahl, Peter M. “Pemistahl/Lingua-Py”. Python, 12 August 2024. https://github.com/pemistahl/lingua-py.

Șuteu, Flora. “Introducere în studiul ortografiei românești” [Introduction to Romanian Orthography]. In Sinteze de limba română [Overviews of the Romanian Language], edited by Theodor Hristea, 3rd ed., 174–86. Bucureşti: Albatros, 1984.

Traub, Myriam C., Jacco van Ossenbruggen, and Lynda Hardman. “Impact Analysis of OCR Quality on Research Tasks in Digital Archives”. In Research and Advanced Technology for Digital Libraries, edited by Sarantos Kapidakis, Cezary Mazurek, and Marcin Werla, 252–63. Cham: Springer International Publishing, 2015. https://doi.org/10.1007/978-3-319-24592-8_19.

DISTRIBUIȚI