The digitization of historical documents is crucial for preserving and providing access to our humanity—our history, our culture, and our lessons. Yet, the process continues to face unresolved challenges, especially with the accuracy of Optical Character Recognition (OCR). OCR is technology that converts images of text into digital text files and it often does not work well on older, historical texts since they frequently exhibit unique typography, non-uniform layout, and degraded printing quality.
This project looks to enhance the legibility and usability of OCR text from the Princeton Prosody Archive (PPA), a full-text searchable database specialized in English prosody. We explore the potential of modern GPT models (3.5-turbo, 4, 4-turbo) by experimenting with various prompts to perform post-OCR correction. Our methodology involved a comparative analysis across different model configurations and prompt strategies to identify the most effective combinations.
Our best-performing configuration achieved on average a 38% improvement in Character Error Rate (CER), which is a substantial enhancement. Improvements like this facilitate better searchability within the PPA thereby increasing the overall accessibility and analytical utility of its collections.
This project highlights the teeming potential of applying large language models (LLMs) to digitization efforts and invites further research into developing even better post-OCR correction strategies. We’re just at the tip of the iceberg. We hope this work inspires further advancements in the accessibility, preservation, and utilization of digitized works, both within the PPA and in broader archival contexts.