The Old Bailey, U.S. Reports, and OCR: Benchmarking AWS, Azure, and GCP on 360,000 Page Images, William Ughetta, UG '21 (2304637)
Court records spanning the entire eighteenth and nineteenth centuries present a compelling benchmark for leading Optical Character Recognition (OCR) cloud providers on historical documents. The Proceedings of the Old Bailey is a corpus of over 180,000 pages of court records and last words published in England from 1674 to 1913. The U.S. Reports is a collection of over 180,000 pages from The United States Supreme Court and predecessor courts ranging from 1754 to 1915. The Old Bailey is uniquely suited for benchmarking OCR, since all 180,000 images have been transcribed by humans. The U.S. Reports will be useful as a relative measure of similarity between the providers, instead of an absolute comparison to human performance. Although these two datasets largely span the same period, there are significant differences in their layout, printing, preservation, scanning, and even digital formats. Our goal is to benchmark three leading cloud OCR services on the 360,000 historical documents from the Old Bailey and U.S. Reports datasets. The three OCR services are: Amazon Web Services's Textract (AWS); Microsoft Azure's Cognitive Services OCR (Azure); and Google Cloud Platform's Vision (GCP). This represents the second time, approximately nine months apart, that the Old Bailey has been benchmarked on AWS, Azure, and GCP, and the first time for the U.S. Reports. We found that AWS had the lowest median Character Error Rate (CER) across both the Old Bailey and the U.S. Reports and that GCP had the lowest median round trip time of less than one second.