Newspaper Navigator: Reimagining Digitized Newspapers with Machine Learning
From PU Princeton University Library on May 21st, 2020
Presented by Ben Lee, Library of Congress Innovator-in-Residence
The 16 million digitized historic newspaper pages within Chronicling America, a joint initiative by the Library of Congress and the NEH, represent an incredibly rich resource for a wide range of users. Historians, journalists, genealogists, students, and members of the American public explore the collection regularly via keyword search. But how do we navigate the abundant visual content? Newspaper Navigator is a project that Ben is currently carrying out while an Innovator-in-Residence at the Library of Congress, in collaboration with Library of Congress Labs, the National Digital Newspaper Program, and Ben's Ph.D. advisor, Professor Daniel Weld, at the University of Washington. Newspaper Navigator consists of two parts. The first is to extract headlines, images, illustrations, maps, comics, and editorial cartoons from millions of newspaper pages by training an image recognition model on thousands of crowdsourced annotations collected by the Library of Congress’s Beyond Words initiative. The second part of Newspaper Navigator is to reimagine how we can navigate this wealth of visual content through an exploratory search interface, enabling users to define queries for concepts of their own choosing (referred to as “open faceted search”).
In this talk, Ben will share current progress with Newspaper Navigator, including running the visual content recognition pipeline at scale. Ben will also discuss how this project, including the resulting datasets and search interface, can contribute to both computer science research and research within digital humanities.