OCR Dreams

Hannah Alpert-Abrams bio photo By Hannah Alpert-Abrams

OCR (Optical Character Recognition) is the name for the software used to convert a picture of words into words that can be read by a machine. It’s used by everything from Google Books to Hathi Trust to Internet Archive to Adobe Acrobat. If you are a humanist and you want to know why OCR matters to you, you should read Ryan Cordell’s super clear and helpful discussion on OCR for humanists. Ryan’s post is an introduction for the more in-depth report that he published with David Smith as part of “A Research Agenda for Historical and Multilingual Optical Character Recognition, a report funded by the Andrew W. Mellon Foundation and conducted in consultation with the NEH’s Office of Digital Humanities and the Library of Congress.

In response to the OCR report, I wrote a Twitter thread about OCR and how I see a world of opportunities for students in the humanities to get involved in OCR development. This is something I’ve been thinking about for a long time, and I wanted to expand on it here.

So, I present: some questions about OCR and the machine reading of digitized historical documents that you can help solve:

  • What are neural nets and what are they doing to historical documents?
  • How does OCR quality vary across languages and dialects?
  • How do we measure OCR quality for multilingual corpora (or collections of texts that are otherwise non-standard)?
  • How do we evaluate the accuracy a multilingual OCR corpus? How do we know when a multilingual corpus is accurate enough to use (and for what)?
  • How do we automatically improve / normalize / lemmatize multilingual corpora?
  • Have OCR tools and workflows been developed with environmental impact in mind? What can we do to improve this?
  • Can we make OCR technology more accessible for ordinary users without institutional affiliations or expensive hardware?
  • Can we center the goal of making our digital collections accessible — for blind and visually impaired users, for users without consistent access to high-speed internet, for users who don’t speak English, for users using smartphones instead of computers — in the design of OCR software?
  • How does OCR impact research methods for humanists who don’t do digital humanities? What should a non-expert know about OCR before doing their work? How should they write about this in their published work? How should reviewers think about this when reviewing non-dh research in the humanities?

Go forth and make better software, friends!

comments powered by Disqus