Unlocking and Analyzing Historical Texts

2024 ~ Present | No partners or sponsors as of now, but I am working on a grant proposal for a project on analysis of ancient Greek and Latin texts and their historical translations which would require the ecosystem and tools mentioned above for acquiring and processing these texts as the starting point.

Goals

The human record is enormous, ranging from the text we produce on the internet today to ancient writings on clay like the cuneiform tablets. Unlike the text that is "born digital" today, much of the historical texts and their metadata remain locked up in various inscrutable archival states rendering computational analysis of these texts impossible. The goal of this project is to build an ecosystem of computational (AI) tools to identify, obtain, scrape, and process these ancient documents in varying states of digitization from a multitude of libraries and publicly available archival sources.

Issues Involved or Addressed

Historical texts exist in many various states that are not machine readable (but publicly available) across museums and libraries: a) Either it is completely undigitized (only physical relics are to be found in the museums), or b) archival images in multiple varying formats, resolutions, and quality exist across libraries with different conventions for storing, cataloging, and archiving (for example, the cuneiform tablets have been 3D-imaged, ancient Greek and Latin texts are available on microfilm images etc.) c) partially machine readable versions of certain editions of archived documents exist. But the conventions around storing metadata, treating paratext like footnotes and marginalia, and handling digitized outputs vary greatly among libraries. The pipeline for getting the text ready in a machine readable state for computational analysis involves a lot of manual effort that we aim to alleviate -- navigating through non-standardized and varying conventions of cataloging and archiving information, requesting and scraping archived data from libraries and public archives, developing and deploying image analysis algorithms to recognize the layout, format, and other visual characteristics of documents, developing and deploying optical character recognition (OCR) systems, working with diverse structure of the printed text and the metadata etc. Many of these steps like OCR, tend to be imperfect and noisy, so developing evaluation schemes for these systems is also necessary.

Methods and Technologies

  • Computer vision
  • Optical Character Recognition
  • Natural language processing
  • Information retrieval
  • Catalog Management
  • Bibliographic methods
  • Dataset curation
  • Unit testing
  • Qualitative and Quantitative evaluation of AI systems
  • Data Science

Academic Majors of Interest

  • Computing
  • Ivan Allen
  • Other
  • Sciences

Preferred Interests and Preparation

Basic familiarity with shell scripting, python, and programming will be advantageous. Also, some knowledge about catalogs, library practices, archives or historical texts would be a plus.

Meeting Schedule & Location

Time 
5:00-5:50
Meeting Location 
Klaus 2446
Meeting Day 
Wednesday

Team Advisors

Kartik Goyal
  • College of Computing

Partner(s) and Sponsor(s)

No partners or sponsors as of now, but I am working on a grant proposal for a project on analysis of ancient Greek and Latin texts and their historical translations which would require the ecosystem and tools mentioned above for acquiring and processing these texts as the starting point.