Text Extraction for Complex Historical Documents: A Modular Approach to Layout Detection and OCR

David Fleischhacker*, Wolfgang Thomas Göderle, Roman Kern

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference paperpeer-review

Abstract

We present a modular approach for high-precision extraction of data from retro-digitized historical texts with complex layouts. Our two-stage process combines AI-driven layout recognition using YOLOv9 with a fine-Tuned Kraken OCR engine. By leveraging synthetic training data and custom fonts, we achieve low single-digit Character Error Rates (CER) for 19th-century documents like the Schematismus. Our approach is particularly effective for processing large-scale historical collections with intricate layouts and nested structures, demonstrating significant improvements over existing solutions in both accuracy and processing efficiency. The systems modular design allows for easy adaptation to different historical document types while maintaining high performance levels.
Original languageEnglish
Title of host publicationJCDL 2024 - Proceedings of the 24th ACM/IEEE Joint Conference on Digital Libraries
EditorsJian Wu, Xiao Hu, Terhi Nurmikko-Fuller, Sam Chu, Ruixian Yang, J. Stephen Downie
PublisherIEEE
ISBN (Electronic)979-840071093-3
DOIs
Publication statusPublished - 13 Mar 2025
Event24th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2024 - Hong Kong, China
Duration: 16 Dec 202420 Dec 2024

Publication series

NameProceedings of the ACM/IEEE Joint Conference on Digital Libraries
ISSN (Print)1552-5996

Conference

Conference24th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2024
Abbreviated titleJCDL '24
Country/TerritoryChina
CityHong Kong
Period16/12/2420/12/24

Keywords

  • Historical documents
  • Information extraction
  • layout detection
  • OCR
  • Synthetic training data
  • YOLOv9

ASJC Scopus subject areas

  • General Engineering

Fingerprint

Dive into the research topics of 'Text Extraction for Complex Historical Documents: A Modular Approach to Layout Detection and OCR'. Together they form a unique fingerprint.

Cite this