
Transforming Classical Chinese Texts into Searchable Databases with AI
Discover how artificial intelligence is reshaping the landscape of classical Chinese studies in this insightful talk by Guenther Lomas, founder of Sigtica and an expert in document intelligence. As AI continues to play a crucial role in the digital humanities, Lomas will explore innovative methods that enhance research capabilities and unveil new insights into historical texts and cultural narratives.
- Date: Wednesday, Nov 6, 2024
- Time: 3:00 p.m. - 5:00 p.m. (EDT)
- Location: The Cheng Yu Tung East Asian Library, 8th Floor, Robarts Library, 130 St. George Street
- Registration Link: https://forms.gle/oaaMubG9JP6aLpL26
Event Description: In this presentation, Lomas will demonstrate how AI technologies can convert large volumes of unstructured classical Chinese texts—such as genealogies and Qing dynasty government employee records—into organized, searchable databases. This groundbreaking approach addresses the longstanding challenges of manual data entry in classical Chinese studies.
Attendees will gain a deep understanding of a comprehensive workflow designed to process millions of pages of historical texts, focusing on the complexities of layout identification and the precision required for effective text extraction. Key technologies, including customized Optical Character Recognition (OCR) and Named Entity Recognition (NER) models, will be discussed, showcasing how they significantly improve data extraction accuracy and enhance accessibility. This talk offers an exciting opportunity for those interested in digital humanities, cultural heritage preservation, and the intersection of AI and historical scholarship.
Event Details, Abstract: As AI becomes an integral part of digital humanities, it offers innovative methods that enhance research capabilities and reveal new insights into historical texts and cultural narratives. In this talk, the speaker will demonstrate how AI can convert large volumes of unstructured classical Chinese texts, including genealogies and Qing dynasty government employee records, into organized, searchable databases.
This transformation addresses a longstanding challenge in classical Chinese studies: the labour-intensive process of manual data entry. The presentation outlines a comprehensive pipeline designed to process millions of pages from historical Chinese texts, highlighting the complexities of layout identification and the precision needed for effective text extraction. Central to this effort is customized Optical Character Recognition (OCR), which significantly improves data extraction accuracy and enables the identification of key fields through Named Entity Recognition (NER) models.
The outcome is the creation of clean, tabular databases that enhance accessibility, allowing researchers to locate and analyze Chinese historical content with unprecedented efficiency. By exploring these methodologies and their implications, this presentation aims to demonstrate how integrating advanced technological tools can not only enrich scholarly inquiry in the digital humanities but also provide deeper insights into the patterns and individual narratives within Chinese history. This approach promises to revolutionize data collection in the field, paving the way for more efficient research practices.
Guest Speaker
Guenther Lomas is a graduate of the University of Toronto and the founder of Sigtica, which specializes in document intelligence and in the use of AI to transform unstructured data into structured databases. His work focuses on applying artificial intelligence technologies to the digital humanities, particularly in the areas of quantitative history and cultural heritage preservation. His latest project involves a collaboration with the Lee-Campbell Research Group at the Hong Kong University of Science and Technology, where he is currently utilizing Qing dynasty era employee record data to develop tools that facilitate the digitization of classical Chinese texts.
![]() |
![]() |
|
![]() |
![]() |
|
![]() |
![]() |
Click here to view event photos from the online forum
Click here to open a pdf version of the poster.
If you enjoy similar events, follow us on Facebook or X (Twitter) for the latest updates!