Soh extends research project to Library of Congress

Oct 09, 2019 By Victoria Grdina

Elizabeth Lorang and Leen-Kiat Soh. Photo courtesy of Nebraska Today.

University of Nebraska–Lincoln researchers are teaming up with members of the Library of Congress to expand and enhance online access to its electronic catalog.

Computer Science and Engineering professor Leen-Kiat Soh is continuing the research he began in 2013 on the “Image Analysis for Archival Discovery (AIDA)” project. He’ll resume his collaboration with University Libraries associate professor Elizabeth Lorang, as well as CSE doctoral students Yi Liu and Chulwoo (Mike) Pack.

AIDA’s original goal was to develop software that could scan newspapers for poetry, but the project evolved when members of the Library of Congress recognized its potential for improving access to its archives. The team’s research activities now involve using machine learning and deep learning techniques to analyze the library’s vast collection of historical documents.

“Each year the repository gets bigger and bigger. There are tens of millions of documents that get added, but they cannot process them fast enough,” Soh said. “They’ve realized they need to have machine learning components to increase accessibility and discoverability.”

In order to develop those components, Liu and Pack spent six weeks in Washington D.C. working with the library’s digital collections, including the Chronicling America, Beyond Words, and By the People.

“The main objective of this project is applying various machine learning technologies to determine how the library can utilize them for the purpose of creating new metadata,” Pack said. “This will ultimately link researchers and the public to relevant content in an online searching, discovery, and computational context.”

Over the summer, the team conducted several “mini explorative projects” to identify possibilities and issues as they built their solution. They tested algorithms and experimented with a variety content types and artifacts such as newspapers, handwritten letters, and even photos. One main focus area was to improve the analysis of images, particularly images that also included text.

“The challenge is that the document images contain different types of content. One newspaper page could have an illustration, table, and textual content at the same time,” Liu said. “A good segmentation and metadata generation process can help with the machine-readable portion of the collection, or even may be able to make the process fully automated.”

The team also focused on eliminating roadblocks that prevented software from detecting or accurately analyzing content, such as a folded page, bleed-through text, or an uneven document scan.

“The human eye can detect these things very quickly, but for the machine it’s difficult,” Soh said.

Based on the feedback and discoveries made over the summer, the AIDA team has used its findings to lay the groundwork for a robust, scalable approach for the library. They’ll present it to the Library of Congress this semester.

According to Pack, the collaboration with the Library of Congress has been an experience that will benefit him and the field as a whole.

“The collaboration has allowed me to understand the gap between practice and research,” Pack said. “Working with the practical dataset from the library’s collection is a valuable opportunity to learn the advantages and improvable aspects of the state-of-art techniques.”

Soh said after the project is completed this fall, he hopes the team can remain involved in its future growth in advisory roles. He also hopes to participate in the development of similar projects with the Library of Congress that will be equally powerful and fulfilling.

“This project is very hands-on and very real-world,” Soh said. “From innovation to algorithm development to applications to providing actual service to users, we’re able to see and make a direct impact.”