One of Canada’s richest historical documents was becoming less useable every year – until a group of U of T political scientists, computer scientists and historians decided to intervene.
“You have this really unrivaled historical resource that is accumulated over time, and by virtue of its size and magnitude, is impenetrable,” says Christopher Cochrane, associate professor of political science at U of T Scarborough. “That is the status of Hansard prior to digitization.”
Since 1880, every word spoken in Canada’s Parliamentary debates is transcribed and recorded into a massive document called the Hansard. To put the size and scale of it in perspective, reading the entire thing at a pace of about a novel a day would take 66 years. Then it would take another 28 years to read everything that was added while reading the original document.
In 2013, Cochrane teamed up with two (at the time) postdoctoral fellows, two PhD students and Graeme Hirst, professor of computer science at U of T Scarborough, to create LiPad: The Linked Parliamentary Data Project. Over the years the project received funding from the Social Sciences and Humanities Research Council, the National Sciences and Engineering Research Council and the Digging into Data initiative.
After years of work, LiPaD has digitized and made searchable Canada’s Parliamentary debates dating back to 1901. The project also included creating and designing a website to make the documents more accessible to the public, a project headed by PhD student Tanya Whyte.
“Making these data very clearly accessible, very clearly searchable and opening it to everybody basically takes something that was becoming of little use because of its size and makes its use as enormous,” Cochrane says.
With a click, users can also find more information on parliamentarians, like their party and gender. The site is continually adding more information on members, including demographic profiles and election outcomes.
The process began with Canadiana, a non-profit heritage coalition, which scanned every page of the Hansard and posted them online. But as pictures instead of text, the documents could not be searched with keywords. While this meant the LiPaD team did not have to scan the documents themselves, it presented a new challenge.
Many of the documents, some more than a century old, were physically damaged with specks, bits of dirt or smudges from printing. This made it challenging for optical character recognition (OCR) programs, which convert written or printed words into text a computer can read, to correctly register the contents of the pages.
The quality of the documents, particularly stray specks, made it difficult to read French words. OCR settings that allowed French accent marks would also confuse them wherever there were specks. Meanwhile, OCR settings that read only English had trouble reading genuine French accent marks. LiPaD is currently only available for the English proceedings, but Cochrane says the team is interested in eventually offering the French proceedings as well.
Even when reading only English, the OCR would often err. Hirst says a common stumbling point was in the standard Parliamentary phrase “Hon. member” (short for “Honourable member”). If the “H” was even slightly obscured or broken, the computer would misread the term as “lion member.”
“That’s an easy one to fix, because obviously we would expect there to be zero occurrences of ‘lion member,’” Hirst says. “But it illustrates the kind of low quality that we were up against all over the place, including ones that weren’t so easy to fix.”
To remedy this, Kaspar Beelen, now an assistant professor at the University of Amsterdam, created several rules, allowing the computer to recognize common mistakes and giving it instructions to fix them.
These rules, Hirst says, can be used by other researchers and machines to understand text. The mass amount of publicly accessible data, which can be downloaded in multiple formats, is also a powerful tool for future work.
“If you present the world with an interesting data set, people will find ways to use it that you yourself never thought of,” Hirst says. “I hope that there are people out there doing that with LiPaD right now.”
Ludovic Rheault, now an assistant professor at U of T, joined the project in 2014 and began conducting applied research projects using the data from LiPaD. In one paper, published in 2016, he used the data to study how the language parliamentarians use in debates can indicate anxiety levels.
Rheault says the intersection of computer science, political science and language represents the most appealing thing he found about LiPaD – the opportunity to work with an interdisciplinary team.
“To grow as a citizen and a researcher, having the ability to look at what people do in other disciplines often times makes you realize that, ‘Oh, I was completely blind or oblivious to this solution or a particular problem,’” he says. “It helps you change the way you see problem-solving in general.”