Defining Privacy
A critical investigation of Canadian political discourse


3.1 Corpora

The use of a corpus (or the plural corpora) for the investigation of language existed before the advent of computerized analysis, although, as we will briefly explore here, the computer had a profound effect on expanding the range of methodologies and theories involved in the practice of text analysis. But first, it is important to operationalize the term ‘corpus’ so that it can be better understood in the context of electronic text analysis.


Put simply, a corpus is a collection of texts with intention. There is a general consensus that the texts should be an authentic and balanced representation of a language within the context of its use (Adolphs 2; Baker 2; McEnery, Xiao, and Tono 21; Sinclair, Trust the Text 13; Tognini-Bonelli 57). Corpora may consist of spoken or written language and thus are a sample of human behaviour, but it’s important to note that a sample of behaviour does not constitute the behaviour itself (Stubbs, Text and Corpus Analysis 233). The corpus is, according to Hunston, “neither good or bad in itself” (26), it is merely a constructed record arranged for a purpose (Hunston, Corpora in Applied Linguistics 26; Stubbs, Text and Corpus Analysis 233).

A corpus is authentic if it consists of material, as Tognini-Bonelli describes, that has been “taken from genuine communications of people going about their normal business” (55). It must be noted that it is an oversimplification to assume that every single word, phrase or sentence that occurs in the corpus is completely indicative of the language under study, because it is difficult to fully articulate what should count as ‘language use’ (Tognini-Bonelli 55). For this reason, the authenticity of a corpus is closely related to its representativeness. A balance exists if the language use in the corpus is generalizable to the variety of language it is meant to represent within its greater context (Baker 26; McEnery, Xiao, and Tono 21; Tognini-Bonelli 57; Yates 103).

Corpora can come in many shapes and sizes and can include whole texts or samples of texts (Baker 30; Hunston, Corpora in Applied Linguistics 32; Sinclair, Corpus, Concordance, Collocation 24). Monitor corpora grow continually, with new texts being added that are representative of the language-in-use, like newspaper archives (Baker 30; McEnery and Hardie 246; Sinclair, Corpus, Concordance, Collocation 26). Static or reference corpora are more generalized and discontinuous (Sinclair, Corpus, Concordance, Collocation 24) and they try to represent a particular type of language over a specific period of time (McEnery and Hardie 8). An example of a static corpus is the Lancaster-Oslo/Bergen (LOB) corpus, which represents a ‘snapshot’ of modern British English in the 1960’s (McEnery and Hardie 9). Other categories of copora include: comparable corpora, which enable researchers to compare the similar content in different languages; parallel corpora, which involves translated text, the Canadian Hansard published in both English and French is a good example; learner corpora, which focus on texts produced by learners of a different language; and historical or diachronic corpora, containing texts from different periods of time that help trace the development of language or language use (Hunston, Corpora in Applied Linguistics 15-16).

Advances in computer processing technology have made it increasingly possible to work with larger and larger corpora, and in the words of Sinclair, this access allows for a “quality of evidence that has not been available before” (Corpus, Concordance, Collocation 4). But the purpose of a corpus as a research tool must always be clearly articulated. There is a distinction between what is known as ‘corpus-based’ research and ‘corpus-driven’ research. The former is concerned with deductive research, where theories about language and language use are tested and confirmed based on corpus data (Tognini-Bonelli 65). The latter has a focus on inductive research, where a hypothesis is formed based on observations made about the corpus data, and the subsequent generalizations and theories must be “fully consistent with, and reflect directly, the evidence provided by the corpus” (Tognini-Bonelli 85). The difference, in essence, is temporal. In corpus-based research, the theories come before the corpus, and in corpus-driven research the theories come after the corpus.


While corpus-driven research is essentially a product of the increasing availability of computer-assisted text processing, methods of text analysis pre-date the computer. The manual analysis of corpora has its origins in the 12th and 13th centuries (Adolphs 5). The earliest texts were primarily religious manuscripts, and corpus research was concerned with the production of concordances, a text analysis method that organizes the data so that a particular key word or phrase appears with a sample of its accompanying text to provide context (Adolphs 5). In the 1950’s, an Italian Jesuit priest named Father Roberto Busa began a project that would ultimately give birth to the field of Humanities Computing by producing the very first computerized concordance, placing the works of St. Thomas Aquinas and others on punch cards to undergo mechanical processing (Hockey). By the 1960’s, other scholars were beginning to see the potential for computer-assisted text processing for tasks such as authorship determination and ancient language research (Hockey).


Technological advances in machine readable corpora have increased the interest in the study of language and corpus-driven studies, highlighting the distinction between the empiricist and rationalist approaches to research. Empirical methods of electronic text analysis allows for the observation of trends and the creation of statistics based on naturally occurring language. The frequency of words can be counted, and concordance lists can be created and analyzed. Texts can undergo quantitative analyses that can be reproduced by other researchers. This is in contrast to the rationalist tradition of linguistics, popularized by scholars like Noam Chomsky. He argued that the focus of language study should be on ‘competence’ (the internalized knowledge of a language) rather than its external use, known as its ‘performance’ (Adolphs 6). His argument was that performance data does little to inform or reveal knowledge about competence (Adolphs 6). He also argued that there could never be a sample of naturally occurring language large enough to be a true representation of language use (Adolphs 6; McEnery, Xiao, and Tono 3).

While the availability of large computerized corpora has steadily increased with the power of the computers available to process them, in practice, the distinction between corpus-based and corpus-driven research, along with their rationalist and empiricist approaches, may not be so clearly defined. While it is claimed that corpus-driven research begins free from theory, researchers must choose in advance the corpus they want to work with, along with the types of questions they may want to ask. In linguistics, this type of preliminary, instinctual knowledge about language is known as ‘intuition’ (Adolphs 6; McEnery, Xiao, and Tono 6), and it affects every stage of the research, from questions, to analysis, to interpretation. It is for this reason that researchers involved in electronic text analysis must make explicit the design of their corpora, their research methodology, and the identifiable theories underlying their work.

Here we return to the concept of text analysis, or ‘sense making’, in the words of McKee (16). If the explication of theory is key in all stages of electronic text analysis, but the task of the research is seen as fundamentally empiricist, we must ask the question: “Can we evaluate corpus evidence in the same way as we evaluate a text” (Tognini-Bonelli 2)? To rephrase Tognini-Bonelli’s question, is there a difference between electronic text analysis and electronic text description?

In short, the answer is yes. Texts, electronic or otherwise, do not exist in a vacuum. It is their context that gives them meaning. As McKee explains, a text cannot even be described without some implicit placement within a greater context (65). Despite the empiricist claims of the corpus-driven research methodology, data does not simply exist, it is always undergoing some form of interpretation. This begins with the questions that are asked of the corpus, continues in the reporting of the quantitative results of its analysis, and ends in the discussion that follows (Adolphs 6; McEnery, Xiao, and Tono 9). As the philosopher of science Kuhn wrote, “more than one theoretical construction can always be placed upon a given collection of data” (Kuhn 76).

Rather than treating corpus-driven research as an endpoint in linguistic research, it should rather be considered as a means of entry, or as Adolphs describes, “a way into the data that is informed the by the data itself” (19).

A corpus is a tool for researchers that consists of a balanced, authentic and representative sample of language, situated within an identifiable context of use. It consists of text, yet it is not a ‘text’ so much as an object of research that can be manipulated and reorganized electronically. While corpora primarily allow for the production of quantitative data, the transmission of that data is never free from interpretation, however explicit or implicit the researcher may be about theory. It is important, as a researcher who uses corpora as a tool, to be clear about the origins, design, and use of that corpora, so that the data may be replicable and open to scrutiny. Corpora may be used for quantitative or qualitative research, often as a entry point into a larger investigation of language in use. For a mixed methods research project, the methodological path is clear: observations made as a result of the data lead to hypotheses, which lead to generalizations about phenomena, which ultimately lead to a unified theoretical statement (Tognini-Bonelli 85). The results of any research in text analysis are only as good as the corpus itself (Sinclair, Corpus, Concordance, Collocation 13).

The Hansard Corpus

The House of Commons debates, commonly known as Hansard, are the official edited verbatim account of the proceedings of the House of Commons (Parliament of Canada, “Debates”). The debates are published in both English and French after every sitting day, and are publicly available (Parliament of Canada, “Debates”). Transcripts have been available for download in XML format starting with the 39th Parliament on April 3, 2006. Hansard is a complete record of the proceedings of the House of Commons, recording the speeches made by MPs in debate (Parliament of Canada, “Debates”). The transcripts also contain voting lists, written answers to questions, the text from the Speech from the Throne, as well as texts of addresses from foreign dignitaries (Parliament of Canada, “Debates”).

Table 3-1: Sessions of Parliament by Date
41st Parliament  
2nd October 16, 2013 - August 2, 2015
1st June 2, 2011 - September 13, 2013
40th Parliament  
3rd March 3, 2010 - March 26, 2011
2nd January 26, 2009 - December 30, 2009
1st November 18, 2008 - December 4, 2008
39th Parliament  
2nd October 16, 2007 - September 7, 2008
1st April 3, 2006 - September 14, 2007

The Hansard corpus used in this study has a total of 68,194,945 words. This includes all of the transcripts that comprise the 39th to the 41st Parliaments, which cover the period between April 3, 2006 and August 2, 2015.

The corpus has been split into two distinct types of groupings. The first grouping divides the corpus based on year, beginning with the year 2006 up to and including 2015. The second divides the corpus by Sessions of Parliament spanning the 39th to the 41st, as shown in Table 3-1 (Library of Parliament, “Parliaments”). This makes Hansard a diachronic monitor corpus; the corpus grows in size every time a new debate is held, and changes in the language used in the corpus can be observed over time.

Figure 3-1: Sample of the Hansard corpus with XML markup
	<ParaText id=”3694786”>Mr. Speaker, I would like to begin by stating that I will be sharing my time with my colleague from <Affiliation DbId=”170184” Type=”2”>Timmins—James Bay</Affiliation>. </ParaText>
	<ParaText id=”3694787”>I am very pleased today to move this motion to ensure that justice is served for Canadians. However, I am very disappointed to have to rise once again to protest this government’s extremely reprehensible actions.</ParaText>
	<ParaText id=”3694788”>I would have thought that, after three years, it would have finally understood. However, once again, the government has been caught spying on its own people.</ParaText>             

The only pre-processing applied to the Hansard corpus involved the removal of the XML markup to create a raw text file. Figure 3-1 shows a sample of the Hansard corpus with XML markup and Figure 3-2 shows the same sample of the Hansard Corpus with the XML markup removed (Hansard Vol. 147 No. 80, 4899). The XML tags were removed in order to facilitate an accurate calculation of word frequencies and to provide readable concordances. Were a different kind of analysis required, the XML markup could easily be accessed. The raw corpus data is stored in the original XML format; the processed text is merely a copy of the data. This is truly the benefit of the electronic processing of text. Large amounts of data can be manipulated while maintaining the structure and format of the original copies.

Figure 3-2: Sample of the Hansard corpus with XML markup removed
Mr. Speaker, I would like to begin by stating that I will be sharing my time with my colleague from Timmins—James Bay. 
	I am very pleased today to move this motion to ensure that justice is served for Canadians. However, I am very disappointed to have to rise once again to protest this government’s extremely reprehensible actions.
	I would have thought that, after three years, it would have finally understood. However, once again, the government has been caught spying on its own people.

While the Hansard corpus technically consists of spoken rather than written language, there is some uncertainty surrounding this distinction. As Mollin highlights in her study of the British Hansard, these kind of Parliamentary debates are more of a hybrid combination of spoken and written text (189). British Hansard transcripts are highly edited: repetitive speech, incomplete utterances, pauses, false starts and reformulations are omitted by the transcribers or editing staff (Mollin 189; Slembrouck 104). While no comprehensive study of the editing practices of the Canadian Hansard has been conducted, it can only be assumed by skimming the text that these characteristics have similarly been removed. For these reasons, the Hansard corpus may not be an appropriate choice for the study of spoken language, though, as Mollin concedes, the analyses of content words, much like the purpose of this study, are likely not affected by the editorial changes (189).

The Hansard corpus was chosen for its availability, consistency, size, and scope. The transcripts are posted in a timely manner and share a consistent formatting scheme. The added benefit of the Hansard corpus is that is readily available for download on the House of Commons website. Not only can other researchers access the data and replicate the text analyses, but the corpus is never in a great danger of being lost due to its constant and public availability. The regularity of sittings allows for a sufficiently large corpus, which is of the utmost importance for this type of text analysis research, especially in the development and testing of context-specific theories of language use (Bayley 34). The corpus only contains transcripts of the proceedings of the debates in the House of Commons. The Hansard corpus represents the language used by the MPs and provides a complete record of the political history in the context of the issues debated in the House of Commons. For this reason, it is an appropriate corpus with which to study the frequency and meaning of the term ‘privacy’ within Canadian political discourse.

Top of Page Home