Defining Privacy
A critical investigation of Canadian political discourse

Text-Analysis

3.2 Word Frequencies

Word frequency is a method of text analysis that refers to the numeric count of the words that are present in a corpus. The determination of word frequencies are perhaps the most direct statistical data a corpus can provide (McEnery, Xiao, and Tono 52). The frequency count of every word (or selection of words) present in a corpus is known as the ‘observed absolute’ or the raw frequency; it is a whole number that is greater than or equal to zero (Gries, “Statistics” 269). While this data does not provide a lot of information in terms of proving the validity of a hypothesis or claim (McEnery, Xiao and Tono 52), it does help to frame a corpus in terms of the context of language use (Adolphs 40), especially when examining the occurrence of a specific word or words. The production of these statistics is generally the first stage in any research project involving electronic text analysis (Sinclair, Trust the Text 28).

Once the observed absolute (raw) frequency of the words in a corpus has been generated, other types of related statistics can be produced. One method involves determining the observed relative frequency, which is the raw number of one word in the corpus divided by the total number of words in the corpus (Gries, “Statistics” 270). For example, if a corpus has 100 total words, and there are 10 occurrences of the word ‘book’, the relative frequency of ‘book’ would be 10/100 = 0.1, or 10%.

Another statistical measure that can be determined using word frequency is a type-token ratio. The analysis of this ratio can be useful when determining the level of complexity in a corpus, especially when used to compare corpora against each other (Adolphs 39). Each individual occurrence of a word is known as a token, while each unique word is called a type (Adolphs 39). Consider the following sentences:

“There are many books in the library. Some of the books are for children, but most of the books are for adults.”

Processing the above sentences in terms of their frequency would result in the following lists:

(‘there’, ‘are’, ‘many’, ‘books’, ‘in’, ‘the’, ‘library’, ‘some’, ‘of’, ‘the’, ‘books’, ‘are’, ‘for’, ‘children’, ‘but’, ‘most’, ‘of’, ‘the’, ‘books’, ‘are’, ‘for’, ‘adults’) = 22 tokens

(‘there’, ‘are’, ‘many’, ‘books’, ‘in’, ‘the’, ‘library’, ‘some’, ‘of’, ‘for’, ‘children’, ‘but’, ‘most’, ‘adults’) = 14 types

Dividing the number of types by the number of tokens results in the type-token ratio. In this example the calculation is 14/22 = 0.64, or 64%. Higher ratios mean more variability in terms of language use (McEnery and Hardie 50). Large corpora tend to have very low type-token ratios, not because of the simplicity of the language, but because of the preponderance of high frequency grammatical words like the and to (see Figure 3-3); in this case it is more accurate to use a sampling method to calculate the ratio, such as taking a measurement of the first 2000 words, and each 2000 words thereafter, and then calculating the mean (Baker 52).

While word frequency statistics can provide valuable insights about the nature of a corpus, when it comes to making comparisons between corpora, the validity of the data is dependent on the overall size of what is being compared (Adolphs 40). In other words, it is advisable to only make comparisons between corpora of similar lengths. If corpora of uneven lengths must be compared, the resulting data should be normalized, which means it must be adjusted to account for the size difference (McEnery, Xiao, and Tono 52).

Another important consideration is the distribution of the words under investigation in the corpus (Baker 49; Gries, “Statistics” 272). While relative frequency statistics are valuable indicators of word use, it is important to determine the distribution of relative frequencies in order to see if the words under investigation are frequent simply because they are concentrated in one area, or if they occur across the corpus as a whole (Gries, “Statistics 272).

The investigation of the frequencies of word use in a corpus, at a deeper level, is an investigation of language in context (Baker 49; Tognini-Bonelli 87). The observation of cumulative patterns of repeated word use allows for the interpretation of a body of text in a way that is not possible by reading or listening alone. While the output of a word frequency analysis may be purely statistical, according to Burrows, the study of the words themselves uncovers the “underlying fabric of a text, a barely visible web that gives shape to whatever is being said” (“Textual Analysis”).

The Hansard Frequencies

The distribution and frequency of words in texts is incredibly uneven due to two broad categories of words found in the English language: content words and function words (Stubbs, Words and Phrases 39). Content words describe what the text is about, while function words help tie the content words together (Stubbs, Words and Phrases 39). The Hansard corpus is no different. Figure 3-3 shows the frequency of the top 50 words in the entire corpus.

Figure 3-3: Frequency of the top 50 words in the Hansard corpus

Figure 3-3: Frequency of the top 50 words in the Hansard corpus

The word ‘the’ appears over 400,000 times in the Hansard corpus, which is almost twice as many times as the second most frequent word: ‘to’. The word under investigation in this study, ‘privacy’, has a total raw frequency of 6,478, while the relative frequency or ratio is 0.011%. While this is a fairly insignificant number in itself, the dispersion statistics provide more context.

Table 3-2 shows the raw and relative frequency of the word ‘privacy’ in the Hansard corpus. Between the years 2006 and 2015, as well as between the 39th and 41st Parliaments, there is an observable trend of increased usage of the word. This is especially apparent between the years 2013 and 2014, and between the first and second session of the 41st Parliament. The data from 2014 is especially compelling, because the relative frequency is the highest in that period compared to any other year or Session, which makes 2014 worthy of further investigation. Despite a few anomalies, there is an overall positive trend of the instances of the word privacy between 2006 and 2015, and between the 39th and 41st Parliaments.

Table 3-2: Raw and relative frequency of the word ‘privacy’
  privacy  
Year - Parliament Raw Ratio %
2006 356 0.007
2007 258 0.004
2008 252 0.005
2009 612 0.009
2010 533 0.009
2011 624 0.012
2012 552 0.008
2013 918 0.015
2014 1567 0.022
2015 806 0.020
39-1 538 0.006
39-2 308 0.005
40-1 20 0.004
40-2 612 0.009
40-3 1011 0.013
41-1 1287 0.008
41-2 2702 0.021

Examining the raw and relative frequencies also illuminates how misleading it can be to rely on raw frequency alone. While the raw frequency in 2009 and 2011 are within 12 words of each other, the relative frequency for 2011 shows a higher percentage of relative use. This is also clearly illustrated by the frequencies in 39-2 and 40-1. While there seems to be a drastic reduction in raw frequency between the two sessions, the relative frequency differs only slightly. This is due to the fact that significantly less words were spoken in 40-1 overall.

In the last chapter, the concept of privacy rights and the meaning of ‘reasonable expectation of privacy’ was discussed at length. After conducting the concordance analysis described in the upcoming section, a trend was discovered connecting the word ‘privacy’ with ‘rights’, both as ‘privacy rights’ and ‘right to privacy’.

The frequency of these phrases, as well as the phrase ‘reasonable expectation of privacy’ is of interest to this investigation. Table 3-3, Table 3-4, and Table 3-5 show the raw and relative frequencies for each of these phrases.

Table 3-3,4,5: Raw and relative frequencies of the phrases
    privacy rights   right to privacy   reasonable expectation of privacy
Year - Parliament Raw Ratio % Raw Ratio % Raw Ratio %
2006 22 0.00042 5 0.00010 0 0.0
2007 9 0.00015 6 0.00010 0 0.0
2008 0 0.0 2 0.00004 0 0.0
2009 50 0.00074 11 0.00016 3 0.000044
2010 29 0.00046 19 0.00030 2 0.000032
2011 72 0.00133 44 0.00081 0 0.0
2012 47 0.00066 20 0.00028 1 0.000014
2013 46 0.00074 42 0.00068 6 0.000097
2014 101 0.00144 30 0.00043 13 0.000185
2015 35 0.00088 15 0.00038 10 0.000251
39-1 30 0.00032 9 0.00010 0 0.0
39-2 1 0.00002 4 0.00006 0 0.0
40-1 0 0.0 0 0.0 0 0.0
40-2 50 0.00074 11 0.00016 3 0.000044
40-3 98 0.00124 60 0.00076 2 0.000025
41-1 90 0.00059 59 0.00038 3 0.000020
41-2 142 0.00111 51 0.00040 27 0.000211

Both ‘right to privacy’ and ‘privacy rights’ are represented with a significant relative frequency in 2011. This same pattern exists in the 3rd session of the 40th Parliament. Since that session took place during the first three months of 2011, it can be deduced that there was a substantial amount of discourse concerning privacy rights in the House of Commons between January and March of that year.

The phrase ‘privacy rights’ shows another significant increase in frequency in 2014, though ‘right to privacy’ does not. This is echoed in the high frequency for ‘privacy rights’ in the 2nd session of the 41st Parliament.

The phrase ‘reasonable expectation of privacy’ shows a substantial increase in relative frequency starting 2013 and continuing through 2015. A dramatic increase in relative frequency is also shown in the 2nd session of the 41st Parliament, which began in October of 2013. Again, it is clear that a specific discourse concerning the topic of a ‘reasonable expectation of privacy’ was conducted in the House of Commons during this time frame.

Clearly the raw and relative frequencies for ‘privacy’ and its related phrases are all quite low, given that the total number of words in the corpus is almost 69 million. But regardless of numbers, an examination of raw and relative frequencies show areas of the corpus that merit additional analysis. This is a clear example of the way in which corpus data can pinpoint areas of interest for further research.

The manual examination of 69 million words, comprising tens of thousands of pages of text, would be an incredibly time consuming task. The conclusive discovery of trends in the specific use of a word or phrase over an eight-year period would be practically impossible to achieve by just reading the text alone.

The results of these frequency statistics will be further supported by the next section on concordances, though these frequency results stand for themselves in terms of narrowing the field of focus for a closer and more detailed analysis of the Hansard corpus.

Top of Page Home