|«3.2 Word Frequencies||3.4 Results»|
A concordance is another method of electronic text analysis. Concordances serve the purpose of bringing together, or concording, passages of text that help to show how a word is used in context (Howard-Hill 4). Concordance outputs are not limited to whole words, they can also be tailored to show lists of letters, phrases, suffixes, and parts of speech (nouns, verbs, etc.) (Adolphs 5; McEnery and Hardie 35).
- The Hansard Concordances
- Alphabetical Sorting
- Key Observation
The most common format for a concordance is known as a Key Word in Context, or KWIC, and it is arranged so that all instances of a search item are in the middle of the page (Adolphs 52; Baker 71; Tognini-Bonelli 13). This search item is often referred to as a ‘node’, and all of the words on the left and right of the node are called the ‘span’. Descriptions of concordance data label the node as N, and the items on the sides as N-1, N-2, N+1, N+2, etc. (Adolphs 52), depending on their distance and position in relation to the node. Figure 3-4 is an example of a KWIC generated from the Hansard corpus where N = privacy.
Figure 3-4: Selection of 25 random concordance lines
Just above the KWIC is a line stating that this particular list contains 25 instances out of a total of 918 matches. This means that the concordance program found a word frequency count for ‘privacy’ totaling 918 occurrences in this search. The node word, privacy, is found in the centre of the page and the total sentence span is equal to 79 characters (including letters, punctuation and spaces).
It is immediately apparent the potential that concordance outputs have for the generation of hypotheses about corpora (Adolphs 51). The nature of the concordance format provides a convenient layout for examining word or phrase use in context, along with the identification of trends or patterns in language use (Stubbs, Text and Corpus Analysis xviii). The example in Figure 3-4 shows 16 occurrences of the word ‘privacy’ in relation to the word ‘Commissioner’, one instance of the phrase ‘Privacy Act’, and one instance of the phrase ‘Access to Information, Privacy and Ethics Chair’. Of the remaining seven instances, when the words to the right of the node are examined, four include the phrase ‘privacy of Canadians’, two include the lemma ‘protect’, and the remaining instance contains the word ‘concern’. A lemma is a base-word from which other words can be constructed, even though they may differ in form or spelling (Baker, Hardie and McEnery 104; Sinclair, Corpus, Concordance, Collocation 41). ‘Protection’, and ‘protected’ are both variations on the lemma ‘protect’.
Figure 3-5: Selection of 25 concordance lines sorted alphabetically at N+1
While concordances can be investigated manually in this manner, they can also be rearranged alphabetically on either side of the node. Figure 3-5 shows a sample of right node alphabetization. The concordance can be further sorted based on a selective number of objective criteria (Tognini-Bonelli 13). Using Figure 3-4 as an example, all of the lines containing the phrase ‘Privacy Commissioner’ have been filtered out as they were deemed unnecessary to this particular analysis. Alternatively, adding a second word to the concordance search (within a span of one or two words) can help identify particular themes of usage (Adolphs 55).
While computers make the production of concordances much easier, their history pre-dates the electronic age. Early concordance work was produced with the intention of studying quotations, allusions and figures of speech in literature, not everyday language (Sinclair, Corpus, Concordance, Collocation 42). What is considered to be the first concordance was hand-compiled for the Latin Vulgate Bible by Hugh of St Cher with the assistance of over five hundred monks in 1230 (McEnery and Hardie 37). Father Roberta Busa compiled the first automated concordance, a project which began in 1951 (Hockey; McEnery and Hardie 37), and by the 1960’s scholars were beginning to see the value of concordances for the purpose of textual and literary analysis. The first generation of concordancers were held on large mainframe computers and used at a single site (McEnery and Hardie 37). They were generally only able to process non-accented characters from the Roman alphabet; accented characters would be replaced by a pre-determined sequence of characters, although these were not standardized and differed from site to site (Hockey; McEnery and Hardie 38). Early concordancers also had difficulty locating the exact location of the citations in the text, as the raw textual information was stored on punch cards or tape. Variant spellings of words and the production of lemmatized lists were also problematic (Hockey).
The nature of the programming involved to create concordance outputs at this time required the assistance of a computer programmer or engineer, something that was not accessible to all scholars (McEnery and Hardie 38). The second-generation of concordancers solved this issue, as they were available as software packages on IBM-compatible PCs (McEnery and Hardie 39). While these concordance programs suffered from many of the same limitations as earlier concordancers, they made electronic text analysis more accessible (McEnery and Hardie 39). Since the inception of automated concordancing in the 60s, the methods, accessibility and scope has drastically improved. Currently, concordance programs exist as downloadable software, web-based applications, and packages of pre-made code for those interested in computer programming.
While the production of concordance outputs is essentially another method in the practice of electronic text analysis, this does not mean the technique is one of complete objectivity. Corpus data is not an ontological reality; it is constructed and delimited by the researcher in an attempt to gather meanings about the discourse under study (Teubert 4). In other words, although the corpus exists and is tangible in many ways, it is not a stand-in for the reality of the Parliament. It is a representation of reality that takes its own form and becomes an object in and of itself. Concordances provide the opportunity to examine language in context, and the structured nature of the output helps to ensure that analysts do more than pick examples that meet their preconceptions of the data (Stubbs, Text and Corpus Analysis 154). Yet the theoretical intention of the researcher is still present at every stage, from search choice to interpretation (Stubbs, Text and Corpus Analysis 154). What concordance outputs provide is the ability to present quantitative evidence of electronic text analysis that can be examined by all readers (Stubbs, Text and Corpus Analysis 154).
Concordances are what Stubbs refers to as “second-order data” (Words and Phrases 66). First-order data is the corpus, or what can be called the ‘raw data’; this data is too large for accurate observation and analysis, leading to the creation of second-order data, which is comprised of the word frequencies and concordance output (Stubbs, Words and Phrases 66). A large corpus generates a large amount of concordance lines, and although these can be managed through sampling, further statistical processing can be done to create what Stubbs calls third-order data, which are known as collocates (Stubbs, Words and Phrases 67).
Words in the English language have a tendency to appear with other words (Stubbs, Words and Phrases 17), giving phrases or groups of words a meaning that transcends the value of each individual word if considered separately (Sinclair, Corpus, Concordance, Collocation 104).Collocates are words that co-occur with other words, and lists of these words can be generated algorithmically, accompanied by statistics that determine their significance (Stubbs, Words and Phrases 29).
In terms of this research, collocational statistics were generated but not used, simply because they did not provide any compelling or new evidence to support what had already been discovered through the frequency and concordance analysis. Notably, both Danielsson (112) and Wermter and Hahn (791) have come to the same conclusion regarding the usefulness of collocational data, arguing that frequency statistics alone provide strong enough evidence to support claims about language use.
The Hansard Concordances
A corpus as large as Hansard does not allow for the inspection of every concordance line, and there are many instances that are not worthy of inspection, such as the multiple instances of “Privacy Commissioner” in Figure 3-4. Sampling and alphabetical sorting make the manual inspection of concordance outputs easier and more efficient. That being said, Sinclair makes a valid point in saying that regardless of the thoroughness of the study, there will always be data left over to perform an even more comprehensive study (Corpus, Concordance, Collocation 65). Concordance analysis, much like word frequency calculation, has the purpose of identifying patterns of interest in the corpus that can be highlighted for further study.
A preliminary method of reviewing concordance output consists of simply scanning down the list and noting any observable patterns. The concordances are produced in order, which in a sense, becomes a timeline of the node word as it has been used in the corpus from the beginning to the end of the measurement period.
When faced with a large corpus such as Hansard, Sinclair suggests a methodical sampling method to make the analysis more manageable. This involves dividing the number of instances of the word by the number of concordance lines desired, using 25 concordance lines as a general standard (Sinclair, Reading Concordances xviii). For example, if there are 5000 instances of a word and 25 concordance are lines required, then 5000 is divided by 25 for a total of 200. This total is the gap between selections, meaning that 25 lines from every 200 lines should be sampled. Starting at concordance line no. 1, the first 25 concordance lines are selected, then lines 201 through 225, then 401 through 425 and so on until the last instance, in this example, no. 4801 (Sinclair, Reading Concordances xviii). The Hansard corpus was sorted in this manner, both by year and by Session of Parliament. This resulted in groups of seven to 18 concordance samples for each year, and 14 to 21 samples for each Parliament (depending, of course, on the frequency of ‘privacy’ for each section). Each concordance sample contained 25 lines.
Once the samples were generated, the resulting concordance lines were sorted alphabetically. The lines were sorted on the right node at position N+1, the first word to the right of ‘privacy’, see Figure 3-5 for an example of this type of sorting. This position yielded the highest amount of duplicate lines for omission, those lines being: Privacy Act; Privacy Commissioner; and Access to Information, Privacy and Ethics. The concordance lines containing those phrases were omitted because they did not accurately represent the pattern of the use of the word ‘privacy’ as a means of determining its meaning. Each sample was then examined to determine any thematic patterns of word use.
Figure 3-6: Selection of concordance lines with a ‘personal’ context
Figure 3-7: Selection of concordance lines about ‘privacy and people’
Figure 3-8: Selection of concordance lines about ‘privacy and rights’
Figure 3-9: Selection of concordance lines with a ‘positive’ or ‘negative’ context
Answering the research question asked of this section, the concordance output from the Hansard corpus identified the following patterns regarding the use of the word ‘privacy’: privacy is something personal and can imply ownership, information, or space (Figure 3-6); privacy affects certain groups of people, including Canadians, veterans, taxpayers, children, travelers, women, hunters, and law-abiding citizens (Figure 3-7); and privacy has something to do with rights, in the context of human rights, civil rights, constitutional rights, the Charter, and freedom of speech (Figure 3-8).
Grammatically, privacy is something that can be referenced in a negative or a positive light, and these phrases consist most commonly of verbs like breach and violate, or protect and strengthen (Figure 3-9); and privacy is often used as the first word in a phrase with nouns, such as ‘privacy interests’ or privacy obligations’ (Figure 3-10).
Figure 3-10: Selection of concordance lines with ‘privacy’ as a phrase
While there were certainly outliers in the samples collected, including phrases like “privacy on the other hand” or “privacy screen”, the overwhelming majority of examples fell into one or more of the previous categories.
In terms of the specific phrases identified in the previous section on frequency calculations, a closer look at the phrase ‘privacy rights’ shows that it is often used in conjunction with the phrase ‘Canadians’, or more interestingly, ‘law-abiding Canadians’ (shown in Figure 3-11). As it was discussed in Chapter 2, ‘privacy rights’ is not necessarily an accurate term, as there is no specific right to privacy in Canada. The connection between ‘privacy rights’ and ‘law-abiding Canadians’ is especially interesting, given that the judgment in R. v. Spencer ruled that privacy protections apply to all Canadians, even when they’ve clearly broken the law.
Figure 3-11: Selection of concordance lines with the phrase ‘law-abiding Canadians’
Again, while it is hard to speculate on specific reasons for these trends without investigating the corpus more thoroughly, the concordance data provides yet another layer upon which to focus the investigation in the next chapter.
|«3.2 Word Frequencies||Top of Page||Home||3.4 Results»|