Blog post written by Lucile Berset
In the mémoire (MA thesis) Systematic Investigation of Linguistic Variation in Historical Data: Normalising and Data Mining Late Modern English in the Letters of Artisans and the Labouring Poor that I submitted in May 2021, I explored and tested some possibilities offered by digital tools to locate and study orthographic variation contained in the language of the labouring poor in Late Modern England. To this end, a sample of pauper letters was selected and specifically prepared to be looked at from a quantitative perspective. My results were interesting although not always reliable due to the quality of the sources. It is rather the method I developed that will likely be helpful for future investigations related to the language of the poor and/or other historical data. This blog post aims to present my work and highlights what my mémoire brings to the field of socio-historical linguistics. It also stresses the importance of interdisciplinarity as well as finding the right balance between computer tools and human intervention when preparing and analysing the data.
Sampling
If you have been following our activities, you will already know that the collection of pauper petitions created by Tony Fairman contains over 2000 letters, written between 1795 and 1834. The work of corpus creation is currently ongoing, and each handwritten document is being transcribed into digital format. In addition, each transcription includes a meta-data header that provides specific information related to the letter (date, year, author, etc.). The tasks of transcription and meta-data preparation are done manually and are therefore highly time-consuming. Because each transcription included in my study had to be carefully checked, coded, and recoded, not all letters contained in the collection could be retained. I sampled 100 petitions representing nearly 24’000 orthographic units and created what can be considered a “mini-corpus”. Each letter was systematically prepared so it could be processed with Textable, an add-on to the open-source text-mining software package Orange Canvas.
From Handwritten Letters to Processable Data
Before I was able to process the pauper letters into Textable, several steps of preparation were necessary. Using existing and partly coded transcriptions allowed me to save a precious amount of time although the remaining work of data preparation was still significant. To compare the highly variable writing of the poor with contemporary uses of English, the original language contained in the letters had to be “normalised” which means that each non-standard occurrence had to be located and standardised. This task was done semi-automatically with the help of the spelling variation detecting software VARD 2. Although the software automatically highlighted most variant occurrences correctly, some non-standard forms were not recognised. I gave the example of “ham”, commonly encountered in the language of some applicants. This orthographic unit does not refer to pig’s meat but is rather an alternative form of the first person singular of the verb to be, which illustrates h-insertion. VARD 2 did not mark the orthographic unit “ham” as a variant since the word exists as such and is found in the dictionary used by the software. In total, 53 occurrences of “ham” were found in my mini-corpus, all of them being examples of h-insertion before “am”. Additional similar issues were encountered and required a close check of the automatic marking. Once every occurrence of spelling variation was highlighted thanks to a mix of automatic marking and labour-intensive manual checking, a first relevant result could be determined. In the whole sample, 17,2% of all orthographic units contain spelling variation. Since the total number of letters contained in my sample only represents 0.5% of the collection, it would be interesting to compare this result with more data, once the work of corpus creation is further advanced. The variation percentage only reflects the direct proportion of spelling variation at the level of orthographic units. As some authors have produced more writing than others and the length of each letter varies, further statistical analyses would be required to obtain the relative percentage of spelling variation. Variant-words were then normalised into standard English. Every change affecting the original text was coded and retained in a new version of the letters.
The next key step was the recoding of the meta-data, initially written in COCOA format. A quick test of Textable showed that the COCOA format was not suitable for the software and that the meta-data had to be formatted in XML to be exploited. The recoding of the meta-data was made possible by Textable with the use of two regular expressions (Regex), locating the original COCOA code and turning it into XML. Additional coding-related changes were added at that stage. Other historical linguists had already been confronted with the same issue; the meta-data of the Helsinki Corpus and the Tagged Corpus of Early English Correspondence Extension (TCEECE) both had to be recoded from COCOA to XML.
Systematic Sample Search
The case study part of my mémoire showed some possibilities offered by Textable to systematically explore the language of the poor. To illustrate how historical phonological variation can be studied with the help of technology, I picked the h-insertion variable, commonly encountered in the pauper petitions. A workflow created with Textable generated a list containing all the words affected by the feature and showed the distribution of h-insertion throughout the mini-corpus. To test the external variable sex, I highlighted the distribution of the phenomenon between male and female authors. The use of this variable is however problematic in the case of pauper applications as it is impossible to be sure of the sex of the author. When preparing my meta-data, I assumed that the gender variable corresponded to the name of the person who signed the letter. The question of authenticity is unfortunately more complex, and frequently the applicant was not the writer of the letter they signed. This specific issue is currently being worked on with an authenticity flow-chart created by Anne-Christine Gardner. It allows us to determine a degree of certainty of what is called the “autographicality” of a letter, i.e. to know how likely the name at the bottom of the letter corresponds to the hand who wrote it. This issue makes the gender-related results obtained in my case study unreliable.
What to Keep
If some of my results cannot be regarded as significant, the methodology I developed is still relevant. It can be reused and/or adapted for further research, with more data and different variables. All in all, my experience writing this mémoire highlighted three main points. The first one is the important potential of the use of digital tools for historical data. The association of VARD 2 and Textable was successful and allowed me to systematically locate and analyse language variation. Obviously, the amount of manual work was considerable and has to be acknowledged. Digital tools are no magic wands, but they do offer innovative possibilities to study diachronic language change if used appropriately. This leads me to the second point which is the interdisciplinary nature of socio-historical linguistic investigations. The use of technology can only be successful if researchers are able to analyse their results in the right context. In addition to a good command of the pieces of software, a strong knowledge of the socio-historical background in which the pauper petitions were produced is necessary to interpret the results. Finally, it is clear that such a project can only benefit from teamwork. To some extent, my mémoire did benefit from the work of others, since I used existing transcriptions, but it remained mostly an individual project written with limited contact with my supervisors due to the global pandemic. I am convinced that collaboration between individuals from various fields (history, historical linguistics, corpus linguistics, computer science, etc.) is the key to success in socio-historical linguistics, especially for a project such as The Language of the Labouring Poor in Late Modern England. I am very much looking forward to following how the project will carry on and see if my mémoire can contribute to the research.