NCRM videos

Suzanne McClure: A Methodological Approach to Creating a Heritage Library Corpus


Suzanne McClure presents at methods@manchester Methods Fair 2018. Abstract: Tens of thousands of literary works are available in digital formats on public domain websites such as Project Gutenberg. By using linguistic tagging software, every word of a text can be marked for items such as part-of-speech and semantic meaning. This metadata can be useful in the quantitative analysis of literary texts to identify elements such as genre and register, frequent keywords and sentiment analysis. The systematic identification of language and stylistic features in such a collection can provide for a richer analysis of linguistic and stylistic patterns in heritage works. Additionally, marked literary texts can be used in various disciplines such as historical linguistics and second-language learning. This paper will explore a methodological approach to creating a literary heritage corpus of prose fiction. To illustrate the practical application of corpus-based analysis, a 9.7 million-word corpus of 95 English novels published between 1911 and 1928 will be presented. Three linguistic tagging software packages will be discussed to explain how the descriptive metadata output can be used in investigating variability amongst a collection of heritage literary works.