Corpus-Assisted Discourse Studies

Presenter(s): Maria Leedham

Corpus-Assisted Discourse Studies (CADS) is a mixed-methods approach which integrates corpus linguistics - the computational analysis of large, organized text collections (corpora) -with discourse analysis - the study of how language constructs meaning beyond individual sentences. It’s now widely used by applied linguists, and across the humanities and social sciences.

Unlike standalone corpus linguistics, which excels at quantifying language patterns, or discourse analysis, which offers rich qualitative insights, CADS harnesses both approaches to explore social phenomena through texts. CADS begins with a social question, using corpora—digital text collections—to analyse language systematically. Corpus linguistics uses software to process vast datasets, revealing patterns such as frequent words or collocates (words which co-occur). Discourse analysis, meanwhile, examines how these patterns build meaning in context, often through close reading or participant perspectives. By combining these, CADS offers a dual lens: quantitative breadth (e.g., how often fail appears in press coverage of social workers) and qualitative depth (e.g., why social work is often framed negatively). This synergy makes CADS versatile, applicable to fields from history to health communication.

By combining computational tools with qualitative reading, CADS helps uncover hidden patterns in language use. For example, it can show whether newspapers use more negative words when talking about social workers compared to other professions, or whether male and female authors describe emotions differently in literature. This method is particularly useful for studying how language shapes attitudes and social perceptions.

Core Methods and Tools

1. Corpus Linguistics Techniques:

Frequency Lists: Rank words or phrases by occurrence (e.g., the most frequent words in undergraduate writing in Engineering).
Concordance Lines: Display a word in context, sortable for patterns (e.g., social worker in a corpus of newspaper texts to quickly see where this occurs).
Collocates: Identify words frequently appearing together, offering semantic insights (e.g. exchange, barely, sideways and collocates of glances).
Keywords: Highlights words or phrases which are unusually frequent in one corpus when compared to a reference corpus (shrugs in female-authored novels and crouched in male-authored works).
Plot Dispersions: A visual where features occur across texts (e.g., the first person pronoun I in student essays).
Semantic Tagging: Using software to automatically categorize words into meaning domains (e.g., the words fast, slow, kilometres are tagged as Measurement: Speed and grouped together).

How Do These Techniques Work in Practice?

Suppose we want to analyse how politicians talk about the climate crisis. We could:

Use frequency lists to see which words appear most often (e.g., emissions, policy, crisis).
Examine concordance lines to check the context (e.g., is crisis used with urgent or exaggerated?).
Look at collocates to find hidden patterns (e.g., climate might appear with denial in some sources and action in others).
Compare with keywords from another dataset (e.g., how climate-related words differ between political speeches and scientific reports).

2. Discourse Analysis Techniques:

Critical Reading: Hypothesize meanings from reading individual texts.
Manual Coding: Classify examples (e.g., press mentions of social worker within context could be classified as ‘Over-zealous’ or ‘Failure to act’ depending on the perceived meaning).
Thematic Analysis: Identify recurring themes across a dataset (e.g. that female authors focus on emotions more than male authors).
Participant Data: Integrate interviews, questionnaires or observation (e.g., students’ views on Young Adult fiction).

While corpus linguistics focuses on identifying patterns, discourse analysis seeks to interpret those patterns. For example, if newspapers use crisis often when discussing immigration, discourse analysis asks: does this create fear? Does it influence public opinion? By combining both methods, CADS helps researchers move beyond counting words to understanding meaning.

These tools slice through data in varied ways, providing entry points for analysis. For instance, a keyword search might flag the use of glances and stares in female-authored fiction when compared to male-authored, leading to qualitative coding of its usage indicating a greater focus on using the eyes to convey emotions.

How to Read a Concordance Table

A concordance table shows how a word appears in context. Here’s an example. Figure x: Concordance lines for glances in Female-authored YA fiction:

The filename appears on the left, followed by items to the left of the search term (glances), then items to the right. Note that concordance lines are read from top to bottom, rather than from left to right. In this example, the concordance lines are sorted to the right, so at appears before between.

What this tells us:

The word glances often appears with verbs like exchange and barely (in the full concordance list), suggesting its use in describing subtle interactions.
This pattern might indicate that female-authored YA fiction places more emphasis on eye contact in social interactions when compared with the reference corpus of male-authored fiction.
A researcher could look at whether female or male characters are more likely to glance at others.
Comparison could be made with male-authored fiction to see how subtleties of communication are conveyed differently.

Why Are Collocates Important?

In Table x, the most frequent 12 collocates of glances from the same corpus are given. The table shows the collocate itself, its ordering or ‘rank’, then the total number of times the collocate appears in the corpus (‘FreqLR’), followed by its frequency to the left and right of the corpus. Table x collocates of glances in Female-authored YA fiction:

A table showing linguistic data for twelve collocates with associated frequency and likelihood metrics.

Collocates help us understand the subtle meanings of words. If glances often appears with sideways, it might suggest secrecy or hesitation. If it appears with exchanged, it might indicate communication. By looking at collocates, we can better interpret how language is used in different contexts.

How to carry out CADS research

Here’s a common procedure followed in CADS research:

Select a social issue to investigate and design research questions
Find – or build – a corpus of appropriate texts
Investigate the corpus through keywords, frequent word list, or other corpus techniques
Draw on techniques from discourse analysis to qualitative sift and code the extracted data. This stage will involve extensive reading of extracts of texts or even whole texts.
Iterative cycles of corpus investigation followed by close reading and thematic coding or other analysis.
(Optionally) collect participant data through interviews or other means.
Integrate all sources of data. Follow up on any areas where the findings diverge.

The first video illustrates some common CADS techniques, drawing on examples from several studies.

Download transcript | Download slides [ 52 Views ]

As you saw earlier, CADS research generally begins with a social question. The next three videos cover the following three research areas and questions

In student writing - What is the impact of different linguistic, cultural and educational backgrounds on undergraduate student writing?
In social work - How do UK newspapers portray social workers?
In literature - How do female and male authors of Young Adult books present their fictional worlds? And what do the readers of YA fiction think about these differences?

These research areas and questions will be explored in the video examples within this tutorial.

Download transcript | Download slides [ 41 Views ]

> Download worksheet (to work with video 2).

Download transcript | Download slides [ 50 Views ]

> Download worksheet (to work with video 4).

Download transcript | Download slides [ 32 Views ]

Supporting materials

About the author

Maria Leedham is a Senior Lecturer in Applied Linguistics and English Language at The Open University. Her main research expertise is in text analysis, using the methodology of Corpus-Assisted Discourse Studies (CADS). Previous projects include research into student assignments, social workers’ writing, newspaper texts and TV transcripts; she is currently leading research projects on tutor marking practices and Young Adult literature.

Primary author profile page

Published on: 22 April 2025
Event hosted by: The Open University
Keywords: Mixed Method | Textual Analysis | Critical Discourse Analysis | Digital Methods | Thematic Analysis |
To cite this resource:
Maria Leedham. (2025). Corpus-Assisted Discourse Studies. National Centre for Research Methods online learning resource. Available at https://www.ncrm.ac.uk/resources/online/all/?id=20855 [accessed: 29 April 2025]

⌃
BACK TO TOP