Meaning extraction from large text data: Thematic analysis via corpus linguistics
Date:
09/06/2025
Organised by:
University of Southampton
Presenter:
Dr Justyna Robinson and Dr Rhys Sandow
Level:
Entry (no or almost no prior knowledge)
Contact:
Penny White
NCRM Centre Manager
p.c.white@southampton.ac.uk

Venue: Online
Description:
The problem: Your team collected thousands of words of data. You try a traditional thematic analysis of the text. Soon, colour coding, close reading, writing ad hoc reflections about the text become too onerous a task. You doubt the validity of your observations. You wish there was another way to streamline the process, that would extract key themes in data in a faster and empirically-valid way.
Solution: Join us for a session in which we showcase empirical methods for the extraction and analysis of meaning, concepts, and themes in texts. The session will provide training in corpus linguistics and mixed-method tools that enable the analysis of texts in an empirical, bottom-up fashion. Through a range of case-studies, you will be guided to extract meaning and other thematic patterns from texts to gain insight into thoughts and behaviours of authors of those texts. We will share best practises on the thematic analysis of various data types, such as diaries, interview transcripts, data scraped from the web, and outputs of both new and traditional media. We also demonstrate ways of building the results of such analyses into answering research questions, developing business strategy, or a public policy.
This session will be run by researchers from the University of Sussex’s Concept Analytics Lab (https://conceptanalytics.org.uk/). We will demonstrate solutions developed for a variety of problems and text types coming from our work with medical sciences, psychology, economics, and the energy industry. We will also show how linguistic patterns within or between texts (e.g. those that differ demographically or diachronically) can be explored, particularly through the use of new visualisation techniques. The workshop will conclude with a showcase of next-generation textual analysis tools that have been developed at Concept Analytics Lab.
This will be a practical session, enabling attendees to develop hands-on experience with using corpus analysis tools. The course will consist of six hours of training over the course of one day [9.30am - 5pm] and will be delivered online.
The course covers:
How to extract meaning from large textual data
How to build a corpus using textual data
How to engage with existing corpora, such as multi-billion word corpora scraped from the web
How to use corpus methods for bottom-up and top-down research
Techniques for the visualisation of unstructured language data
An introduction to discourse analysis and its application to corpora (corpus-assisted discourse analysis)
By the end of the course participants will:
- Know how to engage a suite of mixed-method corpus linguistic tools to extract meaning from a corpus
- Be able to use corpora to answer a variety of research questions
- Be able to build their own corpora
- Conduct comparative corpus analysis (e.g. between texts that differ demographically or diachronically)
Programme:
9:30: Welcome and introduction to corpus linguistics
10:00: Interrogating existing corpora - quantitative analysis
12:00: Lunch
13:00: Interrogating existing corpora - qualitative analysis
15:00: Break
15:15: Building your own corpus
16:15: The Concept Cruncher: The next generation of text analysis
16:45: Final remarks
Speakers:
Dr Justyna Robinson is a Director of Concept Analytics Lab at the University of Sussex. She researches meaning in language and is interested in methods of analysing meaning empirically. Her publications focus on ways of researching meaning from historical perspectives (2012), from cognitive angles (2014), using socio-demographic information and other text metadata (2012, 2022), using corpus and statistical methods (2014, 2022). She researches meaning represented by words (2010), concepts and themes (2017, 2023). With the research team at Concept Analytics Lab, she delivered a range of projects investigating current meanings of loneliness, aging, UK trade deals post Brexit, political manifestos, recycling practises, or post-covid behaviour changes.
Dr Rhys Sandow is a Senior Research Associate at Concept Analytics Lab, University of Sussex. He specialises in applying corpus methods to answer applied research questions, such as in collaborative work with economists, psychologists, historians, and medical humanities researchers, as well as organisations in the private sector. He also specialises in sociolinguistic variation and change, including its intersection with corpus linguistics, where he has worked as an expert witness in a legal context. He has published academic articles and book chapters on corpus linguistics and sociolinguistics and has a forthcoming co-edited book on Sociolinguistic Approaches to Lexical Variation in English to be published by Routledge.
Cost:
The fee per teaching day is: £60 per day for students registered at University, £150 per day for staff at academic institutions, Research Councils researchers, public sector staff and staff at registered charity organisations and recognised research institutions and £350 per day for all other participants.
In the event of cancellation by the delegate a full refund of the course fee is available up to two weeks prior to the course. NO refunds are available after this date.
If it is no longer possible to run a course due to circumstances beyond its control, NCRM reserves the right to cancel the course at its sole discretion at any time prior to the event. In this event every effort will be made to reschedule the course. If this is not possible or the new date is inconvenient a full refund of the course fee will be given. NCRM shall not be liable for any costs, losses or expenses that may be incurred as a result of its cancellation of a course, including but not limited to any travel or accommodation costs.
The University of Southampton’s Online Store T&Cs also continue to apply.
Website and registration:
Region:
South East
Keywords:
Digital Social Research, Mixed Methods, Discourse Analysis, Corpus Analysis, Thematic Analysis, Data Visualisation, Corpus linguistics, Text/language as data, Digital humanities, Discourse analysis
Related publications and presentations from our eprints archive:
Digital Social Research
Mixed Methods
Discourse Analysis
Corpus Analysis
Thematic Analysis
Data Visualisation