Sequence Analysis - Introduction

Presenter(s): Ian Thomas

This online resource will take you through the process of using Sequence Analysis (SA) in R to identify patterns in people’s longitudinal trajectories. It covers the steps of: Creating sequences, Comparing sequences, Cluster analysis, Interpreting clusters.

The TraMineR package in R is used to conduct SA, though the underlying concepts generally apply to other computer software such as Stata.

What are sequences?

Sequences are a way of representing and exploring longitudinal trajectories in social science research. Each sequence consists of a series of states representing a characteristic of interest measured at chronologically ordered time-points. For instance, in research exploring housing instability, states could reflect whether a person was housed or homeless measured at annual survey waves.

A state can take a limited range of values (e.g., housed or homeless), referred to as the alphabet of possible states. Spells represent concurrent time-points which share the same state. Where there is a change in state between two time-points, this is referred to as a transition, e.g., from being housed to experiencing homelessness.

Sequences can differ in several ways which may be of sociological interest, such as:

States experienced and their duration, e.g., do certain groups of people never experience homelessness?
Timing of states, particularly when transitions between states occur, e.g., do certain groups of people transition from being housed to experiencing homelessness earlier than others?
Order in which states occur, e.g., do people experience homelessness as a one-off event, or do they cycle into and out of homelessness?
Creating a typology using sequence data can help shed light on shared social processes, inequalities, and structural patterns.

Figure 1: Example of a sequence of states measuring housing instability at 6 time-points, with basic structural elements of sequences highlighted. [Download long description alt text of the figure, for accessibility]

Step 1: Creating sequences

TraMineR can accept longitudinal data in several formats. State sequence (STS) format is the most intuitive format. Each row in a data set reflects a unique person’s responses, with states measured at different time-points appearing as columns in chronological order.

Alternatively, rather than relating to individual time points, columns can relate to spells and include information on the state experienced and the length of the spell—also known as state permanence sequence (SPS). Data may also be split over multiple rows, with a single person having multiple entries that relate to the start and end time of spells, and the state experienced during the spell—known as spell format (SPELL).

Figure 2: Examples of different formats of longitudinal data accepted by TraMineR [Download long description alt text of the figure, for accessibility]

Three tables showing different data formats for recording housing status for

Many of the SA functions in TraMineR work with a sequence object, which contains sequences for each case/person stored in STS format. When creating a sequence object, the user can attribute specific colours and labels to the different states, e.g., homeless = red, housed = green. Attributing colours and labels when creating a sequence object ensures that all subsequent analyses and visualisations are consistent.

Download transcript | [ 89 Views ]

Step 2: Comparing sequences

Typology creation relies on the identification of patterns of differences between sequences. We therefore need a numeric way of expressing difference. To this end, sequences are compared to one another to quantifying the number of states they differ on, or their ‘distance’. This process works by changing each sequence into all other sequences in a sequence object. Three methods or operations can be used to change elements of one sequence into another so that sequences match or align.

Substitutions can be used to replace the values of states in one sequence to make it look the same as another sequence. Insertions add a state to a sequence, whilst deletions remove a state. Insertion and deletion, sometimes known as ‘indels’, can be used to align sequences of different lengths: inserting a state shifts all other states in the sequence up one place, whilst deleting a state shift everything down one place.

Performing these operations incurs a ‘cost’, this being a value attributed to the use of that specific operation. Costs can be set by the researcher based on a priori knowledge of the topic area, or they can be data driven. Substitution costs are often determined based on the transition rate between states. Transitions that are less common have a higher cost associated with them, because in effect you are trying to change a state in a way that is unlikely to occur.

The total distance between two sequences is the sum of the costs incurred in trying to change one sequence into another. The distance between sequences is calculated between all pairs of sequences, resulting in a distance matrix—a symmetrical grid where each cell represents the computed cost/distance between two sequences.

Within the SA literature, Optimal Matching is frequently used to calculate distances, which combines both indels and substitutions. However, researchers should weigh up the impacts of using substitutions (which alter states) and/or indels (which alter time by shifting sequences up/down) in relation to their research question.

Download transcript | [ 44 Views ]

Step 3: Cluster analysis

Cluster analysis of a dissimilarity matrix is the actual stage at which the typology is generated. The aim of cluster analysis is to create groups which are as different as possible from one another. There are a range of clustering techniques that can be used. However, the most widely adopted clustering method when creating typologies of sequence data is hierarchical clustering.

Hierarchical clustering builds a tree-like structure (also known as a dendrogram) by iteratively merging the most similar sequences, providing a flexible way to explore potential groupings of sequences at different levels.

Choosing the number of clusters to keep is a critical step. Researchers can draw on fit indices, such as the silhouette width, to assess the coherence of clusters, or theoretical considerations based on the research context. Often, researchers will extract several cluster solutions and chose the most ‘appropriate’ number of clusters to retain based on their interpretability.

Download transcript | [ 49 Views ]

Step 4: Describing clusters

This step involves summarising and interpreting the key characteristics of each cluster to highlight their distinct patterns or trajectories. One common approach to interpreting clusters is to use data visualisations that summarise aggregate information about sequences. For example, mean-time plots provide a general overview of states occupied by people in each cluster and how long. Mean-time plots show, on average, how states were occupied as a proportion of overall study time.

To capture general shifts in states for cluster members over-time, state distribution plots visualise the breakdown of states at each of the time-points making up sequences. Transition rate plots can be helpful visualisations to interpret movement between states. These plots give a breakdown of the proportion of transitions between states at one point-in-time (t) and the subsequent point-in-time (t + 1).

An alternative to aggregating patterns in sequences is visualise sequences in their entirety. For example, representative sequence plots present the most ‘central’ sequence(s) within each cluster, using the distance matrix to calculate centrality. However, visualising actual sequences of data may pose a risk of disclosing people’s identities. Depending on the context of the research, this disclosure risk needs to be factored into using methods of data visualisation and analysis that present actual sequences.

Download transcript | [ 57 Views ]

Advanced analyses

When describing and interpreting clusters it is useful to explore their association with characteristics not used to create sequences. Bivariate analyses can be conducted, such as cross-tabulation, where additional characteristics are categorical in nature (e.g., gender), or Analysis of Variance for numeric characteristics (e.g., single year of age). Multinomial logistic regression can also be used to predict cluster membership, whilst controlling for other characteristics simultaneously, e.g., age, gender, and ethnicity.,

Though primarily aimed at identifying and describing patterns in trajectories, typologies can be used as the basis for predictive analysis. For example, in the United States, SA has been used to identify patterns in the experience of homelessness and incarceration, followed by regression analysis to explore variation in HIV care outcomes among people following different trajectories.

Extensions to deal with complex processes

Social science researchers are often interested in changes in multiple characteristics over-time. One way of dealing with this social complexity is to combine different characteristics into a single state alphabet. Alternatively, an extension of SA implementable in TraMineR allows you to simultaneously analyse multiple sequences reflecting different characteristics, also known as multi-channel sequences analysis or MCSA.

Typology creation using MCSA is much the same as the single sequence approach. Separate sequence objects are created for each characteristic or channel, e.g., housing tenure and household type. However, the distance matrix is based on the cost of transforming sequences in each channel separately, which are either summed together or averaged to give the total difference.

As an example of its application, MCSA was used to create a typology of young people’s ‘housing pathways’, defined as transitions in multiple life-domains over-time. Housing pathways were conceptualised as sequences of housing tenure, household type, marital status, and employment status, over a ten-year period.

> Download Worksheet.

> Download R code used in the worksheet.

Supporting materials

About the author

Ian Thomas is Research Fellow at Cardiff University. His main research areas of interest are homelessness and data science. He has over 10-years of experience in the use of linked administrative data and is currently leading on the Administrative Data Research Wales (ES/W012227/1) housing and homelessness thematic research area.

Primary author profile page

Published on: 24 February 2025
Event hosted by: Cardiff University
Keywords: Sequence Analysis (SA) | Typology Creation | Clustering | Data Collection | Data Quality and Data Management | Quantitative Data Handling and Data Analysis | Mixed Methods Data Handling and Data Analysis | Longitudinal Data | Clustering | Data Visualization | R Programming |
To cite this resource:
Ian Thomas. (2025). Sequence Analysis - Introduction. National Centre for Research Methods online learning resource. Available at https://www.ncrm.ac.uk/resources/online/all/?id=20853 [accessed: 24 April 2025]

⌃
BACK TO TOP