Crime sensing with social media

Date
Category
NCRM news
Author(s)
Matt Williams and Pete Burnap, Cardiff University

The majority of individuals aged under twenty in the Western world were ‘born digital’ and will not recall a time without access to the Internet. Combined with the migration of the ‘born analogue’ generation onto the Internet, fuelled by the rise of social media, we have seen the exponential growth of online spaces for the mass sharing of opinions and sentiments. These online spaces represent a socio-technical assemblage that creates a new public sphere enabling digital citizenship through which aspects of civil society are played out. No study of contemporary society can ignore this dimension of social life. New forms of digital online social data, handled by computational methods, allow social scientists to gain meaningful insights into contemporary social processes at unprecedented scale and speed.  How we marshal these new forms of data present key challenges for the social sciences. Our NCRM Methodological Innovation Project sought to contribute to the methodological and capacity step-change needed to marshal and extract value from these new forms of data.

In the project Social media and prediction: crime sensing, data integration and statistical modeling we explored data fusion techniques to build a series of statistical models using heterogeneous datasets to gain insights from social media data to ‘sense’ offline crime patterns in London. We collected a corpus of 180 million UK geo-coded tweets covering a 12-month period using the COSMOS platform1 . To reduce the size of the dataset to a sub-corpus of tweets related to crime and disorder communications we first developed a coding frame using existing interviews with victims of crime from the ESRC UK Data Archive. Extracts of interview data were prepared for crowd-sourced human coding via the CrowdFlower platform and gold standard annotations were derived. The human verified annotations were used to derive a lexicon that allowed us to classify the whole Twitter dataset. The resulting crime and disorder tweet sub-corpus was then also subject to further human annotation for validation. This verified corpus was used to develop a social media ‘crime sensing’ algorithm to automatically identify mundane references to crime and disorder in social media communications, using terms and phrases that were statistically likely to appear in content classified by human annotators. The algorithm was supplemented with emotive and affective terms using the WordNet Affect online lexical resource to identify content that would be suggestive of fear, distress and anxiety.  

We explored a range of statistical methods for combining social media data with administrative (recorded crime) and curated data (census). Our dependent variables were measures of police recorded crime collected over a 12-month period provided by the Metropolitan Police Service. Given the desire to incorporate the temporal and spatial variability of police recorded crime and Twitter data with the static variables from the census, we used Random-Effects models that combine time-variant with time-invariant regressors. This meant that we could explore correlations between independent variables including tweets that have high temporal granularity and variability (every second) and census variables that have very low temporal granularity (every decade) with the dependent variable police recorded crime. The models included information from tweets relating to mentions of crime and disorder (such as criminal damage) and emotive states (such as anxiety).  

Preliminary results indicate that models that include social media data improve the amount of variance explained in police recorded crime patterns, compared to models that include conventional crime predictors alone. This modelling technique may prove effective in sensing crime patterns ahead of conventional means. This project is an extension of our previous ESRC funded work on modelling the classification of racial tension2,3 and propagation of cyberhate in social media4,5,6. We are now working with the Metropolitan Police Service via an ESRC Impact Acceleration Award to embed our computational and statistical models into their operational processes.

As part of the drive to up-skill social scientists in the area of big data analytics we have conducted a series of advanced workshops at the Web Science Trust Summer Schools at Singapore National University  and in Southampton. We have provided training to Wales DTC doctoral students and we are currently developing a master’s degree in Social Data Science at Cardiff as part of the University’s new Social Data Science Lab.

References

1 Burnap, P. et al. 2014. COSMOS: Towards an integrated and scalable service for analysing social media on demand. International Journal of Parallel, Emergent and Distributed Systems 30:2, 80-100

2 Williams, M. L. et al. 2013. Policing cyber-neighbourhoods: Tension monitoring and social media networks. Policing and Society 23:461-481. 3.

3 Burnap, P. et al. 2015. Detecting Tension in Online Communities with Computational Twitter Analysis. Technological Forecasting and Social Change 95.

4 Burnap, P. et al. 2014. Tweeting the terror: modelling the social media reaction to the Woolwich terrorist attack. Social Network Analysis and Mining 4:206: 1-14.

5 Burnap, P. & Williams, M. L. 2015. Cyber hate speech on Twitter: An application of machine classification and statistical modeling for policy and decision making. Policy & Internet 7, 223-42.

6 Williams, M. L., and Burnap, P. 2015. Cyberhate on social media in the aftermath of Woolwich: A case study in computational criminology and big data. British Journal of Criminology, 1-28.