Generating Synthetic Data for Statistical Disclosure Control (few places remaining)
Date:
16/10/2017 - 17/10/2017
Organised by:
University of Southampton/ADRC-E
Presenter:
Dr Jörg Drechsler
Level:
Intermediate (some prior knowledge)
Contact:
Description:
Course No. ADRCE-Training037 Drechsler
Course places are limited and registration by 9 October 2017 is strongly recommended.
Short Summary of Course
This short course will provide a detailed overview of the topic, covering all important aspects relevant for the synthetic data approach. Starting with a short introduction to data confidentiality in general and synthetic data in particular, the workshop will discuss the different approaches to generating synthetic datasets in detail. Possible modelling strategies and analytical validity evaluations will be assessed and potential measures to quantify the remaining risk of disclosure will be presented. To provide the participants with hands on experience, the course will include practical sessions using R, in which the students generate and evaluate synthetic data based on real data examples.
Course Contents
The course covers:
• the fully synthetic data approach
• the partially synthetic data approach
• modelling strategies for generating synthetic data
• data utility evaluations
• disclosure risk assessment
Learning Outcomes
By the end of the course participants will:
• have a practical understanding of the concept of synthetic data
• be able to judge in which situations the approach could be useful
• know how to generate synthetic data from their own data
• have a number of tools available to evaluate the analytical validity of the synthetic datasets
• know how to assess the disclosure risk of the generated data
Computer Software and Computer workshops
Delegates will need to bring their own laptops with the latest version of R installed. It would be helpful if you installed the most recent version of the synthpop package in R prior to the course. This is the link https://CRAN.R-project.org/package=synthpop. Or you could instead open an R session and type install.packages(“synthpop”).
The Presenter
Dr Jörg Drechsler
Jörg is distinguished researcher at the Department for Statistical Methods at the Institute for Employment Research in Nürnberg, Germany. He received his PhD in Social Science from the University in Bamberg in 2009 and his Habilitation in Statistics from the Ludwig-Maximilians-Universität in Munich in 2015. He is also an adjunct assistant professor in the Joint Program in Survey Methodology at the University of Maryland. His main research interests are data confidentiality and nonresponse in surveys. He received several awards for his research on synthetic data and recently published a book on this topic.
Target Audience
The course intends to summarize the state of the art in synthetic data. The main focus will be on practical implementation and not so much on the motivation of the underlying statistical theory. Participants may be academic researchers or practitioners from statistical agencies working in the area of data confidentiality and data access. Basic knowledge in R is expected. Some background in Bayesian statistics is helpful but not obligatory.
Duration
This is a two-day course. On Day one, the Registration will start from 9.30 and formal teaching will commence at 10.00 and finish at around 17.00. On Day two, it will start at 9.00 and finish at around 16.00.
Event Outline (Programme)
1. A Brief History of Data Confidentiality
a. Information Reduction vs. Data Perturbation
b. The Computer Science Approach vs. the SDC Approach to Confidentiality
2. Some Basics Regarding Multiply Imputed Synthetic Datasets
a. Fully Synthetic Datasets
b. Partially Synthetic Datasets
c. Applications in Practice
3. Analyzing Synthetic Datasets
a. Fully Synthetic Data Combining Rules
b. Partially Synthetic Data Combining Rules
c. Extensions to Missing Data
4. Generating Synthetic Datasets
a. Two Approaches for Multiple Imputation (joint modeling vs. sequential regression)
b. Imputation Models and Modeling Strategies ((generalized) linear models and machine learning approaches)
c. Evaluating the Analytical Validity
d. Evaluating the Risk of Disclosure
5. Recent Extensions of the Synthetic Data Approach
a. A Synthesis Approach for Census Data
b. A Two Stage Approach to Balance Analytical Validity and Disclosure Risk
6. Chances and Obstacles of the Approach
Pre-requisites
Some background regarding general linear modelling is expected. Familiarity with the concept of Bayesian statistics is helpful but not required. The statistical software R will be used to illustrate the implementation of the approach.
Familiarity with basics in R would be useful. Participants not familiar with the software can team up with experienced R users during the practical sessions.
Preparatory Reading
The course is based on the following book:
- Drechsler, J. (2011) Synthetic datasets for statistical disclosure control. Theory and implementation. Lecture notes in statistics, 201, New York: Springer
Some useful papers are:
- Karr, A. F., Kohnen, C. N., Oganian, A., Reiter, J. P., and Sanil, A. P. (2006). A framework for evaluating the utility of data altered to protect confidentiality. The American Statistician 60, 224–232.
- Kinney, S. K., Reiter, J. P., Reznek, A. P., Miranda, J., Jarmin, R. S., and Abowd, J. M. (2011), Towards unrestricted public use business microdata: The synthetic Longitudinal Business Database, International Statistical Review, 79, 363 - 384.
- Raghunathan, T. E., Reiter, J. P., and Rubin, D. B. (2003). Multiple imputation for statistical disclosure limitation. Journal of Official Statistics 19, 1–16.
- Reiter, J. P. (2003). Inference for partially synthetic, public use microdata sets. Survey Methodology 29, 181–189.
- Reiter, J. P. (2012), Statistical approaches to protecting confidentiality for microdata and their effects on the quality of statistical inferences, Public Opinion Quarterly, 76, 163 - 181.
- Rubin, D. B. (1993). Discussion: Statistical disclosure limitation. Journal of Official Statistics 9, 462–468
- Woo, M. J., Reiter, J. P., Oganian, A., and Karr, A. F. (2009). Global measures of data utility for microdata masked for disclosure limitation. Journal of Privacy and Confidentiality 1, 111–124.
Course Materials
Participants will receive written course notes.
Cost:
Thanks to ESRC funding we are able to offer this course at reduced rates as follows:
1) £30 per day for UK registered students
2) £60 per day for staff at UK academic institutions, RCUK funded researchers, UK public sector staff and staff in UK registered charity organisations
3) £220 per day for all other participants
4) Free Place for ADRC-E & ADRN/ADS staff
The course fee includes course materials, lunches and morning and afternoon refreshments. Travel and accommodation are to be arranged and paid for by the participant.
Website and registration:
Region:
Greater London
Keywords:
Survey Research, Analysis of official statistics, Analysis of administrative data, Statistical Disclosure Control, Statistical Theory and Methods of Inference, Microdata Methods, R, Confidentiality and Anonymity, Synthetic Data, Synthetic Datasets
Related publications and presentations from our eprints archive:
Survey Research
Analysis of official statistics
Analysis of administrative data
Statistical Disclosure Control
Statistical Theory and Methods of Inference
Microdata Methods
R
Confidentiality and Anonymity