1 Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros.

Slides:



Advertisements
Similar presentations
Sexual Networks in Contemporary Western Societies Fredrik Liljeros Karolinska institutet Stockholm University (Supported by the Swedish Institute for Public.
Advertisements

The Importance of Different Social Networks for Infectious Diseases Fredrik Liljeros Stockholm University Karolinska institutet Supported by the Swedish.
Mobile Communication Networks Vahid Mirjalili Department of Mechanical Engineering Department of Biochemistry & Molecular Biology.
Multiple Indicator Cluster Surveys Data Interpretation, Further Analysis and Dissemination Workshop Basic Concepts of Further Analysis.
Statistics Review – Part II Topics: – Hypothesis Testing – Paired Tests – Tests of variability 1.
What makes an image memorable?
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα Strong and Weak Ties Chapter 3, from D. Easley and J. Kleinberg book.
Introduction to Violence Epidemiology With a focus on crime-related violence Thomas Songer, PhD University of Pittsburgh
Breadth-First Search Seminar – Networking Algorithms CS and EE Dept. Lulea University of Technology 27 Jan Mohammad Reza Akhavan.
CrimeLink Explorer: Lt. Jennifer Schroeder Tucson Police Department Jie Xu University of Arizona June 2, 2003 Using Domain Knowledge to Facilitate Automated.
T HE S TRUCTURE OF S CIENTIFIC C OLLABORATION N ETWORKS & R ESEARCH F UNDING N ETWORKS CS790g Complex Networks Jigar Patel November 30 th 2009.
Introduction to Networking & Telecommunications School of Business Eastern Illinois University © Abdou Illia, Spring 2007 (Week 1, Tuesday 1/9/2007)
Common Properties of Real Networks. Erdős-Rényi Random Graphs.
Linear Regression and Correlation Analysis
Trip Planning Queries F. Li, D. Cheng, M. Hadjieleftheriou, G. Kollios, S.-H. Teng Boston University.
 Protects the standard of living of the survivors  At the policy holder’s death, the insurance company pays survivors the face value of a life insurance.
Agricultural Items – Population and Housing Census Questionnaire.
Enterprise systems infrastructure and architecture DT211 4
Correlations, Alarms and Policies
Geographic Profiling in Australia – An examination of the predictive potential of serial armed robberies in the Australian Environment By Peter Branca.
Analysis and Modeling of the Open Source Software Community Yongqin Gao, Greg Madey Computer Science & Engineering University of Notre Dame Vincent Freeh.
Towards Modeling Legitimate and Unsolicited Traffic Using Social Network Properties 1 Towards Modeling Legitimate and Unsolicited Traffic Using.
Business Intelligence Case Study Sean Downer, Manager Decision Support Royal Children’s Hospital Melbourne.
Key ideas of analysis & interpretation of data Visualize data – (tables, pictures, graphs, statistics, etc. to reveal patterns & relationships). Making.
MINING RELATED QUERIES FROM SEARCH ENGINE QUERY LOGS Xiaodong Shi and Christopher C. Yang Definitions: Query Record: A query record represents the submission.
1 Presentation on the 2010 Population and Housing Census -Achievements and Challenges Ghana Statistical Service Accra 20 th August Republic of Ghana.
DRG as a quality indicator 4th Nordic Casemix Conference 3-4th June 2010 Paasitorni, Helsinki, Finland Lisbeth Serdén National Board of Health and Welfare.
Developing A Thesis Chapter 2.1 – In Search of Good Data Mathematics of Data Management (Nelson) MDM 4U.
CS 765 – Fall 2014 Paulo Alexandre Regis Reddit analysis.
Understanding Technology Crime Investigation for Managers.
Population census micro data for research: the case of Slovenia Danilo Dolenc Statistical Office of the Republic of Slovenia Ljubljana, First Regional.
Mitsubishi Research Institute, Inc Analyses on Distribution of Malicious Packets and Threats over the Internet August 27-31, 2007 APAN Network Research.
The Official Statistics Debate Relevant names and issues to mention in your essay work.
2 2. Towards a Pan- European Monitoring System on THB “Pan-EU THB MoSy” Project submitted to the targeted call for proposals: “Prevention And Fight Against.
Science: Graph theory and networks Dr Andy Evans.
School of Computer Science Carnegie Mellon University 1 The dynamics of viral marketing Jure Leskovec, Carnegie Mellon University Lada Adamic, University.
Jack DeWeese Computer Systems Research Lab. Purpose  Originally intended to create my own simulation with easily modified variables  Halfway through.
1 STAT 500 – Statistics for Managers STAT 500 Statistics for Managers.
SIA: Secure Information Aggregation in Sensor Networks B. Przydatek, D. Song, and A. Perrig. In Proc. of ACM SenSys 2003 Natalia Stakhanova cs610.
Swedish Institute for Infectious Disease Control, Karolinska Institutet, Stockholm University Martin Camitz Macro versus micro in epidemic simulations.
Neighborhood-Based Topology Recognition in Sensor Networks S.P. Fekete, A. Kröller, D. Pfisterer, S. Fischer, and C. Buschmann Corby Ziesman.
Quantitative research – variables, measurement levels, samples, populations HEM 4112 – Research methods I Martina Vukasovic.
1 …continued… Part III. Performing the Research 3 Initial Research 4 Research Approaches 5 Hypotheses 6 Data Collection 7 Data Analysis.
Complex Network Theory – An Introduction Niloy Ganguly.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Chief Chip! Helping our children & young adults to stay safe online.
Complex Network Theory – An Introduction Niloy Ganguly.
STATISTICS 1040 TERM PROJECT SPRING THE QUESTION Is a student’s Grade Point Average (GPA) correlated with their age?
Eurostat-Task Force Statistics on Crime. Luxembourg, 31 May 2005 Crime statistics in Spain Carlos Angulo Martín National Statistics Institute.
Graphs Upon completion you will be able to:
Overview and challenges in the use of administrative data in official statistics IAOS Conference Shanghai, October 2008 Heli Jeskanen-Sundström Statistics.
Informatics tools in network science
Graphs Definition: a graph is an abstract representation of a set of objects where some pairs of the objects are connected by links. The interconnected.
GROUP PresentsPresents. WEB CRAWLER A visualization of links in the World Wide Web Software Engineering C Semester Two Massey University - Palmerston.
Institute for Security Studies Criminal Justice Monitoring Service Safety and Security Portfolio Committee 5 June 2002.
Chapter 6 Becoming Acquainted With Statistical Concepts.
Learnwell Oy VÅRDSVENSKA PROJECT PRESENTATION. VÅRDSVENSKA – Swedish for Health Care Personnel Vårdsvenska is a language learning resource.
Joan Donohue University of South Carolina
Becoming Acquainted With Statistical Concepts
Social Networks Analysis
Graduate Search Clinics
ISSCM 491 Managerial Statistics
Empirical analysis of Chinese airport network as a complex weighted network Methodology Section Presented by Di Li.
Generative Model To Construct Blog and Post Networks In Blogosphere
A platform for Linked Data publishing
Martin Rajman, EPFL Switzerland & Martin Vesely, CERN Switzerland
Department of Computer Science University of York
Statistics.
Online crimes against children
Presentation transcript:

1 Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Outline Why use data sets from public offices? Three example of available Swedish datasets Workplace and household data In-patient data Data of suspected criminals Problems with Swedish public office data

Sociological data Expensive to collect Time consuming (Especially time series) Low response rate Network data are associated with special problems

Sampling of Network Data

We can’t use a random sample

Extracting data from existing databases!

Sweden may be seen as an outlier when it comes to available public data 1686 All priests was ordered to keep track of all people living in their parishes (We had a state church until 2000 in Sweden) 1749 First census 1756 Foundation of the governmental office ”Tabell kommisionen” (Sweden and Finland) 1858 Foundation of Statistics Sweden SCB (

All individuals officially living in Sweden have an unique identifier ”personnummer”

Example 1 The Sweden database

The network Individuals 8,861,392 Families4,641,829 Workplaces437,936

Giant component Average path distance 8.5 Diameter 22

Send home (or vaccinate) everyone except max size of workplace

Send home people randomly

Average path distance

Example 2 Data about suspected criminals

The data All individuals that have been registered as suspected for having committed a criminal act for every year between 1997 and 2005 Total number of suspected individuals: Types of crimes: 144 Total number of reported individual crimes: Average number of suspected crime types per individual: 2.65 Standard deviation of number of suspected crime types per individual: 3.3

Purpose Can social network visualization tools help us to give a better sense of how different crimes are related to each other?

Basic concepts Node: A specific type of crime. (For example, “Assualt, outdoors, against child 0-6year of age, unacquainted with the victim” “Trafficking for sexual purposes “ Link: Exists between two types of crimes if at least one individual have been suspected for both crimes different years

Example 2002 Bank “Robbery, with firearm, (Bank)” 2005 Post “Robbery, with firearm, (Post)”

The mess of all violent crimes

A minimum spanning tree

What is a minimum spanning tree?

AB Number of mutual links

Number of mutual links may not be a good measure

AB Highly correlated

AB Weak correlation

A simple measure of correlation between crimes

AB A simple Example

A minimum spanning tree based on crime correlation

A minimum spanning tree based on crime correlation with a lower threshold of 0.01

The “mess” of sexbuyers

A minimum spanning tree of suspected crimes of suspected sex buyers based on crime correlation

Conclusion To play with different graphs may give a good first picture of how different crimes are associated with each other We still need traditional statistical techniques to test hypotheses Existing software package are not very user friendly (Three different softwares was needed to produce these pictures Windows SQL server, Mathcad and Pajek)

Example 3 Data about inpatients in a hospital system

The hospital network

The network All hospitalizations of individuals in Stockholm ,108 individuals 570,382 institutional, healthcare occasions 702 wards located at different hospitals The mean number of patients admitted to the wards, per day, varied between one and 69 (mean and standard deviation 9.44)

Degree distributions

Duration of hospital stays

Problem with Swedish public office data You usually have to pay for the data You are only allowed to use the data for the purpose you bought i for You can’t share the data for free Swedish data may not be of general interest

A last animation

Relevant publications