Arkansas Research Center Neal Gibson, Ph.D. Greg Holland, Ph.D.

Slides:



Advertisements
Similar presentations
SST Webinar SLDS Webinar1 The presentation will begin at approximately 3:00 p.m. ET Information on joining the teleconference can be found on the “Info”
Advertisements

Virginia’s SSA Data Match Project  Project Teams  Project Timeline  Data Match Process  Opportunities/Challenges  Results.
File Management Chapter 12. File Management A file is a named entity used to save results from a program or provide data to a program. Access control.
Unique Identification Code Request CENTER FOR EDUCATIONAL PERFORMANCE AND INFORMATION.
UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.
S.R.F.E.R.S. State, Regional, and Federal Enterprise Retrieval System Inter-Agency & Inter-State Integration Using GJXML.
Database Design Chapter 2. Goal of all Information Systems  To add value –Reduce costs –Increase sales or revenue –Provide a competitive advantage.
ASP.NET Programming with C# and SQL Server First Edition Chapter 8 Manipulating SQL Server Databases with ASP.NET.
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.
Lecture slides prepared for “Computer Security: Principles and Practice”, 2/e, by William Stallings and Lawrie Brown, Chapter 4 “Overview”.
1 Data Strategy Overview Keith Wilson Session 15.
DAY 21: MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Akhila Kondai October 30, 2013.
IST Databases and DBMSs Todd S. Bacastow January 2005.
MS Access Advanced Instructor: Vicki Weidler Assistant:
Overview Master Person Index/Identity Management VHA Beth Franchi, RHIA Data Quality Program Director, OI Sara Temlitz Business Product Manager, Data Quality,
PELICAN Keys to Quality – GSD Session 11 August 26th, 2008.
1 IRIS Participation Referral Process June What is PPS? The Program Participation System (PPS) is a system used by the State of Wisconsin for many.
2013 MIS Conference 1 F EDERATED AND C ENTRALIZED M ODELS Wednesday, February 13, 2013 Facilitator: Jeff Sellers (SST) Panelists: Charles McGrew, Kentucky.
Initial Research Findings from Arkansas’ Statewide Longitudinal Data System: Using Standards-Based Research to Solve Real Education Problems Jake Walker,
Masud Hasan Secue VS Hushmail Project 2.
Introduction to database systems
COMP 410 & Sky.NET May 2 nd, What is COMP 410? Forming an independent company The customer The planning Learning teamwork.
Driving School Database
ISRS Documentation and Training MnSCU Information Technology Services 2011 ISIR Load Incorporation of Duplicate Checking.
Creating an Extended Attribute Vince Schimizzi, Michigan State University Evelyn Portee, Michigan State University Lauri Thornhill, Michigan State University.
Session # 10 NSLDS Update Valerie Sherrer Ron Bennett.
CODD’s 12 RULES OF RELATIONAL DATABASE
Jeopardy K201 – The Computer In Business Exam Review 2 Manjit Trehan.
HPRP: New Reports HPRP new reports and data entry reporting review April 2010.
Cross Language Clone Analysis Team 2 October 27, 2010.
State of North Dakota Information Technology Department (ITD) July 2012.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
1.NET Web Forms Business Forms © 2002 by Jerry Post.
Georgia Electronic Voting System Testing and Security Voting Systems Testing Summit November 29, 2005.
National Student Clearinghouse Clearinghouse Authentication Challenges 8/20/04 Mark Jones VP – Marketing & Business Development.
Database Design. Referential Integrity : data in a table that links to data in another table must always work in such a way that following the link will.
Organizing Data and Information. What is Data?? Numbers, characters, images, or other method of recording, in a form which can be assessed by a human.
Essential 3a - SSID Enrollment Capabilities and Key Concepts v3.0, August 07, 2012 SSID ENROLLMENT Capabilities and Key Concepts Essential 3a.
Enhancing Forms with OLE Fields, Hyperlinks, and Subforms – Project 5.
Prepared By Prepared By : VINAY ALEXANDER ( विनय अलेक्सजेंड़र ) PGT(CS),KV JHAGRAKHAND.
EEL 5937 Agent models. EEL 5937 Multi Agent Systems Lecture 4, Jan 16, 2003 Lotzi Bölöni.
M1G Introduction to Database Development 4. Improving the database design.
Chapter 8 Data and Knowledge Management. 2 Learning Objectives When you finish this chapter, you will  Know the difference between traditional file organization.
1 Standard Student Identification Method Jeanne Saunders Session 16.
P-20 Statewide Longitudinal Data System (SLDS) Update Center for Educational Performance and Information (CEPI)
The LongNow. Why FERPA? The Sequel: Key Problems Entity Resolution Regulatory Hurdles.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
1 P-20W Identity Management November 16, :15 – 12:15 Bob Swiggum, GA Bill Hurwitch, ME Cathy Wagner, MN.
MA194Using WindowsNT1 Topics for the day… WindowsNT Security WindowsNT File System (NTFS) Viewing/Setting Document and Folder Permissions Access Control.
Methods and Techniques for Integration of Small Datasets September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban.
Finding a PersonBOS Finding a Person! Building an algorithm to search for existing people in a system Rahn Lieberman Manager Emdeon Corp (Emdeon.com)
KeyNet Resources’ eDataGate (EDG)
Web Technologies Lecture 10 Web services. From W3C – A software system designed to support interoperable machine-to-machine interaction over a network.
Blindfolded Record Linkage Presented by Gautam Sanka Susan C. Weber, Henry Lowe, Amar Das, Todd Ferris.
Data modeling Process. Copyright © CIST 2 Definition What is data modeling? –Identify the real world data that must be stored on the database –Design.
Essential 3b - SSID Enrollment - Online Demonstration v2.0, September 07, 2011 SSID ENROLLMENT - ONLINE Demonstration Essential 3b.
The 2011 Census: Estimating the Population Alexa Courtney.
August 14-15, 2003 Crystal Gateway Marriott Arlington, VA Software Developers Conference.
SSMS SQL Server Management System. SQL Server Microsoft SQL Server is a Relational Database Management System (RDBMS) Relational Database Management System.
1 Information Retrieval and Use De-normalisation and Distributed database systems Geoff Leese September 2008, revised October 2009.
MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.
Project Dow: Extending EclipseTrader Emmanuel Sotelo Fall 2008.
Kansas Education Longitudinal Data System Update to Kansas Commission on Graduation and Dropout Prevention and Recovery December 2010 Kathy Gosa Director,
Metrics of Economic Opportunity and the Virginia Longitudinal Data System Tod R. Massa Policy Research and Data Warehousing Director State Council of Higher.
Databases and DBMSs Todd S. Bacastow January
SSID ENROLLMENT Capabilities and Key Concepts
By (Group 17) Mahesha Yelluru Rao Surabhee Sinha Deep Vakharia
Arkansas Research Center
Let’s hear it for the Band. What does the data say
Presentation transcript:

Arkansas Research Center Neal Gibson, Ph.D. Greg Holland, Ph.D.

You are the identity manager… MARIA WILSON HIGH SCHOOL MARIA WILSON HIGH SCHOOL CASTILLO-DELGADO CASTILLO-DELG

You are the identity manager… MARIA D WILSON HS MARIA C WILSON HS CASTILLO-DELGADO CASTILLO-DELG

You are the identity manager… MARIA D WILSON HS MARIA C WILSON HS CASTILLO-DELGADO DOB: 11/05/1995 CASTILLO-DELG DOB: 9/24/1994

Identity Resolution Problems (K12) There are ~55,000 unique first names among students in Arkansas and ~40,000 last names. Approximately 20% of Arkansas students share both the same first and last name with another student. Student CountFirst NameLast Name 64JOSHUASMITH 56ASHLEYSMITH 52JESSICASMITH 48JUSTINSMITH 37ASHLEYJONES 31JUSTINWILLIAMS 30JESSICAJOHNSON 27JOSHUABROWN

There are 4,026 students in Arkansas that share an SSN with at least one other student in the state. Between August and January, 874 student transfers to other schools resulted in an SSN change. Between August and January, an additional 1,018 students changed their SSN—we have records for only 300 of these changes. There are ~17,000 students in Arkansas with a “900” SSN Identity Resolution (K12)

~55,000,000 records for 10 years, 2,938,718 unique SSNs, no DOBs, inconsistent naming standards. 7,865 SSNs used by two or more people, for a total of 18,278 different individuals. Those would be combined incomes and treated as the same person if SSN was the primary key. The same person has two or more SSNs (because of a typo/transposition) 13,373 times. There would be 13,373 additional (non-existent) people with separate incomes if SSN was the primary key. Identity Resolution (Workforce)

Problem Statement There are known knowns; there are things we know we know. We also know there are known unknowns; that is to say, we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know. D. Rumsfeld 2/12/2002

Record Linking File AFile B Your knowledge is limited to what’s in these two files ONLY

Knowledge Base Approach All known representations are stored to facilitate matching in the future and possibly resolve past matching errors. Bob Smith, Conway High School Robert Smith, Acxiom Bob Smith, UCA ClusterRepresentation KB5765Bob Smith, CHS KB5765Robert Smith, Acxiom KB5765Bob Smith, UCA Knowledge Base

Possible Matching Errors False Positives (over-consolidation) False Negatives (under-consolidation)

Identity Management Over-consolidation – split the records apart and update all affected systems Under-consolidation – bring the records together and update all affected systems

Knowledge Base Steps MARIA D WILSON HS MARIA C WILSON HS CASTILLO-DELGADO DOB: 11/05/1995 CASTILLO-DELG DOB: 9/24/1994 Do all 5 values match exactly (E5)? No. Do 4 values match (E4)? No. Do 3 values match (E3)? No/*. Do 2 values match (E2)? Yes. Are they enough for confidence? No. CONCLUSION: NO MATCH THEY ARE KEPT SEPARATE IN KNOWLEDGE BASE * Last name is a special case

Exact v. Fuzzy (Deterministic v. Probabilistic) Exact matching drives the majority of identity resolution (Pareto Rule—80% is easy) Probabilistic algorithms – Soundex, QTR, Edit Distance, Neural Networks (Pareto Rule— 20% require 80% of effort) You want a system that does what YOU, a human, would do

Identity Resolution TypeRecordsPercent E5 52, % E3notLEAnotS 32, % E3notLEAnotF 3,7423.7% E2notLEAnotSghF 2,1002.1% E2notLEAnotSghD 2,0832.1% E3notLEAnotL 1,8261.8% E3notLEAnotD 1,7311.7% E2notLEAnotSghL % E2notDghFnonvalSSN % E3uniqueFLnotSyrDornull % E4notLEA % E2notLEAnotFL % E2notSuniqLLEA4yrDOB % E2notLEAnotDF % 5 additional types % TOTAL identified students 100, % unknown identities 1,0071.0% TOTAL 101, % 100,000+ records from Explore and Plan exams, 2008 and Match rate, 99%. Actual Results

First nameLast nameSSNDate of Birth A H LEA ON TH P M AVER YAAAAAAACLA AAAAAAAAAAAA AAAAAAAA YUMM ON UECA R TWRIGHT Nov. 5, 1959 XYLONSILVER Examples: 1% Not Matched 100% is not realistic – 99% is realistic, but what’s important is the ability to manage problems as they arise

Oyster Development 1 st Generation – built in Access, automation of queries/functions creating Knowledge Base. (started in 2009) – shared with W. Virginia Data was longitudinal, but sourced from K-12 exclusively 2009 IES Grant included funding for research with UALR – this work became “Oyster” Oyster also funded with 2009 ARRA Grant

What is Oyster? Open-System Entity Resolution Not database-driven, pure XML Java source code (unicode support) Matching by either Fellegi-Sunter or R-Swoosh methodologies

Timeline visual 1stGenIDs (Access) Oyster (Java/XML) K.I.M. (SQL/PHP) x 2.x soon GUI

What’s Next? Oyster – the memory leak has been fixed and there is now a GUI K.I.M. (Knowledge-base Identity Management) – replicates what we currently have in Access, eliminating the size limitations, but does not have a GUI at this time Oyster will be adding “assertions” KIM will add full audit capability Both deal with over- and under-consolidations

Oyster XML Run Script

Oyster Input GUI

Oyster Run Script GUI

TrustEd: Identity Information Knowledge Base Identity Resolution & Management TrustEd De-identified Research Databases

A trusted broker maintains a cross reference table, encoding the identifiers for various agencies and for various representations of the entities. Trusted Broker Bob SmithAC0236 Robert SmithED4297 Agency 1Agency 2 Internal IdentifierIdentity Information Identifier Agency1 Identifier Agency2 KB5765Bob Smith, BartonAC0236ED4297 KB5765Robert Smith, BartonAC0236ED4297 KB5765Bob Smith, WilsonAC0236ED4297

Brokered Result 1 HE0236 Salary: $36,000 HE0651 Salary : $28,000 HE1327 Salary : $41,000 TrustEd Results TrustEd validates the request based on sharing rules and translates the requesting agency’s local IDs to that of the other agency. The results are then returned to the requesting agency without the use of personally identifiable information. ADHE DWS What are the salaries for these individuals? HE0236 HE0651 HE1327 WF4297 Salary: $36,000 WF 8516 Salary: $28,000 WF 3508 Salary: $41,000 HE0236 ↔ WF4297 HE0651 ↔ WF8516 HE1327 ↔ WF3508 Brokered Result 2 Salary : $41,000 Salary : $36,000 Salary : $28,000 Brokered Result 3 Average Salary : $35,000 TrustEd

Examples of Multi-agency Research UAMS nICU – 1998 births to 2011 K12 assessments Pre-K programs to K12 preparedness/assessments K12 indicators for Higher Ed on-time graduation Employment outcomes – Higher Ed to Workforce Special Ed outcomes – K12, Higher Ed, Workforce, and Dept. of Corrections

Questions? Oyster information – UALR

Fellegi-Sunter v. R-Swoosh Roberto Neill, 5/6/1948 Bobby O’Neill, 6/5/1948 What about: Roberto O’Neill, 5/5/1948

TrustEd: Identity Information Knowledge Base Identity Resolution TrustEd Identity Management

Dual Database: Regulatory Compliance and Privacy Identity Information: Local Agency ID Generated Information of Interest: Stored With Local Agency ID Only Knowledge Base: Identity Resolution Edge Servers: Shareable Data

Regulations: FERPA/HIPPA Bot h are similar in that they require permission before private information can be disclosed, but both also allow for research without disclosure if the data is “de-identified.”