Presentation is loading. Please wait.

Presentation is loading. Please wait.

Arkansas Research Center Neal Gibson, Ph.D. Greg Holland, Ph.D.

Similar presentations


Presentation on theme: "Arkansas Research Center Neal Gibson, Ph.D. Greg Holland, Ph.D."— Presentation transcript:

1 Arkansas Research Center Neal Gibson, Ph.D. Greg Holland, Ph.D.

2 You are the identity manager… MARIA WILSON HIGH SCHOOL MARIA WILSON HIGH SCHOOL CASTILLO-DELGADO CASTILLO-DELG

3 You are the identity manager… MARIA D WILSON HS MARIA C WILSON HS CASTILLO-DELGADO CASTILLO-DELG

4 You are the identity manager… MARIA D WILSON HS MARIA C WILSON HS CASTILLO-DELGADO DOB: 11/05/1995 CASTILLO-DELG DOB: 9/24/1994

5 Identity Resolution Problems (K12) There are ~55,000 unique first names among students in Arkansas and ~40,000 last names. Approximately 20% of Arkansas students share both the same first and last name with another student. Student CountFirst NameLast Name 64JOSHUASMITH 56ASHLEYSMITH 52JESSICASMITH 48JUSTINSMITH 37ASHLEYJONES 31JUSTINWILLIAMS 30JESSICAJOHNSON 27JOSHUABROWN

6 There are 4,026 students in Arkansas that share an SSN with at least one other student in the state. Between August and January, 874 student transfers to other schools resulted in an SSN change. Between August and January, an additional 1,018 students changed their SSN—we have records for only 300 of these changes. There are ~17,000 students in Arkansas with a “900” SSN Identity Resolution (K12)

7 ~55,000,000 records for 10 years, 2,938,718 unique SSNs, no DOBs, inconsistent naming standards. 7,865 SSNs used by two or more people, for a total of 18,278 different individuals. Those would be combined incomes and treated as the same person if SSN was the primary key. The same person has two or more SSNs (because of a typo/transposition) 13,373 times. There would be 13,373 additional (non-existent) people with separate incomes if SSN was the primary key. Identity Resolution (Workforce)

8 Problem Statement There are known knowns; there are things we know we know. We also know there are known unknowns; that is to say, we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know. D. Rumsfeld 2/12/2002

9 Record Linking File AFile B Your knowledge is limited to what’s in these two files ONLY

10 Knowledge Base Approach All known representations are stored to facilitate matching in the future and possibly resolve past matching errors. Bob Smith, Conway High School Robert Smith, Acxiom Bob Smith, UCA ClusterRepresentation KB5765Bob Smith, CHS KB5765Robert Smith, Acxiom KB5765Bob Smith, UCA Knowledge Base

11 Possible Matching Errors False Positives (over-consolidation) False Negatives (under-consolidation)

12 Identity Management Over-consolidation – split the records apart and update all affected systems Under-consolidation – bring the records together and update all affected systems

13 Knowledge Base Steps MARIA D WILSON HS MARIA C WILSON HS CASTILLO-DELGADO DOB: 11/05/1995 CASTILLO-DELG DOB: 9/24/1994 Do all 5 values match exactly (E5)? No. Do 4 values match (E4)? No. Do 3 values match (E3)? No/*. Do 2 values match (E2)? Yes. Are they enough for confidence? No. CONCLUSION: NO MATCH THEY ARE KEPT SEPARATE IN KNOWLEDGE BASE * Last name is a special case

14 Exact v. Fuzzy (Deterministic v. Probabilistic) Exact matching drives the majority of identity resolution (Pareto Rule—80% is easy) Probabilistic algorithms – Soundex, QTR, Edit Distance, Neural Networks (Pareto Rule— 20% require 80% of effort) You want a system that does what YOU, a human, would do

15 Identity Resolution TypeRecordsPercent E5 52,28751.8% E3notLEAnotS 32,37032.0% E3notLEAnotF 3,7423.7% E2notLEAnotSghF 2,1002.1% E2notLEAnotSghD 2,0832.1% E3notLEAnotL 1,8261.8% E3notLEAnotD 1,7311.7% E2notLEAnotSghL 9480.9% E2notDghFnonvalSSN 7600.8% E3uniqueFLnotSyrDornull 6110.6% E4notLEA 5390.5% E2notLEAnotFL 2890.3% E2notSuniqLLEA4yrDOB 1990.2% E2notLEAnotDF 1740.2% 5 additional types 3580.4% TOTAL identified students 100,01799.0% unknown identities 1,0071.0% TOTAL 101,024100.0% 100,000+ records from Explore and Plan exams, 2008 and 2009. Match rate, 99%. Actual Results

16 First nameLast nameSSNDate of Birth A H LEA ON TH P M AVER YAAAAAAACLA AAAAAAAAAAAA AAAAAAAA YUMM ON UECA R TWRIGHT Nov. 5, 1959 XYLONSILVER Examples: 1% Not Matched 100% is not realistic – 99% is realistic, but what’s important is the ability to manage problems as they arise

17 Oyster Development 1 st Generation – built in Access, automation of queries/functions creating Knowledge Base. (started in 2009) – shared with W. Virginia Data was longitudinal, but sourced from K-12 exclusively 2009 IES Grant included funding for research with UALR – this work became “Oyster” Oyster also funded with 2009 ARRA Grant

18 What is Oyster? Open-System Entity Resolution Not database-driven, pure XML Java source code (unicode support) Matching by either Fellegi-Sunter or R-Swoosh methodologies

19 Timeline visual 1stGenIDs (Access) Oyster (Java/XML) K.I.M. (SQL/PHP) 1.1 1.2 1.3 1.4 1.5 2.0 1.x 2.x 3.0 3.1 3.2 2.0 2009 2010 2011 soon GUI

20 What’s Next? Oyster – the memory leak has been fixed and there is now a GUI K.I.M. (Knowledge-base Identity Management) – replicates what we currently have in Access, eliminating the size limitations, but does not have a GUI at this time Oyster will be adding “assertions” KIM will add full audit capability Both deal with over- and under-consolidations

21 Oyster XML Run Script

22

23 Oyster Input GUI

24 Oyster Run Script GUI

25 TrustEd: Identity Information Knowledge Base Identity Resolution & Management TrustEd De-identified Research Databases

26 A trusted broker maintains a cross reference table, encoding the identifiers for various agencies and for various representations of the entities. Trusted Broker Bob SmithAC0236 Robert SmithED4297 Agency 1Agency 2 Internal IdentifierIdentity Information Identifier Agency1 Identifier Agency2 KB5765Bob Smith, BartonAC0236ED4297 KB5765Robert Smith, BartonAC0236ED4297 KB5765Bob Smith, WilsonAC0236ED4297

27 Brokered Result 1 HE0236 Salary: $36,000 HE0651 Salary : $28,000 HE1327 Salary : $41,000 TrustEd Results TrustEd validates the request based on sharing rules and translates the requesting agency’s local IDs to that of the other agency. The results are then returned to the requesting agency without the use of personally identifiable information. ADHE DWS What are the salaries for these individuals? HE0236 HE0651 HE1327 WF4297 Salary: $36,000 WF 8516 Salary: $28,000 WF 3508 Salary: $41,000 HE0236 ↔ WF4297 HE0651 ↔ WF8516 HE1327 ↔ WF3508 Brokered Result 2 Salary : $41,000 Salary : $36,000 Salary : $28,000 Brokered Result 3 Average Salary : $35,000 TrustEd

28 Examples of Multi-agency Research UAMS nICU – 1998 births to 2011 K12 assessments Pre-K programs to K12 preparedness/assessments K12 indicators for Higher Ed on-time graduation Employment outcomes – Higher Ed to Workforce Special Ed outcomes – K12, Higher Ed, Workforce, and Dept. of Corrections

29 Questions? Oyster information – UALR http://www.ualr.edu/eriq Neal.Gibson@arkansas.gov Greg.Holland@arkansas.gov

30 Fellegi-Sunter v. R-Swoosh Roberto Neill, 5/6/1948 Bobby O’Neill, 6/5/1948 What about: Roberto O’Neill, 5/5/1948

31 TrustEd: Identity Information Knowledge Base Identity Resolution TrustEd Identity Management

32 Dual Database: Regulatory Compliance and Privacy Identity Information: Local Agency ID Generated Information of Interest: Stored With Local Agency ID Only Knowledge Base: Identity Resolution Edge Servers: Shareable Data

33 Regulations: FERPA/HIPPA Bot h are similar in that they require permission before private information can be disclosed, but both also allow for research without disclosure if the data is “de-identified.”


Download ppt "Arkansas Research Center Neal Gibson, Ph.D. Greg Holland, Ph.D."

Similar presentations


Ads by Google