Arkansas Research Center

Slides:



Advertisements
Similar presentations
Node Lessons Learned James Hudson Wisconsin Department of Natural Resources.
Advertisements

SST Webinar SLDS Webinar1 The presentation will begin at approximately 3:00 p.m. ET Information on joining the teleconference can be found on the “Info”
Key-word Driven Automation Framework Shiva Kumar Soumya Dalvi May 25, 2007.
Arkansas Department of Higher Education Universal Financial Aid System.
Understanding Form W-9 and 1099 Requirements.  This policy primarily responds to IRS regulations, which governs the taxation and reporting responsibilities.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
1 Data Strategy Overview Keith Wilson Session 15.
Overview Master Person Index/Identity Management VHA Beth Franchi, RHIA Data Quality Program Director, OI Sara Temlitz Business Product Manager, Data Quality,
2013 MIS Conference 1 F EDERATED AND C ENTRALIZED M ODELS Wednesday, February 13, 2013 Facilitator: Jeff Sellers (SST) Panelists: Charles McGrew, Kentucky.
Federal Student Aid Identification username and password – this is how students and parents will sign the FAFSA application. The FSA ID process replaced.
Initial Research Findings from Arkansas’ Statewide Longitudinal Data System: Using Standards-Based Research to Solve Real Education Problems Jake Walker,
Monthly APCD User Workgroup Webinar April 22, 2014.
Driving School Database
Research and Planning Commission 2012Conference November 9, 2012 Katie Weaver Randall Education Research and Data Center Office of Financial Management.
National Student Clearinghouse Clearinghouse Authentication Challenges 8/20/04 Mark Jones VP – Marketing & Business Development.
Statewide Unit Record Databases in Higher Education: Growth and Application Peter Ewell National Center for Higher Education Management Systems (NCHEMS)
Essential 3a - SSID Enrollment Capabilities and Key Concepts v3.0, August 07, 2012 SSID ENROLLMENT Capabilities and Key Concepts Essential 3a.
Arkansas Research Center Neal Gibson, Ph.D. Greg Holland, Ph.D.
Prepared By Prepared By : VINAY ALEXANDER ( विनय अलेक्सजेंड़र ) PGT(CS),KV JHAGRAKHAND.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
1 Standard Student Identification Method Jeanne Saunders Session 16.
P-20 Statewide Longitudinal Data System (SLDS) Update Center for Educational Performance and Information (CEPI)
The LongNow. Why FERPA? The Sequel: Key Problems Entity Resolution Regulatory Hurdles.
1 P-20W Identity Management November 16, :15 – 12:15 Bob Swiggum, GA Bill Hurwitch, ME Cathy Wagner, MN.
Finding a PersonBOS Finding a Person! Building an algorithm to search for existing people in a system Rahn Lieberman Manager Emdeon Corp (Emdeon.com)
Student Centered ODS ETL Processing. Insert Search for rows not previously in the database within a snapshot type for a specific subject and year Search.
Essential 3b - SSID Enrollment - Online Demonstration v2.0, September 07, 2011 SSID ENROLLMENT - ONLINE Demonstration Essential 3b.
August 14-15, 2003 Crystal Gateway Marriott Arlington, VA Software Developers Conference.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
The Functions of Locate Presenters: Janet Nottley, Napa County Department of Child Support Services, Kevin Babcock, CSE Locate Subject.
Data Policy Politics K16 Data Issues  Clear purpose for the system, the content for the data (standards) and where it can be located  Adequate unit-level.
MEDICAL RECORD BROKER -LAVANYA GUNDAMARAJU Introduction Introduction n Database and database systems have become an essential part of everyday life.
DATA TYPES.
Databases and DBMSs Todd S. Bacastow January
VFA Year Four January 1 – December 31, 2017
Chapter 5 Database Design
Linking information for better lives in Connecticut
Indexing and hashing.
Databases Chapter 16.
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Inventories and the Cost of Goods Sold
Database Systems: Design, Implementation, and Management Tenth Edition
Lecture 2 The Relational Model
SSID ENROLLMENT Capabilities and Key Concepts
SSID ENROLLMENT Capabilities and Key Concepts
Testing a Solution.
Chapter 12: Query Processing
Baylor Scott & White Equitable Care Presentation
Translation of ER-diagram into Relational Schema
COS 346 Day 8.
RDS and State Grant Portal Update
New York State Report Cards
Inventories and the Cost of Goods Sold
Introduction to Computer Programming
Teaching slides Chapter 8.
Non-Credit Workforce Education Reporting
Texas Student Data System
Relational Database Model
2-1-1 Automated Verifications
HRMS Running the SS-300 Report
PROJECTS SUMMARY PRESNETED BY HARISH KUMAR JANUARY 10,2018.
Introduction to Data Structure
POTENTIALS OF FOR DATA LINKAGE
Arkansas Research Center
National Center for Higher Education Management Systems (NCHEMS)
Database Systems: Design, Implementation, and Management
Disclosure This presentation is intended as a high level overview of TRS reporting. This presentation should not be viewed as a comprehensive overview.
Public Education Information Management System (PEIMS)
Let’s hear it for the Band. What does the data say
Presentation transcript:

Arkansas Research Center It’s about capacity and creating links. While this report demonstrates capacity at the state level, institutions have their own needs as well. With changes to FERPA, we are able to provide more feedback to institutions. We could provide an outcomes report to institutions. Our only request is that it be standardized for all institutions. It would also be nice to provide a detailed feedback report for high schools. What would be helpful for high schools to know about the college experience? arc.arkansas.gov

You are the identity manager… MARIA WILSON HIGH SCHOOL CASTILLO-DELGADO MARIA WILSON HIGH SCHOOL CASTILLO-DELG

You are the identity manager… MARIA D WILSON HS CASTILLO-DELGADO MARIA C WILSON HS CASTILLO-DELG

You are the identity manager… MARIA D WILSON HS CASTILLO-DELGADO DOB: 11/05/1995 MARIA C WILSON HS CASTILLO-DELG DOB: 9/24/1994

Identity Resolution Problems (K12) There are ~55,000 unique first names among students in Arkansas and ~40,000 last names. Approximately 20% of Arkansas students share both the same first and last name with another student. Student Count First Name Last Name 64 JOSHUA SMITH 56 ASHLEY 52 JESSICA 48 JUSTIN 37 JONES 31 WILLIAMS 30 JOHNSON 27 BROWN

Identity Resolution (K12) There are 4,026 students in Arkansas that share an SSN with at least one other student in the state. Between August and January, 874 student transfers to other schools resulted in an SSN change. Between August and January, an additional 1,018 students changed their SSN—we have records for only 300 of these changes. There are ~17,000 students in Arkansas with a “900” SSN

Identity Resolution (Workforce) ~55,000,000 records for 10 years, 2,938,718 unique SSNs, no DOBs, inconsistent naming standards. 7,865 SSNs used by two or more people, for a total of 18,278 different individuals. Those would be combined incomes and treated as the same person if SSN was the primary key. The same person has two or more SSNs (because of a typo/transposition) 13,373 times. There would be 13,373 additional (non-existent) people with separate incomes if SSN was the primary key.

Problem Statement There are known knowns; there are things we know we know. We also know there are known unknowns; that is to say, we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know. D. Rumsfeld 2/12/2002

Record Linking: Merge/Purge File A File B Your knowledge is limited to what’s in these two files ONLY

Knowledge Base Approach All known representations are stored to facilitate matching in the future and possibly resolve past matching errors. Bob Smith, Conway High School Robert Smith, Acxiom Knowledge Base Cluster Representation KB5765 Bob Smith, CHS Robert Smith, Acxiom Bob Smith, UCA Bob Smith, UCA

Knowledge Base Steps Do all 5 values match exactly (E5)? No. Do 4 values match (E4)? No. Do 3 values match (E3)? No/*. Do 2 values match (E2)? Yes. Are they enough for confidence? No. CONCLUSION: NO MATCH THEY ARE KEPT SEPARATE IN KNOWLEDGE BASE MARIA D WILSON HS CASTILLO-DELGADO DOB: 11/05/1995 * Last name is a special case MARIA C WILSON HS CASTILLO-DELG DOB: 9/24/1994

Exact v. Fuzzy (Deterministic v. Probabilistic) Exact matching drives the majority of identity resolution (Pareto Rule—80% is easy) Probabilistic algorithms – Soundex, QTR, Edit Distance, Neural Networks (Pareto Rule—20% require 80% of effort) You want a system that does what YOU, a human, would do

Possible Matching Errors False Positives (Over-consolidation) False Negatives (Under-consolidation)

Identity Management Over-consolidation – split the records apart and update all affected systems Under-consolidation – bring the records together and update all affected systems

Identity Resolution Type Actual Results Identity Resolution Type Records Percent E5 52,287 51.8% E3notLEAnotS 32,370 32.0% E3notLEAnotF 3,742 3.7% E2notLEAnotSghF 2,100 2.1% E2notLEAnotSghD 2,083 E3notLEAnotL 1,826 1.8% E3notLEAnotD 1,731 1.7% E2notLEAnotSghL 948 0.9% E2notDghFnonvalSSN 760 0.8% E3uniqueFLnotSyrDornull 611 0.6% E4notLEA 539 0.5% E2notLEAnotFL 289 0.3% E2notSuniqLLEA4yrDOB 199 0.2% E2notLEAnotDF 174 5 additional types 358 0.4% TOTAL identified students 100,017 99.0% unknown identities 1,007 1.0% TOTAL 101,024 100.0% 100,000+ records from Explore and Plan exams, 2008 and 2009. Match rate, 99%.

Examples: 1% Not Matched First name Last name SSN Date of Birth A H LE A ON TH P M <provided> AVER YAAAAAAA CLA AAAAAAAAAAAAAAAAAAAA <none> YUMM ON UE CA R TWRIGHT Nov. 5, 1959 XYLON SILVER 100% is not realistic – 99% is realistic, but what’s important is the ability to manage problems as they arise

Oyster Development 1st Generation – built in Access, automation of queries/functions creating Knowledge Base. (started in 2009) – shared with W. Virginia Data was longitudinal, but sourced from K-12 exclusively 2009 IES Grant included funding for research with UALR – this work became “Oyster” Oyster also funded with 2009 ARRA Grant

What is Oyster? Open-System Entity Resolution Not database-driven, pure XML Java source code (unicode support) Matching by R-Swoosh methodologies but could be adapted to Fellegi-Sunter

Timeline 1stGenIDs (Access) K.I.M. (SQL/PHP) Oyster (Java/XML) 1.1 1.2 1.3 1.4 1.5 2.0 2009 1.x 2.x 3.0 3.1 3.2 2010 2011 K.I.M. (SQL/PHP) GUI 2.0 2012

Oyster and KIM Oyster: Thorough documentation and GUI KIM: Little documentation and no GUI Oyster: Has not been benchmarked since memory fix KIM: Throughput is 1 – 5 million records an hour, depending on the data and use Oyster: R-Swoosh KIM: Fellegi-Sunter Both deal with over- and under-consolidations

Fellegi-Sunter: Record-based matching R-Swoosh: Attribute-based matching Already determined to be the same individual Neil Gibson, 987654321 Neal Gibson, 222222222 Neal Gibbs, 987654321 What about: Neal Gibson, 987654321 (all correct) Neil Gibbs, 222222222 (none correct)

Oyster XML Run Script

Oyster XML Index

Oyster Input GUI

Oyster Run Script GUI

TrustEd: Knowledgebase Identity Management (KIM) TrustEd Identifier Management (TIM) TIM Identifier Management KIM TrustED De-identified Research Databases Identity Resolution

TrustEd: KIM & TIM TIM Identifier Management TrustED Research Data RecID PII SourceID RecID TIM Identifier Management KIM TrustED De-identified Research Databases Identity Resolution

TrustEd: KIM & TIM TIM Identifier Management TrustED PII KBID KIMID TIM Identifier Management KIM TrustED KIMID RecID De-identified Research Databases Identity Resolution

TrustEd: KIM & TIM TIM Identifier Management TrustED KIMID SourceID RecID TIMID Research Data AgencyID TIM Identifier Management KIM TrustED De-identified Research Databases Identity Resolution

TrustEd: KIM & TIM TIM Identifier Management TrustED RecID SourceID TIMID: Management Agency Crosswalks Research Data PII TIM Identifier Management KIM TrustED De-identified Research Databases Identity Resolution

TrustEd Results TrustEd validates the request based on sharing rules and translates the requesting agency’s local IDs to that of the other agency. The results are then returned to the requesting agency without the use of personally identifiable information. ADHE DWS What are the salaries for these individuals? HE0236 HE0651 HE1327 WF4297 Salary: $36,000 WF8516 Salary: $28,000 WF3508 Salary: $41,000 TIM HE0236 ↔ WF4297 HE0651 ↔ WF8516 HE1327 ↔ WF3508 Brokered Result 1 HE0236 Salary: $36,000 HE0651 Salary : $28,000 HE1327 Salary : $41,000 Brokered Result 2 Salary : $41,000 Salary : $36,000 Salary : $28,000 Brokered Result 3 Average Salary : $35,000

Examples of Multi-agency Research UAMS nICU – 1998 births to 2011 K12 assessments Pre-K programs to K12 preparedness/assessments K12 indicators for Higher Ed on-time graduation Employment outcomes – Higher Ed to Workforce Special Ed outcomes – K12, Higher Ed, Workforce, and Dept. of Corrections

Questions? Oyster Information – UALR http://sourceforge.net/projects/oysterer/ jrtalburt@ualr.edu KIM Information – ARC http://arc.arkansas.gov Neal.Gibson@arkansas.gov Greg.Holland@arkansas.gov