Presentation is loading. Please wait.

Presentation is loading. Please wait.

Use of Matching Algorithms to Determine Unit Eligibility

Similar presentations


Presentation on theme: "Use of Matching Algorithms to Determine Unit Eligibility"— Presentation transcript:

1 Use of Matching Algorithms to Determine Unit Eligibility
Brandon Hopkins B.H. statistician RTI Education survey Imputation method to complete missing Redesign Matching prior current

2 Statement of Problem Develop a study population for an education survey by matching lists. Challenge 1: Match instructional programs within each school from prior year and current year study by using “department name” only. Challenge 2: Redesign between prior and current year where current year used more refined definitions for “instructional programs”. Redesign Matching prior current 2 broad to refined

3 Challenge 1: Matching Algorithms
Types of Text Matching Exact – Changing all names to uppercase letters, removing extra spaces and symbols, and replacing “&” with “and” Fuzzy –Removing text within parenthesis, parenthesis, and all numbers Phonetical – SAS Soundex function

4 Round 1 (Exact) and Round 2 (Fuzzy)
Round 1 example 2015 Text Cleaned text Physics and Astronomy PHYSICSANDASTRONOMY 2016 Physics & Astronomy Round 2 example 2015 Text Cleaned text Mathematical Sciences (Statistics) MATHEMATICALSCIENCES 2016 Mathematical Sciences

5 Round 3 Soundex Matching
Step 1: Remove common extraneous words from text such as: DEPARTMENTOF, DEPTOF, DEPARTMENT, DEPT, DIVISIONOF, DIVOF, DIVISION, DIV, PROGRAM, PROG, CENTERFOR, CTRFOR, CENTER, CTR, INSTITUTEOF, INSTOF, INSTITUTE, INST, GENERAL Step 2: Apply the SOUNDEX function to the text 2015 Text Step 2 Comparitive Biosciences C 2016 Comparative Biosciences Differ by 1 letter

6 Quality of Match - Distance Function
Evaluation criteria for the quality of the match was determined by examining magnitude of differences in between the data sources. SPEDIS function (Spelling Distance) computes an asymmetric spelling distance between two words as the normalized cost for converting the keyword to the query word by using a sequence of operations.

7 Quality of Match - Distance Function
Unit Name 2015 Unit Name 2016 SPEDIS Laboratory Animal Science Laboratory Animal Sciences 2 Civil & Environmental Engineering Civil & Environmental Engineer 3 Computer and Information Sciences Computer and Information Scien Chemical & Biomolecular Engineering Chemical & Biomolecular Engr 9 Agricultural & Biological Engineering Agricultural & Biological Engr Animal & Food Sci's (Animal Sci Major) Animal and Food Sciences 36 Agriculture & Resource Economics Agricultural Economics 37 Information Trust Institute Information Sciences Informatics 40 Industrial Engineering Industrial&Enterprise Sys Eng 43 Agricultural Production 45 Agricultural and Consumer Economics Operations Operations Research Schl of Medicine - Dept of Obstetrics & Gynecology Schl of Medicine - Dept of Medicine - Health Care Policy and Research 51 Communications and Media Inst of Communications Rsch 62 Marine Policy Marine Biology and Biological 76

8 Challenge 2: Match Types
Split (One-to-Many) Merger (Many-to-One) Current Year Prior Year Student Type Program 1 Program 2 Program 3 Total Full-time 134 4 11 149 154 Part-time 3 5 7 Program Code A B C Prior Year Current Year Student Type Program 1 Program 2 Program 3 Total Full-time 15 13 30 58 64 Part-time 1 2 5 Program Code D1 D2 D3 D On top of matching by name

9 Challenges & Future Investigation
Performing data cleaning can be difficult. Determining the eligibility of program when they were merged or split Improve data quality by exploring different matching techniques. Future Investigation: What determines a good match when using a distance function? What are key indicators for identify splits and mergers ? How can improve matching by using soundex and spedis

10 RTI International Brandon Hopkins RTI International Statistician
Thank you for listening I look forward to seeing some of you later today


Download ppt "Use of Matching Algorithms to Determine Unit Eligibility"

Similar presentations


Ads by Google