Matching Students to School Districts

Slides:



Advertisements
Similar presentations
Haas MFE SAS Workshop Lecture 3:
Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
What is a Database By: Cristian Dubon.
Crime Section, Central Statistics Office..  The Crime Section would like to acknowledge the assistance provided by the Probation Service in this project.
U of R eXtensible Catalog Team MetaCat. Problem Domain.
Catalog: Batch delete old Patron Records How to conduct global/batch updates to records – patron Adding Faculty and Patron/Student Records Manually Standardizing.
Text Search and Fuzzy Matching
NERCOMP Managing Campus Affiliates Managing Campus Affiliates Faculty? Student? Faculty? Student? Staff? Criss Laidlaw Director of Administrative.
FireRMS SQL Audit, Archiving & Purging Presented by Laura Small FireRMS Quality Assurance.
IAGAP Access Database A Tutorial. Databases There are several databases available from the IAGAP Project. There are several databases available from the.
Introduction to Microsoft Access 2003 Mr. A. Craig Dixon CIS 100: Introduction to Computers Spring 2006.
HAP 709 – Healthcare Databases SQL Data Manipulation Language (DML) Updated Fall, 2009.
CODD’s 12 RULES OF RELATIONAL DATABASE
Data and its manifestations. Storage and Retrieval techniques.
Lecture 7 Integrity & Veracity UFCE8K-15-M: Data Management.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
M1G Introduction to Database Development 2. Creating a Database.
Built-in Data Structures in Python An Introduction.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
Programming Logic and Design Fourth Edition, Comprehensive Chapter 16 Using Relational Databases.
Pass the Buck Every good programmer is lazy, arrogant, and impatient. In the game “Pass the Buck” you try to do as little work as possible, by making your.
Microsoft Access Lesson 5 Lexington Technology Center February 25, 2003 Bob Herring On the Web at
1 Management Information Systems M Agung Ali Fikri, SE. MM.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
5-Paragraph Essay Structure Brought to you by powerpointpros.com.
N5 Databases Notes Information Systems Design & Development: Structures and links.
Advanced Excel Helen Mills OME-RESA.
Just Enough Database Theory for Power Pivot / Power BI
Databases: What they are and how they work
SQL Query Getting to the data ……..
AP CSP: Cleaning Data & Creating Summary Tables
Microsoft Office Access 2010 Lab 3
3.1 Fundamentals of algorithms
Databases Chapter 9 Asfia Rahman.
NCJA ZoomGrants Overview Presented by: Lindsey Johnson
Databases.
CHP - 9 File Structures.
Procurement Desktop Defense (PD²) Contract Closeout
Practical Office 2007 Chapter 10
MIS2502: Data Analytics Relational Data Modeling
Database Normalization
H. Rème, I.Dandouras and A. Barthe IRAP, Toulouse, France
CS 1321.
Alma Analytics Usage Yoel Kortick | Senior Librarian.
MIS5101: Business Intelligence Relational Data Modeling
Chapter 3: Working With Your Data
Application Development Theory
Un</br>able’s MySecretSecrets
Data quality 1: Individual records
Chapter 18: Modifying SAS Data Sets and Tracking Changes
Prof: Dr. Shu-Ching Chen TA: Yimin Yang
Should This Be Normalized?
Prof: Dr. Shu-Ching Chen TA: Hsin-Yu Ha
File Handling Programming Guides.
Teaching slides Chapter 8.
Should This Be Normalized?
Prof: Dr. Shu-Ching Chen TA: Haiman Tian
Fundamentals of Data Structures
Creating Tables & Inserting Values Using SQL
Developing a Model-View-Controller Component for Joomla Part 3
Targeting Wait Statistics with Extended Events
Computer Science Projects Database Theory / Prototypes
Advanced Implementation of Tables
Advanced Database Concepts: Reports & Views
Relational Database Design
HOW TO USE THE NEW GLOBAL GRANT REPORT
Connecting The City: Water Assets
Database System Architecture
Shelly Cashman: Microsoft Access 2016
Presentation transcript:

Matching Students to School Districts

Origin of the Problem https://goo.gl/YShsf1 In December of 2016, we needed information about graduates of Detroit Public Schools for a grant Wayne State was pursuing The first problem was figuring out which schools were Detroit Public Schools… an extremely tedious task between closures and schools becoming charter schools or being taken over by the state The second problem was the lack of a crosswalk between CEEB codes and Federal / State IDs The third problem was that WSU had switched high school coding systems at least three times in the last fifteen years

Depth of the Problem https://goo.gl/YShsf1 We had an list obtained from the Admissions Office many years before that contained 49 names and codes used to identify DPS Schools, but there were a few schools that were instant red flags: These are all private, religious institutions not public schools!

Initial Solution: Research and Tedious Work https://goo.gl/YShsf1 A great deal of time was devoted to figuring out what schools were actually part of the Detroit Public School System A year by year list was created of DPS high schools through the use of archived DPS brochures and extensive use of archive.org Schools that entered or exited DPS had those transitions and date of transitions noted All other high schools that were physically located in the City of Detroit were also added to the list Matching schools to names in our database was done through SQL string matching and manual comparison of lists

Completing the List https://goo.gl/YShsf1 The initial process took about a month to complete, but there were still gaps in information. While there was confidence about DPS schools, there was not a lot of information about charter schools The solution to this problem was the public datasets available through CEPI at https://cepi.state.mi.us/eem/PublicDatasets.aspx These helped to build a comprehensive list of all the schools that are physically located within the City of Detroit, even if they were no longer open

Initial Results Schools in Detroit were categorized in five ways: https://goo.gl/YShsf1 Schools in Detroit were categorized in five ways: Detroit Catholic Schools Detroit Charter Schools Detroit Religious / Private Schools EAA (Education Achievement Authority) Schools Detroit Public Schools For schools that were not consistently part of DPS, students were included or excluded by their high school graduation date The code to categorize high schools was written in SQL Schools were included even if they didn’t exist in our census files

Isn’t it pretty? https://goo.gl/YShsf1

Initial Results, continued https://goo.gl/YShsf1 Among all high schools in our census file, 84 school codes should be flagged as being part of DPS Only 28 of these school codes were on the old list provided by Admissions The high number is caused by the use of multiple coding schemas over time

Benefits Reaped! https://goo.gl/YShsf1 Since the code was finalized, there have been multiple instances where we have been asked for information on DPS graduates with very little (if any) lead time The most notable request was directly from the President, who needed data on DPS students back to 2006 before the end of the day This was in response to a Sept 5, 2017 Detroit News article; the work done on DPS coding helped us provide him with detailed information that helped form his response to the article The coding schema has been used in providing data for at least five different grants

What about everyone else in Michigan? https://goo.gl/YShsf1 Given the value added by the DPS coding schema, we began working on a way to better identify all of the schools in our database It was clear that the methodology used to identify DPS students was simply not feasible for the rest of the state The next best solution? Fuzzy Matching!

Is that like fuzzy math? https://goo.gl/YShsf1 Fuzzy matching is a method of finding matches between items based upon similarity rather than exact correspondence Built upon the work in formal logic of the Polish philosopher Jan Łukasiewicz There are multiple methods of performing fuzzy matching, and it’s possible to do fuzzy matching in SAS, SQL, Python, R, etc. Soundex: Converts words into a alphanumeric string based upon phonetic sounds – this allows for both spelling and length variation as long as there is similar sound Levenshtein Distance: This is a calculation based upon the number of character insertions, deletions, or substitutions that would be required to change one string into another string

Implementation in SAS https://goo.gl/YShsf1 A few different major SAS functions were used in this process: PROC SQL: Soundex and Eqt Data Step: Soundex, Complev, and Compged A simple example comparing two sets of names:

Implementation in SAS https://goo.gl/YShsf1 Just Soundex:

Implementation in SAS https://goo.gl/YShsf1 Soundex and Eqt:

Implementation in SAS Soundex, Complev, and Compged in a Data Step: https://goo.gl/YShsf1 Soundex, Complev, and Compged in a Data Step:

Implementation in SAS https://goo.gl/YShsf1

Challenges With High Schools https://goo.gl/YShsf1 At first glance, this should be an extremely simple project The names of high schools are often not unique: 7 Central High Schools 3 Northern High Schools 3 Lakeview High Schools There can be a great deal of variation in naming convention that poses challenges to most fuzzy matching methods i.e. Father Gabriel Richard vs Fr. Gabriel Richard Use of HS rather than High School Phonetically similar spellings: Stevenson vs Stephenson Admissions can contract names in multiple ways that are not always consistent

Applying this to Schools https://goo.gl/YShsf1 We used a multi-step, iterative process to match our high school names and codes to school districts: First, the CEPI dataset and multiple CEEB datasets are imported into SAS The CEPI dataset is filtered to remove irrelevant schools

Applying this to Schools https://goo.gl/YShsf1 We then match CEEB codes to a dataset from CEPI It’s necessary to match on both City Name and School Name CEEB datasets can either be appended together prior to matching them to CEPI data, or else they can be matched one at a time before appending resultant tables together, but each row should have three items: school_name, city, and ceeb_code. Multiple PROC SQL statements were used to make the matches.

Applying this to Schools https://goo.gl/YShsf1 Results from all CEPI/CEEB joins were appended together and PROC SORT was used to remove duplicates.

Applying this to Schools https://goo.gl/YShsf1 Results from each matching process were then appended together. Resulting table was sorted via PROC SORT with NODUP after using UPCASE to standardize names to remove duplicates

Applying this to Schools https://goo.gl/YShsf1 The pared list from the previous step is then rejoined to the CEPI list based on ENTITY_CODE In a DATA step, the names from CEPI and CEEB lists should be compared with COMPLEV or COMPGED

Applying this to Schools https://goo.gl/YShsf1 Next, schools from the CEPI list are sorted into three lists: clean matches, problematic matches, and blanks.

Applying this to Schools https://goo.gl/YShsf1 While it’s possible to do further passes to increase the range of matches, at this point the clean matches can be joined to our census table

Applying this to Schools https://goo.gl/YShsf1 When the ‘high reliability’ or perfect matches were matched against WSU’s census file, we ended up matching about 67% of Michigan High Schools. Using data steps with complev and compged can produce moderate to high reliability matches that can be quickly analyzed for accuracy

What Steps are Next? https://goo.gl/YShsf1 While the matching program is largely complete, the data from further matching steps using compged have yet to be analyzed – that process should be relatively simple Once it has been analyzed, it will be imported back into SAS and merged with the other datasets One goal is to have a CEEB Code to CEEB Entity ID crosswalk The main aim is have a table in our ODS system to match our codes to the relevant details (district, school type, etc.) For WSU, we also need to finish synchronizing our three high school coding systems, so that we have a match between C###, MI###, and CEEB

Data Sources https://goo.gl/YShsf1 The most important data source for the project is the list of schools available at: https://cepi.state.mi.us/eem/PublicDatasets.aspx CEEB Lists were obtained from: https://collegereadiness.collegeboard.org/k-12-school-code-search https://admissions.vanderbilt.edu/apply/highschoolcode.php https://www.ugadmissions.rutgers.edu/reenrollment/ceeblookup.aspx https://ire.uncg.edu/research/NCES_CEEB_Table/ https://surds.colorado.gov/Documentation/hscodes.xls (access via Google cache) Files are available here: https://goo.gl/YShsf1