Matching Students to School Districts
Origin of the Problem https://goo.gl/YShsf1 In December of 2016, we needed information about graduates of Detroit Public Schools for a grant Wayne State was pursuing The first problem was figuring out which schools were Detroit Public Schools… an extremely tedious task between closures and schools becoming charter schools or being taken over by the state The second problem was the lack of a crosswalk between CEEB codes and Federal / State IDs The third problem was that WSU had switched high school coding systems at least three times in the last fifteen years
Depth of the Problem https://goo.gl/YShsf1 We had an list obtained from the Admissions Office many years before that contained 49 names and codes used to identify DPS Schools, but there were a few schools that were instant red flags: These are all private, religious institutions not public schools!
Initial Solution: Research and Tedious Work https://goo.gl/YShsf1 A great deal of time was devoted to figuring out what schools were actually part of the Detroit Public School System A year by year list was created of DPS high schools through the use of archived DPS brochures and extensive use of archive.org Schools that entered or exited DPS had those transitions and date of transitions noted All other high schools that were physically located in the City of Detroit were also added to the list Matching schools to names in our database was done through SQL string matching and manual comparison of lists
Completing the List https://goo.gl/YShsf1 The initial process took about a month to complete, but there were still gaps in information. While there was confidence about DPS schools, there was not a lot of information about charter schools The solution to this problem was the public datasets available through CEPI at https://cepi.state.mi.us/eem/PublicDatasets.aspx These helped to build a comprehensive list of all the schools that are physically located within the City of Detroit, even if they were no longer open
Initial Results Schools in Detroit were categorized in five ways: https://goo.gl/YShsf1 Schools in Detroit were categorized in five ways: Detroit Catholic Schools Detroit Charter Schools Detroit Religious / Private Schools EAA (Education Achievement Authority) Schools Detroit Public Schools For schools that were not consistently part of DPS, students were included or excluded by their high school graduation date The code to categorize high schools was written in SQL Schools were included even if they didn’t exist in our census files
Isn’t it pretty? https://goo.gl/YShsf1
Initial Results, continued https://goo.gl/YShsf1 Among all high schools in our census file, 84 school codes should be flagged as being part of DPS Only 28 of these school codes were on the old list provided by Admissions The high number is caused by the use of multiple coding schemas over time
Benefits Reaped! https://goo.gl/YShsf1 Since the code was finalized, there have been multiple instances where we have been asked for information on DPS graduates with very little (if any) lead time The most notable request was directly from the President, who needed data on DPS students back to 2006 before the end of the day This was in response to a Sept 5, 2017 Detroit News article; the work done on DPS coding helped us provide him with detailed information that helped form his response to the article The coding schema has been used in providing data for at least five different grants
What about everyone else in Michigan? https://goo.gl/YShsf1 Given the value added by the DPS coding schema, we began working on a way to better identify all of the schools in our database It was clear that the methodology used to identify DPS students was simply not feasible for the rest of the state The next best solution? Fuzzy Matching!
Is that like fuzzy math? https://goo.gl/YShsf1 Fuzzy matching is a method of finding matches between items based upon similarity rather than exact correspondence Built upon the work in formal logic of the Polish philosopher Jan Łukasiewicz There are multiple methods of performing fuzzy matching, and it’s possible to do fuzzy matching in SAS, SQL, Python, R, etc. Soundex: Converts words into a alphanumeric string based upon phonetic sounds – this allows for both spelling and length variation as long as there is similar sound Levenshtein Distance: This is a calculation based upon the number of character insertions, deletions, or substitutions that would be required to change one string into another string
Implementation in SAS https://goo.gl/YShsf1 A few different major SAS functions were used in this process: PROC SQL: Soundex and Eqt Data Step: Soundex, Complev, and Compged A simple example comparing two sets of names:
Implementation in SAS https://goo.gl/YShsf1 Just Soundex:
Implementation in SAS https://goo.gl/YShsf1 Soundex and Eqt:
Implementation in SAS Soundex, Complev, and Compged in a Data Step: https://goo.gl/YShsf1 Soundex, Complev, and Compged in a Data Step:
Implementation in SAS https://goo.gl/YShsf1
Challenges With High Schools https://goo.gl/YShsf1 At first glance, this should be an extremely simple project The names of high schools are often not unique: 7 Central High Schools 3 Northern High Schools 3 Lakeview High Schools There can be a great deal of variation in naming convention that poses challenges to most fuzzy matching methods i.e. Father Gabriel Richard vs Fr. Gabriel Richard Use of HS rather than High School Phonetically similar spellings: Stevenson vs Stephenson Admissions can contract names in multiple ways that are not always consistent
Applying this to Schools https://goo.gl/YShsf1 We used a multi-step, iterative process to match our high school names and codes to school districts: First, the CEPI dataset and multiple CEEB datasets are imported into SAS The CEPI dataset is filtered to remove irrelevant schools
Applying this to Schools https://goo.gl/YShsf1 We then match CEEB codes to a dataset from CEPI It’s necessary to match on both City Name and School Name CEEB datasets can either be appended together prior to matching them to CEPI data, or else they can be matched one at a time before appending resultant tables together, but each row should have three items: school_name, city, and ceeb_code. Multiple PROC SQL statements were used to make the matches.
Applying this to Schools https://goo.gl/YShsf1 Results from all CEPI/CEEB joins were appended together and PROC SORT was used to remove duplicates.
Applying this to Schools https://goo.gl/YShsf1 Results from each matching process were then appended together. Resulting table was sorted via PROC SORT with NODUP after using UPCASE to standardize names to remove duplicates
Applying this to Schools https://goo.gl/YShsf1 The pared list from the previous step is then rejoined to the CEPI list based on ENTITY_CODE In a DATA step, the names from CEPI and CEEB lists should be compared with COMPLEV or COMPGED
Applying this to Schools https://goo.gl/YShsf1 Next, schools from the CEPI list are sorted into three lists: clean matches, problematic matches, and blanks.
Applying this to Schools https://goo.gl/YShsf1 While it’s possible to do further passes to increase the range of matches, at this point the clean matches can be joined to our census table
Applying this to Schools https://goo.gl/YShsf1 When the ‘high reliability’ or perfect matches were matched against WSU’s census file, we ended up matching about 67% of Michigan High Schools. Using data steps with complev and compged can produce moderate to high reliability matches that can be quickly analyzed for accuracy
What Steps are Next? https://goo.gl/YShsf1 While the matching program is largely complete, the data from further matching steps using compged have yet to be analyzed – that process should be relatively simple Once it has been analyzed, it will be imported back into SAS and merged with the other datasets One goal is to have a CEEB Code to CEEB Entity ID crosswalk The main aim is have a table in our ODS system to match our codes to the relevant details (district, school type, etc.) For WSU, we also need to finish synchronizing our three high school coding systems, so that we have a match between C###, MI###, and CEEB
Data Sources https://goo.gl/YShsf1 The most important data source for the project is the list of schools available at: https://cepi.state.mi.us/eem/PublicDatasets.aspx CEEB Lists were obtained from: https://collegereadiness.collegeboard.org/k-12-school-code-search https://admissions.vanderbilt.edu/apply/highschoolcode.php https://www.ugadmissions.rutgers.edu/reenrollment/ceeblookup.aspx https://ire.uncg.edu/research/NCES_CEEB_Table/ https://surds.colorado.gov/Documentation/hscodes.xls (access via Google cache) Files are available here: https://goo.gl/YShsf1