Download presentation
Presentation is loading. Please wait.
Published byEileen Russon Modified over 9 years ago
1
© Federal Statistical Office Germany, IV A2 Federal Statistical Office Germany Application of Regular Expressions in the German Business Register Session 5: Projects on Improvements for Business Registers Wiesbaden Group on Business Registers Paris, November 26 th 2007, Patrizia Moedinger
2
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 2 Example 1: Improving legal form coding by using regular expressions
3
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 3 Background information on legal forms mainly from VAT records not all administrative sources provide information on legal forms use of different not compatible legal form coding or different aggregation levels special requirements for other purposes like the coding of institutional sectors
4
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 4 Background enterprises (legal units) with certain legal forms are legally obliged to carry their legal form in the enterprise name: incorporated firms non-incorporated firms cooperatives merchants that are registered in the German Commercial Register enterprise names can be used for legal form coding
5
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 5 Definition of search patterns patterns from nomenclature, abbreviation and notations (tax authorities) GmbH, AG & Co.KG, Limited, Ltd. patterns from BR real data mistakes in writing, missing blanks,.. construction of regular expression
6
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 6 Evaluation of search patterns completeness of coding legal obligation: high level of found legal forms in enterprise names degree of reliance: evaluation of coding results drawing sample after legal form coding classification of the coding results calculation of sensitivity, specificity, positive predictive value, negative predictive value
7
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 7 Completeness of coding
8
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 8 Evaluation of Type I and II errors Enterprise name contains legal form no or wrong legal form regular expression detects legal form1,0094 PPV (positive predictive value) = 1,009 / (1,009 + 4) = 99.6 % no or wrong legal form 262,961 NPV (negative predictive value) = 2,961 / (2,961 + 24) = 99.1 % Sensitivity = 1,009 / (1,009 + 26) = 97.5 % Specificity = 2,961 / (4 + 2,961) = 99.8 % N =4,000
9
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 9 Example 2: Data pre-processing as a preliminary for record linkage
10
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 10 Background no common unique identifiers available data from different sources are initially linked by names and addresses different or none address standards different notations “BMW“ or “Bayerische Motorenwerke“ or “Bay. Motorenwerke“ German BR is technically limited in storing several addresses (only dispatch and domicile)
11
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 11 Problem of non standardized notations matching by administrative identifiers dependent variable = match by administrative identifiers + no change in the postal code independent variable = differences between enterprise names, street names and town names (Levenshtein edit distance) same (administrative) source different sources (administrative source – BR)
12
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 12 Matching probability against string similarity within an administrative source (Employment Agency) (Model: Logistic regression)
13
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 13 Matching probability against string similarity between an administrative source (Employment Agency) and BR (Model: Logistic regression)
14
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 14 Pre-processing of administrative data for record linkage high level of similarity between two strings identical units high level of disparity between two strings different units differences in name or address lowhigh identical unit different unit
15
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 15 Pre-processing of administrative data for record linkage conversion into specific variables for string matching BMW AG Branch Munich Mr Mueller enterprise name: legal form: other elements: BMW AG Branch Munich Mr Mueller enterprise address simplify comparison strings
16
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 16 Methods for evaluation evaluate link between string similarity and match before and after pre-processing the data evaluation of matching results (drawing sample after matching process) classification of the matching results calculation of sensitivity, specificity, positive predictive value, negative predictive value controlling for effects caused by the used matching program
17
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 17 Synopsis BR text data needs special treatment in data processing applications for regular expressions simple application: legal form coding (limited set of search pattern) more complex application: pre-processing (set of pattern depends on data source and later use) application of regular expressions should always be evaluated
18
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 18 Thank you for your attention.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.