Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Federal Statistical Office Germany, IV A2 Federal Statistical Office Germany Application of Regular Expressions in the German Business Register Session.

Similar presentations


Presentation on theme: "© Federal Statistical Office Germany, IV A2 Federal Statistical Office Germany Application of Regular Expressions in the German Business Register Session."— Presentation transcript:

1 © Federal Statistical Office Germany, IV A2 Federal Statistical Office Germany Application of Regular Expressions in the German Business Register Session 5: Projects on Improvements for Business Registers Wiesbaden Group on Business Registers Paris, November 26 th 2007, Patrizia Moedinger

2 © Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 2 Example 1: Improving legal form coding by using regular expressions

3 © Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 3 Background  information on legal forms mainly from VAT records  not all administrative sources provide information on legal forms  use of different not compatible legal form coding or different aggregation levels  special requirements for other purposes like the coding of institutional sectors

4 © Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 4 Background  enterprises (legal units) with certain legal forms are legally obliged to carry their legal form in the enterprise name:  incorporated firms  non-incorporated firms  cooperatives  merchants that are registered in the German Commercial Register enterprise names can be used for legal form coding

5 © Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 5 Definition of search patterns  patterns from nomenclature, abbreviation and notations (tax authorities) GmbH, AG & Co.KG, Limited, Ltd.  patterns from BR real data mistakes in writing, missing blanks,.. construction of regular expression

6 © Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 6 Evaluation of search patterns  completeness of coding legal obligation: high level of found legal forms in enterprise names  degree of reliance: evaluation of coding results  drawing sample after legal form coding  classification of the coding results  calculation of sensitivity, specificity, positive predictive value, negative predictive value

7 © Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 7 Completeness of coding

8 © Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 8 Evaluation of Type I and II errors Enterprise name contains legal form no or wrong legal form regular expression detects legal form1,0094 PPV (positive predictive value) = 1,009 / (1,009 + 4) = 99.6 % no or wrong legal form 262,961 NPV (negative predictive value) = 2,961 / (2,961 + 24) = 99.1 % Sensitivity = 1,009 / (1,009 + 26) = 97.5 % Specificity = 2,961 / (4 + 2,961) = 99.8 % N =4,000

9 © Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 9 Example 2: Data pre-processing as a preliminary for record linkage

10 © Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 10 Background  no common unique identifiers available  data from different sources are initially linked by names and addresses  different or none address standards  different notations “BMW“ or “Bayerische Motorenwerke“ or “Bay. Motorenwerke“  German BR is technically limited in storing several addresses (only dispatch and domicile)

11 © Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 11 Problem of non standardized notations  matching by administrative identifiers dependent variable = match by administrative identifiers + no change in the postal code independent variable = differences between enterprise names, street names and town names (Levenshtein edit distance)  same (administrative) source  different sources (administrative source – BR)

12 © Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 12 Matching probability against string similarity within an administrative source (Employment Agency) (Model: Logistic regression)

13 © Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 13 Matching probability against string similarity between an administrative source (Employment Agency) and BR (Model: Logistic regression)

14 © Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 14 Pre-processing of administrative data for record linkage high level of similarity between two strings  identical units high level of disparity between two strings  different units differences in name or address lowhigh identical unit different unit

15 © Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 15 Pre-processing of administrative data for record linkage  conversion into specific variables for string matching BMW AG Branch Munich Mr Mueller enterprise name: legal form: other elements: BMW AG Branch Munich Mr Mueller enterprise address  simplify comparison strings

16 © Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 16 Methods for evaluation  evaluate link between string similarity and match before and after pre-processing the data  evaluation of matching results  (drawing sample after matching process)  classification of the matching results  calculation of sensitivity, specificity, positive predictive value, negative predictive value  controlling for effects caused by the used matching program

17 © Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 17 Synopsis  BR text data needs special treatment in data processing  applications for regular expressions  simple application: legal form coding (limited set of search pattern)  more complex application: pre-processing (set of pattern depends on data source and later use)  application of regular expressions should always be evaluated

18 © Federal Statistical Office Germany, IV A2 – Patrizia Moedinger Federal Statistical Office Germany 17.04.2015 Slide 18 Thank you for your attention.


Download ppt "© Federal Statistical Office Germany, IV A2 Federal Statistical Office Germany Application of Regular Expressions in the German Business Register Session."

Similar presentations


Ads by Google