Download presentation
Presentation is loading. Please wait.
Published byCameron Tucker Modified over 9 years ago
1
HOW TO KILL INVENTORS: TESTING THE MASSACRATOR © 2.0 ALGORITHM FOR INVENTOR IDENTIFICATION Francesco Lissoni (Francesco.Lissoni@unibocconi.it) Francesco.Lissoni@unibocconi.it Michele Pezzoni (Michele.Pezzoni@unibocconi.it) Michele.Pezzoni@unibocconi.it DIMI - Università di Brescia KITES –Università Bocconi, Milano
2
Identifying inventors with an algorithm: what does it mean in practice? Identifying inventors within a patent database consists in assigning unique codes to inventors listed on different patents who are believed to be same person, to the extent that they are homonyms or quasi-homonyms and possibly share similar characteristics. In order to identify inventors we use an algorithm, we follow three steps (as described by Raffo and Luhillery, 2009): 1.Cleaning & Parsing 2.Matching 3.Filtering 2
3
Databases & Benchmark databases produced by APE-INV PatStat-Kites database which contains patent applications filed at the EPO since 1978. PatStat-Kites results from cleaning and parsing the original PatStat data by means of the Massacrator © 1.0 algorithm (by Gianluca Tarasconi) French Academic Benchmark EPFL Benchmark [Federal Polytechnic of Lausanne, Switzerland] French and EPFL benchmarks are used for testing & setting purposes of the Massacrator © algorithm 3
4
Massacrator © 2.0: Cleaning & Parsing 1.Punctuation characters are removed and text strings are converted to ASCII 2.Parsing: separate fields for inventor's name, address, city, region and state are created. Inventor’s name includes surname, second, third or fourth names, and suffixes, such as “junior”, “senior”, “III”. Personal titles (“Prof.”, “Professor”) are discarded. 3.Further parsing: separate fields for each token (word) in inventor’s name, as resulting from 2. Massacrator© will proceed to matching according to tokens inventor’s name (after parsing step 3) 4
5
Massacrator © 2.0: Matching 1.List and sort all the tokens from the inventors’ names in alphabetical order 2.Compute the 2-GRAM distance for subsequent tokens (that is, token in row n and token in row n+1) 3.Define groups of tokens as follows: – starting from the top of the sorted list, assign word in row 1 to group 1; – move to token in row 2: if 2-GRAM distance from token in row 1 is less than or equal to a pre-determined value assign it to group 1; otherwise create a separate group, in this case group 2; – … and so on for row n and row n+1]. 5
6
Massacrator © 2.0: Matching Example: – the tokens PEZZONI and PEZZOPANE belong to different groups (2,3) – PEZZOTI, PEZZOTTA and PEZZOTTI belong to the same group (4) STRING2G Scores GR (thr. 0.15) FREQ in Patastat PEZZOLA13 PEZZOLATO0.1011 PEZZOLI0.14120 PEZZOLO0.1211 PEZZONI0.1726 PEZZOPANE0.1731 PEZZOTI0.1741 PEZZOTTA0.1343 PEZZOTTI0.1045 PEZZULLI0.2051 6
7
Massacrator © 2.0: Matching We compute the number of tokens for each inventor say n 1,n 2 and we find the minimum min(n 1,n 2 ). Eg. KNIGHT DAVID JOHN (n 1 =3) and KNIGHT JOHN (n 2 =2) We proceed to match all inventors who have min(n 1,n 2 ) number of tokens belonging to the same group, whatever the order of the tokens Example: This approach results in 10 millions of matched inventors. According to the number groups the number of matches grows very fast. 7
8
Massacrator © 2.0: Filtering 1.Network criteria a.[Coinventor] b.[Three.Degrees] of separation c.ASE 2.Geographical criteria a.[City] b.[Province] c.[Region] d.State [State] e.Street name and civic nr [Street] 3.Applicant related criteria a.[Applicant] b.[Small.Applicant] Applicant with less then 50 inventors 4.IPC class criteria a.4 digits in common [IPC.4] b.6 digits in common [IPC.6] c.12 digits in common [IPC.12] 5.Others criteria a.Priority dates differ for less then 5 years [Five.Years] b.Citations [Citation] c.Rare surname [Rare.Surname] For any pair m of matched inventors i and j, we consider 15 criteria in order to compute the similarity scores, namely: NB Each criterion is represented by a dummy variable!! Ex.: Common coinventor dummy = 1 if matched inventors share at least one coinventor 8
9
ASE (by Hsini Huang, Li Tang, John Walsh) 9
10
Testing methodology What do we want to test? What is the impact of criteria on Precision and Recall? Setting the algorithm Do we have to consider all the criteria, or is better to select? Which are the more appropriate criteria to maximize Precision? and Recall? 10
11
Testing methodology: Measures of Performance Massacrator output: [ inventor i, patent p i, inventor j, patent p j, D_α m ] where D_α m (refers to the pair m comparing inventors i and j) is a binary variable that takes value 1 if matched inventors i and j are believed to be the same person (positive match) and 0 otherwise (negative match). Benchmark outputs: [ inventor i, patent p i, inventor j, patent p j, D_γ m ] False/true positives/negatives are calculated by comparing Massacrator © 's results to information in the benchmark databases (D_γ m ) 11
12
Testing methodology: How do we get D_α m ? 5 steps: 1.Matching: inventors from the PatStat-Keins database are matched one to another. A set of dummy variables x k is associated to each pair (a,b,c) of inventors, where each variable (x city,x IPC.4,..) corresponds to one of the filtering criteria 2.Simulation_1: we draw randomly W (=3) vectors of weights (ω w ) from an uniform Bernoulli (success prob.=0.5) multivariate distribution (K dimensions). [ω city w =1 (0) means (not) that criterion is selected] 3.For each pair of matched (a,b,c) we can compute a similarity score (α m,w ) that is the number of criteria in common: cityIPC.4 a11 b10 c01X [3x2] ω1ω1 ω2ω2 ω3ω3 city011 IPC.4110Ω [2x3] ω1ω1 ω2ω2 ω3ω3 a121 b011 c110 A = X x Ω [3x3] 12
13
Testing methodology: Exercises 4.Simulation_2: In each simulation run we set a threshold w value for the similarity score (α m,w ), above which the two inventors in match m are considered the same person. The threshold is drawn randomly from a uniform distribution U(0,4): 5.For each run (ω w ), we identify the positive and negative matches by comparing the information contained in benchmark databases -> true/false positive/negative -> precision/recall 13
14
Precision and Recall VS threshold each point corresponds to a value of Precision and Recall according to a specific set of weights ω w 14
15
Precision and Recall VS organization each point corresponds to a value of Precision and Recall according to a specific set of weights ω w 15
16
Regression (1/2) Precision = β 0 + β Ω + ε Recall = β 0 + β Ω + ε It is a regression of precision and recall on matrix of weights Ω Weights are independent and identically distributed by definition Obs. ω1ω1 ω2ω2 ω3ω3 Var. city011 IPC.4110Ω [2x3] 16 continues...
17
Regression (2/2) All criteria show up a trade-off between Precision and Recall except COINVENTOR which always increases both Precision and Recall The interaction with EPFL dummy measures the different impact of the variable in the two benchmarks (French academics or EPFL scientist) 17
18
Setting the algorithm: finding dominant solutions (the frontier) Balanced 9 obs. High precision 7 obs. High recall 5 obs. 18
19
We test for the over (under)-representation of each criterion among the dominant solutions Each weight (ω city, ω IPC.4 ) is a random variable with a distribution Bernoulli (p=0.5) we expect an average value of 0.5 (avg[ω city ]=0.5=p) We test for the over/under representation of each criterion among the subsets of dominant solutions (i.e. balanced, high precision, high recall and dominant solutions). It means to test if one criterion is selected (ω city w =1) more (less) frequently in a subset of solutions. 19
20
Average[p-value H 0 =0.5 H a 0.5] 20
21
Conclusion To get the balanced result, according to benchmarks, we include in filtering the following characteristics: IPC.4, Citation, City, Street, IPC.12, Applicant, Small Applicant, Coinventor, Three Degrees, ASE...and set a minimum threshold of 2.54 (i.e. at least 3 characteristic in common) We computed other two alternative version of the cleaned data, one that maximises Precision [High Precision] and another that maximises Recall [High Recall] 21
22
Results Balanced: 2806516 inv. -> 2197767 inv. (-22%) High Precision: 2806516 inv. -> 2481582 inv. (-12%) High Recall: 2806516 inv. -> 2032701 inv. (-28%) 22
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.