Download presentation
Presentation is loading. Please wait.
Published byHeather McGee Modified over 8 years ago
1
Creating patent indicators with multiple information sources Grid Thoma IPTS-Patent Data Meeting May 14-15, Seville IPTS, Seville, Grid ThomaMay 14-15, 2009
2
Questions? Where do innovators (and innovations) come from? – Location, industry, cohort, size, … How to count correctly patents and citations at the applicant’s portofolio level? – EPO, USPTO, WIPO, equivalents, family, triadic and biadic family, etc. – Citations vs self-citations IPTS, Seville, Grid ThomaMay 14-15, 2009
3
The problem IPTS, Seville, Grid ThomaMay 14-15, 2009
4
Goals A methodological contribution to harmonizing data and information at the firm level using multiple sources: business directories, patent statistics, etc Generate harmonized patent and citation portfolios across patent offices IPTS, Seville, Grid ThomaMay 14-15, 2009
5
Paper G. Thoma, S. Torrisi, A. Gambardella, D. Guellec, B. H. Hall, D. Harhoff (2008) Methods and software for the harmonization and integration of datasets: A test based on IP- related data and accounting databases with a large panel of companies at the worldwide level,” first presented in September 3-4, 2008. May 14-15, 2009IPTS, Seville, Grid Thoma
6
Agenda Literature background Software creation Dataset creation Preliminary results IPTS, Seville, Grid ThomaMay 14-15, 2009
7
Agenda Literature background Software creation Dataset creation Preliminary results IPTS, Seville, Grid ThomaMay 14-15, 2009
8
Insights from other scientific contexts Cross-polination from Bioinformatics – biology has increasingly become a science relying on the analysis of large amounts of information Named Entity Recognition (NER) – Dictionary-based – Rule-based IPTS, Seville, Grid ThomaMay 14-15, 2009
9
Dictionary-based approach Large collections of names, serving as examples for a specific entity class Exact matching of dictionary entries OR … “fuzzify” the dictionary by automatically generating typical spelling variants for every entry The problem of recall rate IPTS, Seville, Grid ThomaMay 14-15, 2009
10
Problem setting Every known variation of a given applicant-name Harmonized to one agreed standard name IPTS, Seville, Grid ThomaMay 14-15, 2009
11
Examples of patenter names dictionaries USPTO & EPO standard assignee names file Derwent Patenter Codes Building an own dictionary with an harmonization procedure (Magerman, Van Looy and Song (2006)) IPTS, Seville, Grid ThomaMay 14-15, 2009
12
Names standardization by MVS (2006) 1.Character cleaning 2.Punctuation cleaning 3.Legal form indication treatment 4.Spelling variation harmonization 5.Umlaut harmonization 6.Common company name removal 7.Creation of an unified list of patenters IPTS, Seville, Grid ThomaMay 14-15, 2009
13
Dictionary uses One-to-one or fuzzyfied entry Manual or Automatic PCT designation links across the various PO Priority links across the various PO Potential limitations – The problem of “synonymy” – Multi applicants applications IPTS, Seville, Grid ThomaMay 14-15, 2009
14
Rule-based approach Definition of rules to compare the similarity of names (Thoma and Torrisi 2007) Initially, hand-crafted rules to describe the composition of named entities and their context Some core words and components of words might be used to extract candidates for more complex names OR Viceversa IPTS, Seville, Grid ThomaMay 14-15, 2009
15
Approximate string matching algorithms (1) Edit distance: number of operations to switch from one word to the other – extended to take into the account of spelling variations – two strings x and y of length n x and n y can be calculated as 1-d/N, where 1 is the maximum similarity, d is the distance between x and y and N=max{n x, n y }. IPTS, Seville, Grid ThomaMay 14-15, 2009
16
Edit distance: examples 1. HILLE & MUELLER GMBH & CO./ /HILLE & MULLER GMBH & CO KG/ /HILLE & MÜLLER GMBH & CO KG 2.AB ELECTRONIK GMBH/ /AB Elektronik GmbH 3. BHLER AG /BAYER AG IPTS, Seville, Grid ThomaMay 14-15, 2009
17
Approximate string matching algorithms (2) Jaccard Similarity measure: token based and accounts for differences due to the position of the same tokens between otherwise identical strings. IPTS, Seville, Grid ThomaMay 14-15, 2009
18
Approximate string matching algorithms (2) Jaccard Similarity measure: token based and accounts for differences due to the position of the same tokens between otherwise identical strings. Computationally Easy Similarity Measure: IPTS, Seville, Grid ThomaMay 14-15, 2009
19
Jaccard distance: examples 1.AAE HOLDING /AAE TECHNOLOGY INTERNATIONAL 2.Japan as represented by the president of the university of Tokyo /President of Tokyo University 3.AAE HOLDING /AGRIPA HOLDING 4.VBH DEUTSCHLAND GMBH /IBM DEUTSCHLAND GMBH IPTS, Seville, Grid ThomaMay 14-15, 2009
20
Approximate matching algorithms (cont) Weighted Jaccard Measure – by the inverse frequency of a given token among different companies tokenfrequencyweight INTERNATIONAL21830.12 HOLDING16280.12 TECHNOLOGY12070.12 AGRIPA11 AAE 111 IPTS, Seville, Grid ThomaMay 14-15, 2009
21
Approximate matching algorithms (cont) Weighted Jaccard measure – scalability to the use of external knowledge E.g. isolation of generic and non discriminating tokens IPTS, Seville, Grid ThomaMay 14-15, 2009
22
Agenda Literature background Software creation Dataset creation Preliminary results IPTS, Seville, Grid ThomaMay 14-15, 2009
23
Software development Combination of different NER approaches Firstly, the dictionary approach Secondly, rule-based post-processing regarding the refinements, abbreviations, etc Thirdly, decision making in case of conflicting matches Software meta-learning and the creation of an cumulative register of sinonnimi and conflicting matches Partition of the input data into smaller subsets IPTS, Seville, Grid ThomaMay 14-15, 2009
24
Agenda Literature background Software creation Dataset creation Preliminary results IPTS, Seville, Grid ThomaMay 14-15, 2009
25
Patent data retrieval EPO/PCT patent database from EPOLINE EPO applicant codes up to 07/2008 USPTO assignee codes up to 03/2007 Fully interfaced with PATSTAT using the publication number ANSI vs UNICODE About 2.2 mln records in EPO and 7 mln in USPTO May 14-15, 2009IPTS, Seville, Grid Thoma
26
Amadeus business directory A directory with demographic & financial information about firms Business product and proprietary, but many research institutions are adopting it Coverage: 1993-2006, but only after 2000 a significant extension of the data Detailed ownership information A unique BVD ID code that identifies a given legal entity IPTS, Seville, Grid ThomaMay 14-15, 2009
27
EPO/PCT applicant names 610+ K different applicant names Cleaned, harmonized and transformed in the ANSI Institutional categories identified – Business (380k), Individuals (180k), Not-Profit (60K) – Query based Addresses retrieved from EPOLINE IPTS, Seville, Grid ThomaMay 14-15, 2009
28
Matching Score Similarity Same location Different location Unknown location 50% or more143 30-50%276 10-30%598 May 14-15, 2009IPTS, Seville, Grid Thoma For each matched name a goodness score is given from 0-9:
29
Matching EPO applicant names matched to Amadeus based on the names and location information – About 130k names matched to 80k bvdid codes Re-assignments, updatability and scalability over time Extension with the USPTO data -About 50k names matched to 31k bvdid codes SIC/NACE sector allocation and size and age class IPTS, Seville, Grid ThomaMay 14-15, 2009
30
IPTS, Seville, Grid ThomaMay 14-15, 2009
31
IPTS, Seville, Grid Thoma
32
May 14-15, 2009IPTS, Seville, Grid Thoma
33
May 14-15, 2009IPTS, Seville, Grid Thoma
34
May 14-15, 2009IPTS, Seville, Grid Thoma
35
May 14-15, 2009IPTS, Seville, Grid Thoma
36
Distribution issues Complete panel at the OECD www.epip.eu www.reserachoninnovation.org May 14-15, 2009IPTS, Seville, Grid Thoma
37
Further activities Limitations Propagation of the EPO/PCT matching to the overall Patstat database The proper consolidation level – Identification of Cross-Holdings and Joint-Ventures IPTS, Seville, Grid ThomaMay 14-15, 2009
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.