Presentation is loading. Please wait.

Presentation is loading. Please wait.

Creating patent indicators with multiple information sources Grid Thoma IPTS-Patent Data Meeting May 14-15, Seville IPTS, Seville, Grid ThomaMay 14-15,

Similar presentations


Presentation on theme: "Creating patent indicators with multiple information sources Grid Thoma IPTS-Patent Data Meeting May 14-15, Seville IPTS, Seville, Grid ThomaMay 14-15,"— Presentation transcript:

1 Creating patent indicators with multiple information sources Grid Thoma IPTS-Patent Data Meeting May 14-15, Seville IPTS, Seville, Grid ThomaMay 14-15, 2009

2 Questions? Where do innovators (and innovations) come from? – Location, industry, cohort, size, … How to count correctly patents and citations at the applicant’s portofolio level? – EPO, USPTO, WIPO, equivalents, family, triadic and biadic family, etc. – Citations vs self-citations IPTS, Seville, Grid ThomaMay 14-15, 2009

3 The problem IPTS, Seville, Grid ThomaMay 14-15, 2009

4 Goals A methodological contribution to harmonizing data and information at the firm level using multiple sources: business directories, patent statistics, etc Generate harmonized patent and citation portfolios across patent offices IPTS, Seville, Grid ThomaMay 14-15, 2009

5 Paper G. Thoma, S. Torrisi, A. Gambardella, D. Guellec, B. H. Hall, D. Harhoff (2008) Methods and software for the harmonization and integration of datasets: A test based on IP- related data and accounting databases with a large panel of companies at the worldwide level,” first presented in September 3-4, 2008. May 14-15, 2009IPTS, Seville, Grid Thoma

6 Agenda Literature background Software creation Dataset creation Preliminary results IPTS, Seville, Grid ThomaMay 14-15, 2009

7 Agenda Literature background Software creation Dataset creation Preliminary results IPTS, Seville, Grid ThomaMay 14-15, 2009

8 Insights from other scientific contexts Cross-polination from Bioinformatics – biology has increasingly become a science relying on the analysis of large amounts of information Named Entity Recognition (NER) – Dictionary-based – Rule-based IPTS, Seville, Grid ThomaMay 14-15, 2009

9 Dictionary-based approach Large collections of names, serving as examples for a specific entity class Exact matching of dictionary entries OR … “fuzzify” the dictionary by automatically generating typical spelling variants for every entry The problem of recall rate IPTS, Seville, Grid ThomaMay 14-15, 2009

10 Problem setting  Every known variation of a given applicant-name  Harmonized to one agreed standard name IPTS, Seville, Grid ThomaMay 14-15, 2009

11 Examples of patenter names dictionaries USPTO & EPO standard assignee names file Derwent Patenter Codes Building an own dictionary with an harmonization procedure (Magerman, Van Looy and Song (2006)) IPTS, Seville, Grid ThomaMay 14-15, 2009

12 Names standardization by MVS (2006) 1.Character cleaning 2.Punctuation cleaning 3.Legal form indication treatment 4.Spelling variation harmonization 5.Umlaut harmonization 6.Common company name removal 7.Creation of an unified list of patenters IPTS, Seville, Grid ThomaMay 14-15, 2009

13 Dictionary uses One-to-one or fuzzyfied entry Manual or Automatic PCT designation links across the various PO Priority links across the various PO Potential limitations – The problem of “synonymy” – Multi applicants applications IPTS, Seville, Grid ThomaMay 14-15, 2009

14 Rule-based approach Definition of rules to compare the similarity of names (Thoma and Torrisi 2007) Initially, hand-crafted rules to describe the composition of named entities and their context Some core words and components of words might be used to extract candidates for more complex names OR Viceversa IPTS, Seville, Grid ThomaMay 14-15, 2009

15 Approximate string matching algorithms (1) Edit distance: number of operations to switch from one word to the other – extended to take into the account of spelling variations – two strings x and y of length n x and n y can be calculated as 1-d/N, where 1 is the maximum similarity, d is the distance between x and y and N=max{n x, n y }. IPTS, Seville, Grid ThomaMay 14-15, 2009

16 Edit distance: examples 1. HILLE & MUELLER GMBH & CO./ /HILLE & MULLER GMBH & CO KG/ /HILLE & MÜLLER GMBH & CO KG 2.AB ELECTRONIK GMBH/ /AB Elektronik GmbH 3. BHLER AG /BAYER AG IPTS, Seville, Grid ThomaMay 14-15, 2009

17 Approximate string matching algorithms (2) Jaccard Similarity measure: token based and accounts for differences due to the position of the same tokens between otherwise identical strings. IPTS, Seville, Grid ThomaMay 14-15, 2009

18 Approximate string matching algorithms (2) Jaccard Similarity measure: token based and accounts for differences due to the position of the same tokens between otherwise identical strings. Computationally Easy Similarity Measure: IPTS, Seville, Grid ThomaMay 14-15, 2009

19 Jaccard distance: examples 1.AAE HOLDING /AAE TECHNOLOGY INTERNATIONAL 2.Japan as represented by the president of the university of Tokyo /President of Tokyo University 3.AAE HOLDING /AGRIPA HOLDING 4.VBH DEUTSCHLAND GMBH /IBM DEUTSCHLAND GMBH IPTS, Seville, Grid ThomaMay 14-15, 2009

20 Approximate matching algorithms (cont) Weighted Jaccard Measure – by the inverse frequency of a given token among different companies tokenfrequencyweight INTERNATIONAL21830.12 HOLDING16280.12 TECHNOLOGY12070.12 AGRIPA11 AAE 111 IPTS, Seville, Grid ThomaMay 14-15, 2009

21 Approximate matching algorithms (cont) Weighted Jaccard measure – scalability to the use of external knowledge E.g. isolation of generic and non discriminating tokens IPTS, Seville, Grid ThomaMay 14-15, 2009

22 Agenda Literature background Software creation Dataset creation Preliminary results IPTS, Seville, Grid ThomaMay 14-15, 2009

23 Software development Combination of different NER approaches Firstly, the dictionary approach Secondly, rule-based post-processing regarding the refinements, abbreviations, etc Thirdly, decision making in case of conflicting matches Software meta-learning and the creation of an cumulative register of sinonnimi and conflicting matches Partition of the input data into smaller subsets IPTS, Seville, Grid ThomaMay 14-15, 2009

24 Agenda Literature background Software creation Dataset creation Preliminary results IPTS, Seville, Grid ThomaMay 14-15, 2009

25 Patent data retrieval EPO/PCT patent database from EPOLINE EPO applicant codes up to 07/2008 USPTO assignee codes up to 03/2007 Fully interfaced with PATSTAT using the publication number ANSI vs UNICODE About 2.2 mln records in EPO and 7 mln in USPTO May 14-15, 2009IPTS, Seville, Grid Thoma

26 Amadeus business directory A directory with demographic & financial information about firms Business product and proprietary, but many research institutions are adopting it Coverage: 1993-2006, but only after 2000 a significant extension of the data Detailed ownership information A unique BVD ID code that identifies a given legal entity IPTS, Seville, Grid ThomaMay 14-15, 2009

27 EPO/PCT applicant names 610+ K different applicant names Cleaned, harmonized and transformed in the ANSI Institutional categories identified – Business (380k), Individuals (180k), Not-Profit (60K) – Query based Addresses retrieved from EPOLINE IPTS, Seville, Grid ThomaMay 14-15, 2009

28 Matching Score Similarity Same location Different location Unknown location 50% or more143 30-50%276 10-30%598 May 14-15, 2009IPTS, Seville, Grid Thoma For each matched name a goodness score is given from 0-9:

29 Matching EPO applicant names matched to Amadeus based on the names and location information – About 130k names matched to 80k bvdid codes Re-assignments, updatability and scalability over time Extension with the USPTO data -About 50k names matched to 31k bvdid codes SIC/NACE sector allocation and size and age class IPTS, Seville, Grid ThomaMay 14-15, 2009

30 IPTS, Seville, Grid ThomaMay 14-15, 2009

31 IPTS, Seville, Grid Thoma

32 May 14-15, 2009IPTS, Seville, Grid Thoma

33 May 14-15, 2009IPTS, Seville, Grid Thoma

34 May 14-15, 2009IPTS, Seville, Grid Thoma

35 May 14-15, 2009IPTS, Seville, Grid Thoma

36 Distribution issues Complete panel at the OECD www.epip.eu www.reserachoninnovation.org May 14-15, 2009IPTS, Seville, Grid Thoma

37 Further activities Limitations Propagation of the EPO/PCT matching to the overall Patstat database The proper consolidation level – Identification of Cross-Holdings and Joint-Ventures IPTS, Seville, Grid ThomaMay 14-15, 2009


Download ppt "Creating patent indicators with multiple information sources Grid Thoma IPTS-Patent Data Meeting May 14-15, Seville IPTS, Seville, Grid ThomaMay 14-15,"

Similar presentations


Ads by Google