Creating patent indicators with multiple information sources Grid Thoma IPTS-Patent Data Meeting May 14-15, Seville IPTS, Seville, Grid ThomaMay 14-15,

Creating patent indicators with multiple information sources Grid Thoma IPTS-Patent Data Meeting May 14-15, Seville IPTS, Seville, Grid ThomaMay 14-15, 2009

Questions? Where do innovators (and innovations) come from? – Location, industry, cohort, size, … How to count correctly patents and citations at the applicant’s portofolio level? – EPO, USPTO, WIPO, equivalents, family, triadic and biadic family, etc. – Citations vs self-citations IPTS, Seville, Grid ThomaMay 14-15, 2009

The problem IPTS, Seville, Grid ThomaMay 14-15, 2009

Goals A methodological contribution to harmonizing data and information at the firm level using multiple sources: business directories, patent statistics, etc Generate harmonized patent and citation portfolios across patent offices IPTS, Seville, Grid ThomaMay 14-15, 2009

Paper G. Thoma, S. Torrisi, A. Gambardella, D. Guellec, B. H. Hall, D. Harhoff (2008) Methods and software for the harmonization and integration of datasets: A test based on IP- related data and accounting databases with a large panel of companies at the worldwide level,” first presented in September 3-4, 2008. May 14-15, 2009IPTS, Seville, Grid Thoma

Agenda Literature background Software creation Dataset creation Preliminary results IPTS, Seville, Grid ThomaMay 14-15, 2009

Insights from other scientific contexts Cross-polination from Bioinformatics – biology has increasingly become a science relying on the analysis of large amounts of information Named Entity Recognition (NER) – Dictionary-based – Rule-based IPTS, Seville, Grid ThomaMay 14-15, 2009

Dictionary-based approach Large collections of names, serving as examples for a specific entity class Exact matching of dictionary entries OR … “fuzzify” the dictionary by automatically generating typical spelling variants for every entry The problem of recall rate IPTS, Seville, Grid ThomaMay 14-15, 2009

Problem setting  Every known variation of a given applicant-name  Harmonized to one agreed standard name IPTS, Seville, Grid ThomaMay 14-15, 2009

Examples of patenter names dictionaries USPTO & EPO standard assignee names file Derwent Patenter Codes Building an own dictionary with an harmonization procedure (Magerman, Van Looy and Song (2006)) IPTS, Seville, Grid ThomaMay 14-15, 2009

Names standardization by MVS (2006) 1.Character cleaning 2.Punctuation cleaning 3.Legal form indication treatment 4.Spelling variation harmonization 5.Umlaut harmonization 6.Common company name removal 7.Creation of an unified list of patenters IPTS, Seville, Grid ThomaMay 14-15, 2009

Dictionary uses One-to-one or fuzzyfied entry Manual or Automatic PCT designation links across the various PO Priority links across the various PO Potential limitations – The problem of “synonymy” – Multi applicants applications IPTS, Seville, Grid ThomaMay 14-15, 2009

Rule-based approach Definition of rules to compare the similarity of names (Thoma and Torrisi 2007) Initially, hand-crafted rules to describe the composition of named entities and their context Some core words and components of words might be used to extract candidates for more complex names OR Viceversa IPTS, Seville, Grid ThomaMay 14-15, 2009

Approximate string matching algorithms (1) Edit distance: number of operations to switch from one word to the other – extended to take into the account of spelling variations – two strings x and y of length n x and n y can be calculated as 1-d/N, where 1 is the maximum similarity, d is the distance between x and y and N=max{n x, n y }. IPTS, Seville, Grid ThomaMay 14-15, 2009

Edit distance: examples 1. HILLE & MUELLER GMBH & CO./ /HILLE & MULLER GMBH & CO KG/ /HILLE & MÜLLER GMBH & CO KG 2.AB ELECTRONIK GMBH/ /AB Elektronik GmbH 3. BHLER AG /BAYER AG IPTS, Seville, Grid ThomaMay 14-15, 2009

Approximate string matching algorithms (2) Jaccard Similarity measure: token based and accounts for differences due to the position of the same tokens between otherwise identical strings. IPTS, Seville, Grid ThomaMay 14-15, 2009

Approximate string matching algorithms (2) Jaccard Similarity measure: token based and accounts for differences due to the position of the same tokens between otherwise identical strings. Computationally Easy Similarity Measure: IPTS, Seville, Grid ThomaMay 14-15, 2009

Jaccard distance: examples 1.AAE HOLDING /AAE TECHNOLOGY INTERNATIONAL 2.Japan as represented by the president of the university of Tokyo /President of Tokyo University 3.AAE HOLDING /AGRIPA HOLDING 4.VBH DEUTSCHLAND GMBH /IBM DEUTSCHLAND GMBH IPTS, Seville, Grid ThomaMay 14-15, 2009

Approximate matching algorithms (cont) Weighted Jaccard Measure – by the inverse frequency of a given token among different companies tokenfrequencyweight INTERNATIONAL21830.12 HOLDING16280.12 TECHNOLOGY12070.12 AGRIPA11 AAE 111 IPTS, Seville, Grid ThomaMay 14-15, 2009

Approximate matching algorithms (cont) Weighted Jaccard measure – scalability to the use of external knowledge E.g. isolation of generic and non discriminating tokens IPTS, Seville, Grid ThomaMay 14-15, 2009

Software development Combination of different NER approaches Firstly, the dictionary approach Secondly, rule-based post-processing regarding the refinements, abbreviations, etc Thirdly, decision making in case of conflicting matches Software meta-learning and the creation of an cumulative register of sinonnimi and conflicting matches Partition of the input data into smaller subsets IPTS, Seville, Grid ThomaMay 14-15, 2009

Patent data retrieval EPO/PCT patent database from EPOLINE EPO applicant codes up to 07/2008 USPTO assignee codes up to 03/2007 Fully interfaced with PATSTAT using the publication number ANSI vs UNICODE About 2.2 mln records in EPO and 7 mln in USPTO May 14-15, 2009IPTS, Seville, Grid Thoma

Amadeus business directory A directory with demographic & financial information about firms Business product and proprietary, but many research institutions are adopting it Coverage: 1993-2006, but only after 2000 a significant extension of the data Detailed ownership information A unique BVD ID code that identifies a given legal entity IPTS, Seville, Grid ThomaMay 14-15, 2009

EPO/PCT applicant names 610+ K different applicant names Cleaned, harmonized and transformed in the ANSI Institutional categories identified – Business (380k), Individuals (180k), Not-Profit (60K) – Query based Addresses retrieved from EPOLINE IPTS, Seville, Grid ThomaMay 14-15, 2009

Matching Score Similarity Same location Different location Unknown location 50% or more143 30-50%276 10-30%598 May 14-15, 2009IPTS, Seville, Grid Thoma For each matched name a goodness score is given from 0-9:

Matching EPO applicant names matched to Amadeus based on the names and location information – About 130k names matched to 80k bvdid codes Re-assignments, updatability and scalability over time Extension with the USPTO data -About 50k names matched to 31k bvdid codes SIC/NACE sector allocation and size and age class IPTS, Seville, Grid ThomaMay 14-15, 2009

IPTS, Seville, Grid ThomaMay 14-15, 2009

IPTS, Seville, Grid Thoma

May 14-15, 2009IPTS, Seville, Grid Thoma

Distribution issues Complete panel at the OECD www.epip.eu www.reserachoninnovation.org May 14-15, 2009IPTS, Seville, Grid Thoma

Further activities Limitations Propagation of the EPO/PCT matching to the overall Patstat database The proper consolidation level – Identification of Cross-Holdings and Joint-Ventures IPTS, Seville, Grid ThomaMay 14-15, 2009

Creating patent indicators with multiple information sources Grid Thoma IPTS-Patent Data Meeting May 14-15, Seville IPTS, Seville, Grid ThomaMay 14-15,

Similar presentations

Presentation on theme: "Creating patent indicators with multiple information sources Grid Thoma IPTS-Patent Data Meeting May 14-15, Seville IPTS, Seville, Grid ThomaMay 14-15,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Creating patent indicators with multiple information sources Grid Thoma IPTS-Patent Data Meeting May 14-15, Seville IPTS, Seville, Grid ThomaMay 14-15,

Similar presentations

Presentation on theme: "Creating patent indicators with multiple information sources Grid Thoma IPTS-Patent Data Meeting May 14-15, Seville IPTS, Seville, Grid ThomaMay 14-15,"— Presentation transcript:

Similar presentations

About project

Feedback