Mr. JOTL: A User Friendly Matching Software Stéphane Lhuillery, Julio Raffo & Fernando Lladós December nd "NameGame" APE-INV workshop
Outline Background Objectives & Rationale Results User Friendly Software –Concept –Alpha test Further steps December nd "NameGame" APE-INV workshop
Background Automatic patent retrieval is becoming compulsory due to the size of data sets. Growing literature looking at this NameGame: –On firms’ names: Derwent, 2002; Mageman et al., 2006; Hall, 2006; Thoma et al –On inventors’ names: Trajtenberg et al., 2006; Hoisl, 2006; Lissoni et al., 2006; Mariani et al., 2007; Raffo & Lhuillery, 2009; etc. Our ESF Project outcomes: –New matching best practices –APE-INV database December nd "NameGame" APE-INV workshop
Minimize False positive (=higher precision) Minimize False negative (=higher recall) Objectives of the NameGame December nd "NameGame" APE-INV workshop ? Maximizing True positives
Rationale behind: A three step game December nd "NameGame" APE-INV workshop
Examples on matching (EPFL) 6December nd "NameGame" APE-INV workshop
Examples on filtering (EPFL) 7December nd "NameGame" APE-INV workshop
What we learned so far? General –Matching algorithms are not perfect, but improve considerably the results. Cleaning step –Data origin changes substantially the data preparation process Matching step –There is a hierarchy pattern across algorithms, although specific to each particular case Filtering step –Supplementary data availability enhances or constraints the disambiguation process December nd "NameGame" APE-INV workshop 8
Why to create a user friendly software? December nd "NameGame" APE-INV workshop PATSTAT / APE-INV Database PATSTAT / APE-INV Database SurveyPATVAL EU FW Program SCOPUS ISI Thomson
Concept behind Mr. JOTL Intuitive for beginner users Flexible on inputs and its preparation Fair variety of standard matching processes Adaptable on the disambiguation filters But soundly customizable for advanced users Conceived and coded to be expanded in the future by multiple developers December nd "NameGame" APE-INV workshop 10
From concept to real (ok for the moment just an alpha!) December nd "NameGame" APE-INV workshop
Inputs IPTS, Sevilla May
13IPTS, Sevilla May Parsing
Matching IPTS, Sevilla May
Disambiguation IPTS, Sevilla May SSM
LET’S TEST IT! December nd "NameGame" APE-INV workshop 16
Technical notes OS supported (so far): –Windows XP, Vista, Seven (Server & x64) Coded in C sharp –Pros: Free Development Environment Low cost of entry Large Developer community –Cons: Proprietary language and libraries Less performing memory management Libraries needed: Scintella: open source lexer, syntax highlighter Customizable code: –C sharp & VBA Suggested environment for future development: –Visual Studio (Express version is free to use) –Mono in Linux December nd "NameGame" APE-INV workshop 17
Further developments Full coding existing algorithms. Testing performance against large dataset (>Million records). Pre-setting standard routines (as XML). Drafting documentation (+Video). Proof-testing with first time users (at EPFL). December nd "NameGame" APE-INV workshop
Openness and its governance How to share it? –GitHub? –Forums How to develop a dynamic sharing community? December nd "NameGame" APE-INV workshop 19
Thank you! December nd "NameGame" APE-INV workshop 20