Bio-REGNET Retrieval of Patent Documents from Heterogeneous Sources using Ontologies and Similarity Analysis Siddharth Taduri, Gloria T. Lau, Kincho H.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Proteomics Examination Yvonne (Bonnie) Eyler Technology Center 1600 Art Unit 1646 (703)
Searching and Exploring Biomedical Data Vagelis Hristidis School of Computing and Information Sciences Florida International University.
Patent Law A Career Choice For Engineers Azadeh Khadem Registered Patent Attorney November 25, 2008 Azadeh Khadem Registered Patent Attorney November 25,
Bio-REGNET An Ontology to Integrate Multiple Information Domains in the Patent System Siddharth Taduri Hang Yu Gloria T. Lau Kincho H. Law Jay P. Kesan.
Using the Semantic Web to Construct an Ontology- Based Repository for Software Patterns Scott Henninger Computer Science and Engineering University of.
OntoBlog: Informal Knowledge Management by Semantic Blogging Aman Shakya 1, Vilas Wuwongse 2, Hideaki Takeda 1, Ikki Ohmukai 1 1 National Institute of.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Sensemaking and Ground Truth Ontology Development Chinua Umoja William M. Pottenger Jason Perry Christopher Janneck.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Module 2 Tax Research: Primary and Secondary Sources of Tax Law.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
1 Ontology Generation Based on a User-Specified Ontology Seed Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Cloud based linked data platform for Structural Engineering Experiment Xiaohui Zhang
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Ontologies: Making Computers Smarter to Deal with Data Kei Cheung, PhD Yale Center for Medical Informatics CBB752, February 9, 2015, Yale University.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
1 Foundations V: Infrastructure and Architecture, Middleware Deborah McGuinness and Peter Fox CSCI Week 9, October 27, 2008.
Query Expansion.
Blaz Fortuna, Marko Grobelnik, Dunja Mladenic Jozef Stefan Institute ONTOGEN SEMI-AUTOMATIC ONTOLOGY EDITOR.
Ontology Development Kenneth Baclawski Northeastern University Harvard Medical School.
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
E-BioSci a platform for e-publishing and information integration in the life sciences Les Grivell European Molecular Biology Organization.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Kincho H. Law, Siddharth Taduri, Gloria T. Lau Engineering Informatics.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
1 Foundations V: Infrastructure and Architecture, Middleware Deborah McGuinness TA Weijing Chen Semantic eScience Week 10, November 7, 2011.
1 Foundations V: Infrastructure and Architecture, Middleware Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI Week 10, November.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Post-Grant & Inter Partes Review Procedures Presented to AIPPI, Italy February 10, 2012 By Joerg-Uwe Szipl Griffin & Szipl, P.C.
Information Management and Compliance Assistance for Patent Laws and Regulations PIs: Jay Kesan, University of Illinois at Urbana-Champaign Kincho Law,
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
GEORGIOS FAKAS Department of Computing and Mathematics, Manchester Metropolitan University Manchester, UK. Automated Generation of Object.
It is impossible to guarantee that all relevant pages are returned (even inspected) (Figure 1): Millions of pages available, many of them not indexed in.
Promotion of Innovation: Usefulness and value of Patent Information Andrew Czajkowski Head, Innovation and Technology Support Section Ulaanbaatar March.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
REGNET Gloria Lau, Haoyi Wang, Kincho Law, Gio Wiederhold Stanford University May 16th, 2005 A Relatedness Analysis Approach for Regulation Comparison.
INFO Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.
Company LOGO Digital Infrastructure of RPI Personal Library Qi Pan Digital Infrastructure of RPI Personal Library Qi Pan.
Building a Topic Map Repository Xia Lin Drexel University Philadelphia, PA Jian Qin Syracuse University Syracuse, NY * Presented at Knowledge Technologies.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
1 MedAT: Medical Resources Annotation Tool Monika Žáková *, Olga Štěpánková *, Taťána Maříková * Department of Cybernetics, CTU Prague Institute of Biology.
Responsible Data Use: Copyright and Data Matthew Mayernik National Center for Atmospheric Research Version 1.0 Review Date.
1 Class exercise II: Use Case Implementation Deborah McGuinness and Peter Fox CSCI Week 8, October 20, 2008.
Overviews of the Library of Texas & ZLOT Project Dr. William E. Moen Principal Investigator.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
An Ontology-based Approach to Context Modeling and Reasoning in Pervasive Computing Dejene Ejigu, Marian Scuturici, Lionel Brunie Laboratoire INSA de Lyon,
Copyright and Data Matthew Mayernik National Center for Atmospheric Research Section: Responsible Data Use Version 1.0 October 2012 Copyright 2012 Matthew.
Bio-REGNET Developing an Ontology for the U.S. Patent System Siddharth Taduri, Hang Yu, Gloria T. Lau, Kincho H. Law, Jay P. Kesan Stanford University.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Cross Lingual Patent Retrieval Issues in Korean Language Minah Kim Korea Institute of Patent Information.
Prof. Emily Ryan PA 101.  Primary sources are actual statements of the law.  Enormous amounts of primary source materials available are issued chronologically.
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
Informatics for Scientific Data Bio-informatics and Medical Informatics Week 9 Lecture notes INF 380E: Perspectives on Information.
Development of the Amphibian Anatomical Ontology
Search Techniques and Advanced tools for Researchers
Introduction to Information Retrieval
Presentation transcript:

Bio-REGNET Retrieval of Patent Documents from Heterogeneous Sources using Ontologies and Similarity Analysis Siddharth Taduri, Gloria T. Lau, Kincho H. Law Engineering Informatics Lab, Stanford University Jay P. Kesan, School of Law, University of Illinois Urbana-Champaign 09/21/2011 International Conference on Semantic Computing

Problem Statement  Patent Validity and Enforcement Questions involves analysis of documents in various domains – World-wide Patents, PTO File Wrappers, Scientific Publications and Court documents  The information is siloed into several diverse information sources 09/21/ Issued Patents and Applications Court Cases File Wrappers Technical Publications Regulations and Laws

 The sources are diverse in structure, formats, semantics and syntax How to develop and retrieve comprehensive knowledge of patents in a particular technological space? Problem Statement Issued Patents and Applications Court Cases File Wrappers Technical Publications Regulations and Laws Specific Technical Domain 09/21/ Knowledge Source 2:Bio Ontology Knowledge Source 1: Patent System Ontology Integration

09/21/ Patent System Ontology  Established semantics allow us to reason over the classes, properties and instances to infer new facts  Documents can be connected to form a network similar to citation networks. Only now we have not just citations, but other metadata such as co-inventorships, technological classification and other cross-domain relevancy metrics between documents (ex: patents occurring in court cases etc.)  Can develop rules to perform additional inferences over the knowledge

 Return all the patent documents which contain the keyword “erythropoietin” in the Claims and Assigned to “Amgen_Inc”. What technology classes do these patent documents belong to?  SPARQL Query: Example Query PatentInventor Strickland_Thomas_W Elliott_Steven_G Egrie_Joan_C Elliott_Steven_G Browne_Jeffrey_K Sitney_Karen_C Elliott_Steven_G Byrne_Thomas_E Elliott_Steven_G Lin_Fu-Kuen SELECT DISTINCT ?patent ?inventor FROM WHERE{ ?patent a ont:Patent. ?patent ont:hasAbstract ?abs. ?abs ont:resourceVal ?val. ?val bif:contains "erythropoietin". ?patent ont:hasAssignee ont:Amgen_Inc. ?patent ont:hasInventor ?inventor } Limit 10 06/13/2011 5

09/21/ Domain (Bio) Ontologies  Bio Ontologies serve as terminological standards in the domain

Original Term: Erythropoietin Synonyms: Erythropoietin, Recombinant Erythropoietin, erythropoietin receptor binding, Hematopoietin, Recombinant EPO, Erythrocyte Colony Stimulating Factor, Epoetin, EPO … Children: Darbopoietin Alfa, Epoetin Alfa, Epoetin Beta … Parents: Colony Stimulating Factors, cytokine receptor binding, recombinant hematopoietic growth factors… Grand-Parents: hematopoietic growth factor, receptor binding, recombinant growth factor …  An appropriate ranking function is to be applied to balance the more general terms. Heuristically, we assign a higher weight to synonyms, and a lower weight as we traverse away from the concept node  Resulting Query: “original term” OR [synonyms]^weight OR [children]^weight OR …. Expanded Query 06/13/2011 7

Use-Case: Erythropoietin  5 Core patents – U.S. Patents 5,621,080, 5,756,349, 5,955,422, 5,547,933, 5,618,698  135 directly related patents (through citations) form our gold standard for computing formal measures such as Precision and Recall  Total patent corpus of 1150 patents  Identified over related 3000 publications through citations. These are available on PubMed and can be accessed through Entrez – A tool that provides a search interface to PubMed database  Around 30 court cases, patent litigation involving major companies including Amgen, Hoechst Marion Roussel, Inc., Transkaryotic Therapies, Inc.  BioPortal ( is a comprehensive source of domain knowledgehttp://bioportal.bioontology.org Current Corpus: experimental platform to test the overall effectiveness of the framework 09/21/2011 8

 54 Classes, 40 Properties and over 15,000 individuals from 1150 patents, 30 court cases and one partially instantiated file wrapper  Used Protégé-OWL to edit the ontology and Protégé- OWL/Jena API to programmatically instantiate physical documents  Can query using any SPARQL endpoint such as Protégé or Virtuoso’s Triple Store  SWRL is used to declare rules. We use the Jess rule execution engine Patent Ontology Stats 06/13/2011 9

09/21/ Methodology  The cross-references between document types and metadata of documents in the patent system are utilized through a rule-based system  Structural dependencies between types of documents must be considered  The application of bio-ontologies to each type of document is different due to the depth of technical terminology. This is controlled through the weighting vector  Based upon an initial selection of documents by the user, we perform a similarity analysis between documents [User Relevancy Feedback]

09/21/ Rules  The declarative representation of the patent system ontology can facilitate reasoning through rules  Different users may be interested in different aspects of the document (Users can use their own heuristics)  The methodology allows users to select which rules apply during search

09/21/ Rules  Two patents share the same inventor: IF hasInventor (?pat1, ?inv1) ^ hasInventor (?pat2, ?inv1) ^ owlDifferentFrom (?pat1, ?pat2)  hasSimilarDocument(?pat1, ?pat2)  Same court case cites two different patents: IF patentsInvolved(?case, ?pat1) ^ patentsInvolved (?case, ?pat2) ^ owlDifferentFrom (?pat1, pat2)  hasSimilarDocument(?pat1, ?pat2)  Rules are combined by using:

09/21/ Text Recombinant Erythropoietin Darbepoitin Alfa Epoetin Beta Epoetin Alfa Pegzerepoietin Alfa Recombinant Hematopoietic Growth Factor Recombinant Growth Factor Properties for Recombinant Erythropoietin IDRecombinant Erythropoietin Rdfs:labelRecombinant Erythropoietin SynonymErythrocyte Colony Stimulating Factor SynonymErythropoietin SynonymHematopoietin SynonymRecombinant EPO SynonymEPO NCI Thesaurus Patent 5,547,933 … Claims … 1.A non-naturally occurring erythropoietin glycoprotein product having the in vivo biological … to increase production of reticulocytes and red blood cells and having glycosylation which differs from that of human urinary erythropoietin. … {non-naturally, erythropoietin, glycoprotein, biological, …, reticulocytes, glycosylation …} {recombinant erythropoietin, epo, recombinant epo, hematopoietin, erythrocyte colony stimulating factor} {recombinant hematopoietic growth factor} {recombinant growth factor} {darbapoietin alfa, epoetin beta, epoetin alfa…} M = W Pat =W Case = Generated Query: Q Patent = W Pat T * M Q Case = W Case T * M W Pat W Case M Weight Vector for Patents Weight Vector for Cases Expanded Terms subClassOf property

09/21/ Implementation

09/21/ Result – Structural Dependency

09/21/ Result – Combining Rules and Bio-Ontology

 Formal evaluation is hard due to the unavailability of well defined ground truths, but necessary  Include other information sources – publications, regulations, laws  Experiment with more use cases outside of the biomedical domain Future Work 09/21/

Tool Snapshot 06/13/

Acknowledgement 09/21/ This research is partially supported by NSF Grant Number IIS awarded to the University of Illinois at Urbana-Champaign and NSF Grant Number IIS to Stanford University. Any opinions and findings are those of the authors, and do not necessarily reflect the views of the National Science Foundation.

09/21/ Thank You: Questions? Engineering Informatics Lab: Contact Siddharth Taduri: Gloria T. Lau: Kincho H. Law: Jay P. Kesan:

09/21/ BACK UP SLIDES

09/21/ Patent Ontology, Document Diversity etc.

Patents Documents  Over 7 million U.S. patents  In 2009, 485,312 patent applications were filed  Information is contained in various sections of the documents; a full-text search alone is not sufficient – other metrics such as classification, citations etc. need to be considered  Documents are available in HTML Format and can be easily parsed 09/21/

Court Cases  Court Cases are not very well structured!  Comparatively more difficult to parse information  PACER – an electronic system to access databases for U.S. Courts - requires one to know party/assignee name, case number/type, etc. which may not be known 927 F.2d 1200 (1991) AMGEN, INC., Plaintiff/Cross-Appellant, v. CHUGAI PHARMACEUTICAL CO., LTD., and Genetics Institute, Inc., Defendants- Appellants. Nos , United States Court of Appeals, Federal Circuit. March 5, Suggestion for Rehearing Declined May 20, … Before MARKEY, LOURIE and CLEVENGER, Circuit Judges. … THE PATENTS On June 30, 1987, the United States Patent and Trademark Office (PTO) issued to Dr. Rodney Hewick U.S. Patent 4,677,195, entitled "Method for the Purification of Erythropoietin and Erythropoietin Compositions" (the '195 patent). The patent claims both homogeneous EPO and compositions thereof and a method for purifying human EPO using reverse phase high performance liquid chromatography. The method claims are not before us. The relevant claims of the '195 patent are: 1.Homogeneous erythropoietin characterized by a molecular weight of about 34,000 daltons on SDS PAGE, movement as a single peak on reverse phase high performance liquid chromatography and a specific activity of at least 160,000 IU per absorbance unit at 280 nanometers. * * * * * * 3.A pharmaceutical composition for the treatment of anemia comprising a therapeutically effective amount of the homogeneous erythropoietin of claim 1 in a pharmaceutically acceptable vehicle. 4.Homogeneous erythropoietin characterized by a molecular weight of about 34,000 daltons on SDS PAGE, movement as a single peak on reverse phase high performance liquid chromatography and a specific activity of at least about 160,000 IU per absorbance unit at 280 nanometers. 09/21/

Patent File Wrappers  File Wrappers are folders which contain all documents exchanged between a patent applicant and the patent office  Every File Wrapper is different! No standardized ordering of events  The relevant information is embed within lots of irrelevant text  File Wrappers are available as images requiring additional processing in order to extract text EventsText 09/21/

 There are many aspects of these documents which can be utilized; especially the cross-referencing between the documents PATENT United States Patent, 5,955,422 September 21, 1999 Production of erthropoietin Abstract: Disclosed are novel polypeptide s possessing part or all of the primary structural conformation and one or more of the biological properties of mammalian erythropoietin ("EPO") … Inventors: Lin; Fu-Kuen (Thousand Oaks, CA) Assignee: Kirin-Amgen, Inc. (Thousand Oaks, CA) Appl. No.: 08/100,197 Filed: August 2, COURT CASE 314 F.3d 1313 (2003) AMGEN INC., Plaintiff-Cross Appellant v. HOECHST MARION ROUSSEL, INC. (now known as Aventis Pharmaceuticals, Inc.) and Transkaryotic Therapies, Inc., Defendants- Appellants. … Plaintiff-Cross Appellant Amgen Inc. is the owner of numerous patents directed to the production of erythropoietin ("EPO"), …alleging that TKT's Investigational New Drug Application ("INDA") infringed United States Patent Nos. 5,547,933; 5,618,698; and 5,621,080. The complaint was amended in October 1999 to include United States Patent Nos. 5,756,349 and 5,955,422, which issued after suit was filed. FILE WRAPPER U.S. Patent 5,955,422 … Claims are rejected under 35 U.S.C. § 103 as being unpatentable over any one of Miyake et al., 1977 (R) … In accordance with the provisions of 37 C.F.R. §1.607, the present continuation is being filed for the purpose of … Publication Database REGULATIONS: U.S. Code Title 35, C. F. R Title 37, M. P. E. P. … BIOPORTAL: DOMAIN KNOWLEDGE Cross-Referencing 09/21/

1.Use bio-ontologies to expand user’s query, covering broader terms and concepts 2.Search document domain using expanded query 3.Use patent system ontology’s properties to relate documents (from all document domains) 4.Support user feedback to ensure search progresses in right directions Current prototype framework Patent System Ontology 09/21/

Class Hierarchy - I 06/13/

Class Hierarchy - II 06/13/

Class Hierarchy - III 06/13/

Parsing the document to instantiate the Ontology Case 1 Amgen.. Chugai.. hasPlaintiff hasDefendant  Documents are automatically parsed using a regular expression based script  Separate scripts needed for each document domain  Ontology is automatically instantiated using the Protégé-OWL API 06/13/