Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield.

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

26/10/2008 SWESE'08 1 Enhanced Semantic Access to Software Artefacts Danica Damljanović and Kalina Bontcheva.
University of Sheffield NLP Module 4: Machine Learning.
Data Mining and Text Analytics GATE, by Joel Bywater.
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
Mining the web to improve semantic-based multimedia search and digital libraries
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Text mining and the Semantic Web Dr Diana Maynard NLP Group Department of Computer Science University of Sheffield.
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
Présentation EPFL-Public | Ecole Polytechnique Fédérale de Lausanne EPFL.
Controlled Language for Ontology Editing Adam Funk, Valentin Tablan, Kalina Bontcheva, Hamish Cunningham, Brian Davis, Siegfried Handschuh.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
1 Large Scale Semantic Annotation, Indexing, and Search at The National Archives Diana Maynard Mark Greenwood University of Sheffield, UK.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
Survey of Semantic Annotation Platforms
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Information Extraction From Medical Records by Alexander Barsky.
Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis September.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
University of Sheffield NLP A Collaborative, Web-based Annotation Environment Module 12 TEAMWARE.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Theory and Application of Database Systems A Hybrid Approach for Extending Ontology from Text He Wei.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Assignee Name Harmonization Efforts at the U.S. Patent and Trademark Office US Patent and Trademark Office Office of Electronic Information Products Patent.
University of Sheffield NLP Teamware: A Collaborative, Web-based Annotation Environment Kalina Bontcheva, Milan Agatonovic University of Sheffield.
Semantic Technologies & GATE NSWI Jan Dědek.
Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,
Presenter: Shanshan Lu 03/04/2010
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
ICCS 2008, CracowJune 23-25, Towards Large Scale Semantic Annotation Built on MapReduce Architecture Michal Laclavík, Martin Šeleng, Ladislav Hluchý.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.
University of Sheffield, NLP Module 6: ANNIC Kalina Bontcheva © The University of Sheffield, This work is licensed under the Creative Commons.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Information Retrieval
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Semantic Data Extraction for B2B Integration Syntactic-to-Semantic Middleware Bruno Silva 1, Jorge Cardoso 2 1 2
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
University of Sheffield NLP Module 1: Introduction to JAPE © The University of Sheffield, This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Multi-Source Information Extraction Valentin Tablan University of Sheffield.
University of Sheffield, NLP Introduction to Text Mining Module 4: Development Lifecycle (Part 1)
University of Sheffield NLP Module 4: Teamware: A Collaborative, Web-based Annotation Environment © The University of Sheffield, This work is.
Introducing GATECloud.net Valentin Tablan, Ian Roberts University of Sheffield.
University of Sheffield NLP Sentiment Analysis (Opinion Mining) with Machine Learning in GATE.
TextCrowd – Collaborative semantic enrichment of text-based datasets
Module 4: Taking GATE to the Cloud
A Collaborative, Web-based Annotation Environment
CSE 635 Multimedia Information Retrieval
Content Augmentation for Mixed-Mode News Broadcasts Mike Dowman
Using Uneven Margins SVM and Perceptron for IE
Hierarchical, Perceptron-like Learning for OBIE
CIS 375 Bruce R. Maxim UM-Dearborn
Presentation transcript:

Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

University of Sheffield NLP 2GATE Summer School - July 27-31, 2009 Outline Why patent annotation? The data model The annotation guidelines Building the IE pipeline Evaluation Scaling up and optimisation Find the needle in the annotation (hay)stack

University of Sheffield NLP 3GATE Summer School - July 27-31, 2009 What is Semantic Annotation? Semantic Annotation:  Is about attaching tags and/or ontology classes to text segments;  Creates a richer data space and can allow conceptual search; Suitable for high-value content Can be:  Fully automatic, semi-automatic, manual  Social  Learned

University of Sheffield NLP 4GATE Summer School - July 27-31, 2009 Semantic Annotation

University of Sheffield NLP 5GATE Summer School - July 27-31, 2009 Why annotate patents? Simple text search works well for the Web, but,  patent searchers require high recall (web search requires high precision);  patents don't contain hyperlinks;  patent searchers need richer semantics than offered by simple text search;  patent text amenable to HLT due to regularities and sub-language effects.

University of Sheffield NLP 6GATE Summer School - July 27-31, 2009 How can annotation help? Format irregularities  “Fig. 3”, “FIG 3”, “Figure 3”, etc. Data normalisation  “Figures. 3 to 5” -> FIG. 2, FIG 4, FIG 5.  “23rd Oct 1998” -> Text mining – discovery of:  product names and materials;  references to other patents, publications and prior art;  measurements.  etc.

University of Sheffield NLP 7GATE Summer School - July 27-31, 2009 Manual vs. Automatic Manual SA  high quality  very expensive  requires small data or many users (e.g flickr, del.icio.us). Automatic SA  inexpensive  medium quality  can only do simple tasks Patent data  too large to annotate manually  too difficult to annotate fully automatically

University of Sheffield NLP 8GATE Summer School - July 27-31, 2009 The SAM Projects Collaboration between Matrixware, Sheffield GATE team, and Ontotext Started in 2007 and ongoing  Pilot study for applicability of Semantic Annotation to patents  GATE Teamware: Infrastructure for collaborative semantic annotation  Large scale experiments  Mimir: Large scale indexing infrastructure supporting hybrid search (text, annotations, meaning)

University of Sheffield NLP 9GATE Summer School - July 27-31, 2009 Technologies Teamware GATEOWLIM TRREE JBPM, etc… Data Enrichment (Semantic Annotation) KIM Knowledge Management GATEOWLIM TRREE Lucene, etc… Data Access (Search/Browsing) GATEORDI TRREE MG4J, etc… Large Scale Hybrid Index SheffieldOntotextOther

University of Sheffield NLP 10GATE Summer School - July 27-31, 2009 Teamware revisited: A Key SAM Infrastructure Collaborative Semantic Annotation Environment Tools for semi-automatic annotation; Scalable distributed text analytics processing; Data curation; User/role management; Web-based user interface.

University of Sheffield NLP 11GATE Summer School - July 27-31, 2009 Semantic Annotation Experiments Wide Annotation  Cover a range of generally useful concepts: Documents, document parts, references  High level detail. Deep Annotation  Cover a narrow range of concepts Measurements  As much detail as possible.

University of Sheffield NLP 12GATE Summer School - July 27-31, 2009 Data Model

University of Sheffield NLP 13GATE Summer School - July 27-31, 2009 Example Bibliographic Data

University of Sheffield NLP 14GATE Summer School - July 27-31, 2009 Example measurements

University of Sheffield NLP 15GATE Summer School - July 27-31, 2009 Example References

University of Sheffield NLP 16GATE Summer School - July 27-31, 2009 The Patent Annotation Guidelines 11 pages (10 point font), with concrete examples, general rules, specific guidelines per type, lists of exceptions, etc. The section on annotating measurements is 2 pages long! The clearer the guidelines – the better Inter- Annotator Agreement you’re likely to achieve The higher the IAA – the better automatic results can be obtained (less noise!) The lengthier the annotations – the more scope for error there is, e.g., references to other papers had the lowest IAA

University of Sheffield NLP Annotating Scalar Measurements numeric value including formulae always related to a unit more than one value can be related to the same unit... [80]% of them measure less than [6] um [2]... [2x10 -7] Torr [29G×½]” needle [3], [5], [6] cm turbulence intensity may be greater than [0.055], [0.06] [80]% of them measure less than [6] um [2]... [2x10 -7] Torr [29G×½]” needle [3], [5], [6] cm turbulence intensity may be greater than [0.055], [0.06]...

University of Sheffield NLP including compound unit always related to at least one scalarValue do not include a final dot %, :, / should be annotated as unit deposition rates up to 20 [nm/sec] a fatigue life of 400 MM [cycles] ratio is approximately 9[:]7 deposition rates up to 20 [nm/sec] a fatigue life of 400 MM [cycles] ratio is approximately 9[:]7 Annotating Measurement Units

University of Sheffield NLP Annotation Schemas: Measurements Example

University of Sheffield NLP 20GATE Summer School - July 27-31, 2009 The IE Pipeline JAPE Rules vs Machine Learning  Moving the goal posts: dealing with unstable annotation guidelines JAPE – just change a few rules hopefully ML – could require significant manual re-annotation effort of the training data  Bootstrapping training data creation with JAPE patterns – significantly reduces the manual effort  For ML to be successful, we need IAA to be as high as possible – noisy data problem otherwise  Insufficient training data initially, so chose JAPE approach

University of Sheffield NLP 21GATE Summer School - July 27-31, 2009 Example JAPEs for References Macro: FIGNUMBER //Numbers 3, 45, also 3a, 3b ( {Token.kind == "number"} ({Token.length == "1",Token.kind == "word"})? ) Rule:IgnoreFigRefsIfThere Priority: 1000 ( {Reference.type == "Figure"} )--> {} Rule:FindFigRefs Priority: 50 ( ({Token.root == "figure"} | {Token.root == "fig"}) ({Token.string == "."})? ((FIGNUMBER) | (FIGNUMBERBRACKETS) ):number ):figref )--> :figref.Reference = {type = "Figure", id = :number.Token.string}

University of Sheffield NLP 22GATE Summer School - July 27-31, 2009 Example Rule for Measurements Rule: SimpleMeasure /* * Number followed by a unit. */ ( ({Token.kind == "number"}) ):amount ({Lookup.majorType == "unit"}):unit --> :amount.Measurement = {type = scalarValue, rule = "measurement.SimpleMeasure"}, :unit.Measurement = {type = unit, rule = "measurement.SimpleMeasure"}

University of Sheffield NLP 23GATE Summer School - July 27-31, 2009 The IE Annotation Pipeline

University of Sheffield NLP 24GATE Summer School - July 27-31, 2009 Hands-on: Identify More Patterns Open Teamware and login Find corpus patents-sample Run ANNIC to identify some patterns for references to tables and figures and measurements  There are already POS tags, Lookup annotations, morphological ones  Units for measurements are Lookup.majorType == “unit”

University of Sheffield NLP 25GATE Summer School - July 27-31, 2009 The Teamware Annotation Project Iterated between JAPE grammar development, manual annotation for gold- standard creation, measuring IAA and precision/recall for JAPE improvements Initially gold standard doubly annotated until good IAA is obtained, then moved to 1 annotator per document Had 15 annotators working at the same time

University of Sheffield NLP 26GATE Summer School - July 27-31, 2009 Measuring IAA with Teamware Open Teamware Find corpus patents-double-annotation Measure IAA with the respective tool Analyse the disagreements with the AnnDiff tool

University of Sheffield NLP 27GATE Summer School - July 27-31, 2009 Producing the Gold Standard Selected patents from two very different fields: mechanical engineering and biomedical technology 51 patents, 2.5 million characters 15 annotators, 1 curator reconciling the differences

University of Sheffield NLP 28GATE Summer School - July 27-31, 2009 The Evaluation Gold Standard

University of Sheffield NLP 29GATE Summer School - July 27-31, 2009 Preliminary Results

University of Sheffield NLP 30GATE Summer School - July 27-31, 2009 Running GATE Apps on Millions of Documents Processed 1.3 million patents in 6 days with 12 parallel processes. Data sets from Matrixware:  American patents (USPTO): 1.3 million, 108 GB, average file size - 85KB.  European patents (EPO): 27 thousand, 780MB, average file size - 29KB.

University of Sheffield NLP 31GATE Summer School - July 27-31, 2009 Large-scale Parallel IE Our experiments were carried out on the IRF’s supercomputer with Java (jrockit-R jdk ) with up to 12 processes SGI Altix 4700 system comprising 20 nodes each with four 1.4GHz Itanium cores and 18GB RAM In comparison, we found it 4x faster on Intel Core 2 2.4GHz

University of Sheffield NLP 32GATE Summer School - July 27-31, 2009 Large-Scale, Parallel IE (2) GATE Cloud (A3): dispatches documents to process in parallel; does not stop on error  Ongoing project, moving towards Hadoop  Contact Hamish for further details Benchmarking facilities: generate time stamps for each resource and display charts from them  Help optimising the IE pipelines, esp. JAPE rules  Doubled the speed of the patent processing pipeline  For a similar third-party GATE-based application we achieved a 10-fold improvement

University of Sheffield NLP 33GATE Summer School - July 27-31, 2009 Optimisation Results

University of Sheffield NLP 34GATE Summer School - July 27-31, 2009 MIMIR: Accessing the Text and the Semantic Annotations Documents: 981,315 Tokens: 7,228,889,715 (> 7 billion) Distinct tokens: 18,539,315 (> 18m) Annotation occurrences: 151,775,533 (> 151m)

University of Sheffield NLP 35GATE Summer School - July 27-31, 2009

University of Sheffield NLP 36GATE Summer School - July 27-31, 2009

University of Sheffield NLP 37GATE Summer School - July 27-31, 2009