1(21) HLT, Data Sparsity and Semantic Tagging Louise Guthrie (University of Sheffield) Roberto Basili (University of Tor Vergata, Rome) Hamish Cunningham.

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
An Introduction to GATE
University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.
The Semantic Web and Language Technology BT Exact, Martlesham Hamish Cunningham Department of Computer Science, University of Sheffield Friday October.
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Test Automation Success: Choosing the Right People & Process
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Case Tools Trisha Cummings. Our Definition of CASE  CASE is the use of computer-based support in the software development process.  A CASE tool is a.
An Attack on Data Sparseness JHU –Tutorial June
Languages & The Media, 5 Nov 2004, Berlin 1 New Markets, New Trends The technology side Stelios Piperidis
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 16 Slide 1 User interface design.
Your Interactive Guide to the Digital World Discovering Computers 2012.
CASE Tools And Their Effect On Software Quality Peter Geddis – pxg07u.
Ontology-Aware Information Extraction Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb.
GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.
OpenAlea An OpenSource platform for plant modeling C. Pradal, S. Dufour-Kowalski, F. Boudon, C. Fournier, C. Godin.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Survey of Semantic Annotation Platforms
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Related terms search based on WordNet / Wiktionary and its application in ontology matching RCDL'2009 St. Petersburg Institute for Informatics and Automation.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
HCI in Software Process Material from Authors of Human Computer Interaction Alan Dix, et al.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
University of Sheffield NLP Teamware: A Collaborative, Web-based Annotation Environment Kalina Bontcheva, Milan Agatonovic University of Sheffield.
JHU WORKSHOP July 30th, 2003 Semantic Annotation – Week 3 Team: Louise Guthrie, Roberto Basili, Fabio Zanzotto, Hamish Cunningham, Kalina Boncheva,
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
FDT Foil no 1 On Methodology from Domain to System Descriptions by Rolv Bræk NTNU Workshop on Philosophy and Applicablitiy of Formal Languages Geneve 15.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
ICCS 2008, CracowJune 23-25, Towards Large Scale Semantic Annotation Built on MapReduce Architecture Michal Laclavík, Martin Šeleng, Ladislav Hluchý.
Chapter 6 CASE Tools Software Engineering Chapter 6-- CASE TOOLS
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
UI's for inputting and presenting the metadata of hypermedia documents Kai Kuikkaniemi HUT T
A Unicode-based Environment for the Creation and use of LRs Valentin Tablan, Cristian Ursu, Kalina Bontcheva, Hamish Cunningham, Diana Maynard, Oana Hamza,
Software Reuse Course: # The Johns-Hopkins University Montgomery County Campus Fall 2000 Session 4 Lecture # 3 - September 28, 2004.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
Oman College of Management and Technology Course – MM Topic 7 Production and Distribution of Multimedia Titles CS/MIS Department.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
Chapter 1 WHAT IS A COMPUTER Faculty of ICT & Business Management Tel : BCOMP0101 Introduction to Information Technology.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
TextCrowd – Collaborative semantic enrichment of text-based datasets
Appendix 2 Automated Tools for Systems Development
GATE and the Semantic Web
Software Design and Architecture
Business System Development
Tools of Software Development
Statistical NLP: Lecture 9
CS246: Information Retrieval
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

1(21) HLT, Data Sparsity and Semantic Tagging Louise Guthrie (University of Sheffield) Roberto Basili (University of Tor Vergata, Rome) Hamish Cunningham (University of Sheffield)

2(21) Outline –A ubiquitous problem: data sparsity –The approach: coarse-grained semantic tagging learning by combining multiple evidence –The evaluation: intrinsic and extrinsic measures –The expected outcomes: architectures, tools, development support

3(21) Applications Present We’ve seen growing interest in a range of HLT tasks: e.g. IE, MT Trends –Fully portable IE, unsupervised learning –Content Extraction vs. IE

4(21) Data Sparsity Language Processing depends on a model of the features important to an application. – MT - Trigrams and frequencies – Extraction - Word patterns New texts always seem to have lots of phenomena we haven’t seen before

5(21) Different kinds of patterns Person was appointed as post of company Company named person to post Almost all extraction systems tried to find patterns of mixed words and entities. –People, Locations, Organizations, dates, times, currencies

6(21) Can we do more? Astronauts aboard the space shuttle Endeavor were forced to dodge a derelict Air Force satellite Friday Humans aboard space_vehicle dodge satellite timeref.

7(21) Could we know these are the same? The IRA bombed a family owned shop in Belfast yesterday. FMLN set off a series of explosions in central Bogota today. ORGANIZATION ATTACKED LOCATION DATE

8(21) Machine translation Ambiguity of words often means that a word can translate several ways. Would knowing the semantic class of a word, help us to know the translation?

9(21) Sometimes... Crane the bird vs crane the machine Bat the animal vs bat for cricket and baseball Seal on a letter vs the animal

10(21) SO.. P(translation(crane) = grulla | animal) > P(translation(crane) = grulla) P(translation(crane) = grua | machine) > P(translation(crane) = grua) Can we show the overall effect lowers entropy?

11(21) Language Modeling – Data Sparseness again.. We need to estimate Pr (w 3 | w 1 w 2 ) If we have never seen w 1 w 2 w 3 before Can we instead develop a model and estimate Pr (w 3 | C 1 C 2 ) or Pr (C 3 | C 1 C 2 )

12(21) A Semantic Tagging technology. How? We will exploit similarity with NE tagging,... –Development of pattern matching rules as incremental wrapper induction... with semantic (sense) disambiguation –Use as much evidence as possible –Exploit existing resources like MRD or LKBs... and with machine learning tasks –Generalize from positive examples in training data

13(21) Multiple Sources of Evidence Lexical information (priming effects) Distributional information from general and training texts Syntactic features –SVO patterns or Adjectival modifiers Semantic features –Structural information in LKBs –(LKB-based) similarity measures

14(21) Machine Learning for ST Similarity estimation –among contexts (texts overlaps, …) –among lexical items wrt MRD/LKBs We will experiment –Decision tree learning (e.g. C4.5) –Support Vector Machines (e.g. SVM light) –Memory-based Learning (TiMBL) –Bayesian learning

15(21) What’s New? Granularity –Semantic categories are coarser than word senses (cfr. homograph level in MRD) Integration of existing ML methods –Pattern induction is combined with probabilistic description of word semantic classes Co-training –Annotated data are used to drive the sampling of further evidence from unannotated material (active learning)

16(21) How we know what we’ve done: measurement, the corpus Hand-annotated corpus - from the BNC, 100-million word balanced corpus - 1 million words annotated - a little under ½ million categorised noun phrases Extrinsic evaluation Perplexity of lexical choice in Machine Translation Intrinsic evaluation Standard measures or precision, recall, false positives (baseline: tag with most common category = 33%)

17(21) Ambiguity levels in the training data NPs by semantic categories: % % % % % % % % % % % % % % % % % Total NPs (interim)453360

18(21) Maximising project outputs: software infrastructure for HLT Three outputs from the project: 1. A new resource Automatical annotation of the whole corpus 2.Experimental evidence re how accurate the final results are - how accurate the various methods employed are 3.Component tools for doing 1., based on GATE (a General Architecture for Text Engineering)

19(21) What is GATE? An architecture A macro-level organisational picture for LE software systems. A framework For programmers, GATE is an object-oriented class library that implements the architecture. A development environment For language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. Some free components......and wrappers for other people's components Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. Free software (LGPL). Download at

20(21) Where did GATE come from? A number of researchers realised in the early- mid-1990s (e.g. in TIPSTER): Increasing trend towards multi-site collaborative projects Role of engineering in scalable, reusable, and portable HLT solutions Support for large data, in multiple media, languages, formats, and locations Lower the cost of creation of new language processing components Promote quantitative evaluation metrics via tools and a level playing field History: 1996 – 2002: GATE version 1, proof of concept March 2002: version 2, rewritten in Java, component based, LGPL, more users Fall 2003: new development cycle

21(21) Role of GATE in the project Productivity - reuse some baseline components for simple tasks - development environment support for implementors (MATLAB for HLT?) - reduce integration overhead (standard interfaces between components) - system takes care of persistency, visualisation, multilingual edit,... Quantification - tool support for metrics generation - visualisation of key/response differences - regression test tool for nightly progress verification Repeatability - open source supported, maintained, documented software - cross-platform (Linux, Windows, Solaris, others) - easy install and proven useability (thousands of people, hundreds of sites) - mobile code if you write in Java; web services otherwise