An Attack on Data Sparseness JHU –Tutorial June 11 2003.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Advertisements

1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004
An Introduction to GATE
University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.
University of Sheffield NLP Module 4: Machine Learning.
Improving Machine Translation Quality with Automatic Named Entity Recognition Bogdan Babych Centre for Translation Studies University of Leeds, UK Department.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Building a Large- Scale Knowledge Base for Machine Translation Kevin Knight and Steve K. Luk Presenter: Cristina Nicolae.
For Friday No reading Homework –Chapter 23, exercises 1, 13, 14, 19 –Not as bad as it sounds –Do them IN ORDER – do not read ahead here.
LingPipe Does a variety of tasks  Tokenization  Part of Speech Tagging  Named Entity Detection  Clustering  Identifies.
1(21) HLT, Data Sparsity and Semantic Tagging Louise Guthrie (University of Sheffield) Roberto Basili (University of Tor Vergata, Rome) Hamish Cunningham.
Automatic Metaphor Interpretation as a Paraphrasing Task Ekaterina Shutova Computer Lab, University of Cambridge NAACL 2010.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
1 I256: Applied Natural Language Processing Marti Hearst Sept 20, 2006.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
WSD using Optimized Combination of Knowledge Sources Authors: Yorick Wilks and Mark Stevenson Presenter: Marian Olteanu.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Evidence from Content INST 734 Module 2 Doug Oard.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
ELN – Natural Language Processing Giuseppe Attardi
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Survey of Semantic Annotation Platforms
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Complex Linguistic Features for Text Classification: A Comprehensive Study Alessandro Moschitti and Roberto Basili University of Texas at Dallas, University.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.
W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.
LING/C SC/PSYC 438/538 Lecture 2 Sandiway Fong. Today’s Topics Did you read Chapter 1 of JM? – Short Homework 2 (submit by midnight Friday) Today is Perl.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
JHU WORKSHOP July 30th, 2003 Semantic Annotation – Week 3 Team: Louise Guthrie, Roberto Basili, Fabio Zanzotto, Hamish Cunningham, Kalina Boncheva,
Food and Agriculture Organization of the UN Library and Documentation Systems Division July 2005 Ontologies creation, extraction and maintenance 6 th AOS.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
Modelling Human Thematic Fit Judgments IGK Colloquium 3/2/2005 Ulrike Padó.
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Friday Finish chapter 24 No written homework.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Approaches to Machine Translation
Machine Learning in Natural Language Processing
Statistical NLP: Lecture 9
Approaches to Machine Translation
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

An Attack on Data Sparseness JHU –Tutorial June

OVERVIEW What is this project about? What is gate? Lab assignment

Basic Approach – (from RG talk) Build Linguistic Patterns person was appointed as post of company company named person to post Apply patterns to text and fill data base

Getting these patterns … Use training data to gather information about the contexts of the important bits of text. Write an algorithm that automatically makes use of the contextual information to further identify new important bits and labels them.

It is a difficult task We are already pretty good at Identifying and locating People Locations Organizations Dates Times What if we could do more?

Would it help to tag/replace noun phrases? Astronauts aboard the space shuttle Endeavour were forced to dodge a derelict Air Force satellite Friday. HUMANS aboard SPACE_VEHICLE dodge SATELLITE TIMEREF

We could transform the training data and get more HUMANS DODGE SATELLITE After parsing: HUMANS aboard SPACE_VEHICLE dodge SATELLITE TIMEREF

Could we know these are the same? The IRA bombed a family owned shop in Belfast yesterday. FMLN set off a series of explosions in central Bogota today. ORGANIZATION ATTACKED LOCATION DATE

Lexicography Data Sparseness again.. Sever BODYPART Sever an arm Sever a finger Sever FASTENER Sever the bond.. Sever the links …

Machine translation Ambiguity of words often means that a word can translate several ways. Would knowing the semantic class of a word, help us to know the translation?

Sometimes... Crane the bird vs crane the machine Bat the animal vs bat for cricket and baseball Seal on a letter vs the animal

SO.. P(translation(crane) = grulla | animal) > P(translation(crane) = grulla) P(translation(crane) = grua | machine) > P(translation(crane) = grua | machine) Can we show the overall effect lowers entropy?

Language Modeling – Data Sparseness again.. We need to estimate Pr (w 3 | w 1 w 2 ) If we have never seen w 1 w 2 w 3 before Can we instead develop a model and estimate Pr (w 3 | C 1 C 2 ) or Pr (C 3 | C 1 C 2 )

Overview Noun Phrases Identified Head Nouns Identified People marked Locations, dates, currencies, organizations Also marked CORPUS

Overview Human Annotated with semantic tags– Noun Phrases Only

Overview Test portion Training portion Machine Learning to improve this

The Environment GATE – an environment which conforms to the TIPSTER architecture Provides many tools for processing language and a standard method for managing documents and any new information associated with the document

Gate - Documents have annotations ~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~GEORGE BUSH~~~~ ~~~~~~~~~~~~~~~~~~~~~~ GEORGE BUSH at offset is a person

There may be more than one annotation ~~~~~~~~~~~~~~~~~~~~~~ ~~~~The ruthless criminal~~~~ ~~~~~~~~~~~~~~~~~~~~~~ criminal at offset is a human Is a noun Is the head of an noun phrase

Documents belong to collections (a corpus in GATE) Collections can be loaded into GATE New collections can be created Documents can be added or removed Applications can run over whole collections

Applications – processing resources Programs (tools) can be loaded into gate An Application consists of forming a pipeline of some tools In the demo, you will see two applications

Annie – with defaults Sentence Splitter POS tagger NE recognizer Tokenizer Plus more

Using gate in today’s lab To view already processed documents To process new documents To process documents, you must have both an application and a corpus

To learn more.. Tutorials, slides, downloadable versions for PC, Linux, Solaris, etc.

The lab Follow the directions in /export/ws03sem/lab/gate.lab Use the internet or Grolier to find Paragraphs or documents about bats that fly and bats that hit a ball, cricket bat or baseball bat

Which bat is it? Use the web texts as training data for the context – you can load them into gate or use them as is. Try a bag of words approach

The idea Texts about flying bats Texts about movable solid ones The pitcher held the bat firmly NEW 

Resources Porter Stemmer Gate Can collect trigrams, or bigrams from the training data..

Comments A very primitive approach to the problem Use your work to say which kind of ‘bat’ is used in the text bat.txt Try your same technique for ‘seal’ There is a file called seal.txt to test on

Finally If you are very brave can you find the semantic classes for ‘chicken’ in the chicken.txt file? Careful – this one has a lot of metaphorical use. Have fun!

Tag Set Longman’s Dictionary (LDOCE) 2000 word defining vocabulary 34 semantic categories over subject codes Over 5000 combination markings Gives us 85% coverage of NP’s but only contains 35% of the vocabulary

Wordnet Developed at Princeton (George Miller) About the same coverage on a sample Defined synsets instead of senses Arranged with ‘IS A’ relations which can serve as a semantic category The English acts as an interlingua to EuroWordnet.

Corpus BNC – 100 million words – mostly spoken POS tagged with CLAWS English side of parallel texts possibly 80 million words Aligned Some french – some chinese some arabic Or possibly UN data supplied by the MT team

Evaluation This must be decided before July Baselines should be presented for the opening talk The closing talk should include baseline plus as many measures of improvement as we can come up with

Closing presentation One half day for each of the three projects Each person should plan to talk One part of the team should be devoted to this aspect of the project

Evaluation – suggested focus We focus on showing that we can improve the entropy for MT.

Techniques Basically two possibilites Extend techniques from disambiguation for assigning semantic category and then subject area (word focused) Use machine learning to learn about the contexts and features of a particular semantic category – then tag those (semantic category focused)

Today 12-1 Roberto and Fabio Machine learning Wordnet and conceptual density Ldoce – Wordnet correspondence 1-2 Lunch 2-3:30 Tagging texts and discussion 3:30- 5:30 Gate Tutorial

Tomorrow Annotation tool Division of labor Plan Rome meeting End at 1:00

Why do it? Text Extraction Lexicography Summarization Machine Translation Language Modeling