Dept. of Computer Science University of Liverpool

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Extraction and Visualisation of Emotion from News Articles Eva Hanser, Paul Mc Kevitt School of Computing & Intelligent Systems Faculty of Computing &
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
CPSC 422, Lecture 16Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16 Feb, 11, 2015.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Chunk Parsing CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)
1 Words and the Lexicon September 10th 2009 Lecture #3.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Stemming, tagging and chunking Text analysis short of parsing.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
1/17 Acquiring Selectional Preferences from Untagged Text for Prepositional Phrase Attachment Disambiguation Hiram Calvo and Alexander Gelbukh Presented.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Text-as-Data March 25, 2009.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Overview Project Goals –Represent a sentence in a parse tree –Use parses in tree to search another tree containing ontology of project management deliverables.
CS : Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Constituent Parsing and Algorithms (with.
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Challenges, Basics March.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009.
10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.
Word classes and part of speech tagging Chapter 5.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
Artificial Intelligence: Natural Language
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
Part-of-speech tagging
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
LING/C SC/PSYC 438/538 Lecture 18 Sandiway Fong. Adminstrivia Homework 7 out today – due Saturday by midnight.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Part-of-Speech Tagging & Sequence Labeling Hongning Wang
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Word classes and part of speech tagging Chapter 5.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
Question Classification Ling573 NLP Systems and Applications April 25, 2013.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Lecture 9: Part of Speech
Lexicons, Concept Networks, and Ontologies
Introduction to Machine Learning and Text Mining
Introduction Machine Learning 14/02/2017.
Prepositions Prepositional Phrases Object of the Preposition
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15
University of Computer Studies, Mandalay
LING/C SC/PSYC 438/538 Lecture 20 Sandiway Fong.
Parts of Speech.
CSCI 5832 Natural Language Processing
Certificate III in ESL (Further Studies)
Probabilistic and Lexicalized Parsing
Machine Learning in Natural Language Processing
LING/C SC 581: Advanced Computational Linguistics
Improving an Open Source Question Answering System
Probabilistic and Lexicalized Parsing
Dept. of Computer Science University of Liverpool
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15
LING/C SC/PSYC 438/538 Lecture 23 Sandiway Fong.
A method for WSD on Unrestricted Text
Chunk Parsing CS1573: AI Application Development, Spring 2003
Dept. of Computer Science University of Liverpool
Dept. of Computer Science University of Liverpool
CS246: Information Retrieval
Natural Language Processing
Dept. of Computer Science University of Liverpool
Parts of Speech Year Nine.
Part-of-Speech Tagging Using Hidden Markov Models
Deep Structured Scene Parsing by Learning with Image Descriptions
LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong.
Presentation transcript:

Dept. of Computer Science University of Liverpool COMP527: Data Mining COMP527: Data Mining M. Sulaiman Khan (mskhan@liv.ac.uk)‏ Dept. of Computer Science University of Liverpool 2009 Text Mining: Text-as-Language March 26, 2009 Slide 1

COMP527: Data Mining COMP527: Data Mining Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam Text Mining: Text-as-Language March 26, 2009 Slide 2

Ontologies, Entity Extraction Today's Topics COMP527: Data Mining Part of Speech Tagging Deeper Parsing Ontologies, Entity Extraction Text Mining: Text-as-Language March 26, 2009 Slide 3

This is often represented as: word/XYZ Part of Speech Tagging COMP527: Data Mining The first step is to tag each word with additional information about it. The first bit of information is the word's part of speech. This is often represented as: word/XYZ Where XYZ are some uppercase characters standing for the part of speech of that particular word. There isn't a standard for these tags, but the Penn Treebank set is commonly used, sometimes with extensions. Text Mining: Text-as-Language March 26, 2009 Slide 4

Noun: Naming word (cat, dog, Rob, television)‏ Part of Speech Tagging COMP527: Data Mining Part of Speech Tagging aims to find the part of speech of a word – eg noun, verb, adjective, adverb, conjunction, determiner, pronoun, preposition. Noun: Naming word (cat, dog, Rob, television)‏ Verb: Action word (jump, fish, sell, watch, say)‏ Adjective: Qualifier for noun (big, red, expensive, hungry)‏ Adverb: Qualifier for verb (quickly, noisily, efficiently)‏ These are 'content' words (and hence ones to keep!)‏ Text Mining: Text-as-Language March 26, 2009 Slide 5

Other four main types of part of speech: Part of Speech Tagging COMP527: Data Mining Other four main types of part of speech: Conjunction: Joining words (and, hence, then)‏ Determiner: Articles (the, a, an)‏ Pronoun: Word that stands for a noun (he, it, them, her) Preposition: 'Direction' words (to, with, in, of) These are 'function' words (and hence ones to throw away!)‏ But PoS taggers have a much larger set of classes. Text Mining: Text-as-Language March 26, 2009 Slide 6

Common Penn Treebank Tags: Part of Speech Tagging COMP527: Data Mining Common Penn Treebank Tags: Text Mining: Text-as-Language March 26, 2009 Slide 7

“This is a crazy sentence written by Rob” Examples COMP527: Data Mining “This is a crazy sentence written by Rob” this/DT is/VBZ a/DT crazy/JJ sentence/NN written/VBN by/IN Rob/NNP “Where have all the flowers gone?” Where/WRB have/VBP all/PDT the/DT flowers/NNS gone/VBN ?/. “The cat sat on my house's welcome mat” The/DT cat/NN sat/VBD on/IN my/PRP$ house/NN 's/POS welcome/JJ mat/NN Text Mining: Text-as-Language March 26, 2009 Slide 8

Gets 95% accuracy using about 70 rules Taggers COMP527: Data Mining 90% of words only have one part of speech. So a simple dictionary lookup will get ~90% accuracy, then need a model to predict the other words. Of the words that have more than one PoS, about ½ are noun/verb, and >80% are noun/verb or noun/adjective. Could build a decision tree tagger using this information, with some simple additional rules for non dictionary words and words with multiple PoS. Brill did this, with some additional training methods to correct errors. Gets 95% accuracy using about 70 rules Text Mining: Text-as-Language March 26, 2009 Slide 9

Word Sense Disambiguation COMP527: Data Mining Just because we know the PoS, doesn't mean we know what the word means: The state has lowered its tax rate. The company is in a weak financial state. The solid state of water is ice. Can you state your opinion? (yes this use is a verb)‏ Eg: region, condition, state of matter, make a statement. Can use Latent Semantic Indexing/Clustering techniques to implement WSD. (eg financial + state vs solid/gas/liquid +state)‏ Text Mining: Text-as-Language March 26, 2009 Slide 10

Word Sense Disambiguation COMP527: Data Mining WordNet is a very commonly used 'dictionary' from Princeton. Includes links to synonyms (called synsets): [dog, canine, puppy, hound, ...] and higher/lower semantic categories: dog – canine – carnivore – mammal – animal – living thing Can use to help disambiguate word senses and link words together. Text Mining: Text-as-Language March 26, 2009 Slide 11

That's all very nice, but ... so what? Deep Parsing COMP527: Data Mining That's all very nice, but ... so what? For Data Mining, knowing the part of speech for a word could be useful, but it's just the first step in Text Mining. Step 2 is to extend the parsing to find phrases and how the verbs and nouns match up. Often called 'chunks', these systems find sequences of the words based on the parts of speech. Text Mining: Text-as-Language March 26, 2009 Slide 12

(NP (DT the) (NN cat) ) // noun phrase Deep Parsing COMP527: Data Mining (TOP (S (NP (DT the) (NN cat) ) // noun phrase (VP (VBD sat) // verb phrase start (PP (IN on) (NP // second noun phrase (NP (PRP$ my) (NN house) (POS 's) ) // sub phrase (JJ welcome) (NN mat)‏ )‏ ))‏ Text Mining: Text-as-Language March 26, 2009 Slide 13

So arg1 of bites is dog, and arg2 is man. Dog Bites Man. Deep Parsing COMP527: Data Mining We can also find out which verbs and which nouns correlate, in which order. EG, does the man bite the dog, or the dog bite the man? “The dog bites the man” ROOT ROOT ROOT ROOT ROOT bites bites bite VBZ VB ARG1 dog bites bite VBZ VB ARG2 man The the DT DT ARG1 dog the the DT DT ARG1 man So arg1 of bites is dog, and arg2 is man. Dog Bites Man. What about if we mix it up a little... Text Mining: Text-as-Language March 26, 2009 Slide 14

“The man was bitten by the dog” Deep Parsing COMP527: Data Mining “The man was bitten by the dog” ROOT ROOT ROOT ROOT ROOT bitten bitten bite VBN VB ARG1 dog bitten bite VBN VB ARG2 man The the DT DT ARG1 man the the DT DT ARG1 dog by by IN IN ARG1 dog was be VBD VB ARG1 man was be VBD VB ARG2 bitten So arg1 of bites is still dog, and arg2 is still man. Dog still Bites Man, even with the language mixed up a bit. Text Mining: Text-as-Language March 26, 2009 Slide 15

Ontologies COMP527: Data Mining We have a Proper Noun indicator (NNP). Proper nouns name a specific person, place or thing. These may have some unique identifier, or other associated information. It may exist in an ontology (or database with a similar purpose). Ontologies are classification systems that describe objects and the relationships between them. For example, a directory of people could be an ontology. With some hints as to the type of object, we can look it up in these databases to find out extra information about it in an easily machine- understandable format. Text Mining: Text-as-Language March 26, 2009 Slide 16

A common ontology meta-language is OWL (Web Ontology Language). Ontologies COMP527: Data Mining At this point we have most of the picture in place. We can identify entities and find out additional information, and know which entities act in which way on other entities. A common ontology meta-language is OWL (Web Ontology Language). It in turn uses RDF (Resource Description Framework). RDF's main advantages are allowing inference engines to traverse a large graph of objects and relations, which is the next step in generating correlations between objects and relationships extracted from the text. (See also Semantic Web etc.)‏ Text Mining: Text-as-Language March 26, 2009 Slide 17

Named Entity Recognition COMP527: Data Mining Find 'entities' to extract from text, typically to then look up in an ontology. Eg: "City University of New York" vs "John Smith of New York". Can list countries, even major cities, but how to determine which 'Springfield' is being discussed? Need context, etc. Similar sorts of problems to PoS tagging in terms of sequence analysis, and same sorts of solutions. Classes might be: person, organisation, place, time, date, not-an-entity. Can use HMM, rule based, etc. Text Mining: Text-as-Language March 26, 2009 Slide 18

Events COMP527: Data Mining Temporal events described in text are also important to recognise. For example 'the dog bites the man' is a one off event that happens at a particular time. It might be useful to be see that the man then went to hospital after being bitten. Event extraction can be done via rules, which will be specific to a domain: Templates of different noun types and verbs in given orders that are important to recognise. For example, a person getting a new job title at a company is indicative of a promotion or new acquisition. Text Mining: Text-as-Language March 26, 2009 Slide 19

(How nice would that be!)‏ (Though it is being worked on)‏ Summarisation COMP527: Data Mining If the system can extract events, entities, etc etc, then the 'next' step is to summarise the findings from across the various bits of information extracted. (How nice would that be!)‏ (Though it is being worked on)‏ Text Mining: Text-as-Language March 26, 2009 Slide 20

Anything on Natural Language Processing Further Reading COMP527: Data Mining Konchady Weiss Anything on Natural Language Processing Text Mining: Text-as-Language March 26, 2009 Slide 21