Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dept. of Computer Science University of Liverpool

Similar presentations


Presentation on theme: "Dept. of Computer Science University of Liverpool"— Presentation transcript:

1 Dept. of Computer Science University of Liverpool
COMP527: Data Mining COMP527: Data Mining M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 Text Mining: Text-as-Language March 26, Slide 1

2 COMP527: Data Mining COMP527: Data Mining Introduction to the Course
Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam Text Mining: Text-as-Language March 26, Slide 2

3 Ontologies, Entity Extraction
Today's Topics COMP527: Data Mining Part of Speech Tagging Deeper Parsing Ontologies, Entity Extraction Text Mining: Text-as-Language March 26, Slide 3

4 This is often represented as: word/XYZ
Part of Speech Tagging COMP527: Data Mining The first step is to tag each word with additional information about it. The first bit of information is the word's part of speech. This is often represented as: word/XYZ Where XYZ are some uppercase characters standing for the part of speech of that particular word. There isn't a standard for these tags, but the Penn Treebank set is commonly used, sometimes with extensions. Text Mining: Text-as-Language March 26, Slide 4

5 Noun: Naming word (cat, dog, Rob, television)‏
Part of Speech Tagging COMP527: Data Mining Part of Speech Tagging aims to find the part of speech of a word – eg noun, verb, adjective, adverb, conjunction, determiner, pronoun, preposition. Noun: Naming word (cat, dog, Rob, television)‏ Verb: Action word (jump, fish, sell, watch, say)‏ Adjective: Qualifier for noun (big, red, expensive, hungry)‏ Adverb: Qualifier for verb (quickly, noisily, efficiently)‏ These are 'content' words (and hence ones to keep!)‏ Text Mining: Text-as-Language March 26, Slide 5

6 Other four main types of part of speech:
Part of Speech Tagging COMP527: Data Mining Other four main types of part of speech: Conjunction: Joining words (and, hence, then)‏ Determiner: Articles (the, a, an)‏ Pronoun: Word that stands for a noun (he, it, them, her) Preposition: 'Direction' words (to, with, in, of) These are 'function' words (and hence ones to throw away!)‏ But PoS taggers have a much larger set of classes. Text Mining: Text-as-Language March 26, Slide 6

7 Common Penn Treebank Tags:
Part of Speech Tagging COMP527: Data Mining Common Penn Treebank Tags: Text Mining: Text-as-Language March 26, Slide 7

8 “This is a crazy sentence written by Rob”
Examples COMP527: Data Mining “This is a crazy sentence written by Rob” this/DT is/VBZ a/DT crazy/JJ sentence/NN written/VBN by/IN Rob/NNP “Where have all the flowers gone?” Where/WRB have/VBP all/PDT the/DT flowers/NNS gone/VBN ?/. “The cat sat on my house's welcome mat” The/DT cat/NN sat/VBD on/IN my/PRP$ house/NN 's/POS welcome/JJ mat/NN Text Mining: Text-as-Language March 26, Slide 8

9 Gets 95% accuracy using about 70 rules
Taggers COMP527: Data Mining 90% of words only have one part of speech. So a simple dictionary lookup will get ~90% accuracy, then need a model to predict the other words. Of the words that have more than one PoS, about ½ are noun/verb, and >80% are noun/verb or noun/adjective. Could build a decision tree tagger using this information, with some simple additional rules for non dictionary words and words with multiple PoS. Brill did this, with some additional training methods to correct errors. Gets 95% accuracy using about 70 rules Text Mining: Text-as-Language March 26, Slide 9

10 Word Sense Disambiguation
COMP527: Data Mining Just because we know the PoS, doesn't mean we know what the word means: The state has lowered its tax rate. The company is in a weak financial state. The solid state of water is ice. Can you state your opinion? (yes this use is a verb)‏ Eg: region, condition, state of matter, make a statement. Can use Latent Semantic Indexing/Clustering techniques to implement WSD. (eg financial + state vs solid/gas/liquid +state)‏ Text Mining: Text-as-Language March 26, Slide 10

11 Word Sense Disambiguation
COMP527: Data Mining WordNet is a very commonly used 'dictionary' from Princeton. Includes links to synonyms (called synsets): [dog, canine, puppy, hound, ...] and higher/lower semantic categories: dog – canine – carnivore – mammal – animal – living thing Can use to help disambiguate word senses and link words together. Text Mining: Text-as-Language March 26, Slide 11

12 That's all very nice, but ... so what?
Deep Parsing COMP527: Data Mining That's all very nice, but ... so what? For Data Mining, knowing the part of speech for a word could be useful, but it's just the first step in Text Mining. Step 2 is to extend the parsing to find phrases and how the verbs and nouns match up. Often called 'chunks', these systems find sequences of the words based on the parts of speech. Text Mining: Text-as-Language March 26, Slide 12

13 (NP (DT the) (NN cat) ) // noun phrase
Deep Parsing COMP527: Data Mining (TOP (S (NP (DT the) (NN cat) ) // noun phrase (VP (VBD sat) // verb phrase start (PP (IN on) (NP // second noun phrase (NP (PRP$ my) (NN house) (POS 's) ) // sub phrase (JJ welcome) (NN mat)‏ )‏ ))‏ Text Mining: Text-as-Language March 26, Slide 13

14 So arg1 of bites is dog, and arg2 is man. Dog Bites Man.
Deep Parsing COMP527: Data Mining We can also find out which verbs and which nouns correlate, in which order. EG, does the man bite the dog, or the dog bite the man? “The dog bites the man” ROOT ROOT ROOT ROOT ROOT bites bites bite VBZ VB ARG1 dog bites bite VBZ VB ARG2 man The the DT DT ARG1 dog the the DT DT ARG1 man So arg1 of bites is dog, and arg2 is man. Dog Bites Man. What about if we mix it up a little... Text Mining: Text-as-Language March 26, Slide 14

15 “The man was bitten by the dog”
Deep Parsing COMP527: Data Mining “The man was bitten by the dog” ROOT ROOT ROOT ROOT ROOT bitten bitten bite VBN VB ARG1 dog bitten bite VBN VB ARG2 man The the DT DT ARG1 man the the DT DT ARG1 dog by by IN IN ARG1 dog was be VBD VB ARG1 man was be VBD VB ARG2 bitten So arg1 of bites is still dog, and arg2 is still man. Dog still Bites Man, even with the language mixed up a bit. Text Mining: Text-as-Language March 26, Slide 15

16 Ontologies COMP527: Data Mining We have a Proper Noun indicator (NNP). Proper nouns name a specific person, place or thing. These may have some unique identifier, or other associated information. It may exist in an ontology (or database with a similar purpose). Ontologies are classification systems that describe objects and the relationships between them. For example, a directory of people could be an ontology. With some hints as to the type of object, we can look it up in these databases to find out extra information about it in an easily machine- understandable format. Text Mining: Text-as-Language March 26, Slide 16

17 A common ontology meta-language is OWL (Web Ontology Language).
Ontologies COMP527: Data Mining At this point we have most of the picture in place. We can identify entities and find out additional information, and know which entities act in which way on other entities. A common ontology meta-language is OWL (Web Ontology Language). It in turn uses RDF (Resource Description Framework). RDF's main advantages are allowing inference engines to traverse a large graph of objects and relations, which is the next step in generating correlations between objects and relationships extracted from the text. (See also Semantic Web etc.)‏ Text Mining: Text-as-Language March 26, Slide 17

18 Named Entity Recognition
COMP527: Data Mining Find 'entities' to extract from text, typically to then look up in an ontology. Eg: "City University of New York" vs "John Smith of New York". Can list countries, even major cities, but how to determine which 'Springfield' is being discussed? Need context, etc. Similar sorts of problems to PoS tagging in terms of sequence analysis, and same sorts of solutions. Classes might be: person, organisation, place, time, date, not-an-entity. Can use HMM, rule based, etc. Text Mining: Text-as-Language March 26, Slide 18

19 Events COMP527: Data Mining Temporal events described in text are also important to recognise. For example 'the dog bites the man' is a one off event that happens at a particular time. It might be useful to be see that the man then went to hospital after being bitten. Event extraction can be done via rules, which will be specific to a domain: Templates of different noun types and verbs in given orders that are important to recognise. For example, a person getting a new job title at a company is indicative of a promotion or new acquisition. Text Mining: Text-as-Language March 26, Slide 19

20 (How nice would that be!)‏ (Though it is being worked on)‏
Summarisation COMP527: Data Mining If the system can extract events, entities, etc etc, then the 'next' step is to summarise the findings from across the various bits of information extracted. (How nice would that be!)‏ (Though it is being worked on)‏ Text Mining: Text-as-Language March 26, Slide 20

21 Anything on Natural Language Processing
Further Reading COMP527: Data Mining Konchady Weiss Anything on Natural Language Processing Text Mining: Text-as-Language March 26, Slide 21


Download ppt "Dept. of Computer Science University of Liverpool"

Similar presentations


Ads by Google