M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Challenges, Basics March.

Slides:



Advertisements
Similar presentations
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Information Retrieval in Practice
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
Managing Data Resources
Stemming, tagging and chunking Text analysis short of parsing.
WMES3103 : INFORMATION RETRIEVAL
Dr. M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2010 COMP207: Data Mining General Data Mining Issues COMP207:
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
Data Mining Techniques
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Text-as-Data March 25, 2009.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Machine Learning Queens College Lecture 13: SVM Again.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Group 2 R 李庭閣 R 孔垂玖 R 許守傑 R 鄭力維.
The Chinese University of Hong Kong Introduction to PAT-Tree and its variations Kenny Kwok Department of Computer Science and Engineering.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Information Retrieval
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Data Mining and Decision Support
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
INVITATION TO Computer Science 1 11 Chapter 2 The Algorithmic Foundations of Computer Science.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Neural Networks Lecture 4 out of 4. Practical Considerations Input Architecture Output.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Data Science Dimensionality Reduction WFH: Section 7.3 Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Search Engine Architecture
School of Computer Science & Engineering
Application of Classification and Clustering Methods on mVoC (Medical Voice of Customer) data for Scientific Engagement Yingzi Xu, Department of Statistics,
CS 430: Information Discovery
Dept. of Computer Science University of Liverpool
Efficient Estimation of Word Representation in Vector Space
CSCI 5832 Natural Language Processing
Dept. of Computer Science University of Liverpool
Objective of This Course
Dept. of Computer Science University of Liverpool
Word Embedding Word2Vec.
Dept. of Computer Science University of Liverpool
Introduction to Information Retrieval
Dept. of Computer Science University of Liverpool
Dept. of Computer Science University of Liverpool
Text Mining Application Programming Chapter 3 Explore Text
Word embeddings (continued)
Dept. of Computer Science University of Liverpool
Presentation transcript:

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Challenges, Basics March 24, 2009 Slide 1 COMP527: Data Mining

Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction COMP527: Data Mining Text Mining: Challenges, Basics March 24, 2009 Slide 2 COMP527: Data Mining Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam

Text Representation What is a word? Dimensionality Reduction Text Mining vs Data Mining on Text Today's Topics Text Mining: Challenges, Basics March 24, 2009 Slide 3 COMP527: Data Mining

Basic Goal: Data Mining on Documents Each document must be an instance. First problem: What are the attributes of a document? Easy attributes: format, length in bytes, potentially some metadata extractable from headers or file properties (author, date etc)‏ Harder attributes: How to usefully represent the text? Basic idea is to treat each word as an attribute – either boolean (is present/is not present) or numeric (number of times word occurs)‏ Representation of Documents Text Mining: Challenges, Basics March 24, 2009 Slide 4 COMP527: Data Mining

Second Problem: We will have a LOT of false/0 attribute values in our instances. Very sparse matrix, requiring a LOT of storage space. 1,000,000 documents, 500,000 different words = 500,000,000,000 entries ~= 60 gigabytes if store each entry as a single bit ~= 950 gigabytes if store each entry as short integer (2 bytes)‏ Google's dictionary has 5 million words, times 18 billion web pages... (Process that, WEKA!)‏ Representation of Documents Text Mining: Challenges, Basics March 24, 2009 Slide 5 COMP527: Data Mining

Store only true values: (1,4,100,212,13948)‏ Or true values with frequency: (1:3, 4:1, 100:1, 212:3, 13948:4)‏ For compressed storage, we can maintain the differences: Attribs in order: 1,2,4,5,7,10,15,18, ,... Intervals: 1,1,2,1,2, 3, 5, 3,... 6,... With frequency: 1,4,1,3,2,5,1,3,2,6,3,3,... Can always store in short integer. Regular compression algorithms on this sequence will be efficient. Reorder attributes based on frequency rather than alphabetical. Representation of Documents Text Mining: Challenges, Basics March 24, 2009 Slide 6 COMP527: Data Mining

That's nice... but WEKA needs ARFF! Problems with toolkits: Won't accept a sparse input format Classification Algorithms: Rules: Possible, but unlikely Trees: Less likely than unlikely Bayes: Fine, especially Multinomial Bayes Bayesian Networks: Maybe... but too many possible networks SVM: Fine NN: Not Fine! Tooooo many nodes Perceptron/Winnow: See NN, but more feasible as no hidden layer KNN: Very slow without data structures due to no. of comparisons Input? Text Mining: Challenges, Basics March 24, 2009 Slide 7 COMP527: Data Mining

Overall problem for text classification: Accurate models with high dimensionality impossible to understand by humans (eg SVM, Multinomial Naive Bayes). Association Rule Mining: Fine for presence of word, but how to represent word frequency? Classification Association Rule Mining possible good solution for understandability? Clustering: Very high dimensionality a problem for many algorithms, especially with lots of comparisons (eg partitioning algorithms). Input? Text Mining: Challenges, Basics March 24, 2009 Slide 8 COMP527: Data Mining

First problem: Need to be able to extract data from the file. Very different processing for different file types, eg: XML, HTML, RSS, Word, Open Document, PDF, RTF, LaTeX,... May want to treat different parts of document separately. eg: title vs authors vs abstract vs text vs references Want to normalise texts into semantic areas across different formats – eg abstract in PDF is several lines, but in ODF is an XML element, in LaTeX surrounded by one or more \verb{} commands... Document Types Text Mining: Challenges, Basics March 24, 2009 Slide 9 COMP527: Data Mining

Requirement: Extract words from text. What is a 'word'?  Obvious 'words' (eg consecutive non space characters)‏  Number (1, $ vs 64.0)‏  Hyphenated (book case vs book-case vs bookcase) but also for ranges: or "New York-New Jersey"  URI and more complicated  Punctuation ( Rob's vs 'Robs' vs Robs' vs Robs)‏  Dates as single token?  Non-alphanumeric characters: AT&T, Yahoo! ... Term Extraction Text Mining: Challenges, Basics March 24, 2009 Slide 10 COMP527: Data Mining

Requirement: Extract 'words' from text.  Period character problematic: End of sentence, end of abbreviation, internal to acronyms (but not always present), internal to numbers (with 2 different meanings), dotted quad notation (eg: )‏  Emoticons :( :) >:( =>  Need extra processing for diacritics? eg: é ë ç etc.  Might want to use phrases as attributes 'with respect to', 'data mining' etc. but complicated to determine appropriate phrases.  Expand abbreviations? Expand acronyms?  Expand ranges? ( means all years, not just end points)‏ Term Extraction Text Mining: Challenges, Basics March 24, 2009 Slide 11 COMP527: Data Mining

Requirement: Reduce number of words. (Dimensionality reduction)‏ Many words are useless for distinguishing a document. Don't want to store non-useful words...  a, an, the, these, those, them, they...  of, with, to, in, towards, on...  while, however, because, also, who, when, where... Long list of words to ignore. Called 'stopwords'. BUT... “The Who” -- Band? Stopwords? Part of speech filtering more accurate but more expensive. Dimensionality Reduction Text Mining: Challenges, Basics March 24, 2009 Slide 12 COMP527: Data Mining

Requirement: Normalise terms for consistency and dimensionality reduction.  Normally want to ignore case. eg 'computer' and 'Computer' should be the same attribute. But acronyms different: ram vs RAM, us vs US  Normally want to use word stems. eg 'computer' and 'computers' should be the same attribute. Porter algorithm relies on prefix/suffix matching, but note 'ram' could be noun or verb... Also, stems can be meaningless: "datum mine" Dimensionality Reduction Text Mining: Challenges, Basics March 24, 2009 Slide 13 COMP527: Data Mining

Can use simple statistics to reduce dimensionality: If a word appears in all classes evenly, then it doesn't distinguish any particular class, and is not useful for classification and can be ignored. eg 'the' Equally, a word that appears in only one document will be perfectly discriminating, but also probably over-fitting. Words that appear in most documents (regardless of class distribution) are also unlikely to be useful. Dimensionality Reduction Text Mining: Challenges, Basics March 24, 2009 Slide 14 COMP527: Data Mining

Data Mining: Discover hidden models to describe the data. Text Mining: Discover hidden facts within bodies of text. Completely different approaches: DM tries to generalise all of the data into a single model without getting caught up in over-fitting. TM tries to understand the details, and cross reference between individual instances. Text Mining vs Data Mining Text Mining: Challenges, Basics March 24, 2009 Slide 15 COMP527: Data Mining

Text Mining uses Natural Language Processing techniques to 'understand' the data. Tries to understand the semantics of the text (information) rather than treating it as a big bag of sequences of characters. Major processes:  Part of Speech Tagging  Phrase Chunking  Deep Parsing  Named Entity Recognition  Information Extraction Text Mining vs Data Mining Text Mining: Challenges, Basics March 24, 2009 Slide 16 COMP527: Data Mining

Part of Speech Tagging: Tag each word with its part of speech (noun, verb, adjective etc.)‏ Classification problem, but essential to understand the text, especially the verbs. Phrase Chunking: Discover sequences of words that constitute phrases. eg Noun phrase, verb phrase, prepositional phrase. Also essential, to discover clauses, rather than working with individual words. Text Mining vs Data Mining Text Mining: Challenges, Basics March 24, 2009 Slide 17 COMP527: Data Mining

Deep Parsing: Discover the structure of the clauses and participants in verbs etc. eg Dog bites man, not man bites dog. Essential as the first step where the semantics are really used. Named Entity Recognition: Discover 'entities' within the text and tag them with the same identifier. eg Magnesium and Mg are the same. Bush, President Bush, G.W. Bush, Dubya, the President, are all the same. Essential for correlation of entities. Text Mining vs Data Mining Text Mining: Challenges, Basics March 24, 2009 Slide 18 COMP527: Data Mining

Information Extraction: With all the previous information, find all of the information about each entity from all occurrences within all clauses. Remove duplicates and find correlations. Look for interesting correlations, perhaps according to some set of rules for what is interesting. Actually, this is an impossibly large task given a reasonable set of text, and the interestingness of 'new' facts is often very low. Text Mining vs Data Mining Text Mining: Challenges, Basics March 24, 2009 Slide 19 COMP527: Data Mining

DM crucial for TM, eg correct classification of part of speech. But TM processes also important for accurate dimensionality reduction in DM on Texts. Eg: Every word: average of 100 attributes per vector, 85.7% accuracy over 10 classes with SVM Same data, with linguistic stems and filtered for noun, verb and adjective: average of 64 attributes per vector, 87.2% accuracy. Text Mining vs Data Mining Text Mining: Challenges, Basics March 24, 2009 Slide 20 COMP527: Data Mining

Baeza-Yates, Modern Information Retrieval Weiss, Chapters 2,4 Berry, Survey of Text Mining, Chapter 5 (He gets around, doesn't he?!)‏ Konchady Witten, Managing Gigabytes Further Reading Text Mining: Challenges, Basics March 24, 2009 Slide 21 COMP527: Data Mining