Data Mining: Text Mining

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Improved TF-IDF Ranker
Data Mining: Concepts and Techniques Mining Text Data
Introduction and Jurafsky Model Resource: A Probabilistic Model of Lexical and Syntactic Access and Disambiguation, Jurafsky 1996.
1 Introduction to Natural Language Processing (Lecture for CS410 Text Information Systems) Jan 28, 2011 ChengXiang Zhai Department of Computer Science.
Introduction to Natural Language Processing Hongning Wang
Information Retrieval in Practice
Search Engines and Information Retrieval
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Natural Language Query Interface Mostafa Karkache & Bryce Wenninger.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Context Free Grammar S -> NP VP NP -> det (adj) N
Information Retrieval
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/2010 Overview of NLP tasks (text pre-processing)
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
PARSING David Kauchak CS457 – Fall 2011 some slides adapted from Ray Mooney.
EECS 730 Introduction to Bioinformatics Function Luke Huan Electrical Engineering and Computer Science
Search Engines and Information Retrieval Chapter 1.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Lecture 18 Text Data Mining MW 4:00PM-5:15PM Dr. Jianjun Hu CSCE822 Data Mining and Warehousing University of South.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Natural Language Processing Lecture 6 : Revision.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
GRAMMARS David Kauchak CS159 – Fall 2014 some slides adapted from Ray Mooney.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
For Wednesday Read chapter 23 Homework: –Chapter 22, exercises 1,4, 7, and 14.
Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.
The man bites the dog man bites the dog bites the dog the dog dog Parse Tree NP A N the man bites the dog V N NP S VP A 1. Sentence  noun-phrase verb-phrase.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Information Retrieval and Web Search Course overview Instructor: Rada Mihalcea.
Information Retrieval
1/3/ Data Mining: — Mining Text and Web Data — Han & Kambr.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Natural Language Processing Hongning Wang
NATURAL LANGUAGE PROCESSING
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
1 CIS 467 :Data Mining Department of Computer Information Systems Faculty of Information Technology Yarmouk University – Jordan Instructors: Dr. Qasem.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Information Retrieval in Practice
Information Retrieval in Practice
Data Mining: — Mining Text and Web Data — Han & Kambr
Mining Text Data: An Introduction Data Mining / Knowledge Discovery
Multimedia Information Retrieval
Thanks to Bill Arms, Marti Hearst
CSE 635 Multimedia Information Retrieval
Mining Text Data: An Introduction Data Mining / Knowledge Discovery
Introduction to Information Retrieval
CS246: Information Retrieval
David Kauchak CS159 – Spring 2019
David Kauchak CS159 – Spring 2019
Artificial Intelligence 2004 Speech & Natural Language Processing
Introduction to Search Engines
Presentation transcript:

Data Mining: Text Mining

Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Loanee: Frank Rizzo Lender: MWF Agency: Lake View Amount: $200,000 Term: 15 years ) Frank Rizzo bought his home from Lake View Real Estate in 1992. He paid $200,000 under a15-year loan from MW Financial. <a href>Frank Rizzo </a> Bought <a hef>this home</a> from <a href>Lake View Real Estate</a> In <b>1992</b>. <p>... Loans($200K,[map],...) Throughout this course we have been discussing Data Mining over a variety of data types. Two former types we covered were Structured Data (relational) and multimedia data. Today and in the last class we have been discussing Data Mining over free text, and our next section will cover hypertext, such as web pages. Text mining is well motivated, due to the fact that much of the world’s data can be found in free text form (newspaper articles, emails, literature, etc.). There is a lot of information available to mine. While mining free text has the same goals as data mining in general (extracting useful knowledge/stats/trends), text mining must overcome a major difficulty – there is no explicit structure. Machines can reason will relational data well since schemas are explicitly available. Free text, however, encodes all semantic information within natural language. Our text mining algorithms, then, must make some sense out of this natural language representation. Humans are great at doing this, but this has proved to be a problem for machines.

Bag-of-Tokens Approaches Documents Token Sets Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or … nation – 5 civil - 1 war – 2 men – 2 died – 4 people – 5 Liberty – 1 God – 1 … Feature Extraction The previous text mining presentations “made sense” out of free text by viewing text as a bag-of-tokens (words, n-grams). This is the same approach as IR. Under that model we can already summarize, classify, cluster, and compute co-occurrence stats over free text. These are quite useful for mining and managing large volumes of free text. However, there is a potential to do much more. The BOT approach loses a LOT of information contained in text, such as word order, sentence structure, and context. These are precisely the features that humans use to interpret text. Thus the natural question is can we do better? Loses all order-specific information! Severely limits context!

Natural Language Processing A dog is chasing a boy on the playground Det Noun Aux Verb Prep Noun Phrase Complex Verb Prep Phrase Verb Phrase Sentence Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1). Semantic analysis Lexical analysis (part-of-speech tagging) Syntactic analysis (Parsing) A person saying this may be reminding another person to get the dog back… Pragmatic analysis (speech act) Scared(x) if Chasing(_,x,_). + Scared(b1) Inference NLP, or Computational Linguistics, is an entire field dedicated to the study of automatically understanding free text. This field has been active since the 50’s. General NLP attempts to understand document completely (at the level of a human reader). There are several steps involved in NLP. …Blah… (Taken from ChengXiang Zhai, CS 397cxz – Fall 2003)

Choose most likely parse tree… Parsing Choose most likely parse tree… the playground S NP VP BNP N Det A dog PP Aux V is on a boy chasing P Probability of this tree=0.000015 . Probability of this tree=0.000011 S NP VP NP  Det BNP NP  BNP NP NP PP BNP N VP  V VP  Aux V NP VP  VP PP PP  P NP V  chasing Aux is N  dog N  boy N playground Det the Det a P  on Grammar Lexicon 1.0 0.3 0.4 … 0.01 0.003 Probabilistic CFG Parsing attempts to infer the precise grammatical relationships between different words in a given sentence. For example, POS are grouped into phrases and phrases are combined into sentences. Approaches include parsing with probabilistic CFG’s, “link dictionaries”, and tree adjoining techniques (super-tagging). Current techniques can only parse at the sentence level, in some cases reporting accuracy in the 90% range. Again, the performance heavily depends upon the grammatical correctness and the degree of ambiguity of the text. (Adapted from ChengXiang Zhai, CS 397cxz – Fall 2003)

Obstacles Ambiguity “A man saw a boy with a telescope.” “Oku baban gibi cahil olma” Computational Intensity Imposes a context horizon. Text Mining NLP Approach: Locate promising fragments using fast IR methods (bag-of-tokens). Only apply slow NLP techniques to promising fragments. The biggest obstacle to sophisticated NLP is ambiguity. Humans are quite skilled at inferring context and meaning. NLP is expensive and can currently only be performed on the small scale (per-sentence, selective sentences). This restriction further limits our ability to derive context (from across the document). Current approach is to use fast IR techniques (bag-of-tokens) to determine promising text fragments and then apply more expensive NLP techniques on those fragments. (same idea is in multimedia mining)

Text Databases and IR Text databases (document databases) Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and Web pages, library database, etc. Data stored is usually semi-structured Traditional information retrieval techniques become inadequate for the increasingly vast amounts of text data Information retrieval A field developed in parallel with database systems Information is organized into (a large number of) documents Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents

Information Retrieval Typical IR systems Online library catalogs Online document management systems Information retrieval vs. database systems Some DB problems are not present in IR, e.g., update, transaction management, complex objects Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance

Basic Measures for Text Retrieval Relevant Relevant & Retrieved Retrieved All Documents Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved

Measures for text retrieval There is a trade off between precision and recall Inversely related One commonly used metric: F-score Harmonic mean

Information Retrieval Techniques Basic Concepts A document can be described by a set of representative keywords called index terms. Different index terms have varying relevance when used to describe document contents. This effect is captured through the assignment of numerical weights to each index term of a document. (e.g.: frequency, tf-idf) DBMS Analogy Index Terms  Attributes Weights  Attribute Values

Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms  Documents Frequency Matrices Information Retrieval Models: Boolean Model Vector Model