Introduction to Text Mining

Slides:



Advertisements
Similar presentations
ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,
Advertisements

Chapter 5: Introduction to Information Retrieval
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Unstructured Data and Text Mining
1 I256: Applied Natural Language Processing Marti Hearst Nov 15, 2006.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Web Mining Research: A Survey
Information Extraction from the World Wide Web CSE 454 Based on Slides by William W. Cohen Carnegie Mellon University Andrew McCallum University of Massachusetts.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 11, 2004.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Information Extraction
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Introduction to Data Mining Engineering Group in ACL.
School of Engineering and Computer Science Victoria University of Wellington COMP423 Intelligent agents.
Information Retrieval in Practice
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
OOSE 01/17 Institute of Computer Science and Information Engineering, National Cheng Kung University Member:Q 薛弘志 P 蔡文豪 F 周詩御.
December 2005CSA3180: Information Extraction I1 CSA3180: Natural Language Processing Information Extraction 1 – Introduction Information Extraction Named.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Information Extraction Yunyao Li EECS /SI /29/2006.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
December 2005CSA3180: Information Extraction I1 CSA2050: Natural Language Processing Information Extraction Named Entities IE Systems MUC Finite State.
Types of Extraction. Wrappers 2 IE from Text 3 AttributeWalmart ProductVendor Product Product NameCHAMP Bluetooth Survival Solar Multi- Function Skybox.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Some Work on Information Extraction at IRL Ganesh Ramakrishnan IBM India Research Lab.
Amy Dai Machine learning techniques for detecting topics in research papers.
Chapter 6: Information Retrieval and Web Search
An Overview of Information Retrieval Nov. 10, 2009 Maryam Karimzadehgan Department of Computer Science University of Illinois, Urbana-Champaign.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.
India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
CSC 594 Topics in AI – Text Mining and Analytics
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Course Summary ChengXiang Zhai ( 翟成祥 ) Department of.
Automatic Labeling of Multinomial Topic Models
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Data Acquisition. Get all data necessary for the analysis task at hand Some data comes from inside the company –Need to go and talk with various data.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Information Organization: Overview
Taking a Tour of Text Analytics
Machine Learning overview Chapter 18, 21
Machine Learning overview Chapter 18, 21
Statistical Learning Methods for Natural Language Processing on the Internet 徐丹云.
中国计算机学会学科前沿讲习班:信息检索 Course Overview
Course Summary (Lecture for CS410 Intro Text Info Systems)
Machine Learning Ali Ghodsi Department of Statistics
Special Topics in Data Mining Applications Focus on: Text Mining
Presented by: Prof. Ali Jaoua
Text Categorization Rong Jin.
CSE 635 Multimedia Information Retrieval
Course Summary ChengXiang “Cheng” Zhai Department of Computer Science
WHIRL – Reasoning with IE output
Information Organization: Overview
Introduction to Search Engines
Presentation transcript:

Introduction to Text Mining ChengXiang (“Cheng”) Zhai Department of Computer Science Graduate School of Library & Information Science Statistics, and Institute for Genomic Biology University of Illinois, Urbana-Champaign

Outline Overview of Text Mining IR-Style Text Mining Techniques NLP-Style Text Mining Techniques ML-Style Text Mining Techniques

Two Definitions of “Mining” Goal-oriented (effectiveness driven, NLP, AI) Any process that generates useful results that are non-obvious is called “mining”. Keywords: “useful” + “non-obvious” Data isn’t necessarily massive Method-oriented (efficiency driven, DB, IR) Any process that involves extracting information from massive data is called “mining” Keywords: “massive” + “pattern” Patterns aren’t necessarily useful

What is Text Mining? Data Mining View: Explore patterns in textual data Find latent topics Find topical trends Find outliers and other hidden patterns Natural Language Processing View: Make inferences based on partial understanding natural language text Information extraction Question answering

Applications of Text Mining Direct applications Discovery-driven (Bioinformatics, Business Intelligence, etc): We have specific questions; how can we exploit data mining to answer the questions? Data-driven (WWW, literature, email, customer reviews, etc): We have a lot of data; what can we do with it? Indirect applications Assist information access (e.g., discover latent topics to better summarize search results) Assist information organization (e.g., discover hidden structures)

Text Mining Methods Data Mining Style: View text as high dimensional data Frequent pattern finding Association analysis Outlier detection Information Retrieval Style: Fine granularity topical analysis Topic extraction Exploit term weighting and text similarity measures Question answering Natural Language Processing Style: Information Extraction Entity extraction Relation extraction Sentiment analysis Machine Learning Style: Unsupervised or semi-supervised learning Generative models Dimension reduction Classification & prediction

IR-Style Techniques for Text Mining

Some “Basic” IR Techniques Stemming Stop words Weighting of terms (e.g., TF-IDF) Vector/Unigram representation of text Text similarity (e.g., cosine, KL-div) Relevance/pseudo feedback (e.g., Rocchio)

Generality of Basic Techniques Term Weighting w11 w12… w1n w21 w22… w2n … … wm1 wm2… wmn t1 t2 … tn d1 d2 … dm Term similarity Doc CLUSTERING d Sentence selection SUMMARIZATION Vector centroid Stemming & Stop words Tokenized text d CATEGORIZATION META-DATA/ ANNOTATION Raw text

Sample Applications Information Filtering Text Categorization Document/Term Clustering Text Summarization

Information Filtering Stable & long term interest, dynamic info source System must make a delivery decision immediately as a document “arrives” Two Methods: Content-based vs. Collaborative my interest: Filtering System …

Examples of Information Filtering News filtering Email filtering Recommending Systems Literature alert And many others

Sample Applications Information Filtering Text Categorization Document/Term Clustering Text Summarization

Text Categorization Pre-given categories and labeled document examples (Categories may form hierarchy) Classify new documents A standard supervised learning problem Sports Business Education Science Categorization System … … Sports Business Education

Examples of Text Categorization News article classification Meta-data annotation Automatic Email sorting Web page classification

Sample Applications Information Filtering Text Categorization Document/Term Clustering Text Summarization

The Clustering Problem Discover “natural structure” Group similar objects together Object can be document, term, passages Example

Similarity-induced Structure

Examples of Doc/Term Clustering Clustering of retrieval results Clustering of documents in the whole collection Term clustering to define “concept” or “theme” Automatic construction of hyperlinks In general, very useful for text mining

Sample Applications Information Filtering Text Categorization Document/Term Clustering Text Summarization

“Retrieval-based” Summarization Observation: term vector  summary? Basic approach Rank “sentences”, and select top N as a summary Methods for ranking sentences Based on term weights Based on position of sentences Based on the similarity of sentence and document vector

Examples of Summarization News summary Summarize retrieval results Single doc summary Multi-doc summary Summarize a cluster of documents (automatic label creation for clusters)

NLP-Style Text Mining Techniques Most of the following slides are from William Cohen’s IE tutorial

What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… * Microsoft Corporation CEO Bill Gates Microsoft Gates Bill Veghte VP Richard Stallman founder Free Software Foundation * NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Richard Stallman founder Free Soft.. * *

Landscape of IE Tasks: Complexity E.g. word patterns: Closed set Regular set U.S. states U.S. phone numbers He was born in Alabama… Phone: (413) 545-1323 The big Wyoming sky… The CALD main office can be reached at 412-268-1299 Complex pattern Ambiguous patterns, needing context and many sources of evidence U.S. postal addresses How complicated a modeling technique will you have to use? University of Arkansas P.O. Box 140 Hope, AR 71802 Person names …was among the six houses sold by Hope Feldman that year. Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio 45210 Pawel Opalinski, Software Engineer at WhizBang Labs.

Landscape of IE Techniques Classify Pre-segmented Candidates Abraham Lincoln was born in Kentucky. Classifier which class? Lexicons Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Abraham Lincoln was born in Kentucky. member? Alabama Alaska … Wisconsin Wyoming Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGIN END Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence? Context Free Grammars Abraham Lincoln was born in Kentucky. NNP V P NP PP VP S Most likely parse? Any of these models can be used to capture words, formatting or both.

Statistical Learning Style Techniques for Text Mining

Many Techniques are Available Supervised learning Classification Regression Unsupervised learning Topic models Dimension reduction Most relevant methods Generative models Matrix decomposition

Topics for Discussion Social Science research questions: Mining bias: selection bias, framing bias Text Mining techniques Sentiment analysis Topic discovery and evolution graph Joint text-image analysis