Download presentation
1
Introduction to Text Mining
ChengXiang (“Cheng”) Zhai Department of Computer Science Graduate School of Library & Information Science Statistics, and Institute for Genomic Biology University of Illinois, Urbana-Champaign
2
Outline Overview of Text Mining IR-Style Text Mining Techniques
NLP-Style Text Mining Techniques ML-Style Text Mining Techniques
3
Two Definitions of “Mining”
Goal-oriented (effectiveness driven, NLP, AI) Any process that generates useful results that are non-obvious is called “mining”. Keywords: “useful” + “non-obvious” Data isn’t necessarily massive Method-oriented (efficiency driven, DB, IR) Any process that involves extracting information from massive data is called “mining” Keywords: “massive” + “pattern” Patterns aren’t necessarily useful
4
What is Text Mining? Data Mining View: Explore patterns in textual data Find latent topics Find topical trends Find outliers and other hidden patterns Natural Language Processing View: Make inferences based on partial understanding natural language text Information extraction Question answering
5
Applications of Text Mining
Direct applications Discovery-driven (Bioinformatics, Business Intelligence, etc): We have specific questions; how can we exploit data mining to answer the questions? Data-driven (WWW, literature, , customer reviews, etc): We have a lot of data; what can we do with it? Indirect applications Assist information access (e.g., discover latent topics to better summarize search results) Assist information organization (e.g., discover hidden structures)
6
Text Mining Methods Data Mining Style: View text as high dimensional data Frequent pattern finding Association analysis Outlier detection Information Retrieval Style: Fine granularity topical analysis Topic extraction Exploit term weighting and text similarity measures Question answering Natural Language Processing Style: Information Extraction Entity extraction Relation extraction Sentiment analysis Machine Learning Style: Unsupervised or semi-supervised learning Generative models Dimension reduction Classification & prediction
7
IR-Style Techniques for Text Mining
8
Some “Basic” IR Techniques
Stemming Stop words Weighting of terms (e.g., TF-IDF) Vector/Unigram representation of text Text similarity (e.g., cosine, KL-div) Relevance/pseudo feedback (e.g., Rocchio)
9
Generality of Basic Techniques
Term Weighting w11 w12… w1n w21 w22… w2n … … wm1 wm2… wmn t1 t2 … tn d1 d2 … dm Term similarity Doc CLUSTERING d Sentence selection SUMMARIZATION Vector centroid Stemming & Stop words Tokenized text d CATEGORIZATION META-DATA/ ANNOTATION Raw text
10
Sample Applications Information Filtering Text Categorization
Document/Term Clustering Text Summarization
11
Information Filtering
Stable & long term interest, dynamic info source System must make a delivery decision immediately as a document “arrives” Two Methods: Content-based vs. Collaborative my interest: Filtering System …
12
Examples of Information Filtering
News filtering filtering Recommending Systems Literature alert And many others
13
Sample Applications Information Filtering Text Categorization
Document/Term Clustering Text Summarization
14
Text Categorization Pre-given categories and labeled document examples (Categories may form hierarchy) Classify new documents A standard supervised learning problem Sports Business Education Science Categorization System … … Sports Business Education
15
Examples of Text Categorization
News article classification Meta-data annotation Automatic sorting Web page classification
16
Sample Applications Information Filtering Text Categorization
Document/Term Clustering Text Summarization
17
The Clustering Problem
Discover “natural structure” Group similar objects together Object can be document, term, passages Example
18
Similarity-induced Structure
19
Examples of Doc/Term Clustering
Clustering of retrieval results Clustering of documents in the whole collection Term clustering to define “concept” or “theme” Automatic construction of hyperlinks In general, very useful for text mining
20
Sample Applications Information Filtering Text Categorization
Document/Term Clustering Text Summarization
21
“Retrieval-based” Summarization
Observation: term vector summary? Basic approach Rank “sentences”, and select top N as a summary Methods for ranking sentences Based on term weights Based on position of sentences Based on the similarity of sentence and document vector
22
Examples of Summarization
News summary Summarize retrieval results Single doc summary Multi-doc summary Summarize a cluster of documents (automatic label creation for clusters)
23
NLP-Style Text Mining Techniques
Most of the following slides are from William Cohen’s IE tutorial
24
What is “Information Extraction”
As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… * Microsoft Corporation CEO Bill Gates Microsoft Gates Bill Veghte VP Richard Stallman founder Free Software Foundation * NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Richard Stallman founder Free Soft.. * *
25
Landscape of IE Tasks: Complexity
E.g. word patterns: Closed set Regular set U.S. states U.S. phone numbers He was born in Alabama… Phone: (413) The big Wyoming sky… The CALD main office can be reached at Complex pattern Ambiguous patterns, needing context and many sources of evidence U.S. postal addresses How complicated a modeling technique will you have to use? University of Arkansas P.O. Box 140 Hope, AR Person names …was among the six houses sold by Hope Feldman that year. Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio 45210 Pawel Opalinski, Software Engineer at WhizBang Labs.
26
Landscape of IE Techniques
Classify Pre-segmented Candidates Abraham Lincoln was born in Kentucky. Classifier which class? Lexicons Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Abraham Lincoln was born in Kentucky. member? Alabama Alaska … Wisconsin Wyoming Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGIN END Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence? Context Free Grammars Abraham Lincoln was born in Kentucky. NNP V P NP PP VP S Most likely parse? Any of these models can be used to capture words, formatting or both.
27
Statistical Learning Style Techniques for Text Mining
28
Many Techniques are Available
Supervised learning Classification Regression Unsupervised learning Topic models Dimension reduction Most relevant methods Generative models Matrix decomposition
29
Topics for Discussion Social Science research questions:
Mining bias: selection bias, framing bias Text Mining techniques Sentiment analysis Topic discovery and evolution graph Joint text-image analysis
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.