Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

An Introduction to Data Mining
Verification and Validation
Lesson Overview 1.1 What Is Science?.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Chapter 12 – Strategies for Effective Written Reports
 William M. Pottenger, Ph.D. Computing the Future of Data Mining An Introduction to Data Mining Visit to Messiah College September 4, 2006 William M.
Data Mining Glen Shih CS157B Section 1 Dr. Sin-Min Lee April 4, 2006.
Data Mining Techniques Cluster Analysis Induction Neural Networks OLAP Data Visualization.
HCC class lecture 6 comments John Canny 2/7/05. Administrivia.
5/11/981 Untangling Text Data Mining Stanford Digital Libraries Seminar May 11, 1998 Marti Hearst UC Berkeley SIMS
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998.
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
1 Interfaces for Intense Information Analysis Marti Hearst UC Berkeley This research funded by ARDA.
Text Data Mining Prof. Marti Hearst UC Berkeley SIMS ABLE May 7, 1999.
Text Mining Tools: Instruments for Scientific Discovery Marti Hearst UC Berkeley SIMS Advanced Technologies Seminar June 15, 2000.
Week 9 Data Mining System (Knowledge Data Discovery)
Getting Started: Research and Literature Reviews An Introduction.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Text Mining for Bioscience Applications: The State of the Art Marti Hearst University of California, Berkeley.
Data Mining.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.
UCB CS Research Fair Search Text Mining Web Site Usability Marti Hearst SIMS.
The LINDI Project Linking Information for New Discoveries UIs for building and reusing hypothesis seeking strategies. Statistical language analysis techniques.
Data Mining – Intro.
Data mining By Aung Oo.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
ACS1803 Lecture Outline 2 DATA MANAGEMENT CONCEPTS Text, Ch. 3 How do we store data (numeric and character records) in a computer so that we can optimize.
Path Knowledge Discovery: Association Mining Based on Multi-Category Lexicons Chen Liu, Wesley W. Chu, Fred Sabb, Stott Parker and Joseph Korpela.
Text Data Mining: Introduction Hao Chen School of Information Systems University of California at Berkeley
Chapter 1 Introduction to Data Mining
Knowledge Discovery in the Digital Library Access tools for mining science ICSTI Public Workshop Presented by: Bernard Dumouchel, Director-General February.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
How to read a scientific paper
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
Text Mining Tools: Instruments for Scientific Discovery Marti Hearst UC Berkeley SIMS IMA Text Mining Workshop April 17, 2000.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information.
Chapter 4 Decision Support System & Artificial Intelligence.
NATURAL LANGUAGE PROCESSING Zachary McNellis. Overview  Background  Areas of NLP  How it works?  Future of NLP  References.
Getting Started: Research and Literature Reviews An Introduction.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.
PRIMARY DATA vs SECONDARY DATA RESEARCH Lesson 23 June 2016
Data Mining – Intro.
MIS2502: Data Analytics Advanced Analytics - Introduction
Lindsay & Gordon’s Discovery Support Systems Model
Text Tango: A New Text Data Mining Project
CSE591: Data Mining by H. Liu
Data Warehousing and Data Mining
Untangling Text Data Mining
Interfaces for Intense Information Analysis
Web Mining Department of Computer Science and Engg.
Data Mining: Introduction
CS246: Information Retrieval
ROLE OF «electronic virtual enhanced research-engaged student teams» WEB PORTAL IN SOLUTION OF PROBLEM OF COLLABORATION INTERNATIONAL TEAMS INSIDE ONE.
Presentation transcript:

Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Outline l Untangling several different fields –DM, CL, IA, TDM l TDM examples l TDM as Exploratory Data Analysis –New Problems for Computational Linguistics –Our current efforts

Classifying Application Types

What is Data Mining? (Fayyad & Uthurusamy 96, Fayyad 97) l Fitting models to or determining patterns from very large datasets. l A “regime” which enables people to interact effectively with massive data stores. l Deriving new information from data.

Why Data Mining? l Because the data is there. l Because –larger disks –faster cpus –high-powered visualization –networked information are becoming widely available.

The Knowledge Discovery from Data Process (KDD) KDD: The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. (Fayyad, Shapiro, & Smyth, CACM 96) Note: data mining is just one step in the process

DM Touchstone Applications (CACM 39 (11) Special Issue) l Finding patterns across data sets: –Reports on changes in retail sales »to improve sales –Patterns of sizes of TV audiences »for marketing –Patterns in NBA play »to alter, and so improve, performance –Deviations in standard phone calling behavior »to detect fraud »for marketing

What is Data Mining? Potential point of confusion: –The extracting ore from rock metaphor does not really apply to the practice of data mining –If it did, then standard database queries would fit under the rubric of data mining –In practice, DM refers to: »finding patterns across large datasets »discovering heretofore unknown information

What is Text Data Mining? l Many peoples’ first thought: –Make it easier to find things on the Web. –But this is information retrieval!

Needles in Haystacks The emphasis in IR is in finding documents that already contain answers to questions.

Information Retrieval A restricted form of Information Access l The system has available only pre-existing, “canned” text passages. l Its response is limited to selecting from these passages and presenting them to the user. l It must select, say, 10 or 20 passages out of millions.

What is Text Data Mining? l The metaphor of extracting ore from rock: – Does make sense for extracting documents of interest from a huge pile. –But does not reflect notions of DM in practice: »finding patterns across large collections »discovering heretofore unknown information

Real Text DM What would finding a pattern across a large text collection really look like?

From: “The Internet Diary of the man who cracked the Bible Code ” Brendan McKay, Yahoo Internet Life, (William Gates, agitator, leader) Bill Gates + MS-DOS in the Bible!

From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life,

Real Text DM l The point: –Discovering heretofore unknown information is not what we usually do with text. –(If it weren’t known, it could not have been written by someone!) l However: –There is a field whose goal is to learn about patterns in text for their own sake...

Computational Linguistics! l Goal: automated language understanding –this isn’t possible –instead, go for subgoals, e.g., »word sense disambiguation »phrase recognition »semantic associations l Common current approach: –statistical analyses over very large text collections

Why CL Isn’t TDM l A linguist finds it interesting that “cloying” co-occurs significantly with “Jar Jar Binks”... l … But this doesn’t really answer a question relevant to the world outside the text itself.

Why CL Isn’t TDM l We need to use the text indirectly to answer questions about the world l Direct: –Analyze patent text; determine which word patterns indicate various subject categories. l Indirect: –Analyze patent text; find out whether private or public funding leads to more inventions.

Why CL Isn’t TDM l Direct: –Cluster newswire text; determine which terms are predominant l Indirect: –Analyze newswire text; gather evidence about which countries/alliances are dominating which financial sectors

Nuggets vs. Patterns l TDM: we want to discover new information … l … As opposed to discovering which statistical patterns characterize occurrence of known information. l Example: WSD –not TDM: computing statistics over a corpus to determine what patterns characterize Sense S. –TDM: discovering the meaning of a new sense of a word.

Nuggets vs. Patterns l Nugget: a new, heretofore unknown item of information. l Pattern: distributions or rules that characterize the occurrence (or non- occurrence) of a known item of information. l Application of rules can create nuggets in some circumstances.

Example: Lexicon Augmentation l Application of a lexico-syntactic pattern: NP 0 such as NP 1, {NP 2 …, (and | or) NP i } i >= 1, implies that forall NP i, i>=1, hyponym(NP i, NP 0 ) l Extracts out a new hypernym: –“ Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use.” –implies hyponym(“Gelidium”, “red algae”) l However, this fact was already known to the author of the text.

The Quandry l How do we use text to both –Find new information not known to the author of the text –Find information that is not about the text itself

Idea: Exploratory Data Analysis l Use large text collections to gather evidence to support (or refute) hypotheses –Not known to author: links across many texts –Not self-referential: work within the domain of discourse

Example: Etiology l Given –medical titles and abstracts –a problem (incurable rare disease) –some medical expertise l find causal links among titles –symptoms –drugs –results

Swanson Example (1991) l Problem: Migraine headaches (M) –stress associated with M –stress leads to loss of magnesium –calcium channel blockers prevent some M –magnesium is a natural calcium channel blocker –spreading cortical depression (SCD) implicated in M –high levels of magnesium inhibit SCD –M patients have high platelet aggregability –magnesium can suppress platelet aggregability l All extracted from medical journal titles

Gathering Evidence stress migraine CCB magnesium PA magnesium SCD magnesium

Gathering Evidence migraine magnesium stress CCB PA SCD

Swanson’s TDM l Two of his hypotheses have received some experimental verification. l His technique –Only partially automated –Required medical expertise l Few people are working on this.

How to Automate This? l Idea: mixed-initiative interaction –User applies tools to help explore the hypothesis space –System runs suites of algorithms to help explore the space, suggest directions

Our Proposed Approach l Three main parts –UI for building/using strategies –Backend for interfacing with various databases and translating different formats –Content analysis/machine learning for figuring out good hypotheses/throwing out bad ones

How to find functions of genes? l Important problem in molecular biology –Have the genetic sequence –Don’t know what it does –But … »Know which genes it coexpresses with »Some of these have known function –So … Infer function based on function of co-expressed genes »This is new work by Michael Walker and others at Incyte Pharmaceuticals

Gene Co-expression: Role in the genetic pathway g? PSA Kall. PAP h? PSA Kall. PAP g? Other possibilities as well

Make use of the literature l Look up what is known about the other genes. l Different articles in different collections l Look for commonalities –Similar topics indicated by Subject Descriptors –Similar words in titles and abstracts adenocarcinoma, neoplasm, prostate, prostatic neoplasms, tumor markers, antibodies...

Developing Strategies l Different strategies seem needed for different situations –First: see what is known about Kallikrein. –7341 documents. Too many –AND the result with “disease” category »If result is non-empty, this might be an interesting gene –Now get 803 documents –AND the result with PSA »Get 11 documents. Better!

Developing Strategies l Look for commalities among these documents –Manual scan through ~100 category labels –Would have been better if »Automatically organized »Intersections of “important” categories scanned for first

Try a new tack l Researcher uses knowledge of field to realize these are related to prostate cancer and diagnostic tests l New tack: intersect search on all three known genes –Hope they all talk about diagnostics and prostate cancer –Fortunately, 7 documents returned –Bingo! A relation to regulation of this cancer

Formulate a Hypothesis l Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancer l New tack: do some lab tests –See if mystery gene is similar in molecular structure to the others –If so, it might do some of the same things they do

Strategies again l In hindsight, combining all three genes was a good strategy. –Store this for later l Might not have worked –Need a suite of strategies –Build them up via experience and a good UI

The System l Doing the same query with slightly different values each time is time- consuming and tedious l Same goes for cutting and pasting results –IR systems don’t support varying queries like this very well. –Each situation is a bit different l Some automatic processing is needed in the background to eliminate/suggest hypotheses

The UI part l Need support for building strategies l Mixed-initiative system –Trade off between user-initiated hypotheses exploration and system-initiated suggestions l Information visualization –Another way to show lots of choices

Candidate Associations Current Retrieval Results Suggested Strategies

LINDI: Linking Information for Novel Discovery and Insight l Just starting up now (fall 98) l Initial work: Hao Chen, Ketan Mayer- Patel, Shankar Raman

Summary l The future: analyzing what the text is about –We don’t know how; text is tough! –Idea: bring the user into the loop. –Build up piecewise evidence to support hypotheses –Make use of partial domain models. l The Truth is Out There!

Summary l Text Data Mining: –Extracting heretofore undiscovered information from large text collections l Information Access  TDM –IA: locating already known information that is currently of interest l Finding patterns across text is already done in CL –Tells us about the behavior of language –Helps build very useful tools!

Text Merging Example: Discovering Hypocritical Congresspersons

Discovering Hypocritical Congresspersons l Feb 1, 1996 –US House of Reps votes to pass Telecommunications Reform Act –this contains the CDA (Communications Decency Act) –violaters subject to fines of $250,000 and 5 years in prison –eventually struck down by court

Discovering Hypocritical Congresspersons l Sept 11, 1998 –US House of Reps votes to place the Starr report online –the content would (most likely) have violated the CDA l 365 people were members for both votes –284 members voted aye both times »185 (94%) Republicants voted aye both times » 96 (57%) Democrats voted aye both times

How to find Hypocritical Congresspersons? l This must have taken a lot of work –Hand cutting and pasting –Lots of picky details »Some people voted on one but not the other bill »Some people share the same name l Check for different county/state l Still messed up on “Bono” –Taking stats at the end on various attributes »Which state »Which party l Tools should help streamline, reuse results

How to find Hypocritical Congresspersons? l The hard part? –Knowing two compare these two sets of voting records.