Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Critical Reading Strategies: Overview of Research Process
Understanding Relational Databases Basic Concepts and Applications for Qualitative Content Analysis.
Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999.
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Chapter 12 – Strategies for Effective Written Reports
 William M. Pottenger, Ph.D. Computing the Future of Data Mining An Introduction to Data Mining Visit to Messiah College September 4, 2006 William M.
Advanced Searching Engineering Village.
Literature Survey, Literature Comprehension, & Literature Review.
> a patent search service supplied by Patents & Technology Surveys Ltd PROFESSIONAL ONLINE PATENT INFORMATION SERVICE.
5/11/981 Untangling Text Data Mining Stanford Digital Libraries Seminar May 11, 1998 Marti Hearst UC Berkeley SIMS
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
1 Interfaces for Intense Information Analysis Marti Hearst UC Berkeley This research funded by ARDA.
Text Data Mining Prof. Marti Hearst UC Berkeley SIMS ABLE May 7, 1999.
Text Mining Tools: Instruments for Scientific Discovery Marti Hearst UC Berkeley SIMS Advanced Technologies Seminar June 15, 2000.
UCB HCC Retreat Search Text Mining Web Site Usability Marti Hearst SIMS.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Text Mining for Bioscience Applications: The State of the Art Marti Hearst University of California, Berkeley.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.
UCB CS Research Fair Search Text Mining Web Site Usability Marti Hearst SIMS.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
The LINDI Project Linking Information for New Discoveries UIs for building and reusing hypothesis seeking strategies. Statistical language analysis techniques.
1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI , ARDA.
Data mining By Aung Oo.
Comparing protein structure and sequence similarities Sumi Singh Sp 2015.
DEMO CSE fall. What is GeneMANIA GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
“Project Based” Learning in Secondary Science Patrick Wells Presentation site:
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.
Introduction Why we do it? To disseminate research To report a new result; To report a new technique; To critique/confirm another's result. Each discipline.
Text Data Mining: Introduction Hao Chen School of Information Systems University of California at Berkeley
Knowledge Discovery in the Digital Library Access tools for mining science ICSTI Public Workshop Presented by: Bernard Dumouchel, Director-General February.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
Interactive Probabilistic Search for GikiCLEF Ray R Larson School of Information University of California, Berkeley Ray R Larson School of Information.
How to read a scientific paper
 Finding Scholarly Research on Your Topic. Your Research Journey…  You have, at this point, found information on your topic from general sources – news.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.
Text Mining Tools: Instruments for Scientific Discovery Marti Hearst UC Berkeley SIMS IMA Text Mining Workshop April 17, 2000.
ITGS Databases.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Trust Me, I’m Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster Shengliang Dai.
Data Mining Status and Risks Dr. Gregory Newby UNC-Chapel Hill
Augmenting (personal) IR Readings Review Evaluation Papers returned & discussed Papers and Projects checkin time.
User Interfaces for Information Access Prof. Marti Hearst SIMS 202, Lecture 26.
What Happens After the Search? User Interface Ideas for Information Retrieval Results Marti A. Hearst Xerox PARC.
Getting Started: Research and Literature Reviews An Introduction.
INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.
Text Tango: A New Text Data Mining Project
WIRED Week 2 Syllabus Update Readings Overview.
Untangling Text Data Mining
Interfaces for Intense Information Analysis
Data Mining Chapter 6 Search Engines
Document Clustering Matt Hughes.
CS246: Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. Hearst UC Berkeley SIMS 1998 Talk Outline n What is Data Mining? n What isn’t Text Data Mining? n What is Text Data Mining Examples Examples n A proposal for a system for Text Data Mining

Marti A. Hearst UC Berkeley SIMS 1998 What is Data Mining? (Fayyad & Uthurusamy 96, Fayyad 97) n Fitting models to or determining patterns from very large datasets. n A “regime” which enables people to interact effectively with massive data stores. n Deriving new information from data. finding patterns across large datasets finding patterns across large datasets discovering heretofore unknown information discovering heretofore unknown information

Marti A. Hearst UC Berkeley SIMS 1998 What is Data Mining? n Potential point of confusion: The extracting ore from rock metaphor does not really apply to the practice of data mining The extracting ore from rock metaphor does not really apply to the practice of data mining If it did, then standard database queries would fit under the rubric of data mining If it did, then standard database queries would fit under the rubric of data mining Find all employee records in which employee earns $300/month less than their managers Find all employee records in which employee earns $300/month less than their managers In practice, DM refers to: In practice, DM refers to: finding patterns across large datasets finding patterns across large datasets discovering heretofore unknown information discovering heretofore unknown information

Marti A. Hearst UC Berkeley SIMS 1998 DM Touchstone Applications (CACM 39 (11) Special Issue) n Finding patterns across data sets: Reports on changes in retail sales Reports on changes in retail sales to improve sales to improve sales Patterns of sizes of TV audiences Patterns of sizes of TV audiences for marketing for marketing Patterns in NBA play Patterns in NBA play to alter, and so improve, performance to alter, and so improve, performance Deviations in standard phone calling behavior Deviations in standard phone calling behavior to detect fraud to detect fraud for marketing for marketing

Marti A. Hearst UC Berkeley SIMS 1998 What is Text Data Mining? n Peoples’ first thought: Make it easier to find things on the Web. Make it easier to find things on the Web. This is information retrieval! This is information retrieval! n The metaphor of extracting ore from rock does make sense for extracting documents of interest from a huge pile. n But does not reflect notions of DM in practice: finding patterns across large collections finding patterns across large collections discovering heretofore unknown information discovering heretofore unknown information

Marti A. Hearst UC Berkeley SIMS 1998 Text DM != IR n Data Mining: Patterns, Nuggets, Exploratory Analysis Patterns, Nuggets, Exploratory Analysis n Information Retrieval: Finding and ranking documents that match users’ information need Finding and ranking documents that match users’ information need ad hoc query ad hoc query filtering/standing query filtering/standing query

Marti A. Hearst UC Berkeley SIMS 1998 Real Text DM n What would finding a pattern across a large text collection really look like?

Marti A. Hearst UC Berkeley SIMS 1998 From: “The Internet Diary of the man who cracked the Bible Code ” Brendan McKay, Yahoo Internet Life, (William Gates, agitator, leader) Bill Gates + MS-DOS in the Bible!

Marti A. Hearst UC Berkeley SIMS 1998 From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life,

Marti A. Hearst UC Berkeley SIMS 1998 Real Text DM n The point: Discovering heretofore unknown information is not what we usually do with text. Discovering heretofore unknown information is not what we usually do with text. (If it weren’t known, it could not have been written by someone.) (If it weren’t known, it could not have been written by someone.) n However: There are some interesting problems of this type! There are some interesting problems of this type!

Marti A. Hearst UC Berkeley SIMS 1998 Combining Data Types for Novel Tasks n Text + Links to find “authority pages” (Kleinberg at Cornell, Page at Stanford) n Usage + Time + Links to study evolution of web and information use (Pitkow et al. at PARC)

Marti A. Hearst UC Berkeley SIMS 1998 Ore-Filled Text Collections n Congressional Voting Records Answer questions like: Answer questions like: Who are the most hypocritical congresspeople? Who are the most hypocritical congresspeople? n Medical Articles Create hypotheses about causes of rare diseases Create hypotheses about causes of rare diseases Create hypotheses about gene function Create hypotheses about gene function n Patent Law Answer questions like: Answer questions like: Is government funding of research worthwhile? Is government funding of research worthwhile?

Marti A. Hearst UC Berkeley SIMS 1998

Marti A. Hearst UC Berkeley SIMS 1998

Marti A. Hearst UC Berkeley SIMS 1998 How to find Hypocritical Congresspersons? n This must have taken a lot of work Hand cutting and pasting Hand cutting and pasting Lots of picky details Lots of picky details Some people voted on one but not the other bill Some people voted on one but not the other bill Some people share the same name Some people share the same name Check for different county/state Check for different county/state Still messed up on “Bono” Still messed up on “Bono” Taking stats at the end on various attributes Taking stats at the end on various attributes Which state Which state Which party Which party

Marti A. Hearst UC Berkeley SIMS 1998 How to find causes of disease? Don Swanson’s Medical Work n Given medical titles and abstracts medical titles and abstracts a problem (incurable rare disease) a problem (incurable rare disease) some medical expertise some medical expertise n find causal links among titles symptoms symptoms drugs drugs results results

Marti A. Hearst UC Berkeley SIMS 1998 Swanson Example (1991) n Problem: Migraine headaches (M) stress associated with M stress associated with M stress leads to loss of magnesium stress leads to loss of magnesium calcium channel blockers prevent some M calcium channel blockers prevent some M magnesium is a natural calcium channel blocker magnesium is a natural calcium channel blocker spreading cortical depression (SCD)implicated in M spreading cortical depression (SCD)implicated in M high levels of magnesium inhibit SCD high levels of magnesium inhibit SCD M patients have high platelet aggregability M patients have high platelet aggregability magnesium can suppress platelet aggregability magnesium can suppress platelet aggregability n All extracted from medical journal titles

Marti A. Hearst UC Berkeley SIMS 1998 Swanson’s TDM n Two of his hypotheses have received some experimental verification. n His technique Only partially automated Only partially automated Required medical expertise Required medical expertise n Few people are working on this.

Marti A. Hearst UC Berkeley SIMS 1998 How to find functions of genes? n Important problem in molecular biology Have the genetic sequence Have the genetic sequence Don’t know what it does Don’t know what it does But … But … Know which genes it coexpresses with Know which genes it coexpresses with Some of these have known function Some of these have known function So … Infer function based on function of co-expressed genes So … Infer function based on function of co-expressed genes This is new work by Michael Walker and others at Incyte Pharmaceuticals This is new work by Michael Walker and others at Incyte Pharmaceuticals

Marti A. Hearst UC Berkeley SIMS 1998 Gene Co-expression: Role in the genetic pathway g? PSA Kall. PAP h? PSA Kall. PAP g? Other possibilities as well

Marti A. Hearst UC Berkeley SIMS 1998 Make use of the literature n Look up what is known about the other genes. n Different articles in different collections n Look for commonalities Similar topics indicated by Subject Descriptors Similar topics indicated by Subject Descriptors Similar words in titles and abstracts Similar words in titles and abstracts adenocarcinoma, neoplasm, prostate, prostatic neoplasms, tumor markers, antibodies...

Marti A. Hearst UC Berkeley SIMS 1998 Developing Strategies n Different strategies seem needed for different situations First: see what is known about Kallikrein. First: see what is known about Kallikrein documents. Too many 7341 documents. Too many AND the result with “disease” category AND the result with “disease” category If result is non-empty, this might be an interesting gene If result is non-empty, this might be an interesting gene Now get 803 documents Now get 803 documents AND the result with PSA AND the result with PSA Get 11 documents. Better! Get 11 documents. Better!

Marti A. Hearst UC Berkeley SIMS 1998 Developing Strategies n Look for commalities among these documents Manual scan through ~100 category labels Manual scan through ~100 category labels Would have been better if Would have been better if Automatically organized Automatically organized Intersections of “important” categories scanned for first Intersections of “important” categories scanned for first

Marti A. Hearst UC Berkeley SIMS 1998 Try a new tack n Researcher uses knowledge of field to realize these are related to prostate cancer and diagnostic tests n New tack: intersect search on all three known genes Hope they all talk about diagnostics and prostate cancer Hope they all talk about diagnostics and prostate cancer Fortunately, 7 documents returned Fortunately, 7 documents returned Bingo! A relation to regulation of this cancer Bingo! A relation to regulation of this cancer

Marti A. Hearst UC Berkeley SIMS 1998 Formulate a Hypothesis n Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancer n New tack: do some lab tests See if mystery gene is similar in molecular structure to the others See if mystery gene is similar in molecular structure to the others If so, it might do some of the same things they do If so, it might do some of the same things they do

Marti A. Hearst UC Berkeley SIMS 1998 Strategies again n In hindsight, combining all three genes was a good strategy. Store this for later Store this for later n Might not have worked Need a suite of strategies Need a suite of strategies Build them up via experience and a good UI Build them up via experience and a good UI

Marti A. Hearst UC Berkeley SIMS 1998 The System n Doing the same query with slightly different values each time is time-consuming and tedious n Same goes for cutting and pasting results IR systems don’t support varying queries like this very well. IR systems don’t support varying queries like this very well. Each situation is a bit different Each situation is a bit different n Some automatic processing is needed in the background to eliminate/suggest hypotheses

Marti A. Hearst UC Berkeley SIMS 1998 The System n Three main parts UI for building/using strategies UI for building/using strategies Backend for interfacing with various databases and translating different formats Backend for interfacing with various databases and translating different formats Content analysis/machine learning for figuring out good hypotheses/throwing out bad ones Content analysis/machine learning for figuring out good hypotheses/throwing out bad ones

Marti A. Hearst UC Berkeley SIMS 1998 The UI part n Need support for building strategies n Lots of info lying around, so a nice option is... Two-handed interface Two-handed interface Big table display Big table display n Mixed-initiative system Trade off between user-initiated hypotheses exploration and system-initiated suggestions Trade off between user-initiated hypotheses exploration and system-initiated suggestions n Information visualization Another way to show lots of choices Another way to show lots of choices

Marti A. Hearst UC Berkeley SIMS 1998 Candidate Associations Current Retrieval Results Suggested Strategies

Marti A. Hearst UC Berkeley SIMS 1998 Other applications n Patent example n Political example n The truth’s out there!

Marti A. Hearst UC Berkeley SIMS 1998 Text Tango n Just starting up now. n Let me know if you’d like to work on it!