Text Tango: A New Text Data Mining Project

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Web Mining.
Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999.
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Chapter 12 – Strategies for Effective Written Reports
> a patent search service supplied by Patents & Technology Surveys Ltd PROFESSIONAL ONLINE PATENT INFORMATION SERVICE.
Database Searching: Education Abstracts/Full Text & Professional Development Collection.
5/11/981 Untangling Text Data Mining Stanford Digital Libraries Seminar May 11, 1998 Marti Hearst UC Berkeley SIMS
Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998.
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
1 Interfaces for Intense Information Analysis Marti Hearst UC Berkeley This research funded by ARDA.
Text Data Mining Prof. Marti Hearst UC Berkeley SIMS ABLE May 7, 1999.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.
UCB CS Research Fair Search Text Mining Web Site Usability Marti Hearst SIMS.
The LINDI Project Linking Information for New Discoveries UIs for building and reusing hypothesis seeking strategies. Statistical language analysis techniques.
Data mining By Aung Oo.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
ACS1803 Lecture Outline 2 DATA MANAGEMENT CONCEPTS Text, Ch. 3 How do we store data (numeric and character records) in a computer so that we can optimize.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Text Data Mining: Introduction Hao Chen School of Information Systems University of California at Berkeley
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Interactive Probabilistic Search for GikiCLEF Ray R Larson School of Information University of California, Berkeley Ray R Larson School of Information.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Data Mining By Dave Maung.
 Finding Scholarly Research on Your Topic. Your Research Journey…  You have, at this point, found information on your topic from general sources – news.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Recuperação de Informação B Cap. 10: User Interfaces and Visualization , , 10.9 November 29, 1999.
3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.
ITGS Databases.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Human Centric Computing (COMP106) Assignment 2 PROPOSAL 23.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Trust Me, I’m Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster Shengliang Dai.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
LTER IM Meeting 2008 – Benson, Boose, Bohm, Gries, Gu, Kaplan, Koskela, Laney, Porter, Remillard, Sheldon and others.
Data Mining Status and Risks Dr. Gregory Newby UNC-Chapel Hill
Augmenting (personal) IR Readings Review Evaluation Papers returned & discussed Papers and Projects checkin time.
INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.
Data Mining – Intro.
Create a blog Skills: create, modify and post to a blog
Government Research Project compare and contrast essay
CINAHL DATABASE FOR HINARI USERS
Augmenting (personal) IR
1 2 3 Here we are on the Ohio Web Library’s home page. To get to Business Source Premier, use the following steps: 1. Go to Ohio Web Library 2. Click on.
Introduction to Smart Search
WIRED Week 2 Syllabus Update Readings Overview.
CSE591: Data Mining by H. Liu
Untangling Text Data Mining
Interfaces for Intense Information Analysis
Chapter 6 Discuss the types of strategic research
Data Mining Chapter 6 Search Engines
Document Clustering Matt Hughes.
Introduction to Database Programs
Web Mining Department of Computer Science and Engg.
CS246: Information Retrieval
CHAPTER 7: Information Visualization
Advanced Technical Writing 2006
Information Retrieval and Web Design
Reading and effective note-making
Introduction to Search Engines
Presentation transcript:

Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Talk Outline What is Data Mining? What isn’t Text Data Mining? What is Text Data Mining Examples A proposal for a system for Text Data Mining Marti A. Hearst UC Berkeley SIMS 1998

What is Data Mining? (Fayyad & Uthurusamy 96, Fayyad 97) Fitting models to or determining patterns from very large datasets. A “regime” which enables people to interact effectively with massive data stores. Deriving new information from data. finding patterns across large datasets discovering heretofore unknown information Marti A. Hearst UC Berkeley SIMS 1998

What is Data Mining? Potential point of confusion: The extracting ore from rock metaphor does not really apply to the practice of data mining If it did, then standard database queries would fit under the rubric of data mining Find all employee records in which employee earns $300/month less than their managers In practice, DM refers to: finding patterns across large datasets discovering heretofore unknown information Marti A. Hearst UC Berkeley SIMS 1998

DM Touchstone Applications (CACM 39 (11) Special Issue) Finding patterns across data sets: Reports on changes in retail sales to improve sales Patterns of sizes of TV audiences for marketing Patterns in NBA play to alter, and so improve, performance Deviations in standard phone calling behavior to detect fraud Marti A. Hearst UC Berkeley SIMS 1998

What is Text Data Mining? Peoples’ first thought: Make it easier to find things on the Web. This is information retrieval! The metaphor of extracting ore from rock does make sense for extracting documents of interest from a huge pile. But does not reflect notions of DM in practice: finding patterns across large collections discovering heretofore unknown information Marti A. Hearst UC Berkeley SIMS 1998

Text DM != IR Data Mining: Information Retrieval: Patterns, Nuggets, Exploratory Analysis Information Retrieval: Finding and ranking documents that match users’ information need ad hoc query filtering/standing query Marti A. Hearst UC Berkeley SIMS 1998

Real Text DM What would finding a pattern across a large text collection really look like? Marti A. Hearst UC Berkeley SIMS 1998

Bill Gates + MS-DOS in the Bible! From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil (William Gates, agitator, leader) Marti A. Hearst UC Berkeley SIMS 1998

From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil Marti A. Hearst UC Berkeley SIMS 1998

Real Text DM The point: However: Discovering heretofore unknown information is not what we usually do with text. (If it weren’t known, it could not have been written by someone.) However: There are some interesting problems of this type! Marti A. Hearst UC Berkeley SIMS 1998

Combining Data Types for Novel Tasks Text + Links to find “authority pages” (Kleinberg at Cornell, Page at Stanford) Usage + Time + Links to study evolution of web and information use (Pitkow et al. at PARC) Marti A. Hearst UC Berkeley SIMS 1998

Ore-Filled Text Collections Congressional Voting Records Answer questions like: Who are the most hypocritical congresspeople? Medical Articles Create hypotheses about causes of rare diseases Create hypotheses about gene function Patent Law Is government funding of research worthwhile? Marti A. Hearst UC Berkeley SIMS 1998

Marti A. Hearst UC Berkeley SIMS 1998

Marti A. Hearst UC Berkeley SIMS 1998

How to find Hypocritical Congresspersons? This must have taken a lot of work Hand cutting and pasting Lots of picky details Some people voted on one but not the other bill Some people share the same name Check for different county/state Still messed up on “Bono” Taking stats at the end on various attributes Which state Which party Marti A. Hearst UC Berkeley SIMS 1998

How to find functions of genes? Important problem in molecular biology Have the genetic sequence Don’t know what it does But … Know which genes it coexpresses with Some of these have known function So … Infer function based on function of co-expressed genes This is new work by Michael Walker and others at Incyte Pharmaceuticals Marti A. Hearst UC Berkeley SIMS 1998

Gene Co-expression: Role in the genetic pathway Kall. Kall. g? h? PSA PSA PAP PAP g? Other possibilities as well Marti A. Hearst UC Berkeley SIMS 1998

Make use of the literature Look up what is known about the other genes. Different articles in different collections Look for commonalities Similar topics indicated by Subject Descriptors Similar words in titles and abstracts adenocarcinoma, neoplasm, prostate, prostatic neoplasms, tumor markers, antibodies ... Marti A. Hearst UC Berkeley SIMS 1998

Developing Strategies Different strategies seem needed for different situations First: see what is known about Kallikrein. 7341 documents. Too many AND the result with “disease” category If result is non-empty, this might be an interesting gene Now get 803 documents AND the result with PSA Get 11 documents. Better! Marti A. Hearst UC Berkeley SIMS 1998

Developing Strategies Look for commalities among these documents Manual scan through ~100 category labels Would have been better if Automatically organized Intersections of “important” categories scanned for first Marti A. Hearst UC Berkeley SIMS 1998

Try a new tack Researcher uses knowledge of field to realize these are related to prostate cancer and diagnostic tests New tack: intersect search on all three known genes Hope they all talk about diagnostics and prostate cancer Fortunately, 7 documents returned Bingo! A relation to regulation of this cancer Marti A. Hearst UC Berkeley SIMS 1998

Formulate a Hypothesis Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancer New tack: do some lab tests See if mystery gene is similar in molecular structure to the others If so, it might do some of the same things they do Marti A. Hearst UC Berkeley SIMS 1998

Strategies again In hindsight, combining all three genes was a good strategy. Store this for later Might not have worked Need a suite of strategies Build them up via experience and a good UI Marti A. Hearst UC Berkeley SIMS 1998

The System Doing the same query with slightly different values each time is time-consuming and tedious Same goes for cutting and pasting results IR systems don’t support varying queries like this very well. Each situation is a bit different Some automatic processing is needed in the background to eliminate/suggest hypotheses Marti A. Hearst UC Berkeley SIMS 1998

The System Three main parts UI for building/using strategies Backend for interfacing with various databases and translating different formats Content analysis/machine learning for figuring out good hypotheses/throwing out bad ones Marti A. Hearst UC Berkeley SIMS 1998

The UI part Mixed-initiative system Information visualization Need support for building strategies Lots of info lying around, so a nice option is ... Two-handed interface Big table display Mixed-initiative system Trade off between user-initiated hypotheses exploration and system-initiated suggestions Information visualization Another way to show lots of choices Marti A. Hearst UC Berkeley SIMS 1998

Candidate Associations Suggested Strategies Current Retrieval Results Marti A. Hearst UC Berkeley SIMS 1998

Other applications Patent example Political example The truth’s out there! Marti A. Hearst UC Berkeley SIMS 1998

Text Tango Just starting up now. Let me know if you’d like to work on it! Marti A. Hearst UC Berkeley SIMS 1998