Text Mining Tools: Instruments for Scientific Discovery Marti Hearst UC Berkeley SIMS Advanced Technologies Seminar June 15, 2000.

Slides:



Advertisements
Similar presentations
1 Probability and the Web Ken Baclawski Northeastern University VIStology, Inc.
Advertisements

An Introduction to Data Mining
Andrea M. Landis, PhD, RN UW LEAH
Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999.
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
 William M. Pottenger, Ph.D. Computing the Future of Data Mining An Introduction to Data Mining Visit to Messiah College September 4, 2006 William M.
Mapping Studies – Why and How Andy Burn. Resources The idea of employing evidence-based practices in software engineering was proposed in (Kitchenham.
Data Mining Glen Shih CS157B Section 1 Dr. Sin-Min Lee April 4, 2006.
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
Search Engines and Information Retrieval
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998.
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
1 Interfaces for Intense Information Analysis Marti Hearst UC Berkeley This research funded by ARDA.
Text Data Mining Prof. Marti Hearst UC Berkeley SIMS ABLE May 7, 1999.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
UCB HCC Retreat Search Text Mining Web Site Usability Marti Hearst SIMS.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Chapter 1 Conducting & Reading Research Baumgartner et al Chapter 1 Nature and Purpose of Research.
Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999.
Problem Identification
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.
UCB CS Research Fair Search Text Mining Web Site Usability Marti Hearst SIMS.
Building Knowledge-Driven DSS and Mining Data
Research Methods in MIS Dr. Deepak Khazanchi. Objectives for the Course Identify Problem Areas Conduct Interview Do Library Research Develop Theoretical.
The LINDI Project Linking Information for New Discoveries UIs for building and reusing hypothesis seeking strategies. Statistical language analysis techniques.
Citances and What should our UI look like? Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Business Intelligence
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Ontology Development Kenneth Baclawski Northeastern University Harvard Medical School.
Search Engines and Information Retrieval Chapter 1.
Text Data Mining: Introduction Hao Chen School of Information Systems University of California at Berkeley
Knowledge representation
Chapter 1 Introduction to Data Mining
Big Idea 1: The Practice of Science Description A: Scientific inquiry is a multifaceted activity; the processes of science include the formulation of scientifically.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Evaluating a Research Report
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Statement analysis PresentedBy Prof. Shadia Abdelkader.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Knowledge Representation of Statistic Domain For CBR Application Supervisor : Dr. Aslina Saad Dr. Mashitoh Hashim PM Dr. Nor Hasbiah Ubaidullah.
Text Mining Tools: Instruments for Scientific Discovery Marti Hearst UC Berkeley SIMS IMA Text Mining Workshop April 17, 2000.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Research Design. Selecting the Appropriate Research Design A research design is basically a plan or strategy for conducting one’s research. It serves.
Working with Ontologies Introduction to DOGMA and related research.
Mining the Biomedical Research Literature Ken Baclawski.
Web Technologies for Bioinformatics Ken Baclawski.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Artificial Intelligence Knowledge Representation.
This multimedia product and its contents are protected under copyright law. The following are prohibited by law: any public performance or display, including.
Writing a sound proposal
Lindsay & Gordon’s Discovery Support Systems Model
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Knowledge Management Systems
Web IR: Recent Trends; Future of Web Search
Social Knowledge Mining
Untangling Text Data Mining
Interfaces for Intense Information Analysis
Geospatial and Problem Specific Semantics Danielle Forsyth, CEO and Co-Founder Thetus Corporation 20 June, 2006.
Causal Models Lecture 12.
Business Research Methods
Presentation transcript:

Text Mining Tools: Instruments for Scientific Discovery Marti Hearst UC Berkeley SIMS Advanced Technologies Seminar June 15, 2000

Outline l What knowledge can we discover from text? l How is knowledge discovered from other kinds of data? l A proposal: let’s make a new kind of scientific instrument/tool. Note: this talk contains some common materials and themes from another one of my talks entitled “Untangling Text Data Mining”

What is Knowledge Discovery from Text?

l Finding a document? l Finding a person’s name in a document? This information is already known to the author at least. Needles in Haystacks Needlestacks

What to Discover from Text? l What news events happened last year? l Which researchers most influenced a field? l Which inventions led to other inventions? Historical, Retrospective

What to Discover from Text? l What are the most common topics discussed in this set of documents? l How connected is the Web? l What words best characterize this set of documents’ topics? l Which words are good triggers for a topic classifier/filter? Summaries of the data itself Features used in algorithms

Classifying Application Types

The Quandary l How do we use text to both –Find new information not known to the author of the text –Find information that is not about the text itself?

Idea: Exploratory Data Analysis l Use large text collections to gather evidence to support (or refute) hypotheses –Not known to author: Make links across many texts –Not self-referential: Work within the text domain

The Process of Scientific Discovery l Four main steps (Langley et al. 87): –Gathering data –Finding good descriptions of data –Formulating explanatory hypotheses –Testing the hypotheses l My Claim: We can do this with text as the data!

Scientific Breakthroughs l New scientific instruments lead to revolutions in discovery –CAT scans, fMRI –Scanning tunneling electron microscope –Hubble telescope l Idea: Make A New Scientific Instrument!

How Has Knowledge been Discovered in Non-Textual Data? Discovery from databases involves finding patterns across the data in the records –Classification »Fraud vs. non-fraud –Conditional dependencies »People who buy X are likely to also buy Y with probability P

How Has Knowledge been Discovered in Non-Textual Data? l Old AI work (early 80’s): –AM/Eurisko (Lenat) –BACON, STAHL, etc. (Langley et al.) –Expert Systems l A Commonality: –Start with propositions –Try to make inferences from these l Problem: –Where do the propositions come from?

Intensional vs. Extensional l Database structure: –Intensional: The schema –Extensional: The records that instantiate the schema l Current data mining efforts make inferences from the records l Old AI work made inferences from what would have been the schemata –employees have salaries and addresses –products have prices and part numbers

Goal: Extract Propositions from Text and Make Inferences

Why Extract Propositions from Text? l Text is how knowledge at the propositional level is communicated l Text is continually being created and updated by the outside world –So knowledge base won’t get stale

Example: Etiology l Given –medical titles and abstracts –a problem (incurable rare disease) –some medical expertise l find causal links among titles –symptoms –drugs –results

Swanson Example (1991) l Problem: Migraine headaches (M) –stress associated with M –stress leads to loss of magnesium –calcium channel blockers prevent some M –magnesium is a natural calcium channel blocker –spreading cortical depression (SCD) implicated in M –high levels of magnesium inhibit SCD –M patients have high platelet aggregability –magnesium can suppress platelet aggregability l All extracted from medical journal titles

Gathering Evidence stress migraine CCB magnesium PA magnesium SCD magnesium

Gathering Evidence migraine magnesium stress CCB PA SCD

Swanson’s TDM l Two of his hypotheses have received some experimental verification. l His technique –Only partially automated –Required medical expertise l Few people are working on this.

One Approach: The LINDI Project Linking Information for New Discoveries Three main components: –Search UI for building and reusing hypothesis seeking strategies. –Statistical language analysis techniques for extracting propositions from text. –Probabilistic ontological representation and reasoning techniques

LINDI l First use category labels to retrieve candidate documents, l Then use language analysis to detect causal relationships between concepts, l Represent relationships probabilistically, within a known ontology, l The (expert) user –Builds up representations –Formulates hypotheses –Tests hypotheses outside of the text system.

Objections l Objection: –This is GOF NLP, which doesn’t work l Response: –GOF NLP required hand-entering of knowledge –Now we have statistical techniques and very large corpora

Objections l Objection: –Reasoning with propositions is brittle l Response: –Yes, but now we have mature probabilistic reasoning tools, which support »Representation of uncertainty and degrees of belief »Simultaneously conflicting information »Different levels of granularity of information

Objections l Objection: –Automated reasoning doesn’t work l Response –We are not trying to automate all reasoning, rather we are building new powerful tools for »Gathering data »Formulating hypotheses

Objections l Objection: –Isn’t this just information extraction? l Response: –IE is a useful tool that can be used in this endeavor, however »It is currently used to instantiate pre- specified templates »I am advocating coming up with entirely new, unforeseen “templates”

Traditional Semantic Grammars l Reshape syntactic grammars to serve the needs of semantic processing. l Example (Burton & Brown 79) –Interpreting “What is the current thru the CC when the VC is 1.0?” := when := what is := := is := VC –Resulting semantic form is: (RESETCONTROL (STQ VC 1.0) (MEASURE CURRENT CC))

Statistical Semantic Grammars l Empirical NLP has made great strides –But mainly applied to syntactic structure l Semantic grammars are powerful, but –Brittle –Time-consuming to construct l Idea: –Use what we now know about statistical NLP to build up a probabilistic grammar

Example: Statistical Semantic Grammar l To detect causal relationships between medical concepts –Title: Magnesium deficiency implicated in increased stress levels. –Interpretation: related-to –Inference: »Increase(stress, decrease(mg))

Example: Using Semantics + Ontologies acute migraine treatment intra-nasal migraine treatment

Example: Using Semantics + Ontologies [acute migraine] treatment intra-nasal [migraine treatment] We also want to know the meaning of the attachments, not just which way the attachments go.

Example: Using Semantics + Ontologies acute migraine treatment intra-nasal migraine treatment

Example: Using Semantics + Ontologies acute migraine treatment intra-nasal migraine treatment Problem: which level(s) of the ontology should be used? We are taking an information-theoretic approach.

The User Interface l A general search interface should support –History –Context –Comparison –Operator Reuse –Intersection, Union, Slicing –Visualization (where appropriate) l We are developing such an interface as part of a general search UI project.

Summary l Let’s get serious about discovering new knowledge from text l This will build on existing technologies l This also requires new technologies

Summary l Let’s get serious about discovering new knowledge from text –We can build a new kind of scientific instrument to facilitate a whole new set of scientific discoveries –Technique: linking propositions across texts (Jensen, Harabagiu)

Summary l This will build on existing technologies –Information extraction (Riloff et al., Hobbs et al.) –Bootstrapping training examples (Riloff et al.) –Probabilistic reasoning

Summary l This also requires new technologies –Statistical semantic grammars –Dynamic ontology adjustment –Flexible search UIs