Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.

Slides:



Advertisements
Similar presentations
Yansong Feng and Mirella Lapata
Advertisements

Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999.
Information Retrieval: Human-Computer Interfaces and Information Access Process.
Object-Oriented Analysis and Design
Search Engines and Information Retrieval
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998.
1 Interfaces for Intense Information Analysis Marti Hearst UC Berkeley This research funded by ARDA.
Text Mining Tools: Instruments for Scientific Discovery Marti Hearst UC Berkeley SIMS Advanced Technologies Seminar June 15, 2000.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
UCB HCC Retreat Search Text Mining Web Site Usability Marti Hearst SIMS.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Text Mining for Bioscience Applications: The State of the Art Marti Hearst University of California, Berkeley.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Information Retrieval: Human-Computer Interfaces and Information Access Process.
Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999.
WMES3103: INFORMATION RETRIEVAL WEEK 10 : USER INTERFACES AND VISUALIZATION.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
UCB CS Research Fair Search Text Mining Web Site Usability Marti Hearst SIMS.
The LINDI Project Linking Information for New Discoveries UIs for building and reusing hypothesis seeking strategies. Statistical language analysis techniques.
Thesis Writing. Tasks for Developing Your Thesis The slides in this presentation will guide you step by step to develop some preliminary ideas and format.
Memoplex Browser: Searching and Browsing in Semantic Networks CPSC 533C - Project Update Yoel Lanir.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Improving Data Discovery in Metadata Repositories through Semantic Search Chad Berkley 1, Shawn Bowers 2, Matt Jones 1, Mark Schildhauer 1, Josh Madin.
Amarnath Gupta Univ. of California San Diego. An Abstract Question There is no concrete answer …but …
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Search Engines and Information Retrieval Chapter 1.
Automated Explanation of Gene-Gene Relationships Wacek Kuśnierczyk.
BeeSpace Informatics Research: From Information Access to Knowledge Discovery ChengXiang Zhai Nov. 7, 2007.
Aardvark Anatomy of a Large-Scale Social Search Engine.
THEME 1: Improving the Experimentation and Discovery Process Unprecedented complexity of scientific enterprise Is science stymied by the human bottleneck?
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
-1- Philipp Heim, Thomas Ertl, Jürgen Ziegler Facet Graphs: Complex Semantic Querying Made Easy Philipp Heim 1, Thomas Ertl 1 and Jürgen Ziegler 2 1 Visualization.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Individualized Knowledge Access David Karger Lynn Andrea Stein Mark Ackerman Ralph Swick.
NLP ? Natural Language is one of fundamental aspects of human behaviors. One of the final aim of human-computer communication. Provide easy interaction.
Recuperação de Informação B Cap. 10: User Interfaces and Visualization , , 10.9 November 29, 1999.
Text Mining Tools: Instruments for Scientific Discovery Marti Hearst UC Berkeley SIMS IMA Text Mining Workshop April 17, 2000.
INFO Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
BeeSpace Informatics Research: From Information Access to Knowledge Discovery ChengXiang Zhai Nov. 14, 2007.
Biological Networks & Systems Anne R. Haake Rhys Price Jones.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Mining the Biomedical Research Literature Ken Baclawski.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
1 Information Retrieval LECTURE 1 : Introduction.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Relevance Feedback in Image Retrieval System: A Survey Tao Huang Lin Luo Chengcui Zhang.
Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.
Visual Information Retrieval
Using computers to search electronic databases
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Text Tango: A New Text Data Mining Project
OUTLINE Basic ideas of traditional retrieval systems
Information Retrieval
Untangling Text Data Mining
Interfaces for Intense Information Analysis
Document Clustering Matt Hughes.
CSE 635 Multimedia Information Retrieval
Citation-based Extraction of Core Contents from Biomedical Articles
Marti Hearst Associate Professor SIMS, UC Berkeley
Presentation transcript:

Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

The LINDI Project Linking Information for New Discoveries UIs for building and reusing hypothesis seeking strategies. Statistical language analysis techniques for extracting propositions Two Main Thrusts:

Scenario: Explore Functions of a Gene l Objective –Determine the functions of a newly sequenced Gene X. l Known facts –Gene X co-expresses (activated in the same cell) with Gene A, B, C –The relationship of Gene A, B, C with certain types of diseases (from medical literature) l Question –What types of diseases are Gene X related to?

Gene Co-expression: Role in the genetic pathway g? PSA Kall. PAP h? PSA Kall. PAP g? Other possibilities as well

Make use of the literature l Look up what is known about the other genes. l Different articles in different collections l Look for commonalities –Similar topics indicated by Subject Descriptors –Similar words in titles and abstracts adenocarcinoma, neoplasm, prostate, prostatic neoplasms, tumor markers, antibodies...

Developing Strategies l Different strategies seem needed for different situations –First: see what is known about Kallikrein. –7341 documents. Too many –AND the result with “disease” category »If result is non-empty, this might be an interesting gene –Now get 803 documents

Medical Literature Explore Functions of New Gene X Gene-A Keywords Slide adapted from K. Patel Projection Mapping Query

Developing Strategies l Different strategies seem needed for different situations –First: see what is known about Kallikrein. –7341 documents. Too many –AND the result with “disease” category »If result is non-empty, this might be an interesting gene –Now get 803 documents –AND the result with PSA »Get 11 documents. Better!

Medical Literature Explore Functions of New Gene X Gene-A Keywords Gene-B Gene-C Keywords Projection Keywords Intersection Query

Developing Strategies l Look for commalities among these documents –Manual scan through ~100 category labels –Would have been better if »Automatically organized »Intersections of “important” categories scanned for first

Medical Literature Explore Functions of New Gene X Gene-A Keywords Gene-B Keywords Slide adapted from K. Patel Slicing Gene-C Keywords Projection Keywords Intersection Mapping Query

Try a new tack l Researcher uses knowledge of field to realize these are related to prostate cancer and diagnostic tests l New tack: intersect search on all three known genes –Hope they all talk about diagnostics and prostate cancer –Fortunately, 7 documents returned –Bingo! A relation to regulation of this cancer

Medical Literature Explore Functions of New Gene X Possible Function For Gene-X Gene-A Keywords Gene-B Keywords Slide adapted from K. Patel Slicing Gene-C Keywords Projection Keywords Intersection Mapping Query

Formulate a Hypothesis l Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancer l New tack: do some lab tests –See if mystery gene is similar in molecular structure to the others –If so, it might do some of the same things they do

Strategies again l In hindsight, combining all three genes was a good strategy. –Store this for later l Might not have worked –Need a suite of strategies –Build them up via experience and a good UI

The System l Doing the same query with slightly different values each time is time- consuming and tedious l Same goes for cutting and pasting results –IR systems don’t support varying queries like this very well. –Each situation is a bit different l Some automatic processing is needed in the background to eliminate/suggest hypotheses

The User Interface l A general search interface should support –History –Context –Comparison –Operators: Intersection, Union, Slicing –Operator Reuse –Visualization (where appropriate) l We have an initial implementation l It needs lots of work

Architecture of LINDI UI l Data Layer l Annotation Layer l User Interface Layer

Data Layer l Purpose –Hide different formats of text collections l Components –Data: Abstractions representing records of a text collection –Operations: performed on the data l Data –A set of records –Each record is a set of tuples with types l Operations –union, intersection, projection, mapping

Annotation Layer l Purpose –Associate data set with operations that produced them (history) –History is a first class object l Advantage –Streamline a sequence of operations –Reuse operations –Parameterize operations

User Interface l Direct manipulation of information objects and access operations –Query –Intersection –Union –Mapping –Slicing l Record and reuse of past operations l Parameterization of operations l Streamlining of operations

Initial Palette

Query Structure Determined by Collection Type

Query Operation Results

Projection Operation and Subsequent Results

Parameterized Query: Repeat operations with different values GC GB GA

Intersection over Projected Attribute

Example Interaction with UI Prototype 1 Query on Gene names 2 Project out only mesh headings 3 Intersect the results 4 Map to create a ranking 5 Slice out the top-ranked.

Future Work on UI l As currently designed –Better labeling –Better layout »Intuitive »Scalable –Connection to real backend –User Testing »Does direct manipulation work? »What operator sequences help? »How to improve parameterization? l More advanced –Support for strategies –Incorporation of NLP

Language Analysis Component Goals: –Extract Propositions from Text –Make Inferences

Language Analysis Component Why Extract Propositions from Text? –Text is how knowledge at the propositional level is communicated –Text is continually being created and updated by the outside world

Example: Statistical Semantic Grammar To detect causal relationships between medical concepts –Title: Magnesium deficiency implicated in increased stress levels. –Interpretation: related-to –Inference: »Increase(stress, decrease(mg))

Statistical Semantic Grammars l Empirical NLP has made great strides –But mainly applied to syntactic structure l Semantic grammars are powerful, but –Brittle –Time-consuming to construct l Idea: –Use what we now know about statistical NLP to build up a probabilistic grammar

LINDI: Target Components 1. Special UI for retrieving appropriate docs 2. Language analysis on docs to detect causal relationships between concepts 3. Probabilistic representation of concepts and relationships 4. UI + User: Hypothesis creation