1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA.

Slides:



Advertisements
Similar presentations
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Advertisements

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
CSE 591 (99689) Application of AI to molecular Biology (5:15 – 6: 30 PM, PSA 309) Instructor: Chitta Baral Office hours: Tuesday 2 to 5 PM.
Caption Search for Bioscience Search Interfaces Marti Hearst, Anna Divoli, Jerry Ye, Mike Wooldridge UC Berkeley School of Information ACL Workshop on.
Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998.
Automatic Discovery of Technology Trends from Patent Text Youngho Kim, Yingshi Tian, Yoonjae Jeong, Ryu Jihee, Sung-Hyon Myaeng School of Engineering Information.
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
1 Interfaces for Intense Information Analysis Marti Hearst UC Berkeley This research funded by ARDA.
Semantic Relation Detection in Bioscience Text Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from.
Text Mining Tools: Instruments for Scientific Discovery Marti Hearst UC Berkeley SIMS Advanced Technologies Seminar June 15, 2000.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Text Mining for Bioscience Applications: The State of the Art Marti Hearst University of California, Berkeley.
The BioText Project: Recent Work Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
FROM INFORMATION, KNOWLEDGE Prof. Marti Hearst MIMS Visit Day, 2006 Some Research Projects.
1 Noun compounds (NCs) Any sequence of nouns that itself functions as a noun asthma hospitalizations asthma hospitalization rates health care personnel.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
1 Classification of Semantic Relations in Noun Compounds using MeSH Marti Hearst, Barbara Rosario SIMS, UC Berkeley.
Queensland University of Technology An Ontology-based Mining Approach for User Search Intent Discovery Yan Shen, Yuefeng Li, Yue Xu, Renato Iannella, Abdulmohsen.
Evidence for Showing Gene/Protein Name Suggestions in Bioscience Literature Search Interfaces Anna Divoli, Marti A. Hearst, Michael A. Wooldridge School.
Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.
UCB CS Research Fair Search Text Mining Web Site Usability Marti Hearst SIMS.
Citances: Citation Sentences for Semantic Analysis of Bioscience Text Preslav I. Nakov, Ariel S. Schwartz, and Marti A. Hearst Computer Science Division.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Accessing Sources of Evidence For Practice How to search Karen Smith Department of Health Sciences University of York.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Next Steps in Literature Mining Marti Hearst UC Berkeley ASIST 2003 Literature Mining Panel.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
The LINDI Project Linking Information for New Discoveries UIs for building and reusing hypothesis seeking strategies. Statistical language analysis techniques.
Semantic Interpretation of Medical Text Barbara Rosario, SIMS Steve Tu, UC Berkeley Advisor: Marti Hearst, SIMS.
Citances and What should our UI look like? Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
1 The BioText Project Myers Seminar Sept 22, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI , ARDA AQUAINT,
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
The Descent of Hierarchy, and Selection in Relational Semantics* Barbara Rosario, Marti Hearst, Charles Fillmore UC Berkeley *with apologies to Charles.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
©Edited by Mingrui Zhang, CS Department, Winona State University, 2008 Identifying Lung Cancer Risks.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Text Mining Tools: Instruments for Scientific Discovery Marti Hearst UC Berkeley SIMS IMA Text Mining Workshop April 17, 2000.
Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
BeeSpace Informatics Research: From Information Access to Knowledge Discovery ChengXiang Zhai Nov. 14, 2007.
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Labeling protein-protein interactions Barbara Rosario Marti Hearst Project overview The problem Identifying the interactions between proteins. Labeling.
Clinical research data interoperbility Shared names meeting, Boston, Bosse Andersson (AstraZeneca R&D Lund) Kerstin Forsberg (AstraZeneca R&D.
Artificial Intelligence: Research and Collaborative Possibilities a presentation by: Dr. Ernest L. McDuffie, Assistant Professor Department of Computer.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Probabilistic Annotation Framework: Knowledge Assembly at Scale with Semantic and Probabilistic Techniques Szymon Klarman 1, Larisa Soldatova 1, Robert.
SLIDE 1NaCTeM Launch -Manchester National Center for Text Mining Launch Event Ray R. Larson University of California, Berkeley School of Information.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
TDM in the Life Sciences Application to Drug Repositioning *
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Terminology problems in literature mining and NLP
Text Tango: A New Text Data Mining Project
Social Knowledge Mining
Untangling Text Data Mining
Interfaces for Intense Information Analysis
The Descent of Hierarchy, and Selection in Relational Semantics*
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
Marti Hearst Associate Professor SIMS, UC Berkeley
Presentation transcript:

1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI , ARDA AQUAINT, and a gift from Genentech

2 BioText Project Goals Provide fast, flexible, intelligent access to information for use in biosciences applications. –Better search results –Text mining Focus on –Textual Information –Tightly integrated with other resources Ontologies Record-based databases

3 People Project Leaders: –PI: Marti Hearst Co-PI: Adam Arkin Computational Linguistics –Barbara Rosario –Presley Nakov Database Research –Ariel Schwartz –Gaurav Bhalotia (graduated) User Interface / Information Retrieval –Kevin Li –Dr. Emilia Stoica Bioscience –Dr. TingTing Zhang

4 Outline Main Goals –Text Mining Examples –System Architecture –Apoptosis problem statement Recent results in –Abbreviation definition recognition –Semantic relation recognition (from text) –Search User Interfaces –Hierarchical grouping of journals

5 Text Mining Example 1 How to discover new information … … As opposed to discovering which statistical patterns characterize occurrence of known information. Method: –Use large text collections to gather evidence to support (or refute) hypotheses –Make Connections –Gather Evidence

6 Etiology Example Don Swanson example, 1991 Goal: find cause of disease –Magnesium-migraine connection Given –medical titles and abstracts –a problem (incurable rare disease) –some medical expertise find causal links among titles –symptoms –drugs –results

7 Gathering Evidence stress migraine CCB magnesium PA magnesium SCD magnesium

8 Gathering Evidence migraine magnesium stress CCB PA SCD

9 Swanson’s Linking Approach Two of his hypotheses have received some experimental verification. His technique –Only partially automated –Required medical expertise

10 Text Mining Example 2: How to find functions of genes? –Have the genetic sequence –Don’t know what it does –But … Know which genes it coexpresses with Some of these have known function –So …infer function based on function of co- expressed genes This is problem suggested by Michael Walker and others at Incyte Pharmaceuticals

11 Gene Co-expression: Role in the genetic pathway g? PSA Kall. PAP h? PSA Kall. PAP g? Other possibilities as well

12 Make use of the literature Look up what is known about the other genes. Different articles in different collections Look for commonalities –Similar topics indicated by Subject Descriptors –Similar words in titles and abstracts adenocarcinoma, neoplasm, prostate, prostatic neoplasms, tumor markers, antibodies...

13

14 Formulate a Hypothesis Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancer New tack: do some lab tests –See if mystery gene is similar in molecular structure to the others –If so, it might do some of the same things they do

15 Outline Main Goals –Text Mining Examples –System Architecture –Apoptosis problem statement Recent results in –Abbreviation definition recognition –Semantic relation recognition (from text) –Search User Interfaces –Hierarchical grouping of journals

16 BioText: Architecture Sophisticated Text Analysis Annotations in Database Improved Search Interface

17 Recent Result (Schwartz & Hearst 03) Fast, simple algorithm for recognizing abbreviation definitions. –Simpler and faster than the rest –Higher precision and recall –Idea: Work backwards from the end Examples: –In eukaryotes, the key to transcriptional regulation of the Heat Shock Response is the Heat Shock Transcription Factor (HSF). –Gcn5-related N-acetyltransferase (GNAT) Idea: use redundancy across abstracts to figure out abbreviation meaning even when definition is not present.

18 BioText: A Two-Sided Approach SwissProt Blast Mesh GO Word Net Medline Journal Full Text Sophisticated Database Design & Algorithms Empirical Computational Linguistics Algorithms

19 Death Receptors Signaling Survival Factors Signaling Ca ++ Signaling P53 pathway Caspase 12 Effecter Caspases (3,6,7) Caspase 9 Apaf 1 IAPs NFkB Mitochondria Cytochrome c Bax, Bak Apoptosis Bcl-2 like BH3 only Apoptosis Network Smac ER Stress Genotoxic Stress Initiator Caspases (8, 10) AIF Lost of Attachment Cell Cycle stress, etc Slide courtesy TingTing Zhang

20 The issues (courtesy TingTing Zhang): The network nodes are deduced from reading and processing of experimental knowledge by experts. Every month >1000 apoptosis papers are published. The supporting experimental data are gathered in different organs, tissues, cells using various techniques. There are various levels of uncertainty associated with different techniques used to answer certain questions. Depending on the expression patterns for the players in the network, the observation may or may not be extended to other contexts. We need to keep track of ALL the information in order to understand the system better.

21 Simple cases: Mouse Bim proteins (isoforms EL, L, S) binds to human Bcl-2 (bacteriophoage screening using cDNA expression library from T-Lymphoma cell line KO52DA20). Human BimEL protein is 89% identical to mouse BimEL, Human BimL is 85% identical to mouse BimL (Hybridization of mouse bim cDNA to human fetal spleen and peripheral blood cDNA library). Bim mRNA is detected in B and T lyphoid cells (Northern blot analysis of mouse KO52DA20, WEHI 703, WEHI 707, WEHI7.1, CH1, WEHI231 WEHI415, B BW2 cell extracts). BimL protein interact with Bcl-2 OR Bcl-XL, or Bcl-w proteins (Immuno- precipitation (anti-Bcl-2 OR Bcl-XL OR Bcl-w)) followed by Western blot (anti- EEtag) using extracts human 293T cells co-transfected with EE-tagged BimL AND (bcl-2 OR bcl-XL OR bcl-w) plasmids) BimL deleted of the BH3 domain does not bind to Bcl-2 OR Bcl-XL, or Bcl-w proteins (under experimental conditions mentioned above)

22 Computational Language Goals Recognizing and annotating entities within textual documents Identifying semantic relations among entities To (eventually) be used in tandem with semi-automated reasoning systems.

23 Main Ideas for NLP Approach Assign Semantics using –Statistics –Hierarchical Lexical Ontologies to generalize –Redundancy in the data Build up Layers of Representation –Syntactic and Semantic –Use these in a feedback loop

24 Computational Linguistics Goals Mark up text with semantic relations

25 Recent Result: Descent of Hierarchy Idea: –Use the top levels of a lexical hierarchy to identify semantic relations Hypothesis: –A particular semantic relation holds between all 2-word Noun Compounds that can be categorized by a MeSH pair.

26 Definition NC: Any sequence of nouns that itself functions as a noun –asthma hospitalizations –health care personnel hand wash Technical text is rich with NCs Open-labeled long-term study of the subcutaneous sumatriptan efficacy and tolerability in acute migraine treatment.

27 Identification Syntactic analysis (attachments) [Baseline [headache frequency]] [[Tension headache] patient] Our Goal: Semantic analysis Headache treatment  treatment for headache Corticosteroid treatment  treatment that uses corticosteroid NCs: Three tasks

28 Main Idea: Top-level MESH categories can be used to indicate which relations hold between noun compounds headache recurrence –C C headache pain –C G breast cancer cells –A C04 A11

29 Linguistic Motivation Can cast NC into head-modifier relation, and assume head noun has an argument and qualia structure. –(used-in): kitchen knife –(made-of): steel knife –(instrument-for): carving knife –(used-on): putty knife –(used-by): butcher’s knife

30 Distribution of Frequent Category Pairs

31 How Far to Descend? Anatomy: 250 CPs –187 (75%) remain first level –56 (22%) descend one level –7 (3%) descend two levels Natural Science (H01): 21 CPs –1 (4%) remain first level –8 (39%) descend one level –12 (57%) descend two levels Neoplasm (C04) 3 CPs: –3 (100%) descend one level

32 Evaluation Apply the rules to a test set Accuracy: –Anatomy: 91% accurate –Natural Science: 79% –Diseases: 100% Total: –89.6% via intra-category averaging –90.8% via extra-category averaging

33 Summary of NC Work Lexical hierarchy useful for inferring semantic relations Works because semantics are constrained and word sense ambiguity is not too much of a problem Can it be extended to other types of relations? –Preliminary results on one set of relations are promising.

34 Database Research Issues Efficiently and effectively combining –Relational databases & Text –Hierarchical Ontologies –Layers of Annotations

35 Interface Issues Create intuitive, appealing interfaces that are better than what’s currently out there. Start with existing assigned metadata As text analysis improves, incorporate the results into the interface.

36

37

38

39

40 Some Recent Work Organizing BioScience Journal Names –Currently there are > 3500

41

42

43 Some Recent Work Organizing BioScience Journal Names –Currently there are > 3500 Idea: –Group them into faceted hierarchies semi- automatically –Using clustering of title terms, synonym similarity via WordNet, and other techniques

44

45

46 Summary BioText aims to improve access to bioscience information via –Sophisticated language analysis –Integration of results into Annotated database Flexible user interface Eventual goal –Semi-automated mining and discovery

47 There’s lots to do! biotext.berkeley.edu For more information: