Marti Hearst Associate Professor SIMS, UC Berkeley

Slides:



Advertisements
Similar presentations
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Advertisements

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
WMES3103 : INFORMATION RETRIEVAL
Semantic Relation Detection in Bioscience Text Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
The BioText Project: Recent Work Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
1 Noun compounds (NCs) Any sequence of nouns that itself functions as a noun asthma hospitalizations asthma hospitalization rates health care personnel.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
Presented by Zeehasham Rasheed
BioText Infrastructure Ariel Schwartz Gaurav Bhalotia 10/07/2002.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI , ARDA.
1 The BioText Project Myers Seminar Sept 22, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI , ARDA AQUAINT,
SIEVE—Search Images Effectively through Visual Elimination Ying Liu, Dengsheng Zhang and Guojun Lu Gippsland School of Info Tech,
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Datamining MEDLINE for Topics and Trends in Dental and Craniofacial Research William C. Bartling, D.D.S. NIDCR/NLM Fellow in Dental Informatics Center.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
The Descent of Hierarchy, and Selection in Relational Semantics* Barbara Rosario, Marti Hearst, Charles Fillmore UC Berkeley *with apologies to Charles.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Intelligent Database Systems Lab Presenter : WU, MIN-CONG Authors : Jorge Villalon and Rafael A. Calvo 2011, EST Concept Maps as Cognitive Visualizations.
AUTOMATED TEXT CATEGORIZATION: THE TWO-DIMENSIONAL PROBABILITY MODE Abdulaziz alsharikh.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Chapter 4c, Database H Definition H Structure H Parts H Types.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
INFO Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using Text Mining and Natural Language Processing for.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Labeling protein-protein interactions Barbara Rosario Marti Hearst Project overview The problem Identifying the interactions between proteins. Labeling.
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
SLIDE 1NaCTeM Launch -Manchester National Center for Text Mining Launch Event Ray R. Larson University of California, Berkeley School of Information.
TDM in the Life Sciences Application to Drug Repositioning *
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
PRESENTED BY: PEAR A BHUIYAN
Development of the Amphibian Anatomical Ontology
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
SOCIAL COMPUTING Homework 3 Presentation
Efficient Estimation of Word Representation in Vector Space
Social Knowledge Mining
Writing Analytics Clayton Clemens Vive Kumar.
K Nearest Neighbor Classification
Category-Based Pseudowords
Statistical NLP: Lecture 9
Presented by: Prof. Ali Jaoua
Introduction to Information Retrieval
Statistical Data Analysis
CS246: Information Retrieval
The Descent of Hierarchy, and Selection in Relational Semantics*
Information Retrieval
CLAIMS CLassification Automated InforMation System
Roc curves By Vittoria Cozza, matr
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
Hierarchical, Perceptron-like Learning for OBIE
By Hossein Hematialam and Wlodek Zadrozny Presented by
Presentation transcript:

Marti Hearst Associate Professor SIMS, UC Berkeley The BioText Project Marti Hearst Associate Professor SIMS, UC Berkeley

BioText Project Goals Provide fast, flexible, intelligent access to information for use in biosciences applications. Focus on Textual Information Tightly integrated with other resources Ontologies Record-based databases

BioText: A Two-Sided Approach Sophisticated Database Design & Algorithms SwissProt Blast Medline Journal Full Text Mesh GO Word Net Empirical Computational Linguistics Algorithms

People Computational Linguistics Database Research Bioscience Barbara Rosario Barbara Engelhardt Presley Nakov Database Research Ariel Schwartz Gaurav Bhalotia Bioscience Ting Ting Zhang Anita Wilhelm Biosciences Collaborators Hseuh lab at Stanford Medical Altman lab at Stanford SMI Arkin lab at Berkeley Others …?

Database Research Issues Efficient querying and updating Semi-structured information Fuzzy synonyms Collection subsets Efficiently and effectively combining Relational databases Text databases Layers of processing Hierarchical Ontologies

Computational Language Goals Recognizing and annotating entities within textual documents Identifying semantic relations among entities To (eventually) be used in tandem with semi-automated reasoning systems.

Computational Linguistics Goals Mark up text with semantic relations <protein1><inhibits><protein2> <protein><binds-with><receptor> <chemical><increases><level-of><chemical>

Recent Results Fast, simple algorithm for recognizing abbreviation definitions. Simpler than the rest Higher precision and recall Idea: Work backwards from the end Examples: International Business Machines (IBM) transcription (TSP) transcriPtion tranScription Transcription

Recent Result: Descent of Hierarchy Idea: Use the top levels of a lexical hierarchy to identify semantic relations Hypothesis: A particular semantic relation holds between all 2-word Noun Compounds that can be categorized by a MeSH pair.

Recent Result Top-level MESH categories can be used to indicate which relations hold between noun compounds headache recurrence C23.888.592.612.441 C23.550.291.937 headache pain C23.888.592.612.441 G11.561.796.444 breast cancer cells A01.236 C04 A11

Linguistic Motivation Can cast NC into head-modifier relation, and assume head noun has an argument and qualia structure. (used-in): kitchen knife (made-of): steel knife (instrument-for): carving knife (used-on): putty knife (used-by): butcher’s knife

Distribution of Frequent Category Pairs We placed these CPs into a two-dimensional table, with the MeSH category for the first noun on the X axis, and the MeSH category for the second noun on the Y axis. Each intersection indicates the number of NCs that are classified under the corresponding two MeSH categories. A visualization tool (spotfire) allowed us to explore the dataset to see which areas of the category space are most heavily populated, and to get a feeling for if the distribution is uniform or not If our hypothesis holds (that NCs that fall within the same category pairs are assigned the same relation), then if most of the NCs fall within only a few category pairs then we only need to determine which relations hold between a subset of the possible pairs. Thus, the more clumped the distribution, the potentially more easy our task is. This Figure shows that some areas in the CP space with a higher concentration of unique NCs (the Anatomy, and the E (Therapeutic Techniques) through N (Health Care) sub-hierarchies, for example)

How Far to Descend? Anatomy: 250 CPs Natural Science (H01): 21 CPs 187 (75%) remain first level 56 (22%) descend one level 7 (3%) descend two levels Natural Science (H01): 21 CPs 1 (4%) remain first level 8 (39%) descend one level 12 (57%) descend two levels Neoplasm (C04) 3 CPs: 3 (100%) descend one level A: We descended one level most of the time for the sub-hierarchies E (Analytical, Diagnostic and Therapeutic Techniques), G (Biological Sciences) and N (Health Care) (around 50\% of the time for these categories combined). We never descended for B (Organisms) and did so only for A13 (Animal Structures) in A. In all but three cases, the descending was done for the second noun only. This may be because the second noun usually plays the role of the head noun in two-word noun compounds in English, thus requiring more specificity. Alternatively, it may reflect the fact that for the examples we have examined so far, the more heterogeneous terms dominate the second noun. Further examination is needed to answer this decisively. DON’T SAY THIS Although we began with 250 CPs in the A category, when a descend operation is performed, the CP is split into two or more CPs at the level below. Thus the total number of CPs after all assignments are made was 416.

Evaluation Apply the rules to a test set Accuracy: Total: Anatomy: 91% accurate Natural Science: 79% Diseases: 100% Total: 89.6% via intra-category averaging 90.8% via extra-category averaging We tested the resulting classifications on a randomly chosen test set (20% of the NCs for each CP), entirely distinct from the labeled set, and used the classifications found above to automatically predict which relations should be assigned to the member NCs. The testing was done by an independent evaluator with biomedical training and found the following accuracies: The lower accuracy for the Natural Sciences category indicates how our results depend on the properties of the lexical hierarchy. We can generalize well if the sub-hierarchies are in a well-defined semantic relation with their ancestors. If they are a list of ``unrelated'‘ topics, we cannot use the generalization of the higher levels; most of the mistakes for the Natural Sciences CPs occurred in fact when we failed to descend for broad terms such as Physics. DON’T SAY THIS Performing this evaluation allowed us to find such problems and update the rules; the resulting categorization should now be more accurate. INTRA: average for each class and then average of the average  same weight to all classes EXTRA: averaging across all data points  classes with more data points count more

Sweeping Application In conjunction with Hseuh lab at Stanford Problem: orphan receptors Recently used text to help identify the ligands that react with them Idea: better search to look at related chemicals Sophisticated text search to find a subset of articles Apply NLP to extract relations and narrow the subset Cross-link with various databases and ontologies to help formulate hypotheses.

Thank you! For more information: bailando.sims.berkeley.edu