Presentation is loading. Please wait.

Presentation is loading. Please wait.

Marti Hearst Associate Professor SIMS, UC Berkeley

Similar presentations


Presentation on theme: "Marti Hearst Associate Professor SIMS, UC Berkeley"— Presentation transcript:

1 Marti Hearst Associate Professor SIMS, UC Berkeley
The BioText Project Marti Hearst Associate Professor SIMS, UC Berkeley

2 BioText Project Goals Provide fast, flexible, intelligent access to information for use in biosciences applications. Focus on Textual Information Tightly integrated with other resources Ontologies Record-based databases

3 BioText: A Two-Sided Approach
Sophisticated Database Design & Algorithms SwissProt Blast Medline Journal Full Text Mesh GO Word Net Empirical Computational Linguistics Algorithms

4 People Computational Linguistics Database Research Bioscience
Barbara Rosario Barbara Engelhardt Presley Nakov Database Research Ariel Schwartz Gaurav Bhalotia Bioscience Ting Ting Zhang Anita Wilhelm Biosciences Collaborators Hseuh lab at Stanford Medical Altman lab at Stanford SMI Arkin lab at Berkeley Others …?

5 Database Research Issues
Efficient querying and updating Semi-structured information Fuzzy synonyms Collection subsets Efficiently and effectively combining Relational databases Text databases Layers of processing Hierarchical Ontologies

6 Computational Language Goals
Recognizing and annotating entities within textual documents Identifying semantic relations among entities To (eventually) be used in tandem with semi-automated reasoning systems.

7 Computational Linguistics Goals
Mark up text with semantic relations <protein1><inhibits><protein2> <protein><binds-with><receptor> <chemical><increases><level-of><chemical>

8 Recent Results Fast, simple algorithm for recognizing abbreviation definitions. Simpler than the rest Higher precision and recall Idea: Work backwards from the end Examples: International Business Machines (IBM) transcription (TSP) transcriPtion tranScription Transcription

9 Recent Result: Descent of Hierarchy
Idea: Use the top levels of a lexical hierarchy to identify semantic relations Hypothesis: A particular semantic relation holds between all 2-word Noun Compounds that can be categorized by a MeSH pair.

10 Recent Result Top-level MESH categories can be used to indicate which relations hold between noun compounds headache recurrence C C headache pain C G breast cancer cells A C A11

11 Linguistic Motivation
Can cast NC into head-modifier relation, and assume head noun has an argument and qualia structure. (used-in): kitchen knife (made-of): steel knife (instrument-for): carving knife (used-on): putty knife (used-by): butcher’s knife

12 Distribution of Frequent Category Pairs
We placed these CPs into a two-dimensional table, with the MeSH category for the first noun on the X axis, and the MeSH category for the second noun on the Y axis. Each intersection indicates the number of NCs that are classified under the corresponding two MeSH categories. A visualization tool (spotfire) allowed us to explore the dataset to see which areas of the category space are most heavily populated, and to get a feeling for if the distribution is uniform or not If our hypothesis holds (that NCs that fall within the same category pairs are assigned the same relation), then if most of the NCs fall within only a few category pairs then we only need to determine which relations hold between a subset of the possible pairs. Thus, the more clumped the distribution, the potentially more easy our task is. This Figure shows that some areas in the CP space with a higher concentration of unique NCs (the Anatomy, and the E (Therapeutic Techniques) through N (Health Care) sub-hierarchies, for example)

13 How Far to Descend? Anatomy: 250 CPs Natural Science (H01): 21 CPs
187 (75%) remain first level 56 (22%) descend one level 7 (3%) descend two levels Natural Science (H01): 21 CPs 1 (4%) remain first level 8 (39%) descend one level 12 (57%) descend two levels Neoplasm (C04) 3 CPs: 3 (100%) descend one level A: We descended one level most of the time for the sub-hierarchies E (Analytical, Diagnostic and Therapeutic Techniques), G (Biological Sciences) and N (Health Care) (around 50\% of the time for these categories combined). We never descended for B (Organisms) and did so only for A13 (Animal Structures) in A. In all but three cases, the descending was done for the second noun only. This may be because the second noun usually plays the role of the head noun in two-word noun compounds in English, thus requiring more specificity. Alternatively, it may reflect the fact that for the examples we have examined so far, the more heterogeneous terms dominate the second noun. Further examination is needed to answer this decisively. DON’T SAY THIS Although we began with 250 CPs in the A category, when a descend operation is performed, the CP is split into two or more CPs at the level below. Thus the total number of CPs after all assignments are made was 416.

14 Evaluation Apply the rules to a test set Accuracy: Total:
Anatomy: 91% accurate Natural Science: 79% Diseases: 100% Total: 89.6% via intra-category averaging 90.8% via extra-category averaging We tested the resulting classifications on a randomly chosen test set (20% of the NCs for each CP), entirely distinct from the labeled set, and used the classifications found above to automatically predict which relations should be assigned to the member NCs. The testing was done by an independent evaluator with biomedical training and found the following accuracies: The lower accuracy for the Natural Sciences category indicates how our results depend on the properties of the lexical hierarchy. We can generalize well if the sub-hierarchies are in a well-defined semantic relation with their ancestors. If they are a list of ``unrelated'‘ topics, we cannot use the generalization of the higher levels; most of the mistakes for the Natural Sciences CPs occurred in fact when we failed to descend for broad terms such as Physics. DON’T SAY THIS Performing this evaluation allowed us to find such problems and update the rules; the resulting categorization should now be more accurate. INTRA: average for each class and then average of the average  same weight to all classes EXTRA: averaging across all data points  classes with more data points count more

15 Sweeping Application In conjunction with Hseuh lab at Stanford
Problem: orphan receptors Recently used text to help identify the ligands that react with them Idea: better search to look at related chemicals Sophisticated text search to find a subset of articles Apply NLP to extract relations and narrow the subset Cross-link with various databases and ontologies to help formulate hypotheses.

16 Thank you! For more information: bailando.sims.berkeley.edu


Download ppt "Marti Hearst Associate Professor SIMS, UC Berkeley"

Similar presentations


Ads by Google