Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Mining for Bioscience Applications: The State of the Art Marti Hearst University of California, Berkeley.

Similar presentations


Presentation on theme: "Text Mining for Bioscience Applications: The State of the Art Marti Hearst University of California, Berkeley."— Presentation transcript:

1

2 Text Mining for Bioscience Applications: The State of the Art Marti Hearst University of California, Berkeley

3 Outline Search vs. Discovery Why is text analysis difficult? Some current approaches Future directions

4 My Background Computer Scientist by training –NOT a biologist Professor in an interdisciplinary program –School of Information Management & Systems (SIMS) –Affiliated with the UCSF Bioinformatics Grad Group Research fields are –Computational Linguistics –Search (Information Retrieval) –User Interfaces and Information Visualization Have focused for a while on bioscience text Have received research support from Genentech

5 Monet, Haystack with Snow, Morning Search vs. Discovery Search: Finding hay in a haystack Discovery: Creating a new kind of hay

6 Search Goals More accurate results More comprehensive results –Thesaurus expansion Intelligent summaries of results Organize results along biologically relevant lines Better user interfaces

7 Knowledge Discovery from Text How to discover new information … … As opposed to looking up what’s already known. Method: –Create hypotheses –Use large text collections to gather evidence to refute or support hypotheses –Do lab tests to verify promising results

8 Discovery Goals Genomics –Automatically build gene networks –Discover gene functions Pharmacology –Help determine which drugs can help cure a disease –Help determine which genetic traits will lead to a reaction to a drug Etiology –Discover underlying causes of disease

9 Why is Automated Text Analysis Difficult?

10 USA Today, 2/26/04, Sbazo & Appleby 10 Why is automated text analysis difficult? “ Avastin, developed by South San Francisco-based Genentech (DNA), was approved for advanced colorectal cancer and for patients who haven't received other chemotherapy, according to the Food and Drug Administration.” –What is approved doing in this sentence? John was approved for advancement -> gets a promotion. Avastin was approved for cancer -> to fight cancer. Avastin was approved for patients -> to consume to fight cancer. –What kind of patients approved for? Ambiguous. Could be for anyone who hasn’t received chemotherapy, or only those patients with advanced colorectal cancer who haven’t received chemotherapy.

11 USA Today, 2/26/04, Sbazo & Appleby 11 Why is automated text analysis difficult? “This could easily be a multibillion-dollar drug," McCamant says. Refers to concepts mentioned in earlier sentences.

12 USA Today, 2/26/04, Sbazo & Appleby 12 Why is automated text analysis difficult? "Avastin opens up this new gateway for cancer care," says William Li, president of the Angiogenesis Foundation in Massachusetts. "It's the first in a fleet of other drugs.” –Is Avastin a vehicle? It opens gateways and travels in a fleet!

13 13 Why is automated text analysis difficult? There are many indirect ways to say things: –A two-dose combined hepatitis A and B vaccine would facilitate immunization programs. The vaccine helps prevent hep B. –These results suggest that con A-induced hepatitis was ameliorated by pretreatment with TJ-135. The treatment TJ-135 helps cure hep. –Effect of interferon on hepatitis B. There is an unspecified effect of interferon on hep B.

14 What do we do? Solve sub-problems –Extract certain types of entities Gene/protein names Abbreviation definitions –Classify the noun phrases using ontologies MeSH, LocusLink, GO, etc. –Define relationship types; try to recognize them. –Many other subproblems are actively being worked on Word sense disambiguation Co-reference resolution

15 Two Main Approaches Hand-built RulesMachine Learning

16 Two Main Approaches Hand-built rules –Can be very accurate –Are also very “brittle” –Don’t scale Machine learning –Usually requires labeled training data Unsupervised methods under development –Can be made to scale –Is the way of the future

17 Abbreviation Definition Recognition A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text, Ariel Schwartz and Marti Hearst, PSB 2003 Kauai, Jan 2003 Fast, simple algorithm for recognizing abbreviation definitions. –Simpler and faster than the rest Other approaches are cubic or quadratic in time –Higher precision and recall –Idea: Work backwards from the end Examples: –In eukaryotes, the key to transcriptional regulation of the Heat Shock Response is the Heat Shock Transcription Factor (HSF). –Gcn5-related N-acetyltransferase (GNAT) In future: –Use redundancy across abstracts to figure out abbreviation meaning even when definition is not present.

18 Gene name co-occurence A literature network of human genes for high-throughput analysis of gene expression. Jenssen TK, Laegreid A, Komorowski J, Hovig E. Nat Genet. 2001 May;28(1):21-8. PubGene Assumption: If two genes are co-mentioned in a MEDLINE record, there is an underlying biological relationship. Example: Genes highly upregulated at time point 6 h (6H) in the fibroblast serum response. Green: upregulation Red: downregulation

19 Gene name co-occurence A literature network of human genes for high-throughput analysis of gene expression. Jenssen TK, Laegreid A, Komorowski J, Hovig E. Nat Genet. 2001 May;28(1):21-8. Evaluation: 29-40% of the pairs were incorrect 45% of OMIM pairs found 51% of DIP pairs found (DB of Interacting Proteins)

20 How to find functions of genes? Have the genetic sequence Don’t know what it does But … –Know which genes it coexpresses with –Some of these have known function So …infer function based on function of co-expressed genes –This is problem suggested by Michael Walker and others at Incyte Pharmaceuticals

21 Gene Co-expression: Role in the genetic pathway g? PSA Kall. PAP h? PSA Kall. PAP g? Other possibilities as well

22 Make use of the literature Look up what is known about the other genes. Different articles in different collections Look for commonalities –Similar topics indicated by Subject Descriptors –Similar words in titles and abstracts adenocarcinoma, neoplasm, prostate, prostatic neoplasms, tumor markers, antibodies...

23 Formulate a Hypothesis Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancer New tack: do some lab tests –See if mystery gene is similar in molecular structure to the others –If so, it might do some of the same things they do

24 Etiology Example Complementary structures in disjoint science literatures. Don R. Swanson. In Proceedings of SIGIR ‘91 Goal: find cause of disease –Magnesium-migraine connection Given –medical titles and abstracts –a problem (incurable rare disease) –some medical expertise Find causal links among titles –symptoms –drugs –results

25 Gathering Evidence migraine stress magnesium CCB magnesium SCD magnesium PA magnesium

26 Gathering Evidence migraine magnesium stress CCB PA SCD

27 Swanson’s Linking Approach Two of his hypotheses have received some experimental verification. His technique –Only partially automated –Required medical expertise Recently others have made progress automating it.

28 Automating Swanson-style Discovery Text Mining: Generating Hypotheses from MEDLINE, Padmini Srinivasan. To appear in JASIST. UMLS defines Semantic Types Every MeSH term is assigned one or more Semantic Types –Interferon type II falls within both: Immunologic Factor and Pharmacologic Substance Each PubMed article is assigned a set of MeSH terms The idea is to characterize a set of articles according to which semantic types their MeSH terms fall into.

29 Automating Swanson-style Discovery Text Mining: Generating Hypotheses from MEDLINE, Padmini Srinivasan. To appear in JASIST. Approach: –User inputs topic T of interest –User selects 2 sets from a small number of sets of UMLS semantic types –System Searches PubMed for articles about T Selects out the important MeSH terms as determined by the user- chosen semantic type categories Searches PubMed for articles that contain these MeSH terms Combines the MeSH terms that result from these retrieved documents; Call this result C If a PubMed search on words from T and c from C are empty, place c as a candidate in a final result set R Report those terms in R that fall into the second user-selected semantic type set.

30 Automating Swanson-style Discovery Text Mining: Generating Hypotheses from MEDLINE, Padmini Srinivasan. To appear in JASIST. Results: have successfully reproduced the 7 examples they tried, with very little manual intervention Example: input topic is Raynaud’s disease

31 Main Ideas for NLP Approach Assign Semantics using –Statistics –Hierarchical Lexical Ontologies to generalize –Redundancy in the data Build up Layers of Representation –Syntactic and Semantic –Use these in a feedback loop

32 32 Automated Relation Assignment Recall the problem: –A two-dose combined hepatitis A and B vaccine would facilitate immunization programs. The vaccine helps prevent hep B. Identified 7 relations that can hold between Treatments and Diseases Used Machine Learning to address this –Graphical models –Neural nets Marked up the text with syntactic and semantic information –MeSH labels turn out to be very important

33 33 Automated Relation Assignment Use Machine Learning to address this –Graphical models –Neural nets Mark up the text with syntactic and semantic information –MeSH labels turn out to be very important

34 34 Automated Relation Assignment Results

35 Future Directions In text analysis: –Move away from hand-built rules –More focus on labeling with semantics In problems tackled –There are so many possibilities! –Help with automated curation

36 Thank you! Visit our site: biotext.berkeley.edu


Download ppt "Text Mining for Bioscience Applications: The State of the Art Marti Hearst University of California, Berkeley."

Similar presentations


Ads by Google