Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining the Medical Literature Chirag Bhatt October 14 th, 2004.

Similar presentations


Presentation on theme: "Mining the Medical Literature Chirag Bhatt October 14 th, 2004."— Presentation transcript:

1 Mining the Medical Literature Chirag Bhatt October 14 th, 2004

2 Why MINE data! Medical, genomics, proteomics research Find causal links between symptoms or diseases and drugs or chemicals Gene comparison

3 An example Problem What is causing an uncharacteristic behavior in protein production? Solution Find which genes have a roll to play in amino acid synthesis? How? Search through online literature for genes that play a role in amino acid synthesis

4 Search vs. Discover Search (goal oriented) Discover (opportunistic) Structured data (database) Data retrievalData mining Unstructured Data (text) Information Retrieval Text mining

5 Data Retrieval Company Database e.g. Customer records, product inventory Search entity (structured) records Query (goal-driven) What is the address of our client? How many widgets are in stock? SQL, Oracle, DB2, etc

6 Information Retrieval Google, A9, AltaVista Query (goal-driven) Search entity (unstructured) documents variable format html, pdf, etc

7 Data Mining Structured data set Generally a large amount of (historical) data Find relations or patterns or trends in database (opportunistic) Eg “ beer and diaper ”

8 Text Mining Unstructured data set Documents, publications, abstracts, web pages Discover useful and previously unknown “ gems ” of information in large text collections using patterns, trends and domain knowledge

9 Need for mining text Approximately 90% of the world ’ s data is held in unstructured formats (source: Oracle Corporation)

10 Why Text Mining in Medical Literature? Many multi-functional genes Screen functionally interesting ones Complexity of needs increasing Individual genes -> family of genes Manual Text Mining ? Not really! Availability of published literature online

11 Functionally Coherent Genes Group of genes that exhibit similar experimental features Amino acid metabolism, electron transport, stress response

12 Difficulties Difficulties faced in finding functionally coherent genes Most genes express multi- functionality Some genes studied extensively and some only just discovered

13 Semantic neighbor Two articles are semantic neighbors if they have similar word usage Use statistical natural language processing to access and interpret online text

14 Methodology

15 Find semantic neighbors in document set If any article about common functionality contains atleast one in the group then the group is functionally coherant

16 Neighbor divergence Scoring method Each articles relevance to gene group is scored by: count of number neighbors that have references to the group

17 Neighbor divergence scores If score distribution is different from Poisson then gene group represents biological function The log ratio for a Poisson distribution should be flat along the horizontal axis

18 Need to filter results Generally well-studied genes tend to have semantic neighbor that refer to same gene Neighbor may not be relevant to group function, but increases score – false positive So only articles that refer to different genes are considered

19 Evaluation Report percentile of a functional group of genes Calculate precision and recall at different cutoff levels (next slide) Remove legitimate genes with irrelevant genes in group

20 Precision and Recall

21 Results Sample Space: 19 known yeast groups and 1900 random groups

22 Results

23 Replacing functional genes

24 Limitations of neighbor divergence Neighbor divergence helps group genes not tell us function Work based on abstracts only Entire literature search may prove challenging Break into smaller components

25 Another mining approach Extracting synonymous gene and protein terms

26 Why find synonyms? Genes and proteins are often associated with multiple names across articles and sub domains More names keep getting added new functional or structural information is discovered Improve search and analysis

27 Current work Biological databases such as GenBank and SWISSPROT include synonyms Not up to date Disagreement on some synonyms Laborious manual curation and review Need for automation

28 Two-step problem Identifying gene and protein names Done by state-of-the-art taggers Determining whether these names are synonymous We ’ ll discuss more on this …

29 Current synonym approaches Synonymous gene and protein names represent same biological substance Exhibit identical biological functions Same gene or amino acid sequences Other approaches String matching Matching abbreviations to full-forms

30 Gene and Protein Tagging Identification step Uses BLAST techniques and domain knowledge to pick out genes and protein terms Heuristics Synonyms usually occur within same sentence Synonyms mentioned in first few pages of article

31 Synonym detection approaches Unsupervised - ‘ Similarity ’ based on contextual similarity Semi-supervised - ‘ Snowball ’ extracts structured relations using patterns Supervised - Text Classification/SVM Hand-crafted extraction – GPE Combined system

32 Combined Approach Combine output of SnowBall, SVM, and GPE Each system gives a confidence score for each synonym pair Where, s = is a synonym pair and Conf E (s) is confidence assigned to s by individual extraction by the system E

33 Unsupervised - Similarity Context based All words occurring within a ‘ x ’ word window False positives are very common Run time – O(|lexicon| 3 )

34 Semi-supervised - Snowball Manual feedback mechanism

35 Supervised – Text Classification Input: known synonym pairs Automatically find contexts and assign weights Train classifier to distinguish between ‘ positive ’ and ‘ negative ’ contexts Eg ‘ A also known as B ’ and ‘ A regulates B ’

36 Why Combined Approach? SnowBall and SVM, machine-learning based captures synonyms that may be missed by GPE GPE, knowledge-based SnowBall and SVM have many false positives Combine both advantages

37 Results

38 Summary Text mining Semantic neighbor Neighbor divergence Precision and Recall Synonym detection Approaches Comments / Questions?


Download ppt "Mining the Medical Literature Chirag Bhatt October 14 th, 2004."

Similar presentations


Ads by Google