1 Richard Tzong-Han Tsai, Po-Ting Lai, Hong-Jie Dai, Chi-Hsin Huang,Yue-Yang Bow Yen-Ching Chang,Wen-Harn Pan, Wen-Lian Hsu HypertenGene: Extracting key.

Slides:



Advertisements
Similar presentations
Extracting Disease-Gene Associations from MEDLINE abstracts Tsujii laboratory University of Tokyo.
Advertisements

Random Forest Predrag Radenković 3237/10
Evaluation of Decision Forests on Text Categorization
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.
Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Academia Sinica, Taiwan 1/10 Argument Score Combination for Constituents Tzong-Han Tsai, Chia-Wei Wu, Yu- Chun Lin, and Wen-Lian Hsu Institute of Information.
Populations & Gene Pools and Genetic Variation.
Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
1/1/ Question Classification in English-Chinese Cross-Language Question Answering: An Integrated Genetic Algorithm and Machine Learning Approach Min-Yuh.
Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.
Identifying deleterious Single Nucleotide Polymorphisms using multiple sequence alignments CMSC858P Project by Maya Zuhl.
A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology.
Dependency Network Based Real-time Query Expansion Jiaqi Zou, Xiaojie Wang Center for Intelligence Science and Technology, BUPT.
Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.
1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Mutations Mutation- a change in the DNA nucleotide sequence
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Summary  The task of extractive speech summarization is to select a set of salient sentences from an original spoken document and concatenate them to.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan.
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Should developing countries continue to use older drugs for essential hypertension? A prescription survey in South Africa suggested that prescribers were.
Abstract Background: In this work, a candidate gene prioritization method is described, and based on protein-protein interaction network (PPIN) analysis.
University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Identifying Disease Diagnosis Factors by Proximity-based Mining of Medical Texts Rey-Long Liu *, Shu-Yu Tung, and Yun-Ling Lu * Dept. of Medical Informatics.
NTCIR /21 ASQA: Academia Sinica Question Answering System for CLQA (IASL) Cheng-Wei Lee, Cheng-Wei Shih, Min-Yuh Day, Tzong-Han Tsai, Tian-Jian Jiang,
Kyoungryol Kim Extracting Schedule Information from Korean .
1 Machine Learning 1.Where does machine learning fit in computer science? 2.What is machine learning? 3.Where can machine learning be applied? 4.Should.
Mebi 591D – BHI Kaggle Class Baselines kaggleclass.weebly.com/
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Summary  Extractive speech summarization aims to automatically select an indicative set of sentences from a spoken document to concisely represent the.
Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Introduction Hereditary predisposition (mutations in BRCA1 and BRCA2 genes) contribute to familial breast cancers. Eighty percent of the.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Association of functional polymorphisms of Bax and Bcl2 genes with schizophrenia Kristina Pirumya, PhD, Laboratory of Human Genomics and Immunomics Institute.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Principals of Research Writing. What is Research Writing? Process of communicating your research  Before the fact  Research proposal  After the fact.
Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.
Algorithms For Solving History Sensitive Cascade in Diffusion Networks Research Proposal Georgi Smilyanov, Maksim Tsikhanovich Advisor Dr Yu Zhang Trinity.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Identifying Ethnic Origins with A Prototype Classification Method Fu Chang Institute of Information Science Academia Sinica ext. 1819
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Chinese Named Entity Recognition using Lexicalized HMMs.
Experience Report: System Log Analysis for Anomaly Detection
Gene Expression Database (GXD)
Automatically Labeled Data Generation for Large Scale Event Extraction
Showcasing work by Jonnageddala, Liaw, Ray, Kumar, Chang, and Dai on
Preterm birth < 37 weeks
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
CRF &SVM in Medication Extraction
Efficient Ranking of Keyword Queries Using P-trees
BACKGROUND RESULTS OBJECTIVES METHODS CONCLUSIONS REFERENCES
Fenglong Ma1, Jing Gao1, Qiuling Suo1
Citation-based Extraction of Core Contents from Biomedical Articles
Hsin-Nan Lin, Ching-Tai Chen, Ting-Yi Sung,
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
By Hossein Hematialam and Wlodek Zadrozny Presented by
Deep Learning in Bioinformatics
Presentation transcript:

1 Richard Tzong-Han Tsai, Po-Ting Lai, Hong-Jie Dai, Chi-Hsin Huang,Yue-Yang Bow Yen-Ching Chang,Wen-Harn Pan, Wen-Lian Hsu HypertenGene: Extracting key hypertension genes from biomedical literature …

2 Where are we from? Institute of Information Science Academia Sinica Taiwan

3 InCoB 2009 Institute of Information Science Academia Sinica Taiwan

4 HypertenGene: Extracting key hypertension genes from biomedical literature with position and automatically- generated template features Richard Tzong-Han Tsai, Po-Ting Lai, Hong-Jie Dai, Chi-Hsin Huang,Yue-Yang Bow Yen-Ching Chang,Wen-Harn Pan, Wen-Lian Hsu

5 Outline Motivation Major tasks Dataset Evaluation Conclusion

6 What Causes Hypertension

7 GAD Database Disease View Search for All Record found: 930 *About 930 PubMed ID about genes associate to hypertension recorded in GAD *Database update to 2008

8 Articles about Hypertension Over three hundred thousands abstracts about hypertension in PubMed

9 Key Hypertension Genes Genes which cause hypertension genetically Example The GNB3 may be considered a genetic marker for hypertension. [PMID: ]

10 HG Pair in a Sentence The GNB3 may be considered a genetic marker for hypertension. S G H HG Pair

11 Motivation Major tasks Dataset Evaluation Conclusion Outline

12 Major Task 1.Gene named entity recognition (NER) and gene normalization (GN) 2.Hypertension named entity recognition 3.Gene-hypertension relation extraction

13 Gene Named Entity Recognition Example The GNB3 may be considered a genetic marker for hypertension. [PMID: ]

14 Gene Named Entity Recognition Example The GNB3 may be considered a genetic marker for hypertension. [PMID: ]

15 Gene Normalization Example The GNB3 may be considered a genetic marker for hypertension. [PMID: ] Gene ID: 2784 guanine nucleotide binding protein (G protein), beta polypeptide 3 guanine nucleotide-binding protein, beta-3 subunit transducin beta chain 3 G protein, beta-3 subunit GTP-binding regulatory protein beta-3 chain GNB3

16 1.Gene named entity recognition (NER) and gene normalization (GN) 2.Hypertension named entity recognition 3.Gene-hypertension relation extraction Major Task

17 Disease NER Example The GNB3 may be considered a genetic marker for hypertension. [PMID: ] In conclusion, REN 10631A alleles are significantly associated with EHT in the Emirati population. [PMID: ] EHT : Essential HyperTension

18 Disease NEs in Evident Sentences OBJECTIVE: We sought to determine whether polymorphisms in the transforming growth factor (TGF)- beta3 gene are associated with risk of pregnancy- induced hypertension (PIH) in case-control mother-baby dyads.... CONCLUSION: A fetal TGF-beta3 polymorphism (rs ) is associated with PIH in a predominantly Hispanic population. PMID:

19 List of Hypertension Acronym Original NameAcronym pregnancy-induced hypertension PIH Primary pulmonary hypertension PPH Family history of hypertension FH Pulmonary hypertension PH More than 30 pairs were collected by acronym recognition component

20 1.Gene named entity recognition (NER) and gene normalization (GN) 2.Hypertension named entity recognition 3.Gene-hypertension relation extraction Major Task

21 Formulation HG pair 1 in a S 1 HG pair 2 in a S 1 HG pair 3 in a S 2 Binary Classification: if one target HG pair has relation or not Key Relation Not a Key Relation HG pair 2 HG pair 1 HG pair 3

22 Motivation Major tasks Dataset Evaluation Conclusion Outline

23 Datasets Our data set consists of 939 sentences from 195 abstracts selected from the GAD 1395 HG pairs can be extracted from these 939 sentences Positive HG pairNegative HG pair Number of HG pairs

24 Training & Testing Randomly selected 90% HG pairs for training set 10% HG pairs for test set Repeat 30 times Calculated the averages to compare their performance

25 Motivation Major tasks Dataset Evaluation Conclusion Outline

26 Scoring Method : F-score The weighted harmonic mean of precision and recallharmonic mean HG1 HG2 HG3 HG4 HG5 HG6 HG7 HG8 HG10 HG9 Dataset : HG1~HG10 Key Gene Prediction HG1 HG4HG5HG6 HG7 precision : 1/5 = 0.2 recall : 1/3 = 0.33 F-score : (2*0.2*0.33)/( )=0.25

27 AUC of the iP/R curve * n is the total number of correct HG pairs * p i is the highest interpolated precision for the correct HG pairs j at r j * r j the recall at that HG pairs * Interpolated precision pi is calculated for each recall r by taking the highest precision at r or any r’ > r.

28 Scoring Method : AUC HG1 HG2 HG3 HG4 HG5 HG6 HG7 HG8 HG10 HG9 Dataset : HG1~HG10 Key Gene Prediction 1st HG1 2nd HG6 3rd HG7 4th HG2 5th HG3 1st HG1 2nd HG6 3rd HG2 4th HG3 5th HG7 Key Gene Prediction Precision : 0.6, Recall : 1, F-score : 0.75 AUC : Precision : 0.6, Recall : 1, F-score: 0.75 AUC :

29 Select Features for Classification Binary Classification Features

30 Select Features for Classification The GNB3 may be considered a genetic marker for hypertension. Binary Classification Key HG pair or not Features

31 Features Basic Word Features Chunk Features Parse Tree Path Features Template Features Position Features

32 Basic Word Features The GNB3 may be considered a genetic marker for hypertension. Words between may, be, considered, a, genetic, marker, of, predisposition, for Words between (bigram) may_be, be_considered, considered_a, a_genetic, genetic_marker, marker_of, of_predisposition, predisposition_for

33 Parse Tree Path Features Parse Tree Path Features : NP_S_VP_NP_PP_NP

34 Chunk Features The GNB3 may be considered a genetic marker for hypertension. Inter-HG chunk types VP_NP_PP Inter-HG chunk head words consider_marker_hypertension Word TheGNB3maybeconsideredageneticmarkerforhypertension Chunk B-NPI-NPB- VP I-VP B- NP I-NP B- PP B-NP

35 Result of Baseline Features ConfigPrecisi on RecallF-scoreAUCS AUC Baseline Baseline : Basic word+ Chunk + Parse Tree S AUC : Standard Variation of AUC

36 Template Features Especially, a polymorphism in SLC12A was significantly associated with hypertension in women even after correction by the Bonferroni method. The leptin gene polymorphism was associated with hypertension independent of obesity. On analysis of covariance, the interaction between ND Leu / Met polymorphism and habitual drinking was significantly associated with both systolic blood pressure and diastolic blood pressure. … gene … associated with … hypertension

37 Result of B+T Features ConfigPRF-scoreAUCS AUC ∆AUCtAUC>AUC B ? (t >1.67?) Baseli ne N/A B+T No B : Baseline feature (words feature, chunk feature, parse tree) T : Template features t : t test

38 Position Features Relative position features Section features Divide an abstract into four sections : Value = 0~10 ObjectiveMethodsResultConclusions

39 Before Section Categorization

40 After Section Categorization

41 PubMed EX

42 Result ConfigPRF- score AUCS AUC ∆AU C tAUC>AU C B ? (t >1.67?) Baselin e N/A B+T No B+P Yes B+P+T Yes B : Baseline feature (words feature, chunk feature, parse tree) P : Position features T : Template features

43 Motivation Major tasks Dataset Evaluation Conclusion Outline

44 Conclusions-1 The first systematic study of extracting hypertension-related genes.

45 Conclusions-2 The first attempt to create a hypertension- gene relation corpus base on the GAD database.

46 Conclusions-3 Propose a supervised learning approach for extracting key hypertension-related genes.

47 Thanks for your attention