Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gene Expression Database (GXD)

Similar presentations


Presentation on theme: "Gene Expression Database (GXD)"— Presentation transcript:

0 Effective Biomedical Document Classification for Identifying Publications Relevant to the Mouse Gene Expression Database (GXD) Xiangying Jiang1, Martin Ringwald2, Judith Blake2, Hagit Shatkay1 1Dept. of Computer and Information Science, University of Delaware, Newark, DE, USA 2The Jackson Laboratory, Bar Harbor, ME, USA Good morning. I am Xiangying Jiang, a third year PhD student in the Computer and Information Sciences department at the University of Delaware, working in the computational biomedicine lab. I Thank the organizers for the opportunity to be here and present our work on effective biomedical document classification for identifying publications relevant to the Mouse Gene Expression Database. This work is a collaborative work with Dr. Ringwald and Dr. Blake from the Jackson Laboratory.

1 Gene Expression Database (GXD)
GXD: >24,000 publications; >15,000 genes. Information is manually curated from the literature. Curation workflow GXD MGI Biomedical document classification is aiming to identify publications relevant to a specific research field. In our work, what we concerned is to automated identify publications in Mouse Genome Informatics database into those that are relevant to Gene Expression Database and those that are not. GXD is an extensive resource of mouse development expression information. It includes Expression Literature Content Records for more than 24,000 publications and over 15,000 genes. Much of the detailed information provided by GXD is manually curated from the literature. First step is to identify publications that are relevant to GXD within MGI. Once the publications are identified, the curators annotate within them. Here, we are concentrating on the first step of GXD curation workflow. We use a large and well curated dataset to train and test a classifier that partitions publications in MGI into those that are relevant for GXD and those that are not. others Identify publication relevant to GXD from MGI Annotate data within selected publications

2 Datasets Title + Abstract (From PubMed)
12,966 relevant documents; 12,354 irrelevant documents → 25,320 total Title + Abstract + Image Captions (Harvested from Full Text Articles) 1,630 relevant documents; 1,696 irrelevant documents → 3,326 total We use two datasets to train and test our classifier. All documents in each dataset are throughout year 2004 to 2014, and downloaded free from online source. We first build a dataset consisting 25,320 documents in total, each document contains title and abstract of the publication. There are 12,966 documents are considered as relevant to GXD. 12,354 documents are labeled as irrelevant to GXD. We also build another dataset consisting 3,326 documents in total. These documents contains not only title and abstract but also image captions of the publication. Among these, there are 1,630 documents are labeled as relevant to GXD, and the rest are labeled as irrelevant to GXD.

3 Information Source Title and abstract of the publication Image caption
There are some work have been done for biomedical document classification task.  Basic mainstream methods for biomedical document classification typically use information obtained from titles and abstracts of publications. Here is a list of examples. However, as a brief conclusion of the image, image captions in the biomedical literatures often contain significant and useful information for determining the topic discussed in the publications.  As such, we consider here the use of text obtained from image captions as part of the GXD classification process. Therefore, in addition to build the dataset in which documents only consist of title and abstract, we also build another dataset, in which documents also contains image caption within the publications. *Isono, K., Mizutani-Koseki, Y., Komori, T., Schmidt-Zachmann, M. S., & Koseki, H. (2005). Mammalian polycomb-mediated repression of Hox genes requires the essential spliceosomal protein Sf3b1. Genes & development, 19(5),

4 Document Representation
Polycomb group (PcG) proteins are responsible for the stable repression of homeotic (Hox) genes by forming multimeric protein complexes. We show (1) physical interaction between components of the U2 small nuclear ribonucleoprotein particle (U2 snRNP), including Sf3b1 and PcG proteins Zfp144 and Rnf2; and (2) that Sf3b1-heterozygous mice exhibit skeletal transformations concomitant with ectopic Hox expressions… The first step we conduct is document representation. Here we take part of an abstract as an example. We use both unigrams (which are single words)  and bigrams (which are pairs of two consecutive words) to represent documents. *Isono, K., Mizutani-Koseki, Y., Komori, T., Schmidt-Zachmann, M. S., & Koseki, H. (2005). Mammalian polycomb-mediated repression of Hox genes requires the essential spliceosomal protein Sf3b1. Genes & development, 19(5),

5 Document Representation
Polycomb group (PcG) proteins are responsible for the stable repression of homeotic (Hox) genes by forming multimeric protein complexes. We show (1) physical interaction between components of the U2 small nuclear ribonucleoprotein particle (U2 snRNP), including Sf3b1 and PcG proteins Zfp144 and Rnf2; and (2) that Sf3b1-heterozygous mice exhibit skeletal transformations concomitant with ectopic Hox expressions... Step 1: Removing stop words Step 2 Removing rare terms as well as overly frequent ones We use both unigrams (which are single words)  and bigrams (which are pairs of two consecutive words) to represent documents. To generate an effective and efficient document representation, we select a limited number of meaningful terms rather than all terms as features to represent documents.  We first remove standard stop words. For example, ’are’ and ‘for’ here. We also remove rare terms that only appearing in a single publication within the dataset, as well as overly frequent terms that appearing within over 60% of the publications in the dataset. Step 3 Conducting Z-score test to select distinguishing features *Isono, K., Mizutani-Koseki, Y., Komori, T., Schmidt-Zachmann, M. S., & Koseki, H. (2005). Mammalian polycomb-mediated repression of Hox genes requires the essential spliceosomal protein Sf3b1. Genes & development, 19(5),

6 Document Representation
Polycomb group (PcG) proteins are responsible for the stable repression of homeotic (Hox) genes by forming multimeric protein complexes. We show (1) physical interaction between components of the U2 small nuclear ribonucleoprotein particle (U2 snRNP), including Sf3b1 and PcG proteins Zfp144 and Rnf2; and (2) that Sf3b1-heterozygous mice exhibit skeletal transformations concomitant with ectopic Hox expressions... Pr(t|Relevant Doc) Term t Pr(t|Irrelvant Doc) The last dimensionality-reduction step we used is the Z-score test. By conducting Z-score test, we select features whose probability to occur in the positive set  is statistically significantly different from their probability to occur in the negative  set. We consider these features as distinguishing terms. Relevant Documents Irrelevant Documents *Isono, K., Mizutani-Koseki, Y., Komori, T., Schmidt-Zachmann, M. S., & Koseki, H. (2005). Mammalian polycomb-mediated repression of Hox genes requires the essential spliceosomal protein Sf3b1. Genes & development, 19(5),

7 Document Representation
Polycomb group (PcG) proteins are responsible for the stable repression of homeotic (Hox) genes by forming multimeric protein complexes. We show (1) physical interaction between components of the U2 small nuclear ribonucleoprotein particle (U2 snRNP), including Sf3b1 and PcG proteins Zfp144 and Rnf2; and (2) that Sf3b1-heterozygous mice exhibit skeletal transformations concomitant with ectopic Hox expressions... we use these distinguishing terms to generate a binary vector for representing each documents. If a term appears in the document, the value is set as 1, otherwise, the value is 0. <0,1,1,0,…,0,1,1> # of distinguishing terms

8 Experiments Using Title-and-Abstract
Three sets of cross-validation runs with different settings Five distinct complete 5-fold cross validation runs. Ten distinct complete 10-fold cross validation runs. half of the data is treated as the training set while the other half become the test set. Results for 5-fold cross validation runs Precision Recall Accuracy Utility10 Naïve Bayes 0.892 (.005) 0.957 (.003) 0.917 (.004) 0.876 (.006) Random Forest 0.908 (.006) 0.921 (.005) 0.912 (.005) 0.895 (.007) We trained and tested two types of classifiers, Naïve Bayes and Random Forest, over the two datasets, based on the document representation described before. We first conducted experiments using titles and abstracts of the documents only. To ensure our results are statistically significant, we trained and tested the classifiers using three sets  of cross-validation runs with different settings. First,  we executed five distinct complete 5-fold cross validation runs To validate that the classification results remain steady even when the size of training set varies, we also employed 10 complete sets of stratified 10-fold cross validation and the third set of experiments which use half of the data as the training set while the other half as the test set. We use standard measures widely employed for document classification evaluation, namely Precision, Recall, F-measure and Accuracy.   For biomedical document curation, Recall  is often viewed as more important than Precision  because missing relevant documents may compromise the integrity of the database. So we also include the utility  measure introduced by TREC Genomics, which biases the evaluation in favor of high recall. We can see that our proposed method leads to very high level performance on GXD according to every evaluation measure. It indicates that the proposed document classification method is effective and can indeed be useful in practice. (Combine the experiments and results) Experiments using title-and abstract Table of result

9 Using Text from Titles, Abstracts and Captions
Using Titles and Abstracts Only Using Captions Only Using Titles, Abstracts & Captions Precision Recall Accuracy Utility10 Random Forest 0.779 (.016) 0.768 (.024) 0.802 (.015) 0.765 (.018) Random Forest 0.855 (.019) 0.758 (.017) 0.817 (.014) 0.846 (.021) In this set of experiments we include title+ abstract+ CAPTIONS in the representation. We executed a second group of experiments consisting of three sets of experiments over the dataset in which each document consist not only title-and-abstract but also image captions to compare the impact of using image caption vs. title-and-abstract only. Since the dataset is small, we only used five distinct complete 5-fold cross validation runs for each set experiment, each run using a different 5-way split of the dataset. For the first set of experiments over the GXD-caption  set, we used only titles and abstracts of the publications as training/test data. For the second set of experiments over the GXD-caption dataset, we used only image captions of publications in our training/test datasets. For the third set of experiment, we used titles-and-abstracts as well as image captions of the publications as training/test data. The results show that the classifiers generated in the last set of experiments, which rely on features obtained from title, abstract and image captions, have the best performance. This indicates that image captions indeed provide valuable information supporting the GXD document classification task. Random Forest 0.876 (.019) 0.829 (.015) 0.858 (.012) 0.869 (.020)

10 Conclusion Biomedical document classification framework to effectively identify publications relevant to GXD. Feature selected from title, abstract and image captions improve classification performance. We proposed … We demonstrated that using features obtained from title, abstract and image captions, we can improve the performance of our proposed biomedical document classification framwork

11 Future work Utilize more of the irrelevant documents.
Harvest figure captions directly from PDF – increase captions dataset. Combine information from text and images. Further improving the classifier by developing strategies to better utilize more of the irrelevant documents. Using image captions from a larger set of documents, by harvesting figure captions directly from the PDF. Working on combining information obtained directly from text as well as the images for addressing biomedical document classification.

12 Acknowledgments U.S. National Library of Medicine Judith Blake, PH.D.
Martin Ringwald, PH.D. Judith Blake, PH.D. We appreciate the help from our collleboraters, Dr Ringwald and Dr Judith. And - the work was supported by the NIH U.S. National Library of Medicine

13 Thank you!


Download ppt "Gene Expression Database (GXD)"

Similar presentations


Ads by Google