Social networks, in the form of bibliographies and citations, have long been an integral part of the scientific process. We examine how to leverage the.

Slides:



Advertisements
Similar presentations
NIH PUBLIC ACCESS POLICY NIHMSID, PMCID, PMID OBJECTIVE When the National Institutes of Health (NIH) Public Access Policy became law on April 7, 2008 several.
Advertisements

Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
NATIONAL LIBRARY OF MEDICINE PubMed Central Brooke Dine National Library of Medicine Medical Library Association Conference May 2004.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
1 Enriching UK PubMed Central SPIDER launch meeting, Wolfson College, Oxford Paul Davey, UK PubMed Central Engagement Manager.
Funding Networks Abdullah Sevincer University of Nevada, Reno Department of Computer Science & Engineering.
Exploiting domain and task regularities for robust named entity recognition Ph.D. thesis defense Andrew O. Arnold Machine Learning Department Carnegie.
 BioMed Central is an STM (Science, Technology and Medicine) database. All articles are reviewed before publishing.  It offers full texts, citations,
Fast Query Execution for Retrieval Models based on Path Constrained Random Walks Ni Lao, William W. Cohen Carnegie Mellon University
Sensemaking and Ground Truth Ontology Development Chinua Umoja William M. Pottenger Jason Perry Christopher Janneck.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
Proteins and Protein Function Charles Yan Spring 2006.
CBioC: Massive Collaborative Curation of Biomedical Literature Future Directions.
Literature Mining Tools for Analysis of Genomic Data Ramin Homayouni, Ph.D. Associate Professor of Biology Director of Bioinformatics UTHSC BINF April.
Intra-Document Structural Frequency Features for Semi-Supervised Domain Adaptation Andrew O. Arnold and William W. Cohen Machine Learning Department Carnegie.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Mendeley What is it? How is it different from other “Bibliographic databases” like End Note and Reference.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Moving beyond free text. Authors Scientist does research Scientist publishes research results in journal article Old Paradigm:
Knowledge Integration for Gene Target Selection Graciela Gonzalez, PhD Juan C. Uribe Contact:
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Automated Explanation of Gene-Gene Relationships Wacek Kuśnierczyk.
1 DATABASES By: Hanna Ben-Or Phone: October 2011.
Grant Number: IIS Institution of PI: Arizona State University PIs: Zoé Lacroix Title: Collaborative Research: Semantic Map of Biological Data.
1 Scopus as a Research Tool March Why Scopus?  A comprehensive abstract and citation database of peer-reviewed literature and quality web sources.
Drug Information Resources Ch.#4. Generally, the best method to find drug- related information includes a stepwise approach moving first through: -Tertiary.
IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,
PubMed/How to Search, Display, Download & (module 4.1)
We have displayed the Browse publisher drop down menu. This You have full access to: list for an institution where all the material is included in the.
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Abstract Background: In this work, a candidate gene prioritization method is described, and based on protein-protein interaction network (PPIN) analysis.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Anomalies in Open-Access & Traditional Biomedical Literature: A Comparative Analysis Abstract This research compares rates of anomaly and post-publication.
Statistical Testing with Genes Saurabh Sinha CS 466.
A collaborative tool for sequence annotation. Contact:
Genome Biology and Biotechnology The next frontier: Systems biology Prof. M. Zabeau Department of Plant Systems Biology Flanders Interuniversity Institute.
EBI is an Outstation of the European Molecular Biology Laboratory. Literature Resources at the EBI Information Workshop on European Bioinformatics Resources.
Linked Data Profiling Andrejs Abele National University of Ireland, Galway Supervisor: Paul Buitelaar.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
Section D Free E-journals Access Directory of Open Access Journals BioMed Central PubMed Central HighWire Press PubMed search options.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Big Data that might benefit from ontology technology, but why this usually fails Barry Smith National Center for Ontological Research 1.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
1 Intelligent Information System Lab., Department of Computer and Information Science, Korea University Semantic Social Network Analysis Kyunglag Kwon.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
1 e-Resources on Social Sciences: Scopus. 2 Why Scopus?  A comprehensive abstract and citation database of peer-reviewed literature and quality web sources.
Presented by: Siddhant Kulkarni Spring Authors: Publication:  ICDE 2015 Type:  Research Paper 2.
BRIEF INTRODUCTION: AND EPID 745, April 3, 2012 Xiaoguang Ma.
الله الرحيم بسم الرحمن علیرضا صراف شیرازی دانشیار و مدیر گروه دندانپزشکی کودکان رئیس کتابخانه مرکزی و مرکز علم سنجی دانشگاه علوم پزشکی مشهد.
Publication Pattern of CA-A Cancer Journal for Clinician Hsin Chen 1 *, Yee-Shuan Lee 2 and Yuh-Shan Ho 1# 1 School of Public Health, Taipei Medical University.
Setting the stage: linked data concepts Moving-Away-From-MARC-a-thon.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Model Curation Edmund J. Crampin Auckland Bioengineering Institute
Statistical Testing with Genes
Gully A. Burns1, Pradeep Dasigi2, Eduard H. Hovy2
Department of Genetics • Stanford University School of Medicine
An ecosystem of contributions
Networked Information Resources
Lívia Vasas, PhD 2018 The Nation Library of Medicine and its databases Mozilla Firefox or Google Chrome Lívia Vasas, PhD.
Information Networks: State of the Art
Mapping and evaluating information sources
Webinar for [Name of Group] [Name of Institution]
Webinar for [Name of Group] [Name of Institution]
Learning to Rank Typed Graph Walks: Local and Global Approaches
Statistical Testing with Genes
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Presentation transcript:

Social networks, in the form of bibliographies and citations, have long been an integral part of the scientific process. We examine how to leverage the information contained within these publication networks, along with information concerning the individual publications themselves and a user’s history, to help predict which entities the user might be most interested in and thus intelligently guide his search. Our application domain is the task of predicting which genes and proteins a biologist is likely to write about in the future. We represent this as a link prediction problem wherein we predict which nodes in a graph, currently unlinked, “should” be linked to each other, where “should” is defined in some application-specific way. In our setting, we seek to discover edges between authors and genes, indicating genes about which an author has yet to write, but which he may be interested in. Results Information Extraction as Link Prediction: Using Curated Citation Networks to Improve Gene Detection Andrew Arnold and William W. Cohen Machine Learning Department, Carnegie Mellon University Bibliography We are able to extract the nodes and edges that make up our annotated citation network from the following data sources: PubMed and PubMed Central (PMC): PubMed is a free, open- access on-line archive of over 18 million biological abstracts and bibliographies, including citations, for papers published since 1948 [1]. PubMed Central contains full-text copies of over one million of these papers for which open-access has been granted [2]. The Saccharomyces Genome Database (SGD): A database of various types of information concerning the yeast organism Saccharomyces cerevisiae, including descriptions of its genes along with over 40,000 papers manually tagged with the genes they mention [3]. The Gene Ontology (GO): A large ontology describing the properties of and relationships between various biological entities across numerous organisms [4]. Nodes The nodes of our network represent the entities we are interested in: 44,012 Papers contained in SGD for which PMC bibliographic data is available. 66,977 Authors of those papers, parsed from the PMC citation data. Each author’s position in the paper’s citation (i.e. first author, last author, etc.) is also recorded. 5,816 Genes of yeast, mentioned in those papers. Edges We likewise use the edges of our network to represent the relationships between and among the nodes, or entities. Authorship: 178,233 bi-directional edges linking author nodes and the nodes of the papers they authored. Mention: 160,621 bi-directional edges linking paper nodes and the genes they discuss. Cites: 42,958 uni-directional edges linking nodes of citing papers to the nodes of the papers they cite. Cited: 42,958 uni-directional edges linking nodes of cited papers to the nodes of the papers that cite them RelatesTo: 1,604 uni-directional edges linking gene nodes to the nodes of other genes appearing in their GO description. RelatedTo: 1,604 uni-directional edges linking gene nodes to the nodes of other genes in whose GO description they appear. 1.U.S. National Library of Medicine National Institutes of Health Dwight et al Saccharomyces genome database: underlying principles and organisation. Brief Bioinform. 5(1):9–22. ftp://ftp.yeastgenome.org/yeast. 4.The Gene Ontology Consortium Gene ontology: tool for the unification of biology. In Nature Genet, volume 25, 25– Cohen, W. W., and Minkov, E A graph-search framework for associating gene identifiers with documents. BMC Bioinformatics 7(440). Given our graph representation, the first step is to pick a set of query nodes to which our predicted links will connect. We then perform a random walk out from the query node(s), simultaneously following each edge to the adjacent nodes with a probability proportional to the inverse of the total number of adjacent nodes [5]. We repeat this process a number of times, each time spreading our probability of being on any particular node, given we began on the query node(s). After each step in our walk we have a probability distribution over all the nodes of the graph, representing the likelihood of a walker, beginning at the query node(s) and randomly following outbound edges in the way described, of being on that particular node. We can then use this distribution to rank all the nodes, predicting that the nodes most likely to appear in the walk are also the nodes to which the query node(s) should most likely connect. In order to evaluate our predicted edges, we can hide certain instances of edges, perform a walk, and compare the predicted edges to the actual withheld ones. We use ablation studies to assess the specific contribution of particular edge types. Model Experiment Curated citation networks We construct a citation network as a graph in which publications and authors are represented as nodes, with bidirectional authorship edges linking authors and papers, and uni-directional citation edges linking papers to other papers (the direction of the edge denoting which paper is doing the citing and which is being cited). We use curated literature databases for biology in which publications are tagged, or manually labeled, with the genes with which they are concerned. This allows us to introduce gene nodes to our enhanced citation network, which are bidirectionally linked to the papers in which they are tagged. Finally, we exploit a third source of data, namely biological domain expertise in the form of ontologies and databases of facts concerning these genes, to create association edges between genes which have been shown to relate to each other in various ways. We call the entire structure an annotated citation network. Topology of the full annotated citation network, node names are in bold while edge names are in italics. Subgraphs queried in the ablation experiment, grouped by type: B for baselines, S for social networks, C for networks conveying biological content, and S+C for networks making use of both social and biological information. Shaded nodes represent the node type(s) used as a query. Demo An on-line demo of our work, including a link to the curated citation network data used for the experiments, can be found at Mean percent of queries across graph types, broken down by author position, shown with error bars demarking the 95% confidence interval, along with baselines UNIFORM and ALL_PAPERS. DataIntroduction