Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

Slides:



Advertisements
Similar presentations
A Comparative mapping resource ONTOLOGY DEVELOPMENT AND INTEGRATION IN GRAMENE Pankaj Jaiswal Cornell University.
Advertisements

Carnegie Institution for Science, Department of Plant Biology.
1 Gene Ontology and Functional Annotation Donghui Li ASPB Plant Biology, June 29, 2008, Merida.
Bienvenidos a TAIR! Kate Dreher curator TAIR/PMN.
How pathway databases were created and curated Peifen Zhang Plant Metabolic Network (PMN)
Annotation of Gene Function …and how thats useful to you.
TAIR: Bringing together data for the global plant biology community kate dreher curator TAIR/PMN.
The Arabidopsis Information Resource (TAIR)
Arabidopsis as a model for plant development Eva Huala.
Kate Dreher AraCyc, TAIR, PMN Carnegie Institution for Science
El PMN: Tu amigo en el metabolismo de plantas Kate Dreher curator PMN/AraCyc/TAIR.
Part I: Tips and techniques from curators Kate Dreher TAIR, AraCyc, PMN Carnegie Institution for Science.
Dr Lee Garratt Genetics and Cell Biology Plant Sciences, Sutton Bonington Campus (D211P1)
Determining the roles of the BTB genes At2g04740, At4g08455, At1g04390, and At2g30600 in Arabidopsis thaliana growth and development. Brandon D. Blaisdell,
Fundamentals of Protein Structure August, 2006 Tokyo University of Science Tadashi Ando.
The Plant Metabolic Network: PlantCyc, AraCyc, and NEW Metabolic Pathway Databases for Plant Research *K. Dreher, P. Zhang, L. Chae, R.A. Nilo Poyanco,
Gene Ontology John Pinney
POC tutorial#3: Annotation This tutorial will run automatically in Quicktime. To run the tutorial at your own pace use the internal controllers within.
CACAO - Remote training Gene Function and Gene Ontology Fall 2011
Hormonal Signaling II: Cytokinin and Ethylene 1.Cytokinin 1) overview: Zeatin (a purine derivative) is most abundant natural cytokinin, discovered as a.
CACAO - Remote training Gene Function and Gene Ontology Fall 2011
Enzymatic Function Module (KEGG, MetaCyc, and EC Numbers)
Accessing the Data You Need at the Plant Metabolic Network kate dreher biocurator PMN The Carnegie Institution for Science Stanford, CA.
TAIR resources for plant biology research kate dreher curator TAIR/PMN.
AP Biology Control of Eukaryotic Genes.
SPH 247 Statistical Analysis of Laboratory Data 1 May 12, 2015 SPH 247 Statistical Analysis of Laboratory Data.
Control of Growth and Development Chapter 15. Developmental Processes Present knowledge of plant hormone and light regulation (especially at the molecular.
Ethylene responses Developmental processes
New data and tools at TAIR (The Arabidopsis Information Resource)
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Accessing information in plant metabolic pathway databases at the PMN, Gramene, and SGN Part I: Contents, Search Strategies, and Data Sharing Opportunities.
TAIR/Gramene/SGN Workshop I ASPB Meeting July 08, 2007 Chicago, IL Metabolic Databases.
SPH 247 Statistical Analysis of Laboratory Data 1May 14, 2013SPH 247 Statistical Analysis of Laboratory Data.
TAIR Workshop Model Organism Databases and Community Annotation Plant and Animal Genome XVI Conference, San Diego January 13, 2008.
SRI International Bioinformatics 1 Recent Developments in Pathway Tools GMOD Workshop November ‘07 Suzanne Paley Bioinformatics Research Group SRI International.
Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso.
PlantCyc, AraCyc, PoplarCyc and more... Building databases and connecting to researchers at the Plant Metabolic Network kate dreher curator PMN/TAIR.
MetaCyc and AraCyc: Plant Metabolic Databases Hartmut Foerster Carnegie Institution.
Monday, November 8, 2:30:07 PM  Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories.
Top Four Essential TAIR Resources Debbie Alexander Metabolic Pathway Databases for Arabidopsis and Other Plants Peifen Zhang.
From Functional Genomics to Physiological Model: Using the Gene Ontology Fiona McCarthy, Shane Burgess, Susan Bridges The AgBase Databases, Institute of.
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
SRI International Bioinformatics 1 Submitting pathway to MetaCyc Ron Caspi.
Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.
Combining Computational Prediction and Manual Curation to Create Plant Metabolic Pathway Databases Peifen Zhang Carnegie Institution For Science Department.
Metabolic Pathway Databases and Tools Speaker and Schedule Update PMN (Peifen Zhang) KEGG (auto-slide show) MetaCrop (cancelled)
1 Gene function annotation. 2 Outline  Functional annotation  Controlled vocabularies  Functional annotation at TAIR  Resources and tools at TAIR.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Biological Networks & Systems Anne R. Haake Rhys Price Jones.
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
Introduction to biological molecular networks
Proteomics, the next step What does each protein do? Where is each protein located? What does each protein interact with, if anything? What role does it.
Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology
SRI International Bioinformatics 1 Editing Pathway/Genome Databases Ron Caspi.
1 Annotation EPP 245/298 Statistical Analysis of Laboratory Data.
Building and Refining AraCyc: Data Content, Sources, and Methodologies Kate Dreher TAIR, AraCyc, PMN Carnegie Institution for Science.
Welcome to Gramene’s RiceCyc (Pathways) Tutorial RiceCyc allows biochemical pathways to be analyzed and visualized. This tutorial has been developed for.
1 AraCyc Metabolic Pathway Annotation. 2 AraCyc – An overview  AraCyc is a metabolic pathway database for Arabidopsis thaliana;  Computational prediction.
2006 ICAR: TAIR workshop Organizers: Katica Ilic and Peifen Zhang Location: Reception Room, 4th floor A general overview of TAIR website and demonstration.
Recent Developments and Future Directions in Pathway Tools Peter D. Karp SRI International.
Genomics Lecture 3 By Ms. Shumaila Azam. Proteins Proteins: large molecules composed of one or more chains of amino acids, polypeptides. Proteins are.
CACAO Training Jim Hu and Suzi Aleksander Fall 2015.
Protein Folding.
BRC Science Highlight WRINKLED1, a key regulator of oil biosynthesis, also affects hormone homeostasis Objective WRINKLED1 (WRI1) is a key transcriptional.
Phenotype Annotation at TAIR
Annotation Presentation
Introduction and Fundamentals of Protein Structure
Introduction and Fundamentals of Protein Structure
BRI1/BAK1, a Receptor Kinase Pair Mediating Brassinosteroid Signaling
Volume 11, Issue 7, Pages (July 2018)
Presentation transcript:

Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB

What does a curator do? What do we ALL (researches and curators) want from the papers we read? What problems do we encounter when reading papers? Identifying items Choosing annotations How can we work together to improve these processes? Why does this matter to YOU? Discussion Plan

It depends on the type of curator! Functional genomics curator / Metabolic pathway curator: Help to maintain the TAIR and Plant Metabolic Network / AraCyc websites Answer questions from users Give presentations and workshops at conferences and universities Interact with curators at other institutions to develop better curation practices and tools What does a curator do? Read LOTS of papers

What do we all want from papers?

It depends on the type of paper! I focus on papers that describe: genes/proteins (TAIR and PMN) metabolic pathways (PMN) We all want the important information! Curators also want to be able to capture that information and display it for users on the TAIR and AraCyc/PMN websites. What do we all want from papers?

What gene / protein are they talking about? AGI locus code (TAIR / PMN) At2g46990 Gene symbol and FULL names (TAIR / PMN) BSK3 = Brassinsteroid (BR)-signaling kinase 3 GGT2 = Glutamate:Glyoxylate aminotransferase 2 Gene model (TAIR) At2g What do we all want from papers?

What does this gene do? Molecular Function GO terms (TAIR) has protein kinase activity - GO: functions in histone binding - GO: has L-glutamine transmembrane transporter activity - GO: Phenotype description (TAIR) The ppc4-2 mutant has reduced PEP carboxylase activity Reactions catalyzed (PMN) indole-3-acetonitrile + 2 H2O = ammonia + indole-3-acetate (IAA) Information for gene summaries (TAIR) Information for enzyme summaries (PMN) What do we all want from papers?

Where is this protein found? Cellular Component GO terms (TAIR) located in nucleolus - GO: located in TOC complex - GO: Cellular Ontology (PMN) chloroplast Information for gene summaries (TAIR) Information for enzyme summaries (PMN) What do we all want from papers?

When and where is this gene / protein expressed? Plant Structure PO terms (TAIR) expressed in anther - PO: Plant Growth Stages PO terms (TAIR) expressed during expanded cotyledon stage - PO: Information for gene summaries (TAIR) Information for enzyme summaries (PMN) What do we all want from papers?

What biological processes does this protein participate in? Biological Process GO terms (TAIR) involved in petal development - GO: involved in L-glutamate import - GO: involved in brassinosteroid biosynthetic process - GO: Metabolic Pathways (PMN) put enzyme in alanine degradation pathway Phenotype descriptions The phot1-4 mutant shows reduced responses to blue light Information for gene summaries (TAIR) Information for enzyme summaries (PMN) What do we all want from papers?

What mutant(s) did they describe? (TAIR) Mutant ID SALK_nnnnnn SAIL_21_A07 Mutant name and unique symbol rte1-2 (reversion-to-ethylene-sensitivity 1-2) Ecotype Ploidy level (e.g. heterozygous, homozygous) Phenotype description What do we all want from papers?

What experiments did they do? Assay conditions and reagents Help curators make GO and PO annotations (TAIR) identify enzymatic reactions (PMN) specific substrates, e.g. L-glutamate necessary co-factors, e.g. Mg2+ capture pH and temperature optimums (PMN) We dont capture: PCR primers good antibody sources etc.... but you are welcome to submit this information using Comments What do we all want from papers?

Have you ever read a paper thats missing important information? How did that make you feel? Did it interfere with your ability to do your work? What do we all want from papers? A lot of important information... Gene identity Gene function Gene expression patterns and much more!

Challenges : Identifying Objects Case 1: Paper describes a gene or genes using a symbol Authors never provide AGI code, sequence information, or other unique ID Different genes can have the same symbols in TAIR ASA: Attenuated shade avoidance? Anthranilate Synthase Alpha Subunit? ARF1 Auxin Response Factor 1? ADP-Ribosylation Factor 1? Not all symbols are in TAIR Authors describe a new mutant or name a new gene family and never give IDs Impossible for us to annotate / Impossible for you to do related experiments

Challenges : Identifying Objects Case 2: Paper does not specify gene model when appropriate a. The T-DNA insertion is in the third exon of TPK1 Which third exon? b. We expressed TPK1 in E.coli and saw activity Which TPK1? c. A TPK1:GFP fusion protein localizes to the nucleus Which TPK1?

Challenges : Identifying Objects Case 3: Not enough information is given about a mutant The phyb mutant had a longer hypocotyl than the wild type plant 30 alleles / germplasms associated with phyB in TAIR Which phyb? What ecotype?

Challenges : Identifying Objects Case 4: Not enough information is given about enzymatic reactions Diagram in paper shows: arogenate tyrosine In vitro, AR dehydrogenase catalyzed the formation of tyrosine from arogenate D- or L-form of amino acid? What oxidizing agent is involved? What other substrates or products are involved? What is the chemical structure of arabidiol? We detected the formation of arabidiol

Opportunities : Identifying Objects You can help each other and curators to identify all the important items in the manuscripts you write or review AGI locus code for all genes in paper (At2g46990) Gene model information when relevant (At2g ) Specific mutant names (abc1-7), IDs (SALK_nnnnn) and ecotype Complete and balanced biochemical reactions Chemical structures or chemical database IDs for compounds But, for curators, identifying objects is only one of the challenges... You are the next generation of: Authors Reviewers Journal Editors

Challenges : Choosing annotations Curators have to make decisions... When should we make annotations? What specific annotations should we make? You should be concerned about how we choose annotations You are data providers Were capturing the data from your papers How would you like to see it presented? You are data users You use our annotations of individual genes You analyze your microarray data using our GO and PO annotations You view your transcript and metabolomic data using the OMICs viewer How would you like to see it presented?

Challenges : Choosing annotations – YOU make the call! When and what should we annotate using GO terms?

Challenges : Choosing annotations – YOU make the call! Case 1: When is something involved in a biological process? Molecular Function and Cellular Component annotations – pretty clear Biological Process can be pretty ambiguous! Glycine metabolic process 6 mutants are uncovered that have altered levels of glycine lgl1-1, lgl2-1, lgl3-1 make Less GLycine than wild-type plants mgl1-1, mgl2-1, mgl3-1 make More GLycine than wild-type plants Annotate all 6 genes: involved in glycine metabolic process Use evidence code: IMP = inferred from mutant phenotype

Challenges : Choosing annotations – YOU make the call! LGL1 = threonine aldolase ? LGL2 = transcription factor Which genes are involved in – glycine metabolic process? LGL3 = tyrosine kinase MGL1 = F-box protein (E3 ligase subunit) MGL2 = phosphatase up-regulates enzyme turns on TF degrades kinase promotes E3 ligase activity MGL3 = nucleoporin allows phosphatase to enter nucleus ? ? ? ? ? ? ? ? Where do we stop? Should we change old annotations? (***Evidence code is important – be aware of IMP!) What belongs in a GO annotation versus a phenotype description?

Challenges : Choosing annotations – YOU make the call! Case 2: How do we deal with over-expressers? RNAi? etc.? What biological process is XYZ1 involved in? 35S:XYZ1 more petals than wild type plants xyz1 KO mutants normal number of petals Is XYZ involved in petal development? XYZ1 is only expressed in roots XYZ1 is expressed at very low levels in flowers XYZ1 – no expression data mentioned What if XYZ is part of a large gene family? What if XYZ is unique (not related to other genes)? ? ? ? ? ?

Challenges : Choosing annotations – YOU make the call! Case 3: When is it enough to make an annotation? JKL is expressed in rosette leaves RT-PCR analyses show expression of JKL in rosette leaves JKL is expressed at low levels in rosette leaves JKL expression is barely detectable in rosette leaves GHI has enzymatic activity with the following substrates in vitro: Which Molecular Functions do we annotate with GO in TAIR? Which reactions do we add to AraCyc? IAA + isoleucine -> IAA-Ile (90%) IAA + leucine -> IAA-Leu (50%) IAA + histidine -> IAA-His (20%) IAA + cysteine -> IAA-Cys (5%) IAA + proline -> IAA-Pro (1%) ? ? ? ? ? ? ? ? ? What if the reactions are characterized in vivo?

Challenges : Choosing annotations – YOU make the call! Case 4: Figures without text support Which genes are expressed in these tissues?

Challenges : Choosing annotations – YOU make the call! Case 4: Figures without text support The expression of 11 genes was detected in leaves. ? ? ?

Challenges : Choosing annotations – YOU make the call! Case 5: Which term is most appropriate? GRI (Grim Reaper) is involved in the regulation of extracellular ROS-induced cell death gri plants show increased ROS-induced cell death and reduced seed content. The seed content in siliques was reduced in gri and GRI overexpressors compared with Col-0 and vector control. Wrzaczek et al 2009 involved in fruit development Are the siliques shorter? Are there empty spaces in normal siliques? involved in seed development ? ?

Opportunities : Choosing annotations – YOU make the call! You can be the annotators of the future! informally : us or drop by and say hello! use TAIR or PMN submission forms during journal publication process Plant Physiology (now) more journals in the future!

Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators We all read papers We all want to extract important and useful information from papers We all want reliable annotations in our databases Challenges: Sometimes it is difficult to find the information we need in papers Sometimes it is hard to judge how to curate data in papers Opportunities: Authors, reviewers, and editors can make sure that papers have adequate information Curators can help researchers to directly submit annotations to TAIR or the PMN Curators and researchers can communicate about the curation process You know what we want We know what you want! We all work together to advance scientific research!

Thank you! Current Curators: - Tanya Berardini (lead curator – functional annotation) - David Swarbreck (lead curator – structural annotation) - Peifen Zhang (Director and lead curator- metabolism) - A. S. Karthikeyan (curator) - Philippe Lamesch (curator) -Donghui Li (curator) -Rajkumar Sasidharan (curator) Recent Past Contributors: - Debbie Alexander (curator) - Christophe Tissier (curator) - Hartmut Foerster (curator) NSF Tech Team Members: - Bob Muller (Manager) - Larry Ploetz (Sys. Administrator) - Raymond Chetty - Anjo Chi - Vanessa Kirkup - Cynthia Lee - Tom Meyer - Shanker Singh - Chris Wilks Metabolic Pathway Software: - Peter Karp and SRI group TAIR, AraCyc, and the PMN Eva Huala (Director and Co-PI) Sue Rhee (PI and Co-PI)