Download presentation
1
DATA MANAGEMENT AND CURATION AT TAIR
Margarita Garcia-Hernandez
2
The ‘systems biology’ paradigm
FACT: huge amounts of data NEED: systematic harvesting & easy accessibility (store, sort, interlink) PROBLEM: complexity & heterogeneity of data CHALLENGE: to describe complete biological systems in an integrated way (organizing, defining relationships, defining metadata standards, interpreting, quality control assessment – DATA MANAGEMENT) With the advent of new techniques in biology we are experiencing an explosion of data similar to that seen in other disciplines, such as astronomy. In order for regular researchers to be able to not only have easy access to the data, but also be able to make use of it in a comprehensive way (use the data to make hypothesis). The problem is the data is not only complex, but also coming from diverse sources. The most accepted solution to deal with this is to use relational databases, where the data is stored in a way where relationships between different data classes can be established and maintained. The challenge is how model and manage the data , is that, how to organize it, define the relationships among them, standardize it and provide quality control assessments. My talk will focus on how we address this at TAIR.
3
Data Management Flow Chart
Data generation Collection Selection Organization of similar data types Remove redundancy Correct errors Data curation Association of different data types Establish unambiguous identifiers Define and validate relationships This cartoon represents a overview of the series of steps involved in management of data, from data generation to their dissemination to the public. Starting with the collection of the data from the source/s, it may follow a selection, for example to exclude low quality or not relevant data. Resolve data heterogeneity - standardization Annotation – add descriptions Define standard vocabulary Data modeling Database population Data dissemination
4
Quality Control Issues TAIR’s approaches Accuracy of information
Consistency in format and content Up-to-datedness Conflicting data TAIR’s approaches Personnel training (Ph.D. level biologists) User input Source attribution Checking curation consistency (computationally and manually) Adopt Standard Operation Procedures Define and use controlled vocabularies An important issue in data curation is dealing with quality control issues, and most importantly how is quality recognized. How good is the data and how to recognize god from bad data. There are several issues related to quality control. One is how accurate is the information, second how consistent it is, both in format and, third how up-to date (date changes continuously), and last, but not least, how to deal with conflicting data
5
Many Data Types with Many Sources
Literature Public Databases Community submissions Computational analysis Functional Genomic projects Genes/ Gene Products Mutant Phenotypes Expression Metabolism Stocks
6
Two Examples of Data Curation
Sources Data Types Literature Genes/ Gene Products Public Databases (SMD) Mutant Phenotypes Community submissions Microarray Expression Functional Genomic projects (AFGC) Metabolism Computational analysis Stocks
7
Literature Curation PubSearch
A literature curation management system designed to store and manage the available literature for an organism of interest PubSearch software is freely available at Generic Model Organism Database (GMOD) a joint effort by several model organism databases to develop reusable components for creating new biological databases PubSearch is a literature curation management system designed to store and manage the available literature for an organism of interest. It is one of the GMOD tool kits. GMOD is a joint effort by several model organism databases to develop reusable components for creating new biological databases
8
Literature Curation Step 1: Collection of References
Meetings abstracts Dissertations Textbooks Full text papers (scanning, online) Biosis PubMed Agricola (‘Arabidopsis’ in title or abstract) Arabidopsis References Remove redundancy Journal names standardization PubSearch DB (21,527) (curation tool) TAIR DB (public db)
9
Literature Curation Step 2: Assigning References to Genes
Arabidopsis References (PubSearch) known gene names Scanning references for terms in list (programmatically) Term List (17,470) candidate gene names Ref 1 Ref 2 .. Ref n Gene X Reference hit Validation (by curators using PubSearch) Validated list of references for each gene
10
Literature Curation Step 3: Extracting information
Gene-centric curation approach Each curator is assigned 2 genes per day Papers are read and information extracted (following SOPs and using PubSearch curation tool): Name validation & add aliases Add sequence info Assign locus (mapping to the genome by BLAST) Merge/split genes Write summary sentence Correct errors Annotation using controlled vocabularies (GO, POC)
11
Controlled Vocabularies
A collection of defined terms (organized in a hierarchy) intended to serve as a standard nomenclature Provide a common set of terms that users of a single system (or across multiple systems) can share Allows retrieval of ALL relevant information Example: Find all the genes that have transporter activity (regardless of how they are named, or what type they are) It is very important to assure consistency.
12
Controlled Vocabularies used at TAIR
Gene Ontology (GO) Goal: to produce a controlled vocabulary for describing genes and proteins that can be applied to all organisms Molecular Function Cellular component Biological process Plant Organism Consortium (POC) Gramene, TAIR, Univ Missouri St Louis, MaizeDB, IRIS, MIPS, Oryzabase & Monsanto & Pioneer as collaborators Goal: to develop structured controlled vocabularies for plant-specific knowledge domains: Plant Anatomy (morphology, organs, tissue and cell types) Temporal stages (plant growth and developmental stages) Phenotype Ontology (in the works)
13
Qualifying Annotations with supporting evidence
References Evidence code usage A set of controlled vocabulary, which provides evidence to support the association between gene products and annotations IDA: Inferred from Direct Assay IMP: Inferred from Mutant Phenotype ISS: Inferred from Sequence Similarity IEA: Inferred from Electronic Annotation IEP: Inferred from Expression Pattern …… Evidence code description E.g., IPI : Inferred from Physical Interaction Co-immunoprecipitation Co-purification Co-sedimentation ….
14
Gene Annotation Display in TAIR
15
QC of Literature Curation
Weekly annotation meeting Quality control manager Use of standardized vocabularies Random checks of annotations Annotations are tagged by date and curator Automatic checks in software Use SOPs – curation guidelines
16
Curation of Microarray Data
Sources Data Types Literature Genes/ Gene Products Public Databases (SMD) Mutant Phenotypes Community submissions Microarray Data Functional Genomic projects (AFGC) Metabolism Computational analysis Stocks
17
Curation of AFGC Microarray Data Data Collection and Selection
Arabidopsis Functional Genomics Consortium Stanford Microarray Database sample info proposal abstracts protocols results array design minimal descriptions of individual arrays - All Arabidopsis public arrays - exclude QC arrays (45) Selected Arrays (516) Metadata Numeric Results Data (raw and normalized) Array Elements Protocols Samples Experiments
18
Curation of Metadata: Array elements
1. classify, organize, add missing sequences, correct errors 2. mapping to the Arabidopsis genome & association to genes (pipeline) Samples & Experiments 1. Data extraction from flat files (abstracts, RNA forms), and database (SMD) e.g., tissue type, treatments, experimental design 2. Organization of data & parsing into tables 3. Develop controlled vocabularies for experiment categorization & treatments 4. Standardization using those vocabularies 5. Data association grouping arrays replicate sets experiments merging replicate samples to minimize redundancy linking to other related data (germplasm, clones, publications, people) 6. Annotation Experiments: GO process, category, experimental variables Samples: tissues (POC anatomy & temporal) & treatment Data Submission
19
Curation of Microarray Results Data
Numeric Results Data -Quality control Remove poor quality arrays (2) Exclude spots flagged as bad Re-normalize using lowess method (minimize spatial bias) Remove arrays with strong spatial/plate bias (72)(ANOVA) Exclude array elements with intensity < 350 in both channels Exclude array elements with null values in 80% of arrays -Analysis Calculate log2 ratio [ch2N/ch1N] Calculate fold change [ch2N/ch1N] Calculate averages for each array element (array & replicates) Element fold change/log2 ratio std error per array Element fold change/log2 ratio std error per replicate arrays
20
Conclusions Requires trained biologists familiar with data
Can be facilitated computationally (repetitive tasks), but is mainly a knowledge-based task that can only be done by humans Essential for assuring data quality Adds value to data Slow process Can be inconsistent
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.