Download presentation
Presentation is loading. Please wait.
Published byClaire Reed Modified over 9 years ago
1
Curator Meeting: 10/27/03 Integration of New Data into RGD: Quality Control and Data Submission Tools Rat Genome Database http://rgd.mcw.edu Bioinformatics Research Center Medical College of Wisconsin, Milwaukee, USA
2
Curator Meeting: 10/27/03 RGD Pipeline Background RGD is a relatively new MOD Needed to integrate large amounts of historic data Curation staff is limited Developed near beginning of RGD Project Efficient methods to evaluate and integrate data Informatic methods were chosen to address the problems Catch up with historic data Achieve good productivity with limited staff Modular design New types of data New QC checks and methods
3
Curator Meeting: 10/27/03 RatMap MGD RGD 2.0 RHdb EBI (UK) Markers, Primers Goteborg (Sweden) Genes, markers, QTLs WI/MIT Markers, Genetic Map MIT Jackson Labs Markers, Strains, Genes NCBI LocusLink, RefSeq, UniGene, etc. NCBI Otsuka Otsuka (Japan) SSLPs U. Iowa ESTs, RH Map UI NIAMS ARB Maps and SSLPs Data Sources MCW All Objects MCO SSLPs Baylor (HGSC) Sequences Baylor RGD Literature
4
Curator Meeting: 10/27/03 Data Pipeline Regular Journal Screening And Curation RGD Database Ongoing Data Curation Internal Data Databases Literature Websites Informatic data mining Data Sources Bulk Data Pipeline in the Curation Process External Data
5
Curator Meeting: 10/27/03 RGD Objects RGD stores information about 11 fundamental data types (Objects) 1.Genes 2.Strains 3.QTLs 4.Traits 5.Sequences 6.ESTs 7.Maps 8.SSLPs 9.References 10.Homologs 11.Phenotypes
6
Curator Meeting: 10/27/03 Relationships between RGD Objects Genes -> Genes, ESTs, SSLPs, and QTLs ESTs -> Genes SSLPs -> Genes and Strains QTLs -> Genes, Traits, Strains Traits -> QTLs Maps -> Maps Data Maps Data -> any RGD object References -> any RGD object Homologs -> any RGD object Strains -> any RGD object Sequences -> any RGD object Phenotypes -> any RGD object
7
Curator Meeting: 10/27/03 RGD object Templates
8
Curator Meeting: 10/27/03 Internal Data Sources QC functionality on data entry forms
9
Curator Meeting: 10/27/03 Curation Annotations Notes Editor
10
Curator Meeting: 10/27/03 Data Entry Summary Page
11
Curator Meeting: 10/27/03 Edit Record in Submission Database
12
Curator Meeting: 10/27/03 RGD Data Flow owner_1 Production Cur_1 Curation Owner_2 rgd.mcw.edu dss Curation data Bulkdata All objects Online Genes QTLs Strains Internal Systems Public System Bulk Data (Production-load) QC
13
Curator Meeting: 10/27/03 BD Pipeline Database keep all raw data format the data track all checking flags track all loading status Input raw data Check data RGD Database Blasting results Preload data Load data Web-based interface to view all processing status QC checks in the Data Flow
14
Curator Meeting: 10/27/03 QC Process Overview Incoming Dataset Internal checking (blast/symbol) Blast against RGD database Check for identity conflicts Check symbol Check sequence via GB ID Check sequence via BLAST Check alias Preload: check for any attribute conflicts Load: values without conflicts RGD database Conflict data files for curation review Conflict data files for curation review Curators to review flags Level One: Integrity Checking Level Two: Identity Checking Level Three: Attribute Checking
15
Curator Meeting: 10/27/03 Examples of Checks New symbol matches an RGD symbol New symbol matches an alias in RGD New record has a GBID New GBID matches the RGD record New GBID matches GBID of alias gene New GBID matches any other RGD record New Sequence matches any RGD Every attribute value compared to RGD values
16
Curator Meeting: 10/27/03 Review Pipeline’s QC Checks
17
Curator Meeting: 10/27/03 Review Conflicts
18
Curator Meeting: 10/27/03 Excel Summary Report Conflict Data Report lists the bin ID for data that requires further curation (BLAST/BLAT analysis)
19
Curator Meeting: 10/27/03 Conflict Data Discovered by the Bulk Data Pipeline Nomenclature conflicts Symbols were incorrect Sequence conflicts Sequence reads were unacceptable due to poor quality (Many N’s) Primers were switched Sequence in dataset were associated with different objects in RGD Alias conflicts Dataset aliases were RGD objects Dataset symbols were in RGD aliases Attribute conflicts Chromosomes were different in RGD Cytological positions were different in RGD Expected sizes of PCR products were different in RGD Redundant data conflicts Datasets had duplicate entries
20
Curator Meeting: 10/27/03 Curation of Conflicting Data Checking processes find conflicting data Manual curation to resolve conflicts Nomenclature, Sequence, Alias symbols, Attributes, Redundant records Curated data Resolvable Irresolvable Removed data Load into RGD (Over-write current data) Store data in file (Notify source)
21
Curator Meeting: 10/27/03 After Load
22
Curator Meeting: 10/27/03 Acknowledgements Principal Investigators Howard Jacob Peter Tonellato Simon Twigger RGD Bioinformatics Dean Pasko, Jiali Chen Lan Zhao, Henry Fan, Wenhua Wu, Jian Lu Hanping Long RGD Curation Mary Shimoyama Susan Bromberg Rajni Nigam, Chin-fu Chen Gopal Gopinathrao, Charles Wang Victoria Petri Dorothy Reilly, Cindy Foote Angela Zuniga-Meyer, Nataliya Nenasheva
23
Curator Meeting: 10/27/03
26
Model Organism Bulk Data Processing Work Flow
27
Curator Meeting: 10/27/03 Case Numb er Case DescriptionExpected ResultNote 1New record has an RGD_ID, and both symbol and RGD_ID of new record match an active RGD record Set symbol flag to “IN_RGD_1“ The process will continue through the GenBank ID check 2New record has an RGD_ID, and new symbol matches an active RGD symbol, but new RGD_ID does not match the RGD_ID of matching symbol Set symbol flag to “DIF_RGD_ID“ The process will continue through the GenBank ID check 3New record has an RGD_ID, but new symbol does not match an active RGD symbol Set symbol flag to “DIF_SYMBOL“ The process will continue through the GenBank ID check 4New record does not have an RGD_ID, but new symbol matches an active RGD symbol Set symbol flag to “IN_RGD_2” The process will continue through the GenBank ID check 5New record does not have an RGD_ID and new symbol does not match an active RGD symbol Set symbol flag to “NEW” The process will continue through the GenBank ID check
28
Curator Meeting: 10/27/03 Case Numb er Case Description Expected Result Note 6New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, and alias is associated with only one gene Change symbol flag to “IN_RGD_UPDATED” Change current symbol flag after changing symbol (GBID check) and continue through GenBank ID check 7New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; new GBID matches a record in RGD and there is no alias matching that symbol; new symbol matches a retired or withdrawn symbol Change symbol flag to “DIF_NON_ACTIVE_1 ” Change current symbol flag and continue through GenBank ID check 8New record does not have an RGD_ID and new symbol does not matches an active RGD symbol; new record does not have a GBID; new symbol matches a retired or withdrawn symbol Change symbol flag to “DIF_NON_ACTIVE_2 ” Change current symbol flag and continue through GenBank ID check
29
Curator Meeting: 10/27/03 9New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does match another GBID/seq in RGD Set flag to “DIF_9:RGD_ID “ This case will complete the PRELOAD step in the pipeline, but NOT be loaded until after curation review. Data without GBID match will be compared by BLAST after PRELOAD step in the pipeline, but NOT loaded until after curation review. RGD_ID is the value that is associated with the GBID/seq in RGD 10New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does not match any GBID/seq in RGD Set flag to “DIF_10” This case will be compared by BLAST after PRELOAD step in pipeline, but NOT loaded until after curation review. 11New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD but new GBID doesn’t match the GBID of the gene associated with the matching alias Set flag to “DIF_11” This case will complete the PRELOAD step in the pipeline, but NOT be loaded until after curation review
30
Curator Meeting: 10/27/03 New Check Aliases Use Case Diagram C. Fan
31
Curator Meeting: 10/27/03 New Check Gene Symbol Use Case Diagram
32
Curator Meeting: 10/27/03 Case Numb er Case DescriptionExpected ResultNote 1New record has an RGD_ID, and both symbol and RGD_ID of new record match an active RGD record Set symbol flag to “IN_RGD_1“ The process will continue through the GenBank ID check 2New record has an RGD_ID, and new symbol matches an active RGD symbol, but new RGD_ID does not match the RGD_ID of matching symbol Set symbol flag to “DIF_RGD_ID“ The process will continue through the GenBank ID check 3New record has an RGD_ID, but new symbol does not match an active RGD symbol Set symbol flag to “DIF_SYMBOL“ The process will continue through the GenBank ID check 4New record does not have an RGD_ID, but new symbol matches an active RGD symbol Set symbol flag to “IN_RGD_2” The process will continue through the GenBank ID check 5New record does not have an RGD_ID and new symbol does not match an active RGD symbol Set symbol flag to “NEW” The process will continue through the GenBank ID check
33
Curator Meeting: 10/27/03 Case Numb er Case Description Expected Result Note 6New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, and alias is associated with only one gene Change symbol flag to “IN_RGD_UPDATED” Change current symbol flag after changing symbol (GBID check) and continue through GenBank ID check 7New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; new GBID matches a record in RGD and there is no alias matching that symbol; new symbol matches a retired or withdrawn symbol Change symbol flag to “DIF_NON_ACTIVE_1 ” Change current symbol flag and continue through GenBank ID check 8New record does not have an RGD_ID and new symbol does not matches an active RGD symbol; new record does not have a GBID; new symbol matches a retired or withdrawn symbol Change symbol flag to “DIF_NON_ACTIVE_2 ” Change current symbol flag and continue through GenBank ID check
34
Curator Meeting: 10/27/03 9New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does match another GBID/seq in RGD Set flag to “DIF_9:RGD_ID “ This case will complete the PRELOAD step in the pipeline, but NOT be loaded until after curation review. Data without GBID match will be compared by BLAST after PRELOAD step in the pipeline, but NOT loaded until after curation review. RGD_ID is the value that is associated with the GBID/seq in RGD 10New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does not match any GBID/seq in RGD Set flag to “DIF_10” This case will be compared by BLAST after PRELOAD step in pipeline, but NOT loaded until after curation review. 11New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD but new GBID doesn’t match the GBID of the gene associated with the matching alias Set flag to “DIF_11” This case will complete the PRELOAD step in the pipeline, but NOT be loaded until after curation review
35
Curator Meeting: 10/27/03 12New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, and alias is associated with only one gene Set flag to “IN_RGD_2” The new symbol is changed to the RGD gene symbol of the gene associated with the matching alias and the data is loaded 13New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, but alias is associated with more than one gene Set flag to “DIF_13”This case will complete the PRELOAD step in the pipeline, but NOT be loaded until after curation review
36
Curator Meeting: 10/27/03 Casess for GenBank ID check B in # Sy m b. ma tch GBI D mat ch spec ific RG D reco rd GB ID in ne w file GBI D in RG D GBI D matc h any RGD Seq match any RGD (BLAST ) Symb. match alias GBID match GBID of alias gene Alias of more than one gene Flag Symbol/GBID /Alias 1yes--noyes-- DIF_ 1New: A/- RGD: A/1 2yes -- IN_RGD_1New: A/1 RGD: A/1 3yesnoyes no -- DIF_3New: A/1 RGD: A/2 or – RGD: B/1 4yesnoyes yes or no yes-- DIF_4:RG D_ID New: A/1 RGD: A/2 RGD: B/1 5no--no-- DIF_5New: A/- RGD: B/2 or - 6no--yes--no -- NEWNew: A/1 RGD: B/2 or - 7no--yes yes or no yesno-- DIF_7:RG D_ID New: A/1/C RGD: B/1 8yes--no -- DIF_8New: A/- RGD: A/- 9yes--yesnoyes or no yes-- DIF_9:RG D_ID New: A/1 RGD: A/- 1010 yes--yesno -- DIF_10New: A/1 RGD: A/- 1 no--yes--yes--yesno--DIF_11New: A/1 RGD: B/2/A 1212 no--yes--yes--yes noDIF_12New: A/1 RGD: B/1/A 1313 no--yes--yes--yes DIF_13New: A/1 RGD: B/1/A RGD: C/2/A
37
Curator Meeting: 10/27/03 Input data Check data Pro-load data Load data Complete Bulkdata Pipeline Process Diagram C. Fan
38
Curator Meeting: 10/27/03 sslps genes sequences references maps qtls traits homologs strains phenotypes diseases ESTs Database Object Relationships
39
Curator Meeting: 10/27/03 RGD Schema Diagram 54 Tables 10 Views
40
Curator Meeting: 10/27/03 RGD Schema Word Document
41
Curator Meeting: 10/27/03 Platforms Database server: Oracle 8.1.6 Sun Solaris 2.8 Unix operating system Sun Enterprise 450’s Programming Language Perl 5 Object-oriented Methodology Database - object based schema Perl modules – object based and globally used across systems DB.pm module PRELOAD.pm module LOAD.pm module Schema Documentation Rational Rose 2000 Enterprise RGD Database Technologies
42
Curator Meeting: 10/27/03 Bulk Data Database Schema
43
Curator Meeting: 10/27/03 Review Quality Control Reports
44
Curator Meeting: 10/27/03 Review
45
Curator Meeting: 10/27/03 Review
46
Curator Meeting: 10/27/03 Review
47
Curator Meeting: 10/27/03 Validation
48
Curator Meeting: 10/27/03 RGD Data Flow owner_1 Production Cur_1 Curation Owner_2 dev_1 Development dorado fuxi rgd.mcw.edu dss Curation data Online Strains, References Nomenclature Gene editing Ontologies(rgdtogo.txt) Notes Bulkdata All objects Bulk Data (Test-data) Online Genes QTLs Strains Internal Systems Public System alps Object Templates Text - tab delimited Modify flags Bulk Data (Production-load) 1st 2nd Modify flags Templates Homologs Strains Genes QTLs SSLPs ESTs Map Data
49
Curator Meeting: 10/27/03 Blast Result Scenarios
50
Curator Meeting: 10/27/03 Check for LocusLink, Swiss-Prot, RatMap IDs Bin Number New symbol matches RGD symbol LL/SP/RM_ID in new file LL/SP/RM_ID in specific RGD record LL/SP/RM_ID matches specific RGD record LL/SP/RM_ID matches any RGD
51
Curator Meeting: 10/27/03 Sequence must be over 95% aligned Forward and Reverse primer must be over 95% aligned Ratio of the aligned bp / length of query sequence => 95% Ratio of the length1 (short seq) / length2 (longer seq) => 90% Checks for Sequence
52
Curator Meeting: 10/27/03 Review
53
Curator Meeting: 10/27/03 Review Pipeline’s QC Checks
54
Curator Meeting: 10/27/03 Before Load
55
Curator Meeting: 10/27/03 Check Alias New alias type matches RGD alias types New alias is same as new symbol New alias matches any alias in RGD and same alias type
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.