Presentation is loading. Please wait.

Presentation is loading. Please wait.

Curator Meeting: 10/27/03 Integration of New Data into RGD: Quality Control and Data Submission Tools Rat Genome Database Bioinformatics.

Similar presentations


Presentation on theme: "Curator Meeting: 10/27/03 Integration of New Data into RGD: Quality Control and Data Submission Tools Rat Genome Database Bioinformatics."— Presentation transcript:

1 Curator Meeting: 10/27/03 Integration of New Data into RGD: Quality Control and Data Submission Tools Rat Genome Database http://rgd.mcw.edu Bioinformatics Research Center Medical College of Wisconsin, Milwaukee, USA

2 Curator Meeting: 10/27/03 RGD Pipeline Background RGD is a relatively new MOD Needed to integrate large amounts of historic data Curation staff is limited Developed near beginning of RGD Project Efficient methods to evaluate and integrate data Informatic methods were chosen to address the problems Catch up with historic data Achieve good productivity with limited staff Modular design New types of data New QC checks and methods

3 Curator Meeting: 10/27/03 RatMap MGD RGD 2.0 RHdb EBI (UK) Markers, Primers Goteborg (Sweden) Genes, markers, QTLs WI/MIT Markers, Genetic Map MIT Jackson Labs Markers, Strains, Genes NCBI LocusLink, RefSeq, UniGene, etc. NCBI Otsuka Otsuka (Japan) SSLPs U. Iowa ESTs, RH Map UI NIAMS ARB Maps and SSLPs Data Sources MCW All Objects MCO SSLPs Baylor (HGSC) Sequences Baylor RGD Literature

4 Curator Meeting: 10/27/03 Data Pipeline Regular Journal Screening And Curation RGD Database Ongoing Data Curation Internal Data Databases Literature Websites Informatic data mining Data Sources Bulk Data Pipeline in the Curation Process External Data

5 Curator Meeting: 10/27/03 RGD Objects RGD stores information about 11 fundamental data types (Objects) 1.Genes 2.Strains 3.QTLs 4.Traits 5.Sequences 6.ESTs 7.Maps 8.SSLPs 9.References 10.Homologs 11.Phenotypes

6 Curator Meeting: 10/27/03 Relationships between RGD Objects Genes -> Genes, ESTs, SSLPs, and QTLs ESTs -> Genes SSLPs -> Genes and Strains QTLs -> Genes, Traits, Strains Traits -> QTLs Maps -> Maps Data Maps Data -> any RGD object References -> any RGD object Homologs -> any RGD object Strains -> any RGD object Sequences -> any RGD object Phenotypes -> any RGD object

7 Curator Meeting: 10/27/03 RGD object Templates

8 Curator Meeting: 10/27/03 Internal Data Sources QC functionality on data entry forms

9 Curator Meeting: 10/27/03 Curation Annotations Notes Editor

10 Curator Meeting: 10/27/03 Data Entry Summary Page

11 Curator Meeting: 10/27/03 Edit Record in Submission Database

12 Curator Meeting: 10/27/03 RGD Data Flow owner_1 Production Cur_1 Curation Owner_2 rgd.mcw.edu dss Curation data Bulkdata All objects Online Genes QTLs Strains Internal Systems Public System Bulk Data (Production-load) QC

13 Curator Meeting: 10/27/03 BD Pipeline Database keep all raw data format the data track all checking flags track all loading status Input raw data Check data RGD Database Blasting results Preload data Load data Web-based interface to view all processing status QC checks in the Data Flow

14 Curator Meeting: 10/27/03 QC Process Overview Incoming Dataset Internal checking (blast/symbol) Blast against RGD database Check for identity conflicts Check symbol Check sequence via GB ID Check sequence via BLAST Check alias Preload: check for any attribute conflicts Load: values without conflicts RGD database Conflict data files for curation review Conflict data files for curation review Curators to review flags Level One: Integrity Checking Level Two: Identity Checking Level Three: Attribute Checking

15 Curator Meeting: 10/27/03 Examples of Checks New symbol matches an RGD symbol New symbol matches an alias in RGD New record has a GBID New GBID matches the RGD record New GBID matches GBID of alias gene New GBID matches any other RGD record New Sequence matches any RGD Every attribute value compared to RGD values

16 Curator Meeting: 10/27/03 Review Pipeline’s QC Checks

17 Curator Meeting: 10/27/03 Review Conflicts

18 Curator Meeting: 10/27/03 Excel Summary Report Conflict Data Report lists the bin ID for data that requires further curation (BLAST/BLAT analysis)

19 Curator Meeting: 10/27/03 Conflict Data Discovered by the Bulk Data Pipeline Nomenclature conflicts Symbols were incorrect Sequence conflicts Sequence reads were unacceptable due to poor quality (Many N’s) Primers were switched Sequence in dataset were associated with different objects in RGD Alias conflicts Dataset aliases were RGD objects Dataset symbols were in RGD aliases Attribute conflicts Chromosomes were different in RGD Cytological positions were different in RGD Expected sizes of PCR products were different in RGD Redundant data conflicts Datasets had duplicate entries

20 Curator Meeting: 10/27/03 Curation of Conflicting Data Checking processes find conflicting data Manual curation to resolve conflicts Nomenclature, Sequence, Alias symbols, Attributes, Redundant records Curated data Resolvable Irresolvable Removed data Load into RGD (Over-write current data) Store data in file (Notify source)

21 Curator Meeting: 10/27/03 After Load

22 Curator Meeting: 10/27/03 Acknowledgements Principal Investigators Howard Jacob Peter Tonellato Simon Twigger RGD Bioinformatics Dean Pasko, Jiali Chen Lan Zhao, Henry Fan, Wenhua Wu, Jian Lu Hanping Long RGD Curation Mary Shimoyama Susan Bromberg Rajni Nigam, Chin-fu Chen Gopal Gopinathrao, Charles Wang Victoria Petri Dorothy Reilly, Cindy Foote Angela Zuniga-Meyer, Nataliya Nenasheva

23 Curator Meeting: 10/27/03

24

25

26 Model Organism Bulk Data Processing Work Flow

27 Curator Meeting: 10/27/03 Case Numb er Case DescriptionExpected ResultNote 1New record has an RGD_ID, and both symbol and RGD_ID of new record match an active RGD record Set symbol flag to “IN_RGD_1“ The process will continue through the GenBank ID check 2New record has an RGD_ID, and new symbol matches an active RGD symbol, but new RGD_ID does not match the RGD_ID of matching symbol Set symbol flag to “DIF_RGD_ID“ The process will continue through the GenBank ID check 3New record has an RGD_ID, but new symbol does not match an active RGD symbol Set symbol flag to “DIF_SYMBOL“ The process will continue through the GenBank ID check 4New record does not have an RGD_ID, but new symbol matches an active RGD symbol Set symbol flag to “IN_RGD_2” The process will continue through the GenBank ID check 5New record does not have an RGD_ID and new symbol does not match an active RGD symbol Set symbol flag to “NEW” The process will continue through the GenBank ID check

28 Curator Meeting: 10/27/03 Case Numb er Case Description Expected Result Note 6New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, and alias is associated with only one gene Change symbol flag to “IN_RGD_UPDATED” Change current symbol flag after changing symbol (GBID check) and continue through GenBank ID check 7New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; new GBID matches a record in RGD and there is no alias matching that symbol; new symbol matches a retired or withdrawn symbol Change symbol flag to “DIF_NON_ACTIVE_1 ” Change current symbol flag and continue through GenBank ID check 8New record does not have an RGD_ID and new symbol does not matches an active RGD symbol; new record does not have a GBID; new symbol matches a retired or withdrawn symbol Change symbol flag to “DIF_NON_ACTIVE_2 ” Change current symbol flag and continue through GenBank ID check

29 Curator Meeting: 10/27/03 9New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does match another GBID/seq in RGD Set flag to “DIF_9:RGD_ID “ This case will complete the PRELOAD step in the pipeline, but NOT be loaded until after curation review. Data without GBID match will be compared by BLAST after PRELOAD step in the pipeline, but NOT loaded until after curation review. RGD_ID is the value that is associated with the GBID/seq in RGD 10New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does not match any GBID/seq in RGD Set flag to “DIF_10” This case will be compared by BLAST after PRELOAD step in pipeline, but NOT loaded until after curation review. 11New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD but new GBID doesn’t match the GBID of the gene associated with the matching alias Set flag to “DIF_11” This case will complete the PRELOAD step in the pipeline, but NOT be loaded until after curation review

30 Curator Meeting: 10/27/03 New Check Aliases Use Case Diagram C. Fan

31 Curator Meeting: 10/27/03 New Check Gene Symbol Use Case Diagram

32 Curator Meeting: 10/27/03 Case Numb er Case DescriptionExpected ResultNote 1New record has an RGD_ID, and both symbol and RGD_ID of new record match an active RGD record Set symbol flag to “IN_RGD_1“ The process will continue through the GenBank ID check 2New record has an RGD_ID, and new symbol matches an active RGD symbol, but new RGD_ID does not match the RGD_ID of matching symbol Set symbol flag to “DIF_RGD_ID“ The process will continue through the GenBank ID check 3New record has an RGD_ID, but new symbol does not match an active RGD symbol Set symbol flag to “DIF_SYMBOL“ The process will continue through the GenBank ID check 4New record does not have an RGD_ID, but new symbol matches an active RGD symbol Set symbol flag to “IN_RGD_2” The process will continue through the GenBank ID check 5New record does not have an RGD_ID and new symbol does not match an active RGD symbol Set symbol flag to “NEW” The process will continue through the GenBank ID check

33 Curator Meeting: 10/27/03 Case Numb er Case Description Expected Result Note 6New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, and alias is associated with only one gene Change symbol flag to “IN_RGD_UPDATED” Change current symbol flag after changing symbol (GBID check) and continue through GenBank ID check 7New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; new GBID matches a record in RGD and there is no alias matching that symbol; new symbol matches a retired or withdrawn symbol Change symbol flag to “DIF_NON_ACTIVE_1 ” Change current symbol flag and continue through GenBank ID check 8New record does not have an RGD_ID and new symbol does not matches an active RGD symbol; new record does not have a GBID; new symbol matches a retired or withdrawn symbol Change symbol flag to “DIF_NON_ACTIVE_2 ” Change current symbol flag and continue through GenBank ID check

34 Curator Meeting: 10/27/03 9New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does match another GBID/seq in RGD Set flag to “DIF_9:RGD_ID “ This case will complete the PRELOAD step in the pipeline, but NOT be loaded until after curation review. Data without GBID match will be compared by BLAST after PRELOAD step in the pipeline, but NOT loaded until after curation review. RGD_ID is the value that is associated with the GBID/seq in RGD 10New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does not match any GBID/seq in RGD Set flag to “DIF_10” This case will be compared by BLAST after PRELOAD step in pipeline, but NOT loaded until after curation review. 11New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD but new GBID doesn’t match the GBID of the gene associated with the matching alias Set flag to “DIF_11” This case will complete the PRELOAD step in the pipeline, but NOT be loaded until after curation review

35 Curator Meeting: 10/27/03 12New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, and alias is associated with only one gene Set flag to “IN_RGD_2” The new symbol is changed to the RGD gene symbol of the gene associated with the matching alias and the data is loaded 13New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, but alias is associated with more than one gene Set flag to “DIF_13”This case will complete the PRELOAD step in the pipeline, but NOT be loaded until after curation review

36 Curator Meeting: 10/27/03 Casess for GenBank ID check B in # Sy m b. ma tch GBI D mat ch spec ific RG D reco rd GB ID in ne w file GBI D in RG D GBI D matc h any RGD Seq match any RGD (BLAST ) Symb. match alias GBID match GBID of alias gene Alias of more than one gene Flag Symbol/GBID /Alias 1yes--noyes-- DIF_ 1New: A/- RGD: A/1 2yes -- IN_RGD_1New: A/1 RGD: A/1 3yesnoyes no -- DIF_3New: A/1 RGD: A/2 or – RGD: B/1 4yesnoyes yes or no yes-- DIF_4:RG D_ID New: A/1 RGD: A/2 RGD: B/1 5no--no-- DIF_5New: A/- RGD: B/2 or - 6no--yes--no -- NEWNew: A/1 RGD: B/2 or - 7no--yes yes or no yesno-- DIF_7:RG D_ID New: A/1/C RGD: B/1 8yes--no -- DIF_8New: A/- RGD: A/- 9yes--yesnoyes or no yes-- DIF_9:RG D_ID New: A/1 RGD: A/- 1010 yes--yesno -- DIF_10New: A/1 RGD: A/- 1 no--yes--yes--yesno--DIF_11New: A/1 RGD: B/2/A 1212 no--yes--yes--yes noDIF_12New: A/1 RGD: B/1/A 1313 no--yes--yes--yes DIF_13New: A/1 RGD: B/1/A RGD: C/2/A

37 Curator Meeting: 10/27/03 Input data Check data Pro-load data Load data Complete Bulkdata Pipeline Process Diagram C. Fan

38 Curator Meeting: 10/27/03 sslps genes sequences references maps qtls traits homologs strains phenotypes diseases ESTs Database Object Relationships

39 Curator Meeting: 10/27/03 RGD Schema Diagram 54 Tables 10 Views

40 Curator Meeting: 10/27/03 RGD Schema Word Document

41 Curator Meeting: 10/27/03 Platforms Database server: Oracle 8.1.6 Sun Solaris 2.8 Unix operating system Sun Enterprise 450’s Programming Language Perl 5 Object-oriented Methodology Database - object based schema Perl modules – object based and globally used across systems  DB.pm module  PRELOAD.pm module  LOAD.pm module Schema Documentation Rational Rose 2000 Enterprise RGD Database Technologies

42 Curator Meeting: 10/27/03 Bulk Data Database Schema

43 Curator Meeting: 10/27/03 Review Quality Control Reports

44 Curator Meeting: 10/27/03 Review

45 Curator Meeting: 10/27/03 Review

46 Curator Meeting: 10/27/03 Review

47 Curator Meeting: 10/27/03 Validation

48 Curator Meeting: 10/27/03 RGD Data Flow owner_1 Production Cur_1 Curation Owner_2 dev_1 Development dorado fuxi rgd.mcw.edu dss Curation data Online Strains, References Nomenclature Gene editing Ontologies(rgdtogo.txt) Notes Bulkdata All objects Bulk Data (Test-data) Online Genes QTLs Strains Internal Systems Public System alps Object Templates Text - tab delimited Modify flags Bulk Data (Production-load) 1st 2nd Modify flags Templates Homologs Strains Genes QTLs SSLPs ESTs Map Data

49 Curator Meeting: 10/27/03 Blast Result Scenarios

50 Curator Meeting: 10/27/03 Check for LocusLink, Swiss-Prot, RatMap IDs Bin Number New symbol matches RGD symbol LL/SP/RM_ID in new file LL/SP/RM_ID in specific RGD record LL/SP/RM_ID matches specific RGD record LL/SP/RM_ID matches any RGD

51 Curator Meeting: 10/27/03 Sequence must be over 95% aligned Forward and Reverse primer must be over 95% aligned Ratio of the aligned bp / length of query sequence => 95% Ratio of the length1 (short seq) / length2 (longer seq) => 90% Checks for Sequence

52 Curator Meeting: 10/27/03 Review

53 Curator Meeting: 10/27/03 Review Pipeline’s QC Checks

54 Curator Meeting: 10/27/03 Before Load

55 Curator Meeting: 10/27/03 Check Alias New alias type matches RGD alias types New alias is same as new symbol New alias matches any alias in RGD and same alias type


Download ppt "Curator Meeting: 10/27/03 Integration of New Data into RGD: Quality Control and Data Submission Tools Rat Genome Database Bioinformatics."

Similar presentations


Ads by Google