A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03
Abstract Scope of Study (i.e. aspect of Genetic Databases) Scope of Study (i.e. aspect of Genetic Databases) Types of Genetic Databases Types of Genetic Databases Storage/organization/access/manipulation techniques Storage/organization/access/manipulation techniques Sequencing (querying) of data in Genetic Databases Sequencing (querying) of data in Genetic Databases Logical Layout of Genetic Databases Logical Layout of Genetic Databases
Brief Introduction Human Genome Project (and others) -> Vast amount of biological data Human Genome Project (and others) -> Vast amount of biological data Venture: Computer Science and Biology (BCB) - > Genetic Databases (map,genomic,proteomic) Venture: Computer Science and Biology (BCB) - > Genetic Databases (map,genomic,proteomic) Expected date of Completed map of human genome: end of 2003 Expected date of Completed map of human genome: end of 2003 Next stage: Sequence comp. and Seq-Protein function. Next stage: Sequence comp. and Seq-Protein function. Useful to Pharm. Companies (CADD – e.g. SKB’s Relenza). Useful to Pharm. Companies (CADD – e.g. SKB’s Relenza).
Results - Sequence Current Sequence Generation Technologies Current Sequence Generation Technologies Maxam-Gilbert (use chemicals to cleave DNA at a specific base/length) Maxam-Gilbert (use chemicals to cleave DNA at a specific base/length) Sanger (use enzymatic procedures to produce DNA based on specific base—i.e. length) Sanger (use enzymatic procedures to produce DNA based on specific base—i.e. length)
Derivation of nucleotide sequence from human chromosome
Results - Sequence Types of Sequence Comparisons/alignmts. Types of Sequence Comparisons/alignmts. Global (“How similar are these two sequences?”) Global (“How similar are these two sequences?”) To find best overall alignment b/w two sequences To find best overall alignment b/w two sequences 1970: Needleman and Wunch (global, dynamic) 1970: Needleman and Wunch (global, dynamic) Shortcomings: in small similarities w/in 2 subseq. Shortcomings: in small similarities w/in 2 subseq. Local (“What sequences in a database are most similar to this sequence?”) Local (“What sequences in a database are most similar to this sequence?”) To find the best subseq. match b/w two sequences To find the best subseq. match b/w two sequences 1981: Smith and Waterman (local, dynamic) 1981: Smith and Waterman (local, dynamic) Shortcomings: not computationally efficient, slow Shortcomings: not computationally efficient, slow
Results - Sequence
Heuristic Search (Quick, Approximate) Heuristic Search (Quick, Approximate) Quickly search for “words” that match sequence. Then recursively perform local search on each matched word until no other matches Quickly search for “words” that match sequence. Then recursively perform local search on each matched word until no other matches FASTA (1998), BLAST(1990) FASTA (1998), BLAST(1990) Shortcomings: approximate not exact, E-Value (sig if <0.05) Shortcomings: approximate not exact, E-Value (sig if <0.05)
Results – Sequence (CSC Implementation) Sequence alignment can be represented as matrices and graphs (using rules and costs) Sequence alignment can be represented as matrices and graphs (using rules and costs) When converted into a directed acyclic graph, solution of the sequence alignment is the longest-path (max. path problem). When converted into a directed acyclic graph, solution of the sequence alignment is the longest-path (max. path problem).
Results Sequence (CSC Implementation) Diag. edge = character matches; down edge = gap in string 2; across edge = gap in string 1 Can be solved dynamically as a ‘running max score’ (RMS). For each D(i,j), best RMS = max(west+gap1, north+gap2, NW+current_score) Replace D(i,j) with max Needleman-Wunch Dynamic Program
Results – Sequence (CSC Implementation) Similar to Smith-Waterman Similar to Smith-Waterman Differences: Differences: restricts RMS-discontinues if <0 after several iterations restricts RMS-discontinues if <0 after several iterations For each iteration, saves max for each cell separately rather than replace->Trace back through max. scores for best local alignment For each iteration, saves max for each cell separately rather than replace->Trace back through max. scores for best local alignment BLAST Implementation ( BLAST Implementation (
Results - Storage EMBL Nucleotide Sequence Database (on Oracle) EMBL Nucleotide Sequence Database (on Oracle) Scale: over 130 tables, 140 relationships (80 GB of data) Scale: over 130 tables, 140 relationships (80 GB of data) Object Oriented Organization with Related 5 packages. Object Oriented Organization with Related 5 packages. Operations that return attribute type->supports on demand object creation Operations that return attribute type->supports on demand object creation ‘live object cache’ – copying most accessed instance of DB into cache by Primary key and performing queries on this cache. ‘live object cache’ – copying most accessed instance of DB into cache by Primary key and performing queries on this cache.
Results - Storage 5 EMBL Packages: 5 EMBL Packages: Sequence Info – general information on biological sequence. Sequence Info – general information on biological sequence. Feature Info – sequence annotation/comment Feature Info – sequence annotation/comment Reference Info – bibliographic ref. on seq. Reference Info – bibliographic ref. on seq. Taxonomy Info – taxonomy of organism’s sequence (i.e. kingdom, phyla, family, genus, species, e.t.c.) Taxonomy Info – taxonomy of organism’s sequence (i.e. kingdom, phyla, family, genus, species, e.t.c.) Location Info – location of sequence on DNA/RNA Location Info – location of sequence on DNA/RNA
Results – Storage (Gen. Relation B/W 5 packages)
Results – Storage (Sequence Info)
Results – Storage (Feature Info)
Results – Storage (Reference Info)
Results – Storage (Taxonomy Info)
Results – Storage (Location Info)
Conclusion Genetic Databases (3 main types) are essential to store, manage, and query the massive bio-data from studies like HGP. Genetic Databases (3 main types) are essential to store, manage, and query the massive bio-data from studies like HGP. Object Oriented Design and data organization Object Oriented Design and data organization Sequence Analysis: Global (N-W), Local (S-W), Heuristic (FASTA, BLAST) Sequence Analysis: Global (N-W), Local (S-W), Heuristic (FASTA, BLAST)
Conclusion - Future Enhancements Storage/Management: highly dependent on hardware industry progress Storage/Management: highly dependent on hardware industry progress Sequence Analysis: Sequence Analysis: Use of parallel prog. for faster analysis of 2 sequences (BLAZE-Stanford) Use of parallel prog. for faster analysis of 2 sequences (BLAZE-Stanford) Faster means of comparing and aligning multiple sequences simultaneously (e.g. comparing novel protein sequence to family). Faster means of comparing and aligning multiple sequences simultaneously (e.g. comparing novel protein sequence to family).
Any Questions?