IITB - Bioinformatics Workshop Indexing Genome Sequences Srikanta B. J. Database Systems Lab (DSL) Indian Institute of Science
IITB - Bioinformatics Workshop Background W Sequences W DNA (Deoxyribose Nucleic Acid) W Proteins W Similarity of sequences W The extent to which nucleotide or protein sequences are related W Percent sequence identity, and/or Conservation
IITB - Bioinformatics Workshop Genome Sequence Analysis W Hypothesize W Function of Proteins W Phylogenetic trees W Causes of Diseases W First step in unraveling the mystery of Life! W Sequence Similarity Structural Similarity Functional Similarity
IITB - Bioinformatics Workshop Sequence Similarity W Alignment W between two sequences, S1 & S2 (perhaps of unequal length) W Insert spaces, into or at the ends of S1(S2) Place them so that every character or space in either string is opposite a unique character/space in the other. E.g., q a c - d b d q a w x - b - W Global & Local Alignments
IITB - Bioinformatics Workshop Alignment W Global W Given two sequences, find best alignment over full length E.g., between ( agtcacaaaact, actcgga ) a g t c a c a a a a c t | | | | | | | | | | | | a c t c g g a W Local W Look for islands of high similarity E.g., between ( agtcacaaaact, actcgga ) a g t c a c a a a a c t | | | a c t c g g a O(mn) with Dynamic Programming
IITB - Bioinformatics Workshop Scoring the Alignments W Scoring Schemes W Value for aligning character x against character y W Provided as scoring matrix, for alphabet W E.g., BLOSUM PAM DNA-BLAST (+5 for match, -4 for mismatch) W Optimizing alignments W E.g., Edit Distance Scoring Scheme: Insert - 1, Delete - 1, 0 otherwise => edit_distance (surgery, surgeon) = 4
IITB - Bioinformatics Workshop Search Process W Given sequence to be studied W Want all similar (global/local) known sequences W Collections of sequences W NCBI-GenBank, SwissProt etc. W Contain millions of sequences
IITB - Bioinformatics Workshop State of the art W Dynamic Programming W Slow but accurate W Never misses a significant alignment W FastA W Faster than Dynamic Programming W Uses statistical heuristics W Reduced sensitivity False dismissals W BLAST W Fastest and popular W Lower sensitivity than FastA W Requires whole database in memory!
IITB - Bioinformatics Workshop BLAST - on $1,000 Budget! W BODHI experience [DSL, 2001] W ~51,000 DNA sequences in database W CAFÉ Experience [Williams and Zobel, 2001] W ~120,000 DNA sequences in memory W Time seconds/BLAST 10.6 seconds / BLAST
IITB - Bioinformatics Workshop NCBI GenBank Growth W Doubles every 13 months W In 1998, estimated 40,000 sequence similarity queries per day That was 3 years ago!!
IITB - Bioinformatics Workshop We Need Indexes for Sequence Similarity Searching NOW!!
IITB - Bioinformatics Workshop Indexed Searching W Inverted Indexes W RAMdb [Fondrat and Dessen, 1995] W CAFÉ [Williams and Zobel, 2001] W FLASH [Califano and Rigoutsos, 1993] W Multi-Dimensional Indexes W MRS-indexing [Kahveci and Singh, 2001] W Persistent Prefix Tree [Hunt et al., 2001]
IITB - Bioinformatics Workshop RAMdb (Rapid Access Motif db) W Each sequence in repository is indexed by constituent overlapping sequences fold speedup over Dynamic Programming 6 Prohibitive index size 6 No ranking (goodness) of alignments 6 False dismissals ACTC CTCG Seq1, seq2,… Seq1, seq4,…
IITB - Bioinformatics Workshop CAFÉ W Partitioned Search W Coarse searching with compressed inverted index W Fine searching in small fraction of database, with ranking 4 14-fold speedup over BLAST 4 Compression reduces the index size 6 Distant sequence relationships are lost 6 Lower retrieval effectiveness
IITB - Bioinformatics Workshop MRS - Indexing W Uses progressive wavelet coefficients to represent sequence
IITB - Bioinformatics Workshop MRS-Indexing (contd.) W Builds a hierarchy of Multi-Dim. Indexes 6 Only for edit distances - no general scoring schemes 6 Not suited for average DNA/Protein query lengths
IITB - Bioinformatics Workshop Summary W Rapid growth in sequence databases W Existing algorithms do not scale W Indexed approach to Sequence Similarity is necessary W Improvements needed in Indexed Searching methods