A Study of GeneWise with the Drosophila Adh Region Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc., Pasadena, CA
CMSC 838T – Presentation Motivation u Genome annotation Extraction of biologically relevant knowledge from raw genomic sequence data u Need faster genome annotation methods DNA sequences are very long (millions of nucleotides) Current methods are computationally too expensive u Approach/Solution GeneMatcher2 hardware acceleration of GeneWise
CMSC 838T – Presentation Outline u Motivation Genome annotation u GeneMatcher2 Design ASIC hardware u Comparison GeneWise algorithm HalfWise algorithm Performance (time, precision) u Observations Performance improvement Cost effectiveness
CMSC 838T – Presentation Approach u Problem: make GeneWise run faster “Embarassingly parallel” algorithm Computationally too expensive when run in parallel on PC’s u Paracell’s solution: hardware acceleration Don’t change the algorithm Produce an implementation on the GeneMatcher2 supercomputer that works as much like the original software as possible 6LITE algorithm, now also in Wise2
CMSC 838T – Presentation GeneMatcher Architecture
CMSC 838T – Presentation ASIC Hardware u ASIC – application specific integration circuit Designed to speed up dynamic programming algorithms l (could be used for Smith-Waterman) Each ASIC board has 3072 processors System has up to 9 boards Cost per board around $40K
CMSC 838T – Presentation GeneWise Algorithm u Perform a search of genomic DNA sequence data using a protein HMM Build HMMs from protein families Scan genome using HMM l Look for start codon l “GT” sequence signals possible 5’ splice site l “AG” sequence signals possible 3’ splice site Dynamic programming used in the scanning process l Obtain probability of the most likely path in HMM generating the sequence l Obtain alignment by backtracking
CMSC 838T – Presentation GeneWise model on GeneMatcher2
CMSC 838T – Presentation HalfWise Algorithm u Reduce cost by running BLAST to select HMMs with possible hits u Use these HMMs with GeneWise database search and sequence alignment algorithm u May miss some genes due to BLAST misses
CMSC 838T – Presentation Evaluation u Test data set A genomic DNA sequence contig of about 2.9 Mb from the Drosophila Adh region Focuss on finding all Pfam (Protein families database of alignments and HMMs) protein profile-HMMs that occur in the Adh genomic sequence
CMSC 838T – Presentation Evaluation: Speed
CMSC 838T – Presentation Evaluation: Score
CMSC 838T – Presentation Evaluation: Sensitivity and Specificity
CMSC 838T – Presentation Observations u Performance improvement The speedup is several orders of magnitude. l Makes real target applications possible Accuracy might be improved over HalfWise algorithm u Cost effectiveness System used costs around $500K 500K worth Linux PC’s (500 processors at $1K each) would run about 10 times slower u Weaknesses Cannot modify the algorithm Not enough data to assess scalability