BNFO 615 Usman Roshan
Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine Reads are fragments of a longer DNA sequence present in the sample given as input to the machine Usually in the millions – Genome sequence: a reference DNA sequence much longer than the read length
ACCAG ACCCG Heterozygous SNP ATT--A ATTGA Heterozygous indel ATTGA Human genome reference Short reads from a single individual ATTGA ATTAA Homozygous SNP encoded as 2 ATTGA (2, 1, 0, 1) ACCAG Here no variant is reported. Short read alignment
Applications – Genome assembly – RNA splicing studies – Gene expression studies – Discovery of new genes – Discovering of cancer causing mutations
Short read alignment Two approaches – Hashing based algorithms BFAST SHRIMP MAQ STAMPY (statistical alignment) – Burrows Wheeler transform Bowtie BWA
BFAST overview PLoS ONE 4(11): e7767.
BFAST algorithm PLoS ONE 4(11): e7767.
BFAST masked keys
Short read alignment Empirical performance: Simulated data: – Extract random substrings of fixed length with random mutations and gaps – Realign back to reference genome Real data: – Paired reads: two ends of the same sequence – Count number of paired reads within 500 to bases of each other
Short read alignment Courtesy of Genome Res. June : ;
Short read alignment Courtesy of Genome Res. June : ;
Short read alignment
Metagenomics Study of DNA sequences (fragments) from environmental samples Bioinformatics problem: classify the DNA sequence of an organism Methods: – Sequence similarity – Machine learning Key papers
Metagenomics Project: use an advanced GPU sequence similarity program to classify simulated reads Data: available from published study Steps: – Align simulated metagenomic reads to a set of genomes with programs called MaxSSmap and NextGenMap. – Since these are simulated we know the true genome names – Determine number of correctly aligned pairs
Genome alignment Comparison of two large genome sequences Bioinformatics problem: comparison of distantly related genome sequences like chicken and mouse Methods: – Mainly hash table based methods Key papers
Genome alignment Project: develop a GPU program for accurate genome alignment Data: simulated genomes available from published study Alignathon Steps: – Make fragments of one genome – Align each fragment to other genome (with short read mapping) – Determine accuracy with mafcomparator program
Whole genome phylogeny Whole genome phylogeny to understand evolution at the genome level Bioinformatics problem: compare phylogeny from whole genome data vs concatenated data vs multiple gene trees Methods: – Concatenate alignment – Multiple gene trees Key papers
Whole genome phylogeny Project: determine accuracy of phylogenies from simulated whole genome data with three different methods – Gene concatenation – Tree reconciliation – Whole genome distance based methods (new) Data – Simulated genome sequences from Evolver program Steps – Construct trees with three methods – Determine their Robinson-Foulds distance
Variant detection Variants in unmapped reads could contain Bioinformatics problem: are there variants in unmapped reads? Methods: – GATK for variant detection Key papers
Variant detection Project: determine variants in unmapped reads Data: reads from 1000 genomes human sample (NA12878) that are unmapped by BWA Steps: – Map reads with MaxSSmap and NextGenMap programs – Run GATK to determine variants
Projects Metagenomics: Clarissa and Komal Genome alignment: Kayla and Kathryn Risk prediction webserver: Edward and Ittehad Variant detection: Shijie and Sruti Whole genome phylogeny: Catherine and Samantha