Download presentation
Presentation is loading. Please wait.
Published byRegina Armstrong Modified over 8 years ago
1
BNFO 615 Usman Roshan
2
Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine Reads are fragments of a longer DNA sequence present in the sample given as input to the machine Usually in the millions – Genome sequence: a reference DNA sequence much longer than the read length
3
ACCAG ACCCG Heterozygous SNP ATT--A ATTGA Heterozygous indel ATTGA Human genome reference Short reads from a single individual ATTGA ATTAA Homozygous SNP encoded as 2 ATTGA (2, 1, 0, 1) ACCAG Here no variant is reported. Short read alignment
4
Applications – Genome assembly – RNA splicing studies – Gene expression studies – Discovery of new genes – Discovering of cancer causing mutations
5
Short read alignment Two approaches – Hashing based algorithms BFAST SHRIMP MAQ STAMPY (statistical alignment) – Burrows Wheeler transform Bowtie BWA
6
BFAST overview PLoS ONE 4(11): e7767.
7
BFAST algorithm PLoS ONE 4(11): e7767.
8
BFAST masked keys
9
Short read alignment Empirical performance: Simulated data: – Extract random substrings of fixed length with random mutations and gaps – Realign back to reference genome Real data: – Paired reads: two ends of the same sequence – Count number of paired reads within 500 to 10000 bases of each other
10
Short read alignment Courtesy of Genome Res. June 2011 21: 936-939;
11
Short read alignment Courtesy of Genome Res. June 2011 21: 936-939;
12
Short read alignment
13
Metagenomics Study of DNA sequences (fragments) from environmental samples Bioinformatics problem: classify the DNA sequence of an organism Methods: – Sequence similarity – Machine learning Key papers
14
Metagenomics Project: use an advanced GPU sequence similarity program to classify simulated reads Data: available from published study Steps: – Align simulated metagenomic reads to a set of genomes with programs called MaxSSmap and NextGenMap. – Since these are simulated we know the true genome names – Determine number of correctly aligned pairs
15
Genome alignment Comparison of two large genome sequences Bioinformatics problem: comparison of distantly related genome sequences like chicken and mouse Methods: – Mainly hash table based methods Key papers
16
Genome alignment Project: develop a GPU program for accurate genome alignment Data: simulated genomes available from published study Alignathon Steps: – Make fragments of one genome – Align each fragment to other genome (with short read mapping) – Determine accuracy with mafcomparator program
17
Whole genome phylogeny Whole genome phylogeny to understand evolution at the genome level Bioinformatics problem: compare phylogeny from whole genome data vs concatenated data vs multiple gene trees Methods: – Concatenate alignment – Multiple gene trees Key papers
18
Whole genome phylogeny Project: determine accuracy of phylogenies from simulated whole genome data with three different methods – Gene concatenation – Tree reconciliation – Whole genome distance based methods (new) Data – Simulated genome sequences from Evolver program Steps – Construct trees with three methods – Determine their Robinson-Foulds distance
19
Variant detection Variants in unmapped reads could contain Bioinformatics problem: are there variants in unmapped reads? Methods: – GATK for variant detection Key papers
20
Variant detection Project: determine variants in unmapped reads Data: reads from 1000 genomes human sample (NA12878) that are unmapped by BWA Steps: – Map reads with MaxSSmap and NextGenMap programs – Run GATK to determine variants
21
Projects Metagenomics: Clarissa and Komal Genome alignment: Kayla and Kathryn Risk prediction webserver: Edward and Ittehad Variant detection: Shijie and Sruti Whole genome phylogeny: Catherine and Samantha
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.