Presentation is loading. Please wait.

Presentation is loading. Please wait.

BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.

Similar presentations


Presentation on theme: "BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine."— Presentation transcript:

1 BNFO 615 Usman Roshan

2 Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine Reads are fragments of a longer DNA sequence present in the sample given as input to the machine Usually in the millions – Genome sequence: a reference DNA sequence much longer than the read length

3 ACCAG ACCCG Heterozygous SNP ATT--A ATTGA Heterozygous indel ATTGA Human genome reference Short reads from a single individual ATTGA ATTAA Homozygous SNP encoded as 2 ATTGA (2, 1, 0, 1) ACCAG Here no variant is reported. Short read alignment

4 Applications – Genome assembly – RNA splicing studies – Gene expression studies – Discovery of new genes – Discovering of cancer causing mutations

5 Short read alignment Two approaches – Hashing based algorithms BFAST SHRIMP MAQ STAMPY (statistical alignment) – Burrows Wheeler transform Bowtie BWA

6 BFAST overview PLoS ONE 4(11): e7767.

7 BFAST algorithm PLoS ONE 4(11): e7767.

8 BFAST masked keys

9 Short read alignment Empirical performance: Simulated data: – Extract random substrings of fixed length with random mutations and gaps – Realign back to reference genome Real data: – Paired reads: two ends of the same sequence – Count number of paired reads within 500 to 10000 bases of each other

10 Short read alignment Courtesy of Genome Res. June 2011 21: 936-939;

11 Short read alignment Courtesy of Genome Res. June 2011 21: 936-939;

12 Short read alignment

13 Metagenomics Study of DNA sequences (fragments) from environmental samples Bioinformatics problem: classify the DNA sequence of an organism Methods: – Sequence similarity – Machine learning Key papers

14 Metagenomics Project: use an advanced GPU sequence similarity program to classify simulated reads Data: available from published study Steps: – Align simulated metagenomic reads to a set of genomes with programs called MaxSSmap and NextGenMap. – Since these are simulated we know the true genome names – Determine number of correctly aligned pairs

15 Genome alignment Comparison of two large genome sequences Bioinformatics problem: comparison of distantly related genome sequences like chicken and mouse Methods: – Mainly hash table based methods Key papers

16 Genome alignment Project: develop a GPU program for accurate genome alignment Data: simulated genomes available from published study Alignathon Steps: – Make fragments of one genome – Align each fragment to other genome (with short read mapping) – Determine accuracy with mafcomparator program

17 Whole genome phylogeny Whole genome phylogeny to understand evolution at the genome level Bioinformatics problem: compare phylogeny from whole genome data vs concatenated data vs multiple gene trees Methods: – Concatenate alignment – Multiple gene trees Key papers

18 Whole genome phylogeny Project: determine accuracy of phylogenies from simulated whole genome data with three different methods – Gene concatenation – Tree reconciliation – Whole genome distance based methods (new) Data – Simulated genome sequences from Evolver program Steps – Construct trees with three methods – Determine their Robinson-Foulds distance

19 Variant detection Variants in unmapped reads could contain Bioinformatics problem: are there variants in unmapped reads? Methods: – GATK for variant detection Key papers

20 Variant detection Project: determine variants in unmapped reads Data: reads from 1000 genomes human sample (NA12878) that are unmapped by BWA Steps: – Map reads with MaxSSmap and NextGenMap programs – Run GATK to determine variants

21 Projects Metagenomics: Clarissa and Komal Genome alignment: Kayla and Kathryn Risk prediction webserver: Edward and Ittehad Variant detection: Shijie and Sruti Whole genome phylogeny: Catherine and Samantha


Download ppt "BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine."

Similar presentations


Ads by Google