Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mon 12-14 C222 lecture by Veli Mäkinen Thu 10-12 C222 study group by VM  Mon 10-12 C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.

Similar presentations


Presentation on theme: "Mon 12-14 C222 lecture by Veli Mäkinen Thu 10-12 C222 study group by VM  Mon 10-12 C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5."— Presentation transcript:

1 Mon 12-14 C222 lecture by Veli Mäkinen Thu 10-12 C222 study group by VM  Mon 10-12 C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5 cr, Spring 2015 http://www.cs.helsinki.fi/courses/582715/2015/K/K/1

2 Course content Part I: Algorithmics around molecular biology beyond sequence alignment & analysis Starting from inputs generated by tecniques from period III course Biological Sequence Analysis Mostly reductions to well-known network optimization problems Taking network flows, maximum matchings, etc. as black-box (see Combinatorial optimization course…) Part II: Case studies on tailored algorithm techniques that have evolved directly around molecular biology problem

3 Topics Week 1: Fragment assembly Weeks 2 and 3: Transcriptomics Week 4: Haplotyping and metagenomics Week 5: Perfect phylogeny Week 6: Sequence motifs Week 7: Permutation patterns We will review the required molecular biology knowledge / motivation on each topic Part II Part I 2 cr project in period V

4 Course assessment There will be an exam giving 48 points at the maximum. Active participation to exercises gives at the maximum 12 points (30%->1p,85%->12p, linear scale). You should attend 5/7 study group sessions and get at least 1p from the exercise session to enter the exam. The grading is then based on the maximum of a) the sum of points from exercises (max 12) and exam (max 48) and b) the exam points scaled to 60: ~30 points -> 1 and ~50 points -> 5, depending on the difficulty of the exam.

5 Part I FRAGMENT ASSEMBLY

6 DNA sequencing & assembly Shotgun sequencing Fragment assembly is the problem of assembling the short DNA fragments (reads) in order to reverse engineer the complete DNA sequence content ACGATCGACGTCAGCAGCGACACTACGAGCATCAGCGAGCAGCGACTACGAGCGATGAGCTAGCGACATCGAGCATCAGCGATCGA CGATGAGCTAGCGAC

7 Shortest superstring problem Find a/the shortest DNA sequence that contains all the given short DNA sequence as substrings E.g. ACATAC is the shortest superstring of ATAC, ACAT, CATA NP-hard: approximation algorithms studied in period I course Algorithms for Bioinformatics Not a realistic problem anyhow: Sequencing errors Multiple chromosomes Repeats Ploidy level

8 Sequencing by hybridization I Microarray technique to measure the k-mer spectrum of a DNA sequence: E.g. 2-mer spectrum of ACATAC is AC=2,CA=1,AT=1,TA=1 Eulerian path on the expanded de Bruijn graph of k-mers reveals a sequence whose k-mer spectrum is the one given: AC CA ATTA

9 Sequencing by hybridization II Microarrays can be replaced by k-mer spectrum estimated from sequencing Not a realistic problem anyhow: Sequencing errors Multiple chromosomes Repeats Ploidy level Better formulation becomes NP-hard: Find minimum editing of the graph to make it Eulerian Due to repeats there are many equally good solutions See Algorithms for Bioinformatics course

10 Fragment assembly in practice Error correction: Try to correct errors from reads Contig assembly: Extract contigs (contiguous sequences) from the assembly graph (de Bruijn, overlap,…) of the reads Scaffolding: Use mate pair reads to order contigs into scaffolds Gap filling: Fill the gaps left after scaffolding

11 Error correction Remove anomalies in the de Bruijn graph ”Correct” the set of reads accordingly buble tip C A Measurement error or heterozygous SNP Spurious arc 1 56 78

12 Contig assembly Assume a set of corrected reads Build a de Bruijn graph Extract unary paths as contigs; these are called unitigs for unambigious contigs ACCTTAAG CA AA GA GT ACTAG

13 Overlap/prefix/string graphs Vertices are reads, arcs the overlaps Alternative to de Bruijn graphs ACGACTACGACATG ACATGGTATGATTAG ACGACATGGTATG 6 3 9 (irreducible) prefix graph

14 Project course, period V Implementing some part of assembly Technical details on (space-)efficient assembly graph constructions: Most use hashing or BWT indexing (Biological Sequence Analysis course) E.g. the irreducible prefix/string/overlap graph can be built using a bidirectional BWT index

15 Scaffolding Given contigs and set of mate pair reads Order the contigs into as few scaffolds as possible such that only a given fixed proportion of mate pair mappings are not satisfied

16 Gap filling Given scaffolds and reads that do not map inside their contigs Build an assembly graph on the unmapped reads Find paths that fill the gaps in scaffolds, satisfying the estimated distances between consecutive contigs

17 Study group this Thursday Read scaffolding and gap filling from the lecture script We’ll simulate the NP-hardness reduction of scaffolding simulate the heuristic solution to scaffolding based on maximum matching explain the gap filling dynamic programming solution


Download ppt "Mon 12-14 C222 lecture by Veli Mäkinen Thu 10-12 C222 study group by VM  Mon 10-12 C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5."

Similar presentations


Ads by Google