CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments.

Slides:



Advertisements
Similar presentations
The genetic code.
Advertisements

Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
ATG GAG GAA GAA GAT GAA GAG ATC TTA TCG TCT TCC GAT TGC GAC GAT TCC AGC GAT AGT TAC AAG GAT GAT TCT CAA GAT TCT GAA GGA GAA AAC GAT AAC CCT GAG TGC GAA.
Supplementary Fig.1: oligonucleotide primer sequences.
Transcription & Translation Worksheet
CS262 Lecture 9, Win07, Batzoglou Gene Recognition.
Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov.
Introduction to bioinformatics Lecture 2 Genes and Genomes.
Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)
CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov.
CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Genomics 101 DNA sequencing Alignment Gene identification Gene expression Genome evolution …
Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.
CS262 Lecture 14, Win06, Batzoglou Multiple Sequence Alignments.
CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE  High throughput  One of the best in accuracy ProbCons  High accuracy  Reasonable.
Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree.
Introduction to Molecular Biology. G-C and A-T pairing.
1 Essential Computing for Bioinformatics Bienvenido Vélez UPR Mayaguez Lecture 5 High-level Programming with Python Part II: Container Objects Reference:
 Genetic information, stored in the chromosomes and transmitted to the daughter cells through DNA replication is expressed through transcription to RNA.
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Today… Genome 351, 8 April 2013, Lecture 3 The information in DNA is converted to protein through an RNA intermediate (transcription) The information in.
Reading the blueprint of life DNA sequencing. Introduction The blueprint of life is contained in the DNA in the nuclei of eukaryotic cells and simply.
IGEM Arsenic Bioremediation Possibly finished biobrick for ArsR by adding a RBS and terminator. Will send for sequencing today or Monday.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
Nature and Action of the Gene
Biological Dynamics Group Central Dogma: DNA->RNA->Protein.
Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad.
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
Genes: Regulation and Structure Many slides from various sources, including S. Batzoglou,
Undifferentiated Differentiated (4 d) Supplemental Figure S1.
Supplemental Table S1 For Site Directed Mutagenesis and cloning of constructs P9GF:5’ GAC GCT ACT TCA CTA TAG ATA GGA AGT TCA TTT C 3’ P9GR:5’ GAA ATG.
Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance.
PART 1 - DNA REPLICATION PART 2 - TRANSCRIPTION AND TRANSLATION.
TRANSLATION: information transfer from RNA to protein the nucleotide sequence of the mRNA strand is translated into an amino acid sequence. This is accomplished.
Prodigiosin Production in E. Coli Brian Hovey and Stephanie Vondrak.
Genome Annotation Haixu Tang School of Informatics.
Supplementary materials
Suppl. Figure 1 APP23 + X Terc +/- Terc +/-, APP23 + X Terc +/- G1Terc -/-, APP23 + X G1Terc -/- G2Terc -/-, APP23 + X G2Terc -/- G3Terc -/-, APP23 + and.
GENE EXPRESSION. Transcription 1. RNA polymerase unwinds DNA 2. RNA polymerase adds RNA nucleotides (A ↔ U, G ↔ C) 3. mRNA is formed! DNA reforms a double.
RA(4kb)- Atggagtccgaaatgctgcaatcgcctcttctgggcctgggggaggaagatgaggc……………………………………………….. ……………………………………………. ……………………….,……. …tactacatctccgtgtactcggtggagaagcgtgtcagatag.
Example 1 DNA Triplet mRNA Codon tRNA anticodon A U A T A U G C G
Name of presentation Month 2009 SPARQ-ed PROJECT Mutations in the tumor suppressor gene p53 Pulari Thangavelu (PhD student) April Chromosome Instability.
DNA, RNA and Protein.
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
The response of amino acid frequencies to directional mutation pressure in mitochondrial genomes is related to the physical properties of the amino acids.
Genomics 101 DNA sequencing Alignment Gene identification
bacteria and eukaryotes
RNA and Protein Synthesis
Modelling Proteomes.
Supplementary information Table-S1 (Xiao)
Sequence – 5’ to 3’ Tm ˚C Genome Position HV68 TMER7 Δ mt. Forward
Supplemental Table 3. Oligonucleotides for qPCR
Supplementary Figure 1 – cDNA analysis reveals that three splice site alterations generate multiple RNA isoforms. (A) c.430-1G>C (IVS 6) results in 3.
Huntington Disease (HD)
Section Objectives Relate the concept of the gene to the sequence of nucleotides in DNA. Sequence the steps involved in protein synthesis.
DNA By: Mr. Kauffman.
Gene architecture and sequence annotation
More on translation.
Transcription You’re made of meat, which is made of protein.
Fundamentals of Protein Structure
Central Dogma and the Genetic Code
Python.
Bellringer Please answer on your bellringer sheet:
Shailaja Gantla, Conny T. M. Bakker, Bishram Deocharan, Narsing R
Presentation transcript:

CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou Saving cells in DP 1.Find local alignments 2.Chain -O(NlogN) L.I.S. 3.Restricted DP

CS262 Lecture 15, Win07, Batzoglou Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

CS262 Lecture 15, Win07, Batzoglou The Problem: Find a Chain of Local Alignments (x,y)  (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight

CS262 Lecture 15, Win07, Batzoglou Sparse Dynamic Programming Back to the LCS problem: Given two sequences  x = x 1, …, x m  y = y 1, …, y n Find the longest common subsequence  Quadratic solution with DP How about when “hits” x i = y j are sparse?

CS262 Lecture 15, Win07, Batzoglou Sparse Dynamic Programming Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead

CS262 Lecture 15, Win07, Batzoglou Sparse Dynamic Programming – L.I.S. Longest Increasing Subsequence Given a sequence over an ordered alphabet  x = x 1, …, x m Find a subsequence  s = s 1, …, s k  s 1 < s 2 < … < s k

CS262 Lecture 15, Win07, Batzoglou Sparse Dynamic Programming – L.I.S. Let input be w: w 1,…, w n INITIALIZATION: L:last LIS elt. array L[0] = -inf L[1] = w 1 L[2…n] = +inf B:array holding LIS elts; B[0] = 0 P:array of backpointers // L[j]: smallest j th element w i of j-long LIS seen so far ALGORITHM for i = 2 to n { Find j such that L[j – 1] < w[i] ≤ L[j] L[j]  w[i] B[j]  i P[i]  B[j – 1] } That’s it!!! Running time?

CS262 Lecture 15, Win07, Batzoglou Sparse LCS expressed as LIS Create a sequence w Every matching point (i, j), is inserted into w as follows: For each column j = 1…m, insert in w the points (i, j), in decreasing row i order The 11 example points are inserted in the order given a = (y, x), b = (y’, x’) can be chained iff  a is before b in w, and  y < y’ x y

CS262 Lecture 15, Win07, Batzoglou Sparse LCS expressed as LIS Create a sequence w w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) Consider now w’s elements as ordered lexicographically, where (y, x) < (y’, x’) if y < y’ Claim: An increasing subsequence of w is a common subsequence of x and y x y Why don’t we insert elements (i, j) in w in increasing row i order?

CS262 Lecture 15, Win07, Batzoglou Sparse Dynamic Programming for LIS Example: w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) L = [L1] [L2] [L3] [L4] [L5] … 1.(4,2) 2.(3,3) 3.(3,3) (10,5) 4.(2,5) (10,5) 5.(2,5) (8,6) 6.(1,6) (8,6) 7.(1,6) (3,7) 8.(1,6) (3,7) (4,8) 9.(1,6) (3,7) (4,8) (7,9) 10.(1,6) (3,7) (4,8) (5,9) 11.(1,6) (3,7) (4,8) (5,9) (9,10) Longest common subsequence: s = 4, 24, 3, 11, x y

CS262 Lecture 15, Win07, Batzoglou Sparse DP for rectangle chaining 1,…, N: rectangles (h j, l j ): y-coordinates of rectangle j w(j):weight of rectangle j V(j): optimal score of chain ending in j L: list of triplets (l j, V(j), j)  L is sorted by l j : smallest (North) to largest (South) value  L is implemented as a balanced binary tree y h l

CS262 Lecture 15, Win07, Batzoglou Sparse DP for rectangle chaining Main idea: Sweep through x- coordinates To the right of b, anything chainable to a is chainable to b Therefore, if V(b) > V(a), rectangle a is “useless” for subsequent chaining In L, keep rectangles j sorted with increasing l j - coordinates  sorted with increasing V(j) score V(b) V(a)

CS262 Lecture 15, Win07, Batzoglou Sparse DP for rectangle chaining Go through rectangle x-coordinates, from lowest to highest: 1.When on the leftmost end of rectangle i: a.j: rectangle in L, with largest l j < h i b.V(i) = w(i) + V(j) 2.When on the rightmost end of i: a.k: rectangle in L, with largest l k  l i b.If V(i) > V(k): i.INSERT (l i, V(i), i) in L ii.REMOVE all (l j, V(j), j) with V(j)  V(i) & l j  l i i j k Is k ever removed?

CS262 Lecture 15, Win07, Batzoglou Example x y a: 5 c: 3 b: 6 d: 4 e: When on the leftmost end of rectangle i: a.j: rectangle in L, with largest l j < h i b.V(i) = w(i) + V(j) 2.When on the rightmost end of i: a.k: rectangle in L, with largest l k  l i b.If V(i) > V(k): i.INSERT (l i, V(i), i) in L ii.REMOVE all (l j, V(j), j) with V(j)  V(i) & l j  l i abcde V 5 L lili V(i) i 5 5 a c b d

CS262 Lecture 15, Win07, Batzoglou Time Analysis 1.Sorting the x-coords takes O(N log N) 2.Going through x-coords: N steps 3.Each of N steps requires O(log N) time: Searching L takes log N Inserting to L takes log N All deletions are consecutive, so logN per deletion Each element is deleted at most once: < N logN for all deletions Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

CS262 Lecture 15, Win07, Batzoglou Whole-genome Alignment Pipelines Given N species, phylogenetic tree: 1.Local Alignment between all pairs – BLAST 2.In the order of the tree: 1.Synteny mapping: find long regions with lots of collinear alignments 2.In each synteny region, 1.Chaining 2.Global alignment Alternatively, all species are mapped to one reference (e.g., human) Then, in each unbroken synteny region between multiple species, perform chaining & progressive multiple alignment

CS262 Lecture 15, Win07, Batzoglou Examples Human Genome Browser ABC

CS262 Lecture 15, Win07, Batzoglou Whole-genome alignment Rat—Mouse—Human

CS262 Lecture 15, Win07, Batzoglou Next 2 years: 20+ mammals, & many other animals, will be sequenced & aligned

CS262 Lecture 15, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 15, Win07, Batzoglou The Central Dogma Protein RNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDEPEPTIDE CCUGAGCCAACUAUUGAUGAA

CS262 Lecture 15, Win07, Batzoglou Gene structure exon1 exon2exon3 intron1intron2 transcription translation splicing exon = protein-coding intron = non-coding Codon: A triplet of nucleotides that is converted to one amino acid

CS262 Lecture 15, Win07, Batzoglou Finding Genes in Yeast Start codon ATG 5’3’ Stop codon TAG/TGA/TAA Intergenic Coding Intergenic Mean coding length about 1500bp (500 codons) Transcript

CS262 Lecture 15, Win07, Batzoglou Finding Genes in Yeast Yeast ORF distribution

CS262 Lecture 15, Win07, Batzoglou Introns: The Bane of ORF Scanning Start codon ATG 5’ 3’ Stop codon TAG/TGA/TAA Splice sites Intergenic Exon Intron Intergenic Exon Intron Transcript

CS262 Lecture 15, Win07, Batzoglou Introns: The Bane of ORF Scanning Drosophila: 3.4 introns per gene on average mean intron length 475, mean exon length 397 Human: 8.8 introns per gene on average mean intron length 4400, mean exon length 165 ORF scanning is defeated

CS262 Lecture 15, Win07, Batzoglou Where are the genes?

CS262 Lecture 15, Win07, Batzoglou

Needles in a Haystack

CS262 Lecture 15, Win07, Batzoglou Signals for Gene Finding We need to use more information to help recognize genes 1.Regular gene structure 2.Exon/intron lengths 3.Nucleotide composition 4.Motifs at the boundaries of exons, introns, etc. Start codon, stop codon, splice sites 5.Patterns of conservation

CS262 Lecture 15, Win07, Batzoglou Regular Gene Structure Start, Stop of translation region:  Protein-coding starts with ATG  ends with TAA / TAG / TGA Exon – Intron – Exon – Intron … – Exon g[ GT/GC ]gag – Intron – cAGt Exon reading frame:  NNN – NNN – NNN – NNN – NN…  NN – NNN – NNN – NNN – NN…  N – NNN – NNN – NNN – NNN…

CS262 Lecture 15, Win07, Batzoglou Next Exon: Frame 0 Next Exon: Frame 1

CS262 Lecture 15, Win07, Batzoglou Exon/Intron Lengths

CS262 Lecture 15, Win07, Batzoglou Nucleotide Composition Base composition in exons is characteristic due to the genetic code Amino AcidSLCDNA Codons IsoleucineIATT, ATC, ATA LeucineLCTT, CTC, CTA, CTG, TTA, TTG ValineVGTT, GTC, GTA, GTG PhenylalanineFTTT, TTC MethionineMATG CysteineCTGT, TGC AlanineAGCT, GCC, GCA, GCG GlycineGGGT, GGC, GGA, GGG ProlinePCCT, CCC, CCA, CCG ThreonineTACT, ACC, ACA, ACG SerineSTCT, TCC, TCA, TCG, AGT, AGC TyrosineYTAT, TAC TryptophanWTGG GlutamineQCAA, CAG AsparagineNAAT, AAC HistidineHCAT, CAC Glutamic acidEGAA, GAG Aspartic acidDGAT, GAC LysineKAAA, AAG ArginineRCGT, CGC, CGA, CGG, AGA, AGG

CS262 Lecture 15, Win07, Batzoglou Biological Signals How does the cell recognize start/stop codons and splice sites?  In part, from characteristic base composition Donor site (start of intron) is recognized by a section of U1 snRNA U1 snRNA: GUCCAUUCA Donor site consensus: MAGGTRAGT M means “A or C”, R means “A or G”

CS262 Lecture 15, Win07, Batzoglou atg tga ggtgag caggtg cagatg cagttg caggcc ggtgag

CS262 Lecture 15, Win07, Batzoglou 5’ 3’ Donor site Position  -8…-2012…17 A26… …21 C26…155012…27 G25… …27 T23… …25 Splice Sites

CS262 Lecture 15, Win07, Batzoglou Splice Sites (

CS262 Lecture 15, Win07, Batzoglou WMM: weight matrix model = PSSM (Staden 1984) WAM: weight array model = 1 st order Markov (Zhang & Marr 1993) MDD: maximal dependence decomposition (Burge & Karlin 1997)  Decision-tree algorithm to take pairwise dependencies into account Starting with a training set of known splice sites: For each position I, calculate S i =  j  i  2 (C i, X j ) Choose i * such that S i* is maximal and partition into two subsets, until No significant dependencies left, or Not enough sequences in subset  Train separate WMM models for each subset All donor splice sites G5G5 not G 5 G 5 G -1 G 5 not G -1 G 5 G -1 A 2 G 5 G -1 not A 2 G 5 G -1 A 2 U 6 G 5 G -1 A 2 not U 6 Splice Sites