Bioinformatics: Buzzword or Discipline (???)

Slides:



Advertisements
Similar presentations
Genome Projects A genome project is the complete DNA sequence of the genome of an organism, and the identification of all its genes Genome projects are.
Advertisements

Sequencing a genome. Definition Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism.
Integrating Genomes D. R. Zerbino, B. Paten, D. Haussler Science 336, 179 (2012) Teacher: Professor Chao, Kun-Mao Speaker: Ho, Bin-Shenq June 4, 2012.
Molecular Evolution Revised 29/12/06
9 Genomics and Beyond Brief Chapter Outline
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Some new sequencing technologies. Molecular Inversion Probes.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Bioinformatics and Phylogenetic Analysis
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
16 and 20 February, 2004 Chapter 9 Genomics Mapping and characterizing whole genomes.
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
The Human Genome (Harding & Sanger) * *20  globin (chromosome 11) 6*10 4 bp 3*10 9 bp *10 3 Exon 2 Exon 1 Exon 3 5’ flanking 3’ flanking 3*10 3.
CISC667, F05, Lec27, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Review Session.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Sequencing a genome and Basic Sequence Alignment
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Lesson 10 Bioinformatics
Cédric Notredame (30/08/2015) Chemoinformatics And Bioinformatics Cédric Notredame Molecular Biology Bioinformatics Chemoinformatics Chemistry.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Mouse Genome Sequencing
Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics.
CSE 6406: Bioinformatics Algorithms. Course Outline
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
GenomesGenomes Chapter 21 Genomes Sequencing of DNA Human Genome Project countries 20 research centers.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native.
Genomics Analysis Chapter 20 Overview of topics to be discussed  The Human Genome Analysis  Variable Number Tandem Repeats  Short Tandem Repeats 
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Chapter 21 Eukaryotic Genome Sequences
Construction of Substitution Matrices
Calculating branch lengths from distances. ABC A B C----- a b c.
Linkage and Mapping. Figure 4-8 For linked genes, recombinant frequencies are less than 50 percent.
Algorithms for Biological Sequence Analysis Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University,
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Johnson - The Living World: 3rd Ed. - All Rights Reserved - McGraw Hill Companies Genomics Chapter 10 Copyright © McGraw-Hill Companies Permission required.
Construction of Substitution matrices
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
454 Genome Sequence Assembly and Analysis HC70AL S Brandon Le & Min Chen.
CISC667, S07, Lec25, Liao1 CISC 467/667 Intro to Bioinformatics (Spring 2007) Review Session.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Bioinformatics Overview
Introduction to Bioinformatics Resources for DNA Barcoding
Human Genome Project.
Genomics A Systematic Study of the Locations, Functions and Interactions of Many Genes at Once.
Distance based phylogenetics
Sequence comparison: Local alignment
High-throughput Biological Data The data deluge
Algorithms for Biological Sequence Analysis
Molecular Phylogenetics
CSE182-L12 Gene Finding.
DNA Sequencing The DNA from the genome is chopped into bits- whole chromosomes are too large to deal with, so the DNA is broken into manageably-sized overlapping.
Genomes and Their Evolution
Bioinformatics Biological Data Computer Calculations +
Genome organization and Bioinformatics
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
CISC 667 Intro to Bioinformatics (Spring 2007) Review session for Mid-Term CISC667, S07, Lec14, Liao.
Sequence the 3 billion base pairs of human
Unit Genomic sequencing
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Human Genome Project Seminal achievement. Scientific milestone.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

Bioinformatics: Buzzword or Discipline (???)

Outline of the course Analysis of one DNA sequence: Shotgun sequencing, Markov-Chain modeling, patterns and repeats. Analysis of multiple DNA or protein sequences: Dynamic programming alignments, substitution matrices. BLAST: Algorithm for sequence retrieval and comparison. Refresher on Markov Chains: Capsule theory, Markov-Chain Monte Carlo algorithms. Hidden Markov Models: Viterbi Algorithm and its applications. Evolutionary Models: Models of nucleotide mutation and substitution, recombination and genetic drift, with applications to genome evolution and gene mapping. Molecular phylogenetics (tree making): distance matrix, maximum likelihood and parsimony. Special topics: Gene and protein networks, analysis of DNA-microarray data, …

30,000 Genes make up only 3% of the genome BCM- HGSC

Genome Sizes Human 3.0 x 109 base pairs Mouse 3.0 x 109 Drosophila 1.1 x 108 Worm 1.0 x 108 Dictyostelium 3.4 x 107 Yeast 1.2 x 107 Bacteria 1.0 - 5.0 x 106

Shotgun Sequencing High Accuracy Sequence: < 1 error/ 10,000 bases

The Human Genome: 3 Billion Base Pairs Whole Genome Shotgun Strategy 3 billion bases Libraries of clones 3kb, 10kb, 50kb base pairs DNA sequence reads 500 bases each AGGCTCACTG BCM- HGSC

Statistical issues in shotgun strategy Model for the random fragments: Binomial/Poisson process Coverage of sequence by random fragments Mean number of contigs Mean size of contigs Coverage by anchored contigs

Binomial/Poisson Process N fragments, of length L each, randomly scattered in the interval of length G. Coverage a = NL/G Contig: Union of overlapping fragments. We want to have them cover as much of G as possible. Pr[#frags with left end in (x, x-h) = k] “is” binomial(N,h/G) or approximately Poisson(Nh/G) (when?).

Mean number of contigs E[#contigs] = N  Pr[a frag is rightmost in a contig] = N  Pr[frag does not include the left end of any other frag] = N  exp(- NL/G) = (aG/L)  exp(- a) L = 800 G = 100,000

Mean contig size E[S] = E[#frags-1] E[inter-epoch distance] + L

Mean contig size E(S) a

Number of anchored contigs #anchors = M #frags = N a = NL/G b = ML/G E[#anchored contigs] =Nb [exp(-a)-exp(-b)]/(b-a)

Conclusions Expected number of contigs first increases, then decreases with coverage. Expected size of contig increases with coverage. Expected number of anchored contigs first increases then decreases with anchor density . Attention: Computations do not involve boundary effects.