Biological Sequence Analysis
The materials used in this class are made possible by: Zhiping Weng, Wenyi Wang Zhijin Wu Garland publishing, Alberts’s the Cell And the wealth of internet resources
Who are we? Sining Chen Carlo Colantuoni Giovanni Parmigiani
Who are you? Field of research Stats & computing background Register or audit Why are you taking this course Specific topics you are interested
tm Administrative Details
The MHS program in Bioinfo Jointly offered by Dept. Biostatistics and Molecular Microbiology and Immunology An intensive one-year program that emphasizes biology, statistical methods, and computing
Goal of the class Learn to look at biological sequences from a probabilistic point of view Understand algorithms behind routine operations, e.g. BLAST. Be able to build statistical model to solve problems involving sequences
Carlo Colantuoni Clinical Brain Disorders Branch, NIMH, NIH Dept. Biostatistics, JHSPH Biological Sequence Analysis: Basic Biological Concepts
Molecular Cell Biology: Central Dogma RNA Protein Sequence analysis important at all 3 levels Transcription Translation DNA Replication
The Human Genome DADMOM YOU 2 copies in every cell (46 chr) One copy from each parent Each parent passes on a “mixed copy” Genomic Content: 3.3 billion bases ~30K genes 23 chromosomes (22+X/Y) Millions of variants
Nucleotides are the chemical building block of Nucleic Acids: DNA and RNA
Nucleotides are the chemical building block of Nucleic Acids: DNA and RNA
From Genomic DNA to mRNA Transcripts EXONSINTRONS Alternative splicing ~30K >30K Protein-coding genes are not easy to find - gene density is low, and exons are interrupted by introns. Promoters Poly-Adenylation
AAAAA STARTSTOP protein coding 5’ UTR 3’ UTR mRNA Genomic DNA 3.3 Gb Protein Molecular Cell Biology: Components of the Central Dogma Transcription Translation
DNA: A T G C 1:1 RNA: A U G C 3:1 Protein: 20 amino acids Transcription Translation Replication Translation - Protein Synthesis: Every 3 nucleotides (codon) are translated into one amino acid
Translation - Protein Synthesis 5’ -> 3’ : N-term -> C-term RNA Protein
Nucleotide sequence determines the amino acid sequence
The Human Genome DADMOM YOU 2 copies in every cell One copy from each parent Each parent passes on a “mixed copy” Genomic Content: 3.3 billion bases ~30K genes 23 chromosomes (22+X/Y) Deletions Insertions Mutations Evolutionary Scale
Biological Sequence Analysis: Primary Concepts Homologue Paralogue Ortholog Identity & Similarity