Bioinformatics An introduction

Slides:



Advertisements
Similar presentations
Introduction to Bioinformatics Algorithms Sequence Alignment.
Advertisements

Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
Sequence similarity.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Principles of Biology By Frank H. Osborne, Ph. D. Molecular Genetics.
Mutation  Is a change in the genetic material.  Structural change in genomic DNA which can be transmitted from cell to it is daughter cell.  Structural.
Advanced Molecular Biological Techniques. Polymerase Chain Reaction animation.
Unit 7 RNA, Protein Synthesis & Gene Expression Chapter 10-2, 10-3
6.3 Advanced Molecular Biological Techniques 1. Polymerase chain reaction (PCR) 2. Restriction fragment length polymorphism (RFLP) 3. DNA sequencing.
Mutations Mutation- a change in the DNA nucleotide sequence
Module 1 Section 1.3 DNA Technology
Bioinformatics Lecture 1.
RNA AND PROTEIN SYNTHESIS RNA vs DNA RNADNA 1. 5 – Carbon sugar (ribose) 5 – Carbon sugar (deoxyribose) 2. Phosphate group Phosphate group 3. Nitrogenous.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Construction of Substitution Matrices
Chapter 8 Microbial Genetics part A. Life in term of Biology –Growth of organisms Metabolism is the sum of all chemical reactions that occur in living.
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
GENOME: an organism’s complete set of genetic material Humans ~3 billion base pairs CHROMOSOME: Part of the genome; structure that holds tightly wound.
Chap. 1 basic concepts of Molecular Biology Introduction to Computational Molecular Biology Chapter 1.
Human Genomics. Writing in RED indicates the SQA outcomes. Writing in BLACK explains these outcomes in depth.
6.3 Advanced Molecular Biological Techniques 1. Polymerase chain reaction (PCR) 2. Restriction fragment length polymorphism (RFLP) 3. DNA sequencing.
Construction of Substitution matrices
FOOTHILL HIGH SCHOOL SCIENCE DEPARTMENT Chapter 13 Genetic Engineering Section 13-2 Manipulating DNA.
Mutations Csaba Bödör, Semmelweis University, 1 st Dept. of Pathology.
KEY CONCEPT 8.5 Translation converts an mRNA message into a polypeptide, or protein.
8.2 KEY CONCEPT DNA structure is the same in all organisms.
Macromolecular and Physical Data Michael J. Watts 1.
Arginine, who are you? Why so important?. Release 2015_01 of 07-Jan-15 of UniProtKB/Swiss-Prot contains sequence entries, comprising
Looking Within Human Genome King abdulaziz university Dr. Nisreen R Tashkandy GENOMICS ; THE PIG PICTURE.
From the double helix to the genome
13/11/
Biotechnology.
Sequence similarity, BLAST alignments & multiple sequence alignments
Part 3 Gene Technology & Medicine
Genetic code and mutations
© 2018 Pearson Education, Inc.
Polymerase Chain Reaction (PCR)
Transcription, Translation & Protein Synthesis
21.8 Recombinant DNA DNA can be used in
Polymerase Chain Reaction
Protein Sequence Alignments
Mutations.
DNA Mutations Biology 6(E).
Relationship between Genotype and Phenotype
Chapter 13: Protein Synthesis
Notes over Active Transport and Protein Synthesis
Chapter 14 Bioinformatics—the study of a genome
Central Dogma of Molecular Biology From Genes to Protein
Polymerase Chain Reaction (PCR) technique
Outline What is an amino acid / protein
PROTEIN SYNTHESIS.
The genetic code © 2016 Paul Billiet ODWS.
Amino Amigos! Name: _______________________
Entry Task Apply: Suppose a template strand of DNA had the following sequence: DNA: T A C G G A T A A C T A C C G G G T A T T C A A What would.
Entry Task Apply: Suppose a template strand of DNA had the following wild-type gene sequence: DNA: T A C G G A T A A C T A C C G G G T A T T C.
It og Sundhed Thomas Nordahl Petersen, Associate Professor
DNA and the Genome Key Area 8a Genomic Sequencing.
Mutations are changes in the genetic material of a cell or virus
Sexual reproduction creates unique combinations of genes.
Translation.
It og Sundhed Thomas Nordahl Petersen, Associate Professor
DNA: the molecule of heredity
Chapter 18 Naturally Occurring Nitrogen-Containing Compounds
13.3 Mutations.
Chapter 14: Protein Synthesis
Relationship between Genotype and Phenotype
Thomas Nordahl Petersen, Associate Prof, Food DTU
Introduction to Bioinformatics II
Thomas Nordahl Petersen, Associate Bioinformatics, DTU
Presentation transcript:

Bioinformatics An introduction 1

DNA - the basics 2

3

Drew Berry – DNA animations http://www.youtube.com/watch?v=WFCvkkDSfIU&index=4&list=PL9CBBEA5A85DBCDEF 4

Organisation of DNA DNA is packed in chromosomes Karyotype: chromosome set of a species Chromosomes are dynamic structures The Human karyotype 23 pairs of chromosomes 46 DNA molecules 5

The Genetic Code In general: Amino acids that share the same biosynthetic pathway tend to have the same first base in their codons Amino acids with similar physical properties have similar codons causing conservative substitutions in the case of mutations or mistranslation 6

DNA replication The ability of DNA to replicate itself is a fundamental driver of life DNA copy is catalysed by enzymes (DNA polymerases) The complementary strand is synthesised from a template strand, using deoxynucleotides and a primer Synthesis is directional (5’->3’) Deoxyribonucleotides dNTPs Template DNA strand Primer A C T G DNA polymerase Template 5’ TCAG 3’ 3’ 5’ T C G A reverse complement copy 7

Genetic mutation The genetic code can be changed by a variety of processes Small scale: Damage to DNA (radiation or chemical damage) Translation errors Large scale: Duplication of sections of DNA Deletion of sections of DNA Transposition of sections of DNA These errors in replication cause DNA Base substitutions Insertions Deletions Frameshifts Look at the web site http://evolution.berkeley.edu/evolibrary/article/0_0_0/mutations_01 8

Single nucleotide polymorphisms (SNPs) Defined as cases where 1% of the population has a variation in a single nucleotide Are mostly unique Can occur in coding or non-coding regions of DNA Can result in a change in the translated amino acid sequence or be silent (synonymous) Why are SNPs important? Aubundant – the most common form of genetic variation When comparing two human DNA sequences there is a SNP every 1–2,000 nucleotides 2-3 million SNPs per genome Cause genetic variation Inherited and can be used to trace ancestry 9

The rate of genetic mutation The mutation rate (per year or per generation) differs between species and even between different sections of the genome Different types of mutations occur with different frequencies The average mutation rate is estimated to be ~2.5 × 10−8 mutations per nucleotide site or 175 mutations per diploid genome per generation Ref: Nachman, M. W.; Crowell, S. L. Estimate of the Mutation Rate Per Nucleotide in Humans. Genetics, 156, 297 (2000). 10

Amino acid substitution matrices Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y V Ala A 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18 Arg R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1 Asn N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1 Asp D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1 Cys C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2 Gln Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1 Glu E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2 Gly G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5 His H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1 Ile I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33 Leu L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15 Lys K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1 Met M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4 Phe F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0 Pro P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2 Ser S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2 Thr T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9 Trp W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0 Tyr Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1 Val V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901 Interconversion between amino acids is not equally likely – this is governed by the DNA code itself and the physicochemical properties of the encoded amino acids (polar, nonpolar, large, small, etc) Substitution matrices describe the probability that one aa is converted to another and ‘accepted’ (after some period of time) Above is the PAM1 matrix for comparison of 10,000 codons (corresponds to a period of time where 1% of bases have changed). Using such matrices allows us to estimate the probability that two sequences have a common ancestor 11

PAM and BLOSUM matrices Scoring matrices are used to: produce sequence alignments and score similarity between two or more protein to search a database to find sequences similar to a test sequence Commonly used families of matrices: PAM (Accepted Point Mutation) matrices (Dayhof) Derived from global alignments of entire proteins Better for closely related proteins BLOSUM (BLocks SUbstitution Matrices) matrices (Steven and Henikof) Derived from local alignments of blocks of sequences Better for evolutionally divergent sequences 12

The polymerase chain reaction Replication requires a DNA polymerase Thermostable DNA polymerase. E.g. Taq polymerase from the Thermus aquaticus, a thermophilic bacterium that lives in hot springs Efficient DNA amplification No error correction Kary Mullis Nobel prize in chemistry: 1993 Melt DNA (94-98 °) Anneal primers (50-65 °) Elongation (72 °) Exponential replication 13

DNA Sequencing (Sanger) PCR Reaction is terminated using randomly incorporated dideoxynucleosides (ddNP) Older methods use radiolabelled phosphate Newer methods use ddNP incorporating dyes Truncated DNA strands are separated on a gel or by capillary electrophoresis 14

Next Generation Sequencing Next generation sequencing refers to methods newer than the Sanger approach A variety of techniques developed by different companies DNA is generally immobilized on a solid support Very large numbers of small reads Multiple reads of a each section of genomic DNA (eg 30x) Assembling the genome becomes a significant computational problem Some ‘single molecule’ methods do not require PCR (reduces errors) Cost has reduced substantially  the $1000 genome! Refs: Metzker, M. L. Sequencing Technologies — the Next Generation. Nat. Rev. Genet. 2009, 11, 31–46. 15

Next-gen Sequencing Overview Ref: http://res.illumina.com/documents/products/illumina_sequencing_introduction.pdf 16

The Human Genome Project Funded by US government The human genome was published in February 2001 Project completed in 2003 Cost $US 2.7 billion in 1991 dollars Hierarchical shotgun sequencing (genome is broken down into many smaller fragments) Automated Sanger type sequencing Ref: http://www.nature.com/scitable/topicpage/dna-sequencing-technologies-key-to-the-human-828 17

Human gene function The human genome contains about 21K genes (about 100K were expected!) 98% of the human genome is noncoding DNA Noncoding DNA can code for regulatory RNAs or otherwise regulate transcription Ref: Häggström, Wikiversity Journal of Medicine 1 (2). DOI:10.15347/wjm/2014.008. ISSN 20018762 18

Human genome resources Three useful sites providing a huge number of resources such as genome browsers NCBI: National center of biological information http://www.ncbi.nlm.nih.gov/ http://www.ncbi.nlm.nih.gov/genome/guide/human/ UCSC genome browser http://genome.ucsc.edu/ Ensembl: European site at the Sanger centre http://www.ensembl.org 19

Multiple Genomes Ref: McVean et al. An Integrated Map of Genetic Variation From 1,092 Human Genomes. Nature 2012, 491, 56-65. 20

Genomic data Sequencing technologies produce enormous amounts of sequence data. What do we want to do with this? Identify genes Identify functions of gene products (proteins) Compare genes between species Identify relationships (similarities) between species Identify relationships between changes in sequences and disease/disorders Pharmacogenomics – find relationships between drug behaviour/metabolism and the genome Cancer – identify relationships between sequence and disease. E.g. mutations in the BRCA1 and BRCA2 greatly increase a women’s risk of breast cancer (See NIH BRCA fact sheet) 21

BLAST - Searching genomes BLAST is a rapid method for searching protein or DNA sequences in large databases Can search on nucleotide or protein sequences Sequences are divided into groups of k amino acids or bases PGFHJIQMQVVS  PGF, GFH, FHJ, HJI, etc (k=3) Common or repeated sequences are discarded Sections of exact sequence match are searched for The sequence alignment is expanded from sections that are exact matches Blast can miss difficult matches 22

Blast at NIH NCBI https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch 23

24

25

Sequence alignment Protein or DNA sequences can be aligned Differences between sequences are interpreted as mutations, insertions or deletions Substitution matrices are used to score the likelihood of a match Alignment scores are calculated between pairs of sequences Multiple alignments can be performed Many alignment programs: Clustal, T-coffee, 26

Clustal 27

Sequence alignments and protein structural similarity Sequence alignments are based on protein/DNA sequence similarity and not on structural similarity High sequence similarity implies (but does not guarantee) structural similarity High sequence similarity implies (but does not garuantee) similar protein function Comparison of RMSD when pairs of similar proteins are superimposed using the sequence alignment (X axis) and the protein 3D structures (Y axis) Ref: Kosloff, M.; Kolodny, R. Sequence-Similar, Structure-Dissimilar Protein Pairs in the PDB. Proteins 2008, 71, 891 28

Key learning questions DNA Organisation Function Replication Mutations and inheritance DNA sequencing The polymerase chain reaction (how does it work, benefits, limitations) Sanger sequencing (how? Limitations) Next gen sequencing (in general, how does it differ from older methods, why is it better)? The human genome What’s in it? Why sequence it? Genomic data Sequence alignments (what are they estimating, what are the limitations?) BLAST searching (what can it do for us, what are the limitations here?) 29

Good resources The NIH provides a genetics primer which is available online https://ghr.nlm.nih.gov/primer - hgp or as a pdf https://ghr.nlm.nih.gov/primer.pdf The NIH BRCA fact sheet: https://www.cancer.gov/about-cancer/causes-prevention/genetics/brca-fact-sheet