Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics An introduction

Similar presentations


Presentation on theme: "Bioinformatics An introduction"— Presentation transcript:

1 Bioinformatics An introduction
1

2 DNA - the basics 2

3 3

4 Drew Berry – DNA animations
4

5 Organisation of DNA DNA is packed in chromosomes
Karyotype: chromosome set of a species Chromosomes are dynamic structures The Human karyotype 23 pairs of chromosomes 46 DNA molecules 5

6 The Genetic Code In general:
Amino acids that share the same biosynthetic pathway tend to have the same first base in their codons Amino acids with similar physical properties have similar codons causing conservative substitutions in the case of mutations or mistranslation 6

7 DNA replication The ability of DNA to replicate itself is a fundamental driver of life DNA copy is catalysed by enzymes (DNA polymerases) The complementary strand is synthesised from a template strand, using deoxynucleotides and a primer Synthesis is directional (5’->3’) Deoxyribonucleotides dNTPs Template DNA strand Primer A C T G DNA polymerase Template 5’ TCAG 3’ 3’ 5’ T C G A reverse complement copy 7

8 Genetic mutation The genetic code can be changed by a variety of processes Small scale: Damage to DNA (radiation or chemical damage) Translation errors Large scale: Duplication of sections of DNA Deletion of sections of DNA Transposition of sections of DNA These errors in replication cause DNA Base substitutions Insertions Deletions Frameshifts Look at the web site 8

9 Single nucleotide polymorphisms (SNPs)
Defined as cases where 1% of the population has a variation in a single nucleotide Are mostly unique Can occur in coding or non-coding regions of DNA Can result in a change in the translated amino acid sequence or be silent (synonymous) Why are SNPs important? Aubundant – the most common form of genetic variation When comparing two human DNA sequences there is a SNP every 1–2,000 nucleotides 2-3 million SNPs per genome Cause genetic variation Inherited and can be used to trace ancestry 9

10 The rate of genetic mutation
The mutation rate (per year or per generation) differs between species and even between different sections of the genome Different types of mutations occur with different frequencies The average mutation rate is estimated to be ~2.5 × 10−8 mutations per nucleotide site or 175 mutations per diploid genome per generation Ref: Nachman, M. W.; Crowell, S. L. Estimate of the Mutation Rate Per Nucleotide in Humans. Genetics, 156, 297 (2000). 10

11 Amino acid substitution matrices
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y V Ala A Arg R Asn N Asp D Cys C Gln Q Glu E Gly G His H Ile I Leu L Lys K Met M Phe F Pro P Ser S Thr T Trp W Tyr Y Val V Interconversion between amino acids is not equally likely – this is governed by the DNA code itself and the physicochemical properties of the encoded amino acids (polar, nonpolar, large, small, etc) Substitution matrices describe the probability that one aa is converted to another and ‘accepted’ (after some period of time) Above is the PAM1 matrix for comparison of 10,000 codons (corresponds to a period of time where 1% of bases have changed). Using such matrices allows us to estimate the probability that two sequences have a common ancestor 11

12 PAM and BLOSUM matrices
Scoring matrices are used to: produce sequence alignments and score similarity between two or more protein to search a database to find sequences similar to a test sequence Commonly used families of matrices: PAM (Accepted Point Mutation) matrices (Dayhof) Derived from global alignments of entire proteins Better for closely related proteins BLOSUM (BLocks SUbstitution Matrices) matrices (Steven and Henikof) Derived from local alignments of blocks of sequences Better for evolutionally divergent sequences 12

13 The polymerase chain reaction
Replication requires a DNA polymerase Thermostable DNA polymerase. E.g. Taq polymerase from the Thermus aquaticus, a thermophilic bacterium that lives in hot springs Efficient DNA amplification No error correction Kary Mullis Nobel prize in chemistry: 1993 Melt DNA (94-98 °) Anneal primers (50-65 °) Elongation (72 °) Exponential replication 13

14 DNA Sequencing (Sanger)
PCR Reaction is terminated using randomly incorporated dideoxynucleosides (ddNP) Older methods use radiolabelled phosphate Newer methods use ddNP incorporating dyes Truncated DNA strands are separated on a gel or by capillary electrophoresis 14

15 Next Generation Sequencing
Next generation sequencing refers to methods newer than the Sanger approach A variety of techniques developed by different companies DNA is generally immobilized on a solid support Very large numbers of small reads Multiple reads of a each section of genomic DNA (eg 30x) Assembling the genome becomes a significant computational problem Some ‘single molecule’ methods do not require PCR (reduces errors) Cost has reduced substantially  the $1000 genome! Refs: Metzker, M. L. Sequencing Technologies — the Next Generation. Nat. Rev. Genet. 2009, 11, 31–46. 15

16 Next-gen Sequencing Overview
Ref: 16

17 The Human Genome Project
Funded by US government The human genome was published in February 2001 Project completed in 2003 Cost $US 2.7 billion in 1991 dollars Hierarchical shotgun sequencing (genome is broken down into many smaller fragments) Automated Sanger type sequencing Ref: 17

18 Human gene function The human genome contains about 21K genes (about 100K were expected!) 98% of the human genome is noncoding DNA Noncoding DNA can code for regulatory RNAs or otherwise regulate transcription Ref: Häggström, Wikiversity Journal of Medicine 1 (2). DOI: /wjm/ ISSN 18

19 Human genome resources
Three useful sites providing a huge number of resources such as genome browsers NCBI: National center of biological information UCSC genome browser Ensembl: European site at the Sanger centre 19

20 Multiple Genomes Ref: McVean et al. An Integrated Map of Genetic Variation From 1,092 Human Genomes. Nature 2012, 491, 20

21 Genomic data Sequencing technologies produce enormous amounts of sequence data. What do we want to do with this? Identify genes Identify functions of gene products (proteins) Compare genes between species Identify relationships (similarities) between species Identify relationships between changes in sequences and disease/disorders Pharmacogenomics – find relationships between drug behaviour/metabolism and the genome Cancer – identify relationships between sequence and disease. E.g. mutations in the BRCA1 and BRCA2 greatly increase a women’s risk of breast cancer (See NIH BRCA fact sheet) 21

22 BLAST - Searching genomes
BLAST is a rapid method for searching protein or DNA sequences in large databases Can search on nucleotide or protein sequences Sequences are divided into groups of k amino acids or bases PGFHJIQMQVVS  PGF, GFH, FHJ, HJI, etc (k=3) Common or repeated sequences are discarded Sections of exact sequence match are searched for The sequence alignment is expanded from sections that are exact matches Blast can miss difficult matches 22

23 Blast at NIH NCBI 23

24 24

25 25

26 Sequence alignment Protein or DNA sequences can be aligned
Differences between sequences are interpreted as mutations, insertions or deletions Substitution matrices are used to score the likelihood of a match Alignment scores are calculated between pairs of sequences Multiple alignments can be performed Many alignment programs: Clustal, T-coffee, 26

27 Clustal 27

28 Sequence alignments and protein structural similarity
Sequence alignments are based on protein/DNA sequence similarity and not on structural similarity High sequence similarity implies (but does not guarantee) structural similarity High sequence similarity implies (but does not garuantee) similar protein function Comparison of RMSD when pairs of similar proteins are superimposed using the sequence alignment (X axis) and the protein 3D structures (Y axis) Ref: Kosloff, M.; Kolodny, R. Sequence-Similar, Structure-Dissimilar Protein Pairs in the PDB. Proteins 2008, 71, 891 28

29 Key learning questions
DNA Organisation Function Replication Mutations and inheritance DNA sequencing The polymerase chain reaction (how does it work, benefits, limitations) Sanger sequencing (how? Limitations) Next gen sequencing (in general, how does it differ from older methods, why is it better)? The human genome What’s in it? Why sequence it? Genomic data Sequence alignments (what are they estimating, what are the limitations?) BLAST searching (what can it do for us, what are the limitations here?) 29

30 Good resources The NIH provides a genetics primer which is available online - hgp or as a pdf The NIH BRCA fact sheet:


Download ppt "Bioinformatics An introduction"

Similar presentations


Ads by Google