Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics Lecture 1.

Similar presentations


Presentation on theme: "Bioinformatics Lecture 1."— Presentation transcript:

1 Bioinformatics Lecture 1

2 DNA - the basics

3 Drew Berry – DNA animations

4 Organisation of DNA DNA is packed in Chromosomes
Karyotype: chromosome set of a species Chromosomes are dynamic structures The Human karyotype: 23 pairs of chromosomes 46 DNA molecules

5 DNA replication The ability of DNA to replicate itself is a fundamental driver of life DNA copy is catalysed by enzymes (DNA polymerases) The complementary strand is synthesised from a template strand, using deoxynucleotides and a primer Synthesis is directional (5’->3’) Deoxyribonucleotides dNTPs Template DNA strand Primer A C T G DNA polymerase Template 5’ TCAG 3’ 3’ 5’ T C G A reverse complement copy

6 The polymerase chain reaction
Replication requires a DNA polymerase Thermostable DNA polymerase (eg Taq polymerase) Efficient DNA amplification No error correction Kary Mullis Nobel prize in chemistry: 1993 Melt DNA (94-98 °) Anneal primers (50-65 °) Elongation (72 °) Exponential replication

7 DNA Sequencing (Sanger)
PCR Reaction is terminated using randomly incorporated dideoxynucleosides (ddNP) Older methods use radiolabelled phosphate Newer methods use ddNP incorporating dyes Truncated DNA strands are separated on a gel or by capillary electrophoresis

8 Next Generation Sequencing
Next generation sequencing refers to methods newer than the Sanger approach A variety of techniques developed by different companies DNA is generally immobilized on a solid support Very large numbers of small reads Multiple reads of a each section of genomic DNA (eg 30x) Assembling the genome becomes a significant computational problem Some ‘single molecule’ methods do not require PCR (reduces errors) Cost has reduced substantially  the $1000 genome! Refs: Metzker, M. L. Sequencing Technologies — the Next Generation. Nat. Rev. Genet. 2009, 11, 31–46.

9 The Human Genome Project
Funded by US government The human genome was published in February 2001 Project completed in 2003 Cost $US 2.7 billion in 1991 dollars Hierarchical shotgun sequencing (genome is broken down into many smaller fragments) Automated Sanger type sequencing Ref:

10 Human genome by function
The human genome contains about 21K genes (about 100,000 were expected!) 98% of the human genome is noncoding DNA Noncoding DNA can code for regulatory RNAs or otherwise regulate transcription Ref: Häggström, Wikiversity Journal of Medicine 1 (2). DOI: /wjm/ ISSN

11 The druggable genome – Current drug targets
Ref: Hopkins, A. L.; Groom, C. R. The Druggable Genome. Nat Rev Drug Discov 2002, 1, 727–730.

12 The druggable genome – Human genes
Ref: Hopkins, A. L.; Groom, C. R. The Druggable Genome. Nat Rev Drug Discov 2002, 1, 727–730.

13 Human genome resources
Three useful sites providing a huge number of resources such as genome browsers NCBI: National center of biological information UCSC genome browser Ensembl: European site at the Sanger centre

14 Next-gen Sequencing Overview
Ref:

15 Multiple Genomes Ref: McVean et al. An Integrated Map of Genetic Variation From 1,092 Human Genomes. Nature 2012, 491,

16 Bioinformatics Sequencing technologies produce enormous amounts of sequence data. What do we want to do with this? Identify genes Identify functions of gene products (proteins) Compare genes between species Identify relationships (similarities) between species

17 The Genetic Code In general:
Amino acids that share the same biosynthetic pathway tend to have the same first base in their codons Amino acids with similar physical properties have similar codons causing conservative substitutions in the case of mutations or mistranslation

18 Genetic mutation The genetic code can be changed by a variety of processes Small scale: Damage to DNA (radiation or chemical damage) Translation errors Large scale: Duplication of sections of DNA Deletion of sections of DNA Transposition of sections of DNA

19 The rate of genetic mutation
The mutation rate (per year or per generation) differs between species and even between different sections of the genome Different types of mutations occur with different frequencies The average mutation rate is estimated to be ~2.5 × 10−8 mutations per nucleotide site or 175 mutations per diploid genome per generation Ref: Nachman, M. W.; Crowell, S. L. Estimate of the Mutation Rate Per Nucleotide in Humans. Genetics, 156, 297 (2000).

20 Amino acid substitution matrices
Substitution matrices describe the probability that one AA is converted to another and ‘accepted’ Matrix is a ‘log odds’ matrix – i.e. here the probability of conversion from Ala to Arg is 1/log(30)

21 PAM and BLOSUM matrices
Scoring matrices are used to: produce sequence alignments and score similarity between two or more protein to search a database to find sequences similar to a test sequence Commonly used families of matrices: PAM (Accepted Point Mutation) matrices (Dayhof) Derived from global alignments of entire proteins Better for closely related protens BLOSUM (BLocks SUbstitution Matrices) matrices (Steven and Henikof) Derived from local alignments of blocks of sequences Better for evolutionally divergent sequences

22 BLAST - Searching genomes
BLAST is a rapid method for searching protein or DNA sequences in large databases Sequences are divided into groups k AAs or Bases PGFHJIQMQVVS  PGF, GFH, FHJ, HJI, etc (k=3) Common or repeated sequences are discarded Sections of exact sequence match are searched for The sequence alignment is expanded from sections that are exact matches Blast can miss difficult matches

23

24

25

26 Sequence alignment Protein or DNA sequences can be aligned
Differences between sequences are interpreted as mutations, insertions or deletions Substitution matrices are used to score the likelihood of a match Alignment scores are calculated between pairs of sequences Multiple alignments can be performed Many alignment programs: Clustal, T-coffee,

27 Clustal

28 Sequence alignments and protein structural similarity
Sequence alignments are based on protein/DNA sequence similarity and not on structural similarity High sequence similarity implies (but does not guarantee) structural similarity High sequence similarity implies (but does not garuantee) similar protein function Comparison of RMSD when pairs of similar proteins are superimposed using the sequence alignment (X axis) and the protein 3D structures (Y axis) Ref: Kosloff, M.; Kolodny, R. Sequence-Similar, Structure-Dissimilar Protein Pairs in the PDB. Proteins 2008, 71, 891

29 Differences between sequence and structural alignment
Chain A versus chain D from PDB ID 1vr4. The two chains are 100% identical in sequence A: Alignment by sequence B: Alignment by structure C: Overlaid structures Ref: Kosloff, M.; Kolodny, R. Sequence-Similar, Structure-Dissimilar Protein Pairs in the PDB. Proteins 2008, 71, 891

30 Improving sequence alignments
Adding structural information to sequence alignments can improve their quality

31 Summary This lecture should provide an overview of:
DNA sequencing and the Polymerase Chain Reaction Genome sequencing BLAST searching Sequence alignments and their limitations


Download ppt "Bioinformatics Lecture 1."

Similar presentations


Ads by Google