Introduction to NCBI & Ensembl tools including BLAST and database searching Incorporating Bioinformatics into the High School Biology Curriculum Fran Lewitter,

Slides:

Advertisements

Similar presentations

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.

Advertisements

Bioinformatics Tutorial I BLAST and Sequence Alignment.

BLAST Sequence alignment, E-value & Extreme value distribution.

Last lecture summary.

Measuring the degree of similarity: PAM and blosum Matrix

Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.

Lecture 8 Alignment of pairs of sequence Local and global alignment

Heuristic alignment algorithms and cost matrices

Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.

Introduction to Bioinformatics Algorithms Sequence Alignment.

Introduction to bioinformatics

Sequence similarity.

Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.

Similar Sequence Similar Function Charles Yan Spring 2006.

Sequence Alignment III CIS 667 February 10, 2004.

BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.

Sequence alignment, E-value & Extreme value distribution

Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.

TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.

1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.

Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.

An Introduction to Bioinformatics

Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.

CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)

Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.

Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.

Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.

Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?

Biology 4900 Biocomputing.

Comp. Genomics Recitation 3 The statistics of database searching.

Construction of Substitution Matrices

Sequence Alignment Csc 487/687 Computing for bioinformatics.

Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.

Introduction to NCBI & Ensembl tools including BLAST and database searching Incorporating Bioinformatics into the High School Biology Curriculum Fran Lewitter,

BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.

Basic Local Alignment Search Tool BLAST Why Use BLAST?

COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Sequence Alignment.

Construction of Substitution matrices

Bioinformatics Computing 1 CMP 807 – Day 2 Kevin Galens.

Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University

Step 3: Tools Database Searching

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

Copyright OpenHelix. No use or reproduction without express written consent1.

Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center

BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.

Sequence Similarity The bioinformatics for molecular biologists lecture series.

What is sequencing? Video: WlxM (Illumina video) WlxM.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Sequence similarity, BLAST alignments & multiple sequence alignments

BLAST Anders Gorm Pedersen & Rasmus Wernersson.

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool (BLAST)

Alignment IV BLOSUM Matrices

Basic Local Alignment Search Tool

Sequence alignment, E-value & Extreme value distribution

Presentation transcript:

Introduction to NCBI & Ensembl tools including BLAST and database searching Incorporating Bioinformatics into the High School Biology Curriculum Fran Lewitter, Ph.D. Director, Bioinformatics & Research Computing Whitehead Institute for Biomedical Research Lewitter AT wi.mit.edu

Lewitter-Whitehead Institute – August 15, 2012 What I hope you’ll learn What do we learn from database searching and sequence alignments What tools are available at NCBI What tools are available through Ensembl 2

First some background info Lewitter-Whitehead Institute – August 15, 20123

4

Doolittle RF, Hunkapiller MW, Hood LE, Devare SG, Robbins KC, Aaronson SA, Antoniades HN. Science 221: , Simian sarcoma virus onc gene, v-sis, is derived from the gene (or genes) encoding a platelet-derived growth factor. 5

Lewitter-Whitehead Institute – August 15, 2012 Why do alignments Use sequence similarity to infer homology and/or structural similarity between 2 or more genes/proteins Identify more conserved regions of a protein, potentially identifying regions of most functional importance Compare and contrast homologs (perhaps into groups) based on shared positions or regions Infer evolutionary distance from sequence dissimilarity 6

Lewitter-Whitehead Institute – August 15, 2012 Evolutionary Basis of Sequence Alignment Similarity - observable quantity, such as percent identity Homology - conclusion drawn from data that two genes share a common evolutionary history; no metric is associated with this –Paralog – genes related by duplication –Ortholog – genes related by speciation 7

Lewitter-Whitehead Institute – August 15, 2012 Local vs Global Alignment From Mount, Bioinformatics, 2004, pg 71 GLOBAL LOCAL 8

Nucleotide vs Protein If comparing protein coding genes, use protein sequences because of less noise If protein sequences are very similar, it might be more instructive to use DNA sequences Lewitter-Whitehead Institute – August 15, 20129

Example of simple scoring system for nucleic acids Match = +1 (ex. A-A, T-T, C-C, G-G) Mismatch = -1 (ex. A-T, A-C, etc) Gap opening = - 5 Gap extension = -2 T C A G A C G A G T G T C G G A - - G C T G = -4 10

11WIBR Sequence Analysis Course, © Whitehead Institute, February 2005 Possible Alignments A:T C A G A C G A G T G B:T C G G A G C T G I.T C A G A C G A G T G T C G G A - - G C T G II.T C A G A C G A G T G T C G G A - G C - T G III.T C A G A C G A G T G T C G G A - G - C T G score=-4 score=-5

Lewitter-Whitehead Institute – August 15, 2012 Scoring for Protein Alignments - Amino Acid Substitution Matrices PAM - point accepted mutation based on global alignment [evolutionary model] BLOSUM - block substitutions based on local alignments [similarity among conserved sequences] 12

Lewitter-Whitehead Institute – August 15, 2012 AA Scoring Matrices PAM - point accepted mutation based on global alignment [evolutionary model] BLOSUM - block substitutions based on local alignments [similarity among conserved sequences] Log-odds= pair in homologous proteins pair in unrelated proteins by chance Log-odds= obs freq of aa substitutions freq expected by chance 13

Lewitter-Whitehead Institute – August 15, 2012 Substitution Matrices BLOSUM 30 BLOSUM 62 BLOSUM 80 % identity PAM 250 (80) PAM 120 (66) PAM 90 (50) % change Increasing similarity 14

Lewitter-Whitehead Institute – August 15, 2012 Scoring for BLAST Alignments Score = 94.0 bits (230), Expect = 6e-19 Identities = 45/101 (44%), Positives = 54/101 (52%), Gaps = 7/101 (6%) Query: 204 YTGPFCDV----DTKASCYDGRGLSYRGLARTTLSGAPCQPWASEATYRNVTAEQ---AR 256 Y+ FC + + CY G G +YRG T SGA C PW S V Q A+ Sbjct: 198 YSSEFCSTPACSEGNSDCYFGNGSAYRGTHSLTESGASCLPWNSMILIGKVYTAQNPSAQ 257 Query: 257 NWGLGGHAFCRNPDNDIRPWCFVLNRDRLSWEYCDLAQCQT 297 GLG H +CRNPD D +PWC VL RL+WEYCD+ C T Sbjct: 258 ALGLGKHNYCRNPDGDAKPWCHVLKNRRLTWEYCDVPSCST 298 Position 1: Y - Y = 7 Position 2: T - S = 1 Position 3: G - S = 0 Position 4: P - E = Position 9: - - P = -11 Position 10: - - A = Sum 230 Based on BLOSUM62 15

Lewitter-Whitehead Institute – August 15, 2012 Statistical Significance Raw Scores - score of an alignment equal to the sum of substitution and gap scores. Bit scores - scaled version of an alignment ’ s raw score that accounts for the statistical properties of the scoring system used. E-value - expected number of distinct alignments that would achieve a given score by chance. Lower E-value => more significant. 16

Lewitter-Whitehead Institute – August 15, 2012 A formula E = Kmn e - S This is the Expected number of high-scoring segment pairs (HSPs) with score at least S for sequences of length m and n. This is the E value for the score S. The parameters K and can be thought of simply as natural scales for the search space size and the scoring system respectively. 17

What’s significant? High confidence - >40% identity for long alignments (Rost, 1999 found that sequence alignments unambiguously distinguish between protein pairs of similar and non-similar structure when the pairwise sequence identity >40%) “Twilight zone” – blurry % identity “Midnight zone” - <20% identity Lewitter-Whitehead Institute – August 15,

Tools of Interest NCBI (US) - –BLAST –Entrez EBI/EMBL/WTSI (Europe) –Ensembl ( Lewitter-Whitehead Institute – August 15,

Lewitter-Whitehead Institute – August 15, 2012 NCBI Home Page 20

Lewitter-Whitehead Institute – August 15, 2012 Ensembl Home Page 21

Hands-On Entrez – finding sequences BLAST – look for similar sequences Ensembl – finding sequences Lewitter-Whitehead Institute – August 15,

How is BLAST WIBR? Looking for homologs Identifying domains in proteins Peptide identification Genome annotation Lewitter-Whitehead Institute – August 15,

Lewitter-Whitehead Institute – August 15, 2012 Remember 1.Don’t be afraid to look at help pages for each website Click, click, click 2.Any analysis tool will give you results 3.Interpret results you find 24

Lewitter-Whitehead Institute – August 15, 2012 Sequence of Note 1 GCGTTGCTGG CGTTTTTCCA TAGGCTCCGC 31 CCCCCTGACG AGCATCACAA AAATCGACGC 61 GGTGGCGAAA CCCGACAGGA CTATAAAGAT ………… GTAAAGTCTG GAAACGCGGA AGTCAGCGCC “ Here you see the actual structure of a small fragment of dinosaur DNA, ” Wu said. “ Notice the sequence is made up of four compounds - adenine, guanine, thymine and cytosine. This amount of DNA probably contains instructions to make a single protein - say, a hormone or an enzyme. The full DNA molecule contains three billion of these bases. If we looked at a screen like this once a second, for eight hours a day, it ’ d still take more than two years to look at the entire DNA strand. It ’ s that big. ” (page 103) (Published in 1990) Biotechniques,

Lewitter-Whitehead Institute – August 15, 2012 DinoDNA "Dinosaur DNA" from Crichton's THE LOST WORLD p. 135 GAATTCCGGAAGCGAGCAAGAGATAAGTCCTGGC ATCAGATACAGTTGGAGATAAGGACGACGTGTGG CAGCTCCCGCAGAGGATTCACTGGAAGTGCATTA CCTATCCCATGGGAGCCATGGAGTTCGTGGCGCT GGGGGGGCCGGATGCGGGCTCCCCCACTCCGTTC CCTGATGAAGCCGGAGCCTTCCTGGGGCTGGGGG GGGGCGAGAGGACGGAGGCGGGGGGGCTGCTGGC CTCCTACCCCCCCTCAGGCCGCGTGTCCCTGGTG CCGTGGCAGACACGGGTACTTTGGGGACCCCCCA GTGGGTGCCGCCCGCCACCCAAATGGAGCCCCCC CACTACCTGGAGCTGCTGCAACCCCCCCGGGGCA GCCCCCCCCATCCCTCCTCCGGGCCCCTACTGCC ACTCAGCAGCGCCTGCGGCCTCTACTACAAAC…… Published in

Lewitter-Whitehead Institute – August 15, 2012 >Erythroid transcription factor (NF-E1 DNA-binding protein) Query: 121 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 300 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG Sbjct: 1 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 60 Query: 301 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECVMARKNCGAT 480 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV NCGAT Sbjct: 61 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV----NCGAT 116 Query: 481 ATPLWRRDGTGHYLCNWASACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCSHERENCQT 660 ATPLWRRDGTGHYLCN ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS NCQT Sbjct: 117 ATPLWRRDGTGHYLCN---ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS----NCQT 169 Query: 661 STTTLWRRSPMGDPVCNNIHACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 840 STTTLWRRSPMGDPVCN ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG Sbjct: 170 STTTLWRRSPMGDPVCN---ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 226 Query: 841 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 1020 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF Sbjct: 227 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 286 Query: 1021 FGGGAGGYTAPPGLSPQI 1074 FGGGAGGYTAPPGLSPQI Sbjct: 287 FGGGAGGYTAPPGLSPQI

Lewitter-Whitehead Institute – August 15, 2012 >Erythroid transcription factor (NF-E1 DNA-binding protein) Query: 121 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 300 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG Sbjct: 1 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 60 Query: 301 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECVMARKNCGAT 480 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV NCGAT Sbjct: 61 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV----NCGAT 116 Query: 481 ATPLWRRDGTGHYLCNWASACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCSHERENCQT 660 ATPLWRRDGTGHYLCN ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS NCQT Sbjct: 117 ATPLWRRDGTGHYLCN---ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS----NCQT 169 Query: 661 STTTLWRRSPMGDPVCNNIHACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 840 STTTLWRRSPMGDPVCN ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG Sbjct: 170 STTTLWRRSPMGDPVCN---ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 226 Query: 841 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 1020 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF Sbjct: 227 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 286 Query: 1021 FGGGAGGYTAPPGLSPQI 1074 FGGGAGGYTAPPGLSPQI Sbjct: 287 FGGGAGGYTAPPGLSPQI