Computational Biology or Bioinformatics ability to rapidly sequence DNA has led to large databases development of new algorithms data analysis and interpretation.

Slides:



Advertisements
Similar presentations
Replication transcription processingtranslation Molecular Analysis possible to detect and analyze DNA, RNA, and protein DNA sequence represents 'genotype'
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
1 Exercise: BIOINFORMATIC DATABASES and BLAST. 2 Outline  NCBI and Entrez  Pubmed  Google scholar  RefSeq  Swissprot  Fasta format  PDB: Protein.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Sequence Similarity Searching Class 4 March 2010.
BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.
Bioinformatics and Phylogenetic Analysis
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Comparative Genomics of the Eukaryotes
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
BLAST : Basic local alignment search tool B L A S T !
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
School B&I TCD Bioinformatics Database homology searching May 2010.
Part I: Identifying sequences with … Speaker : S. Gaj Date
What is BLAST? BLAST® (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases.
Biology 4900 Biocomputing.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Search and Analysis SPE 1653 (703)
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Function preserves sequences
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Alignment.
Construction of Substitution matrices
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
What is BLAST? Basic BLAST search What is BLAST?
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
What is BLAST? Basic BLAST search What is BLAST?
Sequence similarity, BLAST alignments & multiple sequence alignments
Basics of BLAST Basic BLAST Search - What is BLAST?
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Identifying templates for protein modeling:
BLAST.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

Computational Biology or Bioinformatics ability to rapidly sequence DNA has led to large databases development of new algorithms data analysis and interpretation similar concepts also applied to epidemiological databases genetic epidemiology evolutionary genetics align related sequences and search databases

easy to obtain DNA sequence data difficult to predict protein structure and function structure/function can be inferred from sequence similarities similarities identified by aligning DNA or protein sequences alignments can be global or local Sequence Alignments homolog (common ancestor) ortholog (between species) paralog (within species) analog (no common ancestor)

scoring matrix calculates an alignment score eg, match = 0.9 and mismatch = -0.1 for DNA amino acids have different weights (abundance and chemical or structural similarities) BLOSUM and PAM + variants Pair-wise Sequence Alignments Amino Acid Similarities ChemicalPhysical A, G D, E F, Y K, R I, L, M, V Q, N S, T C, S D, L, N E, Q F, H, W, Y I, T, V K, M, R GCGCCTC ||| || GCGGGTC (5 x 0.9) + (2 x -0.1) = 4.3

Multiple sequence alignments gives 1 st approximation of best score human eye + biological insight better at refining the alignments gap penalties (opening and extending) optimal penalties depend on relatedness of sequences 'trial and error' approach alignment with maximum score is returned for prescribed gap penalties and scoring matrix not necessarily most biological significant Pair-wise Sequence Alignments

two types: 1 o (original biological data) 2 o (value added) three 1 o DNA databases GenBank EMBL DDBJ subdivisions (taxonomic groups, genome projects, ESTs, etc) annotated to include ancillary information (author, publications, etc.) Databases

Searching Databases text-based (annotations) gene name, authors, species, etc. information retrieval systems Entrez can access all databases + medline sequence comparisons submit query sequence compare to all sequences in database(s) pairwise is too time consuming heurisitic programs (eg, FASTA and BLAST ) match short sequence fragments alignments of sequence regions showing promise scores and statistics

Basic Local Alignment Search Tool

Doing a BLAST Search BLAST/ choose BLAST program paste in query sequence or acc. no. BLAST! change default options: database (nr = non-redundant) scoring matrix and gap penalties filtering E-value cutoff (ie, Expect) limit subset of database (organism, keyword, etc.) display options (eg, # of descriptions, alignments, etc.)

Blast Search Results Query= Pbpp58b (423 letters) Database: nr (493,611 sequences; 154,780,071 total letters) Score E Sequences producing significant alignments: (bits) Value sp|Q08168|HRP_PLABE 58 KD PHOSPHOPROTEIN (HEAT SHOCK-RELATED PRO e-90 gb|AAC | (L21710) 58 kDa phosphoprotein [Plasmodium berghei] 329 3e-89 pir||T10455 heat shock related protein - Plasmodium berghei >gi| e-65 sp|P50503|HIP_RAT HSC70-INTERACTING PROTEIN >gi| |emb|CAA e-22 sp|P50502|HIP_HUMAN HSC70-INTERACTING PROTEIN (PROGESTERONE RECE e-16 gb|AAF | (AE003429) CG2947 gene product [Drosophila melano e-16 pir||T24865 hypothetical protein T12D8.8 - Caenorhabditis elegan e-16 pir||T04562 hypothetical protein T12H Arabidopsis thalian e-14. emb|CAA | (X89416) protein phosphatase 5 [Homo sapiens] pdb|1A17| Tetratricopeptide Repeats Of Protein Phosphatase ref|NP_ || protein phosphatase 5, catalytic subunit >gi| pir||S52570 phosphoprotein phosphatase (EC ) 5, catalyti Probabilitydatabase | accession # | entry name or locus

Example of Blast Alignment >pir||T24865 hypothetical protein T12D8.8 -Caenorhabditis elegans (Length = 422) Score = 86.2 bits (210), Expect = 5e-16 Identities = 44/101 (43%), Positives = 60/101 (58%), Gaps = 2/101 (1%) Query: 119 EAVDLVENKKYEEALEKYNKIISFGNPSAMIYTKRASILLNLKRPKACIRDCTEALNLNV 178 +A + N ++ AL + I SAM++ KRA++LL LKRP A I DC +A+++N Sbjct: 121 KAQEAFSNGDFDTALTHFTAAIEANPGSAMLHAKRANVLLKLKRPVAAIADCDKAISINP 180 Query: 179 DSANAYKIRAKAYRYLGKWEFAHADMEQGQKIDYDE--NLW 217 DSA YK R +A R LGKW A D+ K+DYDE N W Sbjct: 181 DSAQGYKFRGRANRLLGKWVEAKTDLATACKLDYDEAANEW 221 Score = 41.4 bits (95), Expect = Identities = 16/34 (47%), Positives = 23/34 (67%) Query: 9 LKKFVASCEENPSILLKPELSFFKDFIESFGGKI 42 LK+FV C+ NP++L PE FFKD++ S G + Sbjct: 7 LKQFVGMCQANPAVLHAPEFGFFKDYLVSLGATL 40 Gap Matches A 2 nd high scoring segment

Blast Search Results Query= Pbpp58b (423 letters) Database: nr (493,611 sequences; 154,780,071 total letters) Score E Sequences producing significant alignments: (bits) Value sp|Q08168|HRP_PLABE 58 KD PHOSPHOPROTEIN (HEAT SHOCK-RELATED PRO e-90 gb|AAC | (L21710) 58 kDa phosphoprotein [Plasmodium berghei] 329 3e-89 pir||T10455 heat shock related protein - Plasmodium berghei >gi| e-65 sp|P50503|HIP_RAT HSC70-INTERACTING PROTEIN >gi| |emb|CAA e-22 sp|P50502|HIP_HUMAN HSC70-INTERACTING PROTEIN (PROGESTERONE RECE e-16 gb|AAF | (AE003429) CG2947 gene product [Drosophila melano e-16 pir||T24865 hypothetical protein T12D8.8 - Caenorhabditis elegan e-16 pir||T04562 hypothetical protein T12H Arabidopsis thalian e-14. emb|CAA | (X89416) protein phosphatase 5 [Homo sapiens] pdb|1A17| Tetratricopeptide Repeats Of Protein Phosphatase ref|NP_ || protein phosphatase 5, catalytic subunit >gi| pir||S52570 phosphoprotein phosphatase (EC ) 5, catalyti Probabilitydatabase | accession # | entry name or locus

Example of Blast Alignment >sp|P50503|HIP_RAT HSC70-INTERACTING PROTEIN >gi| |emb|CAA | (X82021) Hsc70-interacting protein [Rattus norvegicus] (Length = 368) Score = 106 bits (261), Expect = 5e-22 Identities = 60/224 (26%), Positives = 97/224 (42%) Query: 1 MDIEKIEDLKKFVASCEENPSILLKPELSFFKDFIESFGGKIKKDKMGYXXXXXXXXXXX 60 MD K+ +L+ FV C ++PS+L E+ F ++++ES GGK+ Sbjct: 1 MDPRKVSELRAFVKMCRQDPSVLHTEEMRFLREWVESMGGKVPPATHKAKSEENTKEEKR 60 (SDEEEEDEEEEEEEEEDDDPEKLE) Query: 61 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXAVECPPLAPXXXXXXXXXXXXXXCKLKEEA P + K A Sbjct: 61 DKTTEDNIKTEEPSSEESDLEIDNEGVIEADTDAPQEMGDENAEITEAMMDEANEKKGAA 120 Query: 121 VDLVENKKYEEALEKYNKIISFGNPSAMIYTKRASILLNLKRPKACIRDCTEALNLNVDS 180 +D A++ + I A++Y KRAS+ + L++P A IRDC A+ +N DS Sbjct: 121 IDALNDGELQKAIDLFTDAIKLNPRLAILYAKRASVFVKLQKPNAAIRDCDRAIEINPDS 180 Query: 181 ANAYKIRAKAYRYLGKWEFAHADMEQGQKIDYDENLWDMQKLIQ 224 A YK R KA+R LG WE A D+ K+DYDE+ M + +Q Sbjct: 181 AQPYKWRGKAHRLLGHWEEAARDLALACKLDYDEDASAMLREVQ 224 Filtering

LOCUS RNHSRP 1694 bp mRNA ROD 14-JAN-1996 DEFINITION R.norvegicus mRNA for heat shock related protein. ACCESSION X82021 REFERENCE 2 (bases 1 to 1694) AUTHORS Hohfeld,J., Minami,Y. and Hartl,F.U. JOURNAL Cell 83 (4), (1995) MEDLINE FEATURES Location/Qualifiers source /organism="Rattus norvegicus" gene /gene="hip" CDS /product="Hsc70-interacting protein" /protein_id="CAA " /db_xref="SWISS-PROT:P50503" /translation="MDPRKVSELRAFVKMCRQDPSVLHTEEMRFLREWVESMGGK QDVAQNPSNMSKYQNNPKVMNLISKLSAKFGGHS" BASE COUNT 542 a 342 c 423 g 387 t 1 gcgtcgacgg gcttggcatc gggcctccgc agccgcccac cgccagaagc ttccagcctc aaaaaaaaaa aaaa //

| Acidic Domain Pbe MDIEKIEDLKKFVASCEENPSILLKPELSFFKDFIESFGGKIKKD KMGYEKMKSEDSTEEKSDEEEEDEEEEEEEEEDD 79 Rat MDPRKVSELRAFVKMCRQDPSVLHTEEMRFLREWVESMGGKVPPATHKAKSEENTKEEKRDKT-TEDNIKTEEPSSEESD 79 ** *...*. ** *..**.*. *..*.....**.*** *.....*..**.* | | TPR || Pbe DPEKLELIKEEAVECPPLAPIIEGELSEEQIEEICKLKEEAVDLVENKKYEEALEKYNKIISFGNPSAMIYTKRASILLN 159 Rat LEIDNEGVIEADTDAPQEMGDENAETTEAMMDEANEKKGAAIDALNDGELQKAIDLFTDAIKLNPRLAILYAKRASVFVK 159. *. *... *...*.*..... *.* *..... *.... *..*.****.... TPR || TPR | | Basic Domain Pbe LKRPKACIRDCTEALNLNVDSANAYKIRAKAYRYLGKWEFAHADMEQGQKIDYDENLWDMQKLIQEKYKKIYEKRRYKIN 239 Rat LQKPNAAIRDCDRAIEINPDSAQPYKWRGKAHRLLGHWEEAARDLALACKLDYDEDASAMLREVQPRAQKIAEHRRKYER 239 *..*.* ****. *...* ***..** *.**.* **.** * *... *.****..*..*..** *.**. | || Pbe KEEEKQRLKREKELKKKLAAKKKAEKMYKENNKRENYDSDSSDSSYSEPDFSGDFPGGMPGGMPGMPGGMGGMGGMPGMP 319 Rat KREEREIKERIERVKKAREEHEKAQRE EEARRQSGSQFGSFPGGFPGGMPGNFPGGMPGMGG 301 * **...*..**....** *..*.*.*******..** **** GGMP Repeat Domain | Pbe GGFPGMPGGMPGGMPGGMGGMPGMPGGMPGGMGGMPGMPGGMPDLNSPEMKELFNNPQFFQMMQNMMSNPDLINKYASDP 399 Rat AMP GMAGMPGLNEILSDPEVLAAMQDPEVMVAFQDVAQNPSNMSKYQNNP 351.** **.**** *..**.....*...*.. **...**...* Pbe KYKNIFENLKNSDLGGMMGEKPKP 423 Rat KVMNLISKLSAKFGG HS 368 *.*....*... *.