Prosite UCSC Genome Browser MSAs and Phylogeny Exercise 2.

Slides:



Advertisements
Similar presentations
On line (DNA and amino acid) Sequence Information Lecture 7.
Advertisements

Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
Psi-BLAST, Prosite, UCSC Genome Browser Lecture 3.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Multiple Sequence Alignment (MSA) and Phylogeny. Clustal X.
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Multiple Sequence Alignment (MSA) and Phylogeny. One of the options to get multiple sequence Fasta file.
UCSC Genome Browser Tutorial
Phylogeny. Reconstructing a phylogeny  The phylogenetic tree (phylogeny) describes the evolutionary relationships between the studied data  The data.
Matching Problems in Bioinformatics Charles Yan Fall 2008.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Prosite and UCSC Genome Browser Exercise 3. Protein motifs and Prosite.
Genome Browsers UCSC (Santa Cruz, California) and Ensembl (EBI, UK)
The Poor Beginners’ Guide to Bioinformatics. What we have – and don’t have... a computer connected to the Internet (incl. Web browser) a text editor (Notepad.
Consensus Consensus tree A consensus tree summarizes information common to two or more trees. bcdeabcdeabcdea.
Eukaryotic Gene Finding
Sequencing a genome and Basic Sequence Alignment
Comparative Genomics of Viruses: VirGen as a case study Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune Pune
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
The Genome Genome Browser Training Materials developed by: Warren C. Lathe, Ph.D. and Mary Mangan, Ph.D. Part 1.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
The UCSC Genome Browser Introduction
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Corrections. - The cacao genome is currently being sequenced - Human Chromosome 1 sequence Search ‘Genome’
Copyright OpenHelix. No use or reproduction without express written consent1.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Genomics and Personalized Care in Health Systems Lecture 5 Genome Browser Leming Zhou, PhD School of Health and Rehabilitation Sciences Department of Health.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Sequencing a genome and Basic Sequence Alignment
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
1 LSM2241 AY0910 Semester 2 MiniProject Briefing Round 5.
BIOINFORMATIK I UEBUNG 2 mRNA processing.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Sackler Medical School
Protein and RNA Families
数据库使用 杨建华 2010/9/28. Outline of the Topics UCSC and Ensembl Genome Browser (Blat vs Blast vs Blastz vs Multiz) 挖掘数据用 Table Browser 或 BioMart 用户友好化你的数据.
Protein Domain Database
Bioinformatics and Computational Biology
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Copyright OpenHelix. No use or reproduction without express written consent1.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
InterPro Sandra Orchard.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Sequence similarity, BLAST alignments & multiple sequence alignments
A Very Basic Gibbs Sampler for Motif Detection
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Genome Center of Wisconsin, UW-Madison
Welcome - webinar instructions
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Prosite UCSC Genome Browser MSAs and Phylogeny Exercise 2

Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The challenge is to turn this raw data into biological knowledge  A valuable tool for this challenge is an automated diagnostic pipe through which newly determined sequences can be streamlined

From sequence to function  Nature tends to innovate rather than invent  Proteins are composed of functional elements: domains and motifs Domains are structural units that carry out a certain function Domains are structural units that carry out a certain function The same domains are The same domains are shared between different proteins Motifs are shorter Motifs are shorter sequences with certain biological activity

InterPro  An integrated documentation resource for protein families, domains and sites  Groups signatures describing the same protein family or domain  Combines a number of databases that use different methodologies to derive protein signature: UniProt: UniProtKB Swiss-Prot, TrEMBL, UniRef,UniParc UniProt: UniProtKB Swiss-Prot, TrEMBL, UniRef,UniParc prosite: documented DB on domains, families and functional sites. prosite: documented DB on domains, families and functional sites. Pfam: a DB of protein families represented by MSAs Pfam: a DB of protein families represented by MSAs

InterPro search

prosite  A method for determining the function of uncharacterized translated protein sequences  Consists of a DB of annotated biologically important sites/patterns/motifs/signature/fingerprints

prosite  Entries are represented with patterns or profiles pattern A.1000T C G profile [AC]-A-[GC]-T-[TC]-[GC] Profiles are used in prosite when the motif is relatively divergent, and it is difficult to represent as a pattern

Scanning prosite Query: sequence Query: pattern Result: all patterns found in sequence Result: all sequences which adhere to this pattern

Patterns with a high probability of occurrence  Entries describing commonly found post- translational modifications or compositionally biased regions.  Found in the majority of known protein sequences  High probability of occurrence

prosite sequence query

prosite pattern query

UCSC Genome Browser

Reset all settings of previous user UCSC Genome Browser - Gateway

UCSC Genome Browser query results

UCSC Genome Browser Annotation tracks Vertebrate conservation mRNA (GenBank) RefSeq UCSC Genes Base position Single species compared SNPs Repeats Gene Direction Exon Intron UTR

USCS Gene

UCSC Genome Browser - movement Zoom x3 + Center

UCSC Genome Browser – Base view

Annotation track options dense squish full pack

Annotation track options Another option to toggle between ‘pack’ and ‘dense’ view is to click on the track title Sickle-cell anemia distr. Malaria distr.

BLAT  BLAT = Blast-Like Alignment Tool  BLAT is designed to find similarity of >95% on DNA, >80% for protein  Rapid search by indexing entire genome. Good for: 1. Finding genomic coordinates of cDNA 2. Determining exons/introns 3. Finding human (or chimp, dog, cow…) homologs of another vertebrate sequence

BLAT on UCSC Genome Browser

BLAT Results

Match Non-Match (mismatch/indel) Indel boundaries

BLAT Results

BLAT Results on the browser

Getting DNA sequence of region

Clustal X – A Multiple Alignment Tool

Input: multiple sequence Fasta file >gi| |ref|NP_ | mesotrypsin preproprotein [Homo sapiens] MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECLISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQLQGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS >gi| |ref|NP_ | protease, serine, 2 [Macaca mulatta] MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQVRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEALISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQLQGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS >gi| |ref|NP_ | mesotrypsin [Mus musculus] MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCLISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNRELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi| |ref|NP_ | protease, serine, 2 [Rattus norvegicus] MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi| |ref|NP_ | pancreatic anionic trypsinogen [Bos taurus] MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQVRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL...

One of the options to get multiple sequence Fasta file

Input: multiple sequence Fasta file >gi| |ref|NP_ | mesotrypsin preproprotein [Homo sapiens] MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECLISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQLQGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS >gi| |ref|NP_ | protease, serine, 2 [Macaca mulatta] MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQVRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEALISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQLQGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS >gi| |ref|NP_ | mesotrypsin [Mus musculus] MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCLISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNRELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi| |ref|NP_ | protease, serine, 2 [Rattus norvegicus] MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi| |ref|NP_ | pancreatic anionic trypsinogen [Bos taurus] MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQVRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL...

Input: multiple sequence Fasta file >gi| |ref|NP_ | mesotrypsin preproprotein [Homo sapiens] MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECLISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQLQGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS >gi| |ref|NP_ | protease, serine, 2 [Macaca mulatta] MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQVRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEALISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQLQGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS >gi| |ref|NP_ | mesotrypsin [Mus musculus] MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCLISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNRELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi| |ref|NP_ | protease, serine, 2 [Rattus norvegicus] MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi| |ref|NP_ | pancreatic anionic trypsinogen [Bos taurus] MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQVRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL...

Step1: Load the sequences

Sequences and conservation view

Step2: Perform Alignment

Sequences and conservation view

Step 3: Create tree

Step 4: NJPlot

The Newick tree format is used to represent trees as strings CA D In Newick format: ((A,C),(B,D)); B Each pair of parenthesis () enclose a clade in the tree, and the comma separates the members of the corresponding clade. “;” – is always the last character

How robust is our tree?

 We need some statistical way to estimate the confidence in the tree topology  But we don’t know anything about the tree topology distribution or parameters  The only data source we have is our data (MSA)  So, we must rely on our own resources: “pull up by your own bootstraps” How robust is our tree?

Bootstrap (and jackknife)

Jackknife 1. We create n (typically ) new MSAs (pseudo-data sets) by randomly sampling half of the characters. (random samples without replacement) We do not change the number of sequences, just the number of positions! POS: : TATTT 2 : CATTT 3 : CACTT N : AACTT POS: : TTTAT 2 : TAACC 3 : TAACC N : TGGGA POS: : TTGTA 2 : TAGAC 3 : TAAAC N : TGAGG

Jackknife 2. We reconstruct a tree from each data set, using the same method used for reconstructing the original tree POS: : TATTT 2 : CATTT 3 : CACTT N : AACTT POS: : TTTAT 2 : TAACC 3 : TAACC N : TGGGA POS: : TTGTA 2 : TAGAC 3 : TAAAC N : TGAGG Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4

3. For each node in our original tree, we count the number of times it appeared in the Jackknife analysis Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Back to Jackknife Sp1 Sp2 Sp3 Sp4 67% 100% In 67% of the data sets, the node SP1+SP2 was found

Bootstrap The same as jackknife, but instead of sampling K/2 positions, we sample K positions with replacement

Bootstrap 1. Resample K positions n times K 1 : ATCTG…A 2 : ATCTG…C 3 : ACTTA…C N : ACCTA…T K 1 : AATTT…T 2 : AATTT…G 3 : AACTT…T N : AACTT…T 47789…K 1 : TTTAT…T 2 : TAACC…G 3 : TAACC…T N : TGGGA…T 15578…K 1 : AGGTA…T 2 : AGGAC…G 3 : AAAAC…A N : AAAGG…C

Bootstrap 2. Reconstruct a tree from each data set using the same method used for reconstructing the original tree Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp K 1 : AATTT…T 2 : AATTT…G 3 : AACTT…T N : AACTT…T 47789…K 1 : TTTAT…T 2 : TAACC…G 3 : TAACC…T N : TGGGA…T 15578…K 1 : AGGTA…T 2 : AGGAC…G 3 : AAAAC…A N : AAAGG…C

Bootstrap 3. For each node in our original tree, we count the number of times it appeared in the bootstrap analysis Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 67% 100%

Step Bootstrap

Bootstrap values on NJPlot Note: ClustalX saves trees as.ph file trees with bootstrap are saved as.phb You might have to reopen the tree…