MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

MCB 5472 Blast, Psi BLAST, Perl: Arrays, Loops J. Peter Gogarten Office: BPB 404 phone: ,
SCHOOL OF COMPUTING ANDREW MAXWELL 9/11/2013 SEQUENCE ALIGNMENT AND COMPARISON BETWEEN BLAST AND BWA-MEM.
BLAST Sequence alignment, E-value & Extreme value distribution.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Random Genetic Drift Selection Allele frequency advantageous disadvantageous Modified from from
MCB 5472 Psi BLAST, Perl: Arrays, Loops J. Peter Gogarten Office: BPB 404 phone: ,
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
PSI (position-specific iterated) BLAST The NCBI page described PSI blast as follows: “Position-Specific Iterated BLAST (PSI-BLAST) provides an automated,
Similar Sequence Similar Function Charles Yan Spring 2006.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
BLAST.
Introduction to Bioinformatics - Tutorial no. 8 Predicting protein structure PSI-BLAST.
Psi-Blast: Detecting structural homologs Psi-Blast was designed to detect homology for highly divergent amino acid sequences Psi = position-specific iterated.
MCB 372 PSI BLAST, scalars J. Peter Gogarten Office: BPB 404 phone: ,
Sequence alignment, E-value & Extreme value distribution
Homology bird wing bat wing human arm by Bob Friedman.
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
MCB 5472 Psi BLAST, Perl: Arrays, Loops, Hashes J. Peter Gogarten Office: BPB 404 phone: ,
MCB 5472 Assignment #5: RBH Orthologs and PSI-BLAST February 19, 2014.
BLAST : Basic local alignment search tool B L A S T !
Selection versus drift The larger the population the longer it takes for an allele to become fixed. Note: Even though an allele conveys a strong selective.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
MCB 5472 Lecture #4: Probabilistic models of homology: Psi-BLAST and HMMs February 17, 2014.
Bacterial Genetics - Assignment and Genomics Exercise: Aims –To provide an overview of the development and.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Assignment feedback Everyone is doing very well!
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Motif discovery and Protein Databases Tutorial 5.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Neutral mutations Neither advantageous nor disadvantageous Invisible to selection (no selection) Frequency subject to ‘drift’ in the population Mutation.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
METHOD: Family Classification Scheme 1)Set for a model building: 67 microbial genomes with identified protein sequences (Table 1) 2)Set for a model.
Lab 3.2: Database Similarity Searching “The BLAST Buffet” Stephanie Minnema University of Calgary.
Sequence similarity, BLAST alignments & multiple sequence alignments
Stand alone BLAST on Linux
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
Identifying templates for protein modeling:
Codon based alignments in Seaview
BLAST.
PSI (position-specific iterated) BLAST
Basic Local Alignment Search Tool
Blast, Psi BLAST, Perl: Arrays, Loops
Basic Local Alignment Search Tool
BLAST, unix, Perl continued
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

MCB 3421 class 25

student evaluations Please go to husky CT and complete student evaluations !

PSI (position-specific iterated) BLAST The NCBI page described PSI blast as follows: “Position-Specific Iterated BLAST (PSI-BLAST) provides an automated, easy-to-use version of a "profile" search, which is a sensitive way to look for sequence homologues. The program first performs a gapped BLAST database search. The PSI-BLAST program uses the information from any significant alignments returned to construct a position-specific score matrix, which replaces the query sequence for the next round of database searching. PSI-BLAST may be iterated until no new significant alignments are found. At this time PSI-BLAST may be used only for comparing protein queries with protein databases.”

The Psi-Blast Approach 1. Use results of BlastP query to construct a multiple sequence alignment 2. Construct a position-specific scoring matrix from the alignment 3. Search database with alignment instead of query sequence 4. Add matches to alignment and repeat Psi-Blast can use existing multiple alignment, or use RPS-Blast to search a database of PSSMs

PSI BLAST scheme

Position-specific Matrix M Gribskov, A D McLachlan, and D Eisenberg (1987) Profile analysis: detection of distantly related proteins. PNAS 84: by Bob Friedman

Psi-Blast is for finding matches among divergent sequences (position- specific information) WARNING: For the nth iteration of a PSI BLAST search, the E-value gives the number of matches to the profile NOT to the initial query sequence! The danger is that the profile was corrupted in an earlier iteration. PSI BLAST and E-values!

Often you want to run a PSIBLAST search with two different databanks - one to create the PSSM, the other to get sequences: To create the PSSM: blastpgp -d nr -i subI -j 5 -C subI.ckp -a 2 -o subI.out -h F f blastpgp -d swissprot -i gamma -j 5 -C gamma.ckp -a 2 -o gamma.out -h F f Runs 4 iterations of a PSIblast the -h option tells the program to use matches with E <10^-5 for the next iteration, (the default is ) -C creates a checkpoint (called subI.ckp), -o writes the output to subI.out, -i option specifies input as using subI as input (a fasta formated aa sequence). The nr databank used is stored in /common/data/ -a 2 use two processors -h e-value threshold for inclusion in multipass model [Real] default = THIS IS A RATHER HIGH NUMBER!!! (It might help to use the node with more memory (017) (command is ssh node017) PSI Blast from the command line

Use of a PSSM: blastpgp -d /Users/jpgogarten/genomes/msb8.faa -i subI -a 2 -R subI.ckp -o subI.out3 -F f blastpgp -d /Users/jpgogarten/genomes/msb8.faa -i gamma -a 2 -R gamma.ckp -o gamma.out3 -F f Runs another iteration of the same blast search, but uses the databank /Users/jpgogarten/genomes/msb8.faa -R tells the program where to resume -d specifies a different databank -i input file - same sequence as before -o output_filename -a 2 use two processors -h e-value threshold for inclusion in multipass model [Real] default = This is a rather high number, but might be ok for the last iteration.

PSI Blast and finding gene families within genomes use PSSM to search a genome: A)Use protein sequences encoded in genome as target: blastpgp -d target_genome.faa -i query.name -a 2 -R query.ckp -o query.out3 -F f B) Use nucleotide sequence and tblastn. This is an advantage if you are also interested in pseudogenes, and/or if you don’t trust the genome annotation: blastall -i query.name -d target_genome_nucl.ffn -p psitblastn -R query.ckp Build PSSM from query sequence and a large database (nr is a good choice – if you know the annotation of the query sequences, you don’t need to worry about the annotations in the database)

man wc

>wc -l blastp.out PSIblastP.out psitblastn.out 34 blastp.out 44 PSIblastP.out 56 psitblastn.out Comparison of blastp, PSIblastP, and psitblastn

ori

Finding transferred genes Screening in the wet-lab and in the computer

Finding transferred genes

Taxplot at NCBI

Other approaches to find transferred genes Gene presence absence data for closely related genomes (for additional genes) Phylogenetic conflict (for homologous replacement (e.g. quartet decompositon spectra see Figs. 1 and 2 ) quartet decompositon spectra Composition based analyses (for very recent transfers).

Discussion of HGT from Bacteria to TardigradesTardigrades We estimate that approximately one-sixth of tardigrade genes entered by HGT, nearly double the fraction found in the most extreme cases of HGT into animals known to date. Foreign genes have supplemented, expanded, and even replaced some metazoan gene families within the tardigrade genome. Our results demonstrate that an unexpectedly large fraction of an animal genome can be derived from foreign sources.

Source of genes in the H. dujardini genome as determined by HGT index calculations

Discussion of HGT from Bacteria to Tardigrades

BIOARCHIVES doi:

“While the raw data indicated extensive contamination with bacteria, presumably from the gut or surface of the animals, careful cleaning generated a clean tardigrade dataset for assembly.”

Our assembly, and inferences from it, conflict with a recently published draft genome (UNC) 6 for what is essentially the same strain of H. dujardini. Our assembly, despite having superior assembly statistics, is ~120 Mb shorter than the UNC assembly. Our genome size estimate from sequence assembly is congruent with the values we obtained by direct measurement. We find 15,000 fewer protein-coding genes, and a hugely reduced impact of predicted HGT on gene content in H. dujardini. These HGT candidates await detailed validation. While resolution of the conflict between these assemblies awaits detailed examination based on close scrutiny of the raw UNC data, our analyses suggest that the UNC assembly is compromised by sequences that derive from bacterial contaminants, and that the expanded genome span, additional genes, and HGT candidates are likely to be artefactual.

Figure 4: Mapping of read data to UNC assembly identifies non-shared contaminants and no expression from bacterial scaffolds A Blobplot showing the UNC assembly contigs distributed by GC proportion and coverage derived from the UNC raw genomic sequence data (data file TG-300). Scaffold points are scaled by length, and coloured based on taxonomic assignment of the sum of the best BLAST and Diamond matches for all the genes on the scaffold. Taxonomic assignments are summed by phylum. B Blobplot showing the UNC assembly contigs distributed by GC proportion and coverage derived from the Edinburgh raw genomic sequence data. Scaffold points are scaled by length, and coloured based on taxonomic assignment of the sum of the best BLAST and Diamond matches for all the genes on the scaffold. Taxonomic assignments are summed by phylum.

UNC reads Edinburgh reads both mapped on the UNC assembly