Presentation is loading. Please wait.

Presentation is loading. Please wait.

MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

Similar presentations


Presentation on theme: "MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !"— Presentation transcript:

1 MCB 3421 class 25

2 student evaluations Please go to husky CT and complete student evaluations !

3 PSI (position-specific iterated) BLAST The NCBI page described PSI blast as follows: “Position-Specific Iterated BLAST (PSI-BLAST) provides an automated, easy-to-use version of a "profile" search, which is a sensitive way to look for sequence homologues. The program first performs a gapped BLAST database search. The PSI-BLAST program uses the information from any significant alignments returned to construct a position-specific score matrix, which replaces the query sequence for the next round of database searching. PSI-BLAST may be iterated until no new significant alignments are found. At this time PSI-BLAST may be used only for comparing protein queries with protein databases.”

4 The Psi-Blast Approach 1. Use results of BlastP query to construct a multiple sequence alignment 2. Construct a position-specific scoring matrix from the alignment 3. Search database with alignment instead of query sequence 4. Add matches to alignment and repeat Psi-Blast can use existing multiple alignment, or use RPS-Blast to search a database of PSSMs

5 PSI BLAST scheme

6 Position-specific Matrix M Gribskov, A D McLachlan, and D Eisenberg (1987) Profile analysis: detection of distantly related proteins. PNAS 84:4355-8. by Bob Friedman

7 Psi-Blast is for finding matches among divergent sequences (position- specific information) WARNING: For the nth iteration of a PSI BLAST search, the E-value gives the number of matches to the profile NOT to the initial query sequence! The danger is that the profile was corrupted in an earlier iteration. PSI BLAST and E-values!

8 Often you want to run a PSIBLAST search with two different databanks - one to create the PSSM, the other to get sequences: To create the PSSM: blastpgp -d nr -i subI -j 5 -C subI.ckp -a 2 -o subI.out -h 0.00001 -F f blastpgp -d swissprot -i gamma -j 5 -C gamma.ckp -a 2 -o gamma.out -h 0.00001 -F f Runs 4 iterations of a PSIblast the -h option tells the program to use matches with E <10^-5 for the next iteration, (the default is 10 -3 ) -C creates a checkpoint (called subI.ckp), -o writes the output to subI.out, -i option specifies input as using subI as input (a fasta formated aa sequence). The nr databank used is stored in /common/data/ -a 2 use two processors -h e-value threshold for inclusion in multipass model [Real] default = 0.002 THIS IS A RATHER HIGH NUMBER!!! (It might help to use the node with more memory (017) (command is ssh node017) PSI Blast from the command line

9 Use of a PSSM: blastpgp -d /Users/jpgogarten/genomes/msb8.faa -i subI -a 2 -R subI.ckp -o subI.out3 -F f blastpgp -d /Users/jpgogarten/genomes/msb8.faa -i gamma -a 2 -R gamma.ckp -o gamma.out3 -F f Runs another iteration of the same blast search, but uses the databank /Users/jpgogarten/genomes/msb8.faa -R tells the program where to resume -d specifies a different databank -i input file - same sequence as before -o output_filename -a 2 use two processors -h e-value threshold for inclusion in multipass model [Real] default = 0.002. This is a rather high number, but might be ok for the last iteration.

10 PSI Blast and finding gene families within genomes use PSSM to search a genome: A)Use protein sequences encoded in genome as target: blastpgp -d target_genome.faa -i query.name -a 2 -R query.ckp -o query.out3 -F f B) Use nucleotide sequence and tblastn. This is an advantage if you are also interested in pseudogenes, and/or if you don’t trust the genome annotation: blastall -i query.name -d target_genome_nucl.ffn -p psitblastn -R query.ckp Build PSSM from query sequence and a large database (nr is a good choice – if you know the annotation of the query sequences, you don’t need to worry about the annotations in the database)

11 man wc

12 >wc -l blastp.out PSIblastP.out psitblastn.out 34 blastp.out 44 PSIblastP.out 56 psitblastn.out Comparison of blastp, PSIblastP, and psitblastn

13 1 2 3 4 5678 1 8 3 6 5 7 2 4

14 ori 1 2 3 5 6 7 8 4

15 Finding transferred genes Screening in the wet-lab and in the computer

16 Finding transferred genes

17 Taxplot at NCBI

18

19 Other approaches to find transferred genes Gene presence absence data for closely related genomes (for additional genes) Phylogenetic conflict (for homologous replacement (e.g. quartet decompositon spectra see Figs. 1 and 2 ) quartet decompositon spectra Composition based analyses (for very recent transfers).

20 Discussion of HGT from Bacteria to TardigradesTardigrades We estimate that approximately one-sixth of tardigrade genes entered by HGT, nearly double the fraction found in the most extreme cases of HGT into animals known to date. Foreign genes have supplemented, expanded, and even replaced some metazoan gene families within the tardigrade genome. Our results demonstrate that an unexpectedly large fraction of an animal genome can be derived from foreign sources.

21 Source of genes in the H. dujardini genome as determined by HGT index calculations

22 Discussion of HGT from Bacteria to Tardigrades

23 BIOARCHIVES doi: http://dx.doi.org/10.1101/033464 http://biorxiv.org/content/early/2015/12/01/033464

24 “While the raw data indicated extensive contamination with bacteria, presumably from the gut or surface of the animals, careful cleaning generated a clean tardigrade dataset for assembly.”

25 Our assembly, and inferences from it, conflict with a recently published draft genome (UNC) 6 for what is essentially the same strain of H. dujardini. Our assembly, despite having superior assembly statistics, is ~120 Mb shorter than the UNC assembly. Our genome size estimate from sequence assembly is congruent with the values we obtained by direct measurement. We find 15,000 fewer protein-coding genes, and a hugely reduced impact of predicted HGT on gene content in H. dujardini. These HGT candidates await detailed validation. While resolution of the conflict between these assemblies awaits detailed examination based on close scrutiny of the raw UNC data, our analyses suggest that the UNC assembly is compromised by sequences that derive from bacterial contaminants, and that the expanded genome span, additional genes, and HGT candidates are likely to be artefactual.

26 Figure 4: Mapping of read data to UNC assembly identifies non-shared contaminants and no expression from bacterial scaffolds A Blobplot showing the UNC assembly contigs distributed by GC proportion and coverage derived from the UNC raw genomic sequence data (data file TG-300). Scaffold points are scaled by length, and coloured based on taxonomic assignment of the sum of the best BLAST and Diamond matches for all the genes on the scaffold. Taxonomic assignments are summed by phylum. B Blobplot showing the UNC assembly contigs distributed by GC proportion and coverage derived from the Edinburgh raw genomic sequence data. Scaffold points are scaled by length, and coloured based on taxonomic assignment of the sum of the best BLAST and Diamond matches for all the genes on the scaffold. Taxonomic assignments are summed by phylum.

27 UNC reads Edinburgh reads both mapped on the UNC assembly


Download ppt "MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !"

Similar presentations


Ads by Google