Finding new nirK genes in metagenomic data

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Pfam(Protein families )
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Hidden Markov Model Ed Anderson and Sasha Tkachev.
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Profiles for Sequences
Using PFAM database’s profile HMMs in MATLAB Bioinformatics Toolkit Presentation by: Athina Ropodi University of Athens- Information Technology in Medicine.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Heuristic Local Alignerers 1.The basic indexing & extension technique 2.Indexing: techniques to improve sensitivity Pairs of Words, Patterns 3.Systems.
Applications of Hidden Markov Models in the Avian/Mammalian Genome Comparison Christine Bloom Animal Science College of Agriculture University of Delaware.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Profile-profile alignment using hidden Markov models Wing Wong.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
. Class 5: HMMs and Profile HMMs. Review of HMM u Hidden Markov Models l Probabilistic models of sequences u Consist of two parts: l Hidden states These.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Protein Domain Analysis Using Hidden Markov Models Liangjiang (LJ) Wang March 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics.
HMMER tutorial 羅偉軒 Account IP: Account: binfo2005 Password: 2005binfo.
Project No. 7 Structural Genomics of the RGS Protein Family: Development of a Public Web-based Informatics Database Resource Dahai Gai Samuel Kalet Hongbo.
Algorithms for variable length Markov chain modeling Author: Gill Bejerano Presented by Xiangbin Qiu.
Psi-Blast: Detecting structural homologs Psi-Blast was designed to detect homology for highly divergent amino acid sequences Psi = position-specific iterated.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Automatic methods for functional annotation of sequences Petri Törönen.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Protein Sequence Alignment and Database Searching.
Hidden Markov Models for Sequence Analysis 4
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or.
MCB 5472 Lecture #4: Probabilistic models of homology: Psi-BLAST and HMMs February 17, 2014.
Lab7 QRNA, HMMER, PFAM. Sean Eddy’s Lab
Adding GO GO Workshop 3-6 August GOanna results and GOanna2ga 2. gene association files 3. getting GO for your dataset 4. adding more GO (introduction)
11 Overview Paracel GeneMatcher2. 22 GeneMatcher2 The GeneMatcher system comprises of hardware and software components that significantly accelerate a.
Construction of Substitution Matrices
BioPerf: A Benchmark Suite to Evaluate High- Performance Computer Architecture on Bioinformatics Applications David A. Bader, Yue Li Tao Li Vipin Sachdeva.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
K Phone: Web: A Software Package for the Design and Analysis of Microbial Functional.
Protein and RNA Families
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.
Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Construction of Substitution matrices
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
(H)MMs in gene prediction and similarity searches.
PORTING HMMER AND INTERPROSCAN TO THE GRID Daniel Alberto Burbano Sefair ( ) Michael Angel Pérez Cabarcas.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
DNA Sequences Analysis Hasan Alshahrani CS6800 Statistical Background : HMMs. What is DNA Sequence. How to get DNA Sequence. DNA Sequence formats. Analysis.
Aligning Kinases Applying MSA Analysis to the CDK family.
BLAST: Basic Local Alignment Search Tool Robert (R.J.) Sperazza BLAST is a software used to analyze genetic information It can identify existing genes.
Introduction to Profile HMMs
Free for Academic Use. Jianlin Cheng.
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Dr Tan Tin Wee Director Bioinformatics Centre
Sequence Based Analysis Tutorial
Presentation transcript:

Finding new nirK genes in metagenomic data

What is nirK? -one kind of nitrite reductase nirK is nitrite reductase, a gene involved in denitrification. Denitrification is an essential part of Nitrogen Cycling. The following are important member of Nirogen Cycling.

Nitrogen Cycling This picture shows the procedures in Nitrogen Cycling: Nitrogen fixation, Nitrification, Denitrification. They are important for global nitrogen equillibrium. In general, denitrification occurs where oxygen , a more energetically favorable electron acceptor than these molecules, is depleted, and Bacteria respire nitrate as a substitute terminal electron acceptor. Due to the high concentration of oxygen is our atmosphere, denitrification only take place in environments where oxygen consumption exceeds the rate of oxygen supply, such as in some soils and groundwater, wetlands, poorly ventilated corners of ocean, and in seafloor sediments. Nirk is the nitrite reductase reduce nitrite to nitric oxide.

+5 +3 +2 +1 In denitrification, Nitrate(+5) is reduced to Nitrite(+3), to Nitric oxidase(+2), to nitrous oxidase(+1), to Nitrogen(0) with different denitrifiers. Nirtrite reductase is the one that turn nitrite into nitric oxide or nitrous oxide, which are first gaseous product in denitrification, so it is the key enzyme and has numerous sequences available now. Nitrous oxide is an important factor for global warming and ozone depletion. For a 100 year period global warming potential, nitrous oxide has 298 times more impact per unit weight than carbon dioxide. In general, denitrification occurs where oxygen , a more energetically favorable electron acceptor than these molecules, is depleted, and Bacteria respire nitrate as a substitute terminal electron acceptor. Due to the high concentration of oxygen is our atmosphere, denitrification only take place in environments where oxygen consumption exceeds the rate of oxygen supply, such as in some soils and groundwater, wetlands, poorly ventilated corners of ocean, and in seafloor sediments. We collect soil sample from KBS LTER(Long Term Ecological Research), where NirS is not detected. So nirK is selected as our target gene.

Metagenomic Datasets 2 Samples from Agricultural soil, 2 sequencing runs per sample( by roche 454 pyrosequecing technique) 2 Samples from Forest soil, 2 sequencing runs per sample( by roche 454 pyrosequecing technique ) Data are from Tom Schmidt Lab The reason Why Agricultural soil and Forest soil are chosen is that there might be decrease or increase in denitrifiers in soil for fertilizer(nitrate) added into soil.

Methods Start with sequence similarity search softwares-------HMMER HMMER : an implementation of profile hidden Markov models (profile HMMs) for biological sequence analysis Profie HMMs are built from multiple sequence alignment made of known members of a given protein family by alignment tool Profile HMMs has global and local mode. Local mode is used in my research.

Advantage over BLAST HMMs have a formal probabilistic basis: use probability theory to guide how all the scoring parameters should be set HMMS have consistent theory behind gap and insertion scores But much slower than BLAST Useful on searching or annotation of domain structures of protein; finding sequences of proteins sequence family.

HMMER components HMMER has components: to build profile HMM---hmmbuild to search a profile against sequence database---hmmsearch and to align sequences according to a existing profile---hmmalign Hmmbuild, hmmcalibrate, hmmsearch, hmmalign are mainly used.

Mutiple alignment format Fungene pipe line download 6 Good known nirKs clustalw Mutiple alignment format blast hmmbuild Against soildata 6 different and well characterized nirK genes are made into a profile HMM, search against soil data. Blast is the most popular sequence similarity search tool, so I am interested to see the search result difference between two tools. BlAST nirK result Potential nirKs hmmsearch Profile HMM compare Against soil data hmmcalibrate

Blast and Hmmer results input files: /u/gjr/nirk2/ma1w2_run1_dereplicated_blastp.txt <==========> /u/gjr/nirk2/ma1w2_run1_dereplicated_localhmm.txt   blastOnly: 23 shared : 6 hmmOnly : 2 input files: /u/gjr/nirk2/ma1w2_run2_dereplicated_blastp.txt <==========> /u/gjr/nirk2/ma1w2_run2_dereplicated_localhmm.txt blastOnly: 28 shared : 8 hmmOnly : 4 input files: /u/gjr/nirk2/ma1w4_run1_dereplicated_blastp.txt <==========> /u/gjr/nirk2/ma1w4_run1_dereplicated_localhmm.txt blastOnly: 24 hmmOnly : 5 input files: /u/gjr/nirk2/ma1w4_run2_dereplicated_blastp.txt <==========> /u/gjr/nirk2/ma1w4_run2_dereplicated_localhmm.txt blastOnly: 34 shared : 16 hmmOnly : 5  Interesting

Profile matters! Hmmsearch 6 seed profile hmm against all 3055 fungene nirKs (some may not real nirKs…) See the E-value distribution

6Seed profile e-value distribution make the seqs(124) on left into a profile

124Seq e-value distribution

Cumulative curve The green line(124Seq) is above blue(6seq) near -50. This means the whole distribution of e-values moves left a little. For e-value, the smaller, more likely that sequence is nirK. The 126 Seqs are relatively better? At least, from this perspective, it is true.

124Seq profile HMMER and BLAST Result input files: /u/gjr/nirk3/ma1w2_run1_dereplicated.blastp.txt <==========> /u/gjr/nirk3/ma1w2_run1_dereplicated.localhmm.txt  blastOnly: 112 shared : 7 hmmOnly : 0  input files: /u/gjr/nirk3/ma1w2_run2_dereplicated.blastp.txt <==========> /u/gjr/nirk3/ma1w2_run2_dereplicated.localhmm.txt  blastOnly: 129 shared : 8 hmmOnly : 0   input files: /u/gjr/nirk3/ma1w4_run1_dereplicated.blastp.txt <==========> /u/gjr/nirk3/ma1w4_run1_dereplicated.localhmm.txt  blastOnly: 109 shared : 10 input files: /u/gjr/nirk3/ma1w4_run2_dereplicated.blastp.txt <==========> /u/gjr/nirk3/ma1w4_run2_dereplicated.localhmm.txt  blastOnly: 120 shared : 18 hmmOnly : 0 Hmmer results are totally covered by Blast. I think blast result has a lot of bad nirks. But we still can tell which are real nirKs or not, try other methods, then come back to the blast problem.

Then tree method Just to show an idea nirK1 Seq1(good) nirK2 nirK1 Seq2(bad)

NCBI nirK(cultured) Soil blast result Soil Hmmeresult Hmmalign with 6 seq profile quicktree tree

Too big that it is hard to get any conclusion from it Too big that it is hard to get any conclusion from it. Considering write a program to parse this tree.

Question to answer Best definition of nirK according to the current information Criteria of choosing seeds for profile hmm Blast false positive problem

Thanks