Finding new nirK genes in metagenomic data
What is nirK? -one kind of nitrite reductase nirK is nitrite reductase, a gene involved in denitrification. Denitrification is an essential part of Nitrogen Cycling. The following are important member of Nirogen Cycling.
Nitrogen Cycling This picture shows the procedures in Nitrogen Cycling: Nitrogen fixation, Nitrification, Denitrification. They are important for global nitrogen equillibrium. In general, denitrification occurs where oxygen , a more energetically favorable electron acceptor than these molecules, is depleted, and Bacteria respire nitrate as a substitute terminal electron acceptor. Due to the high concentration of oxygen is our atmosphere, denitrification only take place in environments where oxygen consumption exceeds the rate of oxygen supply, such as in some soils and groundwater, wetlands, poorly ventilated corners of ocean, and in seafloor sediments. Nirk is the nitrite reductase reduce nitrite to nitric oxide.
+5 +3 +2 +1 In denitrification, Nitrate(+5) is reduced to Nitrite(+3), to Nitric oxidase(+2), to nitrous oxidase(+1), to Nitrogen(0) with different denitrifiers. Nirtrite reductase is the one that turn nitrite into nitric oxide or nitrous oxide, which are first gaseous product in denitrification, so it is the key enzyme and has numerous sequences available now. Nitrous oxide is an important factor for global warming and ozone depletion. For a 100 year period global warming potential, nitrous oxide has 298 times more impact per unit weight than carbon dioxide. In general, denitrification occurs where oxygen , a more energetically favorable electron acceptor than these molecules, is depleted, and Bacteria respire nitrate as a substitute terminal electron acceptor. Due to the high concentration of oxygen is our atmosphere, denitrification only take place in environments where oxygen consumption exceeds the rate of oxygen supply, such as in some soils and groundwater, wetlands, poorly ventilated corners of ocean, and in seafloor sediments. We collect soil sample from KBS LTER(Long Term Ecological Research), where NirS is not detected. So nirK is selected as our target gene.
Metagenomic Datasets 2 Samples from Agricultural soil, 2 sequencing runs per sample( by roche 454 pyrosequecing technique) 2 Samples from Forest soil, 2 sequencing runs per sample( by roche 454 pyrosequecing technique ) Data are from Tom Schmidt Lab The reason Why Agricultural soil and Forest soil are chosen is that there might be decrease or increase in denitrifiers in soil for fertilizer(nitrate) added into soil.
Methods Start with sequence similarity search softwares-------HMMER HMMER : an implementation of profile hidden Markov models (profile HMMs) for biological sequence analysis Profie HMMs are built from multiple sequence alignment made of known members of a given protein family by alignment tool Profile HMMs has global and local mode. Local mode is used in my research.
Advantage over BLAST HMMs have a formal probabilistic basis: use probability theory to guide how all the scoring parameters should be set HMMS have consistent theory behind gap and insertion scores But much slower than BLAST Useful on searching or annotation of domain structures of protein; finding sequences of proteins sequence family.
HMMER components HMMER has components: to build profile HMM---hmmbuild to search a profile against sequence database---hmmsearch and to align sequences according to a existing profile---hmmalign Hmmbuild, hmmcalibrate, hmmsearch, hmmalign are mainly used.
Mutiple alignment format Fungene pipe line download 6 Good known nirKs clustalw Mutiple alignment format blast hmmbuild Against soildata 6 different and well characterized nirK genes are made into a profile HMM, search against soil data. Blast is the most popular sequence similarity search tool, so I am interested to see the search result difference between two tools. BlAST nirK result Potential nirKs hmmsearch Profile HMM compare Against soil data hmmcalibrate
Blast and Hmmer results input files: /u/gjr/nirk2/ma1w2_run1_dereplicated_blastp.txt <==========> /u/gjr/nirk2/ma1w2_run1_dereplicated_localhmm.txt blastOnly: 23 shared : 6 hmmOnly : 2 input files: /u/gjr/nirk2/ma1w2_run2_dereplicated_blastp.txt <==========> /u/gjr/nirk2/ma1w2_run2_dereplicated_localhmm.txt blastOnly: 28 shared : 8 hmmOnly : 4 input files: /u/gjr/nirk2/ma1w4_run1_dereplicated_blastp.txt <==========> /u/gjr/nirk2/ma1w4_run1_dereplicated_localhmm.txt blastOnly: 24 hmmOnly : 5 input files: /u/gjr/nirk2/ma1w4_run2_dereplicated_blastp.txt <==========> /u/gjr/nirk2/ma1w4_run2_dereplicated_localhmm.txt blastOnly: 34 shared : 16 hmmOnly : 5 Interesting
Profile matters! Hmmsearch 6 seed profile hmm against all 3055 fungene nirKs (some may not real nirKs…) See the E-value distribution
6Seed profile e-value distribution make the seqs(124) on left into a profile
124Seq e-value distribution
Cumulative curve The green line(124Seq) is above blue(6seq) near -50. This means the whole distribution of e-values moves left a little. For e-value, the smaller, more likely that sequence is nirK. The 126 Seqs are relatively better? At least, from this perspective, it is true.
124Seq profile HMMER and BLAST Result input files: /u/gjr/nirk3/ma1w2_run1_dereplicated.blastp.txt <==========> /u/gjr/nirk3/ma1w2_run1_dereplicated.localhmm.txt blastOnly: 112 shared : 7 hmmOnly : 0 input files: /u/gjr/nirk3/ma1w2_run2_dereplicated.blastp.txt <==========> /u/gjr/nirk3/ma1w2_run2_dereplicated.localhmm.txt blastOnly: 129 shared : 8 hmmOnly : 0 input files: /u/gjr/nirk3/ma1w4_run1_dereplicated.blastp.txt <==========> /u/gjr/nirk3/ma1w4_run1_dereplicated.localhmm.txt blastOnly: 109 shared : 10 input files: /u/gjr/nirk3/ma1w4_run2_dereplicated.blastp.txt <==========> /u/gjr/nirk3/ma1w4_run2_dereplicated.localhmm.txt blastOnly: 120 shared : 18 hmmOnly : 0 Hmmer results are totally covered by Blast. I think blast result has a lot of bad nirks. But we still can tell which are real nirKs or not, try other methods, then come back to the blast problem.
Then tree method Just to show an idea nirK1 Seq1(good) nirK2 nirK1 Seq2(bad)
NCBI nirK(cultured) Soil blast result Soil Hmmeresult Hmmalign with 6 seq profile quicktree tree
Too big that it is hard to get any conclusion from it Too big that it is hard to get any conclusion from it. Considering write a program to parse this tree.
Question to answer Best definition of nirK according to the current information Criteria of choosing seeds for profile hmm Blast false positive problem
Thanks