Download presentation
Presentation is loading. Please wait.
Published byAmbrose Barber Modified over 6 years ago
1
Estimation of human endogenous retrovirus activities from expressed sequence databases
Merja Oja, Jaakko Peltonen, Sami Kaski University of Helsinki and Helsinki University of Technology
2
Human endogenous retroviruses
Human endogenous retroviruses (HERVs) are DNA sequences A retrovirus becomes endogenous when it infects a germ-line cell HERVs are remains from ancient infections 8% of the human genome is retroviral Human endogenous retroviruses or HERVs for short retrovirus (for example HIV is a retrovirus) and is inherited by the children of the host ancient infections = happened tens of millions of years ago
3
Proliferation of endogenous retroviruses
HERVs belong to transposable elements, in other words, The present day HERVs are mutated and mainly unable to transpose Some terminology: Elements with a common ancestor are grouped to families HERVs have moved independently in our genome several mutated versions of the originals
4
Why are HERVs interesting?
can express viral genes in human tissues is this connected to disease? some HERV derived normal human genes may regulate close by human genes offer recombination sites for genome reorganizations Their presence in the genome may affect the function of nearby human genes --- promoters, --- extra exons There are A few known instances of HERV derived normal human genes this means that some HERVs have adopted functions beneficial to us
5
Why measure HERV activities?
HERVs are expressed in disease tissues what is their role? HERVs are expressed in healthy tissues What controls HERV activation? we need genomic neighborhood and activity level The connection between HERVs and disease is unknown. Are there more HERV derived genes? Is there a difference between healthy and diseased cases? we need the DNA surrounding the HERV (we can search for transcription factor binding sites) We want to measure the activities of all HERVs to get a large data set for further analysis.
6
Current methods and problems
HERV expression can be measured with RT-PCR (real time polymerase chain reaction) retrovirus microarray Problem: all these methods give activities for families hundreds of sequences in a family not know which of the are active We want activities for individual HERVs RT-PCR: accurate technique, the activity of one thing at a time retrovirus chip: expensive, noisy, several HERVs at the same time. individual HERVs, this is needed in order to analyze the control mechanisms and to find potential disease associated HERVs. it is difficult to design probes that target only individual sequences within the family.
7
Our approach We introduce a model for estimating HERV activities from sequence databases Expressed sequence tags (ESTs) ESTs are a snapshot of gene activity in a cell dbEST contains million of human ESTs Principle: activity based on the number of ESTs from the HERV ESTs are short and noisy versions of the gene transcripts present in cells (a snapshot of all mRNA)
8
Cross-talk problem Counting the ESTs coming from one particular HERV is not easy ESTs are noisy and HERVs are very similar to each other EST could be coming from a number of HERVs Cross-talk between HERVs We model the uncertainty in the matching statistically the noise level in ESTs can be larger than the sequence differences within a family it may be hard to determine exactly from which HERV an EST stems from Our model will give relative activity levels for the HERVs based on EST data.
9
Generative model for ESTs
We introduce a generative mixture model for the ESTs The model mimics the actual biological generation process each mixture component is a hidden Markov model
10
Hidden Markov Model Mixture
corresponds to one large HMM where chooses one of the N HERV-specific sub-HMMs low-quality end part. We use the Baum-Welch algorithm to learn the whole mixture. are estimates of the HERV activities. HERV specific sub-HMMs Probabilities of first transition = HERV activities
11
HMM mixture We constrain the model complexity by sharing parameters.
We use two heuristics to reduce computation time days instead of years! Sharing parameters - all match states share parameters (emit correct nucleotide) blocks reduce the time by the factor of 1000. To limit to only a relevant portion of the HERV sequence (HERV is much longer than the EST) Use only some probable pairs and assume that all other EST-HERV pairs have zero probability These heuristics don’t seem to have an effect on the results. The heuristics reduce the computation time from years to days per iteration. Without it would take 45 years to compute one iteration.
12
Experiments with generated data
Can we reconstruct the activity profile from generated data? Data: small set of 91 real HERVs from these using our HMM model. We selected HMM-parameters and activity distribution (and thus the level of cross-talk) to be similar to those in real data.
13
No cross-talk approach
A simple heuristic BLAST-based approach assumes no cross-talk BLAST is used to retrieve ESTs from dbEST HERVs as query sequences The BLAST-approach: each EST is simply assigned to best matching HERV HERV activity: count of ESTs Our method is designed to tackle the cross-talk problem BLAST is a sequence similarity algorithm, which is generally used to make database queries
14
Results with generated data
Our method is able to learn the activities fairly well. What is more important the inactive HERVs are not mistaken as non-active and vice versa. Our method and the BLAST approach perform about equally.
15
Results from generated data experiment
Both our method and the heuristic BLAST based approach work well The BLAST approach works surprisingly well The data contains less cross-talk than anticipated The fast BLAST approach can be used in large experiments Our method and the BLAST approach perform about equally. this suggests it can be used for large tests where HMM training would be computationally too costly.
16
Experiment with real data
What are the real activities of HERVs? from three HERV families. HERVs obtained by RetroTector dbEST by BLAST (RetroTector is a program designed for finding retroviral elements from genomes. It was implemented by are collaborators in Sweden. It contains their expert knowledge on retroviruses).
17
Results with real data The method finds relative activities of the HERVs. Each family has only a few active HERVs and many that are only slightly active. The activity of most of them was earlier unknown. Results seem reliable. method. were reoptimized for each bootstrap replicate; other parameters were kept fixed.
18
Biological results The activity levels of the
HERVs were previously unknown The age of the element doesn’t explain activity The intactness doesn’t explain activity Model sequences are among the active elements, but not the most active The activity levels of most of the HERVs were previously uknown. This is truly new information.
19
Conclusions We have introduced a generative model for estimating the activity levels of individual HERVs. The model is able to learn true activities A heuristic BLAST approach works surprisingly well Our method is general; it can be applied to study activities in different conditions, etc. Results with generated data prove that we can reliably learn the activities The method is general; it can be used to compare different conditions, to study HERVs in other organisms elements.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.