Statistical Detection of Co-occurring Transcription Factor Binding Sites Armand Halbert 1
Background Examining whether transcription factor binding sites are in a fixed position relative to Transcription Start Sites Created protein clusters with TFBS within 300 nucleotides of the Transcription Start Site. 300 Nucleotides was chosen because this where most cis-regulatory factors are concentrated. 2
False discovery rate of protein clusters We used the DAVID web service on the protein clusters to get Functional Annotations of the proteins in the clusters False Discovery rate was used to determine probability that clusters were “interesting” by chance. We confirmed consequences of our fundamental hypothesis, that TFs occupy fixed positions relative to one another in CRMs (cis-regulatory modules) that co-regulate genes. 3
Jaccard distance Jaccard distance is a measure of the dissimilarity of sets. Jaccard distance of pairs of clusters was then compared to the distance of positions of clusters It was expected that as distance increased, the functions of the proteins would diverge 4
Results Distance between the cluster starting sites weakly correlated with Jaccard Distance of pairs of clusters 5
Non-pathogenicity in natural SIV hosts 6
Background Natural hosts (Example: AGM, African Green Monkeys) rarely get sick from SIV, despite high prevalence rate and high viral loads Contrast to Non-Natural Hosts(Example: Humans), who do develop AIDS Natural hosts are believed to have co-evolved with SIV 7
Searching human proteins vs agm proteins PSIBLAST was used to compare each protein of a species with the proteome of another species, and gather the top hits under an evalue threshold Cd-hit was used to put similar proteins into clusters for finding reciprocal best hit proteins 8
PSIBLAST Reports Creation Process 9
10
PSIBLAST Reports Creation Process 11
Searching PSIBLAST reports 12
Searching PSIBLAST reports 13
Searching PSIBLAST reports: Reciprocal Best Hit 14 H1vsAgm.br H1 M2 0.0 H1 M3 2e-2 …. M2vsHuman.br M2 H1 0.0 M2 H6 4e-6 …. reciprocalBestHits.out H1 M2 0.0 …
Searching PSIBLAST reports: Best Hit 15 H2vsAgm.br H2 M2 0.0 H2 M3 2e-2 …. M2vsHuman.br M2 H3 0.0 M2 H6 4e-6 …. bestHits.out H2 M2 0.0 …
Searching PSIBLAST reports: No Hit 16 H2vsAgm.br noHits.out H2 Goal: to find proteins that have no homologue in other species
Results Searches: AGM: 61,804 Proteins Human: 71,340 proteins 17
Expansion Eventually, this process will be adapted to multiple species Challenges involve performance of a large number of psiblast searches. For example, running each human protein against the African green monkey database took 4 days. Creating it as a farm job will allow the application to scale. 18