Presentation is loading. Please wait.

Presentation is loading. Please wait.

Motif Detection in Yeast Vishakh Joe Bertolami Nick Urrea Jeff Weiss.

Similar presentations


Presentation on theme: "Motif Detection in Yeast Vishakh Joe Bertolami Nick Urrea Jeff Weiss."— Presentation transcript:

1 Motif Detection in Yeast Vishakh Joe Bertolami Nick Urrea Jeff Weiss

2 Overview 1. Problem Statement 2. Motivation 3. History 4. Our Approach 5. Evaluation 6. Results 7. Discussion 8. References

3 1. The Problem  Find regulatory sequences in the upstream region of yeast DNA.  Regulatory sequences are segments of DNA where proteins can bind to enhance transcription of a gene.

4 The Problem  We are given: Upstream Genome- consists of:  Gene Families- consists of:  Individual Genes- consists of:  Strings like ATGC  We had to find substrings unusually frequent in gene families given their distribution in the whole upstream genome.

5 The Problem  We emulated techniques devised by van Helden.  Worked on similar data set and tried to emulate and even better his findings.

6 2. Motivation  Organisms like yeast share many genes with humans.  As a result, they share diseases too.  Finding regulatory sequences in yeast might lead to medical advances.  Might lead to therapies for diseases such as cystic fibrosis.

7 3. History  Previous century saw rapid advances in genetics.  Scientific community trying to get a better understanding of various genomes.  This particular technique was developed by Jacques van Helden.

8 4.Our approach  Extract all substrings of lengths 6-8 in the upstream genome.  Calculate frequency of occurrence of each substring.  Put this data in a table.

9 Our Approach  Consider a gene family.  Find all substrings in it and frequencies and build table.  For each entry, add the probability of occurrence.  Use above data to calculate three scores.

10 Our Approach  Score 1: Expected Occurrence / Actual Occurrence  Use probability of occurrence and size of gene family to calculate expected occurrence.  Divide by actual occurrence.  Low score -> Unusually frequent substring.

11 Our Approach  Score 2: Poisson Distribution  Use expected and actual number of occurrences.  If substring occurs ‘n’ times, calculate probability of ‘n’ occurrences using Poisson Distribution.  Lower probability -> Unusually frequent

12 Our Approach  Score 3: Binomial Theorem  Use probability of occurrence, sizes of genome and gene family and actual occurrences.  If substring occurs ‘n’ times, calculate probability of ‘n’ occurrences using Binomial Distribution.  Lower probability -> Unusually frequent

13 Our Approach  Sort substrings by a score.  Take top sequences, create a probability matrix.  Iterate probability matrix to get probabilistic model of regulatory sequence.

14 5. Evaluation Metrics  Van Helden’s results in ’98 paper and his website.  ’98 paper used old data, not very reliable for evaluation.  Website very useful since it works on current data and dynamically calculates results.  Compared our output to his.

15 Evaluation Metrics  Also, compare three scores types to find best method.

16 6. Results Comparison of Results for MET FAMILY GeneVan Helden’s siteBinomial DistPoisson DistExpected / ActualOld Paper CACGTG11341 ACGTGA22123 TCACGT33212 ATATAT44N/A 5 TATATA55N/A 10 AACTGT674284 ACAGTT76N/A29N/A ACACAC897N/A GTGTGT986N/A

17 Results  Probability matrices generated successfully!

18 7. Discussion  Paper results clearly outdated.  Close co-relation with van Helden’s site.  Binomial distribution best, followed by Poisson and Expected/Actual

19 Discussion  Why don’t Binomial results perfectly match van Helden’s site? Van Helden paper only outlines general method. He uses many filters and adjustments. Limited info about them on site. We used similar, but not same, filters. Example: Purge sequences that appear twice in a row.

20 Discussion  Future work Find more filters. Try other similar organisms’ genomes. Biologically verify results!

21 Discussion  What we learnt Biology!  First-hand look at genetic data  Became more familiar with genes  Clearly understood what the fuss about genetics is about Computer Science  Teamwork  Interfacing CS with other scientific disciplines

22 References  van Helden, J., André, B. & Collado-Vides, J. (1998). Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol 281(5), 827-42.  van Helden, J., Rios, A. F. & Collado- Vides, J. (2000). Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res. 28(8):1808-18.


Download ppt "Motif Detection in Yeast Vishakh Joe Bertolami Nick Urrea Jeff Weiss."

Similar presentations


Ads by Google