Presentation is loading. Please wait.

Presentation is loading. Please wait.

MF workshop 08 © Ron Shamir 1 Motif Finding Workshop Project Chaim Linhart January 2008.

Similar presentations


Presentation on theme: "MF workshop 08 © Ron Shamir 1 Motif Finding Workshop Project Chaim Linhart January 2008."— Presentation transcript:

1 MF workshop 08 © Ron Shamir 1 Motif Finding Workshop Project Chaim Linhart January 2008

2 MF workshop 08 © Ron Shamir 2 Outline 1. Some background again… 2. The project

3 MF workshop 08 © Ron Shamir 3 1. Background Slides with Ron Shamir and Adi Akavia

4 MF workshop 08 © Ron Shamir 4 DNA Pre- mRNA protein transcriptiontranslation Mature mRNA splicing Gene: from DNA to protein

5 MF workshop 08 © Ron Shamir 5 DNA DNA: a “string” over the alphabet of 4 bases (nucleotides): { A, C, G, T } Resides in chromosomes Complementary strands: A-T ; C-G Forward/sense strand: AACTTGCG Reverse-complement/anti-sense strand: TTGAACGC Directional: from 5’ to 3’: (upstream) AACTTGCGATACTCCTA (downstream) 5’ end3’ end

6 MF workshop 08 © Ron Shamir 6 Gene structure (eukaryotes) Transcription start site (TSS) Promoter Transcription (RNA polymerase) DNA Pre-mRNA Exon Intron Splicing (spliceosome) Mature mRNA 5’ UTR3’ UTR Start codon Stop codon Coding region Translation (ribosome) Protein Coding strand

7 MF workshop 08 © Ron Shamir 7 Translation Codon - a triplet of bases, codes a specific amino acid (except the stop codons); many-to-1 relation Stop codons - signal termination of the protein synthesis process http://ntri.tamuk.edu/cell/ribosomes.html

8 MF workshop 08 © Ron Shamir 8 Genome sequences Many genomes have been sequences, including those of viruses, microbes, plants and animals. Human: –23 pairs of chromosomes –3+ Gbps (bps = base pairs), only ~3% are genes –~25,000 genes Yeast: –16 chromosomes –20 Mbps –6,500 genes

9 MF workshop 08 © Ron Shamir 9 Regulation of Expression Each cell contains an identical copy of the whole genome - but utilizes only a subset of the genes to perform diverse, unique tasks Most genes are highly regulated – their expression is limited to specific tissues, developmental stages, physiological condition Main regulatory mechanism – transcriptional regulation

10 MF workshop 08 © Ron Shamir 10 Transcription is regulated primarily by transcription factors (TFs) – proteins that bind to DNA subsequences, called binding sites (BSs) TFBSs are located mainly (not always!) in the gene’s promoter – the DNA sequence upstream the gene’s transcription start site (TSS) BSs of a particular TF share a common pattern, or motif Some TFs operate together – TF modules TF Gene 5’5’ 3’3’ BS TSS Transcriptional regulation

11 MF workshop 08 © Ron Shamir 11 Consensus (“degenerate”) string: TFBS motif models gene 7 gene 9 gene 5 gene 3 gene 2 gene 4 gene 6 gene 8 gene 10 gene 1 AACTGT CACTGT CACTCT CACTGT AACTGT ACAC ACT CGCG T Statistical models… Motif logo representation

12 MF workshop 08 © Ron Shamir 12 Human G2+M cell-cycle genes: The CHR – NF-Y module CDCA3 (trigger of mitotic entry 1) CTCAGCCAATAGGGTCAGGGCAGGGGGCGTGGCGGGAAGTTTGAAACT -18 CDCA8 (cell division cycle associated 8) TTGTGATTGGATGTTGTGGGA … [25bp] … TGACTGTGGAGTTTGAATTGG +23 CDC2 (cell division control protein 2 homolog) CTCTGATTGGCTGCTTTGAAAGTCTACGGGCTACCCGATTGGTGAATCCGGG GCCCTTTAGCGCGGTGAGTTTGAAACTGCT 0 CDC42EP4 (cdc42 effector protein 4) GCTTTCAGTTTGAACCGAGGA … [25bp] … CGACGGCCATTGGCTGCTGC -110 CCNB1 (G2/mitotic-specific cyclin B1) AGCCGCCAATGGGAAGGGAG … [30bp] … AGCAGTGCGGGGTTTAAATCT +45 CCNB2 (G2/mitotic-specific cyclin B2) TTCAGCCAATGAGAGT … [15bp] … GTGTTGGCCAATGAGAAC … [15bp] … GGGC CGCCCAATGGGGCGCAAGCGACGCGGTATTTGAATCCTGGA +10 BS ’ s are short, non-specific, hiding in both strands and at various locations along the promoters TFs: NF-Y, CHR

13 MF workshop 08 © Ron Shamir 13 The computational challenge Given a set of co-regulated genes (e.g., from gene expression chips) Find a motif that is over-represented (occurs unusually often) in their promoters This may be the TF binding site motif Find TF modules – over-represented motifs that tend to co-occur

14 MF workshop 08 © Ron Shamir 14 The computational challenge (II) Motifs can also be found w/o a given target-set – “genome-wide” Find a motif that is localized - occurs more often neat the TSS of genes Find a motif with a strand bias – occurs more often on the genes’ coding strand Find TF modules with biases in their order / orientation / distance

15 MF workshop 08 © Ron Shamir 15 Motif finding algorithms >100 motif finding algs Main differences between them: –Type of analysis & input: Target-set vs. genome-wide Single vs. multi-species (conservation) Single motifs vs. modules –Motif model –Score for evaluating motif –Motif search technique: Combinatorial (enumeration) vs. Statistical optimization

16 MF workshop 08 © Ron Shamir 16 Over-represented motifs in the promoters of genes expressed in the G2 and G2/M phases of the human cell cycle: Example - Amadeus CHR NF-Y

17 MF workshop 08 © Ron Shamir 17 2. The project

18 MF workshop 08 © Ron Shamir 18 General goals Develop software from A-Z: –Design –Implementation –(Optimization) –Execution & analysis of real data A taste of bioinformatics Have fun Get credit…

19 MF workshop 08 © Ron Shamir 19 The computational task Given a set of DNA sequences Find “interesting” pairs of motifs: –Order bias –Other scores… Main challenges: –Performance (time, memory) –Output redundancy

20 MF workshop 08 © Ron Shamir 20 Input File with DNA sequences in “fasta” format: >sequence-name1 [header1] ACCCGNNNNTCGGAAATGANN CGGAGTAAAATATGCGAGCGT >sequence-name2 [header2] cggattnnnaccgcannnnnnnnaccgtga >sequence-name3 [header3] agtttagactgctagctcgatcgcta gcggatnggctannnnnatctag

21 MF workshop 08 © Ron Shamir 21 Input (II) Ignore the header lines Sequence may span multiple lines or one long line Sequence contains the characters A,C,G,T,N in upper or lower case “N” means unknown or masked base Sample input files will be supplied

22 MF workshop 08 © Ron Shamir 22 (don’t count overlaps, e.g. AAAAAA) Input (III) Search parameters: –Length of motifs (between 5-10) –Min. + Max. distance between the motifs: ACGGATTGATNNNTGGATGCCAT distance=9 –Single vs. two strands search –Min. number of occurrences (hits) of pair: GCGGATTCAGTGATGCCANGNATGCCTCAGGATTGNAATGCCA hit hit hit –Max. p-value –Additional parameters…

23 MF workshop 08 © Ron Shamir 23 Output A.A list of the string pairs with the best order-bias score (smallest p- values): Motif A Motif B A→B B→A p-value ACGTT GGATT 97 17 4.3E-15 ACGTT GATTC 87 16 2.7E-13 TTAAC CAGCC 31 114 1.2E-12 B.A non-redundant list of motif pairs (motif = consensus string): logos, # of hits, additional scores

24 MF workshop 08 © Ron Shamir 24 Part A: String pairs with order bias n A = # of A → B ; n B = # of B → A WLOG, n A > n B n = n A + n B H 0 = random order: n A ~ B(n, 0.5) p-value = prob for at least n A occurrences of A → B = tail of B(n, 0.5) Normal approximation (central limit thm.) Fix for multiple testing: x2

25 MF workshop 08 © Ron Shamir 25 Collect similar strings to motif with better score: (motif = consensus) String pair (p-value) Motif pair ACGTT, GGATT (4.3E-15) ACGAT, GGATT (2.4E-11) AGGAT, GGTTT (1.7E-5) AGGTT, GGTTT (5.9E-5) Don’t report similar motif pairs: –Motifs that consist of similar strings –Motif pairs that are small shifts of one another –Palindromes Part B: Non-redundant list of motif pairs,(8.1E-31)

26 MF workshop 08 © Ron Shamir 26 Option I: Co-occurrence rate N = total # of sequences s A = # of sequences that contain motif A s AB = # of sequences that contain motifs A and B H 0 = motifs occur independently and randomly p-value = prob for at least joint occurrences, given the number of hits of each single motif = tail of hypergeometric distribution Part B (cont.): Additional score

27 MF workshop 08 © Ron Shamir 27 Option II: Distance bias Is the distance between the two motifs uniform (H 0 ), or are there specific distances that are very common? Option III: Gap variability Are the sequences between the motifs conserved (H 0 ), or are they highly variable? Other options?? Part B (cont.): Additional score

28 MF workshop 08 © Ron Shamir 28 Implementation Java (Eclipse) ; Linux GUI: Simple graphical user interface for supplying the input parameters and reporting the results Packages for motif logo and statistical scores will be supplied Time performance will be measured only for part A Reasonable documentation Separate packages for data-structures, scores, GUI, I/O, etc.

29 MF workshop 08 © Ron Shamir 29 Design document Due in 3 weeks (Feb 24) 3-5 pages (Word), Hebrew/English Briefly describe main goal, input and output of program Describe main data structures, algorithms, and scores for parts A+B Meet with me before submission

30 MF workshop 08 © Ron Shamir 30 Fin


Download ppt "MF workshop 08 © Ron Shamir 1 Motif Finding Workshop Project Chaim Linhart January 2008."

Similar presentations


Ads by Google