Download presentation
Published byDiana Taylor Modified over 9 years ago
1
Algorithms in Bioinformatics: A Practical Introduction
Project: Motif finding using ChIP-seq peak data
2
Transcriptional Control (I)
3
Transcriptional Control (II)
TATAAT is the motif!
4
Motif model TTGACA TCGACA TTGAAA ATGACA GTGACA TTGACT TTGACC Consensus Pattern TTGACA Positional Weight Matrix (PWM) Motif can be described in two ways based on the binding sites discovered
5
ChIP experiment Chromatin immunoprecipitation experiment
Detect the interaction between protein (transcription factor) and DNA.
6
Peak data Peak data represents the locations where a particular TF binding. The data tells us the locations and intensities. (Note that due to experimental error, peaks of low intensity may be noise.) ChIP-seq data for Human (MCF7) E2 treatment at 45min chr1:883, ,485
7
Our aim Given the DNA sequences of those peaks, find motifs which occur in those peak regions. For the example below, we have two motifs: TTGACA and GCATC. Note that each instance has at most 1 mutation. GCACGCGGTATCGTTAGCTTGACAATGAAGAATCCCCCCGCTCGACAGT GCATACTTTGACACTGACTTCGCTTCTTTAATGTTTAATGAAACATGCG CCCTCTGGAAATTAGTGCGGCATCTCACAACCCGAGGAATGACCAAATG GTATTGAAAGTAAGGCAACGGTGATCCCCATGACACCAAAGATGCTAAG CAACGCTCAGGCAACGTTGACAGGTGACACGTTGACTGCGGCCTCCTGC GTCTCTTGACCGCTTAATCCTAAAGGCCTCCTATTAGTATCCGCAATGT GAACAGGAGCGCGAGCCATCAATTGAAGCGAAGTTGACACCTAATAACT
8
Input (I) From every peak, we get approximately +/-200 DNA sequence
>cmyc_1_chr1_ _ _range_chr1_ _ _intensity_20 CCTCCATACCAGCCCCAATGTTCTGCGTTCCCGAATGAAAGACACACAACACAGCCTTTATATTTTGATATGCCTAAAACTGCTCAATGGCTGGGCCACTTCCTAGCTAGTATCCACGTGGCTATCCCACCTCTCTCTGATATTCCCAAGTCATTACTTACTAAAATCTGTAATTACATCTTTGCTGCCCTAGGCCCAATCTGGCAGCCCTCCTGTGGCCCCTCAGGCTACTACATGGCAGCTAAGCTCTCTGACCCACATCTTCTCAGGCACCGTGCCTCCTCTTCTCCACCTTATTCAAACATGGTGGCTCTCCTTCCTCCTTCTTCCTGTCTGTCCCCAGCCTGGGAATTCTAAAAGTCCCACCTCTGTCTGCCCTGTTCAGCCATTGGCTGTCGGCATCTTTATTTACGAG >cmyc_2_chr1_ _ _range_chr1_ _ _intensity_15 GGTCATAAACCAAGCTTCTTCAAAGATTTTTGGCTTTTTGGCACCAGTGGCCTGCAGGGTGGCGAGCTCTGCCAGTTTGAAGTGACCAAGTTAAGTGGCCTGGGAAAGGCCATTTGGTGCGCGGTCCAGCAGTTTTGGGCGCTCTCGGCTTCCGCCCTCAGCTGCGGTCACGTGCGGCTGCTCACGTGCCAGACGCTGCTGTCACTTCGTAGCTGTTCCGGCTTCCTCTGAGTGAGGCTCGCAACGTCTCCCACGGAGTCGCCTTCGTTCTGCTCTGGGTCTCCCGTGGCCACTGAGACCTCGGAGCTCGACCGGCGCCTGCCCGCCCGTGCGGCCCTCACTCCCCGAGGCTATCCAGGTGAGGCCGCCTGGGGTCCCTCCCCGGCTCCGGAGAGCCGACTGGTTTCCCTGCCG >cmyc_3_chr1_ _ _range_chr1_ _ _intensity_36 GTAGTCCCAACCAGGTCCTGAGCTGGTTAGCCAACCCTCAGCGCCAGTCGGGCCAACATCCGGTGACGAATCCAAGTCCCGCCTCTAAGCCCATCTGCTGTCCAATGCCGCCCTCTGCCGGTCTTTACCTCCCCGCCTAGCTGTGAGCCGCTTCCAGACAACCCGGAAGTGATCTTTCCTCTTCCGGATTACGGGTCCGGACGTCCGCACGTGGTTGCCGGTTTAGGGTGCTGCTGTAGTGGCGATACGTCCCGCCGCTGTCCCGAAGTGAGGGATCCGAGCCGCAGCGAGAGCCATGGAGGGCCAGCGCGTGGAGGAGCTGCTGGCCAAGGCAGAGCAGGAGGAGGCGGAGAAGCTGCAGCGCATCACGGTGCACAAGGAGCTGGAGCTGGAGTTCGACCTGGGCAACC ……………
9
Input (II) A set of sequences which are likely containing no motif.
AACAAGGGAAAGAGTAGTGAGTGCTTCTTTCTATTCAGAGGGAGGGGAAGTTGCTGTTAGCTAAGACAGTCAGGACTGAGAAGGGGGGGGGGGGTTTAACTCTCCTGGAGGGAGCTGAGAGGTAAAGGGAGGGGCGTGAGGTAGAACAAGCCGAGAACACAGGGCAGGTTGGTCTGACTCCAGAGCACAGTGCAGGAGCCCGGAAGTTGACTCAGTTCAGTTAGCAAGTATTTTCACACAAGGCGTGAACACTGAAGACAAAAGCAAGAGACACAGCTCTATCTCTAAGAAGATTTTCAGAGCCAAGATCGATGGGGCACACCTGTTAATCCCAGCACTTAGGAGGCTGAGGCAGGAGGATCCCAAGTTCAAAACCAGCCTGGACTTGTTTTAAGGAAAA >SEQ_2 AAAAAAAAAAAAAAGACTTCCAGTTTAATAAATGACCAATTCAGGAATGGAGATTAGGGCTGGATGACAAGTTTTTAATTGTCAAGGACTCAATTCTGTTTATCAGTTGGTATGGAATTATGTAAGCTTTTAGCGATATGACCGCACGGAGCAGTGTAGAGAGTGATCTGAGAGACGCTTGGGGGTCAGGATGGAGATAGAACTCCCTCTCTATTAGAAGGTGTTTGGTGGTAGGTAACCCTGGGCTAGCATGGTGGGTCTCTTCTTACTTAGGCTTCCATCTTTGTGGTTCAAATCCAAGAAGGACCTGCGTTCCCTCCCTCCTTGTGATCAGCTGATTGCTAGAGCATAACTCATCTTAACTTCTCATGTACTCTCCGGGTACAGGAAGGGAGGGGGC >SEQ_3 CCACTGCTGACAGTGGAGCATGAAACGACCGGCTTCCTGACTATGTTGGTACCCTTTCAGGAGCCTAAAACAGTGCTTTCAATACTTGTGTCTATGTCTGTTAGCCACAACTTTCTAGTTTCCCAGAGAGATTTTGAAGTGTAGTTTTGTATTTGCTCAAATATATATTCATATGGTGAGGTGCACATTTTTTATATTATATTTTTATTCATTTATTTTTGGTGCTTGGGAATTATACTCTAGGAATAAAGCGCCTGGTAGAAAGTGGCACACATCTTTAATCCCAGCACTCAGGAAGCAGAGGCAGACAAATCTCTGCGTTCCAGGACAGCCTGGTCTATAGAGCAAGGTCCAAGCCAGCCAGGTTTACACAAAGAAACCTAGTGTGGAAAAGACAAAA ……………
10
Output You need to output a list of candidate (ranked) motifs.
You can model the motif as PWM or consensus sequence. If you model the motif as a PWM, one of the answer for the previous dataset is You may also return other significant motifs.
11
Aim of the project Given a sample file and a background file,
you need to implement a method which output a list of motifs. You need to take advantage of the fact that this is a ChIP-seq dataset Hint: Read papers on ChIP-seq and understand its properties.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.