Presentation is loading. Please wait.

Presentation is loading. Please wait.

E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

Similar presentations


Presentation on theme: "E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang."— Presentation transcript:

1 E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang

2 O UTLINE Review of Mixture Model and EM algorithm Importance Sampling Re-sampling EM Extending EM Integrate Other Features Result

3 R EVIEW M OTIF F INDING : M IXTURE MODELING Given a dataset X, a motif model Ѳ, and a background model θ 0, the likelihood of observed X, is defined as : To optimize likelihood above is NP-hard, EM algorithm solve this problem with the concept of missing data. Assume the missing data Z i is binding site Boolean flag of each site: Motif ComponentBackground Component

4 R EVIEW M OTIF F INDING : EM E-step: M-step:

5 P ROS AND C ONS Pros: Pure Probabilistic Modeling EM is a well known method The complexity of each iteration is linear Cons: In each iteration, it examines all the sites (most is background sites) EM is sensitive to its starting condition The length of motif is assumed given

6 S AMPLING I DEA (1) Simple Example: 20 As and 10 Bs AAAAAAAAAAAAAAAAAAAABBBBBBBBBBBB Let’s define a sampling function Q(x), and Q(x)=1 when x is sampled: E.G., P(Q(A)=1)=0.1 P(Q(B)=1)=0.2 The sampled data maybe: AABB we can recover the original data from “AABB” 2A in sample/0.1=20 A in original 2B in sample/0.2=10 B in original

7 S AMPLING I DEA (2) Almost every sampling function can recover the statistics in the original, which is known as “ Importance sampling ” We can defined a good sampling function on the sequence data, which prefer to sample binding sites than background sites. According the parameter complexity, motif model need more samples than background to achieve the same level of accuracy.

8 R E - SAMPLING EM Sampling function Q(.), and sampled data X Q E-step: the same as original EM M-step:

9 R E - SAMPLING EM

10 How to find a good sampling function Intuitively, Motif PWM is the natural good sampling function, but it is impossible for us to know the motif PWM before hand. Fortunately, a approximate PWM model already can do a good job in practice.

11 H OW TO FIND A GOOD APPROXIMATING PWM? Unknown length Unknown distribution

12 E XTENDING EM Start from all over-represented 5-mers Similarly, we find a motif model(PWM) contains the given 5-mer which maximizes the likelihood of the observed data. We define a extending EM process which optimizes the flanking columns included in the final PWM.

13 E XTENDING EM Imagine we have a length-25 PWM Ѳ with 5-mer q “ACTTG” in the middle, which is wide enough for us to target any motif less than 15bp ( W max ). Po12… 10111213141516… 2425 A 0.25 …… 0.2510000 …… 0.25 C …… 0.2501000 …… 0.25 G …… 0.2500001 …… 0.25 T …… 0.2500110 …… 0.25

14 E XTENDING EM We use two indices to maintain the start and end of the real motif PWM

15 E XTENDING EM The M-step is the same as original EM, but we need to determine which column should be included. The increase of log-likelihood by including column j

16 C ONSIDER OTHER FEATURES IN EM Other features Positional Bias Strand Bias Sequence Rank Bias We integrate them into mixture model New likelihood ratio Boolean variable to determine whether include feature or not.

17 C ONSIDER OTHER FEATURES IN EM If feature data is modeled as multinomial, Chi-square Test is used to decide whether a feature should be included: The multinomial parameters φ also can be learned in the M-step:

18 A LL TOGETHER

19 PWM ModelPosition Prior Model Peak Rank Prior Model

20 S IMULATION R ESULT

21

22 R EAL D ATA R ESULT 163 ChIP-seq datasets Compare 6 popular motif finders. Half for training, half for testing

23 R EAL D ATA R ESULT De novo AP1 ModelDe novo FOXA1 Model De novo ER Model

24 C ONCLUSION SEME can perform EM on biased sampled data but estimate parameters unbiasedly vary PWM size in EM procedure by starting with a short 5-mer automatically learn and select other feature information during EM iterations

25


Download ppt "E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang."

Similar presentations


Ads by Google