Download presentation
Presentation is loading. Please wait.
1
…@liris.cnrs.fr - http://liris.cnrs.fr/... Laboratoire d'InfoRmatique en Image et Systèmes d'information LIRIS UMR 5205 CNRS/INSA de Lyon/Université Claude Bernard Lyon 1/Université Lumière Lyon 2/Ecole Centrale de Lyon Université Claude Bernard Lyon 1, bâtiment Nautibus 43, boulevard du 11 novembre 1918 — F-69622 Villeurbanne cedex http://liris.cnrs.fr UMR 5205 DDDM'08, Pisa - 15/12/2008 Parameter Tuning for Differential Mining of String Patterns J.Besson, C. Rigotti, I. Mitasiunaite and J.-F. Boulicaut
2
DDDM'08, Pisa - 15/12/2008 2 Tuning extraction parameters Local pattern mining: itemsets, closed itemsets, episodes, seq. patterns, substrings …. under constraints (monotonic or not or neither, pattern shapes, occurrence properties, measures …) can select/focus …. … where to look in the parameter space ? often easy when a single threshold … but when multiple constraints/multiple thresholds ?
3
DDDM'08, Pisa - 15/12/2008 3 Two different kinds of tuning 1) exploratory stage: find in parameter space promising areas 2) fine grain tuning: ako greedy strategy by small local exploration of the parameter space
4
DDDM'08, Pisa - 15/12/2008 4 Tools ? Best ever tool used in exploratory stage to find promising setting of the parameters in local pattern mining ??? …
5
DDDM'08, Pisa - 15/12/2008 5 Tools GREP + Word Count method: manual mix count extracted patterns choose points in parameter space random walk try local greedy strategy having in mind known properties of the constraints (when applicable) and domain knowledge
6
DDDM'08, Pisa - 15/12/2008 6 Tools … when several parameters, several thresholds, e.g., minimal support and maximal support on another dataset … perform more exhaustive exploration of pattern space draw curves depicting the extraction landscape
7
DDDM'08, Pisa - 15/12/2008 7 Tools / landscape Examples
8
DDDM'08, Pisa - 15/12/2008 8 Obtaining extraction landscapes use script - can need a lot of resources to execute - too much time needed to explore a large parameter space (several parameters) use a global model of the presence of the local patterns to estimate the number of patterns reuse/adapt a model - not so much exist develop a new global model - each kind of patterns and each conjunction of constraints can be a research problem in itself incorporate K of domain ? Global analytical model even more complex to exhibit …
9
DDDM'08, Pisa - 15/12/2008 9 What about sampling the pattern space ? sounds too naive, needing complicated frameworks how to sample ? size of the sample ? number of pattern in the sample that satisfy the constraints ? using domain knowledge ? how to estimate value for the whole pattern space ?
10
DDDM'08, Pisa - 15/12/2008 10 What about simple choices ? sampling with replacement in pat. that satisfies the syntactic constraints (conjunction of constraints) number of patterns in the sample that satisfy the constraints compute probability to satisfy the constraints for each patterns (incorporate K of the domain) in the sample approx. number of patterns that sat. the constraints (in the sample) sample size: growth the sample up to convergence of percentage of patterns satisfying the constraints estimate the number of patterns in the pattern space that satisfy the constraints: percentage of the pat. that sat. syntactic constraints
11
DDDM'08, Pisa - 15/12/2008 11 Whole process 1) built an initial sample of Psynt 2) comp. estimate of E(N) from the sample 3) add more patt. to the sample 4) comp. estimate of E(N) from the sample 5) if estimate changes a lot goto 3)
12
DDDM'08, Pisa - 15/12/2008 12 Using it in freq. substring mining Two datasets: R1 and R2 (two sets of strings) Constraints having size Z appearing at least min times in R1 appearing no more than max times in R2 Consider exact and approx. matching
13
DDDM'08, Pisa - 15/12/2008 13 Pattern space and K of domain string over an alphabet of 4 or 8 symbols K of domain as three models of symbol distribution Me - independent symbols with equal frequency Md - independent symb. with different frequencies Mm - first order Markov model for given p, and Me or Md or Mm, we have the proba that exits at-least one occ. of p in a string from binomial distribution we have the proba that p sat. min and max support constraints
14
DDDM'08, Pisa - 15/12/2008 14 Example / random data 4 symb. Md (0.4, 0.1, 0.2, 0.3) 100 strings of length 1000 in R1 and R2, exact match
15
DDDM'08, Pisa - 15/12/2008 15 Example / random data 4 symb. Mm, 100 strings of length 1000 in R1 and R2, exact and approx. match
16
DDDM'08, Pisa - 15/12/2008 16 Example / gene promoter seq. 4 symb. A,C,G,T - Md, strings of 4000 symb., 29 in R1 and 21 in R2 - approx. match
17
DDDM'08, Pisa - 15/12/2008 17 Example / gene promoter seq. Estimate vs. extraction
18
DDDM'08, Pisa - 15/12/2008 18 Conclusion Drawing extraction landscape for parameter tuning, in local pattern extraction, using pattern space sampling … seems possible … … at-least in some cases … using simple framework … incorparating K of domain (to some extend - many works on proba of a given patt. to sat. constraints) simplier than building a global analytical model faster than running real extractions … sufficient in exploratory stage ? … companion software?
19
DDDM'08, Pisa - 15/12/2008 19 Example / random data 8 symb. Me, 100 strings of length 30000 in R1 and R2, approx. match
20
DDDM'08, Pisa - 15/12/2008 20 Pb - Sampling / estimate kind of sampling (with replacement ?) specific sampling (ako stratified sampling) for some constraints ? kinds of patterns ? quality of estimates … occurrences of different patterns are not independent
21
DDDM'08, Pisa - 15/12/2008 21 Pb - Other parameters added size of starting set convergence criterion ? 5% ? size of additional subsets … not so hard to tune ?
22
DDDM'08, Pisa - 15/12/2008 22 Number of patterns conjunction of constraints C patterns in patt. space PS for each patt. p, let var Xp=1 if p sat. C or Xp=0 if p not sat. C N = nb of patt. that sat. C = sum of Xp over PS E(N) = sum of E(Xp) over PS E(Xp) = proba that p sat. C Psynt = patt. in PS that sat. syntactic constraint in C E(N) = sum of E(Xp) over Psynt
23
DDDM'08, Pisa - 15/12/2008 23 Number of patterns comp. NS = sum of E(Xp) over a sample of Psynt comp. ratio NR = NS/sample size use NR * size of Psynt as an estimate of E(N)
24
DDDM'08, Pisa - 15/12/2008 24 Example / gene promoter seq. Estimate vs. extraction
25
DDDM'08, Pisa - 15/12/2008 25 Example / gene promoter seq. Estimate vs. extraction
26
DDDM'08, Pisa - 15/12/2008 26 Often repeat exploratory stage redo exploratory stage after important changes as: data selection (e.g., part of sequences) encoding (e.g., mapping on event types) discretization (e.g., threshold of binarization) …
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.