Presentation is loading. Please wait.

Presentation is loading. Please wait.

pTree-k-means-classification-sequential (pkmc-s)

Similar presentations


Presentation on theme: "pTree-k-means-classification-sequential (pkmc-s)"— Presentation transcript:

1 pTree-k-means-classification-sequential (pkmc-s)
PAj>c=Pj,m om...ok+1Pj,k oi=AND iff bi=1, k is the rightmost bit pos with bit-value "0", opeations are right binding. c = bm bk b0 pTree-k-means-classification-sequential (pkmc-s) Initially, let PREMAINING be pure1. Initially from the TrainingSet, 1. For each attribute, calculate the mean for each class and sort ascending on mean. Calculate all mean_gaps = differences_of_consecutive_means. Create MeanTable(attribute, class, mean, gapL, gapH, gapRELATIVE) sorted desc on gapRELATIVE = ( gapL + gapH)/mean ) gapL is the gap on the low side of the mean. gapH, high side. 2. Choose and remove the MT record with max gapRELATIVE. Use formula above with cL=mean-gapL/2 and with cH=mean+gapH/2 to produce PL=PA>cL and PH=P'A>cH The class mask is PCLASS = PL & PH & PREMAINING and we update PREMAINING = PREMAINING & P'CLASS 3. Repeat 2 above until all classes have a pTree mask (or until PREMAINING is pure0, but that's a count op.). 4. Repeat 1,2,3 until means stop changing (much). The next two slides contain a (partial) walk through of this algorithm for a subset of the IRIS dataset. Initial means are shown below (the clusters are color coded throughout with R,G,B for setosa, versicolor, virginica and also the features (sepal Length, sepal Width, pedal Length, pedal Width) are color coded. se ve vi sLN sWD pLN pWD sepalLeNgth sepalWiDth pedalLeNgth pedalWiDth Initial means: sLN m mg se 51 12 vi 63 7 ve 70 sWD m mg ve 32 1 vi 33 2 se 35 pLN m mg se 14 33 ve 47 13 vi 60 pWD m mg se 2 ve 14 11 vi 25 1.attr, class, calc means, mean_gaps.

2 pkmc-s PREM=pure1 1. attr, class, calc means, gaps
pkmc-s PREM=pure1 1.attr, class, calc means, gaps. MT(attr,class,mean,gapL,gapH,gapREL) sorted desc on gapREL =(gapL+gapH)/2*mean) gapL=lo gap. gapH hi. 2. MT rec w max gapREL cL=mn-gapL/2 cH=mn+gapH/ PCLASS = PA>cL & P'A>cH & PREM PREM= PREM &P'CLASS 3. Repeat 2 til all classes pTree. 4. Repeat 1,2,3 til conv se se se se se se se se se se Sepal Length Sepal Width Pedal Length Pedal Wth ve ve ve ve ve ve ve ve ve ve vi vi vi vi vi vi vi vi vi vi 1 PREM PA>cH 1 Pse = P'A>cH sLN m mg se 51 12 vi 63 7 ve 70 sWD m mg ve 32 1 vi 33 2 se 35 pLN m mg se 14 33 ve 47 13 vi 60 pWD m mg se 2 ve 14 11 vi 25 1.attr, class, calc means, mean_gaps. MT cl at mn gL gH gR (not yet sorted on gR) se sLN (12+12)/(2*51) se sWD ( 2+ 2)/(2*35) se pLN (33+33)/(2*14) se pWD (12+12)/(2* 2) vi sLN ( 8+ 7)/(2*63) vi sWD ( 1+ 2)/(2*33) vi pLN (13+13)/(2*60) vi pWD (11+11)/(2*25) ve sLN ( 7+ 7)/(2*70) ve sWD ( 1+ 1)/(2*32) ve pLN (33+13)/(2*47) ve pWD (12+11)/(2*14) x's fill ins. MT cl at mn gL gH gR We're separating out setosa class 2. MT rec w max gapREL cL= mean - gapL/2 cH=mean+gapH/2 se pWD = /2 = -4 PA>cL =Ppure1 = /2 = 8 = PA>cH = =(P4,4|(P4,3&(P4,2|(P4,1|P4,0)))) se pWD se pLN vi pLN ve pLN ve pWD vi pWD se sLN vi sLN ve sLN se sWD vi sWD ve sWD MTscl at mn gL gH gR (sortws desc gR) Psetosa =PA>cL & P'A>cH & PREM =Ppure1 & P'A>cH & Ppure1 = P'A>cH PREM= PREM &P'CLASS = Ppure1 &P'setosa = P'setosa

3 pkmc-s PREM=pure1 1. attr, class, calc means, gaps
pkmc-s PREM=pure1 1.attr, class, calc means, gaps. MT(attr,class,mean,gapL,gapH,gapREL) sorted desc on gapREL =(gapL+gapH)/2*mean) gapL=lo gap. gapH hi. 2. Get MT w max gapREL cL=mn-gapL/2 cH=mn+gapH/ PCLASS = PA>cL & P'A>cH & PREM PREM= PREM &P'CLASS 3. Repeat 2 til all classes pTree. 4. Repeat 1,2,3 til conv se se se se se se se se se se Sepal Length Sepal Width Pedal Length Pedal Wth ve ve ve ve ve ve ve ve ve ve vi vi vi vi vi vi vi vi vi vi 1 1 1 sLN m mg se 51 8 vi 63 7 ve 70 sWD m mg ve 32 1 vi 33 2 se 35 pLN m mg se 14 33 ve 47 13 vi 60 pWD m mg se 2 12 ve 14 11 vi 25 1.attr, class, calc means, mean_gaps. MT cl at mn gL gH gR 2. MT rec w max gapREL cL= mean+ gapH/2 ve pLN 1 1 1 1 =47+13/2= = PA>cL =P3,6|(P3,5&(P3,4& (P3,3|(P3,2&(P3,1 MTscl at mn gL gH gR (sortws desc gR) ONLY 2 MISTAKES se pWD se pLN ve pLN ve pWD vi pWD se sLN vi pLN vi sLN ve sLN se sWD vi sWD ve sWD

4 pTree-k-means-classification-divisive (pkmc-d)
PAj>c=Pj,m om...ok+1Pj,k oi=AND iff bi=1, k is the rightmost bit pos with bit-value "0", operations are right binding. c = bm bk b0 pTree-k-means-classification-divisive (pkmc-d) Current Cluster=CC={Class1, ...Classm} (all classes), is represented by pTree mask, PCC ( pure1 initially). From the TrainingSet, 1. For each attribute, calculate the mean for each class in CC and sort asc on mean. Calculate all mean_gaps = difference_of_consecutive_means. Create MeanTable (attribute, class, mean, gap) sorted desc on gap 2. Choose and remove the MT record with maximum gap Use PA>c (c=mean+gap/2) to separate the current cluster into two clusters. The cluster masks are PNEWCLUSTER1 = PA>c & PCC PNEWCLUSTER2 = P'A>c & PCC and the new clusters then are NEWCLUSTER1= {all classes corresponding to the mean that had the max gap and those above it from CC. NEWCLUSTER2= {all other classes in CC}, also definable as {all classes below max gap class in CC) 3. Repeat 2 with CC=NEWCLUSTERi (i=1,2) until all clusters are singleton sets of classes. 4. Repeat 1,2,3 until means stop changing (much). On the next two slides you will find a (partial) walk through of this algorithm for a subset of the IRIS dataset. The initial means are shown below (the clusters are color coded throughout with R,G,B for setosa,. versicolor, virginica. I also color code the features (sepal Length, sepal Width, pedal Length, pedal Width Then I take 10 samples from each class for the example. sLN sWD pLN pWD sepalLeNgth sepalWiDth pedalLeNgth pedalWiDth se ve vi sLN m mg se 51 12 vi 63 7 ve 70 sWD m mg ve 32 1 vi 33 2 se 35 pLN m mg se 14 33 ve 47 13 vi 60 pWD m mg se 2 ve 14 11 vi 25 1.attr, class, calc means, mean_gaps.

5 pkmc-d CC=all [3] classes, mask, PCC ( pure).
Sepal Length Sepal Width Pedal Length Pedal Wth ve ve ve ve ve ve ve ve ve ve vi vi vi vi vi vi vi vi vi vi pkmc-d CC=all [3] classes, mask, PCC ( pure). 2. PA>c (c=mean+gap/2*mean) separate CC into 2. PNEWCLUSTER1=PA>c & PCC PNEWCLUSTER2 =P'A>c & PCC 3. Repeat 2 w CC=NEWCLUSTERi (i=1,2) until all are singletons. 4. Repeat 1,2,3 until means stop changing. 1 PNEW1 PA>31 1 PNEW2 P'A>31 1. MT(attr,class,mean,gap) sorted desc on gap . sLN m gap se 51 12 vi 63 7 ve 70 sWD m gap ve 32 1 vi 33 2 se 35 pLN m gap se 14 33 ve 47 13 vi 60 pWD m gap se 2 ve 14 11 vi 25 1.attr, class, calc means, gaps. MT at cl mn gap (not yet sorted and I'm not using relative gaps this time, Note also, there is only 1 entry for each gap not 2 for each mean) se sLN 51 12 se pLN 14 33 se pWD 2 12 vi sLN 63 7 vi sWD 33 2 ve sWD 32 1 ve pLN 47 13 ve pWD 14 11 MT at cl mn gap (separates {ve. vi} from {setosa} using pedal Length 2. MT rec w max gap c= mean + gap/2 se pLN 14 33 se sLN 51 12 se pLN 14 33 se pWD 2 12 vi sLN 63 7 vi sWD 33 2 ve sWD 32 1 ve pLN 47 13 ve pWD 14 11 MT cl at mn gap (sorted desc on gap) = /2 = = (applied roof) PA>31 = P3,6 | P3,5 PNEW2 is done ( cluster is the singleton set {setosa} ) Need to further partition PNEW1 (cluster is {versicolor, virginica} )

6 pkmc-d CC=all [3] classes, mask, PCC ( pure).
Sepal Length Sepal Width Pedal Length Pedal Wth pkmc-d CC=all [3] classes, mask, PCC ( pure). 2. PA>c (c=mean+gap/2) separate CC into 2. PNEWCLUSTER1=PA>c & PCC PNEWCLUSTER2 =P'A>c & PCC 3. Repeat 2 w CC=NEWCLUSTERi (i=1,2) until all are singletons. 4. Repeat 1,2,3 until means stop changing. ve ve ve ve ve ve ve ve ve ve 1. MT(attr,class,mean,gap) sorted desc on gap . sLN m gap se 51 12 vi 63 7 ve 70 sWD m gap ve 32 1 vi 33 2 se 35 pLN m gap se 14 33 ve 47 13 vi 60 pWD m gap se 2 ve 14 11 vi 25 1.attr, class, calc means, gaps. vi vi vi vi vi vi vi vi vi vi se sLN 51 8 se pLN 14 33 se pWD 2 12 vi sLN 63 7 vi sWD 33 2 ve sWD 32 1 ve pLN 47 13 ve pWD 14 11 MT at cl mn gap (sorted desc on gap) MT at cl mn gap (separates {ve} from {vi} using pedal Length 2. MT rec w max gap c= mean + gap/2 ve pLN 47 13 = /2 = = (applied roof) PA>c = P3,6 | P3,5 =P3,6|(P3,5&(P3,4&(P3,3|(P3,2&(P3,1 &(P3,0 This is the virginica mask. There are no mistakes on versicolor, 3 mistakes on virginica (#'s 1,6,10). With one epoch, overall accuracy is 90% 1 1 or 1 1 1 1 1 1 1 Ways to improve accuracy (at a slight cost in speed) include: 1. Use more than one attrubute cutpoint each time. 2. Use standard deviation calculations to optimize cutpoints. or

7 pkmc-d CC=all [3] classes, mask, PCC ( pure).
Sepal Length Sepal Width Pedal Length Pedal Wth pkmc-d CC=all [3] classes, mask, PCC ( pure). 2. PA>c (c=mean+gap/2*mean) separate CC into 2. PNEWCLUSTER1=PA>c & PCC PNEWCLUSTER2 =P'A>c & PCC 3. Repeat 2 w CC=NEWCLUSTERi (i=1,2) until all are singletons. 4. Repeat 1,2,3 until means stop changing. ve ve ve ve ve ve ve ve ve ve 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1. MT(attr,class,mean,gap) sorted desc on gap . sLN m gap se 51 8 vi 63 7 ve 70 sWD m gap ve 32 1 vi 33 2 se 35 pLN m gap se 14 33 ve 47 13 vi 60 pWD m gap se 2 12 ve 14 11 vi 25 1.attr, class, calc means, gaps. vi vi vi vi vi vi vi vi vi vi se sLN 51 8 se pLN 14 33 se pWD 2 12 vi sLN 63 7 vi sWD 33 2 ve sWD 32 1 ve pLN 47 13 ve pWD 14 11 MT at cl mn gap (sorted desc on gap) MT at cl mn gap (separates {ve} from {vi} using pedal Length 2. MT rec w max gap c= mean + gap/2 ve pWD 14 11 = /2 = = (applied roof) PA>c = P3,6 | P3,5 =P3,6|(P3,5&(P4,4&(P4,3|(P4,2&(P4,1 &(P3,0 To improve accuracy (at a slight cost in speed) include: 1. Use more than one attribute cutpoint. Using pWD as 2nd attribute for separating ve and vi: 1 1 1 1 1 No improvement by including ve pWD. Next we'll try vi sLN.

8 pkmc-d CC=all [3] classes, mask, PCC ( pure).
Sepal Length Sepal Width Pedal Length Pedal Wth pkmc-d CC=all [3] classes, mask, PCC ( pure). 2. PA>c (c=mean+gap/2*mean) separate CC into 2. PNEWCLUSTER1=PA>c & PCC PNEWCLUSTER2 =P'A>c & PCC 3. Repeat 2 w CC=NEWCLUSTERi (i=1,2) until all are singletons. 4. Repeat 1,2,3 until means stop changing. ve ve ve ve ve ve ve ve ve ve 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1. MT(attr,class,mean,gap) sorted desc on gap . sLN m gap se 51 8 vi 63 7 ve 70 sWD m gap ve 32 1 vi 33 2 se 35 pLN m gap se 14 33 ve 47 13 vi 60 pWD m gap se 2 12 ve 14 11 vi 25 1.attr, class, calc means, gaps. vi vi vi vi vi vi vi vi vi vi se sLN 51 8 se pLN 14 33 se pWD 2 12 vi sLN 63 7 vi sWD 33 2 ve sWD 32 1 ve pLN 47 13 ve pWD 14 11 MT at cl mn gap (sorted desc on gap) MT at cl mn gap (separates {ve} from {vi} using pedal Length 2. MT rec w max gap c= mean + gap/2 vi sLN 63 7 = /2 = = (applied roof) PA>c = P3,6 | P3,5 =P1,6&(P1,5|(P1,4|(P1,3|(P1,2 1 1 1 No improvement by including vi sLN.

9 -------or pkmc-d CC=all [3] classes, mask, PCC ( pure).
Sepal Length Sepal Width Pedal Length Pedal Wth pkmc-d CC=all [3] classes, mask, PCC ( pure). 2. PA>c (c=mean+gap/2*mean) separate CC into 2. PNEWCLUSTER1=PA>c & PCC PNEWCLUSTER2 =P'A>c & PCC 3. Repeat 2 w CC=NEWCLUSTERi (i=1,2) until all are singletons. 4. Repeat 1,2,3 until means stop changing. ve ve ve ve ve ve ve ve ve ve 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1. MT(attr,class,mean,gap) sorted desc on gap . sLN m gap se 51 8 vi 63 7 ve 70 sWD m gap ve 32 1 vi 33 2 se 35 pLN m gap se 14 33 ve 47 13 vi 60 pWD m gap se 2 12 ve 14 11 vi 25 1.attr, class, calc means, gaps. vi vi vi vi vi vi vi vi vi vi se sLN 51 8 se pLN 14 33 se pWD 2 12 vi sLN 63 7 vi sWD 33 2 ve sWD 32 1 ve pLN 47 13 ve pWD 14 11 MT at cl mn gap (sorted desc on gap) MT at cl mn gap (separates {ve} from {vi} using pedal Length 2. MT rec w max gap c= mean + gap/2 vi sWD 33 2 = /2 = = PA>c = P3,6 | P3,5 =P3,6|(P2,5&(P2,4|(P2,3|(P2,2|(P2,1 &(P2,0 or 1 1 1 1 1 1 Improvement by including vi sWD (captures 10 as virginica while missing also on 10 in versicolor.

10 The above methods are all pkmc methods involving the distance, Lp distance in one dimension (the most relevant dimension based on mean gaps or?.). I say Lp because all of these distance are identical in one dimension (just absolute value of the value difference). To improve accuracy we could try using std based gap measurements and pick the maximum number of gap stds each time (using Mohammad's formula for variance), rather than gap distance and/or we could maximize the relative gap = gap/mean measure or #gap_stds/mean. Could use the L distance on all relevant dimensions instead of just one dimension. Could use the Lp distance on all relevant dimensions (L1 and L2 using Mohammad's formulas).

11 Here I (Mohammad) am giving another algorithm to cal summation of squared value using p-trees.
If we only need the summation of squared value of all the data, and do not need the p-trees of the individual squared value then this algorithm is really easy (I believe). Suppose we have 4 values represented by 3 p-trees A a2 a1 a0 = == == == we need to calculate the squared sum A2 s5 s4 s3 s2 s1 s0 = == == == == == == -- Sum=86 5*5 = (a5,2*22 + a5,1*21 + a5,0*20)* (a5,2*22 + a5,1*21 + a5,0*20) + (a5,2*a5,1*2(2+1)+a5,2*a5,0*2(2+0)) = (a5,2*a5,2* a5,2*a5,1* a5,2*a5,0*22+0) + a5,1*a5,0*2(1+0) + (a5,1*a5,2* a5,1*a5,1* a5,1*a5,0*21+0) + (a5,0*a5,2* a5,0*a5,1* a5,0*a5,0*20+0) = (a5,2*a5,2*22+2 + a5,1*a5,1*21+1 + a5,0*a5,0*20+0) + a5,2*a5,1*2(2+1)+a5,2*a5,0*2(2+0) + a5,1*a5,0*2(1+0) a5,2*a5,2*22+2 + a5,1*a5,1*21+1 +a5,0*a5,0*20+0 + a6,2*a6,1*2(2+1)+a6,2*a6,0*2(2+0) + a6,1*a6,0*2(1+0) a6,2*a6,2*22+2 + a6,1*a6,1*21+1 +a6,0*a6,0*20+0 + a3,2*a3,1*2(2+1)+a3,2*a3,0*2(2+0) + a3,1*a3,0*2(1+0) a3,2*a3,2*22+2 + a3,1*a3,1*21+1 +a3,0*a3,0*20+0 + a4,2*a4,1*2(2+1)+a4,2*a4,0*2(2+0) + a4,1*a4,0*2(1+0) a4,2*a4,2*22+2 + a4,1*a4,1*21+1 +a4,0*a4,0*20+0 e.g., first column: (a5,2*a5,2+ *22+2 a6,2*a6,2+ a3,2*a3,2+ a4,2*a4,2) All of these products are binary (1*1=1, 1*0=0*1=0*0=0 and therefore accomplished by ANDing)


Download ppt "pTree-k-means-classification-sequential (pkmc-s)"

Similar presentations


Ads by Google