Download presentation
Presentation is loading. Please wait.
Published byRidwan Darmali Modified over 6 years ago
1
we call it the bip stride=m [level=1] pMap of pM
Multi-level pTrees for data tables: Given a cardinality=n table, a row predicate (e.g., a bit slice predicate, or a category map) and a row ordering (e.g., ascending on the key or, for spatial data, column/row-raster, Z=Peano or Hilbert), the sequence of predicate truth values (1/0) is the raw or level-0 predicate map (pMap) for that table, predicate and row order. pred: rem(div(SL/2)/2)=1 order: given order gte50% stride=5 pMSL,1 1 pure1 stride=5 pMSL,1 gte20% stride=5 pMSL,1 1 IRIS Table Name SL SW PL PW Color setosa red setosa blue setosa red setosa white setosa blue versicolor red versicolor red versicolor white versicolor blue versicolor white virginica white virginica red virginica blue virginica red virginica red pMSL,0 1 pMColor=red 1 pMSL,1 1 gte50% stride=5 pMC=red 1 pure1 gte20% predicate: remainder(SL/2)=1 order: the given table order pred: Color=red order: given ord pred: PW<7 order: given gte50% stride=5 pMPW<7 1 pMPW<7 1 Note the potential classification power of this gte50% stride=5 pMap-perfectly predicts setosa. Given a raw pMap, pM, a decomposition of it into mutually exclusive, collectively exhaustive bit intervals and a bit-interval-predicate, bip (e.g., pure1, pure0, gte50%Ones), define a level-1 pMap as the string of bip truth values generated by applying bip to the consecutive intervals of the decomposition (When the decomposition is equiwidth, the interval sequences is fully determined by the width=m>1, AKA, stride=m) we call it the bip stride=m [level=1] pMap of pM the combination of a raw pMap and a collection of its level-1 pMaps form a pTree.
2
Level-2 pMaps? All non-raw pMaps are level-1 pMaps. If the intervals of one set of level-1 pMaps are subsets of another sets of pMaps, then the raw pMap together with all the level-1 pMaps can be viewed as a multi-level pTree: pred: rem(SL/2)=1 ord: given order gte50% stride=4 pMSL,0 1 gte50% stride=8 pMSL,0 1 gte50% stride=16 pMSL,0 IRIS Table 2 Name SL SW PL PW Color setosa red setosa blue setosa red setosa white setosa blue versicolor red versicolor red versicolor white versicolor blue versicolor white virginica white virginica red virginica blue virginica red virginica red virginica white pMSL,0 1 We can define a level-2 pMap as one that is a level-1 pMap built on another level-1 pMap (considered to be a one-column table) pMgte50%,s=4,SL,0≡ gte50% stride=4 pMSL,0 1 level-2 gte50% stride=2 1 pMgte50%,s=4,SL,0 pred: rem(SL/2)=1 ord: given order pMSL,0 1 level-2 pMaps appear to have less utility than level-1's. Note that the bit strings are different. The upper collection of level-1s will be called the gte50%; strides=4,8,16; SL,0 pTree and denoted pTgte50%_s=4,8,16_SL,0
3
construction [revisited with the new definitions]
1 construction [revisited with the new definitions] gte50_pTrees11 raw level-0 pMap 1 1 level-1 gt50 stride=4 pMap 1 1 1 level-1 gt50 stride=2 pMap 1 1 1 1
4
gte50 Satlog-Landsat stride=64, classes: redsoil cotton greysoil dampgreysoil stubble verydampgreysoil RG 1 255 ... R G Rir1 1 255 ... R ir1 2 Rir2 1 255 ... R ir2 r cl c g d s v For gte50 Satlog-Landsat stride=320, we get: 255 ... 1 R Rclass 1 320-bit strides start end cls cls 320 stride _ Gir1 1 255 ... G ir1 Gir2 1 255 ... G ir2 r cl c g d s v 255 ... 1 G Gclass 1 ir1ir2 1 255 ... ir1 ir2 r cl c g d s v 255 ... 1 ir1 ir1class 1 R G ir ir2 cls means stds means stds means stds means stds r cl c g d s v Note that for stride=320, the means are way off and it therefore will probably produce very inaccurate classification.. A level-0 pVector is a bit string with 1 bit per record. A level-1 pVector is a bit string with 1 bit per record stride which gives truth of a predicate applied to record stride. A n-level pTree consists of a level-k pVector (k=0...n-1) all with the same predicate and s.t. each level-k stride is a contained within one level-k-1 stride. 255 ... 1 ir2 ir2class 1
5
This is not too expensive.
R cls 14 16 15 18 17 19 21 20 23 22 24 26 25 28 27 29 30 31 32 34 33 36 35 37 38 39 1 41 40 43 1 45 47 49 48 50 52 51 2 54 53 1 55 4 56 62 63 65 69 72 73 75 74 77 76 79 80 82 81 83 85 87 86 90 91 93 95 94 96 98 97 100 99 101 103 102 105 104 106 108 107 109 111 110 113 112 114 116 115 118 117 119 121 120 123 122 124 126 125 G cls 14 16 15 18 17 19 21 20 23 22 24 26 25 28 27 29 30 31 35 37 38 41 40 42 44 46 45 47 49 50 54 57 56 59 58 60 62 64 63 65 66 69 68 70 72 71 74 77 76 78 80 82 81 84 85 87 86 88 90 89 93 92 95 94 96 97 2 100 101 102 105 104 106 108 109 111 4 110 113 112 114 116 115 1 118 117 119 121 120 123 122 124 126 125 ir cls 14 16 15 18 17 19 21 20 23 22 24 26 25 28 27 29 30 31 32 34 33 36 35 37 38 39 41 40 42 44 43 46 45 47 49 48 50 52 51 54 53 55 57 56 59 58 60 62 61 63 69 71 75 76 78 80 81 83 84 85 87 88 90 89 91 93 95 94 99 103 102 108 107 109 111 110 113 115 118 1 117 119 121 120 123 122 4 126 125 ir cls 14 1 16 15 18 17 19 21 20 23 22 24 26 25 28 27 29 30 31 32 34 33 36 35 37 38 39 41 40 42 44 43 46 45 47 49 48 50 52 51 54 53 55 56 62 63 72 73 75 1 74 77 76 79 80 85 87 89 1 91 1 93 95 2 94 1 96 98 97 100 99 101 103 102 105 104 106 108 107 109 111 110 113 112 114 116 115 118 117 119 121 120 1 123 1 122 124 126 1 125 classes: 1. redsoil 2. cotton 3. greysoil 4. dampgreysoil 5. stubble 7. verydampgreysoil gte50 stride=64 FT=1, CT=.95 Strong rules AC exist.C=ClassSet, A=IntervalConf. |PA&Pc| /|PA| > CT Frequency condition |PA| > FT Frequency unimportant? There is no closure to help mine confident rules. But we can see there are strong rules: G[0,46]2 G[47,64]5 G[65,81]1,7 G[81,94]4 G[94,255]{1,3} R[0,48]{1,2} R[49,62]{1,5} R[63,81]{1,4,7} R[82,255]3 ir1[0,88]{5,7} ir1[89,255]{1,2,3,4} ir2[0,52]5 ir2[53,255]{1,2,3,4,7} Altho no closures exits, but we can mine confident rules by scanning values (1 pass 0-255) for each band. This is not too expensive. For an unclassified sample, let rules vote (weight inversely by consequent size and directly by std % in gap, etc.). Is there any new information in 2-hop rules, e.g., RRGG GGclscls? Can cls=1,3 be separated (only classes not separated by G). We will also try using all the strong one-class or two-class rules above.
6
Those R-values also occur in C={1}
R a b c123456 G32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 1 11 111 1 2 1 1 G 35 37 38 40 41 42 44 45 46 47 49 50 54 56 57 58 59 60 62 63 64 65 66 68 69 70 71 72 74 76 77 78 80 81 82 84 85 86 87 88 89 90 92 93 94 95 96 97 2 100 101 102 104 105 106 108 109 110 111 4 112 113 114 115 1 For C={1}, Note that the only difference for C={3} is G=98 where the R-values are R=89 and 92. Those R-values also occur in C={1} Therefore no AR is going to separate C={1} from C={3}
7
2pstdr FAUST a = pmr + (pmv-pmr) = Satlog pstdv+2pstdr
evaluation R G ir1 ir2 mn R G ir1 ir2 std a = pmr (pmv-pmr) = pstdv+2pstdr 2pstdr pmr*pstdv + pmv*2pstdr pstdr +2pstdv NonOblique lev-0 1's 's 's 's 's 's True Positives: Class actual-> 2s1, # of FPs reduced and TPs somewhat reduced. Better? Parameterize the 2 to max TPs, min FPs. Best parameter? NonOblq lev1 gt50 1's 's 's 's 's 's True Positives: False Positives: Oblique level-0 using midpoint of means 1's 's 's 's 's 's True Positives: False Positives: tot TP actual TP nonOb L0 pure1 TP nonOblique FP level-1 50% TP Obl level-0 FP MeansMidPoint TP Obl level-0 FP s1/(s1+s2) TP 2s1/(2s1+s2) FP Ob L0 no elim TP 2s1/(2s1+s2) FP Ob L TP 2s1/(2s1+s2) FP Ob L TP 2s1/(2s1+s2) FP Ob L TP BandClass rule FP mining (below) Oblique level-0 using means and stds of projections (w/o cls elim) 1's 's 's 's 's 's True Positives: False Positives: Oblique lev-0, means, stds of projections (w cls elim in order) Note that none occurs 1's 's 's 's 's 's True Positives: False Positives: Oblique level-0 using means and stds of projections, doubling pstd No elimination! 1's 's 's 's 's 's True Positives: False Positives: Oblique lev-0, means, stds of projs,doubling pstdr, classify, eliminate in 2,3,4,5,7,1 ord 1's 's 's 's 's 's True Positives: False Positives: Oblique lev-0, means,stds of projs, doubling pstdr, classify, elim 3,4,7,5,1,2 ord 1's 's 's 's 's 's True Positives: False Positives: 2s1/(2s1+s2) elim ord: TP: FP: G[0,46]2 G[47,64]5 G[65,81]7 G[81,94]4 G[94,255]{1,3} R[0,48]{1,2} R[49,62]{1,5} R[82,255]3 ir1[0,88]{5,7} ir2[0,52]5 Conclusion? MeansMidPoint and Oblique std1/(std1+std2) are best with the Oblique version slightly better. I wonder how these two methods would work on Netflix? Two ways: above=(std+stdup)/gap below=(std+stddn)/gapdn suggest ord abv below abv below abv below abv below avg red green ir ir2 cls avg 4 2.12 2 2.36 5 4.03 7 4.12 1 4.71 3 5.27 UTbl(User, M1,...,M17,770) (u,m); umTrainingTbl = SubUTbl(Support(m), Support(u), m) MTbl(Movie, U1,...,U480189) (m,u); muTrainingTbl = SubMTbl(Support(u), Support(m), u)
8
Netflix data: {mk} k=1..17770 UserTable(uID,m1,...,m17770)
m mh m17770 u1 : uk . u480189 rmhuk 47B UserTable(uID,m1,...,m17770) m0, m17769,0 u1 : uk . u480189 1/0 47B UPTreeSet 3*17770 bitslices wide uID rating date u i1 rmk,u dmk,u ui2 . ui n k mk(u,r,d) avg:5655u/m m 1 2 4 5 u ?45 mID uID rating date m u rm,u dm,u m u2 . m u r17770, d17770,480189 or U ,480, Main:(m,u,r,d) avg:209m/u (u,m) to be predicted, from umTrainingTbl = SubUTbl(Support(m), Support(u),m) Of course, the two supports won't be tight together like that but they are put that way for clarity. u uk u480189 m1 : m h m17770 rmhuk 47B MTbl(mID,u1...u480189) u0, u480189,0 m1 : m h m17770 0/1 47B MPTreeSet 3* bitslices wide
9
SubUTbl( nSup(u)mSup(n), Sup(u),m)?
There are lots of 0s in vector space, umTraningTbl). We want the largest subtable without zeros. How? m mh m17770 u1 : uk . u480189 rmhuk 47B UserTable(uID,m1,...,m17770) m0, m17769,0 u1 : uk . u480189 1/0 47B UPTreeSet 3*17770 bitslices wide u ?45 m 1 2 4 5 u ?45 m 1 2 4 5 SubUTbl( nSup(u)mSup(n), Sup(u),m)? Using Coordinate-wise FAUST (not Oblique), in each coordinate, nSup(u), divide up all users vSup(n)Sup(m) into their rating classes, rating(m,v). then: 1. calculate the class means and stds. Sort means. 2. calculate gaps 3. choose best gap and define cutpoint using stds. Using Coordinate-wise FAUST (not Oblique), in each coordinate, vSup(m), divide up all movies nSup(v)Sup(u) into their rating classes, rating(n,u). then: 1. calculate the class means and stds. Sort means. 2. calculate gaps 3. choose best gap and define cutpoint using stds. This of course may be slow. How can we speed it up? (u,m) to be predicted, form umTrainingTbl=SubUTbl(Support(m),Support(u),m) These gaps alone might not be the best (especially since the sum of the gaps is no more than 4 and there are 4 gaps). Weighting (correlation(m,n)-based) might be useful (the higher the correlation the more significant the gap??) The other issue that comes to mind is that these cutpoints would be constructed for just this one prediction, rating(u,m). It makes little sense to find all of them. Maybe we should just find, e,g, which n-class-mean(s) rating(u,n) is closest to and make those the votes?
10
D≡ mrmv. APPENDIX: FAUST Oblique formula: P(Xod)<a X any set of vectors (e.g., a training class). Let d = D/|D|. To separate rs from vs using means_midpoint as the cut-point, calculate a as follows: Viewing mr, mv as vectors ( e.g., mr ≡ originpt_mr ), a = ( mr+(mv-mr)/2 ) o d = (mr+mv)/2 o d What if d points away from the intersection, , of the Cut-hyperplane (Cut-line in this 2-D case) and the d-line (as it does for class=V, where d = (mvmr)/|mvmr| ? Then a is the negative of the distance shown (the angle is obtuse so its cosine is negative). But each vod is a larger negative number than a=(mr+mv)/2od, so we still want vod < ½(mv+mr) o d r r r v v r mr r v v v r r v mv v r v v r v d d a
11
FAUST Oblique vector of stds D≡ mrmv , d=D/|D| PX o d < a
= PdiXi<a To separate r from v: Using the vector of stds cutpoint , calculate a as follows: Viewing mr, mv as vectors, a = ( mr mv ) o d stdr+stdv stdr stdv What are the purple stds? approach-1: for each coordinate (or dimension) calculate the stds of the coordinate values and for the vector of those stds. Let's remind ourselves that the formula given Md's formula, does not require looping through the X-values but requires only one AND program across the pTrees. PX o d < a = PdiXi<a r r r v v r mr r v v v r r v mv v r v v r v d
12
FAUST Oblique D≡ mrmv , d=D/|D|
PXod<a = PdiXi<a FAUST Oblique D≡ mrmv , d=D/|D| Approach 2 To separate r from v: Using the stds of the projections , calculate a as follows: a = pmr (pmv-pmr) = pstdr+pstdv pstdr pmr*pstdr + pmr*pstdv + pmv*pstdr - pmr*pstdr pstdr +pstdv next? pmr (pmv-pmr) = pstdv+2pstdr 2pstdr pmr*2pstdr + pmr*pstdv + pmv*2pstdr - pmr*2pstdr 2pstdr +pstdv In this case the predicted classes will overlap (i.e., a given sample point may be assigned multiple classes) therefore we will have to order the class predictions. r r r v v r mr r v v v r r v mv v r v v r v By pmr, we mean this distance, mrod, which is also mean{rod|rR} r | v | d r | pmr | By pstdr, std{rod|rR} | r | r | r v | pmv | | v | v | v
13
/ ct(&pw&swARswSpw)
Can MYRRH classify? (pixel classification?) Try 4-hop using attributes of IRIS(Cls,SL,SW,PL,PW) stride=10 level-1 val SL SW PL PW setosa setosa setosa setosa setosa versicolor versicolor versicolor versicolor versicolor virginica virginica virginica virginica virginica SL SW PL rnd(PW/10) A={3,4} SW PW 1 2 3 4 5 6 7 S R PL SL 1 2 3 4 5 6 7 U T CLS se ve vi AC confident? C={se} ct( &pw&swARswSpw &sl&clsCUclsTsl ) / ct(&pw&swARswSpw) = 1/2 pl={1,2} pl={1}
14
ct(RA&cls{se}Rcls) / ct(RA)
1-hop: IRIS(Cls,SL,SW,PL,PW) stride=10 level-1 val SL SW PL PW setosa setosa setosa setosa setosa versicolor versicolor versicolor versicolor versicolor virginica virginica virginica virginica virginica SL SW PL rnd(PW/10) SW CLS se ve vi C={se} R 1-hop AC is more confident: ct(RA&cls{se}Rcls) / ct(RA) = 1 sw= {3,4} sw= {3,4} sw= {3,4} But what about just taking R{class}? Gives {3,4}se {2,3}ve {3}vi This is not very differentiating of class. Include the other three? SL CLS se ve vi {4,5}se {5,6}ve {6,7}vi These rules were derived from the binary relationships only. A minimal Decision Tree Classifier suggested by the rules: / \ PW= else | se PL{3,4} & SW=2 & SL=5 else ve 2 of 3 of: else PL{3,4,5} | SW={2,3} vi SL={5,6} SW CLS se ve vi {3,4}se {2,3}ve {3}vi PL CLS se ve vi {1,2}se {3,4,5}ve {5,6}vi PW CLS se ve vi {0}se {1,2}ve {1,2}vi I was hoping for a "Look at that!" but it didn't happen ;-)
15
=1 ct(ORplATpl &clsCUcls) / ct(ORplATpl)
2-hop stride=10 level-1 val SL SW PL PW setosa setosa setosa setosa setosa versicolor versicolor versicolor versicolor versicolor virginica virginica virginica virginica virginica SL SW PL rnd(PW/10) PL SL 1 2 3 4 5 6 7 U T CLS se ve vi C={se} ct(ORplATpl &clsCUcls) / ct(ORplATpl) =1 Mine out all confident se-rules with minsup = 3/4: sl={4,5} Closure: If A{se} is nonconfident and AUse then B{se} is nonconfident for all B A. So starting with singleton A's: ct(Tpl=1 & Use) / ct(Tpl=1) = 2/2 yes. A= {1,3} {1,4} {1,5} or {1,6} will yield nonconfidence and AUse so all supersets will yield nonconfidence. ct(Tpl=2 & Use) / ct(Tpl=2) = 1/1 yes. A= {1,2} will yield confidence. ct(Tpl=3 & Use) / ct(Tpl=3) = 0/1 no. A= {2,3} {2,4} {2,5} or {2,6} will yield nonconfidence but the closure property does not apply. ct(Tpl=4 & Use) / ct(Tpl=4) = 0/1 no. ct(Tpl=5 & Use) / ct(Tpl=5) = 1/2 no. ct(Tpl=6 & Use) / ct(Tpl=6) = 0/1 no. etc. I conclude that this closure property is just too weak to be useful. And also it appears from this example that trying to use myrrh to do classification (at least in this way) does not appear to be productive.
16
Use relationships to find "neighbors" to predict rating(c=3,i=5)?
Collaborative filtering, AKA customer preference prediction, AKA Business Intelligence, is critical for on-line retailers (Netflix, Amazon, Yahoo...). It's just classical classification: based on a rating history training set, predict how customer, c, would rate item, i? 1(C,I) 1 2(I,C) 2 3 4 C I 5 3(I,C) 4(I,C) 5(I,C) Multihop Relationship model Use relationships to find "neighbors" to predict rating(c=3,i=5)? TrainingSet C I Rating Find all customers whose rating history is similar to that of c=3. I.e., for each rating, k=1,2,3,4,5, find all other customers who give that rating to the movies that c=3 gives that rating to, which is kk3 where k is a customer pTree from the relationship k(C,I). Then find the intersection of those k-CustomerSet: &kk3 and let those resulting customers vote or predict rating(c=3,i=5) 5(I,C) C 2 3 4 5 1(I,C) 1 I Rolodex Relationship model 4(I,C) 1 2(I,C) 3(I,C) Binary Relationship model 1(C,I) 1 2 3 4 C I 5 2(C,I) 1 2 3 4 C I 5 3(C,I) 1 2 3 4 C I 5 4(C,I) 1 2 3 4 C I 5 5(C,I) 1 2 3 4 C I 5
17
Lev2-50% stride640, classes: redsoil cotton greysoil dampgreysoil stubble verydampgreysoil
1 2 3 G 4 5 6 7 1 2 3 ir1 4 5 6 7 1 2 3 ir2 4 5 6 7 r cl c g d s v R 7 6 5 4 3 2 1 RG 1 R 7 6 5 4 3 2 1 Rir1 1 R 7 6 5 4 3 2 1 Rir2 1 R 7 6 5 4 3 2 1 Rclass 1 1 2 3 ir1 4 5 6 7 1 2 3 ir2 4 5 6 7 r cl c g d s v G 7 6 5 4 3 2 1 Gir1 1 G 7 6 5 4 3 2 1 Gir2 1 G 7 6 5 4 3 2 1 Gclass 1 1 2 3 ir2 4 5 6 7 r cl c g d s v ir1 7 6 5 4 3 2 1 ir1ir2 1 G 7 6 5 4 3 2 1 ir1class 1 r cl c g d s v ir2 7 6 5 4 3 2 1 ir2class 1
18
But it's pure0 so this branch ends
pTrees predicate Tree technologies provide fast, accurate horizontal processing of compressed, data-mining-ready, vertical data structures. predicate Trees (pTrees): project each attribute (now 4 files) then vertically slice off each bit position (now 12 files) 1st, Vertically Processing of Horizontal Data (VPHD) R(A1 A2 A3 A4) then compress each bit slice into a tree using the predicate e.g., the compression of R11 into P11 goes as follows: e.g., find the number of occurences of =2 2nd, using pTrees find the number of occurences of R[A1] R[A2] R[A3] R[A4] Base 10 Base 2 = for Horizontally structured, record-oriented data, one must scan vertically R11 1 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 pure1? false=0 pure1? true=1 pure1? false=0 pure1? false=0 pure1? false=0 Record truth of predicate: "pure1" = "all 1s" in a tree, recursively on halves, until the half is pure. 1. Whole thing pure1? false 0 0 0 0 1 P11 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 1 1 0 1 0 01 0 1 0 0 ^ 2. Left half pure1? false 0 3. Right half pure1? false 0 0 0 P11 4. Left half of rt half ? false0 0 0 5. Rt half of right half? true1 0 0 0 1 But it's pure0 so this branch ends To count (7,0,1,4)s use P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 *23 *22 =2 0 1 *21 *20 =
19
Siblings are pure0 so collapse!
R(A1 A2 A3 A4) R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 = # change P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 10 1 0 01 1 0 0 1 0 0 ^ These 1s and these 0s (which when complemented are 1's) make node 1 This (terminal) 0 makes entire left branch 0 There is no need to look at the other operands. These 0s make this node 0 ^ To count occurrences of 7,0,1,4 use : P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 01 The 21-level has the only 1-bit so 1-count=1*21 =2 R11 1 Top-down construction of basic pTrees is best for understanding, bottom-up is much faster (once across). R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 Bottom-up construction of 1-Dim, P11, is done using in-order tree traversal, collapsing of pure siblings as we go: P11 1 1 1 1 Siblings are pure0 so collapse!
20
P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43
7,0,1,4= 0 0 0 1 10 ^ P11 1 0 0 1 01 ^ P12 0 0 ^ P13 1 0 P21 1 0 1 0 P22 0 0 1 0 01 ^ P23 1 0 P31 0 0 1 0 P32 0 1 0 1 P33 0 1 0 0 01 ^ P41 0 0 0 0 01 ^ P42 0 0 ^ P43 1 0 0 1 01 ^ P12 0 0 P13 P21 1 0 P22 P23 P32 P31 0 1 P33 0 0 P41 P42 P43 10 P11 stride=8 stride=4 stride=2 stride=1 (raw) A Mixed pTree is the AND of the complements of the pure1 and pure0. Derive and store? Any 1 of Pure1, Pure0, Mixed is derivable from others. Store 2? Store all? If there is a stride, it is level-1. 0 0 0 1 10 ^ P11 1 0 01 P12 P13 1 0 P21 1 0 P22 0 0 01 ^ P23 P32 P31 0 0 0 0 01 ^ P42 P43 0 1 0 1 P33 0 0 01 ^ P41 Store complements or derive them when needed? Or process complement set with separate code? Pure1 stride= 4 Pure0 stride= 4 Mixed stride= 4 P11 0 0 1 0 0 1 P12 P13 P33 P41 P21 P22 P23 P31 P32 P42 P43 1 1 Pure1 stride= 4 Pure0 stride= 4 Mixed stride= 4 P11 0 0 1 0 0 1 P12 P13 P33 P41 P’21 P’22 P’23 P’31 P’32 P’42 P’43 1 1 Derive comps: Mix of comp-no change. Swap p1, p0 Count contribution computable from these: 2*Count( & Pure1_str=4)=4*Count(0 0)=4*0= 0 Retrieve stride=2 (or stride=1's) only if stride=4 has 0-bit. And then for each individual pTree, retrieve that stride=2 vector only if Pure1 stride=4 has a 0-bit there. So retrieve: P P P P’ P’432 PureOne stride= 2 PureZero stride= 2 0 1 0 0 1 0 Retrieve stride=2 vectors: st=2_P st=2_P st=2_P st=2_P’ st=2_P’432 The contribution to the result 1-bit count from level 1: * Count( & Pure1_lev1 ) = 2 * Count( 0 1 ) = 2* 1 = 2 Retrieve level 0 vector only if orPure0_lev1 (=11) has a 0-bit in that position. And then for each individual pTree, retrieve that level_0 vector only if Pure1_lev1 has a 0-bit there. Since orPure0_lev1 )=(11) has no zero-bits, no level_0 vectors need to be retrieved. The answer, then, is = 2. Binary pTrees are used here for better understanding. In reality, the "stride" of each level would be tuned to processor architecture. E.g., on 64-bit processors, it would make sense to have only 2 levels, where each level_1 bit "covers" or "is the predicate truth of" a string of 64 consecutive raw bits. An alternative is to tune coverages to cache strides. In any case, we suggest storing level_0 (raw) in its entirety and building level_1 pTrees using various predicates (e.g., pure1, pure0, mixed, gt50, gt10). Then only if the tables are very very deep, build level_2s on top of that... Last point: Level_1s can be used independently of level_0s to do "approximate but very fast" data mining (especially using the gt50 predicate).
21
pTrees construction [one-time]
1 pTrees construction [one-time] Can be done as 1 pass thru each bit slice required for bottom-up construction of pure1 pTrees. R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 gte50_pTree11 1 1 1 1 1 1 gte100_pTree11 We must record 1-count of the stride of each inode (e.g., in binary trees, if one child=1 and the other=0, it could be the 1-child is pure1 and the 0-child is just below 50% (so parent_node=1) or the 1-child is just above 50% and the 0-child has almost no 1-bits (so parent node=0). 1 1 1 1 R11 1 (changed R11 so this issue of recording 1-counts as you go is pertinent) 1-child is pure1 and 0-child is just below 50% (so parent_node=1) 1-child is just above 50% and the 0-child has almost no 1-bits (so that the parent node=0). (example on next slide). R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 8_4_2_1_gte50%_pTree11 1 1 1 1 1 1 1 1 1 0 or 1? Need to know left branch OneCount=1, and right branch Onecount=3. So this stride=8 subtree OneCount=4 ( 50%). 0 or 1? OneCount of left branch=1, of right branch=0. So stride=4 subtree OneCount=1 (< 50%). OneCount of right branch=0 (pure0), but OneCount of left branch=?. Finally, recording the OneCounts as we build the tree upwards is a near-zero-extra-cost step.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.