Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa 5 6 7 8 pf 9 pb a pc b pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f FAUST=Fast, Accurate Unsupervised and Supervised.

Similar presentations


Presentation on theme: "1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa 5 6 7 8 pf 9 pb a pc b pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f FAUST=Fast, Accurate Unsupervised and Supervised."— Presentation transcript:

1 1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa 5 6 7 8 pf 9 pb a pc b pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f FAUST=Fast, Accurate Unsupervised and Supervised Teaching (Teaching big data to reveal information) FAUST CLUSTER-fmg (furthest-to-mean gaps for finding round clusters): C=X (e.g., X≡{p1,..., pf}= 15 pix dataset.) 1.While an incomplete cluster, C, remains find M ≡ Medoid(C) ( Mean or Vector_of_Medians or? ). 2.Pick f  C furthest from M from S≡SPTreeSet(D(x,M).(e.g., HOBbit furthest f, take any from highest-order S-slice.) 3.If ct(C)/dis 2 (f,M)>DT (DensThresh), C is complete, else split C where P≡PTreeSet(c o fM/|fM|) gap > GT (GapThresh) 4.End While. Notes: a. Euclidean and HOBbit furthest. b. fM/|fM| and just fM in P. c. find gaps by sorrting P or O(logn) pTree method? M C 2 ={p5} complete (singleton = outlier). C 3 ={p6,pf}, will split (details omitted), so {p6}, {pf} complete (outliers). That leaves C 1 ={p1,p2,p3,p4} and C 4 ={p7,p8,p9,pa,pb,pc,pd,pe} still incomplete. C 1 is dense ( density(C 1 )= ~4/2 2 =.5 > DT=.3 ?), thus C 1 is complete. Applying the algorithm to C 4 : M4 M4 M0 M0 {pa} outlier. C 2 splits into {p9}, {pb,pc,pd} complete. M 0 8.3 4.2 M 1 6.3 3.5 D(x,M 0 ) 2.2 3.9 6.3 5.4 3.2 1.4 0.8 2.3 4.9 7.3 3.8 3.3 1.8 1.5 M1 M1 f 1 =p3, C 1 doesn't split (complete). C 1 C 2 C 3 C 4 X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 f 1 2 p2 p5 p1 3 p4 p6 p9 4 p3 p8 p7 5 pf pb 6 pe pc 7 pd pa 8 1 2 3 4 5 6 7 8 9 a b c d e f Interlocking horseshoes with an outlier In both cases those probably are the best "round" clusters, so the accuracy seems high. The speed will be very high!

2 FAUST CLUSTER-fmg: O(logn) pTree method for finding P-gaps: P ≡ ScalarPTreeSet( c o fM/|fM| ) D(x,M) 8 7 6 4 2 7 6 7 4 6 7 4 X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 D3 1 0 D2 0 1 0 1 D1 0 1 0 1 0 1 0 D0 0 1 0 1 0 1 0 1 0 HOBbit Furthest pt list ={p1} Pick f=p1. dens(C)=16/8 2 =16/64=.25 x o Up1M 1 3 4 6 9 14 13 15 13 14 13 15 10 P3 0 1 P2 0 1 0 1 0 P1 0 1 0 1 0 1 0 1 0 1 0 1 P0 1 0 1 0 1 0 1 0 P3=[8,15] 0 1 ct= 10 P3'=[0,7] 1 0 ct=5 P3'&P2' =[0,3] 1 0 ct =3 P3'&P2 =[4,7] 0 1 0 ct =2 P3&P2' =[8,11] 0 1 0 1 ct =2 P3&P2 =[12,15] 0 1 0 ct =8 P3'&P2'&P1' =[0,1] 1 0 ct =1 P3'&P2'&P1 =[2,3] 0 1 0 ct =2 P3'&P2&P1' =[4,5] 0 1 0 ct =1 P3'&P2&P1 =[6,7] 0 1 0 ct=1 P3&P2'&P1' =[8,9] 0 1 0 ct =1 P3&P2&P1' =[12,13] 0 1 0 1 0 1 0 ct =3 P3&P2'& P1= [10,11] 0 1 ct=1 P3&P2&P1 =[14,15] 0 1 0 1 0 1 0 1 0 ct =4 0 P3'&P2' &P1'&P0' 0ct=0 1 0 P3'&P2' &P1'&P0 1ct=1 0 P3'&P2' &P1&P0' 2ct=0 0 1 0 P3'&P2' &P1&P0 3ct=2 0 1 0 P3'&P2& P1'&P0' 4ct=0 0 P3'&P2 &P1'&P0 5ct=0 0 1 0 P3'&P2& P1&P0' 6ct=1 0 P3'&P2 &P1&P0 7ct=0 0 P3&P2'& P1'&P0' 8ct=0 0 1 0 P3&P2'& P1'&P0 9ct=1 0 1 P3&P2' &P1&P0' 10ct=1 0 P3&P2' &P1&P0 11ct=0 0 P3&P2& P1'&P0' 12ct=0 0 1 0 1 0 1 0 P3&P2 &P1'&P0 13ct=4 0 1 0 1 0 P3&P2' &P1&P0' 14ct=2 0 1 0 1 0 P3&P2 &P1&P0 15ct=2 Gaps at each red value. Get a mask pTree for each cluster by ORing the pTrees between pairs of gaps. Next slide - use x o fM instead of x o UfM If GT=2 k then add 0,1,...,2 k-1 check all k of these down to level=2 k

3 X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 xofM 11 27 23 34 53 80 118 114 125 114 110 121 109 125 83 No zero counts yet (=gaps) p6 0 1 p5 0 1 0 1 0 p4 0 1 0 1 0 1 0 1 p3 1 0 1 0 1 0 p2 0 1 0 1 0 1 0 1 0 1 0 1 0 p1 1 0 1 0 1 0 1 p0 1 0 1 0 1 0 1 p6' 1 0 p5' 1 0 1 0 1 p4' 1 0 1 0 1 0 1 0 p3' 0 1 0 1 0 1 p2' 1 0 1 0 1 0 1 0 1 0 1 0 1 p1' 0 1 0 1 0 1 0 p0' 0 1 0 1 0 1 0 p3' 0 1 0 1 0 1 p3 1 0 1 0 1 0 p3' 0 1 0 1 0 1 p3 1 0 1 0 1 0 p3' 0 1 0 1 0 1 p3 1 0 1 0 1 0 p3' 0 1 0 1 0 1 p3 1 0 1 0 1 0 p3' 0 1 0 1 0 1 p3 1 0 1 0 1 0 p3' 0 1 0 1 0 1 p3 1 0 1 0 1 0 p3' 0 1 0 1 0 1 p3 1 0 1 0 1 0 p3' 0 1 0 1 0 1 p3 1 0 1 0 1 0 p4' 1 0 1 0 1 0 1 0 p4' 1 0 1 0 1 0 1 0 p4 0 1 0 1 0 1 0 1 p4 0 1 0 1 0 1 0 1 p4' 1 0 1 0 1 0 1 0 p4' 1 0 1 0 1 0 1 0 p4 0 1 0 1 0 1 0 1 p4 0 1 0 1 0 1 0 1 p4' 1 0 1 0 1 0 1 0 p4' 1 0 1 0 1 0 1 0 p4 0 1 0 1 0 1 0 1 p4 0 1 0 1 0 1 0 1 p4' 1 0 1 0 1 0 1 0 p4' 1 0 1 0 1 0 1 0 p4 0 1 0 1 0 1 0 1 p4 0 1 0 1 0 1 0 1 p5' 1 0 1 0 1 p5' 1 0 1 0 1 p5' 1 0 1 0 1 p5' 1 0 1 0 1 p5' 1 0 1 0 1 p5' 1 0 1 0 1 p5' 1 0 1 0 1 p5' 1 0 1 0 1 p5 0 1 0 1 0 p5 0 1 0 1 0 p5 0 1 0 1 0 p5 0 1 0 1 0 p5 0 1 0 1 0 p5 0 1 0 1 0 p5 0 1 0 1 0 p5 0 1 0 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6 0 1 p6 0 1 p6 0 1 p6 0 1 p6 0 1 p6 0 1 p6 0 1 p6 0 1 f=p1 and x o fM-GT=2 3. First round of finding L p gaps width = 2 4 =16 gap: [100 0000, 100 1111]= [64,80) width=2 3 =8 gap: [010 1000, 010 1111] =[40,48) width=2 3 =8 gap: [011 1000, 011 1111] =[56,64) width= 2 4 =16 gap: [101 1000, 110 0111]=[88,104) width=2 3 =8 gap: [000 0000, 000 0111]=[0,8) OR between gap 1 & 2 for cluster C 1 ={p1,p3,p2,p4} OR between gap 2 and 3 for cluster C 2 ={p5} between 3,4 cluster C 3 ={p6,pf} Or for cluster C 4 ={p7,p8,p9,pa,pb,pc,pd,pe} f= FAUST CLUSTER-fmg: O(logn) pTree method for finding P-gaps: P ≡ ScalarPTreeSet( c o fM )

4 C 21 C 221 X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 xofM 11 27 23 34 53 80 118 114 125 114 110 121 109 125 83 p6 0 1 f= C1C1 p6 0 p5 0 p4 0 1 C 11 C 11 ={p1,p2,p3,p4} dense, so complete. C 12 ={p5} singleton, complete, outlier. DxM 11 f27 23 34 53 x o fM 11 27 23 34 53 p6 0 1 DxM 4 6 f 5 3 4 5 xofM 48 59 61 70 68 83 92 90 97 67 1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa 5 6 7 8 pf 9 pb a pc b pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f DxM 6 f 5 1 2 4 3 4 xofM 77 74 86 96 92 101 69 p6 1 p5 0 1 0 1 0 C 222 ={pc,pe} dense so complete. DxM 6 f 4 2 4 4 xofM 75 71 78 82 61 p6 1 1 0 C 2211 C 2212 C 2212 ={pf} complete, so outlier. p6 0 DxM 4 3 f 1 xofM 36 56 53 p5 1 p4 0 1 C 211 C 212 C 211 ={p6} complete, so outlier. C 211 ={p7,p8} dense so complete. DxM 5 3 5 xofM 54 53 66 72 p6 0 1 C 22111 C 22112 17 15 xofM 1 0 p4 14 16 xofM 1 0 p4 We note that we get p9,pb,pd as outliers when they are not. FAUST CLUSTER-fmg: HOBbit distance based pTree method for finding P-gaps: P ≡ ScalarPTreeSet( c o fM )

5 FAUST CLUSTER, ffd (furthest-to-furthest divisive version) (next 3 slides). Initially, 1 incomplete cluster C=X. 0. While there remains an incomplete cluster, C, 1. Pick f to be a point in C furthest from the medoid, M. 2. If density(C)≡ count(C)/d(M,f) 2 > DT, declare C complete, else pick g to be a point in C furthest from f, and do 3. 3. (fission step) Replace M with two new medoids which are the points on the fg-line_segment the fraction, h, in from the endpoints, f and g. Divide C into two new clusters by assigning each point of C to the closest of two new medoids. Here, mediod=mean; h=1; DT= 1.5 1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa 5 6 7 8 pf 9 pb a pc b pd pe c d e ( scatter plot of X ) f 0 1 2 3 4 5 6 7 8 9 a b c d e f X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 M 0 8.6 4.7 M 1 3 1.8 M 2 11 6.2 dis to M 0 8.4 6.7 7.1 5.8 3.7 1.7 7.4 6.0 6.6 4.4 5.7 6.2 6.7 3.6 dis to f(=p1) 0 2 1.41 2.82 5.09 8.24 14 13.0 14.1 12.3 12.0 13.4 12.8 14.1 9.21 dis to g(=pe) 14.1 12.8 12.7 11.3 10.2 8.24 10.7 9.48 8.94 7.28 2.23 1 2 0 5 M0 M0 M1 M1 M2 M2 C 1 = pts closer to f=p1 than g=pe C 2 ↓ density(C 0 )= 16/8.4 2 =.23 < DT=1.5 so C 0 is incomplete! (further fission required). dis to M 1 2.1 0.8 1.0 1.2 3.0 density(C 1 )= 5/3 2 =.55 < DT, so C 1 is incomplete! (further fission required). dis to p5 5.09 3.16 4 3.16 0 dis to p1 0 2 1.41 2.82 5.09 C 11 C 12 ↓ C 12 = {p5} is a singleton, so it is complete (an outlier or anomaly). C 11 density= 4/1.4 2 = 4/2 = 2 > DT, so it is complete. Analysis of C 2 (and its sub-clusters, if any) is on the next slide. Aside: We probably would not declare the 4 points in C 11 as outliers due to C 11 's relatively large size (4 out of a total of 16) Reminder: We assume round clusters and anomalousness depends upon both small size and large separation distance. C 11 C 12 C 2

6 1 p7 2 p8 3 p6 p9 4 pa 5 6 7 8 pf 9 pb a pc b pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f FAUST CLUSTER ffd clusters C 2 C 21 C 22 C 211 C 212 C 221 C 222 C 2121 C 2122 C 2 x1 x2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 M 2 (11 6.2) dis M 2 4 6.3 4.9 4.8 2.7 3.1 3.8 5.3 4.8 4.7 dist p7 6.32 0 1.41 2 3.60 9.43 9.84 11.6 10.7 10.6 dist pd 8 11.6 10.2 10 8.06 2.23 0 2 3.60 M2 M2 C 21 : closer to p7 than pd C 22 ↓ C 21 density=5/4.2 2 =.28<DT incomplete. C 22 density=5/3.1 2 =.52<DT incomplete C 2 density≡count/d(M 2,f 2 ) 2 = 10/6.3 2 =.25<DT, incomplete. M 21 (13.2 2.6) dis M 21 4.2 2.4 1 1.8 1.4 M 22 ( 9.6 9.8) dis M 22 0.8 1.4 1.3 1.8 3.1 C 22 x1 x2 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 dis M 22 0.8 1.4 1.3 1.8 3.1 dist pf 3.1 4.4 3.6 5 0 dist pe 2.2 1 2 0 5 C 221 density= 4/1.4 2 = 2.04>DT, complete C 221 Closer to pe than pf C 222 ↓ C 222 = {pf} singleton so complete ( outlier). M 221 ( 10.2 10.2 ) dis M 221 1.2 0.7 1.4 1.0 C 21 x1 x2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13 4 M 21 (13.2 2.6) dis M7 4.2 2.4 1 1.8 1.4 dist p6 0 6.3 5.0 6 4.1 dist p7 6.3 0 1.4 2 3.6 C 211 closer to p6 than p7 C 212 ↓ C 211 = {p6} complete. (outlier). C 212 density=4/1.9 2 = 1.11< DT, incomplete. M 212 (14.2 2.5) dis M 212 1.6 0.5 0.9 1.9 C 212 x1 x2 p7 15 1 p8 14 2 p9 15 3 pa 13 4 M 212 (14 2.5) dis M 212 1.6 0.5 0.9 1.9 dist pa 3.6 2.2 0 dist p7 0 1.4 2 3.6 C 2121 closer to p7 than pa C 2122 ↓ C 2121 density=3/1 2 =3>DT complete.C 2122 ={pa} complete (outlier) M 2121 (14.6 2 ) dis M 2121 1.0 0.6 1.0 C 211 C 2122 C 2121 C 222 C 221

7 1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa 5 6 7 8 pf 9 pb a pc b pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f centriod=mean; h=1; DT= 1.5 gives 4 outliers and 3 non-outlier clusters If DT=1.1 then{pa} joins {p7,p8,p9}. If DT=0.5 then also {pf} joins {pb,pc.pd,pe} and {p5} joins {p1,p2,p3,p4}. We call the overall method FAUST CLUSTER because it resembles FAUST CLASSIFY algorithmically and k (# of clusters) is dynamically determined. FAUST CLUSTER ffd summary Improvements? Better stop condition? Is fmg better than ffd? In ffd, what if k over shoots its' optimal value? Add a fusion step each round? As Mark points out, having k too large can be problematic?. The proper definition of outlier or anomaly is a huge question. An outlier or anomaly should be a cluster that is both small and remote. How small? How remote? What combination? Should the definition be global or local? We need to research this (give users options and advice for their use). Md: create f=furthest pt from M, d(f,M) while creating D=SPTreeSet(d(x,M)? Or as a separate procedure, start with P=D h (h=High Bit Pos.) then recursively P k <-- P & D h-k until P k+1 =0. Then back up to P k and take any of those points as f and that bit pattern is d(f,M). Note that this doesn't necessarily give the furthest pt from M but gives a pt sufficiently far from M. Or use HOBbit dis? Modify to get absolute furthest pt by jumping (when AND gives zero) to P k+2 and continuing AND from there. (D h gives a decent f (at furthest HOBbit dis).

8 II. PROGRAM DESCRIPTION of the NSF Big Data RFP (NSF-12-499): Pervasive sensing and computing across natural, built, and social environments is generating heterogeneous data at unprecedented scale and complexity. Today, scientists, biomedical researchers, engineers, educators, citizens and decision-makers live in an era of observation: data comes from disparate sources, sensor networks; scientific instruments, such as medical equipment, telescopes, colliders, satellites, environmental networks, and scanners; video, audio, and click streams; financial transaction data; email, weblogs, twitter feeds, and picture archives; spatial graphs and maps; scientific simulations and models. This plethora of data sources has given rise to a diversity in data types; temporal, spatial, or dynamic and can be derived from structured, unstructured sources. Data may have different representation types, media formats, and levels of granularity, and may be used across multiple scientific disciplines. These new sources of data and their increasing complexity contribute to an explosion of information. A. Broader Research Goals of BIGDATA The potential for transformational science and engineering for all disciplines is enormous, but realizing the next frontier depends on effectively managing, using, and exploiting these heterogeneous data sources. It is now possible to extract knowledge and useful information in ways that were previously impossible, and to gain new insights in a timely manner. To understand the full spectrum of what advances in big data might mean, imagine a world where: 1.Responses to disaster recovery empower rescue workers and individuals to make timely and effective decisions and provide resources where most needed; 2.Complete health/disease/genome/environmental knowledge bases enable biomedical discovery and patient-centered therapy; 3.The full complement of health and medical information is available at the point of care for clinical decision-making; 4.Accurate high-resolution models support forecasting and management of increasingly stressed watersheds and ecosystems; 5.Access to data and software in an easy-to-use format are available to everyone around the globe; 6.Consumers can purchase wearable products using materials with novel and unique properties that prevent injuries; 7.The transition to use of sustainable chemistry and manufacturing materials has been accelerated to the point the US leads in advanced manufacturing; 8.Consumers have the information they need to make optimal energy consumption decisions in their homes and cars; 9.Civil engineers can continuously monitor and identify at-risk man-made structures like bridges, moderate the impact of failures, and avoid disaster; 10.Students and researchers have intuitive real-time tools to view, understand, and learn from publicly available large scientific data sets on everything from genome sequences to astronomical star surveys, from public health databases to particle accelerator simulations and their teachers and professors use student performance analytics to improve that learning; and 11.Accurate predictions of natural disasters, such as earthquakes, hurricanes, and tornadoes, enable life-saving and cost saving preventative actions. Opportunities abound for learning from large-scale data sets, which can provide researchers and decision makers with info of enhanced range, quality, depth.

9 To jump-start a nat'l initiative in big data discovery, this solicitation focuses on the shared research interests across NIH and NSF, and has 4 related objectives: 1.Promote new science, address key science questions, and accelerate the progress of discovery by harnessing the value of large, heterogeneous data. 2.Exploit the unique value of big data to address areas of national need, agency missions and societal and economic challenges in all parts of society. 3.Support responsible stewardship and sustainability of data resulting from federally- funded research. 4.Develop and sustain educational resources, a competent, knowledgeable workforce and the infrastructure needed to advance data-enabled sciences and broaden participation in data-enabled inquiry and action. BIGDATA seeks proposals that develop and evaluate core technologies and tools that take advantage of available collections of large data sets to accelerate progress in science, biomedical research, and engineering. Each proposal should include an evaluation plan. (See details in the Proposal Preparation and Submission Instructions section). Proposals can focus on one or more of the following three perspectives: 1. Data collection and management (DCM). Dealing with massive amounts of often heterogeneous and complex data coming from multiple sources -- such as those generated by observational systems across many scientific fields, as well as those created in transactional and longitudinal data systems across social and commercial domains -- will require the development of new approaches and tools. Potential research areas include, but are not limited to: 1.New data storage, I/O systems, and architectures for continuously generated data, as well as shared and widely-distributed static and real-time data; 2.Effective utilization and optimization of computing, storage, and communications resources; 3.Streaming, filtering, compressed sensing, and sufficient statistics -- potentially in real time and allowing reduction of data sizes as data are generated; 4.Fault-tolerant systems that continuously aggregate and process data accurately and reliably, while ensuring integrity; 5.Novel means of automatically annotating data with semantic and contextual information (that both machines and humans can read), including curation; 6.Model discovery techniques that can summarize and annotate data as they are generated; 7.Tracking how, when and where data are created and modified, including provenance, allowing long-lived data to provide insights in the future; 8.New designs for advanced data architectures, including clouds, addressing extreme capacity, power management, and real-time control while providing for extensibility and accessibility; 9.New architectures that reflect the structure and hierarchy of data as well as access techniques enabling efficient parallelism in operations across a data structure or database schema; 10.Next generation multi-core processor architectures and the next generation software libraries that take maximum advantage of such architectures; 11.Tools for efficient archiving, querying, retrieval and data recovery of richly structured, semi-structured, and unstructured data sets, in particular those for large transactional-intensive databases; 12.Research in software development to enable correct and effective programming of big data applications, including new programming languages, methodologies, and environments; and 13.New approaches to improve data quality, validity, integrity, and consistency, as well as methods to account for and quantify uncertainty in large data sets, including the development of data assurance processes, formal methods, and algorithms.

10 2. Data analytics (DA). Significant impacts will result from advances in analysis, simulation, modeling, and interpretation to facilitate discovery of phenomena, to realize causality of events, to enable prediction, and to recommend action. Advances will allow, for example, 1.modeling of social networks and learning communities, 2.reliable prediction of consumer behaviors and preferences, 3.and the surfacing of communication patterns among unknown groups at a larger, global scale; 4.extraction of meaning from textual data; 5.more effective correlation of events; 6.enhanced ability to extract knowledge from large-scale experimental and observational datasets; and 7.extracting useful information from incomplete data. Potential research areas include, but are not limited to: 1.Development of new algorithms, programming languages, data structures, and data prediction tools; 2.Computational models and underlying math and statistical theory needed to capture important performance characteristics of computing over massive data sets; 3.Data-driven high fidelity modeling and simulations and/or reduced-order models enabling improved designs and/or processes for engineering industries, and direct interfacing with measurements and equipment; 4.Novel algorithmic techniques with the capability to scale to handle the largest, most complex data sets being created now and in the future; 5.Real-time processing techniques addressing the scale of continuously generated data sets, as well as real-time visualization and analysis tools that allow for more responsive and intuitive study of data; 6.Computational, mathematical and statistical techniques for modeling physical, engineering, social or other processes that produce massive data sets; 7.Novel applications of inverse methods to big data problems; 8.Mining techniques that involve novelty and anomaly detection, trend detection and/or taxonomy creation as well as predictive models, hypothesis generation and automated discovery, incl. fundamentally new statistical, math. and computational methods for identifying changes in massive datasets; 9.Development of data extraction techniques (e.g. natural language processing) to unlock vast amounts of info currently stored as unstructured data (e.g. text); 10.New scalable data visualization techniques and tools, which are able to illustrate the correlation of events in multidimensional data, synthesize information to provide new insights, and allow users to drill down for more refined information; 11.Techniques to integrate disparate data and translate data into knowledge to enable on-the-fly decision-making; 12.Development of usable state-of-the-art tools and theory in statistical inference and statistical learning for knowledge discovery from massive, complex, and dynamic data sets; and 13.Consideration to potential limitations, e.g., the number of possible passes over the data, energy conservation, new communication architectures, and their implications for solution accuracy.

11 3. E-science collaboration environments (ESCE). A comprehensive "big data" cyberinfrastructure is necessary to allow for broad communities of scientists and engineers to have access to diverse data and to the best and most usable inferential and visualization tools. Potential research areas include, but are not limited to: Novel collaboration environments for diverse and distant groups of researchers and students to coordinate their work (e.g., thru data and model sharing and software reuse, tele-presence capability, crowd sourcing, social networking capabilities) with greatly enhanced efficiency and effectiveness for the scientific collaboration; Automation of the discovery process (e.g., through machine learning, data mining, and automated inference); Automated modeling tools to provide multiple views of massive data sets that are useful to diverse disciplines; New data curation techniques for managing the complex and large flow of scientific output in a multitude of disciplines; Development of systems and processes that efficiently incorporate autonomous anomaly and trend detection with human interaction, response, and reaction; End-to-end systems that facilitate the development and use of scientific workflows and new applications; New approaches to development of research questions that might be pursued in light of access to heterogeneous, diverse, big data; New models for cross-disciplinary information fusion and knowledge sharing; New approaches for effective data, knowledge, and model sharing and collaboration across multiple domains and disciplines; Securing access to data using innovative techniques to prevent excessive replication of data to external entities; Providing secure and controlled role-based access to centrally managed data environments; Remote operation, scheduling, and real-time access to distant instruments and data resources; Protection of privacy and maintenance of security in aggregated personal and proprietary data (e.g., de-identification); Generation of aggregated or summarized data sets for sharing and analyses across jurisdictional and other end user boundaries; and E-publishing tools that provide unique access, learning, and development opportunities.

12 In addition to 3 science and eng perspectives on big data described above, all proposals must also include a desc. of how project will build capacity: Capacity-building Req activities are critical to growth and health of this emerging area of research and education. There are three broad types of CB activities: 1.appropriate models, policies and technologies to support responsible and sustainable big data stewardship; 2.training and communication strategies, targeted to the various research communities and/or the public; and 3.sustainable, cost-effective infrastructure for data storage, access and shared services. 4.To develop a coherent set of stewardship, outreach and education activities in big data discovery, each research proposal must focus on at least one capacity-building activity. Examples include, but are not limited to: 5.Novel, effective frameworks of roles/responsibilities for big data stakeholders (i.e., researchers, collaborators, research communities/inst., fund agencies); 6.Efficient/effective DM models, considering structure/formatting data, terminology standards, metadata and provenance, persistent identifiers, data quality.. 7.Development of accurate cost models and structures; 8.Establishing appropriate cyberinfrastructure models, prototypes and facilities for long-term sustainable data; 9.Policies and processes for evaluating data value and balancing cost with value in an environment of limited resources; 10.Policies and procedures to ensure appropriate access and use of data resources; 11.Economic sustainability models; 12.Community standards, provenance tracking, privacy, and security; 13.Communication strategies for public outreach and engagement; 14.Education and workforce development; and 15.Broadening participation in big data activities. It is expected that at least one PI from each funded project will attend a BIGDATA PI meeting in year two of the initiative to present project research findings and capacity building or community outreach activities. Requested budgets should include funds for travel to this event. An overarching goal is to leverage all the BIGDATA investments to build a successful science and engineering community that is well trained in dealing with and analyzing big data from various sources. Finally, a project may choose to focus its science and engineering big data project in an area of national priority, but this is optional: National Priority Domain Area Option. In addition to the research areas described above, to fully exploit the value of the investments made in large-scale data collection, BIGDATA would also like to support research in particular domain areas, especially areas of national priority, including health IT, emergency response and preparedness, clean energy, cyberlearning, material genome, national security, and advanced manufacturing. Research projects may focus on the science and eng of big data in one or more of these domain areas while simultaneously engaging in the foundational research necessary to make general advances in "big data." B. Sponsoring Agency Mission Specific Research Goals NATIONAL SCIENCE FOUNDATION NSF intends to support excellent research in the three areas mentioned above in this solicitation. It is important to note that this solicitation represents the start of a multi-year, multi-agency initiative, which at NSF is part of the Cyberinfrastructure Framework for 21st Century Science and Engineering (CIF21). Innovative information technologies are transforming the fabric of society and data is the new currency for science, education, government and commerce. High performance computing (HPC) has played a central role in establishing the importance of simulation and modeling as the third pillar of science (theory and experiment being the first two), and the growing importance of data is creating the fourth pillar. Science and engineering researchers are pushing beyond the current boundaries of knowledge by addressing increasingly complex questions, which often require sophisticated integration of massive amounts of highly heterogeneous data from theoretical, experimental, observational, and simulation and modeling research programs. These efforts, which rely heavily on teams of researchers, observing and sensor platforms and other data collection efforts, computing facilities, software, advanced networking, analytics, visualization, and models, lead to critical breakthroughs in all areas of science and engineering and lay the foundation for a comprehensive, research requirements-based approach to the development of NSF's Cyberinfrastructure Framework for 21st Century Science and Engineering (CIF21). Finally, NSF is interested in integrating foundational computing research with the domain sciences to make advances in big data challenges related to national priorities, such as health IT, emergency response and preparedness, clean energy, cyberlearning, material genome, national security, and advanced manufacturing.

13 1 2 p2 p5 p1 3 p4 p6 p9 4 p3 p8 p7 5 pf pb 6 pe pc 7 pd pa 8 9 a b c d e f 1 2 3 4 5 6 7 8 9 a b c d e f APPENDIX: Relative gap size on f-g line for fission pt. Declare 2 gaps (3 clusters), C1={p1,p2,p3,p4,p5,p6,p7,p8,pe,pf} C2={p9,pb,pd} C3={pa} (outlier). On C1, no gaps, so C1 has converged and is declared complete. On C2, 1 (relative) gap, and the two subclusters are uniform so the both are complete (skipping that analysis) 1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa 5 6 7 8 pf 9 pb a pc b pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f Declare 2 gaps (3 clusters), C1={p1,p2,p3,p4,p5} C2={p6} (outlier) C3={p7,p8,p9,pa,pb,pc,pd,pe,pf} On C1, 1 gap so declare (complete) clusters, C11={p1,p2,p3,p4} C12={p5} On C3, 1 gap so declare clusters, C31={p7,p8,p9,pa} C32={pb,pc,pd,pe,pf} On C31, 1 gap, declare complete clusters, C311={p7,p8,p9} C312={pa} On C32, 1 gap, declare complete clusters, C311={pf} C322={pb,pc,pd,pe} Does this method also work on the first example? YES.

14 1 2 p2 p5 p1 3 p4 p6 p9 4 p3 p8 p7 5 pf pb 6 pe pc 7 pd pa 8 9 a b c d e f 1 2 3 4 5 6 7 8 9 a b c d e f X x1 x2 p1 8 2 p2 5 2 p3 2 4 p4 3 3 p5 6 2 p6 9 3 p7 9 4 p8 6 4 p9 13 3 pa 13 7 pb 12 5 pc 11 6 pd 10 7 pe 8 6 pf 7 5 M 0 8.1 4.2 max dis to M 0 6.13 dis f=p3 6.32 3.60 0.00 1.41 4.47 7.07 7.00 4.00 11.0 11.4 10.0 9.21 8.54 6.32 5.09 dis g=pa 7.07 9.43 11.4 10.7 8.60 5.65 5.00 7.61 4.00 0.00 2.23 3.00 5.09 6.32 dens(C 0 )= 15/6.13 2 <DT inc P C1 1 0 1 0 1 M 1 5.3 3.1 dis to M 1 2.94 1.17 3.39 2.29 1.34 1.11 2.52 dens(C 1 )= 7/3.39 2 <DT inc dis to M 2 4.24 3.65 2.98 1.42 0.86 1.15 2.42 4.14 M 2 12.1 5.9 dens(C 2 )= 8/4.24 2 <DT inc dis f 2 =6 0.00 1.00 4.00 5.65 3.60 4.12 3.16 dis g 2 =a 5.65 5.00 4.00 0.00 2.23 3.00 5.09 P C21 1 0 1 M 21 10.6 5.1 dis to M 21 4.24 3.65 4.14 dens(C 21 )= 3/4.14 2 <DT inc dis f 21 =e 3.16 2.24 0.00 dis g 21 =6 0.00 1.00 3.16 P C211 0 1 dens(C 212 )= 2/.5 2 =8>DT compl C 212 compl M 22 11.8 5.6 dis to M 22 6.02 2.86 1.84 0.63 0.89 dis f 22 =9 0.00 4.00 2.24 3.61 5.00 dis g 22 =d 5.00 3.00 2.83 1.41 0.00 P C221 1 0 1 0 dns(C 221 )= 2/5<DT inc X x1 x2 p1 8 2 p2 5 2 p3 2 4 p4 3 3 p5 6 2 p6 9 3 p7 9 4 p8 6 4 p9 13 3 pa 13 7 pb 12 5 pc 11 6 pd 10 7 pe 8 6 pf 7 5 P C222 0 1 0 1 dis to M 222 1.70 0.75 1.37 M 222 11.3 6.7 dns(C 222 )=1.04<DT inc dis to M 1 2.94 1.17 3.39 2.29 1.34 1.11 2.52 dis to f 2 =3 6.32 3.61 0.00 1.41 4.47 4.00 5.10 P C11 0 1 0 dis to M 12 1.89 1.72 1.08 2.09 M 12 6.4 3 dis to f 12 =f 3.16 3.61 3.16 1.41 0.00 P C12 0 1 Discuss: Here, DT=.99 (DT=1.5 all singeltons?). We expected FAUST to fail to find interlocked horseshoes, but hoped. e,g, pa and p9 would be only singleton! Can modify so it doesn't make almost everything outliers (singles, doubles a. look at upper cluster bbdry (margin width)? b. use std ratio bddys? c. other? d. use a fussion step to weld the horseshoes back Next slide: gaps on f-g line for fission pt. FAUST CLUSTER ffd on the "Linked Horseshoe" type example:

15 FAUST C fMg (pTree walkthru): Initially, C=X 0. While incomplete cluster, C, remains M≡Mean(C) 1. f C pt HOBbit furthest from M. 2. If dens(C)>.3 complete, else split C at GT gaps 3. Goto 0. X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 xofM 11 27 23 34 53 80 118 114 125 114 110 121 109 125 83 f=p1 and x o fM-GT=2 3. p6 0 1 p5 0 1 0 1 0 p4 0 1 0 1 0 1 0 1 p3 1 0 1 0 1 0 p2 0 1 0 1 0 1 0 1 0 1 0 1 0 p1 1 0 1 0 1 0 1 p0 1 0 1 0 1 0 1 p6' 1 0 p5' 1 0 1 0 1 p4' 1 0 1 0 1 0 1 0 p3' 0 1 0 1 0 1 p2' 1 0 1 0 1 0 1 0 1 0 1 0 1 p1' 0 1 0 1 0 1 0 p0' 0 1 0 1 0 1 0 p6' 1 0 p6' 1 0 p6 0 1 p6 0 1 p5' 1 0 1 0 1 p5' 1 0 1 0 1 p6' 1 0 p5' 1 0 1 0 1 p6' 1 0 p5' 1 0 1 0 1 p6' 1 0 p5' 1 0 1 0 1 p6' 1 0 p6 0 1 p6' 1 0 p6 0 1 p6' 1 0 p6 0 1 p6 0 1 p5' 1 0 1 0 1 p6 0 1 p5' 1 0 1 0 1 p6 0 1 p5' 1 0 1 0 1 p4' 1 0 1 0 1 0 1 0 p4' 1 0 1 0 1 0 1 0 p4 0 1 0 1 0 1 0 1 p4 0 1 0 1 0 1 0 1 p4' 1 0 1 0 1 0 1 0 p4' 1 0 1 0 1 0 1 0 p4 0 1 0 1 0 1 0 1 p4 0 1 0 1 0 1 0 1 p4' 1 0 1 0 1 0 1 0 p4' 1 0 1 0 1 0 1 0 p4 0 1 0 1 0 1 0 1 p4 0 1 0 1 0 1 0 1 p4' 1 0 1 0 1 0 1 0 p4' 1 0 1 0 1 0 1 0 p4 0 1 0 1 0 1 0 1 p4 0 1 0 1 0 1 0 1 2 4 gap: [100 0000, 100 1111]= [64,79] p5 0 1 0 1 0 p5 0 1 0 1 0 p5 0 1 0 1 0 p5 0 1 0 1 0 p5 0 1 0 1 0 p5 0 1 0 1 0 p5 0 1 0 1 0 p5 0 1 0 1 0 p3 1 0 1 0 1 0 p3 1 0 1 0 1 0 p3 1 0 1 0 1 0 p3 1 0 1 0 1 0 p3 1 0 1 0 1 0 p3 1 0 1 0 1 0 p3 1 0 1 0 1 0 p3' 0 1 0 1 0 1 p3' 0 1 0 1 0 1 p3' 0 1 0 1 0 1 p3' 0 1 0 1 0 1 p3' 0 1 0 1 0 1 p3' 0 1 0 1 0 1 p3' 0 1 0 1 0 1 2 3 gap: [0,7]= [000 0000, 000 0111], but since it is at the front, it is not actually a gap 2 3 gap: [40,47] = [010 1000, 010 1111] 2 3 gap: [56,63] = [011 1000, 011 1111] 2 4 gap: [101 1000, 110 0111] = [88,103] OR between gap 1 & 2 for cluster C 1 ={p1,p3,p2,p4} OR between gap 2 and 3 for cluster C 2 ={p5} between 3,4 cluster C 3 ={p6,pf}Or for cluster C 4 ={p7,p8,p9,pa,pb,pc,pd,pe}

16 K-means: Assign each pt to closest mean and increment sum, count for mean recalculation (1 scan). Iterate until stop_cond. pK-means: Same as above, but both assignment and means recalculation are done without scanning: 1.Pick K centroids, {C i } i=1..K 2. Calc SPTreeSet, D i =D(X,C i ) (col of distances from all x to C i ) to get P(D i  D j ) i<j ( predicate is dis(x,C i )  dis(x,C j ) ). 4. Calculate the mask-pTrees for the clusters goes as follows: PC 1 = P(D 1  D 2 ) & P(D 1  D 3 ) & P(D 1  D 4 ) &... & P(D 1  D K ) PC 2 = P(D 2  D 3 ) & P(D 2  D 4 ) &... & P(D 2  D K ) & ~PC 1 PC 3 = P(D 3  D 4 ) &... & P(D 3  D K ) & ~PC 1 & ~PC 2... PC k = & ~PC 1 & ~PC 2 &... & ~PC K-1 5. Calculate new Centroids, C i = Sum(X&PC i )/count(PC i ) 6. If stop_cond=false, start next iteration with new centroids. Note: In 2. above, Md's 2's complement formulas can be used to get mask pTrees, P(D i  D j ) or FAUST (using Md's dot product formula) can be used. Is one faster than the other? pKl-means: ( P K-less means, pronounced pickle means ) For all K: 4'. Calculate cluster mask pTrees. For K=2..n, PC 1K = P(D 1  D 2 ) & P(D 1  D 3 ) & P(D 1  D 4 ) &... & P(D 1  D K ) PC 2K = P(D 2  D 3 ) & P(D 2  D 4 ) &... & P(D 2  D K ) & ~PC 1... PC K = P(X) & ~PC 1 &... & ~PC K-1 6'. If  k s.t. stop_cond = true, stop and choose that k, else start the next iteration with these new centroids. 3.5'. Continue with certain k's only (e.g., top t? Top means? a. Sum of cluster diams (use max, min of D(Cluster i, C j ), or D(Cluster i. Cluster j ) ). b. Sum of diams of cluster gaps (Use D(listPC i, C j ) or D(listPC i, listPC j ). c. other? Fusion: Check for clusters that should be fused? Fuse (decrease k) 1. Empty clusters with any other and reduce k (this is probably assumed in all k-means methods since there is no mean.). 2. For some a>1, max(D(CLUST i,C j ))< a*D(C i,C j ) and max(D(CLUST j,C i ))< a*D(C i,C j ), fuse CLUST i and CLUST j. Avg better? Fission: Split cluster (increase k), if a. mean and vom are quite far apart, b. cluster is sparse (i.e., max(D( CLUS,C))/count(CLUS)<T (Pick fission centroid y at max dis from C. Pick z at max dis from y. (diametric opposites in C) Sort PTreeSet(dis(x,X-x)), then sort desc, gives singleton-outlier-ness. Or take global medoid, C, increase r until ct(dis(x,Disk(C,r)))>ct(X)–n, then declare compliment outliers..Or, loop x once - alg is O(n). ( O(n 2 ) for horiz:  x, find dis(x,y)  y  x (O(n(n-1)/2)=O(n 2 ). Or predict C so it is not X-x but a fixed subset? Or create 3 col “distance table”, DIS(x,y,d(x,y)) (limit it to only those distances < thresh?) where dis(x,y) is a PTreeSet of those distances. If we have DIS as a PTreeSet both ways - have one for “y-pTrees” and another for “x-pTrees”. y’s --> x’s 0 2 1 3 1 2… v 0 2 5 9 1… y’s close to x are in it’s cluster. If small, and next larger d(x,y) is large, x-cluster members are outliers.

17 Mark Silverman: I start randomly - converges in 3 cycles.Here I increase k from 3 to 5. 5th centroid could not find a member (at 0,0), 4th centroid picks up 2 points that look remarkably anomalous Treeminer, Inc. (240) 389-0750(240) 389-0750 WP: Start with large k? Each round, "tidy up" by fusing pairs of clusters using max( P(dis(CLUS i, C j ))) < dis(C i, C j ) and max( P(dis(CLUS j, C i ))) < dis(C i, C j ) ? Eliminate empty clusters and reduce k. (Avg better than max ? in the above). Mark: Curious about one other state it converges to. Seems like when we exceed optimal k, some instability. WP: Tiding up would fuse Series4 and series3 into series34 Then calc centroid34. Next fuse Series34 and series1 into series134, calc centrod34 Also?: Each round, split a cluster (create 2nd centroid) if mean and vector_of_medians far apart. (A second go at this mitosis based on density of the cluster. If a cluster is too sparse, split it. A pTree (no looping) sparsity measure: max(dis( CLUSTER,CENTROID )) / count(CLUSTER) X

18 1. no clusters determined yet. o 0 r 1 v 1 r 2 v 2 r 3 v 3 v 4 dim2 dim1 mm: Choose dim1. 3 clusters, {r 1,r 2,r 3,O}, {v 1,v 2,v 3,v 4 }, {0}. 1.a: When d(mean,median) >c*width, declare cluster. 1.b: Same alg on subclusts. mean median mean median Declare {r 1,r 2,r 3,O} mean median mean median mean median Declare {0,v 1 } or {v 1,v 2 }? Take {v 1,v 2 } (on median side of mean). Makes {0} a cluster (outlier, since it's singleton). Continuing with {v 1,v 2 }: mean median mean median Declare {v 1,v 2,v 3,v 4 }. Have to loop, but not on next m projs if close? Oblique: grid of Oblique dir_vects, e.g., For 3D, DirVect from each PTM triangle. With projections onto those lines, do 1 or 2 above. Order = any sphere grid: S n ≡{x≡(x 1...x n )  R n |  x i 2 =1}, polar coords. Can skip doubletons since mean always same as median. Alg4: Calc mean and vom. Do 1a or 1b on line connecting them. Repeat on each cluster, use another line? Adjust proj lines, stop cond dens: 2.a density > Density_Thresh, declare (density≡count/size). Alg5: Proj to mean-vom-line, mn=6.3,5.9 vom=6,5.5 (11,10=outlier). 4.b, perp line? 4,9 2,8 5,8 4,6 3,4 dim2 dim1 11,10 10,5 9,4 8,3 7,2 6.3,5.9 6,5.5 lexicographical polar coords? 180 n too many? Use e.g., 30 deg units, giving 6 n vectors, for dim=n. Attrib relevance important! mmd: Use 1st criteria to trigger from 1.a, 2.a to declare clusters. FAUST CLASSIFY, d versions (dimensional versions, mm, dens, mmd...) 435 524 504 545 323 2 1 3 mean=(8.18, 3.27, 3.73)vom=(7,4,3) 924 b43 e43 c63 752 f72 2. (9,2,4) determined as an outlier cluster. 3. Use red dim line, (7,5,2) an outlier cluster. maroon pts determined as cluster, purple pts too. 3.a use mean-vom again would the same be determined? Other option? use a p-Kmeans approach. Could use K=2 and divisive (using a GA mutation at various times to get us off a non-convergent track)? Notes:Each round, reduce dim by one (low bound on the loop.) Each round, just need good line (in remaining hyperplane) to project cluster (so far). 1. pick line thru proj'd mean, vom (vom is dependent on basis used. better way?) 2. pick line thru longest diameter? ( or diam  1/2 previous diam?). 3. try a direction vector. Then hill climb it in direction increase in diam of proj'd set.

19 r r vv r m R r v v v r r v m V v r v v r v FAUST Classify, Oblique version (our best classifier?) P R =P (X o d R ) < a R 1 pass gives classR pTree D≡ m R  m V d=D/|D| midpoint of means Separate class R using midpoint of means method: Calc a (m R +(m V -m R )/2) o d = a = (m R +m V )/2 o d (works also if D=m V  m R, d Training≡placing cut-hyper-plane(s) (CHP) (= n-1 dim hyperplane cutting space in two). Classification is 1 horizontal program (AND/OR) across pTrees, giving a mask pTree for each entire predicted class (all unclassifieds at-a-time) Accuracy improvement? Consider the dispersion within classes when placing the CHP. E.g., use the vom 1. vectors_of_median, vom, to represent each class, not the mean m V, where vom V ≡(median{v 1 |v  V}, midpt_std, vom_std methods 2. midpt_std, vom_std methods : project each class on d-line; then calculate std (one horizontal formula per class using Md's method); then use the std ratio to place CHP (No longer at the midpoint between m r and m v median{v 2 |v  V},...) vom V v1v1 v2v2 vom R std of distances, v o d, from origin along the d-line dim 2 dim 1 d-line Note:training (finding a and d) is a one-time process. If we don’t have training pTrees, we can use horizontal data for a,d (one time) then apply the formula to test data (as pTrees)

20 L L R R L L R R Level-2 follows the level-1 LLRR pattern with another LLRR pattern. L L R R L L L L R L L R R R R R R PTree Triangular Mesh (PTM) ordering is: Peel from south to north pole along quadrant great circles and the equator. Level-3 follows level-2 with LR when level-2 pattern is L and RL when level-2 pattern is R R L L R R L L R L L R R L L R R R L L R R L L R L L R R L L R R What ordering is best for spherical data (e.g., Data sets involving astronomical bodies on the celestial sphere, which shares its' origin and equatorial plane with the earth, but has no radius. Hierarchical Triangle Mesh (HTM) orders its' recursive equilateral triangulations as:... 1,2 1,2 1,1 1,1 1,0 1,0 1,3 1 1,1, 2 1,1,01,1,1 1.1.3 HTM sub-triangle ordering PTM_ LLRR_LLRR_LR... Theorem:  n,  an n-sphere filling (n-1)-sphere? Corollary:  sphere filling circle (2-sphere filling 1-sphere). Proof of Corollary: Let C n ≡ the level-n circle, C ≡ limit n  C n is a circle which fills the 2-sphere! Proof: Let x be any point on the 2-sphere. distance(x,C n )  sidelength (=diameter) of the level-n triangles. sidelength n+1 = ½ * sidelength n. d(x,C) ≡ lim d(x,C n )  lim sidelength n  sidelength 1 * lim ½ n = 0 x equator south pole north pole


Download ppt "1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa 5 6 7 8 pf 9 pb a pc b pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f FAUST=Fast, Accurate Unsupervised and Supervised."

Similar presentations


Ads by Google