UpS = The Universe of all pTree Sets= {all vectors of SPSs (formerly known as SPTSs)} V=n-dimensional vector space. Code all operations as n-ary operations.

Slides:

Advertisements

Similar presentations

CS 450: COMPUTER GRAPHICS LINEAR ALGEBRA REVIEW SPRING 2015 DR. MICHAEL J. REALE.

Advertisements

Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.

Jeopardy $100 $100 $100 $100 $100 $200 $200 $200 $200 $200 $300 $300

CENG536 Computer Engineering department Çankaya University.

Lecture 3 HSPM J716. Efficiency in an estimator Efficiency = low bias and low variance Unbiased with high variance – not very useful Biased with low variance.

The Binary Numbering Systems

Gauss’ Law Besides adding up the electric fields due to all the individual electric charges, we can use something called Gauss’ Law. Its idea is similar.

The Mathematics of Sudoku

Probabilistic Methods in Coding Theory: Asymmetric Covering Codes Joshua N. Cooper UCSD Dept. of Mathematics Robert B. Ellis Texas A&M Dept. of Mathematics.

Review of Matrix Algebra

Example 8.7 Cluster Analysis | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | CLUSTERS.XLS n This file contains demographic data on 49 of.

Chapter 26: Comparing Counts. To analyze categorical data, we construct two-way tables and examine the counts of percents of the explanatory and response.

15-853Page :Algorithms in the Real World Error Correcting Codes I – Overview – Hamming Codes – Linear Codes.

Introduction to Excel 2007 Bar Graphs & Histograms Psych 209 February 1st, 2011.

Introduction to Excel 2007 Part 3: Bar Graphs and Histograms Psych 209.

8/10/2015Slide 1 The relationship between two quantitative variables is pictured with a scatterplot. The dependent variable is plotted on the vertical.

Radial Basis Function (RBF) Networks

Gauss’ Law. Class Objectives Introduce the idea of the Gauss’ law as another method to calculate the electric field. Understand that the previous method.

EXAMPLE: 3.1 ASSEMBLING AND TESTING COMPUTERS

Real Numbers and the Decimal Number System

Data Representation Number Systems.

a b c Gauss’ Law … made easy To solve the above equation for E, you have to be able to CHOOSE A CLOSED SURFACE such that the integral is TRIVIAL. (1)

D MANCHE Finding the area under curves:  There are many mathematical applications which require finding the area under a curve.  The area “under”

Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.

Information Representation (Level ISA3) Floating point numbers.

Measure your handspan and foot length in cm to nearest mm We will record them as Bivariate data below: Now we need to plot them in what kind of graph?

European Computer Driving Licence Syllabus version 5.0 Module 4 – Spreadsheets Chapter 22 – Functions Pass ECDL5 for Office 2007 Module 4 Spreadsheets.

880.P20 Winter 2006 Richard Kass 1 Confidence Intervals and Upper Limits Confidence intervals (CI) are related to confidence limits (CL). To calculate.

EXERCISES Try roughly plotting the potential along the axis for some of the pairs Exercises on sheet similar to this.

Kinds of data 10 red 15 blue 5 green 160cm 172cm 181cm 4 bedroomed 3 bedroomed 2 bedroomed size 12 size 14 size 16 size 18 fred lissy max jack callum zoe.

STEM AND LEAF DIAGRAMS Don’t forget to order Include a key.

1 Digital Logic Design Week 5 Simplifying logic expressions.

School of Information - The University of Texas at Austin LIS 397.1, Introduction to Research in Library and Information Science LIS Introduction.

27/05/ Iteration Loops Nested Loops & The Step Parameter.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.

11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.

CONSTANTS Constants are also known as literals in C. Constants are quantities whose values do not change during program execution. There are two types.

Chapter 5 Working with Multiple Worksheets and Workbooks

FAUST Oblique Analytics (based on the dot product, o). Given a table, X(X 1..X n ), |X|=N and vectors, D=(D 1..D n ), FAUST Oblique employs the ScalarPTreeSets.

1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa pf 9 pb a pc b pd pe c d e f a b c d e f X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6.

Given k, k-means clustering is implemented in 4 steps, assumes the clustering criteria is to maximize intra- cluster similarity and minimize inter-cluster.

Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.

Overview Excel is a spreadsheet, a grid made from columns and rows. It is a software program that can make number manipulation easy and somewhat painless.

Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.

Level-0 FAUST for Satlog(landsat) is from a small section (82 rows, 100 cols) of a Landsat image: 6435 rows, 2000 are Tst, 4435 are Trn. Each row is center.

In the recently concluded OpenWorld Conf Larry Ellison of Oracle announced a new in-memory offering to company’s latest 12c database platform.Oracle This.

Enclose clusters with gaps using functionals (ScalarPTreeSets or SPTSs): C p,d (x)=(x-p) o d /  (x-p) o (x-p) Conical Separating clusters by cone gaps.

Chapter 2: Frequency Distributions. Frequency Distributions After collecting data, the first task for a researcher is to organize and simplify the data.

Unit 3 Vocabulary Amelia N.. Algebraic Expression An expression that contains a variable. Examples: 10/p=2 p+11=20.

Area Circumference Sectors

FAUST Technology for Clustering (includes Anomaly Detection) and Classification (Where are we now?) FAUST technology for classification/clustering is built.

Oblique FAUST DENSE REGION VERSION pTree Density Finder can be used to pre-compute the modes of each column one time up front. (Instead of watching for.

Q&A f=distance dominated functional, avgGap=(f max -f min )/|f(X)| may be a good measurement for setting thresholds, e.g., x is an outlier=anomaly if.

EXCEL DECISION MAKING TOOLS AND CHARTS BASIC FORMULAE - REGRESSION - GOAL SEEK - SOLVER.

For Datatel and other applications Presented by Cheryl Sullivan.

Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:

US≡Universe of Scalar pTreeSets A ScalarpTreeSet is the complete set of pTrees for a column of real numbers (Complete:  a pTree  bit pos, - to  ToC.

Reasoning in Psychology Using Statistics

Pre-Processing What is the best amount of amortized preprocessing?

PDR PTreeSet Distribution Revealer

Ideas Oblique FAUST, Barrel (OFLB)

Displaying Distributions with Graphs

Displaying and Summarizing Quantitative Data

Chapter 26 Comparing Counts.

Mathematical Formulas and Excel

CISE-301: Numerical Methods Topic 1: Introduction to Numerical Methods and Taylor Series Lectures 1-4: KFUPM CISE301_Topic1.

Data Mining CSCI 307, Spring 2019 Lecture 23

Presentation transcript:

UpS = The Universe of all pTree Sets= {all vectors of SPSs (formerly known as SPTSs)} V=n-dimensional vector space. Code all operations as n-ary operations on UpS (1 level or multi-level): DP v :=Dot Product with a fixed real vector, v  V... use v's bit pattern? n-ary operations on UpS: UpS ...  UpS  UpS SP c =Scalar Product (multiplying by a constant, c). (Usually unary. Even more efficient is to use c's bit pattern! Note, UpS includes SpSs of all cardinalities (= depths = # of rows). It seems best to code on UpS rather than on SpSn (card(SpS)=n). Of course, it is very important to know what the rows represent so as to avoid nonsense results, however, why restrict the operations themselves? When SpS operands are of different depths, the result SpS's depth = depth of the shallowest operand (operating from the top of each). ER a =FP's EinRings Result = pTree mask of rows < a apply < above? Better, use a's bit pattern only? AG a = YC's Aggregates, count, sum, avg, max, min, median, rank_k, top_k, IceBergQueries (Here, the result is a number, but a number is a depth=1, width=1 SpS.) SD v =Square Distance from fixed real vector, v  V....use v's bit pattern only? addition is row-wise addition in each column, or (SpS i 1,SpS i 2,...,SpS i n i )+...+(SpS z 1,SpS z 2,...,SpS z n z )  (SpS  1,SpS  2,...,SpS  n  ) where SpS  h =SpS i h +...+SpS z h Dimension_of_result = n  = min{n i,...,n z } and |SpS|≡depth_of_SpS=cardinality_of_SpS=#_of_rows_SpS and |SpS  h | = min{|SpS i h |,..., |SpS z h |} -, /, * are binary on SpSs, row_wise also. if op is any of =, >, <, , , op produces 1 mask pTree of the truth of (SpS i 1 op SpS z 1 ) AND... AND (SpS i n  op SpS z n  ) Note, in HORIZONTAL structuring, a dataset=file=table is one column of rows. In VERTICAL structuring a dataset is one row of columns. Our advantage: the column count is typically small and a fixed number, whereas the row count can be very large and variable. The advantage of defining all operations as n-ary operations on UpS is that we can then code various implementations independently and test them one against the other for speed. I.e., we can do the engineering on the operations. re s u lt re s u lt

C p,d (x)=(x-p) o d /  (x-p) o (x-p) Oblique FAUST Cone (OFC) (Enclose clusters with cone gaps) gap Barrel Oblique FAUST (OF) Clustering: Linear (default) OFL, Spherical OFS, Barrel OFB, Conical OFC) B p,d (x)=(x-p) o (x-p)-((x-p) o d) 2 Oblique FAUST Barrel (OFB) (Enclose clusters with barrel gaps) Search for Gap Lower >T, Gap Upper >T and Gap Barrel >T 2 (BR≡Barrel_Radius) Search S p for spherical gap, {x | r 2  S p (x) < (r+T) 2 }=  so that the interior of the r-sphere about p encloses a sub-cluster. S p (x)=(x-p) o (x-p) Oblique FAUST Spherical (OFS) (Enclose clusters with spherical gaps) No gaps show on the red, blue or green projection lines d p r p L p,d :X  R: L p,d (x)=(x-p) o d Oblique FAUST Linear (OFL) clustering (Enclose clusters between (n-1)-dimensional hyperplanar gaps) Find a 1 <a 2 such that  =Gap Lower ={x | a 1 <L pd (x)<a 1 +T} and  =Gap Upper ={x | a 2 <L pd (x)<a 2 +T} and C={x|a 1 +T<L pd (x)<a 2 } Gap Upper d p Gap Lower a1a1 a2a2  B pd x x Note: B pd (x) = S p (x) - L 2 pd (x) Note: C 2 pd (x) = L 2 pd (x) / S p (x) Assume a real number table, TBL(C 1..C n ), (= n-dim vector space; or categorical columns, either code to real numbers or bitmap, e.g., a Month column can be coded as {1,...,12} and a Color column can be bitmapped by Red(yes=1|no=0)...Violet(yes=1|no=0) ). TBL is converted to a PTreeSet. Define distance function ds(x,y):TBL  TBL  R ds(x,y)=   k  CR r k |x k -y k | 2 +  k  CC c k |x k -y k | where CR is the set of real columns, CC is the set of categorical columns (consider coded columns as real) and r k, c k are real coefficients. Each method uses a real valued functional from X to R and all methods are completely data parallel (data can be distributed over a cluster, processed in parallel (dot product), then the partial results sent home to be added.

p T=MGW=12 d=x-n= CONCRETE ST CM WA FA AG L1 M1 L2 M12 H 17 C3 OF LB...LB Clustering on Concrete(STrength,ConcreteMix,WAter,FineAggregate, AGgregate). Assess STerror L<40  M<60  H (x-p)od/4 Ct Gp  3 C if 1 st B radius>>0, use p=min_radius_pt L2 M1 C0 L20 M9 H4 C1 H4 M3 H1 C4 H1 L18 M26 H28 C2 Br/4 ct gp  3 C H1 M3 Br/4 ct gp  3 C L1 M3 H3 C31 L1 M1 H4 C32 H1 M1 H5 C33 H1 H2 H1 M1 M2 M1 M3 (x-p) o d/4 gp  3 C M3. H3 (x-p) o d/4 gp  3 C L1 M1 d=4. H4 Br/4 gp  3 C L1 L9 M1 C21 L4 M3 H1 C22 M1 L2 M4 H3 C23 M1 L2 M3 H16 C24 L2 M3 H4 C25 M1 M1 H3 C26 M1 M2 M3 H1 C27 M2 (x-p)od/4 gp  3 C M3. H1 (x-p)od/4 gp  3 C M1 ' H3 (x-p)od/4 gp  3 C M2 H2 M1 H1C251 H1 Br/4 gp  3 C M1 H1 (x-p)od/4 gp  3 C L1 L1 M2 H16 C241 M1 Br/4 gp  3 C L1 M1 H5 C2411 H5 M1 c (Clust dendogram w/o purity) c0 c1 c2 c3 c4 c31 c32 c33 c21 c22 c23 c24 c25 c26 c27 (x-p) o d/4 gp  3 C M1. H5 c251 c241 (x-p) o d/4 g  3 C M1. H5 c2411 (x-p)od/4 gp  3 C M3 L1 H1 L1 M1 C231 c231 Br/4 gp  3 C L1 M1 (x-p)od/4 gp  3 C L3 M2. H1 M1 L1 (x-p)od/4 gp  3 C L6 L3 M1 C211 Br/4 gp  3 C c211 L1 M1 L1 Br/4 gp  3 C L2 L11 M3 C11 L1 L4 M1 M2 L1 M2 H1 C12 H3 M1 L1 c11 c12 (x-p)od/4 gp  3 C L1 H1 M2 (x-p)od/4 gp  3 C L11 M3 Br/4 gp  3 C L1 L1 M1 d=4

p d ClsAreaLnkeAcoeLnk OF LB...LB Clustering on SEEDS(cls area lnker acoef lnkrgv) (x-p)od*10 Ct Gp  3 if 1 st B radius>>0, use p=min_radius_pt Br*10 gp  3 c (x-p)od g  3 c c c c c c Br*10 gp  3 c c c c c

p=avg q=vom ClsAreaLnkeAcoeLnk R Ct gp Find LinGaps Oblique FAUST Pipe Clustering on SEEDS(cls area lnker acoef lnkrgv) 1=pipe radius xod Ct gp The fact that there are no good pipe gapsmay means exit. Start over w 1st last. d p 0. Always start with linear analysis, then: 1. Project the inside of a pipe (small radius) on the d-line. 2.  linear gapped region, increase radius until a radial gap appears. 3. Increase linear region width until cap gaps appear.. 4. Mask off that cluster 6. GOTO 1 (and revise p, d) here or if either 2 or 3 fail to materialize gaps. xod Ct gp First last L 3 H This is unfinished (ran out of time). I also tried Spherical when it appeared from the pipe analysis that we were at the center of a cluster. So far this didn't work out.

Q&A f=distance dominated functional, avgGap=(f max -f min )/|f(X)| may be a good measurement for setting thresholds, e.g., x is an outlier=anomaly if gap around {x} > 3*avgGap? If d and t are trained over DocumentTerm (DT) Gradient(F)=G=(G d, G t ). Instead of a LineSearch using F(s)=f +sG, always use 2D-RectangleSearch, F(s d,s t )=F(f + s d *G d + s t *G t ). Set  F/  s d =0 and  F/  s t =0. It may be a better approach to find dense cells (sphere, barrel, cone) then fuse them, because it's difficult to position themaround clusters (due to bumps, protrusion etc.) (Not true for outlier clusters (singleton\doubleton)) An Akg: Start with a line and a small radius barrel around it. Find dense regions between 2 consecutive gaps in this pipe. This should identify portion of a dense cluster. Lots of ways to go from there: a. Use centroid of dense pipe piece as sphere|barrel center. b. Move to a better centroid for that cluster by a gradient asc/desc process c. In a "GA mutation" fashion, jump to a nearby centroid, governed by some fitness function (e.g., count in dense pipe piece). If the minimum barrel radii >> 0, we have chosen a d-line far from the data. It may be advisable to pick p to ba an actual data point. Here are the formulas from the spreadsheet: G=(B12-B$6)*B$9+(C12-C$6)*C$9+(D12-D$6)*D$9+(E12-E$6)*E$9 H=G12-$G$9 L=(x-p)od-min I=(B12-B$6)^2+(C12-C$6)^2+(D12-D$6)^2+(E12-E$6)^2 B=SQRT[(x-p)o(x-p)-(x-p)od^2] Note we don't round, so we are calculating pTree bitslices by truncating. We don't even need to do that! For fixed piont, here are the bislice formulas: Keep going (take bitslices to the right of decimal pt ) Floating point? Bitslice the mantissa. The exponent shifts the slice name. E.g.,.1011    Gap analytic tools: L(x)=x o d, S(x)=(x-p) o (x-p) and then from those, B(x)=S(x)-L 2 (x) (If T is the minimum gap threshold, use T 2 for S and B ) Oblique FAUST, Barrel (OFLB) Alternate L pq x, B pq x to get a cluster dendogram (topdown). Take p=1st_TR pt? d=vom  avg Defining Avg Density? AvD = count /  k=1..dim (max k -min k )? This is for choosing good Thresholds. MinGapThres=T b,AvD ≡ b*(1/ AvD) 1/dim b=adjustable param If we're given a TrainingSet, TR, with K classes, is avg k=1..K vom k a better mediod than VoM? Take p=MinCorner, q=MaxCorner of box circumscribing {VoM k } k=1..K better than not circ box of TR? SSPTS = set of all SPTSs (columns of reals); V = n-dim vector space. Code operations on SSPTS (both 1 level or multi-level): DP v (Dot Product with a fixed vector, v  V) SSPTS  SSPTS  SSPTS (Binary Algebraic Operations): including: +, -, /, RWP =Row_Wise_Product SSPTS  SSPTS (Unary Operations) including: SP c =Scalar_Product (Multiply each SPTS row by same constant, c. Use const SPTS? all rows=c, then RWP. More efficient? w/o forming const SPTS? Use c's bit pattern c only? (subset of previous with n = |SSPTS|?) {SPTS k } k=1..n  SSPTS (Unary ops.Typically SPTS k =V k ) incl: SD v (Square Distance from a fixed vector, v  V) Note, SSPTS includes SPTSs of all cardinalities (= depths = # of rows) It seems best to code on SSPTS rather than on SSPTSn (card(SPTS)=n). Of course, it is very important to know what the rows represent so as to avoid nonsense results, however, why restrict the operations themselves? When SPTS operands are of different depths, the result SPTS's depth = depth of the shallowest operand (operate from the top of each). ER a = FP's EinRings (n=1, r  R) result masks rows s.t. row < a SPTS  R includes AG a = YC's Aggregates and iceberg queies: count, sum, avg, max, min, median, rank_k, top_k, IceBergQueries /16