Download presentation
Presentation is loading. Please wait.
Published byArron Garrett Modified over 8 years ago
1
1 Algebraic Topology in Data Science Algebraic Topology in Data Science GrC in Big Data Tsau Young (‘T. Y.’) Lin Institute of Data Science and Computing GrC Society and Computer Science Department, San Jose State University Ty.lin@sjsu.edu ; prof.tylin@gmail.com
2
2 Outline 1.What is Data Science 2.What is Data Mining? 3.What is GrC? Simplicial Complexes 4. Information Table (IT) Bit IT theory (BIT) 5. Traces the Geometric Data Mining algorithm
3
3 Data Science Vasant Dhar, (New York University, E-i-C of Big Data ) defined: "Data Science is a study of generalizable extraction of knowledge from data."
4
4 Data Science I am not fully agree with his definition about Data Science; but I will adopt his idea in this talk.
5
5 Outline 1.What is Data Science 2.What is Data Mining? 3.What is GrC? Simplicial Complexes 4. Information Table (IT) Bit IT theory (BIT) 5. Traces the Geometric Data Mining algorithm
6
6 Data Mining Data Mining is a study about HOW TO EXTRACT PATTERNS from data.
7
7 Data Mining Core Methods of Data Mining: 1.Classification 2.Clustering 3.Association(rules)
8
8 Data Mining Methods for mining frequent patterns Aprior(Rakesh Agrawarl) FP-growth(Jia-Wei Han) (frequent pattern growth)
9
(skip) FP-growth uses an extended prefix-tree (FP-tree) structure to store the database. It adopts a divide-and-conquer strategy.
10
1.(skip) First of all, compress database into a frequent-pattern tree (FP-tree) 2.Then divide FP-tree into a set of conditional FP-tree 3.Next, mine each conditional FP- tree separately to get the complete frequent patterns of database
11
11 Outline 1.What is Data Science 2.What is Data Mining? 3.What is GrC? Simplicial Complexes 4. Information Table (IT) Bit IT theory (BIT) 5. Traces the Geometric Data Mining algorithm
12
12 GrC Is it a partition/granulation ?
13
13 Partition (Rough Sets) (usual pictures are vague)
14
14 Open Simplexes Open Simplexes 0- Simplex ( 1 point)
15
15 Open Simplexes Open Simplexes 1-simplex (no end points)
16
16 Open Simplexes Open Simplexes 2-simplex (no boundary)
17
17 Closed 3-simplex : It summarizes all information of First tuple Closed 3-simplex : It summarizes all information of First tuple C A B D
18
18 Open Simplexes Open Simplexes 3-simplex(no boundary) Open tetrahedron
19
19 A New Partition (algebraic topology) 12 open triangles (2-simplex); 23 open segments (1-simplexes); 12 vertices (0-simplexes Simplicial complex
20
20 Geometric Closed Tetrahedron U: The closed geometric Tetrahedron S implexes = the set of all open simplexes = { ABCD, BCD, ACD, ABD, ABC AB, AC, AD, BC, BC, BD, CD A, B, C, D } Is it rough set ? Yes
21
21 Abstract Closed Tetrahedron U = {A, B, C,D} S implexes = { ABCD {A, B, C, D}, BCD, ACD, ABD, ABC {A, B, C} AB, AC, AD, BC, BC, BD, CD A, B, C, D } Is it rough set ? Yes
22
22 Geometric Simplicial Complex (U, S) is a Simplicial Complex, if U is (1) d ecomposed into a set S of simplexes, (2) all faces of any simplexes are also simplexes. (U is called a polyhedron)
23
23 A Rough Set ? A Rough Set ? D A B C Red lines only
24
24 Geometric Simplicial Complex U = The Picture of Red lines S implexes = { BC, BD, CD, AB A, B, C, D } Is a simplicial complex Is (U, S) a RS? infinite RS
25
25 Abstract Simplicial complex U = {A, B, C, D} S implexes = { {B, C}, {B, D}, {C, D}, {A, B} {A}, {B},{C}, {D} } Is a simplicial complex Is (U, S) a RS? No
26
26 A Rough Set ? A Rough Set ? D A B C Red Zone only
27
27 Geometric Simplicial Complex U= The picture of red Zone S= { BCD, AB and all their faces (descendants) } I s (U, S) a RS? Yes
28
28 Abstract Simplicial complex U= {A, B, C, D} S= { {B, C, D}, {A, B} and all their descendants (we will skip them): 1){B, C}, {B, D}, {C, D}, {A, B} 2){A}, {B},{C}, {D} } Is (U, S) a RS? No
29
29 Geometric Simplicial Complex Geometric Simplicial Complex D E B A C Open tetrahedron
30
30 Simplicial complex U= {A, B, C, D, E} S: { {A, B, C, D}, {A, E} and all their descendants. {A, B,C},{A, B,D},{A,C,D},{B, C, D}, {A, B},{A, C},{A,D},{B, C}, {B, D}, {C, D}, {A, E}, {A}, {B}, {C}, {D},{E} }.
31
31 Abstract Simplicial complex U= {A, B, C, D, E} S: { {A, B, C, D}, {A, E} and all their descendants. {A, B,C},{A, B,D},{A,C,D},{B, C, D}, {A, B},{A, C},{A,D},{B, C}, {B, D}, {C, D}, {A, E}, {A}, {B}, {C}, {D},{E} }.
32
32 Outline 1.What is GrC ? Simplicial Complexes 2. Bit Information Table(IT) 3. Traces the Geometric Data Mining algorithm
33
33 Outline 1.What is Data Science 2.What is Data Mining? 3.What is GrC? Simplicial Complexes 4. Information Table (IT) Bit IT theory (BIT) 5. Traces the Geometric Data Mining algorithm
34
Bit IT Theory Next Table is a BiT Information Table of a DEPARTMENT STORE. Column names are item names. If a customer has purchased an item, it will mark with bit 1, otherwise 0 in the column of that item. 34
35
35 BiT IT (IT= A Relation in Relational Database) D iaper B eer M ilk P en …. 111100
36
BiT IT (Geometric View) The first row can be visualized GEOMETRICALLY as A CLOSED SIMPLEX 36
37
37 Closed 3-simplex is: a pair ( U={D iaper, B eer, M ilk, P en }, a set S of open simplexes) S is 3-simplexes {D, B, M, P} and all its descendants
38
38 1 st generation: 2-simplexes {B, M, P}, {D, M, P}, {D, B, P}, {D, B, M} 2 nd generation: 1-simplexes {D, B},{D, M},{D, P}, B,M},{B, P}, {M, P}. 3 rd generation: 0-simplexes {D}, {B}, {M}, {P}; Total (2 4 -1) OPEN simplexes
39
BiT Geometric IT Instead of item names, we may use name of the “unit vector” First item is A=(1, 0, 0, …) The second item is B=(0, 1, 0, 0, ….) 39
40
40 Closed 3-simplex : It summarizes all information of First tuple Closed 3-simplex : It summarizes all information of First tuple C A B D
41
41 C is the 1 st visit A is the 2 nd visit A C is the 3rd visit D is the 4 th visit D C D A D A C
42
42 Outline 1.What is Data Science 2.What is Data Mining? 3.What is GrC? Simplicial Complexes 4. Information Table (IT) Bit IT theory (BIT) 5. Traces the Geometric Data Mining algorithm
43
43 A Bit IT D iaper B eer M ilk P en EFG 111 1000#1 011 1100#2 10100 11#3 #4#2#1 #3#5#6#7
44
44 Aprior(1-itemset) D iaper B eer M ilk P en EFG 111 1000#1 011 1100#2 10100 11#3 #4#2#1 #3#5#6#7 22321 11
45
45 Aprior(2-itemset) D iaper B eer M ilk P en EFG 111 1000#1 011 1100#2 10100 11#3 #4#2#1 #3#5#6#7 ….
46
46 Its Simplicial Complex View Its Simplicial Complex View a #4 b #2 c #1 d #3 c b e #5 d f #6 g # 7 Open tetrahedron 1 Open tetrahedron 2 Open tetrahedron 3
47
47 Main Idea So our main approach is to use Weighted Simplicial Complex to find Frequent Itemsets. The results are very impressive. It is 200 times faster than FP-Growth in Real World database (1257 column and 65,536 rows)
48
Traversal of Weighted Simplicial Complex [1] 3, [4] 2, [4 1] 2 [3] 2, [3 1] 2, [3 4] 1, [3 4 1] 1 [2] 2, [2 1] 2, [2 4] 1, [2 4 1] 1 [2 3] 2, [2 3 1] 2, [2 3 4] 1 [2 3 4 1] 1 48
49
[7] 1, [7 1] 1 [7 4] 1,[7 4 1] 1 [6] 1,[6 1] 1,[6 4] 1, [6 4 1] 1 [6 7] 1, [6 7 1] 1 [6 7 4] 1, [6 7 4 1] 1 49
50
[5] 1, [5 1] 1 [5 3] 1,[5 3 1] 1 [5 2] 1,[5 2 1] 1 [5 2 3] 1,[5 2 3 1] 1 50
51
51 Data Science Here we have used a very simple example to illustrate the idea of data science. We not only extract the frequent items sets, but also, their structure of interactions. For example, we have the homology groups of the output.
52
Knowledge Complex [1] 3, [4] 2, [4 1] 2 [3] 2, [3 1] 2, [2] 2, [2 1] 2, [2 3] 2, [2 3 1] 2, 52
53
53 Knowledge Complex Knowledge Complex D A B C Red Zone only
54
Knowledge Complex H 0 ( K )= Z H i (K)=0 i 0 54
55
55 A Bit IT A BC DEFG 111 1000#1 011 0100#21 010 1100#22 001 1100#23 10100 11#3
56
56 Its Simplicial Complex View Its Simplicial Complex View a #4 b #2 c #1 d #3 c b e #5 d f #6 g # 7 Open tetrahedron 1 Open tetrahedron 2 {b, c, d} in tetrahedron 2 is removed
57
Traversal of Weighted Simplicial Complex [1] 4 [5] 3 [5 1] 2 [3] 3 [3 1] 2 [3 5] 2 [3 5 1] 1 57
58
[2] 3 [2 1] 2 [2 5] 2 [2 5 1] 1 [2 3] 2 [2 3 1] 1 [2 3 5] 1 58
59
[4] 2 [4 1] 2 [4 3] 1 [4 3 1] 1 [4 2] 1 [4 2 1] 1 [4 2 3] 1 [4 2 3 1] 1 59
60
[7] 1 [7 1] 1 [7 4] 1 [7 4 1] 1 60
61
[6] 1 [6 1] 1 [6 4] 1 [6 4 1] 1 [6 7] 1 [6 7 1] 1 [6 7 4] 1 [6 7 4 1] 1 61
62
Knowledge Complex [1] 4, [5] 3, [5 1] 2 [3] 3, [3 1] 2, [3 5] 2 [2] 3, [2 1] 2, [2 5] 2, [2 3] 2 [4] 2, [4 1] 2 62
63
Knowledge Complex [5 1] 2 [3 1] 2, [3 5] 2 [2 1] 2, [2 5] 2, [2 3] 2 [2 1] 2, [2 5] 2, [2 3] 2 [4 1] 2 63
64
64 Knowledge Complex w Knowledge Complex w c b e #5 d 7 segments and all the points a
65
65 For a real world database (1256 column; 65,536 rows), New algorithm(5.07 secs) FP-growth of Professor Jia-Wei Han (1283.337036). runs nearly 200-300 times faster than
66
66 Thanks !
67
67 Simplicial Complex in Web (21) A Web page is 1. a linearly ordered Text. 2. a knowledge representation of human thoughts 2
68
68 1. Wall Street is a symbol for American financial industry. Most of the computer systems for those financial institute have employed information flow security policy. 2. Wall Street is a shorthand for US financial industry. Its E-security has applied security policy that was based on the ancient intent of Chinese wall. 3. Wall Street represents an abstract concept of financial industry. Its information security policy is Chinese wall.
69
69 2-ary Relation Wall Street InformationSecurity FinanceIndustry
70
70 1. Wall Street is a symbol for American finance industry. Most of the computer systems for those financial institute have employed information flow security policy. that was based on the ancient intent of 2. Wall Street is a shorthand for US finance industry. Its E-security has applied security policy that was based on the ancient intent of Chinese wall. is 3. Wall Street represents an abstract concept of finance industry. Its information security policy is Chinese wall.
71
71 4-nary Relation securitypolicyChinawall
72
72 Concept Mining Here we used the same idea to do concept mining in Documents
73
73 Concept Analysis Simplex, as an ordered keyword set, represents a Concept in the web Simplicial complex is the knowledge structure of the web
74
74 Knowledge Structure Concept: 1-simplex Knowledge Structure Concept: 1-simplex Wall Street Wall Street is a 1-simplex represents the concept of financial industry
75
75 Knowledge Structure Concept: 1-simplex Finance Industry Finance Industry (Stemming)
76
76 Knowledge Structure Indexing the Concepts by indexing the concepts in simplicial complex,... building Knowledge Based Search Engine Can be built.
77
77 Concepts will be clustered by Homology Theory T. Y. LIN – Tung Yen Lin –Tsau Young Lin...
78
78 Conclusions
79
79 Key Components 1.GrC Model (U, β): 2.Two Operations: (skip) Granulation and Integration 3. Three Semantic Views on β Knowledge Engineering (considering) Uncertainty Theory How-to-solve/compute-it
80
80 Key Components 4. Four Structures Granular structure/variable (Zadeh) Quotient Structure (QS - Zhang) Knowledge Structure (KS - Pawlak) Linguistic Structure/variable(Zadeh) http://xanadu.cs.sjsu.edu/~grc/grcinfo_center/1Linabs_william.pdf (From TY Lin’s home page granular computing conference 2009 GrC Information Center Click here for a formal theory in First paragraph.) Click here for a formal theory
81
81 Other Applications 2. Information Flow Security 3 rd GrC model Solve 30 years outstanding Problem; IEEE SMC 2009
82
82 Other Applications 3. Approximation Theory in the category of Turing machines 7 th GrC Model Expressing DNA sequences by finite automata 2014
83
83 Other Applications Approximation Theory in the category of Functions 6 th GrC Model Patterns in numerical sequences (1999)
84
84 Other “Applications” Interpreting Uncertainty in Quantum Mechanics as GrC 3rd GrC Model Interpreting Approximations in Big Data 1 st GrC Model
85
85 Thanks !
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.