Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Project Overview Discovering Concepts Hidden in the Web This is incomplete set of slides that will explain the idea underlying the project For now, the.

Similar presentations


Presentation on theme: "1 Project Overview Discovering Concepts Hidden in the Web This is incomplete set of slides that will explain the idea underlying the project For now, the."— Presentation transcript:

1 1 Project Overview Discovering Concepts Hidden in the Web This is incomplete set of slides that will explain the idea underlying the project For now, the important task is download the project I will come back for details

2 2 Important Concepts 1) A set of documents is associated with a Matrix, called Latent Semantic Index(LSI), Then by treating the row vectors as Euclidean space points(point=TFIDF), The document is clustered (The terms marked red will be explained in lectures)

3 3 Important Concepts A set of documents will be associated with a polyhedron; this association is believed to be nearly one-to-one Corollary: A set of English documents and their Chinese translations can be identified (nearly) via this associations.

4 4 Important Concepts 1. Introduction Domain: Information Ocean Methodology: Granular Computing Results: ? 2. Intuitive View of Data Science and Computing 2

5 5 Current State Current search engines are syntactic based systems, they often return many meaningless web pages Cause: Inadequate semantic analysis, and lack of semantic based organization of information ocean.

6 6 Information Ocean Internet is an information ocean. It needs a methodology to navigate. A new methodology-Granular Computing

7 7 Granular Computing-a methodology The term granular computing is first used to label a subset of Zadeh’s granular mathematics as my research area in BISC, 1996-97 ( Zadeh, L.A. (1998) Some reflections on soft computing, granular computing and their roles in the conception, design and utilization of information/intelligent systems, Soft Computing, 2, 23-25.)

8 8 Granular computing Since, then, it has grown into an active research area: books, sessions, workshops (Zhong, Lin was the first independent conference using Name GrC; there has several in JCIS) IEEE task force

9 9 Granular Computing Granulation seems to be a natural problem-solving methodology deeply rooted in human thinking. Human body has been granulated into head, neck, and etc.

10 10 Granulating Information Ocean In this talk, we will explain how we granulate the semantic space of information ocean that consists of millions of web pages

11 11 Organizing Information Ocean How to organize the information ocean? Considering the Semantics Space

12 12 Latent Semantic Space A set of documents/web pages carries certain human thoughts. We will call the totality of these thoughts Latent semantic space (LSS); (recall Latent Semantic Index(LSI)

13 13 Classification & clustering In data mining, a classification means identify an unseen object with one of the known classes in a partition Clustering means classify a set of object into disjoint classes based on similarity, distance, and etc.; the key ingredient here is the classes are not known apriori.

14 14 Categorizing Information Multiple concepts can simultaneously exist in a single web page, So to organize web pages, a powerful Clustering method is needed. (The # of concepts can not be known apriori)

15 15 Latent Semantic Space(LSS) The simplest representations of LSS? A Set of Keywords LSI

16 16 Latent Semantic Index Key1Key2…KeyN Doc1 TFIDF1TFIDF2TFIDFn Doc2 TFIDF DocMTFIDF

17 17 TFIDF Definition 1. Let Tr denote a collection of documents. The significance of a term ti in a document dj in Tr is its TFIDF value calculated by the function tfidf(ti, dj), which is equivalent to the value tf(ti, dj) · idf(ti, dj). It can be calculated as TFIDF(ti; dj)=tf(ti; dj) log |Tr|/|Tr(ti)

18 18 TFIDF where Tr(ti) denotes the number of documents in Tr in which ti occurs at least once, 1 + log(N(ti; dj)) if N(ti; dj) > 0 tf(ti; dj) = 0 otherwise where N(ti, dj) denotes the frequency of terms ti occurs in document dj by counting all its nonstop words.

19 19 TFIDF where Tr(ti) denotes the number of documents in Tr in which ti occurs at least once, 1 + log(N(ti; dj)) if N(ti; dj) > 0 tf(ti; dj) = 0 otherwise where N(ti, dj) denotes the frequency of terms ti occurs in document dj by counting all its nonstop words.

20 20 Latent Semantic Index Treat each row as a point in Euclidean space. Clustering such a set of points is a common approach (using SVD) Note that the points has very little to do with the semantic of documents

21 21 Topological Space of LSS Euclidean space has many metics but has only one topology; We will use this one

22 22 Keywords (0-Association) 1. Given by Experts 2. High TFIDF is a Keyword “Wall”, “Door”..., “Street”, “Ave”

23 23 Keywords Pairs (1-Association) 1-association (“Wall”, “Street”)  financial notion, that nothing to do with the two vertices, “Wall” and “Street”

24 24 Keywords Pairs (1-Association) 1-association (“White”, “House”)  that nothing to do with the two vertices, “White” and “House”

25 25 Keywords Pairs (1-Association) 1-association (“Neural”, “Network”)  that nothing to do with the two vertices, “Wall” and “Street”

26 26 Geometric Analogy-1- Simplex (open) 1-simplex: (v 0,v 1 )  open segment (“Wall”, “Street”)  financial notion, End points (boundaries) are not included

27 27 Keywords are abstract vertices LSS of Documents/web pages  Simplicial Complex A special Hypergraph Polyhedron  Simplicial Complex

28 28 r-Association r-association Similarly r-association represents some semantic generated by a set of r keywords, moreover the semantics may have nothing to do with the individual keywords There are mathematical structure that reflects such properties; see next

29 29 Topology:(Open) Simplex 1-simplex: open segment (v 0,v 1 ) 2-simplex: open triangle (v 0,v 1, v 2 ) ; 3-simplex: open tetrahedron (v 0,v 1, v 2, v 3 ) All boundaries are not included

30 30 Topology: (Open) Simplex A (open) r-simplex is the generalization of those low dimensional simplexes (segment, triangle and tetrahedron) to high dimensional analogy in r-space (Euclidean spaces of dimension r) Theorem. r-simplex uniquely determines the r+1 linearly independent vertices, and vice versa

31 31 Face The convex hull of any m vertices of the r-simplex is called an m -face. The 0 -faces are the vertices, the 1 - faces are the edges, 2-faces are triangles, and the single r -face is the whole r -simplex itself.

32 32 A line segment where two faces of a polyhedron meet, also called a side.line segmentfacespolyhedronside

33 33 n-Complex A simplicial complex C is a finite set of simplices such that: Any face of a simplex from C is also in C. The intersection of any two simplices from C is either empty or is a face for both of them If the maximal dimension of the constituting simplices is n then the complex is called n- complex.

34 34 Upper/Closure approximations Let B(p), p  V, be an elementary granule U(X)=  {B(p) | B(p)  X =  } (Pawlak) C(X)= {p | B(p)  X =  } (Lin-topology)

35 35 Upper/Closure approximations Cl (X)=  i C i (X) (Sierpenski-topology) Where C i (X)= C(…(C(X))…) (transfinite steps) Cl (X) is closed.

36 36 New View Divide (and Conquer) Partition of set  (generalize) ? Partition of B-space (topological partition)

37 37 New View:B-space The pair (V, B) is the universe, namely an object is a pair (p, B(p)) where B: V  2 V ;  p  B(p) is a granulation

38 38 Derived Partitions The inverse images of B is a partition (an equivalence relation) C ={C p | C p =B –1 (B p ) p  V}

39 39 Derived Partitions C p is called the center class of B p A member of C p is called a center.

40 40 Derived Partitions The center class C p consists of all the points that have the same granule Center class C p = {q | B q = B p }

41 41 C-quotient set The set of center classes C p is a quotient set Iran, Iraq..US, UK,... Russia, Korea

42 42 New Problem Solving Paradigm (Divide and) Conquer Quotient set  Topological Quotient space

43 43 Neighborhood of center class C (in the case B is not reflexive) B-granule/neighborhoodC-classes

44 44 Neighborhood of center class B-granule C-classes

45 45 Topological partition C p -classes B-granule/neighborhood

46 46 New Problem Solving Paradigm (Divide and) Conquer Quotient set  Topological Quotient space

47 47 Topological partition C p -classes B-granule/neighborhood

48 48 Topological partition C p -classes B-granule/neighborhood

49 49 Topological partition C p -classes B-granule/neighborhood

50 50 Topological Table (2-column) 2-columns Binary relation for Column I US  CXCX West CXCX C Y (  B X ) UK  CXCX West CXCX C Z (  B X ) Iran  CYCY M-east CYCY C X (  B Y ) Iraq  CYCY M-east CYCY C Z (  B Y ) Russia  CzCz East CZCZ C X (  B z ) Korea  CzCz East CZCZ C y (  B z )

51 51 Future Direction Topological Reduct Topological Table processing

52 52 Application 1: CWSP In UK, a financial service company may consulted by competing companies. Therefore it is vital to have a lawfully enforceable security policy. 3

53 53 Background Brewer and Nash (BN) proposed Chinese Wall Security Policy Model (CWSP) 1989 for this purpose

54 54 Policy: Simple CWSP (SCWSP) "Simple Security", BN asserted that "people (agents) are only allowed access to information which is not held to conflict with any other information that they (agents) already possess."

55 55 A little Fomral Simple CWSP(SCWSP): No single agent can read data X and Y that are in CONFLICT

56 56 Formal SCWSP SCWSP says that a system is secure, if “(X, Y)  CIR  X NDIF Y “ CIR=Conflict of Interests Binary Relation NDIF=No direct information flow

57 57 Formal Simple CWSP SCWSP says that a system is secure, if “(X, Y)  CIR  X NDIF Y “ “(X, Y)  CIR  X DIF Y “ CIR=Conflict of Interests Binary Relation

58 58 More Analysis SCWSP requires no single agent can read X and Y, but do not exclude the possibility a sequence of agents may read them Is it secure?

59 59 Aggressive CWSP (ACWSP) The Intuitive Wall Model implicitly requires: No sequence of agents can read X and Y: A 0 reads X=X 0 and X 1, A 1 reads X 1 and X 1,... A n reads X n =Y

60 60 Composite Information flow Composite Information flow(CIF) is a sequence of DIFs, denoted by  such that X=X 0  X 1 ...  X n =Y And we write X CIF Y NCIF: No CIF

61 61 Composition Information Flow Aggressive CWSP says that a system is secure, if “(X, Y)  CIR  X NCIF Y “ “(X, Y)  CIR  X CIF Y “

62 62 The Problem Simple CWSP  ? Aggressive CWSP This is a malicious Trojan Horse problem

63 63 Need ACWSP Theorem Theorem If CIR is anti-reflexive, symmetric and anti-transitive, then Simple CWSP  Aggressive CWSP

64 64 C and CIR classes CIR: Anti-reflexive, symmetric, anti-transitive CIR-class C p -classes

65 65 Application 2 Association mining by Granular/Bitmap computing

66 66 Fundamental Theorem Theorem 1: All isomorphic relations have isomorphic patterns

67 67 Illustrations:Table K v1v1  TWENTYMARNY) v2v2  TENMARSJ) v3v3  TENFEBNY) v4v4  TENFEBLA) v5v5  TWENTYMARSJ) v6v6  TWENTYMARSJ) v7v7  TWENTYAPRSJ) v8v8  THIRTYJANLA) v9v9  THIRTYJANLA)

68 68 Illustrations: Table K’ v1v1  203rdNew York) v2v2  103rdSan Jose) v3v3  102ndNew York) v4v4  102ndLos Angels) v5v5  203rdSan Jose) v6v6  203rdSan Jose) v7v7  204thSan Jose) v8v8  301stLos Angels) v9v9  301stLos Angels)

69 69 Illustrations: Patterns in K v1v1  TWENTYMAR NY) v2v2  TEN MARSJ) v3v3  TENFEBNY) v4v4  TENFEBLA) v5v5  TWENTYMARSJ) v6v6  TWENTYMARSJ) v7v7  TWENTYAPR SJ) v8v8  THIRTYJAN LA) v9v9  THIRTYJAN LA)

70 70 Isomorphic 2-Associations K Count K’ (TWENTY, MAR) 3(20, 3rd) (MAR, SJ)3(3rd, San Jose) (TWENTY, SJ)3(20, San Jose)

71 71 Canonical Model Bitmaps in Granular Forms Patterns in Granular Forms

72 72 Table K’ v1v1  203rd v2v2  103rd v3v3  102nd v4v4  102nd v5v5  203rd v6v6  203rd v7v7  204th v8v8  301st v9v9  301st

73 73 Illustration: K  GDM K GDM v1v1  203rd{v 1 v 5 v 6 v 7 }{v 1 v 2 v 5 v 6 } v2v2  103rd{v 2 v 3 v 4 }{v 1 v 2 v 5 v 6 } v3v3  102nd{v 2 v 3 v 4 }{v 3 v 4 } v4v4  102nd{v 2 v 3 v 4 }{v 3 v 4 } v5v5  203rd{v 1 v 5 v 6 v 7 }{v 1 v 2 v 5 v 6 } v6v6  203rd{v 1 v 5 v 6 v 7 }{v 1 v 2 v 5 v 6 } v7v7  204th{v 1 v 5 v 6 v 7 }{v 7 } v8v8  301st{v 8 v 9 } v9v9  301st{v 8 v 9 }

74 74 Illustration: K  GDM K GDM v1v1  203rd(100011100)(110011000) v2v2  103rd(011100000)(110011000) v3v3  102nd(011100000)(001100000) v4v4  102nd(011100000)(001100000) v5v5  203rd(100011100)(110011000) v6v6  203rd(100011100)(110011000) v7v7  204th(100011100)(110011000) v8v8  301st(000000011) v9v9  301st(000000011)

75 75 GranularData Model (of K’ ) NAMEElementary Granules 10(011100000)={v 2 v 3 v 4 } 20(100011100) ={v 1 v 5 v 6 v 7 } 30(000000011)={v 8 v 9 } 1st(000000011)={v 8 v 9 } 2nd(001100000)={v 3 v 4 } 3rd(110011000)={v 1 v 2 v 5 v 6 } 4th(110011000)={v 7 }

76 76 Associations in Granular Forms KCardinality of Granules (20, 3rd) |{v 1 v 5 v 6 v 7 }  {v 1 v 2 v 5 v 6 }|= |{v 1 v 5 v 6 }|=3 (10, 2 nd ) |{v 2 v 3 v 4 }  {v 3 v 4 }|= |{v 3 v 4 }|=2 (30, 1 st ) |{v 8 v 9 }  {v 8 v 9 }|= |{v 8 v 9 }|=2

77 77 Associations in Granular Forms KCardinality of Granules (20, 3rd) |{v 1 v 5 v 6 v 7 }  {v 1 v 2 v 5 v 6 }|= |{v 1 v 5 v 6 }|=3 (3rd, SJ) |{v 1 v 2 v 5 v 6 }  {v 2 v 5 v 6 v 7 }|= |{v 2 v 5 v 6 }|=3 (20, SJ) |{v 1 v 5 v 6 v 7 }  {v 2 v 5 v 6 v 7 }|= |{v 5 v 6 v 7 }|= 3

78 78 Fundamental Theorems 1. All isomorphic relations are isomorphic to the canonical model (GDM) 2. A granule of GDM is a high frequency pattern if it has high support.

79 79 Relation Lattice Theorems 1. The granules of GDM generate a lattice of granules with join =  and meet= . This lattice is called Relational Lattice by Tony Lee (1983) 2. All elements of lattice can be written as join of prime (join-irreducible elements) (Birkoff & MacLane, 1977, Chapter 11)

80 80 Find Association by Linear Inequalities Theorem. Let P 1, P 2,  are primes (join-irreducible) in the Canonical Model. then G= x 1 * P 1  x 2 * P 2   is a High Frequency Pattern, If |G|= x 1 * |P 1 | +x 2 * |P 2 | +   th, ( x j is binary number)

81 81 Join-irreducible elements 10  1 st {v 2 v 3 v 4 }  {v 8 v 9 }=  20  1 st {v 1 v 5 v 6 v 7 }  {v 8 v 9 }=  30  1 s {v 8 v 9 }  {v 8 v 9 }= {v 8 v 9 } 10  2 nd {v 2 v 3 v 4 }  {v 3 v 4 }= {v 3 v 4 } 20  2 nd {v 1 v 5 v 6 v 7 }  {v 3 v 4 }=  30  2nd{v 8 v 9 }  {v 3 v 4 }=  10  3 rd {v 2 v 3 v 4 }  {v 1 v 2 v 5 v 6 }= {v 2 } 20  3 rd {v 1 v 5 v 6 v 7 }  {v 1 v 2 v 5 v 6 }= {v 1 v 5 v 6 } 30  3 rd {v 8 v 9 }  {v 1 v 2 v 5 v 6 }=  10  4th{v 2 v 3 v 4 }  {v 7 }=  20  4th{v 1 v 5 v 6 v 7 }  {v 7 }= {v 7 } 30  4th{v 8 v 9 }  {v 7 }= 

82 82 AM by Linear Inequalities |x 1 *{v 1 v 5 v 6 }=(20, 3rd) +x 2 *{v 2 } =(10, 3rd) +x 3 *{v 3 v 4 }=(10, 2nd) +x 4 *{v 7 } =(20, 4th) +x 5 *{v 8 v 9 } =(30, 1 st )| = x 1 *3+x 2 *1+x 3 *2+x 4 *1+ x 5 *2

83 83 AM by Linear Inequalities |x 1 *{v 1 v 5 v 6 }+x 2 *{v 2 }+x 3 *{v 3 v 4 }+x 4 *{v 7 }+x 5 *{v 8 v 9 }| = x 1 *3+x 2 *1+x 3 *2+x 4 *1+ x 5 *2 1. x 1 =1 2. x 2 =1, x 3 =1, or x 2 =1, x 5 =1 3. x 3 =1, x 4 =1 or x 3 =1, x 5 =1 4. x 4 =1, x 5 =1

84 84 AM by Linear Inequalities |x 1 *{v 1 v 5 v 6 }+x 2 *{v 2 }+x 3 *{v 3 v 4 }+x 4 *{v 7 }+x 5 *{v 8 v 9 }| = x 1 *3+x 2 *1+x 3 *2+x 4 *1+ x 5 *2 1. x 1 =1 |1*{v 1 v 5 v 6 } | = 1*3=3 (20, 3rd) |{v 1 v 5 v 6 v 7 }  {v 1 v 2 v 5 v 6 }|= |{v 1 v 5 v 6 }|=3

85 85 AM by Linear Inequalities |x 1 *{v 1 v 5 v 6 }+x 2 *{v 2 }+x 3 *{v 3 v 4 }+x 4 *{v 7 }+x 5 *{v 8 v 9 }| = x 1 *3+x 2 *1+x 3 *2+x 4 *1+ x 5 *2 x 2 =1, x 3 =1, or x 2 =1, x 5 =1 |x 2 *{v 2 }+x 3 *{v 3 v 4 }| =(10  20, 3 rd ) |x 2 *{v 2 }+x 5 *{v 8 v 9 }| =( 10, 2nd)  (10, 3 rd ) x 3 =1, x 4 =1 or x 3 =1, x 5 =1 x 4 =1, x 5 =1

86 86 AM by Linear Inequalities |x 1 *{v 1 v 5 v 6 }+x 2 *{v 2 }+x 3 *{v 3 v 4 }+x 4 *{v 7 }+x 5 *{v 8 v 9 }| = x 1 *3+x 2 *1+x 3 *2+x 4 *1+ x 5 *2 x 3 =1, x 4 =1 or x 3 =1, x 5 =1 | x 3 *{v 3 v 4 }+x 4 *{v 7 }| =(10, 2nd  3 rd ) | x 3 *{v 3 v 4 }+x 5 *{v 8 v 9 }| =( 10, 2nd)  (30, 1st) x 4 =1, x 5 =1

87 87 AM by Linear Inequalities |x 1 *{v 1 v 5 v 6 }+x 2 *{v 2 }+x 3 *{v 3 v 4 }+x 4 *{v 7 }+x 5 *{v 8 v 9 }| = x 1 *3+x 2 *1+x 3 *2+x 4 *1+ x 5 *2 x 4 =1, x 5 =1 | x 3 *{v 3 v 4 }+x 5 *{v 8 v 9 }| =( 20, 4st)  (30, 1st)


Download ppt "1 Project Overview Discovering Concepts Hidden in the Web This is incomplete set of slides that will explain the idea underlying the project For now, the."

Similar presentations


Ads by Google