Data Processing with Missing Information Haixuan Yang Supervisors: Haixuan Yang Supervisors: Prof. Irwin King & Prof. Michael R. Lyu.

Data Processing with Missing Information Haixuan Yang Supervisors: Haixuan Yang Supervisors: Prof. Irwin King & Prof. Michael R. Lyu

2 Outline Introduction Introduction Link Analysis Link Analysis –Preliminaries –Related Work –Predictive Ranking Model –Block Predictive Ranking Model –Experiment Setup –Conclusions Information Systems Information Systems –Preliminaries –Related Work –Generalized Dependency Degree  ’ –Extend  ’ to Incomplete Information Systems –Experiments –Conclusions Future Work Future Work

3 Introduction We handle missing information in two areas We handle missing information in two areas –Link analysis on the web. –Information systems with missing values. In both areas, there is a common simple technique: using the sample to estimate the density. If the total number of the sample is n, and the number for a phenomena in the sample is m, then the probabilty for this phenomena is estimated as m/n. In both areas, there is a common simple technique: using the sample to estimate the density. If the total number of the sample is n, and the number for a phenomena in the sample is m, then the probabilty for this phenomena is estimated as m/n. In both areas, the difficulty is how to use the above simple technique in its right places. In both areas, the difficulty is how to use the above simple technique in its right places.

4 Link Analysis

5 Preliminaries PageRank (1998) PageRank (1998) –It uses the link information to rank web page; –The importance of a page depends on the number of pages that point to it; –The importance of a page also depends on the importance of pages that point to it. –If x is the rank vector, then the rank x i can be expressed as Where d j is the outdegree of the node j f i : probability that the user will randomly jump to the node i  : probability that the user follows the actual link structure.

6 Preliminaries Problems Problems –Manipulation –The “richer-get-richer” phenomenon –Computation Efficiency –Dangling nodes problem

7 Preliminaries Nodes that either have no out-link or for which no out-link is known are called dangling nodes. Nodes that either have no out-link or for which no out-link is known are called dangling nodes. Dangling nodes problem Dangling nodes problem –It is hard to sample the entire web. Page et al (1998) reported that they have 51 million URL not downloaded yet when they have 24 million pages downloaded.Page et al (1998) reported that they have 51 million URL not downloaded yet when they have 24 million pages downloaded. Eiron et al (2004) reported that the number of uncrawled pages (3.75 billion) still far exceeds the number of crawled pages (1.1 billion).Eiron et al (2004) reported that the number of uncrawled pages (3.75 billion) still far exceeds the number of crawled pages (1.1 billion). –Including dangling nodes in the overall ranking may not only change the rank value of non-dangling nodes but also change the order of them.

8 An example If we ignore the dangling node 3, then the ranks for nodes 1 and 2 are. If we consider the dangling node 3, then the ranks are by the revised pagerank algorithm (Kamvar 2003).

9 Related work Page (1998): Simply removing them. After doing so, they can be added back in. The details are missing. Page (1998): Simply removing them. After doing so, they can be added back in. The details are missing. Amati (2003): Handle dangling nodes robustly based on a modified graph. Amati (2003): Handle dangling nodes robustly based on a modified graph. Kamvar (2003): Add uniform random jump from dangling nodes to all nodes. Kamvar (2003): Add uniform random jump from dangling nodes to all nodes. Eiron (2004): Speed up the model in Kamvar (2003), but sacrificing accuracy. Furthermore, suggest algorithm that penalize the nodes that link to dangling nodes of class 2 that we will define later. Eiron (2004): Speed up the model in Kamvar (2003), but sacrificing accuracy. Furthermore, suggest algorithm that penalize the nodes that link to dangling nodes of class 2 that we will define later.

10 Related work - Amati (2003)

11 Related work - Kamvar (2003)

12 Related work - Eiron (2004)

13 Predictive Ranking Model Classes of Dangling nodes Classes of Dangling nodes –Nodes that are found but not visited at current time are called dangling nodes of class 1. –Nodes that have been tried but not visited successfully are called dangling nodes of class 2. –Nodes, which have been visited successfully but from which no outlink is found, are called dangling nodes of class 3. Handle different kind of dangling nodes in different way. Our work focuses on dangling nodes of class 1, which cause missing information. Handle different kind of dangling nodes in different way. Our work focuses on dangling nodes of class 1, which cause missing information.

14 1234 56 7 Illustration of dangling nodes At time 1 visited node:1. Dangling nodes of class 1: 2, 4, 5,7. At time 2, 124 5 23 6 Dangling nodes of class 3 : 7 Visited nodes : 1,7,2; Dangling nodes of class 1: 3,4,5,6 77 Known information at time 2: red links Missing information at time 2: White links

15 Predictive Ranking Model For dangling nodes of class 3, we use the same technique as Kamvar (2003). For dangling nodes of class 3, we use the same technique as Kamvar (2003). For dangling nodes of class 2, we ignore them at current model although it is possible to combine the push-back algorithm (Eiron 2004) with our model. (Penalizing nodes is a subjective matter.) For dangling nodes of class 2, we ignore them at current model although it is possible to combine the push-back algorithm (Eiron 2004) with our model. (Penalizing nodes is a subjective matter.) For dangling nodes of class 1, we try to predict the missing information about the link structrue. For dangling nodes of class 1, we try to predict the missing information about the link structrue.

16 Predictive Ranking Model (Cont.) Suppose that all the nodes V can be partitioned into three subsets:. Suppose that all the nodes V can be partitioned into three subsets:. – denotes the set of all non-dangling nodes (that have been crawled successfully and have at least one out-link); – denotes the set of all dangling nodes of class 3; – denotes the set of all dangling nodes of class 1; For each node v in V, the real in-degree of v is not known. For each node v in V, the real in-degree of v is not known. For each node v in, the real out-degree of v is known. For each node v in, the real out-degree of v is known. For each node v in, the real out-degree of v is known to be zero. For each node v in, the real out-degree of v is known to be zero. For each node v in, the real out-degree of v is unknown. For each node v in, the real out-degree of v is unknown.

17 Predictive Ranking Model (Cont.) We predict the real in-degree of v by the number of found links from C to v. We predict the real in-degree of v by the number of found links from C to v. –Assumption: the number of found links from C to v is proportional to the real number of links from V to v. For example, For example, if C and have 100 nodes, if C and have 100 nodes, V has 1000 nodes, V has 1000 nodes, and if the number of links from C to v is 5, and if the number of links from C to v is 5, then we estimate that the number of links from V to v is 50. then we estimate that the number of links from V to v is 50. The difference between these two numbers is distributed uniformly to the nodes in. The difference between these two numbers is distributed uniformly to the nodes in.

18 Predictive Ranking Model Models the missing information from unvisited nodes to nodes in V: from D 2 to V. Model the known link information as Page (1998): from C to V. Model the user’s behavior as Kamvar (2003) when facing dangling nodes of class 3: from D 1 to V. n : the number of nodes in V; m: the number of nodes in C; m 1 : the number of nodes in D 1.

19 Predictive Ranking Model Model users’ behavior (called as “teleportation”) as Page (1998) and Kamvar (2003) when the users get bored in following actual links and they may jump to some nodes randomly. is the rank vector.

20 Block Predictive Ranking Model Predict the in-degree of v more accurately. Predict the in-degree of v more accurately. Divide all nodes into p blocks (v[1], v[2], …, v[p]) according to their top level domains (for example, edu), or domains (for example, stanford.edu), or the countries (for example, cn). Divide all nodes into p blocks (v[1], v[2], …, v[p]) according to their top level domains (for example, edu), or domains (for example, stanford.edu), or the countries (for example, cn). Assumption: the number of found links from C[i] (C[i] is the meet of C and V[i]) to v is proportional to the real number of links from V[i] to v. Consequently, the matrix A is changed. Assumption: the number of found links from C[i] (C[i] is the meet of C and V[i]) to v is proportional to the real number of links from V[i] to v. Consequently, the matrix A is changed. Other parts are same as the Predictive Ranking Model. Other parts are same as the Predictive Ranking Model.

21 Block Predictive Ranking Model Models the missing information from unvisited nodes in 1st block to nodes in V.

22 Experiment Setup Get two datasets (by Patrick Lau): one is within the domain cuhk.edu.hk, the other is outside this domain. In the first dataset, we snapshot 11 matrices during the process of crawling; in the second dataset, we snapshot 9 matrices. Get two datasets (by Patrick Lau): one is within the domain cuhk.edu.hk, the other is outside this domain. In the first dataset, we snapshot 11 matrices during the process of crawling; in the second dataset, we snapshot 9 matrices. Apply both Predictive Ranking Model and the revised RageRank Model (Kamvar 2003) to these snapshots. Apply both Predictive Ranking Model and the revised RageRank Model (Kamvar 2003) to these snapshots. Compare the results of both models at time t with the future results of both models. Compare the results of both models at time t with the future results of both models. –The future results rank more nodes than the current results. So it is difficult to make a direct comparison.

23 Illustration for comparison future result current result Cut Normalize Difference computed by 1-norm

24 Within domain cuhk.edu.hk Data Description Time t1234567891011 Vnum[t]771278662109383160019252522301707373579411724444974471684502610 Tnum[t]18542120970157196234701355720404728476961515534549162576139607170 Time t1234567891011 PreRank53222222222 PageRank124322222222 Number of iterations

25 Within domain cuhk.edu.hk Comparison Based on future PageRank result at time 11. PageRank result at time t PageRank result at time 11 PreRank result at time t Difference

26 Within domain cuhk.edu.hk Comparison Based on future PageRank result at time 11. Visited nodes at time 11: 502610 Found nodes at time 11: 607170

27 Within domain cuhk.edu.hk Comparison Based on future PreRank result at time 11. PageRank result at time t PreRank result at time 11 PreRank result at time t Difference

28 Within domain cuhk.edu.hk Comparison Based on future PreRank result at time 11 Visted nodes at time 11: 502610 Found nodes at time 11: 607170

29 Outside cuhk.edu.hk Data Description Time t123456789 Vnum[t]4611678510310166902031823453254172884739824 Tnum[t]87930121961164701227682290731322774362440413053882254 Time t123456789 PreRank222111111 PageRank333222221 Number of iterations

30 Outside cuhk.edu.hk Comparison Based on future PageRank result at time 9 Visted nodes at time 9: 39824 Found nodes at time 9: 882254 Actual nodes: more than 4 billions

31 Outside cuhk.edu.hk Comparison Based on future PreRank result at time 9 Visted nodes at time 9: 39824 Found nodes at time 9: 882254 Actual nodes: more than 4 billions

32 Conclusions on PreRank PreRank performs better than PageRank in accuracy in a close web. PreRank performs better than PageRank in accuracy in a close web. PreRank needs less iterations than PageRank (outside our expectation !). PreRank needs less iterations than PageRank (outside our expectation !).

33 Information Systems (IS)

34 Preliminaries In Rough Set Theory, the dependency degree  is a traditional measure, which expresses the percentage of objects that can be correctly classified into D-class by employing attribute C. However,  does not accurately express the dependency between C and D. In Rough Set Theory, the dependency degree  is a traditional measure, which expresses the percentage of objects that can be correctly classified into D-class by employing attribute C. However,  does not accurately express the dependency between C and D. –  =0 when there is no deterministic rule. The table in next slide is an Information system. The table in next slide is an Information system.

35  =0 when C={a}, D={d}. But in fact D depends on C to some degree. An example Attribute AttributeObject Headache (a) Muscle Pain (b) Body Temperature (c) Influenza (d) e1YYNormal(0)N e2YYHigh(1)Y e3YY Very high(2) Y e4NYNormal(0)N e5NNHigh(1)N e6NY Y e7YNHigh(1)Y

36 Original Definition of  where U is set of all objects X is one D-class is the lower approximation of X Each block is a C-class

37 Preliminaries Incomplete IS: has missing values. Incomplete IS: has missing values. Attribute AttributeObject Headache (a) Muscle Pain (b) Body Temperature (c) Influenza (d) e1YYNormal(0)N e2Y*High(1)Y e3YY*Y e4N*Normal(0)N e5NNHigh(1)N e6NY Very high(2) Y e7YNHigh(1)Y

38 Related work Lingras (1998): represents missing values by the set of all possible values, by which a rule extraction method is proposed. Lingras (1998): represents missing values by the set of all possible values, by which a rule extraction method is proposed. Kryszkiewicz (1999): establishes a similarity relation based on the missing values, upon which a method for computing optimal certain rules is proposed. Kryszkiewicz (1999): establishes a similarity relation based on the missing values, upon which a method for computing optimal certain rules is proposed. Leung (2003): introduces maximal consistent block technique for rule acquisition in incomplete information systems. Leung (2003): introduces maximal consistent block technique for rule acquisition in incomplete information systems.

39 Related work The conditional entropy, The conditional entropy, is used in C4.5 (Quinlan, 1993). is used in C4.5 (Quinlan, 1993).

40 Definition of  We find the following variation of . U: universe of objects; C, D: sets of attributes; C(x): C-class containing x; D(x): D-class containing x.

41 U: universe of objects; C, D: sets of attributes; C(x): C-class containing x; D(x): D-class containing x. Generalized Dependency Degree  ’  ’ is defined as The first form of  ’ is defined as

42 Variations of  ’ and  Where MinR(C,D) denotes the set of all minimal rules. Minimal rule takes the following form:  ’ The second form of  ’

43 Properties of  ’  ’ can be extended to equivalence relations R 1 and R 2. Property 1. Property 2. Property 3. Property 4.

44 Extend  ’ to incomplete IS Replace the missing value by the probabilistic distribution. Replace the missing value by the probabilistic distribution. Attribute AttributeObject Headache (a) Muscle Pain (b) Body Temperature (c) Influenza (d) e1YYNormal(0)N e2YHigh(1)Y e3YY*Y e4N*Normal(0)N e5NNHigh(1)N e6NY Very high(2) Y e7YNHigh(1)Y where

45 Extend to incomplete IS (Cont.) Then use the second form of the generalized dependency degree Then use the second form of the generalized dependency degree as our definition of  ’ in an incomplete IS. as our definition of  ’ in an incomplete IS. We need to define the strength and confidence of a rule. We need to define the strength and confidence of a rule.

46 Extend to incomplete IS (Cont.) The confidence of a rule is defined as The confidence of a rule is defined as The strength of a rule is defined as The strength of a rule is defined as The set, is defined inductively as The set, is defined inductively as

47 Extend to incomplete IS (Cont.) where is the algebraic sum of the fuzzy sets is the algebraic product of the fuzzy sets is the complement of the fuzzy sets

48 The conditional entropy is as follows:  ’ (C,D) can be proved to be Comparison with conditional entropy  ’ The third form of  ’

49 The conditional entropy H(D|C) is used in C4.5 We replace H(D|C) with  ’ (C,D) in C4.5 algorithm such that a new C4.5 algorithm is formed. Comparison by experiments

50 We use all the same eleven datasets with missing values from the UCI Repository as Quinlan uses in his paper. We use all the same eleven datasets with missing values from the UCI Repository as Quinlan uses in his paper. Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research, 4, 77-90, 1996. Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research, 4, 77-90, 1996. Both old C4.5 and new C4.5 use ten-fold cross- validations with each task. Both old C4.5 and new C4.5 use ten-fold cross- validations with each task. Datasets we use

51 datasetCasesClassesContDiscr Anneal8986632 Auto20561510 Breast-w699290 Colic3682715 Credit-a690269 Heart-c303267 Heart-h294285 Hepatitis1552613 Allhyper37725722 Labor57288 Sick37722722 Description of the Datasets

52 dataset Old C4.5 unpruned (%) Old C4.5 pruned (%) New C4.5 unpruned(%) pruned(%) Anneal3.94.66.17.9 Auto20.5222222.5 Breast-w5.74.34.24.5 Colic19.81616.315.4 Credit-a19.717.115.215.6 Heart-c22.421.423.423.1 Heart-h24.222.820.721.1 Hepatitis2019.919.319.3 Allhyper1.41.41.11.2 Labor24.726.315.719.3 Sick1.21.111 Mean error rates of both algorithms

53dataset Old C4.5 New C4.5 UnprunedReducedrate(%) pruned Reduced rate(%) Anneal6.8004.600 32.4 5.100 25.0 Auto9.7002.600 73.2 2.600 Breast-w1.6001.000 37.5 1.000 Colic4.1001.500 63.4 1.500 Credit-a8.3002.400 71.1 2.500 69.9 Heart-c2.0000.700 65.0 0.900 55.0 Hepatitis0.9000.600 33.3 0.700 22.2 Allhyper45.00018.500 58.9 18.500 Labor0.4000.100 75.0 0.400 0.0 Sick38.10017.100 55.1 20.800 45.4 Average run time of both algorithms Time Unit: 0.01s

54 Conclusions on experiments 1. Theoretical complexity: The new C4.5 does not need the MDL (Minimum Description Length) principal to correct the bias towards the continuous attribute, and it does not need the pruning procedure. 2. Speed: The new C4.5 outperforms the original C4.5 greatly in run time, because a)To compute  ’, we only need to compute the square of the frequency, while the computation of the often used conditional entropy needs to compute the time consuming logarithm of the frequency. b)The building tree procedure in the new C4.5 algorithm stops earlier. c)Omitting the pruning procedure can also save us an amount of time. 3. Prediction accuracy: The new C4.5 is a little better than the original C4.5 in prediction accuracy in these 11 datasets.

55 Conclusions on  ’ The first form in terms of equivalent relation is most important. The first form in terms of equivalent relation is most important. –It is simple. –It is flexible and so can be extended to arbitrary relation. –It bridges the gap between  used in RST and the conditional entropy use in Machine Learning. The second form in terms of minimal rule and the first form share the advantage of being easily understood. The second form in terms of minimal rule and the first form share the advantage of being easily understood. The third form in terms of probability is computing-efficient. The third form in terms of probability is computing-efficient.

56 Conclusions on  ’ (Cont.) The generalized dependency degree has good properties, such as Partial Order Preserving Property and Anti- Partial Order Preserving Property. The generalized dependency degree has good properties, such as Partial Order Preserving Property and Anti- Partial Order Preserving Property. Its value is between 0 and 1. Its value is between 0 and 1.

57 Future Work Deepen Deepen –In the area of link analysis Estimate the web structure more accurately.Estimate the web structure more accurately. Conduct experiment on large real dataset to support the Block Predictive Ranking Model.Conduct experiment on large real dataset to support the Block Predictive Ranking Model. Speed up the predictive ranking algorithmSpeed up the predictive ranking algorithm –In the area of decision tree Instead of branching on one attribute, we can branch on more than one attributes.Instead of branching on one attribute, we can branch on more than one attributes. Handle missing value more accurately.Handle missing value more accurately.

58 Future Work (Cont.) Widen Widen –In the area of link analysis Push-back algorithm can be combined with our PreRank algorithmPush-back algorithm can be combined with our PreRank algorithm. –In the area of decision tree Create decision forest using the new C4.5 algorithm.Create decision forest using the new C4.5 algorithm. –In other areas such as SVM (Support Vector Machine).

59 Q & A

Data Processing with Missing Information Haixuan Yang Supervisors: Haixuan Yang Supervisors: Prof. Irwin King & Prof. Michael R. Lyu.

Similar presentations

Presentation on theme: "Data Processing with Missing Information Haixuan Yang Supervisors: Haixuan Yang Supervisors: Prof. Irwin King & Prof. Michael R. Lyu."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Processing with Missing Information Haixuan Yang Supervisors: Haixuan Yang Supervisors: Prof. Irwin King & Prof. Michael R. Lyu.

Similar presentations

Presentation on theme: "Data Processing with Missing Information Haixuan Yang Supervisors: Haixuan Yang Supervisors: Prof. Irwin King & Prof. Michael R. Lyu."— Presentation transcript:

Similar presentations

About project

Feedback