Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Mining Class Nam Hoai Nguyen Hiep Tuan Nguyen Tri Survey on Web Structure Mining 2014-06-19.

Similar presentations


Presentation on theme: "Web Mining Class Nam Hoai Nguyen Hiep Tuan Nguyen Tri Survey on Web Structure Mining 2014-06-19."— Presentation transcript:

1 Web Mining Class Nam Hoai Nguyen Hiep Tuan Nguyen Tri Survey on Web Structure Mining 2014-06-19

2 Contents Research purpose Introduction What is Web Structure Mining Algorithms in Web Structure Mining Comparison table of Web structure Mining Algorithms Implementation results Conclusion

3 Research purpose Study about Web Structure Mining and its techniques Try to make a systematically comparison of some important Web Structure Mining algorithms through literature analysis Implement in practice some Web Structure Mining techniques in order to get the insights of those techniques.

4 Introduction What is Web Structure Mining? Web Mining Web Content Mining Web Structure Mining Web Usage Mining Web Structure Mining (WSM): A process by which the model of link structures and web pages are discovered Purpose of WSM: generate structural summary about the Web site and Web page

5 Introduction (cont’d) Web Structure Mining Link Mining Document Structure Mining Extracting patterns from hyperlinks in the web Mining the document structure (tree-like structure of documents)

6 Introduction (cont’d) Four important WSM algorithms: Pagerank algorithm Weighted pagerank algorithm Weighted content pagerank algorithm (WCPR) Hyperlink-Induced Topic Search (HITS)

7 1. PageRank Developed by L.Page and S.Brin. A page has high rank when the sum of the ranks of its backlinks is high Utilized by Google: 1. User request a search query 2. Google combines pre-computed static PageRank scores with content matching score to obtains an overall ranking score for each web page.

8 2. Weighted PageRank Proposed in order to improve pageRank Is an extended algorithm of PageRank by Wenpu Xing and Ali Ghorbani Method: assigns larger rank values to more popular pages instead of dividing the rank value of a page among its outlink pages Popular page: is the more linkages that other web page tend to have to them or are linked to by them

9 3. Weighted Content PageRank Based on WST and WCM Return the relevant and important pages in a list to a given query WSM is used to calculate the important page WCM is used to find how much relevant a page is Popularity of a page = number of inlinks and outlinks of the page A page is maximally matched to the query, it becomes more relevant.

10 3. Weighted Content PageRank (cont’d) Algorithm summary: Input for the algorithm: Page P, inlink and outlink. Weights of all backlinks of P, Query Q, d (damping factor). Output of the algorithm: Rank score Step 1: Relevance calculation: Find all meaningful word strings of Q (say N) Find whether the N strings are occurring in P or not? Z = Sum of frequencies of all N strings. S = Set of the maximum possible strings occurring in P. X = Sum of frequencies of strings in S. Content Weight (CW) = X/Z C = No. of query terms in P D = No. of all query terms of Q while ignoring stop words. Probability Weight (PW) = C/D Step 2: Rank calculation: Find all backlinks of P (say set B) Calculate Rank score Output PR(P) as the Rank score

11 4. Hyperlink-Induced Topic Search (HITS) HITS is a link algorithm Two types of webpages: hubs and authorities Hub: Resource lists A good hub: pointing to many authoritative pages on content that is being queried Authority: Pages having important contents A good authority: pointed by many good hub pages on the same content

12 4. Hyperlink-Induced Topic Search (HITS)(cont’d) Algorithm summary: Input: search topic, specified by one or more query terms. Step 1 - Sampling: A sampling component, which constructs a focused collection of several thousand Web pages likely to be rich in relevant authorities Step 2 - Weight propagation: A weight-propagation component, which determines numerical estimates of hub and authority weights by an iterative procedure. Output: hubs and authorities for the search.

13 Comparison table of WST Algorithms AlgorithmPageRankWeighted PageRank Weighted Page Content Rank HITS Author/YearS. Brin et al., 1998Wenpu Xing et al, 2004 P. Sharmar et al., 2010 Jon Kleinberg, 1998 Mining Technique Used WSM WSM and WCM DescriptionComputes scores at indexing time, not query time. Results are sorted according to importance of pages. Assigns large value to more important pages instead of dividing the rank value of a page evenly among its outlink pages Gives sorted order to the web pages returned by a search engine as a numerical value in response to a user query Computes hub and authority scores of n highly relevant pages on the fly. Relevant as well as important pages are returned.

14 Comparison table of WST Algorithms (cont’d) AlgorithmPageRankWeighted PageRank Weighted Page Content Rank HITS Input / Output Parameters Backlinks Backlinks, Forward links Backlinks, Forward links, Contents Backlinks, Forward links, Contents ComplexityO(logn)<O(logn) Advantages - Providing important pages according to given query. - Providing important pages according to given query. - Assigning importance in terms of weight values to incoming and outgoing links - Providing important pages and relevant pages according to query by using web structure and web content mining - Providing more relevant authority and hub pages according to query LimitationQuery independent Importance of page is ignored - Topic drift (topic unrelated to the original query) - Cannot detect advertisements Search EngineGoogleGoolgeResearch modelClever

15 Implementation results Table: Comparing top 10 pages which have high rank score from implementation result of PageRank and Weighted PageRank algorithm

16 Implementation results Graph: Comparing results from PageRank and Weighted PageRank algorithm

17 Implementation results Comparing convergence time and iteration number between PageRank and Weighted PageRank algorithm when threshold is variable.

18 Implementation results Comparing convergence time and iteration number between PageRank and Weighted PageRank algorithm when d is variable.

19 Conclusion Contributions of this study: A tabular comparison for important WSM algorithms A practical implementation with some result to get the insights of some WSM techniques Limitations Not enough time to implement more techniques in practice Still not contribute any novelty to existing techniques

20 Preferences [1] T. Bhatia, “Link Analysis Algorithms For Web Mining,” IJCST, vol. 2, no. 2, 2011. [2] M. da Costa Jr, and Z. Gong, "Web structure mining: an introduction." p. 6 pp. [3] M. A. Preeti Chopra, “A Survey on Improving the Efficiency of Different Web Structure Mining Algorithms,” IJEAT, vol. 2, no. 3, 2013. [4] S. K. Madria, S. S. Bhowmick, W.-K. Ng, and E.-P. Lim, "Research issues in web data mining," DataWarehousing and Knowledge Discovery, pp. 303-312: Springer, 1999. [5] P. Sharma, and P. Bhadana, “Weighted page content rank for ordering web search result,” International Journal of Engineering Science and Technology, vol. 2, no. 12, pp. 7301-7310, 2010. [6] J. M. Kleinberg, “Hubs, authorities, and communities,” ACM Computing Surveys (CSUR), vol. 31, no. 4es, pp. 5, 1999. [7] S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg, “Mining the Web's link structure,” Computer, vol. 32, no. 8, pp. 60-67, 1999. [8] R. Kosala and H. Blockeel, “Web mining research: A survey,” SIGKDD Explor. Newsl., vol. 2, no. 1, pp. 1–15, Jun. 2000. [Online]. Available: http://doi.acm.org/10.1145/360402.360406 [9] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web,” in Proceedings of the 7 International World Wide Web Conference, Brisbane, Australia, 1998, pp. 161–172. [Online]. Available: citeseer.nj.nec.com/page98pagerank.html [10] “Weighted pagerank algorithm,” in Proceedings of the Second Annual Conference on Communication Networks and Services Research, ser. CNSR ’04. Washington, DC, USA: IEEE Computer Society, 2004, pp. 305–314. [Online]. Available: http://dl.acm.org/citation.cfm?id=998669.998911 [11] H. Dubey and P. B. N. Roy, “An improved page rank algorithm based on optimized normalization technique,” pp. 2183– 2188, 2011.

21 Thank you


Download ppt "Web Mining Class Nam Hoai Nguyen Hiep Tuan Nguyen Tri Survey on Web Structure Mining 2014-06-19."

Similar presentations


Ads by Google