Web Mining Class Nam Hoai Nguyen Hiep Tuan Nguyen Tri Survey on Web Structure Mining 2014-06-19.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
(hyperlink-induced topic search)
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Overview of Web Data Mining and Applications Part I
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
Weighted Semantic PageRank Using RDF Metadata on Hadoop ICOMP 2014 Jun 20, 2014 Hee-gook Jun.
Adversarial Information Retrieval The Manipulation of Web Content.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Using Hyperlink structure information for web search.
CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Web Mining By:- Vineeta 8pgc18 M.Tech (II Semester)
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Overview of Web Ranking Algorithms: HITS and PageRank
Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Analysis of Link Structures on the World Wide Web and Classified Improvements Greg Nilsen University of Pittsburgh April 2003.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.
Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
1 CS 430: Information Discovery Lecture 5 Ranking.
Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Clustering of Web pages
HITS Hypertext-Induced Topic Selection
Lecture #11 PageRank (II)
Link-Based Ranking Seminar Social Media Mining University UC3M
Greg Nilsen University of Pittsburgh April 2003
A Comparative Study of Link Analysis Algorithms
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Information retrieval and PageRank
Junghoo “John” Cho UCLA
Presentation transcript:

Web Mining Class Nam Hoai Nguyen Hiep Tuan Nguyen Tri Survey on Web Structure Mining

Contents Research purpose Introduction What is Web Structure Mining Algorithms in Web Structure Mining Comparison table of Web structure Mining Algorithms Implementation results Conclusion

Research purpose Study about Web Structure Mining and its techniques Try to make a systematically comparison of some important Web Structure Mining algorithms through literature analysis Implement in practice some Web Structure Mining techniques in order to get the insights of those techniques.

Introduction What is Web Structure Mining? Web Mining Web Content Mining Web Structure Mining Web Usage Mining Web Structure Mining (WSM): A process by which the model of link structures and web pages are discovered Purpose of WSM: generate structural summary about the Web site and Web page

Introduction (cont’d) Web Structure Mining Link Mining Document Structure Mining Extracting patterns from hyperlinks in the web Mining the document structure (tree-like structure of documents)

Introduction (cont’d) Four important WSM algorithms: Pagerank algorithm Weighted pagerank algorithm Weighted content pagerank algorithm (WCPR) Hyperlink-Induced Topic Search (HITS)

1. PageRank Developed by L.Page and S.Brin. A page has high rank when the sum of the ranks of its backlinks is high Utilized by Google: 1. User request a search query 2. Google combines pre-computed static PageRank scores with content matching score to obtains an overall ranking score for each web page.

2. Weighted PageRank Proposed in order to improve pageRank Is an extended algorithm of PageRank by Wenpu Xing and Ali Ghorbani Method: assigns larger rank values to more popular pages instead of dividing the rank value of a page among its outlink pages Popular page: is the more linkages that other web page tend to have to them or are linked to by them

3. Weighted Content PageRank Based on WST and WCM Return the relevant and important pages in a list to a given query WSM is used to calculate the important page WCM is used to find how much relevant a page is Popularity of a page = number of inlinks and outlinks of the page A page is maximally matched to the query, it becomes more relevant.

3. Weighted Content PageRank (cont’d) Algorithm summary: Input for the algorithm: Page P, inlink and outlink. Weights of all backlinks of P, Query Q, d (damping factor). Output of the algorithm: Rank score Step 1: Relevance calculation: Find all meaningful word strings of Q (say N) Find whether the N strings are occurring in P or not? Z = Sum of frequencies of all N strings. S = Set of the maximum possible strings occurring in P. X = Sum of frequencies of strings in S. Content Weight (CW) = X/Z C = No. of query terms in P D = No. of all query terms of Q while ignoring stop words. Probability Weight (PW) = C/D Step 2: Rank calculation: Find all backlinks of P (say set B) Calculate Rank score Output PR(P) as the Rank score

4. Hyperlink-Induced Topic Search (HITS) HITS is a link algorithm Two types of webpages: hubs and authorities Hub: Resource lists A good hub: pointing to many authoritative pages on content that is being queried Authority: Pages having important contents A good authority: pointed by many good hub pages on the same content

4. Hyperlink-Induced Topic Search (HITS)(cont’d) Algorithm summary: Input: search topic, specified by one or more query terms. Step 1 - Sampling: A sampling component, which constructs a focused collection of several thousand Web pages likely to be rich in relevant authorities Step 2 - Weight propagation: A weight-propagation component, which determines numerical estimates of hub and authority weights by an iterative procedure. Output: hubs and authorities for the search.

Comparison table of WST Algorithms AlgorithmPageRankWeighted PageRank Weighted Page Content Rank HITS Author/YearS. Brin et al., 1998Wenpu Xing et al, 2004 P. Sharmar et al., 2010 Jon Kleinberg, 1998 Mining Technique Used WSM WSM and WCM DescriptionComputes scores at indexing time, not query time. Results are sorted according to importance of pages. Assigns large value to more important pages instead of dividing the rank value of a page evenly among its outlink pages Gives sorted order to the web pages returned by a search engine as a numerical value in response to a user query Computes hub and authority scores of n highly relevant pages on the fly. Relevant as well as important pages are returned.

Comparison table of WST Algorithms (cont’d) AlgorithmPageRankWeighted PageRank Weighted Page Content Rank HITS Input / Output Parameters Backlinks Backlinks, Forward links Backlinks, Forward links, Contents Backlinks, Forward links, Contents ComplexityO(logn)<O(logn) Advantages - Providing important pages according to given query. - Providing important pages according to given query. - Assigning importance in terms of weight values to incoming and outgoing links - Providing important pages and relevant pages according to query by using web structure and web content mining - Providing more relevant authority and hub pages according to query LimitationQuery independent Importance of page is ignored - Topic drift (topic unrelated to the original query) - Cannot detect advertisements Search EngineGoogleGoolgeResearch modelClever

Implementation results Table: Comparing top 10 pages which have high rank score from implementation result of PageRank and Weighted PageRank algorithm

Implementation results Graph: Comparing results from PageRank and Weighted PageRank algorithm

Implementation results Comparing convergence time and iteration number between PageRank and Weighted PageRank algorithm when threshold is variable.

Implementation results Comparing convergence time and iteration number between PageRank and Weighted PageRank algorithm when d is variable.

Conclusion Contributions of this study: A tabular comparison for important WSM algorithms A practical implementation with some result to get the insights of some WSM techniques Limitations Not enough time to implement more techniques in practice Still not contribute any novelty to existing techniques

Preferences [1] T. Bhatia, “Link Analysis Algorithms For Web Mining,” IJCST, vol. 2, no. 2, [2] M. da Costa Jr, and Z. Gong, "Web structure mining: an introduction." p. 6 pp. [3] M. A. Preeti Chopra, “A Survey on Improving the Efficiency of Different Web Structure Mining Algorithms,” IJEAT, vol. 2, no. 3, [4] S. K. Madria, S. S. Bhowmick, W.-K. Ng, and E.-P. Lim, "Research issues in web data mining," DataWarehousing and Knowledge Discovery, pp : Springer, [5] P. Sharma, and P. Bhadana, “Weighted page content rank for ordering web search result,” International Journal of Engineering Science and Technology, vol. 2, no. 12, pp , [6] J. M. Kleinberg, “Hubs, authorities, and communities,” ACM Computing Surveys (CSUR), vol. 31, no. 4es, pp. 5, [7] S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg, “Mining the Web's link structure,” Computer, vol. 32, no. 8, pp , [8] R. Kosala and H. Blockeel, “Web mining research: A survey,” SIGKDD Explor. Newsl., vol. 2, no. 1, pp. 1–15, Jun [Online]. Available: [9] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web,” in Proceedings of the 7 International World Wide Web Conference, Brisbane, Australia, 1998, pp. 161–172. [Online]. Available: citeseer.nj.nec.com/page98pagerank.html [10] “Weighted pagerank algorithm,” in Proceedings of the Second Annual Conference on Communication Networks and Services Research, ser. CNSR ’04. Washington, DC, USA: IEEE Computer Society, 2004, pp. 305–314. [Online]. Available: [11] H. Dubey and P. B. N. Roy, “An improved page rank algorithm based on optimized normalization technique,” pp. 2183– 2188, 2011.

Thank you