Fig. 1 (a) The PageRank algorithm (b) The web link structure

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Improved TF-IDF Ranker
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Information Retrieval in Practice
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
LinkSelector: Select Hyperlinks for Web Portals Prof. Olivia Sheng Xiao Fang School of Accounting and Information Systems University of Utah.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
Lexicon/dictionary DIC Inverted Index Allows quick lookup of document ids with a particular word Stanford UCLA MIT … PL(Stanford) PL(UCLA)
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
HCC class lecture 22 comments John Canny 4/13/05.
Overview of Search Engines
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Adversarial Information Retrieval The Manipulation of Web Content.
Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Text mining.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
The identification of interesting web sites Presented by Xiaoshu Cai.
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
Overview of Web Ranking Algorithms: HITS and PageRank
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Ranking Link-based Ranking (2° generation) Reading 21.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Fuzzy integration of structure adaptive SOMs for web content.
Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.
Post-Ranking query suggestion by diversifying search Chao Wang.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
1 CS 430: Information Discovery Lecture 5 Ranking.
Artificial Intelligence Techniques Internet Applications 4.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Search Engines Session 5 INST 301 Introduction to Information Science.
Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
Automated Information Retrieval
Information Retrieval in Practice
Using Web Structure for Classifying and Describing Web Pages
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Clustering of Web pages
Julián ALARTE DAVID INSA JOSEP SILVA
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Text & Web Mining 9/22/2018.
Presented by: Prof. Ali Jaoua
Information Organization: Clustering
Principles of Data Mining Published by Springer-Verlag. 2007
Junghoo “John” Cho UCLA
TEXTAND WEB MINING.
TEXT and WEB MINING.
Web Information retrieval (Web IR)
Link Analysis Many slides are borrowed from Stanford Data Mining Class taught by Drs Anand Rajaraman, Jeffrey D. Ullman, and Jure Leskovec.
Presentation transcript:

Fig. 1 (a) The PageRank algorithm (b) The web link structure Fig. 1 shows the PageRank Algorithm with random teleports and the web link structure: Construct the column stochastic matrix M and A. Calculate the PageRank with random transports ( = 0.8) for three iterations. Which, among the three nodes {Yahoo, Amazon, M’soft}, is the most important node? (a) (b) Fig. 1 (a) The PageRank algorithm (b) The web link structure

Ans: 1/2 1/2 0 1/2 0 0 0 1/2 1 M = 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15 0.8 + 0.2 A = = rk+1 = Ark y a = m 1 r0 1.00 0.60 1.40 r1 0.84 0.60 1.56 r2 0.776 0.536 1.688 r3 The most important node

Fig. 2 Flowchart for feature extraction in text mining. As shown in Fig. 2, document frequency thresholding is an important step toward feature extraction in text mining. Given the content for document D1 and D2 shown in Fig.3, fill the document-feature matrix. Given N=10 and  =1.5, what are the feature terms extracted from D1 by using inverse document frequency weighting? Given N=10 and  =1.0, what are the feature terms extracted from D2 by using entropy weighting? ps. c)小題之計算複雜度高,介於考試時間有限,故本題型在期末考出現的機率很低。 (c) Fig. 3 Fig. 2 Flowchart for feature extraction in text mining.

Feature Extraction: Weighting Model(3) Entropy weighting where average entropy of j-th term gfj ::= number of times j-th term occurs in the whole training document collection -1: if word occurs once time in every document 0: if word occurs in only one document

a) b) c) Ans: wij = Freqij * log(N/ DocFreqj) K O Q R S T W X D1 4 1 D2 2 b) wij = Freqij * log(N/ DocFreqj) A B K O Q R S T W X DocFreqj 10 5 3 2 1 N/DocFreqj 1.00 2.00 3.33 5.00 10.00 log2(N/DocFreqj) 0.00 1.74 2.32 3.32 Freqij for D1 4 tf×idf => feature terms: O, R, S, W c) wij = log2(Freqij +1)* (1-entropy(wi)) A B K O Q R S T W X Freqij for D2 4 2 1 Entropy(wj) 0.4 0.1 0.3 log2(N/DocFreqj) 1.39 1.43 0.90 0.00 0.70 0.60 => feature terms: A, B

Data Preprocessing is essential for web usage mining. Explain the four steps data preprocessing Given the web page linkage shown in Fig 4. (c), refine the user sessions shown in Fig. 4 (a). Given the web page linkage shown in Fig 4. (c), complete the paths in Fig. 4. (b). Fig. 4 Fig. 4

a) b) Three Sessions: A-B-F-O-G-A-D L-R A-B-C-J Four Sessions: Ans: a) b) Three Sessions: A-B-F-O-G-A-D L-R A-B-C-J Four Sessions: A-B-F-O-G A-D L-R A-B-C-J or c) Four Sessions: A-B-F-O-F-B-G A-D L-R A-B-A-C-J