Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Center for E-Business Technology Seoul National University Seoul, Korea Socially Filtered Web Search: An approach using social bookmarking tags to personalize.
Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Adaptive Fastest Path Computation on a Road Network : A Traffic Mining Approach Hector Gonzalez Jiawei Han Xiaolei Li Margaret Myslinska John Paul Sondag.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Recommender systems Ram Akella November 26 th 2008.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Evaluating Performance for Data Mining Techniques
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
2008/06/06 Y.H.Chang Towards Effective Browsing of Large Scale Social Annotations1 Towards Effective Browsing of Large Scale Social Annotations WWW 2007.
PowerPoint Presentation for Dennis & Haley Wixom, Systems Analysis and Design, 2 nd Edition Copyright 2003 © John Wiley & Sons, Inc. All rights reserved.
A Web Crawler Design for Data Mining
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
Presented by Tienwei Tsai July, 2005
Mining Interesting Locations and Travel Sequences from GPS Trajectories IDB & IDS Lab. Seminar Summer 2009 강 민 석강 민 석 July 23 rd,
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc – SIAM Web Analytics Workshop.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
1 On the Placement of Web Server Replicas Lili Qiu, Microsoft Research Venkata N. Padmanabhan, Microsoft Research Geoffrey M. Voelker, UCSD IEEE INFOCOM’2001,
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Classical Music for Rock Fans?: Novel Recommendations for Expanding User Interests Makoto Nakatsuji, Yasuhiro Fujiwara, Akimichi Tanaka, Toshio Uchiyama,
Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.
南台科技大學 資訊工程系 A web page usage prediction scheme using sequence indexing and clustering techniques Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2010/10/15.
CSE 185 Introduction to Computer Vision Pattern Recognition 2.
Intelligent Database Systems Lab 1 Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Silvia Nittel Kelvin T.Leung Amy Braverman 國立雲林科技大學 National Yunlin.
Personalization Technologies: A Process-Oriented Perspective Communications of the ACM (October 2005) Presented By Gediminas Adomavicius, Alexander Tuzhilin.
Web Usage Mining for Semantic Web Personalization جینی شیره شعاعی زهرا.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,
Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,
Collaborative Filtering versus Personal Log based Filtering: Experimental Comparison for Hotel Room Selection Ryosuke Saga and Hiroshi Tsuji Osaka Prefecture.
User Behavior Analysis of Location Aware Search Engine Third international Conference of MDM, 2002 Takahiko Shintani, Iko Pramudiono NTT Information Sharing.
BestChoice SRM: A Simple and Practical Supplier Relationship Management System for e-procurement June 12, 2007 Dongjoo Lee, Seungseok Kang, San-keun Lee,
Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.
Center for E-Business Technology Seoul National University Seoul, Korea Social Ranking: Uncovering Relevant Content Using Tag-based Recommender Systems.
No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering.
An Energy-efficient Task Scheduler for Multi-core Platforms with per-core DVFS Based on Task Characteristics Ching-Chi Lin Institute of Information Science,
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Machine Learning Queens College Lecture 7: Clustering.
From Theory to Practice: Efficient Join Query Processing in a Parallel Database System Shumo Chu, Magdalena Balazinska and Dan Suciu Database Group, CSE,
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.
Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.
Ch1 Larson/Farber 1 1 Elementary Statistics Larson Farber Introduction to Statistics As you view these slides be sure to have paper, pencil, a calculator.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
2008 Freshmen Java Project Produced by 2008 IDS Freshmen.
An Energy-Efficient Approach for Real-Time Tracking of Moving Objects in Multi-Level Sensor Networks Vincent S. Tseng, Eric H. C. Lu, & Kawuu W. Lin Institute.
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
Net-Centric Software and Systems I/UCRC Self-Detection of Abnormal Event Sequences Project Lead: Farokh Bastani, I-Ling Yen, Latifur Khan Date: April 1,
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
TEMPLATE DESIGN © Crawling is the process of automatically exploring a web application to discover the states of the application.
An Overview of Statistics Section 1.1 After you see the slides for each section, do the Try It Yourself problems in your text for that section to see if.
Data Mining and Text Mining. The Standard Data Mining process.
Semi-Supervised Clustering
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
A Unifying View on Instance Selection
Information Organization: Clustering
Hierarchical and Ensemble Clustering
Relevance and Reinforcement in Interactive Browsing
Presentation transcript:

Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis Team IDS Lab., SNU th VLDB Conference (2002) Yusuke Ohura Katsumi Takahashi Iko Pramudiono Masaru Kitsuregawa Institute of Industrial Science, University of Tokyo NTT Information Sharing Platform Laboratories

Copyright  2008 by CEBT Index  Introduction  Internet Yellow Page Service and its Problems  Log Analysis of iTOWNPAGE  Query Expansion Using Web Log Mining  Implementation and Evaluation  Conclusion Center for E-Business Technology

Copyright  2008 by CEBT Introduction  Rapid progress on storage capacity and processor performance Lead to a chance to analyze huge log data left on Web servers But still.. – No technical report on huge log data mining is available to public  This paper reports.. Results of log data mining and query expansion experiments on the huge commercial Web service – iTOWNPAGE An online Japanese telephone directory system(Yellow Page Service) Center for E-Business Technology

Copyright  2008 by CEBT Index  Introduction  Internet Yellow Page Service and its Problems  Log Analysis of iTOWNPAGE  Query Expansion Using Web Log Mining  Implementation and Evaluation  Conclusion Center for E-Business Technology

Copyright  2008 by CEBT Internet Yellow Page Service and its Problems  iTOWNPAGE Internet version of TOWNPAGE Center for E-Business Technology

Copyright  2008 by CEBT Internet Yellow Page Service and its Problems  Problems Found Through Statistical Analysis Log data – Access log on iTOWNPAGE from 1 st February to 30 th June 2000 – 450 million lines, 200GB 1 st issue – Regarding sessions with multiple categories 27.2% of search sessions with category as their variable input are multiple category sessions 75.2% of them used non sibling categories which do not share the parent in the category hierarchy iTOWNPAGE provides 2 nd issue – The case when users can not get any results for their search requests Center for E-Business Technology Overview of Search Requests on iTOWNPAGE

Copyright  2008 by CEBT Index  Introduction  Internet Yellow Page Service and its Problems  Log Analysis of iTOWNPAGE  Query Expansion Using Web Log Mining  Implementation and Evaluation  Conclusion Center for E-Business Technology

Copyright  2008 by CEBT Log Analysis of iTOWNPAGE  Session The sequence of requests from a user Set of two continuous requests (within 30 mins interval) are regarded as the same session Session vector i th session vector Center for E-Business Technology

Copyright  2008 by CEBT Log Analysis of iTOWNPAGE  K-means Clustering Algorithm for clustering sessions  Can not predict the number of clusters in advance Improve the algorithm so that it can dynamically decide the number of clusters to be generated  Improved K-means algorithm The 1 st input vector becomes the centroid vector of the first cluster C 1 – becomes the member of the cluster C 1 for each successive input vector, – Similarity with existing clusters C 1 … C k is calculated with formula If the similarity is below threshold, new cluster is generated If not, input vector becomes a member of the cluster with the highest similarity  Centroid vector is recalculated with formula The process is iteratively executed until it converges Center for E-Business Technology

Copyright  2008 by CEBT K-means Clustering Algorithm  Overview An algorithm to cluster n objects based on attributes into k partitions It assumes that the object attributes form a vector space The objective it tries to achieve is to minimize total intra-cluster variance Algorithm – Partition the input points into k initial sets, either at random or using some heuristic data – Calculate the centroid(mean point) of each set – Construct a new partition by associating each point with the closest centroid – Then, centroids are recalculated for the new clusters, and algorithm repeated by alternate application of these two steps until convergence Center for E-Business Technology

Copyright  2008 by CEBT Log Analysis of iTOWNPAGE Only display categories whose number of sessions are more than TH cat of total sessions for that cluster many non-sibling categories in the category hierarchy appear in the clusters Can infer that the search session with the same input such as “Hotels” are performed on various demands and contexts The clustering of web access logs is effective to understand the user behavior Center for E-Business Technology

Copyright  2008 by CEBT Index  Introduction  Internet Yellow Page Service and its Problems  Log Analysis of iTOWNPAGE  Query Expansion Using Web Log Mining  Implementation and Evaluation  Conclusion Center for E-Business Technology

Copyright  2008 by CEBT Query Expansion Using Web Log Mining  Motivation There are many requests end with no result – Best solution : recommend another address Possible only when coordinate information for addresses is available – Another solution : recommend categories Need for similarity between categories  Can be extracted by clustering the user access log There are many sessions consist of non-sibling categories – Propose another expansion method for recommending categories, not similar but having some relation to the input category Center for E-Business Technology

Copyright  2008 by CEBT Query Expansion Using Web Log Mining  Strategies for Query Expansion Intra-Category Recommendation – Selects sibling categories that appear in major clusters of CAT input 1.Find clusters that have CAT input as a member in the order of the appearance ratio of CAT input 2.Choose a sibling category that has the most count from each cluster until the number of sibling categories reaches MAX sibl Inter-Category Recommendation 1.Selects non-sibling categories that appear in major clusters of CAT input 2.Choose the maximum non-sibling category of CAT input from each clusters up to MAX non-sibl in the same way of “Intra-Category” step Center for E-Business Technology

Copyright  2008 by CEBT Index  Introduction  Internet Yellow Page Service and its Problems  Log Analysis of iTOWNPAGE  Query Expansion Using Web Log Mining  Implementation and Evaluation  Conclusion Center for E-Business Technology

Copyright  2008 by CEBT Implementation and Evaluation Center for E-Business Technology

Copyright  2008 by CEBT Implementation and Evaluation  Use another log data to test expansion method – 1 st July to 20 th July 2000 Firstly, test data is converted into sessions – “Category A -> Category B -> Category C” Transition relations are extracted from the sessions – “Category A -> Category B”, “Category B -> Category C” Features Center for E-Business Technology N : # of test relations S : # of successful expansions after the expansion test C i : # of expanded categories displayed for i-th test request

Copyright  2008 by CEBT Index  Introduction  Internet Yellow Page Service and its Problems  Log Analysis of iTOWNPAGE  Query Expansion Using Web Log Mining  Implementation and Evaluation  Conclusion Center for E-Business Technology

Copyright  2008 by CEBT Conclusion  Experimental results of mining access log from a huge commercial site  Propose a query expansion method based on clustering of user requests  Enhance K-means clustering algorithm  Two-step expansion method Recommendation for similar categories Recommendation for related categories although they are non-similar in category hierarchy Center for E-Business Technology

Copyright  2008 by CEBT Summary  Pros It uses real data from commercial web-site Simple and useful  Cons Nothing special – Clustering user sessions Two step expansion method? Center for E-Business Technology