Website Clustering Combining Website Lexical Data and Query Semantic Data Nana Huang, Ray Li.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Web Intelligence Text Mining, and web-related Applications
Optimizing search engines using clickthrough data
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
ADAM AKESSON’S E-PORTFOLIO For the Thinking Through Computing Learning Community.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Web queries classification Nguyen Viet Bang WING group meeting June 9 th 2006.
CM143 - Web Week 2 Basic HTML. Links and Image Tags.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
WebPage Summarization Using Clickthrough Data JianTao Sun & Yuchang Lu, TsingHua University, China Dou Shen & Qiang Yang, HK University of Science & Technology.
Define html document byusing Example : Title of the document The content of the document......
Latent Semantic Analysis Hongning Wang VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
Title, meta, link, script.  The title looks like:  The tag defines the title of the document in the browser toolbar.  It also: ◦ Provides a title for.
1 BINGO! and Daffodil: Personalized Exploration of Digital Libraries and Web Sources Martin Theobald Max-Planck-Institut für Informatik Claus-Peter Klas.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Source-Selection-Free Transfer Learning
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Searching and Browsing Using Tags Nikos Sarkas Social Information Systems Seminar DCS, University of Toronto, Winter 2007.
META tag META tag is the element in the HTML that interacts with the search engines. It’s contain 2 attributes that should always be used: NAME: is an.
Introduction to Web Designing-I
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Contextual Ranking of Keywords Using Click Data ICDE`09 Utku Irmak Vadim von Brzeski Vadim von Brzeski Reiner Kraft.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
Algorithmic Detection of Semantic Similarity WWW 2005.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Search Engine and SEO Presented by Yanni Li. Various Components of Search Engine.
National Taiwan University, Taiwan
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Post-Ranking query suggestion by diversifying search Chao Wang.
About Me Swaroop Butala  MSCS – graduating in Dec 09  Specialization: Systems and Databases  Interests:  Learning new technologies  Application of.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Chapter 3: On-Site SEO. Chapter Objectives Identify the keywords that would be most worthwhile for a website to target in its search engine optimization.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Chapter 3: On-Site SEO. Chapter Objectives Identify the keywords that would be most worthwhile for a website to target in its search engine optimization.
Who is Executive Web Club? Globally Local from Nanaimo, BC to Mumbai, India  White Label Platforms  Search Engine Optimization  Mobile Apps Development.
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
Image Retrieval and Ranking using L.S.I and Cross View Learning Sumit Kumar Vivek Gupta
Information Organization: Overview
OUTLINE Basic ideas of traditional retrieval systems
Searching EIT, Author Gay Robertson, 2017.
Web Information retrieval (Web IR)
Multimedia Information Retrieval
HTML What is it? HTML is a computer language devised to allow website creation. These websites can then be viewed by anyone else connected to the Internet.
Junghoo “John” Cho UCLA
Information Organization: Overview
Information Retrieval and Web Design
Presentation transcript:

Website Clustering Combining Website Lexical Data and Query Semantic Data Nana Huang, Ray Li

Traditional Lexical Features Traditional website clustering uses lexical data parsed from each webpage to classify the websites into different categories. Regular text tags tags (description, keywords, arthur) What if the webpage consists of mainly automatically generated content from scripts? What if the webpage is a empty frame page with two or more frame?

AOL Clickthrough Data Back in August 2006, AOL released 2.2 GBs of search logs, which includes queries, clicked websites, and website page rank information. brochures for business5http:// brochures for business6http:// brochures for business8http:// brochures for business10http:// brochures for business9http:// brochures for business7http://

Query-Website Graph We parsed a subset of this data to generate a query-document bipartite graph, where each link in the graph represents the number of times each query lead a website being clicked. Q1Q1 Q2Q2 Q3Q3 Q4Q4 Q5Q5 Queries D1D1 D2D2 D3D3 D4D4 D5D5 Documents

Query-Website Graph A graph like this is most likely too sparse to be useful. There are a lot of unobserved ‘clicks’ between queries and other related webpages. We use an iterative process to ‘smooth’ out the bipartite relationship between queries and websites, based on the observation that: Documents are considered ‘similar’ to some extent if they have been seen by the same query. Queries are considered ‘similar’ to some extent if they produce the same document.

Query-Website Graph This will produce a more realistic query-website bipartite relationship We can then use a list of queries associated with each website as a semantic feature vector. Q1Q1 Q2Q2 D2D2 D1D1 D3D3 Q1Q1 Q2Q2 D2D2 D1D1 D3D3

Combined Feature Vectors We have three sets of feature vectors for each document: Lexical features (consists of text and different html tags from the webpage itself) Semantic features (consists of queries information related to each webpage) Combination of both There are words and 2000 queries – too many features.

Latent Semantic Analysis We then apply Latent Semantic Analysis to reduce the features into a lower-ranked 30 ‘virtual concepts’ approximation {Chicken, Beef, Apple, Oranges} -> {Meat, Fruits} Each website is transformed from the original vector of features into a new vector of ‘virtual concepts’.

K-Means + Results We then apply K-means on this new vector space to classify websites into different categories. Results show that, while using only the semantic query vector performs worse than using the lexical feature vector, combining both features together results in a slightly better clustering performance. Lexical + Semantic QueryF1: 0.50 Lexical onlyF1: 0.47 Queries onlyF1: 0.30