© 2006 KDnuggets 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N"

Slides:



Advertisements
Similar presentations
Web Development & Design Foundations with XHTML
Advertisements

SEO IS NOT WHAT IS WAS Search Engine Optimization = WEB OPTIMIZATION
Search Engine Marketing 101 June 20, 2006 Presented By:
Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,
The World Wide Web and the Internet MIS XLM.B Jack G. Zheng June 20 th 2005.
The World Wide Web and the Internet MIS XLM.B Jack G. Zheng May 13 th 2008.
8 th of February 2006WPG1 Stakeholders Usage Reporting - STUR aka NetTracker Upgrade.
The Internet and the Web
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 4.1 Chapter 4 : Searching the Web The mechanics.
The Structure of the Web Mark Levene (Follow the links to learn more!)
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 1.1 Chapter 1 : Introduction The World-Wide-Web.
Acroterion Search Engine Solutions Presentation The next mousetrap Mom, What's a Library? Search Engine Marketing is Necessity Billions of dollars poured.
Search Engine Marketing Free Traffic for Your Web Site Paul Allen, CEO
Search Engines & Search Engine Optimization (SEO) Presentation by Saeed El-Darahali 7 th World Congress on the Management of e-Business.
CS 345A Data Mining Lecture 1
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
Sigir’99 Inside Internet Search Engines: Fundamentals Jan Pedersen and William Chang.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
Introduction Web Development II 5 th February. Introduction to Web Development Search engines Discussion boards, bulletin boards, other online collaboration.
Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns.
Search Engine Optimization (SEO)
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Search Engine Marketing Pay Per Click. Distinctions: SEM vs. SEO Search Engine Marketing (SEM), aka –“paid search” –“pay per click” (PPC) Search Engine.
The Technical SEO Audit Rick Ramos | seOveflow. Introduction  SEO is search engine usability.  Why do you need an audit?  How nimble are your development.
By Raza / Faisal By: Raza Usmani Faisal Khan. What is SEO? It is the process of affecting the visibility of a website or a web page in a search engine's.
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
Introductions Search Engine Development COMP 475 Spring 2009 Dr. Frank McCown.
Browser Wars and the Politics of Search Engines
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Strategies for improving Web site performance Google Webmaster Tools + Google Analytics Marshall Breeding Director for Innovative Technologies and Research.
© 2006 KDnuggets [16/Nov/2005:16:32: ] "GET /jobs/ HTTP/1.1" "
Search Engines & Search Engine Optimization (SEO).
Lecture 7 World Wide Web CSCS100 – Fall 2009 – Forman Christian College Asher Imtiaz *Several of these slides have been adapted and modified from VU CS101.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Search Engine Optimization & Pay Per Click Advertising
1 Search Engine Optimization An introduction to optimizing your web site for best possible search engine results.
Online Advertising Core Concepts are Identical to traditional advertising: –Building Brand Awareness –Creating Consumer Demand –Informing Consumers of.
Search Engine Optimization 101 What is SEM? SEO? How can I use SEO on my blogs and/or my personal web space?
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Search Engines By: Faruq Hasan.
Internet Network of networks Mother of all networks
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Week 1 Introduction to Search Engine Optimization.
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Data mining in web applications
22C:145 Artificial Intelligence
Search Engine Optimization
Marking the Most of the Web’s Resources
Dr. Frank McCown Comp 250 – Web Development Harding University
Web Mining Ref:
Introduction to Web Mining
Microsoft Office Illustrated Introductory, Premium Edition
Objective % Explain concepts used to create websites.
Search Engine Optimization (SEO)
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
All About the Internet.
CS 345A Data Mining Lecture 1
CS 345A Data Mining Lecture 1
Introduction to Web Mining
CS 345A Data Mining Lecture 1
Presentation transcript:

© 2006 KDnuggets [16/Nov/2005:16:32: ] "GET /jobs/ HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;.NET CLR )“ [16/Feb/2006:00:06: ] "GET / HTTP/1.1" " 740_1006" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" [16/Feb/2006:00:06: ] "GET /kdr.css HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" [16/Feb/2006:00:06: ] "GET /images/KDnuggets_logo.gif HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" Web Mining: An Introduction Gregory Piatetsky-Shapiro KDnuggets An extract from KDnuggets web log [16/Nov/2005:16:32: ] "GET /jobs/ HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;.NET CLR )“ [16/Feb/2006:00:06: ] "GET / HTTP/1.1" " 740_1006" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" [16/Feb/2006:00:06: ] "GET /kdr.css HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" [16/Feb/2006:00:06: ] "GET /images/KDnuggets_logo.gif HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)"

© 2006 KDnuggets World Wide Web – a brief history  Who invented the wheel is unknown  Who invented the World-Wide Web ?  (Sir) Tim Berners-Lee  in 1989, while working at CERN, invented the World Wide Web, including URL scheme, HTML, and in 1990 wrote the first server and the first browser  Mosaic browser developed by Marc Andreessen and Eric Bina at NCSA (National Center for Supercomputing Applications) in 1993; helped rapid web spread  Mosaic was basis for Netscape …

© 2006 KDnuggets What is Web Mining? Examples:  Web search, e.g. Google, Yahoo, MSN, Ask, …  Specialized search: e.g. Froogle (comparison shopping), job ads (Flipdog)  eCommerce :  Recommendations: e.g. Netflix, Amazon  improving conversion rate: next best product to offer  Advertising, e.g. Google Adsense  Fraud detection: click fraud detection, …  Improving Web site design and performance Discovering interesting and useful information from Web content and usage

© 2006 KDnuggets How does it differ from “classical” Data Mining?  The web is not a relation  Textual information and linkage structure  Usage data is huge and growing rapidly  Google’s usage logs are bigger than their web crawl  Data generated per day is comparable to largest conventional data warehouses  Ability to react in real-time to usage patterns  No human in the loop Reproduced from Ullman & Rajaraman with permission

© 2006 KDnuggets How big is the Web ?  Number of pages  Technically, infinite  Because of dynamically generated content  Lots of duplication (30-40%)  Best estimate of “unique” static HTML pages comes from search engine claims  Google = 8 billion, Yahoo = 20 billion  Lots of marketing hype Reproduced from Ullman & Rajaraman with permission

© 2006 KDnuggets 76,184,000 web sites (Feb 2006) Netcraft survey

© 2006 KDnuggets The web as a graph  Pages = nodes, hyperlinks = edges  Ignore content  Directed graph  High linkage  8-10 links/page on average  Power-law degree distribution Reproduced from Ullman & Rajaraman with permission

© 2006 KDnuggets Power-law degree distribution Source: Broder et al, 2000 Reproduced from Ullman & Rajaraman with permission

© 2006 KDnuggets Power-laws galore  In-degrees  Out-degrees  Number of pages per site  Number of visitors  Let’s take a closer look at structure  Broder et al. (2000) studied a crawl of 200M pages and other smaller crawls  Not a “small world” Reproduced from Ullman & Rajaraman with permission

© 2006 KDnuggets Bow-tie Structure Source: Broder et al, 2000 Reproduced from Ullman & Rajaraman with permission

© 2006 KDnuggets Searching the Web Content aggregators The Web Content consumers Reproduced from Ullman & Rajaraman with permission

© 2006 KDnuggets Ads vs. search results Reproduced from Ullman & Rajaraman with permission

© 2006 KDnuggets Ads vs. search results  Search advertising is the revenue model  Multi-billion-dollar industry  Advertisers pay for clicks on their ads  Interesting problems  How to pick the top 10 results for a search from 2,230,000 matching pages?  What ads to show for a search?  If I’m an advertiser, which search terms should I bid on and how much to bid? Reproduced from Ullman & Rajaraman with permission

© 2006 KDnuggets Sidebar: What’s in a name?  Geico sued Google, contending that it owned the trademark “Geico”  Thus, ads for the keyword geico couldn’t be sold to others  Court Ruling: search engines can sell keywords including trademarks  No court ruling yet: whether the ad itself can use the trademarked word(s) Reproduced from Ullman & Rajaraman with permission

© 2006 KDnuggets Extracting Structured Data Reproduced from Ullman & Rajaraman with permission

© 2006 KDnuggets Extracting structured data Reproduced from Ullman & Rajaraman with permission

© 2006 KDnuggets The Long Tail Source: Chris Anderson (2004) Reproduced from Ullman & Rajaraman with permission

© 2006 KDnuggets The Long Tail  Shelf space is a scarce commodity for traditional retailers  Also: TV networks, movie theaters,…  The web enables near-zero-cost dissemination of information about products  More choices necessitate better filters  Recommendation engines (e.g., Amazon)  How Into Thin Air made Touching the Void a bestseller Reproduced from Ullman & Rajaraman with permission

© 2006 KDnuggets Web Mining topics  Crawling the web  Web graph analysis  Structured data extraction  Classification and vertical search  Collaborative filtering  Web advertising and optimization  Mining web logs  Systems Issues Reproduced from Ullman & Rajaraman with permission

© 2006 KDnuggets Web search basics The Web Ad indexes Web crawler Indexer Indexes Search User Reproduced from Ullman & Rajaraman with permission

© 2006 KDnuggets Search engine components  Spider (a.k.a. crawler/robot) – builds corpus  Collects web pages recursively  For each known URL, fetch the page, parse it, and extract new URLs  Repeat  Additional pages from direct submissions & other sources  The indexer – creates inverted indexes  Various policies wrt which words are indexed, capitalization, support for Unicode, stemming, support for phrases, etc.  Query processor – serves query results  Front end – query reformulation, word stemming, capitalization, optimization of Booleans, etc.  Back end – finds matching documents and ranks them Reproduced from Ullman & Rajaraman with permission

© 2006 KDnuggets New Web Professions  SEM - Search Engine Marketing  SEO – Search Engine Optimization  Chief Data Officer (at Yahoo)

© 2006 KDnuggets Web Mining  Web content (and structure) mining  so far  Web usage mining  next

© 2006 KDnuggets Web Usage Mining Understanding is a pre-requisite to improvement 1 Google, but 70,000,000+ web sites Applications:  Simple and Basic:  Monitor performance, bandwidth usage  Catch errors (404 errors- pages not found)  Improve web site design  (shortcuts for frequent paths, remove links not used, etc)  …  Advanced and Business Critical :  eCommerce: improve conversion, sales, profit  Fraud detection: click stream fraud, …  …