1 CS/INFO 430 Information Retrieval Lecture 18 Web Search 4.

Slides:



Advertisements
Similar presentations
Search Engine Optimization (SEO) Guideline Powered by DonorCommunity TM DonorCommunity eLearning Series v1.2, February 2012 Search Engine Optimization.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
SEO Best Practices with Web Content Management Brent Arrington, Services Developer, Hannon Hill Morgan Griffith, Marketing Director, Hannon Hill 2009 Cascade.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Search Engines & Search Engine Optimization (SEO) Presentation by Saeed El-Darahali 7 th World Congress on the Management of e-Business.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
Information Retrieval
The Invisible or Deep Web What is it? The "visible web" is what you can find using general web search engines. It's also what you see in almost all subject.
Search Engine Optimization (SEO)
Overview of Search Engines
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
Google and the Page Rank Algorithm Székely Endre
_______________________________________________________________________________________________________________ E-Commerce: Fundamentals and Applications1.
Todd Friesen April, 2007 SEO Workshop Web 2.0 Expo San Francisco.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
1 SOCIAL BOOKMARKING 101. HIBA KHALID BILAL SAEED KHAN FARID ALIANI ASKARI HASAN SOCIAL BOOKMARKING.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
1.Understand the decision-making process of consumer purchasing online. 2.Describe how companies are building one-to-one relationships with customers.
Designing for Search Engines MIS 314 MIS 314 Professor Sandvig Professor Sandvig.
Search Engine Optimization. Introduction SEO is a technique used to optimize a web site for search engines like Google, Yahoo, etc. It improves the volume.
Data Access Worldwide May 16 – 18, 2007 Copyright 2007, Data Access Worldwide May 16 – 18, 2007 Copyright 2007, Data Access Worldwide Search Engine Optimization.
Adversarial Information Retrieval The Manipulation of Web Content.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
Courtney Forsmann IT Help Desk Manager Lewis-Clark State College October 1, 2014.
Promotion & Cataloguing AGCJ 407 Web Authoring in Agricultural Communications.
Search Engines & Search Engine Optimization (SEO).
 What is SEO?  Industry Research  SEO Process  Technical aspects of SEO  Social Media - MySpace Optimization  Measuring SEO success  SEO Tools.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
Use of Electronic and Internet advertising options Standard 3.4.
Search Engine Optimization & Pay Per Click Advertising
1 Search Engine Optimization An introduction to optimizing your web site for best possible search engine results.
Marshall Breeding Director for Innovative Technology and Research Vanderbilt University
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 CS/INFO 430 Information Retrieval Lecture 21 Web Search 3.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Search & Searchability. Presentation from David Hawking – CSIRO Ineffectual corporate search tools can be the biggest drag on employee productivity. Knowledge.
Discovering Computers Fundamentals, Third Edition CGS 1000 Introduction to Computers and Technology Spring 2007.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web Crawling  Web search.
Search Engines By: Faruq Hasan.
© 2010 Pearson Education, Inc. | Publishing as Prentice Hall. Computer Literacy for IC 3 Unit 3: Living Online Chapter 2: Searching for Information.
Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching.
Search Engine Optimization Information Systems 337 Prof. Harry Plantinga.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
Week 1 Introduction to Search Engine Optimization.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Search Engine Optimization Presented By:- ARKA Softwares Effective! Affordable! Time Groove
Best Strategies For Website Promotion. What is Website Promotion? Website promotion is the continuing process used by webmasters to promote and bring.
Why You Should Optimize Your Website Content. Optimizing a website's content, in order to obtain a high search engine ranking is what Search Engine Optimization.
Search Engine Optimization Miami (SEO Services Miami in affordable budget)
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Search Engine Optimization
Search Engine Optimization(S.E.O)
Search Engine Optimization
WEB SPAM.
Welcome to SharePoint Saturday Denver!
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Platinum Sponsors Silver Sponsors Say Thanks to our Sponsors
Discussion Class 9 Google.
Welcome to SharePoint/O365 Saturday Kansas City!.
Presentation transcript:

1 CS/INFO 430 Information Retrieval Lecture 18 Web Search 4

2 Course Administration

3 Search Engine Spam: Objective Success of commercial Web sites depends on the number of visitors that find the site while searching for a particular product. 85% of searchers look at only the first page of results A new business sector – search engine optimization M. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. International Joint Conference on Artificial Intelligence, Drost, I. and Scheffer, T., Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam. 16th European Conference on Machine Learning, Porto, 2005

4 Spam: Meta Tags Meta tags provide the creator of a Web page a place for cataloguing data that describes the page, but it can be used for advertising, misleading, or other mischievous text Example: (October 2000)

5 Search Engine Spam: Techniques Invisible text: Add keywords to a page in the hope that search engines will index it, but organized so that it will not be visible to a user, e.g., in special type of format, background color, etc. Cloaking: Return different page to Web crawlers than to ordinary downloads. (Can also be used to help Web search, e.g., by providing a text version of a highly visual page.)

6 Search Engine Spam: Anchor Text Search engines assume that anchor text provides helpful terms to index the page that is linked to. But anchor text can be deliberately misleading. Consider the impact if a million pages each contained the anchor text: Cornell University

7 Search Engine Spam: Anchor Text Google Bomb: a collective hyperlinking strategy intended to change the search results of a specific term or phrase. Examples The "miserable failure" Google bomb promoted George W. Bush’s page on whitehouse.gov to the number one rank in a search of the phrase "miserable failure." The "Jew" Google bomb demoted an anti–Semitic Web site from number one rank with a search of "Jew," and promoted the wikipedia.org definition of "Jew" to number one. See: Clifford Tatum, 2005,

8 Link Spamming: Techniques Link exchange services: Listings of (often unrelated) hyperlinks. To be listed, businesses have to provide a back link that enhances the PageRank of the exchange service. Guestbooks, discussion boards, and weblogs: Automatic tools post large numbers of messages to many sites; each message contains a hyperlink to the target website. Link farms: Densely connected arrays of pages. Farm pages propagate their PageRank to the target, e.g., by a funnel- shaped architecture that points directly or indirectly towards the target page. To camouflage link farms, tools fill in inconspicuous content, e.g., by copying news bulletins.

9 Search Engine Spam: Link Farms The regular Web, W, with n w pages. A link farm, F, with n f pages Link from W to F for crawler to find F

10 Search Engine Spam: Link Farms Consider the PageRank iteration formula w k = (1-d)w 0 + dBw k-1 Assuming that all pages are crawled, the effect of the factor (1-d)w 0 is that the random jumps go to W and F in the ratio n w :n f. Since there are few links between W and F, the effect of B is to assign PageRank within W and F respectively. Therefore the total PageRank is divided between W and F in the ratio n w :n f.

11 Search Engine Spam: Link Farms The manager of the link farm, F, can organize the links within the farm so that certain pages within the farm, h 1, h 2,..., h k, are highly ranked. A manager who wants to give high rank to a page w 0 in W, places links to w 0 from several of the pages h 1, h 2,..., h k. As a result, w 0 is linked to from several highly ranked pages and hence becomes highly ranked. (In addition, w 0 could link back to F thus returning rank to the farm.)

12 Link Spamming: Defenses Manual identification of spam pages and farms to create a blacklist. Automatic classification of pages using machine learning techniques. BadRank algorithm. The "bad rank" is initialized to a high value for blacklisted pages. It propagates bad rank to all referring pages (with a damping factor) thus penalizing pages that refer to spam.

13 Search Engine Friendly Pages Good ways to get your page indexed and ranked highly Use straightforward URLs, with simple structure, that do not change with time. Submit your site to be crawled. Provide a site map of the pages that you wish to be crawled. Have the words that you would expect to see in queries: - in the content of your pages. - in and tags. Attempt to have links to your page from appropriate authorities. Avoid suspicious behavior.

14 Legal Issues in Web Searching Copyright In US law, the creator of a Web page (or the employer) owns the copyright, with a few exceptions. Copyright gives the owner exclusive right to: reproduce, distribute, perform, display, or license others to reproduce, distribute, perform, or display. Search engines operate under an untested legal concept of an implied license. The concept is to assume that somebody who puts a Web page online expects users to download it, read it, index it, etc., unless the copyright owner explicitly states otherwise. Historically, Web companies have been cautious, but recently Google has been pushing the legal limits.

15 Economic Models for Content and Services on the Web Mounting information on the Web or supplying services costs money. Who pays? Open access Externally funded from other funds (standard model). Advertising (e.g., Web search). Restricted access Subscription (e.g., journal publishers). Pay by use (rare). Note that these same four models are used for television

16

17 Information about Individuals Advertising is most effective if it is tailored to the individual Portals, such as Yahoo or Google, have many ways of gaining information about users: identity tracked by cookie or login search terms used, pages retrieved, advertisements clicked use of other services, e.g., travel, shopping, maps Data mining such information can provide valuable services, but raises serious concerns about privacy.

18 How many of these services collect information about the user?

19 Adding Audience Information to Ranking Conventional information retrieval: A given query returns the same set of hits, ranked in the same sequence, irrespective of who submitted the query. If the search service has information about the user: The results set and/or the ranking can be varied to match the user's profile Example: In an educational digital library, the order of search results can be varied for: instructor v. student grade level of course

20 Adding Audience Information to Ranking Metadata based methods: Label documents with controlled vocabulary to define intended audience. Provide users with means to specify their needs, through a profile (preferences), or by a query parameter Automatic methods: Capture persistent information about user behavior by data mining Adjust tf.idf rankings using terms derived from terms previously use by the user