Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University

Slides:

Advertisements

Similar presentations

Chapter 5: Introduction to Information Retrieval

Advertisements

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.

Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.

Natural Language Processing WEB SEARCH ENGINES August, 2002.

Web Search – Summer Term 2006 I. General Introduction (c) Wolfgang Hürst, Albert-Ludwigs-University.

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS

IR Models: Overview, Boolean, and Vector

Search Engines and Information Retrieval

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)

Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,

1 The Four Dimensions of Search Engine Quality Jan Pedersen Chief Scientist, Yahoo! Search 19 September 2005.

Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.

1 Searching the Web Junghoo Cho UCLA Computer Science.

Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.

Searching the Web II. The Web Why is it important: –“Free” ubiquitous information resource –Broad coverage of topics and perspectives –Becoming dominant.

Information Retrieval in Practice

INFO 624 Week 3 Retrieval System Evaluation

CS 345 Data Mining Lecture 1 Introduction to Web Mining.

Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )

Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.

“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.

Search Engine Optimization

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

Search Engines and Information Retrieval Chapter 1.

1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.

Master Thesis Defense Jan Fiedler 04/17/98

Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:

Information Retrieval and Knowledge Organisation Knut Hinkelmann.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?

Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.

Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University

Chapter 6: Information Retrieval and Web Search

LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.

GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.

IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.

Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.

Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.

Search Engines By: Faruq Hasan.

Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.

1 Information Retrieval LECTURE 1 : Introduction.

Performance Measurement. 2 Testing Environment.

Information Retrieval

Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)

Sigir’99 Inside Internet Search Engines: Spidering and Indexing Jan Pedersen and William Chang.

Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Autumn Web Information retrieval (Web IR) Handout #11:FICA: A Fast Intelligent Crawling Algorithm Ali Mohammad Zareh Bidoki ECE Department, Yazd.

INFORMATION STROAGE AND RETRIEVAL SYSTEM By Ms. Preeti Patel Lecturer School of Library And Information Science DAVV, Indore

Autumn Web Information retrieval (Web IR) Handout #14: Ranking Based on Click Through data Ali Mohammad Zareh Bidoki ECE Department, Yazd University.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Automated Information Retrieval

Thanks to Bill Arms, Marti Hearst

CS 440 Database Management Systems

Web Information retrieval (Web IR)

Introduction to Information Retrieval

Web Information retrieval (Web IR)

Information Retrieval and Web Design

Information Retrieval and Web Design

Information Retrieval and Web Design

Presentation transcript:

Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University

Autumn Outline Web challenges SE & Web IR challenges Web Structure (Graph) Web characteristics Zip law

Autumn Web Challenges Huge size of information –11.5 billions pages (2005) –64 billions pages (05 June, 2008) Proliferation and dynamic nature –New pages are created at the rate of 8% per week –Only 20% of the current pages will be accessible after one year –New links are created at rate 25% per week Heterogeneous contents –HTML/Text/Audio/… Users of web are growing exponentially

Autumn What is the success reason of the Web? A distributed system A simple protocol Production and generation is very simple

Autumn Information Retrieval Definition IR deals with the representation, storage, organization of, and access to information items (relevant to user query) Information retrieval (IR) is the science of searching for documents, for information within documents, and for metadata about documents An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In information retrieval a query does not uniquely identify a single object in the collection. Instead, several objects may match the query, perhaps with different degrees of relevancy.

Autumn Web Retrieval User Space Information Space Matching Retrieval Browsing Index terms Full text Full text + Structure (e.g. hypertext) Search Engine Search engine is an IR system!

Autumn IR vs Data Retrieval A data retrieval aims at retrieving all objects which satisfy clearly defined conditions in regular expression DR does not solve the problem of retrieving information about subject or object

Autumn Comparing IR to databases ( vs data retrieval ) DatabasesIR Data StructuredUnstructured Fields Clear semantics (SSN, age) No fields (other than text) Queries Defined (relational algebra, SQL) Free text (“natural language”), Boolean Query specification CompleteIncomplete Matching Exact (results are always “correct”) Imprecise (need to measure effectiveness) Error response SensitiveInsensitive

Autumn Main points in IR What is the definition of relevancy? Evaluation! –Subjective (opposite to hardware, network)

Autumn Web IR (SE) Challenges (1) The definition of Relevancy The connectivity with content in Web –A huge graph Different type of Queries –Narrow Needle in a haystack –Wide Overlapping with many areas User have Poor patience: they commonly browse through the first ten results (i.e. one screen) hoping to find there the “right” document for their query

Autumn Web IR (SE) Challenges (2) Spamming phenomenon –it is crucial for business sites to be ranked highly by the major search engines. –There are quite a few companies who sell this kind of expertise (also known as “search engine optimization”) and actively research ranking algorithms and heuristics of search engines, and know how many keywords to place (and where) in a Web page so as to improve the page’s ranking –SEO Books Content & Connectivity Spamming Anti Spamming solutions

Autumn Web IR (SE) Challenges (3) Rich-get-richer problem –It takes a long time for a young high quality web pages to receive an appropriate quality –Unfairness –Bad directions in growing web contents

Autumn Web IR (SE) Challenges (4) Crawling challenges –Huge size of information with dynamic nature –Freshness & converge Google covers only 70% of the Web –An suitable scheduling policy –Hidden web (600 times bigger) Using meta search engines to increase coverage –Merging and ranking problem

Autumn Web IR (SE) Challenges (5) User evaluation is subjective and changes in time –Relevancy between a query and document depends on user and time –Two users with the same query expect different results

Autumn Web IR (SE) Challenges (6) Query Ambiguity –Python –Car & automobile

Autumn Web Dynamics For each page p and each visit, the following information is available: –The access time-stamp of the page: visitp. –The last-modified time-stamp (given by mostWeb servers; about 80%-90%of the requests in practice): modifiedp. –The text of the page, which can be compared to an older copy to detect changes, especially if modifiedp –is not provided. –The following information can be estimated if the re-visiting period is short: –The time at which the page first appeared: createdp. –The time at which the page was no longer reachable: deletedp In all cases, the results are only an estimation of the actual values

Autumn Estimating freshness and age The probability that a copy of p is up-to- date at time t, u p (t) decreases with time if the page is not re-visited. When page changes are modeled as a Poisson process, if t units of time have passed since the last visit, then:

Autumn Characterization of Web page changes Age: visitp-modifiedp. Lifespan: deletedp-createdp. Number of changes during the lifespan: changesp. Average change interval: lifespanp/changesp.

Autumn Freshness && Age

Autumn

Autumn Web a Scale Free Network A scale-free network is characterized by a few highly-linked nodes that act as “hubs” connecting several nodes to the network. It follows Power Law

Autumn Random Vs Scale-Free

Autumn Distribution of Web Graph: Power- Law

Autumn Power-Law and Zipf Law

Autumn Zipf Law for Content

Autumn Macroscopic Structure of Web

Autumn User Sessions User sessions on the Web are usually characterized through models of random surfers The most used source for data about the browsing activities of users are the access log files of Web Servers, Proxies, SEs –Caching Modeling User behavior Eye tracking

Autumn Next Lecture Information Retrieval Models –Boolean –Vector Space –Realistic