Download presentation
Presentation is loading. Please wait.
Published byRodger Payne Modified over 9 years ago
1
Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir
2
Autumn 20112 Outline Web challenges SE & Web IR challenges Web Structure (Graph) Web characteristics Zip law
3
Autumn 20113 Web Challenges Huge size of information –11.5 billions pages (2005) –64 billions pages (05 June, 2008) Proliferation and dynamic nature –New pages are created at the rate of 8% per week –Only 20% of the current pages will be accessible after one year –New links are created at rate 25% per week Heterogeneous contents –HTML/Text/Audio/… Users of web are growing exponentially
4
Autumn 20114 What is the success reason of the Web? A distributed system A simple protocol Production and generation is very simple
5
Autumn 20115 Information Retrieval Definition IR deals with the representation, storage, organization of, and access to information items (relevant to user query) Information retrieval (IR) is the science of searching for documents, for information within documents, and for metadata about documents An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In information retrieval a query does not uniquely identify a single object in the collection. Instead, several objects may match the query, perhaps with different degrees of relevancy.
6
Autumn 20116 Web Retrieval User Space Information Space Matching Retrieval Browsing Index terms Full text Full text + Structure (e.g. hypertext) Search Engine Search engine is an IR system!
7
Autumn 20117 IR vs Data Retrieval A data retrieval aims at retrieving all objects which satisfy clearly defined conditions in regular expression DR does not solve the problem of retrieving information about subject or object
8
Autumn 20118 Comparing IR to databases ( vs data retrieval ) DatabasesIR Data StructuredUnstructured Fields Clear semantics (SSN, age) No fields (other than text) Queries Defined (relational algebra, SQL) Free text (“natural language”), Boolean Query specification CompleteIncomplete Matching Exact (results are always “correct”) Imprecise (need to measure effectiveness) Error response SensitiveInsensitive
9
Autumn 20119 Main points in IR What is the definition of relevancy? Evaluation! –Subjective (opposite to hardware, network)
10
Autumn 201110 Web IR (SE) Challenges (1) The definition of Relevancy The connectivity with content in Web –A huge graph Different type of Queries –Narrow Needle in a haystack –Wide Overlapping with many areas User have Poor patience: they commonly browse through the first ten results (i.e. one screen) hoping to find there the “right” document for their query
11
Autumn 201111 Web IR (SE) Challenges (2) Spamming phenomenon –it is crucial for business sites to be ranked highly by the major search engines. –There are quite a few companies who sell this kind of expertise (also known as “search engine optimization”) and actively research ranking algorithms and heuristics of search engines, and know how many keywords to place (and where) in a Web page so as to improve the page’s ranking –SEO Books Content & Connectivity Spamming Anti Spamming solutions
12
Autumn 201112 Web IR (SE) Challenges (3) Rich-get-richer problem –It takes a long time for a young high quality web pages to receive an appropriate quality –Unfairness –Bad directions in growing web contents
13
Autumn 201113 Web IR (SE) Challenges (4) Crawling challenges –Huge size of information with dynamic nature –Freshness & converge Google covers only 70% of the Web –An suitable scheduling policy –Hidden web (600 times bigger) Using meta search engines to increase coverage –Merging and ranking problem
14
Autumn 201114 Web IR (SE) Challenges (5) User evaluation is subjective and changes in time –Relevancy between a query and document depends on user and time –Two users with the same query expect different results
15
Autumn 201115 Web IR (SE) Challenges (6) Query Ambiguity –Python –Car & automobile
16
Autumn 201116 Web Dynamics For each page p and each visit, the following information is available: –The access time-stamp of the page: visitp. –The last-modified time-stamp (given by mostWeb servers; about 80%-90%of the requests in practice): modifiedp. –The text of the page, which can be compared to an older copy to detect changes, especially if modifiedp –is not provided. –The following information can be estimated if the re-visiting period is short: –The time at which the page first appeared: createdp. –The time at which the page was no longer reachable: deletedp In all cases, the results are only an estimation of the actual values
17
Autumn 201117 Estimating freshness and age The probability that a copy of p is up-to- date at time t, u p (t) decreases with time if the page is not re-visited. When page changes are modeled as a Poisson process, if t units of time have passed since the last visit, then:
18
Autumn 201118 Characterization of Web page changes Age: visitp-modifiedp. Lifespan: deletedp-createdp. Number of changes during the lifespan: changesp. Average change interval: lifespanp/changesp.
19
Autumn 201119 Freshness && Age
20
Autumn 201120
21
Autumn 201121 Web a Scale Free Network A scale-free network is characterized by a few highly-linked nodes that act as “hubs” connecting several nodes to the network. It follows Power Law
22
Autumn 201122 Random Vs Scale-Free
23
Autumn 201123 Distribution of Web Graph: Power- Law
24
Autumn 201124 Power-Law and Zipf Law
25
Autumn 201125 Zipf Law for Content
26
Autumn 201126 Macroscopic Structure of Web
27
Autumn 201127 User Sessions User sessions on the Web are usually characterized through models of random surfers The most used source for data about the browsing activities of users are the access log files of Web Servers, Proxies, SEs –Caching Modeling User behavior Eye tracking
28
Autumn 201128 Next Lecture Information Retrieval Models –Boolean –Vector Space –Realistic
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.