Presentation is loading. Please wait.

Presentation is loading. Please wait.

Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific.

Similar presentations


Presentation on theme: "Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific."— Presentation transcript:

1 Algorithms for Information Retrieval Prologue

2 References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific papers available on the course site !! Mining the Web: Discovering Knowledge from... S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

3 More than 85% users arrive to a site from a SE Web Searches: 45% Google, 29% Yahoo, 13% MSN, 5% ASK,... Toolbar searches: 49.6% Google, 46.1% Yahoo,...  SE have an impact onto: Web structure, knowledge and understanding, social behavior.......and, onto the market: 33% users believe that “the results of a query are the best place where to buy things” !! Ads (4B $ in USA, 2B€ in Europe, 180M€ in Italy) Paid search: 65% Google, 25% Yahoo, 8% MSN,... Portal search: 15% Yahoo, 10% MSN, 7% AOL-Google,... Much interest...

4 Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query : paradigm “bag of words” Relevant ?!?...We face many difficulties, especially on the Web!!! Goal of a Search Engine

5 Languages/Encodings Hundreds of languages: 55 (Jul01) Home pages: In 1997: English 82%, the next 15 take 13% In 2001: English: 53%, the next 9 take 30% Distributed authorship Millions of people creating pages with their own style… Not all have the purest motives in providing high-quality information - commercial motives drive “spamming”. Web is huge and heterogeneous Extracting “significant data” is difficult !!

6 Web is highly dynamic [154 sites, 2004] A “good” coverage of the indexed Web is difficult !! Normalized wrt first week

7 Web structure

8 User Queries Query composition: Short 2001: 2.54 terms avg 80% less than 3 terms Imprecise terms 78% of the queries are not modified Query results: 85% of the users look at just one result-page

9 User Needs Informational – want to learn about something (~40%) Navigational – want to go to a page (~25%) Transactional – want to do something (~35%) Access a service Downloads Shop Asthma Alitalia NY weather Mars surface images Nikon CoolPix

10 Evolution of Search Engines First generation -- use only on-page, web-text data Word frequency and language Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page) Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data Query mining 1995-1997 AltaVista, Excite, Lycos, etc 1998: Google, now everyone No winner yet !! Fourth generation  Information Supply [Andrei Broder, VP emerging search tech, Yahoo! Research]

11 What is a search engine, nowadays?

12 Size of search engines [2005] Google vs Yahoo: 20-30% sharing of results

13 Ranking: Google vs Yahoo!

14 Ranking: Google vs Google.cn

15 Clustering engines Vivisimo, Snaket,... Suggestions Products Local searches News, Blogs,.... Not only Web Searches...

16 Directories Deep web: Invisible-web.net, Completeplanet, ResoruceDiscovery Network Invisible-web.netCompleteplanetResoruceDiscovery Network

17 “Vertical” search engines

18 About this course This course is a mix of Smart algorithms & data structures Data compression IR tools: Data Projection, Clustering,...

19 Massive Data Nature 2/06 issue highlight trends in sciences: “2020 – Future of computing” Exponential growth of scientific data Due to e.g. large experiments, sensor networks, etc Nano-tech provides further opportunities  Paradigm shift: Science will be about mining data Computer science paramount in all sciences March 2006

20 Algorithm Inadequacy Importance of scalability/efficiency → Algorithmics core computer science area Traditional algorithmics: Transform input to output using simple machine model Communities addressing inadequacies have emerged You should be space/IO-aware programmers

21 I/O-conscious Algorithms Disk access is 10 6 times slower than main memory access Store/access data taking advantage of blocks I/O-efficient algorithms: Move as few disk blocks as possible to solve given problem Access close blocks to reduce the seek time “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)

22 Streaming Algorithms Data arrive continuously or we wish FEW scans Streaming algorithms: Use few scans Handle each element fast Use small space

23 Cache-Oblivious Algorithms Unknown and/or changing devices Block access important on all levels of memory hierarchy But memory hierarchies are very diverse Cache-oblivious algorithms: Explicitly, algorithms do not assume any model parameters Implicitly, algorithms use blocks efficiently on all memory levels


Download ppt "Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific."

Similar presentations


Ads by Google