Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )

Similar presentations


Presentation on theme: "Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )"— Presentation transcript:

1 Exercise 1: Bayes Theorem (a)

2 Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )

3 Exercise 1: Bayes Theorem P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )

4 Exercise 2: Bayes Theorem d1: (Germany, win, soccer, worldcup, final, Brazil) d2: (Germany, win, soccer, worldcup, final, Brazil, champion, defeat) d3: (Germany, win, soccer, worldcup, final, Brazil, champion, defeat, Plasma, TV, sale, increase)

5 Exercise 2: Bayes Theorem

6 Exercise 3: Query expansion

7

8 Web Search – Summer Term 2006 III. Web Search - Introduction (c) Wolfgang Hürst, Albert-Ludwigs-University

9 INDEX Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface PERFORMANCE EVALUATION QUERY QUERY PROCESSING (PARSING & TERM PROCESSING) LOGICAL VIEW OF THE INFORM. NEED SELECT DATA FOR INDEXING PARSING & TERM PROCESSING SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION

10 Information Retrieval (IR) Main problem: Unstructured, imprecisely, and imperfectly defined data But also: The whole search process can be characterized as uncertain and vague Hence: Information is often returned in form of a sorted list (docs ranked by relevance ). INFORMATION QUERY DATA / DOCUMENTS INFORMATION NEED

11 Classic IR vs. Web Search: Documents Hugh amount of data, continuous growth, high rate of change Hugh variability and heterogeneity - Quality, credibility and reputation of the source - Static vs. dynamic docs - Different media types (text, pics, audio, video) - Different formats (HTML, Flash, PDF,...) - Miscellaneous topics - Continuous text vs. note form / keywords - Different languages, encoding Spam and advertisements Web-specific characteristics - Hypertext, linking - Broken links - Unstructured, not always conform with standards Redundancy (syntactic and semantic) Distributed (need to collect them automatically) Different popularity and access frequency

12 Classic IR vs. Web Search: Users Different needs and aims, e.g. users might want - to learn s.th. ("informational") - to go to a particular site ("navigational") - to do s.th., e.g. shopping, download,... ("transactional") - to do other, miscellaneous things, e.g. finding hubs, "exploratory search",... Different premises, qualifications, languages,... Different network connection / bandwidths Imprecise, unspecific queries Short, ambiguous, inexact, incorrect, no usage of operators or special syntax

13 IR vs. Web Search Note: Most of this is true for IR as well, but... INFORMATION QUERY DATA / DOCUMENTS INFORMATION NEED The no. of users is huge. Very huge. The web is huge. Very huge. Big variety in dataBig variety in users Doc. authors don't cooperate (spam,...) Users don't cooperate (short queries,...)

14 How does web search work? Problem: High commercial interest -> Commercial search engines don't tell exactly how they work Nevertheless, many information is available: - High scientific interest -> Publications - Basic research is done (and published) by some companies (e.g. Google: labs.google.com/papers/) - Hard-fought market -> well observed and documented (e.g. www.searchenginewatch.com) - Many fan pages, "anti" fan pages, critical observers, web blogs, etc.

15 Schedule Web Search : - Introduction - Crawling - Page Repository - Indexing - Ranking (PageRank, HITS) - Exercises for web search basics - Advanced / additional web search topics In parallel : - Programming project (Lucene)


Download ppt "Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )"

Similar presentations


Ads by Google