1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University
2 Outline Introduction Information Retrieval Indexing Smarter Internet Searching Examples
Introduction Internet has enormous quantity of information: billions of web pages thousands of newsgroups Two questions face any information seeker: (1) How can I find what I want? (2) How can I know that what I find is any good? 3
4 Information Retrieval Goal = find documents relevant to an information need from a large document set Document collection Info. need Query Answer list IR system Retrieval
5 Example GoogleGoogle Web
Search Engine Consists of: the interface you use to type in a query an index of Web sites that the query is matched with and a software program (called a spider or bot) that goes out on the Web and gets new sites for the index 6
7 IR problem First applications: in libraries (1950s) ISBN: Author: Salton, Gerard Title: Automatic text processing: the transformation, analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989 Content: External attributes and internal attribute (content) Search by external attributes = Search in DB IR: search by content
8 Possible approaches 1.String matching (linear search in documents) - Slow 2.Indexing - Fast - Flexible to further improvement
9 Documents Query Results Indexing Query RepresentationDocument Representation Comparison Function Index
10 Main problems in IR Query evaluation (or retrieval process) To what extent does a document correspond to a query? System evaluation How good is a system? Are the retrieved documents relevant? (precision) Are all the relevant documents retrieved? (recall)
11 Document indexing Goal = Find the important meanings and create an internal representation Factors to consider: Accuracy to represent meanings (semantics) Exhaustiveness (cover all the contents) Facility for computer to manipulate What is the best representation of contents? Word: good coverage, not precise Phrase: poor coverage, more precise Concept: poor coverage, precise Coverage (Recall) Accuracy (Precision) Word Phrase Concept
12 Keyword selection and weighting How to select important keywords? Simple method: using middle-frequency words Search engines usually disregard minor words such as "the, and, to, etc."
13 Result of indexing Each document is represented by a set of weighted keywords (terms): D 1 {(t 1, w 1 ), (t 2,w 2 ), …} e.g.D 1 {(comput, 0.2), (architect, 0.3), …} D 2 {(comput, 0.1), (network, 0.5), …}
14 Retrieval The problems underlying retrieval Retrieval model How is a document represented with the selected keywords? How are document and query representations compared to calculate a score?
15 Vector space model Vector space = all the keywords encountered Document D = a i = weight of t i in D Query Q = b i = weight of t i in Q R(D,Q) = Sim(D,Q)
16 Matrix representation t 1 t 2 t 3 … t n D 1 a 11 a 12 a 13 …a 1n D 2 a 21 a 22 a 23 …a 2n D 3 a 31 a 32 a 33 …a 3n … D m a m1 a m2 a m3 …a mn Qb 1 b 2 b 3 …b n Term vector space Document space
17 Some formulas for Sim Dot product Cosine Dice Jaccard t1 t2 D Q
18 (Classic) Presentation of results Query evaluation result is a list of documents, sorted by their similarity to the query. E.g. doc10.67 doc20.65 doc30.54 …
19 IR on the Web No stable document collection (spider, crawler) Duplication Huge number of documents Multimedia documents Multilingual problem …
Tips for smarter Internet searching Use unique, specific terms Use the minus operator (-) to narrow the search yarmouk -university Utilize quotation marks, to view "consecutive words of a phrase," such as "flower arrangement." Enter a short question, such as " what time is it in amman?“, “3.55* =“, “who is the king of england?”, “what is the distance between the sun and earth” 20
Smarter Internet Searching inurl:test results only test must be found in the web address (URL) allinurl:test results Both test AND results must be found in the web address. define: will provide definitions of the words, gathered from various online sources. define: search engine 21
Smarter Internet Searching Allintext Sometimes you get pages that do not have your search term/phrase in them. Why? Because Google also searches for pages that just link to the target page. Use allintext to get only those pages that have your search terms in them. 22
Smarter Internet Searching Allinanchor: Returns only pages that link to pages with your search terms, but not in the actual pages. This is the opposite of allintext. Site: Limit your search to a specific web site. Example: students site:yu.edu.jo students site:yu.edu.jo filetype:pdf 23
Smarter Internet Searching Don't use common words and punctuation Common words and punctuation marks should be used when searching for a specific phrase inside quotes Most search engines do not distinguish between uppercase and lowercase Maximize AutoComplete 24
Smarter Internet Searching The wildcard operator (*): Google calls it the fill in the blank operator. For example, amusement * will return pages with amusement and any other term(s) the Google search engine deems relevant. Using a wildcard (*) for a character does not work in Google. cat* returns the same results as cat. 25
Smarter Internet Searching Related sites: For example, related: can be used to find sites similar to Yarmouk University site. Specific file type: For example Information retrieval filetype:ppt 26
Examples Searching for papers YU library Google scholar Searching for instructor resources Morgan Kaufmann Pearson 27
Examples Searching for books to buy Amazon.com Ebay.com Searching for items to buy Electronics: bustbuy.com Searching for hotels Expedia.com Priceline.com Booking.com 28
Examples Regional search Google jo Searching for images Google images Searching for a job Jobsinacademia.net Academickeys.com 29
The End. 30