Download presentation
Presentation is loading. Please wait.
Published byGloria George Modified over 8 years ago
1
Search Engine and Optimization 1
2
Agenda Indexing Algorithms Latent Semantic Indexing 2
3
3 PROBLEM Conventional IR methods depend on – Boolean – vector space – probabilistic models Handicap – dependent on term-matching – not efficent for IR
4
4 Need to capture the concepts instead of only the words – multiple terms can contribute to similar semantic meanings synonym, i.e. car and automobile – one term may have various meanings depending on their context polysemy, ie. apple and computer
5
5 IR MODELS
6
6 THE IDEAL SEARCH ENGINE Scope – able to search every document on the Internet Speed – better hardware and programming Currency – frequent updates
7
7 Recall: – always find every document relevant to our query Precision: – no irrelevant documents in our result set Ranking: – most relevant results would come first
8
8 What a Simple Search Engine Can Not Differentiate? Polysemy – monitor workflow, monitor Synonymy – car, automobil
9
9 Singular and plural forms – tree, trees Word with similar roots – different, differs, differed
10
10 HISTORY Mathematical technique for information filtering Developed at BellCore Labs, Telcordia 30 % more effective in filtering relevant documents than the word matching methods A solution to “polysemy” and “synonymy” End of the 1980’s
11
11 Semantics Syntax - structure of words, phrases and sentences Semantics - meaning of and relationships among words in a sentence Extracting an important meaning from a given text document Contextual meaning
12
12 LSI? Concepts instead of words Mathematical model – relates documents and the concepts Looks for concepts in the documents Stores them in a concept space – related documents are connected to form a concept space Do not need an exact match for the query
13
13 CONCEPTS
14
14 HOW LSI WORKS? A set of documents – how to determine the similiar ones? examine the documents try to find concepts in common classify the documents
15
15 This is how LSI also works. LSI represents terms and documents in a high- dimensional space allowing relationships between terms and documents to be exploited during searching.
16
16 HOW TO OBTAIN A CONCEPT SPACE? One possible way would be to find canonical representations of natural language – difficult task to achieve. Much simpler – use mathematical properties of the term document matrix, i.e. determine the concepts by matrix computation.
17
17 TERM-DOCUMENT MATRIX Query: Human-Computer Interaction Dataset: c1Human machine interface for Lab ABC computer application c2A survey of user opinion of computer system response time c3The EPS user interface management system c4System and human system engineering testing of EPS c5Relations of user-perceived response time to error measurement m1 The generation of random, binary, unordered trees m2 The intersection graph of paths in trees m3 Graph minors IV: Widths of trees and well-quasi- ordering m4 Graph minors: A survey
18
18
19
19 CONCEPT SPACE Makes the semantic comparison possible Created by using “Matrices” Try to detect the hidden similarities between documents Avoid little similarities Words with similar meanings – occur close to each other
20
20 Dimensions: terms Plots: documents Each document is a vector Reduce the dimensions – Singular Value Decomposition
21
21 LSI-Procedure Obtain term-document matrix. Compute the SVD. Truncate-SVD into reduced-k LSI space. -k-dimensional semantic structure -similarity on reduced-space: - term-term, term-document, document-document
22
22 Approaches to semantic analysis Compositional semantics – uses parse tree to derive a hierarchical structure – informational and intentional meaning – rule based Classification - Bayesian approach Statistics-algebraic approach (LSA)
23
23 Latent Semantic Analysis LSA is a fully automatic statistics-algebraic technique for extracting and inferring relations of expected contextual usage of words in documents It uses no humanly constructed dictionaries, knowledge bases, semantic networks, grammars Takes as input row text
24
24 Building Latent Semantic Space Training corpus in the domain of interest document – a sentence, paragraph, chapter vocabulary size – remove stopwords
25
25 Word-document co-occurrence Given - N documents, vocabulary size M Generate a word-documents co-occurrence matrix W W = d 1 d 2 ….. d N w1w2::wMw1w2::wM
26
26 Discriminate words Normalized entropy – close to 0 : very important – close to 1 : less important Scaling and normalization
27
27 Singular Value Decomposition u1u2::uMu1u2::uM v 1 T v 2 T ….. v N T W VTVT SU
28
28 SVD Approximation Dimensionality reduction – Best rank-R approximation – Optimal energy preservation – Captures major structural associations between words and documents – Removes ‘noisy’ observations
29
29 Computing SVD
30
30 Truncated-SVD Using truncated SVD, underlying latent structure is represented in reduced-k dimensional space. Noise in word usage is eliminated, Create a rank-k approximation to A, k < r A or k = r A, A k = U k S k V T k
31
31 Query processing Map the query to reduced k-space q’=q T U k S -1 k, Retrieve documents or terms within a proximity. -cosine -best m
32
32 Updating Folding-in d’=d T U k S -1 k similar to query projection SVD re-computation
33
33 APPLICATION AREAS Information Retrieval Information Filtering Relevance Feedback Cross-language retrieval
34
34 Dynamic advertisements put on pages, Google’s AdSense Improving performance of Search Engines – in ranking pages Spam filtering for e-mails Optimizing link profile of your web page Cross language retrieval Foreign language translation Automated essay grading
35
35 GOOGLE USES LSI Increasing its weight in ranking pages – ~ sign before the search term stands for the semantic search – “~phone” the first link appearing is the page for “Nokia” although page does not contain the word “phone”
36
36 – “~humor” retrieved pages contain its synonyms; comedy, jokes, funny Google AdSense sandbox – check which advertisements google would put on your page
37
37
38
38
39
39 ANOTHER USAGE – Tried on TOEFL exam. a word is given the most similar in meaning should be selected from the four words scored %65 correct
40
40 + / - Improve the efficency of the retrieval process Decreases the dimensionality of vectors Good for machine learning algorithms in which high dimensionality is a problem Dimensions are more semantic
41
41 Newly reduced vectors are dense vectors Saving memory is not guaranteed Some words have several meanings and it makes the retrieval confusing Expensive to compute, complexity is high.
42
42 How to measure success? Assume there is a set of ‘correct answers’ to the query. The docs in this set are called relevant to the query The set of documents returned by the system are called retrieved documents Precision: what percentage of the retrieved documents are relevant Recall: what percentage of all relevant documents are retrieved
43
43
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.