Download presentation
Presentation is loading. Please wait.
Published byMaria Fisher Modified over 9 years ago
1
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek
2
2 Outline Problem background and motivation Project Goals System architecture The social network tool Facebook app Gathering user data The dictionary Gathering documents and data Building the lexicon LSI The Indexer and Search Building the Index Servicing search requests
3
3 Background and Motivation Traditional search Keyword based: not optimal Low recall, high precision Stress on – formulate a query effectively Enhancements Automate query reformulation using – relevance feedback from previous search semantic meaning extraction to aid search
4
4 Goals Demonstrate the usability of semantic search concepts Use social networking data to develop prototype implementation Make search framework generic What makes a good lexicon / dictionary for focused search requirements
5
5 Arhictecture
6
Query results Detailed Architecture Facebook Application (PHP) Lucene Search & Index (Java) -- xml -- Query terms Parser -- Dictionary -- (WordNet + LSI) lookup Facebook Server
7
7 Search Front end Screenshot here
8
8 Gathering Facebook user data Users who add application allow storage of profile information Use ‘profile_update_time’ to limit updates For search over friends’ profiles, the data is cached temporarily if the friend is not already a registered user Facebook privacy restrictions on storage of private data Workaround – server side cron jobs – periodically update database if profile is updated
9
9 Dictionary Rationale Goal: Find the semantically related web pages given a query. Solution: Add some semantically related keywords in our queries. The dictionary serves as the pool of words, from which we can extract the semantically related words. Approach In order to determine the semantic relation between pairs of terms, we need to analyze a very large number of documents. When we have a collection of documents at hand, we need to preprocess the document by removing noise. Parse the documents and extract those keywords whose occurrence is greater than some threshold.
10
10 LSI Latent Semantic Indexing is a method that we could calculate the relatedness score of each pair of terms. Each document can be parsed into vectors LSI can determine the orthonormal basis for the document space Assume the orthonormal basis is U The relatedness score could be calculated as U*U’. The semantic relatedness actually is calculated through the co-occurrence of pairs of terms. The size of our dictionary is 10,775 and we have crawled 7142 documents. All the term document matrix is calculated through sparse matrix operation.
11
11 Gathering data Crawling was done using WebSphinx, an open source crawler(http://www.cs.cmu.edu/~rcm/websphinx) We crawled around 10,000 pages from blogs and other social media to build the dictionary Pages were crawled mainly from these sites http://en.wikipedia.org http://directory.yahoo.com http://www.blogspot.com Crawled data was filtered for removing noise such as unicode characters, tags and other non-text material.
12
12 WordNet WordNet is a lexical database of English language developed in Princeton University WordNet provides “SysNets” which are set of conceptually semantic words. We used WrodNet to derive conceptually semantic words We aggregate the related words obtained from WordNet and Dictionary.
13
Indexing & Search Lucene API: Lucene is a software library, and concerns with text indexing and searching. It’s “NOT” a ready-to- use application like a file-search program, a web crawler, or a web site search engine.
14
Indexing & Search Indexing breaks down into three main operations: Conversion from data to text, Analyzing/stemming, saving it to the index (inverted index). Searching Parsing the Query, Analysing the Query, Search in inverted index Updating the indexes on regular basis: A Document must first be deleted from an index and then re-added to it.
15
Indexing & Search Some properties of Lucene utilised for semantic search: Analyser : Eliminates the stop words & stores words in it base form. Keyword : Not to be analyzed (stemmed), but is indexed Updating the indexes on regular basis (A Document will be deleted from an index and then re-added to it.) Search Facility extended: * Keyword1 AND/OR Keyword2 * + Keyword1 – Keywords2 (Extended) Ranking formula for results:
16
Indexing & Searching : Implementation Picks Document Add Fields Adds to Index Indexed Files Query/ Search word Analyzer/parser Index Search Document Ids Corpus
17
Search & Tradeoffs Search by fieldname (space tradeoff) Referencing words before and following of search-keyword. (speed tradeoff)
18
.. Facebook Application (PHP) Lucene Search & Index (Java) ---- XML File ----- Abc xyz 24 Sports, Music Atlanta Moderate ------------------- Hello 27 XML Parser ---- Text File ----- Abc xyz 24 Sports, Music Atlanta Moderate Facebook Server
19
19 Extensions and future work Coupling semantic search with traditional search techniques to achieve ‘whole’ solution Relevance feedback from previous search for instance Performance testing of search results Relevance sorting of results (partially)
20
20 Questions (?) Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.