Download presentation
Presentation is loading. Please wait.
Published byTerence Nelson Modified over 9 years ago
1
Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6
2
Search: Problem Area Stovepipe Applications –All wanted search Cant search each tool Unified Search of all content –1 Text box + a button –Just like Google To Start with Slightly less content
3
Possible Solutions Image © University of Cambridge 2006
4
Public/Private Search Engine –Register your site with Google What about the content/permissions? Non starter, content missing. –Google Scholar Eg DSpace –Google Researcher ? Google Learner ? Sakai is not OpenAccess Why would they ?
5
Private Search Application –Intranet solution Install Apache Nutch ›Add AuthZ code Buy a Google Appliance ›Configure to do some AuthZ ›~£40K 0.5M pages –Rendered content is only a view Misses properties Approximates linkage ›Doesn’t know about Sakai –Nutch Prototype in 1.5.1
6
Entity Search –Write a search engine! Full time job. –Reuse Lucene Scalability ›Most have < 5M active documents ›Nutch benchmarked »5 boxes, 2TB == 100M+ docs »http://wiki.apache.org/nutch/HardwareRequirementshttp://wiki.apache.org/nutch/HardwareRequirements Plumb in Lucene ›Connect to Sakai Entity Bus ›Connect to Entity Produces at the object level. Learn from Nutch ›Index Storage and Management ›Scalability Reliability –MUST Cluster OOTB
7
Search Tool Image © University of Cambridge 2006
8
Search Tool
9
Permissions –Owning Entity checks permission on each result Rendering Highlighting –Matching terms highlighted RSS Feed of search results OpenSearch (FF2.0, IE7) and Sherlock/Mycroft (FF1.5) integration
10
Admin Tool
11
Monitor Indexing progress Monitor Segments Request Worksite Index Rebuilds Request Complete Index Rebuilds –Expensive!
12
Tag Tool
13
Search for a term Discover other terms –Size indicates relevance within result set Needs some windowing on the word vectors –High frequency words not significant –Short words not significant
14
Search API Simple API, one method.. Search() Results paged at lowest level Access to secondary Indexes –“+Tool:wiki +Site: +cowslips +bluebell Content terms use Porter Stemmer and Stop words –Stop words “and” “the” “a” ignored –Stemmer looks == look, try == trying May be some i18n issues
15
Internal Architecture Image © Wikipedia Commons 2006
16
Search Service Lucene Architecture Sakai Entity Bus Wiki ServiceContent ServiceMessage Service Event Listener Index Queue Index Builder Entity Content Producer Local Segment Store Clustered Index Store Shared Segment Store Index Builder Search Service Search API Search ToolTag ToolRWiki Search Resources ToolOSP ToolsChat ToolEmail ToolAnnouncementsWiki Tool
17
Indexer –Indexing Queue Events arrive on the Bus Added to the Queue transitionally –Indexing Index workers run concurrently ( 2 per Sakai node) Take Events from the queue Open an Abstract Lucene segment Distributed lock manager Search Service Lucene Event Listener Index Queue Index Builder Entity Content Producer Local Segment Store Clustered Index Store Shared Segment Store Index Builder Search Service
18
Content –Entity Content Producer Digests a Token Stream ›On Content ›Using Stemmer and Stop Words Provides index terms ›Site ID ›User info ›Properties ›Tool ›Custom RDF Structure ›Requires A triple Store ›Sesame in Contrib ›Mulgara/Kowali needs work. Search Service Lucene Event Listener Index Queue Index Builder Entity Content Producer Local Segment Store Clustered Index Store Shared Segment Store Index Builder Search Service
19
Cluster Index Storage –Not Distributed Mirrored for Central Deposit Not as scalable as Nutch with Google MapReduce BUT No setup required –Local Segments Opened by IndexReaders, IndexWriters, IndexSearchers High performance Seek –Shared Segments Central deposit of search segments Synchronized with local copies –Periodic Merging Reduce open files Eliminated Deleted items Search Service Lucene Event Listener Index Queue Index Builder Entity Content Producer Local Segment Store Clustered Index Store Shared Segment Store Index Builder Search Service
20
Production Deployment Image © University of Cardiff 2006
21
Sites In production –Cambridge 73K documents, 6GB index, content in index. Rebuild time = 45 minutes –Cape Town 93K documents, 200MB index, content not in index. Rebuild time = ? –Others ? Considering –Michigan 1.7M documents Rebuild time…. Weeks ? Should not put the content in the index
22
Deployment Issues Indexing Times –Acceptable for smaller sites, a few hours –Pain at larger sites Rolling per worksite index build Dedicated indexing cluster (not serving pages) Storage strategies –First Attempts - Cambridge - Cape Town Cape Town identified many problems - Thank you! MySQL - Don’t put segments in DB! - Extremely slow tables. –Node Layout All nodes are indexers –Content in the Index or Out of the index No content in index now Results re-digested on search
23
Roadmap Image from: http://marlin.sourceforge.net A Gnome2 media editor Image © Marlin Project 2006
24
New Features Tagged Search Discovery –Based on word vectors –In trunk –Needs a lens - focus on distribution segment RDF Faceted Discovery –Merged word vectors and triples –Needs per worksite ontology tools –Needs triple Store Should be a Sakai wide store. ›Kowali - issues with community ›Mulgara
25
Roadmap Parallel Indexing –Implemented, needs heavy testing –Learn from Nutch –Multiple active indexes –Big sites in production –Better merge algorithm Other tools using search –Use indexes for PK search –Issues over Queue delays Text Mining - Sydney - Rafael Calvo
26
Questions Image © University of Cambridge 2006
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.