Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.

Slides:

Advertisements

Similar presentations

Implementing Tableau Server in an Enterprise Environment

Advertisements

Enterprise Search with FAST Rick McDannel Manager of Information Technology.

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert

Kentico CMS 5.5 R2 What’s New. Highlights Intranet Solution Document management package – WebDAV support – Project & task management – Document libraries.

EasySearch Technical Overview. Ever seen a website without a full text search? BUT – Search is expensive Financially Computationally – Search is complicated.

July 2010 D2.1 Upgrading strategy Javier Soto Catalog Release 3. Communities.

June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.

Implementation Considerations for FAST Search For SharePoint (FS4SP) Presenter : Shyam Narayan MOSSIG – February 2011 Meeting b:

Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.

Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.

LCT2506 Internet 2 Data-driven web sites Week 5. LCT2506 Internet 2 Current Practice  Combining web pages and data stored in a relational database is.

Kentico CMS 5.0 Full-featured Flexible Web Content Management System for All Your Needs.

INTRODUCTION TO CLOUD COMPUTING Cs 595 Lecture 5 2/11/2015.

Capacity Planning in SharePoint Capacity Planning Process of evaluating a technology … Deciding … Hardware … Variety of Ways Different Services.

SAP on windows server 2012 hyper-v documentation

How WebMD Maintains Operational Flexibility with NoSQL Rajeev Borborah, Sr. Director, Engineering Matt Wilson – Director, Production Engineering – Consumer.

Core Publisher: Station Administrator Tools. Training 1: Site Administration Training 2: Programs Training 3: Content Tagging Training 4: Creating Posts.

Version control Using Git 1Version control, using Git.

Teaching End User SharePoint Robert Bogue

Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

So, Jung-ki Distributed Computing System LAB School of Computer Science and Engineering Seoul National University Implementation of Package Management.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

1 G A A new Document Control System “A new system to manage LIGO documents” Stuart Anderson Melody Araya David Shoemaker 29 September, 2008

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.

Sakai/OSP Portfolio UvA Bas Toeter Universiteit van Amsterdam

OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL

Revolutionizing enterprise web development Searching with Solr.

Version control Using Git Version control, using Git1.

Kelly Boccia Abi Natarajan Konstantin Livitski Senthil Anand Subbanan Meyyappan 1.

Open Search Office Web Services Database Doc Mgt Sys Pipeline Index Geospatial Analysis Text Search Faceting Caching Query parsing Clustering Synonyms.

What’s new in Kentico CMS 5.0 Michal Neuwirth Product Manager Kentico Software.

Okalo Daniel Ikhena Dr. V. Z. Këpuska December 7, 2007.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.

INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.

ITGS Databases.

1 FollowMyLink Individual APT Presentation Third Talk February 2006.

Core Publisher: Station Administrator Tools. Training 1: Site Administration Training 2: Programs Training 3: Content Tagging Training 4: Creating Posts.

Solutions using Microsoft Content Management Server 2002 Connector for SharePoint Technologies Sue Corke Mark Harrison Microsoft UK.

Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #15 Secure Multimedia Data.

Afresco Overview Document management and share

1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.

Dr. David Roldán Martínez Universidad Politécnica de Valencia, Spain & Nuno Fernandes Universidade Fernando Pessoa, Portugal Site Stats, the power of event.

A Technical Overview Bill Branan DuraCloud Technical Lead.

Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.

Load Rebalancing for Distributed File Systems in Clouds.

Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.

Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.

Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.

Tutorial on Science Gateways, Roma, Catania Science Gateway Framework Motivations, architecture, features Riccardo Rotondo.

Visibility ClicksEngage Lead Form Sales TrafficWebsiteSales Tracking and Measurement SEO Workshop-at-a-Glance © Partners Consulting, LLC

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

DDN Web Object Scalar for Big Data Management Shaun de Witt, Roger Downing (STFC) Glenn Wright (DDN)

GOOGLE TAG MANAGER. INTRODUCTION Google Tag Manager (GTM) is a free solution, introduced in October Google Tag Manager (GTM) is a free solution,

Breeda Herlihy, IR Manager, UCC Library. UCC selected DSpace in 2008 Software selection group Staff from Library IT, Computer Centre, Special Collections,

A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.

Data mining in web applications

Internet Made Easy! Make sure all your information is always up to date and instantly available to all your clients.

Big Data is a Big Deal!.

Simulation Production System

Cms Full-featured Flexible Web Content Management System for All Your Needs.

Searching and Indexing

Open Source distributed document DB for an enterprise

Version control, using Git

LMEvents SharePoint Portal How-to Guide

Multi-Farm, Cross-Continent SharePoint Architecture

social content management

SharePoint services Provides team collaboration through SharePoint Sites and makes it easy for communities to work together on documents, tasks, contacts,

Presentation transcript:

Search Dr Ian Boston University of Cambridge Image © University of Cambridge December :30 INTL 6

Search: Problem Area Stovepipe Applications –All wanted search Cant search each tool Unified Search of all content –1 Text box + a button –Just like Google To Start with Slightly less content

Possible Solutions Image © University of Cambridge 2006

Public/Private Search Engine –Register your site with Google What about the content/permissions? Non starter, content missing. –Google Scholar Eg DSpace –Google Researcher ? Google Learner ? Sakai is not OpenAccess Why would they ?

Private Search Application –Intranet solution Install Apache Nutch ›Add AuthZ code Buy a Google Appliance ›Configure to do some AuthZ ›~£40K 0.5M pages –Rendered content is only a view Misses properties Approximates linkage ›Doesn’t know about Sakai –Nutch Prototype in 1.5.1

Entity Search –Write a search engine! Full time job. –Reuse Lucene Scalability ›Most have < 5M active documents ›Nutch benchmarked »5 boxes, 2TB == 100M+ docs » Plumb in Lucene ›Connect to Sakai Entity Bus ›Connect to Entity Produces at the object level. Learn from Nutch ›Index Storage and Management ›Scalability Reliability –MUST Cluster OOTB

Search Tool Image © University of Cambridge 2006

Search Tool

Permissions –Owning Entity checks permission on each result Rendering Highlighting –Matching terms highlighted RSS Feed of search results OpenSearch (FF2.0, IE7) and Sherlock/Mycroft (FF1.5) integration

Admin Tool

Monitor Indexing progress Monitor Segments Request Worksite Index Rebuilds Request Complete Index Rebuilds –Expensive!

Tag Tool

Search for a term Discover other terms –Size indicates relevance within result set Needs some windowing on the word vectors –High frequency words not significant –Short words not significant

Search API Simple API, one method.. Search() Results paged at lowest level Access to secondary Indexes –“+Tool:wiki +Site: +cowslips +bluebell Content terms use Porter Stemmer and Stop words –Stop words “and” “the” “a” ignored –Stemmer looks == look, try == trying May be some i18n issues

Internal Architecture Image © Wikipedia Commons 2006

Search Service Lucene Architecture Sakai Entity Bus Wiki ServiceContent ServiceMessage Service Event Listener Index Queue Index Builder Entity Content Producer Local Segment Store Clustered Index Store Shared Segment Store Index Builder Search Service Search API Search ToolTag ToolRWiki Search Resources ToolOSP ToolsChat Tool ToolAnnouncementsWiki Tool

Indexer –Indexing Queue Events arrive on the Bus Added to the Queue transitionally –Indexing Index workers run concurrently ( 2 per Sakai node) Take Events from the queue Open an Abstract Lucene segment Distributed lock manager Search Service Lucene Event Listener Index Queue Index Builder Entity Content Producer Local Segment Store Clustered Index Store Shared Segment Store Index Builder Search Service

Content –Entity Content Producer Digests a Token Stream ›On Content ›Using Stemmer and Stop Words Provides index terms ›Site ID ›User info ›Properties ›Tool ›Custom RDF Structure ›Requires A triple Store ›Sesame in Contrib ›Mulgara/Kowali needs work. Search Service Lucene Event Listener Index Queue Index Builder Entity Content Producer Local Segment Store Clustered Index Store Shared Segment Store Index Builder Search Service

Cluster Index Storage –Not Distributed Mirrored for Central Deposit Not as scalable as Nutch with Google MapReduce BUT No setup required –Local Segments Opened by IndexReaders, IndexWriters, IndexSearchers High performance Seek –Shared Segments Central deposit of search segments Synchronized with local copies –Periodic Merging Reduce open files Eliminated Deleted items Search Service Lucene Event Listener Index Queue Index Builder Entity Content Producer Local Segment Store Clustered Index Store Shared Segment Store Index Builder Search Service

Production Deployment Image © University of Cardiff 2006

Sites In production –Cambridge 73K documents, 6GB index, content in index. Rebuild time = 45 minutes –Cape Town 93K documents, 200MB index, content not in index. Rebuild time = ? –Others ? Considering –Michigan 1.7M documents Rebuild time…. Weeks ? Should not put the content in the index

Deployment Issues Indexing Times –Acceptable for smaller sites, a few hours –Pain at larger sites Rolling per worksite index build Dedicated indexing cluster (not serving pages) Storage strategies –First Attempts - Cambridge - Cape Town Cape Town identified many problems - Thank you! MySQL - Don’t put segments in DB! - Extremely slow tables. –Node Layout All nodes are indexers –Content in the Index or Out of the index No content in index now Results re-digested on search

Roadmap Image from: A Gnome2 media editor Image © Marlin Project 2006

New Features Tagged Search Discovery –Based on word vectors –In trunk –Needs a lens - focus on distribution segment RDF Faceted Discovery –Merged word vectors and triples –Needs per worksite ontology tools –Needs triple Store Should be a Sakai wide store. ›Kowali - issues with community ›Mulgara

Roadmap Parallel Indexing –Implemented, needs heavy testing –Learn from Nutch –Multiple active indexes –Big sites in production –Better merge algorithm Other tools using search –Use indexes for PK search –Issues over Queue delays Text Mining - Sydney - Rafael Calvo

Questions Image © University of Cambridge 2006