CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t OIS CERN IT-OIS Tim Bell, Eduardo Alvarez Fernandez, Andreas Wagner HEPiX Fall 2010 Workshop.

Slides:



Advertisements
Similar presentations
Local SEO Panel Search Engine Optimization – employing techniques that help your website rank higher in organic (natural) search results. What is SEO.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 S.E.O Search Engine Optimization. 2 History of Google Began January 1996 Stanford University California Larry Page and Sergey Brin “BackRub” used a.
The Documentum Team Lance Callaway, Brooke Durbin, Perry Koob, Lorie McMillin, Jennifer Song Missouri University of Science and Technology Rolla, Missouri.
Creating web guides for a library portal Jackie Wickham – Intute Martin Gill – University of Leeds
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.
SEO PACKAGES. Types of Plans Starter Plan Business Plan Enterprises Plan.
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
Federated Searching Pre-Conference Workshop - The federated searching cookbook Qin Zhu HP Labs Research Library February 18, 2007.
WageIndicator SEO, December 10, 2008 Irene van Beveren Today: 0.Why SEO is important 1.Keyword Strategies 2.Title Tags 3.Internal Links 4.Duplicate Content.
Operating Systems & Infrastructure Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS CERN Search Updates Eduardo Alvarez November.
Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Web Content Management System Discussion.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Power to the People: The IUB Libraries' Website Digital Asset Management System Doug Ryner, Tadas Paegle, & Julie Hardesty.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Ideas for 2011 Prepare must be done work items –Warranty –Software maintenance –Commitments.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Cross Platform Browser Support Tim Bell 15.
State of the KUMC Jameson Watkins Director, Internet Development Our Topics Updated stats New KU design Search engines: how they.
Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Working with Windows 7 at CERN Michał Budzowski.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
The Business Model and Strategy of MBAA 609 R. Nakatsu.
Revolutionizing enterprise web development Searching with Solr.
Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Drupal Database Selection Tim Bell 6 th June.
CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Andreas Wagner – CERN IT/OIS Eduardo Alvarez – CERN IT/OIS Sergio Fernandez – CERN.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
CERN IT Department CH-1211 Geneva 23 Switzerland t Daniel Gomez Ruben Gaspar Ignacio Coterillo * Dawid Wojcik *CERN/CSIC funded by Spanish.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Module 10 Administering and Configuring SharePoint Search.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Copenhagen, 7 June 2006 Toolkit update and maintenance Anton Cupcea Finsiel Romania.
The Business Model of Google MBAA 609 R. Nakatsu.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
CERN IT Department CH-1211 Genève 23 Switzerland t The new IT Web Site Tim Bell Cath Noble IT Technical Forum 15 th June 2012.
Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Update on Windows 7 at CERN & Remote Desktop.
Search Tools and Search Engines Searching for Information and common found internet file types.
CERN Content Management System Support ATLAS Requirements S. Goldfarb – 19 May 2010 (On behalf of the ATLAS Collaboration)
Module 9 User Profiles and Social Networking. Module Overview Configuring User Profiles Implementing SharePoint 2010 Social Networking Features.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Drupal at CERN Juraj Sucik Jarosław Polok.
CERN IT Department CH-1211 Genève 23 Switzerland t Services and Resources Web IT Services and Resources Web Pages A Proposal Tim Bell 1.
CERN - IT Department CH-1211 Genève 23 Switzerland t Operating systems and Information Services OIS Proposed Drupal Service Definition IT-OIS.
CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Tim Bell CERN IT/OIS 7 th September 2010 Service Management Meeting.
CERN - IT Department CH-1211 Genève 23 Switzerland t OIS Update on the anti spam system at CERN Pawel Grzywaczewski, CERN IT/OIS HEPIX fall.
CERN - IT Department CH-1211 Genève 23 Switzerland t CERN - IT Department CH-1211 Genève 23 Switzerland t SharePoint 2007 deployment.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
June 30, 2005 Public Web Site Search Project Update: 6/30/2005 Linda Busdiecker & Andy Nguyen Department of Information Technology.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland.
Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Discussing possibility of deleting archives.
Search Engine Optimization
Dr. Frank McCown Comp 250 – Web Development Harding University
Web Page Elements Writing For the Web
Search Engines and Search techniques
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Search Engines & Subject Directories
What is a Search Engine EIT, Author Gay Robertson, 2017.
Data Mining Chapter 6 Search Engines
Search Engines & Subject Directories
Search Engines & Subject Directories
Web Search Engines.
Information Retrieval and Web Design
Presentation transcript:

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS CERN IT-OIS Tim Bell, Eduardo Alvarez Fernandez, Andreas Wagner HEPiX Fall 2010 Workshop 3rd November 2010, Cornell University CERN Search Engine Status

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Outline Enterprise Search What is Enterprise Search? Requirements for protected search Enterprise Search solution providers CERN Search –Background & Objectives –Architecture, Document Workflow –Search Relevancy, Ranking algorithms Improving TWiki Search –Indexing TWiki Topics Google Comparison –What about Google Search Appliance ? –Comparison with FAST Future Steps –FAST Search Server 2010 CERN Search - 2

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Enterprise Search Components of Enterprise Search: –Document retrieval Not only web pages Database/XML data (CDS, Indico, Phone data) –Search Engine with ranking –Integration within existing infrastructure Authentication Authorization –Protected documents Getting access to document data Recording ACLs as well Enterprise Search is not only a question about the search technology used! CERN Search - 3

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS CERN Search - 4 Protection Requirements Protected information must not ‘leak’ from search Search engine only presents data you can read To obtain full results, authentication is required Results filtered by your access rights Authentication models can be based on Document ACL at time of indexing Callback to the application Dependent on role based model for the site Ideally only one role model

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS CERN Search - 5 Enterprise Search Providers Gartner Report: “Magic Quadrant for Information Access Technology, ”Gartner Report: “Magic Quadrant for Information Access Technology, ” Fast Fast

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS CERN Search A CERN Search page for the whole site search for public datawww.cern.ch Central IT services Experiment web sites Infrastructure / HR / Administrative workflow sites Start of project in February 2006 –Based on FAST as one of market leaders –Present resources 1 Project Associate and small share of an engineer In production since 2007 CERN Search - 6

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS CERN Search - 7 FAST ESP Architecture Content API Query API Filter API Connectors (Push&Pull) Document retrievalDocument indexingDocument processing Document Content Flow

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS CERN Search - 8 Indexing Protected Content Document Processing Resolve ACLs to text strings Sent to Indexer with document Security Access Module of FAST Active Directory integration based on CERN accounts and e-groups Search Index CERN Search Document Repository Document Processing Active Directory Users & Groups Doc + ACL ACL Document

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS CERN Search - 9 Authentication / Authorisation CERN Search Active Directory Users & Groups Search Index Search Front End Query & Identity Group Membership Authentication (SSO) & Search Query Processing Authentication by Front-End User identity and e-group membership is passed along with query

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS CERN Search - System Layout CERN Search - 10 Document Processing & Frontend Search Index search21 search23 search22 search20 search24 Production System Document Processing & Frontend Search Index search10 search11 search02 search06 Development System search06 websvc08 Frontend

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Indexed Documents Currently >3 million documents Estimated 10 million in total if all sites indexed CERN Search - 11

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Result Ranking – Relevancy Order search with most interesting document first in list Ranking Metrics: –Search Terms: Occurrence in URL, page title and page contents. Proximity of terms in document –Quality of a page: Relevance of page in the Web space of all indexed pages (how many other pages link to the page) How deep inside a Website a page is located –Freshness of document Generally the newer the document, the more interesting –Anchortext Text of a link pointing to a page CERN Search - 12

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Ranking Issues at CERN Flat Web space –~10,000 Web sites just one level down –No consistent structure and navigation (apart from back-links to CERN home page) Keyword distribution –Small number of significant words in large number of pages CERN Search - 13 Hit numberPage Score

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Result Ranking – Improvements How to improve ranking? –Manual Tuning of results to assure expected results during important events –LHC first physics; Angels & demons –Usage analysis e.g. review of “zero result” queries user tracking – “what links users follow” Best results obtained with hints to search engine and effort by content authors –Add keyword and author meta data tags at minimum CERN Search - 14

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS TWiki Search Request from experiments to index protected TWiki content and to improve ranking –Built in TWiki search functionality was weak Pages are protected so access requires CERN SSO step –Not natural for web crawlers URLs are not words so br eak of topic name improved ranking –‘Example Topic Template’ from Get changed pages only –Twiki ‘find’ for modified documents to be re-indexed –Could increase frequency to hourly In production since June 3 rd 2010 –Users reporting substantial improvements compared to built in TWiki search CERN Search - 15

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS What about Google? What makes Google Web search work well –The whole web for analysis who links to your site –Huge usage data used for “voting” for results most popular results swim up –Substantial resources to tune and correct results usage data analysis taking into account popular events hand edited results for popular single key word searches Above is valid for all public search engines –Yahoo!, Bing, … CERN Search - 16

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Google Search Appliance Google make a packaged offering –Hardware –Software –2 year license and then need to replace Priced by number of documents –CERN has around 10 million documents Black box solution –Management GUI –Alerting –Does retrieval, analysis and indexing –Single-sign on support (but see later…) CERN Search - 17

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS ATLAS Comparison Test –BNL have a Google Search Appliance which they use to index ATLAS public pages at CERN –Performed sampling comparisons with CERN FAST Search for sample common terms Results –Google Search Appliance did better job at ranking according to content owners –Indexing of protected pages did not work Issues with Single Single On javascript Google engineers could not find a solution –GSA cost would have been substantially higher CERN Search - 18

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Looking Ahead Include additional protected content e.g. Indico, EDMS, Sharepoint, Drupal, … Migrate to FAST Search 2010 Improved web selection filtering –Show documents from past X months –Show documents written by author Y Partition web space –Official content –Personal sites Feedback based on previous user choices –Put higher if often selected Allow content managers to adjust rankings themselves –Repeat comparisons with other solutions in 2011 such as GSA Interested to see what other sites are doing CERN Search - 19

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS CERN Search - 20

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS CERN Search: and also via: –CERN Intranet & Public Pages –TWiki –IT, HR, PH Websites –JACOW CERN Search CERN Search - 21

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Enterprise Search Wide range of document sources: CERN Search - 22 Web Pages File systems Databases Directories (People and Places) Document repositories (CDS, EDMS, Indico, …) Variety of meta data Different access protection schemes Different retrieval methods and frequencies