CP3024 Lecture 12 Search Engines
What is the main WWW problem? With an estimated 800 million web pages finding the one you want is difficult!
What is a Search Engine? A page on the web connected to a backend program Allows a user to enter words which characterise a required page Returns links to pages which match the query
A Typical Search Engine
Types of Search Engine Automatic search engine e.g. Altavista, Lycos Classified Directory e.g. Yahoo! Meta-Search Engine e.g. Dogpile
Components of a Search Engine Robot (or Worm or Spider) –collects pages –checks for page changes Indexer –constructs a sophisticated file structure to enable fast page retrieval Searcher –satisfies user queries
Query Interface Usually a boolean interface –(Fred and Jean) or (Bill and Sam) Normally allows phrase searches –"Fred Smith" Also proximity searches Not generally understood by users May have extra 'friendlier' features ?
Search Results Presented as links Supposedly ordered in terms of relevancy to the query Some Search Engines score results Normally organised if groups of ten per page
Problems Links are often out of date Usually too many links are returned Returned links are not very relevant The Engines don't know about enough pages Different engines return different results U.S. bias
Improving query results To look for a particular page use an unusual phrase you know is on that page Use phrase queries where possible Check your spelling! Progressively use more terms If you don't find what you want, use another Search Engine!
Who operates Search Engines? People who can get money from venture capitalists! Many search engines originate from U.S. universities Often paid for by advertisements Engines monitor carefully what else interests you (paid by the click)
How do pages get into a Search Engine? Robot discovery (following links) Self submission Payments
Robot Discovery Robots visit sites while following links The more links the more visits Make sure you don't exclude Robots from visiting public pages
Payments Some search engines only index paying customers The more you pay the higher you appear on answers to queries
Self submission Register your page with a search engine Pay for a company to register you with many search engines Get registration with many search engines for free!
Getting to the top Only relevant queries should be ranked highly Search engines only look at text Search engine operators try to stop "search engine spamming" Some queries are pre-answered
Get where you should be! Put more than graphics on a page Don't use frames Use the tag Make good use of and Consider using the tag Get people to link to your page
Summary Search Engines are vital to the Web user Search Engines are not perfect by a long way There are tactics for better searching Page design can bring more visitors via Search Engines The more links the better!
WWLib-TNG A Next Generation Search Engine
In the beginning WWLib-TOS –Manually constructed directory –Classified on Dewey Decimal –Simple data structure –Proof of concept
The New Architecture
The Classifier
Motive - Why Generate Metadata Automatically? Meta tags are not compulsory Old pages are less likely to have meta tags Available data can be unreliable The Web of Trust requires comprehensive resource description An essential prerequisite for widespread deployment of RDF applications
Method - How can Metadata be Generated Automatically? Using an automatic classifier The classifier classifies Web Pages according to Dewey Decimal Classification Other useful metadata can be extracted during the process of automatic classification
Automatic Classification Intended to combine the intuitive accuracy of manually maintained classified directories with the speed and comprehensive coverage of automated search engines DDC has been adopted because of its universal coverage, multilingual scope and hierarchical nature
Automatic Classifier - How does it work? Firstly, the page is retrieved from a URL or local file and parsed to produce a document object
Automatic Classifier - How does it work? The document object is then compared with DDC objects representing the top ten DDC classes
Automatic Classifier - How does it work? Each time a word in the document matches a word in the DDC object, the two associated weights are added to a total score A measure of similarity is then calculated using a similarity coefficient
Automatic Classifier - How does it work? If there is a significant measure of similarity the document will be compared with any subclasses of that DDC class If there are no subclasses (i.e. the DDC class is a leaf node) the document is assigned the classmark If the result is not significant, the comparison process will proceed no further through that particular branch of the DDC object hierarchy
Metadata elements The automatic classification process can be used to extract other useful metadata elements other than the classification classmarks: Keywords Classmarks Word count Title URL Abstract A unique accession number and associated dates can be obtained and supplied by the system
Metadata elements - Wolverhampton Core Wolverhampton CoreDublin Core 1Unique Accession numberIdentifier 2Title 3URLIdentifier 4AbstractDescription 5KeywordsSubject 6ClassmarksSubject 7Word count 8Classification date 9Last modified dateDate
RDF Data Model
RDF Schema There is a significant overlap with the Dublin Core element set Requirement for implementation clarity Those that have Dublin Core equivalents are declared as sub-properties Maintain interoperability with Dublin Core applications
RDF Schema Keyword Classmark
Classifier Evaluation Automatic metadata generation will become important for the widespread deployment of RDF based applications Documents created before the invention of RDF generating authoring tools also need to be described RDF utilised in this manner may encourage interoperability between search engines More info:
Current Status of WWLib-TNG New results interface proposed –R-wheel (CirSA) Builder and searcher constructed, now being tested Classifier constructed Test Dispatcher/Analyser/Archiver in place