Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK.

Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Contents of Talk Introduction –Search Engines –Computational Linguistics Three Questions Research Agenda Conclusions

Introduction

Origins TREC has been running since 1992 –Only two systems using CL techniques (Strzalkowski et al in the 1990 and more recently Stokoe, Oakes and Tait) have ever shown an improvement on the standard search engine task Performance on most tasks is improved by using more information –Surely dictionaries, grammars, semantics should help ? ACL/COLING CLIIR Workshop in Sydney

Search Engine Process SignatureData CrawlerIndex Query Engine Searcher Web

Index Compressed and Abstracted form of data (web pages) allowing rapid access to some part of that data –Simple version Maps key words (signature) to URL’s (data) –Real Systems Compressed vector of weighted terms (signature) to URL’s plus snippet generation support (+ ad’s etc.)

Crawler Continuously in the background –Moves over web pages accumulating Signature data –term data –metadata –links URLS, URI’s etc. –Updates the index

Query Engine User types in a query –Usually short list of key words Organizes results –“Best” pages first –Summary to judge page relevance –Clickable links –Relevant ad’s –Etc.

IR Its all about matching documents and queries

Myths about Statistical IR I.e. Search Engines

Myths 1.It doesn’t work Google ? 2.It’s a dead subject 40% improvement since TREC began in 1991 Recent progress with e.g. Language Modeling and Continuous Relevance Models 3.IR people don’t know/care about language Karen Sparck Jones much early work Strzalkowski et al …

Computational Linguistics What is it ?

Characteristics of CL Assumes Existence of –Dictionary –Grammar –Semantics Independent of task Dependent on word meanings Arrived at through composition of words in sentences

Statistical CL Often –Aims to make immediate progress with practical tasks –Minimizes assumptions about language But still shares the common assumptions of CL

IR/Search Engines Only care about the task

Characteristics of CL Assumes Existence of –Dictionary –Grammar –Semantics Independent of task Dependent on word meanings Arrived at through composition of words in sentences

Three Questions about Why Search Engines don’t use Computational Linguistics

Disclaimer Question Answering –QA systems use CL –Do search engines use QA now ? askJeeves ???  –Will they in the future ? Will we ever get general/casual users to type long questions ? Have known for a long time long queries are good - rarely used

Three Questions Are Computational Linguistic Techniques too inaccurate to improve Search Engines? Is the Search Engine Task formed in some way which makes CL techniques ineffective? Does statistical information retrieval in fact capture the relevant properties of language but in a form which is inaccessible or hidden?

CL too inaccurate ? Long Version –Is the problem that computational linguistic techniques are too unreliable or narrowly applicable, so improved performance on some documents or queries is masked by worse performance on others?

Example of Problem Query “wants” and unusual word sense –“main head design” Topic “yachting”: “head”  “head of sail” Irrelevant retrieved document 1 had a signature generated off an inaccurate word sense (“body part”) –CL eliminates Irrelevant not retrieved document 2 –Word Sense Disambiguation inaccurate Added to relevant set

CL too inaccurate ? But best systems do >97% on test data - like Penn Treebank ? –Overfitted on very sparse data? –Don’t do anything like as well on unseen data –Especially bad at unseen noun phrases - very common search terms

What CL should do Stop working on pitifully small samples: –IR researchers consider 18Gigabytes too small for real statistical significance Ensure you include overfitting protection in your methodology –Always test against genuinely unseen data Don’t simplify the data –But do use “hacks” to make it tractable

Search task not match CL Is the conventional information retrieval task formulated in a way which prevents or obstructs computational linguistics contributing? –Short queries Not sentences, running text –Short Ranked Lists of highly relevant documents –Predetermined document signatures

Search Task not match CL ? CL allows the extraction of structural signatures 1.Bracketing is combinatoric –Effect on index size 2.Most queries too short to get structure –Remember its matching queries and documents (signatures) 3.Many queries too short to disambiguate –Really ??? Co-occurence

What CL should do Focus on Word Sense Disambiguation –Accept the dictionary is more important than grammar –Accept proper names/named entities are at least as important as common words Focus on chunking/triple/phrase extraction –Full parsing will only ever help as an intermediate step

IR Captures Relevant Properties Long Version –Does statistical information retrieval in fact capture the relevant properties of language but in a form which is inaccessible or hidden? –Just like many machine learning techniques

IR captures relevant properties? Could be ? –Success of corpus linguistics –Success of data driven and Machine Learning approaches E.g. Statistical MT E.g. Textual Entailment

What CL should do Look at what and whether IR term weighting algorithms like BM25 are capturing about language as a legitimate research topic –Observation: BM25 looks very like some Machine Learning generated formulae Hardly surprising as BM25 derived by optimisation over a very large corpus Like Porter Stemmer before it Consider whether and to what extent division into dictionary, syntax, semantics is “real”

Some more questions Are assumptions made in computational linguistics about the nature of lexical semantics and the structural properties of well formed running text in some way ill founded, at least for the information retrieval task? Is there some specific property of language (for example semantic redundancy or one topic per document) which means that the relatively crude statistical techniques capture enough information to obtain the available improvements in performance?

Lessons CL has much to learn from IR –Having a task changes the game Allows the development of effective experimental methodology Effective solutions to task problems becomes the focus –Which might in turn stimulate non-task based research

Lessons 2 CL for IR –Needs to work on better document signatures Small, compressible, characteristics of documents –Word sense identifiers –Triples Noun verb/prep Noun Chunks –Accept probability

Lesson 3 Show document structure is useful for determining relevance –Are sentences useful So can parse trees be useful –Human centred evaluation –Paragraphs ?? –Whole Documents ???

Conclusions IR can benefit from Computational Linguistics Techniques –But CL research needs to focus on the relevant problems CL can benefit greatly from trying to get acceptance in IR –Focussed task –Think of statistical MT

Job Ad Postdoc positions in multimedia retrieval available in Sunderland Search for Sunderland IR Group on the Web See: –http://my.sunderland.ac.uk/web/services/hr/recruitment/http://my.sunderland.ac.uk/web/services/hr/recruitment/ –Search for VITALAS Email me: –John.Tait@sunderland.ac.uk

Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK.

Similar presentations

Presentation on theme: "Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK.

Similar presentations

Presentation on theme: "Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK."— Presentation transcript:

Similar presentations

About project

Feedback