Download presentation
Presentation is loading. Please wait.
Published byAshlee Hill Modified over 9 years ago
1
Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK
2
Contents of Talk Introduction –Search Engines –Computational Linguistics Three Questions Research Agenda Conclusions
3
Introduction
4
Origins TREC has been running since 1992 –Only two systems using CL techniques (Strzalkowski et al in the 1990 and more recently Stokoe, Oakes and Tait) have ever shown an improvement on the standard search engine task Performance on most tasks is improved by using more information –Surely dictionaries, grammars, semantics should help ? ACL/COLING CLIIR Workshop in Sydney
5
Search Engine Process SignatureData CrawlerIndex Query Engine Searcher Web
6
Index Compressed and Abstracted form of data (web pages) allowing rapid access to some part of that data –Simple version Maps key words (signature) to URL’s (data) –Real Systems Compressed vector of weighted terms (signature) to URL’s plus snippet generation support (+ ad’s etc.)
7
Crawler Continuously in the background –Moves over web pages accumulating Signature data –term data –metadata –links URLS, URI’s etc. –Updates the index
8
Query Engine User types in a query –Usually short list of key words Organizes results –“Best” pages first –Summary to judge page relevance –Clickable links –Relevant ad’s –Etc.
9
IR Its all about matching documents and queries
10
Myths about Statistical IR I.e. Search Engines
11
Myths 1.It doesn’t work Google ? 2.It’s a dead subject 40% improvement since TREC began in 1991 Recent progress with e.g. Language Modeling and Continuous Relevance Models 3.IR people don’t know/care about language Karen Sparck Jones much early work Strzalkowski et al …
12
Computational Linguistics What is it ?
13
Characteristics of CL Assumes Existence of –Dictionary –Grammar –Semantics Independent of task Dependent on word meanings Arrived at through composition of words in sentences
14
Statistical CL Often –Aims to make immediate progress with practical tasks –Minimizes assumptions about language But still shares the common assumptions of CL
15
IR/Search Engines Only care about the task
16
Characteristics of CL Assumes Existence of –Dictionary –Grammar –Semantics Independent of task Dependent on word meanings Arrived at through composition of words in sentences
17
Three Questions about Why Search Engines don’t use Computational Linguistics
18
Disclaimer Question Answering –QA systems use CL –Do search engines use QA now ? askJeeves ??? –Will they in the future ? Will we ever get general/casual users to type long questions ? Have known for a long time long queries are good - rarely used
19
Three Questions Are Computational Linguistic Techniques too inaccurate to improve Search Engines? Is the Search Engine Task formed in some way which makes CL techniques ineffective? Does statistical information retrieval in fact capture the relevant properties of language but in a form which is inaccessible or hidden?
20
CL too inaccurate ? Long Version –Is the problem that computational linguistic techniques are too unreliable or narrowly applicable, so improved performance on some documents or queries is masked by worse performance on others?
21
Example of Problem Query “wants” and unusual word sense –“main head design” Topic “yachting”: “head” “head of sail” Irrelevant retrieved document 1 had a signature generated off an inaccurate word sense (“body part”) –CL eliminates Irrelevant not retrieved document 2 –Word Sense Disambiguation inaccurate Added to relevant set
22
CL too inaccurate ? But best systems do >97% on test data - like Penn Treebank ? –Overfitted on very sparse data? –Don’t do anything like as well on unseen data –Especially bad at unseen noun phrases - very common search terms
23
What CL should do Stop working on pitifully small samples: –IR researchers consider 18Gigabytes too small for real statistical significance Ensure you include overfitting protection in your methodology –Always test against genuinely unseen data Don’t simplify the data –But do use “hacks” to make it tractable
24
Search task not match CL Is the conventional information retrieval task formulated in a way which prevents or obstructs computational linguistics contributing? –Short queries Not sentences, running text –Short Ranked Lists of highly relevant documents –Predetermined document signatures
25
Search Task not match CL ? CL allows the extraction of structural signatures 1.Bracketing is combinatoric –Effect on index size 2.Most queries too short to get structure –Remember its matching queries and documents (signatures) 3.Many queries too short to disambiguate –Really ??? Co-occurence
26
What CL should do Focus on Word Sense Disambiguation –Accept the dictionary is more important than grammar –Accept proper names/named entities are at least as important as common words Focus on chunking/triple/phrase extraction –Full parsing will only ever help as an intermediate step
27
IR Captures Relevant Properties Long Version –Does statistical information retrieval in fact capture the relevant properties of language but in a form which is inaccessible or hidden? –Just like many machine learning techniques
28
IR captures relevant properties? Could be ? –Success of corpus linguistics –Success of data driven and Machine Learning approaches E.g. Statistical MT E.g. Textual Entailment
29
What CL should do Look at what and whether IR term weighting algorithms like BM25 are capturing about language as a legitimate research topic –Observation: BM25 looks very like some Machine Learning generated formulae Hardly surprising as BM25 derived by optimisation over a very large corpus Like Porter Stemmer before it Consider whether and to what extent division into dictionary, syntax, semantics is “real”
30
Some more questions Are assumptions made in computational linguistics about the nature of lexical semantics and the structural properties of well formed running text in some way ill founded, at least for the information retrieval task? Is there some specific property of language (for example semantic redundancy or one topic per document) which means that the relatively crude statistical techniques capture enough information to obtain the available improvements in performance?
31
Lessons CL has much to learn from IR –Having a task changes the game Allows the development of effective experimental methodology Effective solutions to task problems becomes the focus –Which might in turn stimulate non-task based research
32
Lessons 2 CL for IR –Needs to work on better document signatures Small, compressible, characteristics of documents –Word sense identifiers –Triples Noun verb/prep Noun Chunks –Accept probability
33
Lesson 3 Show document structure is useful for determining relevance –Are sentences useful So can parse trees be useful –Human centred evaluation –Paragraphs ?? –Whole Documents ???
34
Conclusions IR can benefit from Computational Linguistics Techniques –But CL research needs to focus on the relevant problems CL can benefit greatly from trying to get acceptance in IR –Focussed task –Think of statistical MT
35
Job Ad Postdoc positions in multimedia retrieval available in Sunderland Search for Sunderland IR Group on the Web See: –http://my.sunderland.ac.uk/web/services/hr/recruitment/http://my.sunderland.ac.uk/web/services/hr/recruitment/ –Search for VITALAS Email me: –John.Tait@sunderland.ac.uk
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.