Extracting Lexical Features Development of software tools for a search engine 1. convert an arbitrary pile of textual objects into a well-defined corpus.

Extracting Lexical Features Development of software tools for a search engine 1. convert an arbitrary pile of textual objects into a well-defined corpus of documents, each containing a string of terms to be indexed. 2. invert the index, so rather than seeing all the words contained in a particular document, we can find all documents containing particular keywords. 3. (later chapters): match queries to indices to retrieve those which are most similar.

Interdocument Parsing The first step is to break the corpus – an arbitrary “pile of text” into individually retrievable documents. Two text corpora are AIT (AI theses) and email – documents are abstracts (AIT) or the entire message (email). Filters such as DeTex for removing LATEX markup, or in HTML.

Intradocument Parsing Reading each character of each document, deciding whether it is part of a meaningful token, and deciding whether these tokens are worth indexing is the most computationally intensive aspect of indexing: must be efficient. Deal with text in situ and not make a second copy for use by the indexing and retrieval system, by creating a system of pointers to locations within the corpus. A lexical analyser tokenises the stream of characters into a sequence of word-like elements using a finite state machine (e.g. UNIX lex tool is a lexical analyser generator, PERL, next slide). Fold case: treating upper and lower case interchangeably saves space.

Stemming Stemming aims to remove surface markings (such as number) to reveal a root form Using a token’s root form as an index term can give robust retrieval even when the query contains the plural CARS while the document contains the singular CAR Linguists distinguish inflectional morphology (plurals, third person singular, past tense, -ing) from derivational morphology (e.g. teach (verb), teacher (noun)). Weak vs. strong stemming.

Plural to singular Most common: remove terminal –s, but: Can’t remove last –s of –ss, e.g. crisis  crisi, chess  ches. woman / women, leaf / leaves, ferry / ferries, fox / foxes, alumnus / alumni. We need a context-sensitive transformational grammar which works reliably over groups of words (e.g. all words ending in –ch). See next page.

Example stemming rules (.*)SSES  /1SS: PERL-like syntax to say that strings ending in –SSES should be transformed by taking the stem (characters before –SSES) and adding only the two characters –SS. (.*)IES  /1Y A complete stemmer contains many such rules (60 in Lovins’ set), and a regime for handling conflicts when multiple rules match the same token, e.g. longest match, rule order.

Pros and Cons of Stemming Reduces the size of the keyword vocabulary, allowing compression of the index files of 10 – 50%. Increases recall – a query on FOOTBALL now also finds documents on FOOTBALLER(S), FOOTBALLING. Reduces precision – stripping away morphological features may obscure differences in word meanings. For example, GRAVITY has two senses (earth’s pull, seriousness). GRAVITATION can only refer to earth’s pull – but if we stem it to GRAVITY, it could mean either.

Noise words A relatively small number of words account for a very significant fraction of all text’s bulk. Words like IT, AND and TO can be found in virtually every sentence. These noise words make very poor index terms. Users are unlikely to ask for documents about TO, and it is hard to imagine a document about BE. Noise words should be ignored by our lexical analyser, e.g. by storing in a negative dictionary or stop list. In general, noise words are the most frequent in the corpus. But TIME, WAR, HOME, LIFE, WATER and WORLD are among the 200 most common words in English literature. The same tokens that are thrown away in IR are precisely those function words that are most important to the syntactic analysis of a well-formed sentence, and are indicators of an author’s individual writing style.

Example Corpus 1: AIT AIT, the Artificial Intelligence Thesis – about 5000 (most) Ph.D. and Master’s dissertations in AI from 1987-1997. structured attributes are ones for which we can reason more formally, using database and AI techniques (thesis number, author, year, university, supervisor, language, degree) Textual fields (IR): the abstract is the primary textual element associated with each thesis, while the title (also a textual field) will be used as its proxy (conveying much of the material in the abstract in a highly abbreviated form). Proxies are important document surrogates for the documents, e.g. when the users are presented with hitlists of retrieved documents.

Example Corpus 2 : Your Email Email has structured attributes associated with it, in its header. These include: From: To Cc Subject (proxy text) Date Other features we may associate with an email message are incoming/outgoing and folder in which it was stored. Parallels between the two example corpora are that both have well-defined authors, time-stamps, and obvious candidates for proxy text.

Basic Algorithm for an IR system We now assume that: Prior technology has successfully broken our large corpus into a set of documents; Within each document we have identified individual tokens; Noise words have been identified. Then our basic algorithm proceeds as follows:

Algorithm 2.1 For every doc in corpus while (token = getNonNoiseToken) token = stem(token) save Posting(token, doc) in tree A posting is simply a correspondence between a particular word and a particular document, representing the occurrence of that word in that document. For every token in Tree Accumulate totdoc(token), totfreq(token) Sort postings data in descending order of docfreq write token, totdoc, totfreq, Postings. Also store a file of document lengths for normalisation purposes.

Refinements to the postings data structure. Once the documents’ postings have been sorted into descending order of frequency, it is likely that several of the documents in this list have the same frequency, and we can exploit this fact to compress their representation Consider various keyword weighting schemes.

Splay Trees Splay trees are an appropriate data structure for these keywords and their postings. A splay tree is a self-balancing binary search tree with the additional unusual property that recently accessed elelments are quick to access again. Splaying the tree for a certain element rearranges the tree so that the element is placed at the root of the tree (Wikipedia).

Fine points Posting resolution: some query languages allow proximity operators which allow users to specify how close two keywords must be (e.g. adjacent, same sentence, within a k-word window) – this requires us to retain the exact position of each keyword, not just which document it’s in. Emphasising words in proxy text over those used in the rest of the corpus, e.g. tripling the keyword counters for title text. Quoted email text marked by >>: we only want to index each piece of text once.

Extracting Lexical Features Development of software tools for a search engine 1. convert an arbitrary pile of textual objects into a well-defined corpus.

Similar presentations

Presentation on theme: "Extracting Lexical Features Development of software tools for a search engine 1. convert an arbitrary pile of textual objects into a well-defined corpus."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Extracting Lexical Features Development of software tools for a search engine 1. convert an arbitrary pile of textual objects into a well-defined corpus.

Similar presentations

Presentation on theme: "Extracting Lexical Features Development of software tools for a search engine 1. convert an arbitrary pile of textual objects into a well-defined corpus."— Presentation transcript:

Similar presentations

About project

Feedback