Download presentation
1
Document Indexing and Scoring in Solr
IST 441 Spring 2010 Instructor: Dr. C. Lee Giles Presented by: Sujatha Das Slides courtesy: Saurabh Kataria/Madian Khabsa
2
Outline Brief overview of Solr Architecture Indexing Scoring/Searching
3
Features of Solr Built on top of Lucene XML/HTTP interfaces
Schema to define types and fields Web adminstration interface Caching/replication/extensibility Spring 2006
4
Indexing/Searching(Solr)
YouSeer Architecture Indexing/Searching(Solr) Crawling(Heritrix) Parsing Stop Analyzer Standard Analyzer Your Analyzer File System PDF HTML DOC TXT … Solr Docu- ments TXT parser FS Crawler Index indexer PDF parser WWW indexer Crawl (Heritrix) HTML parser YouSeer searcher searcher Searching Spring 2008
5
Lucene’s index (conceptual)
Document Document Field Field Field Document Name Value Field Document Field Spring 2008
6
Solr accepts well formatted XML documents <add> <doc>
Solr documents Solr accepts well formatted XML documents <add> <doc> <field name=“URL"> <field name=“title">CNN Breaking News – Obama wins</field> <field name=“content">Barack Obama is the 44th president of the USA</field> <field name=“pubDate"> T23:59:59.999Z</field> </doc> </add> Spring 2008
7
Adding documents via YouSeer
To index documents crawled by heritrix: Navigate to ~/youseer/release Run: java –jar YouSeer.jar /absolute/path/to/arc/files /cachingDirectory 1 0 Solr URL The full path to the ARC files The virtual directory which maps to the cached files Number of threads, please keep it 1 Waiting Time between retries Spring 2006
8
YouSeer workflow Iterates on the compressed files (ARC files), and process the documents Extract the textual content of the document, and parse metadata (from heritrix) Generate an XML file as output This XML file is submitted to the index via the url Spring 2006
9
Analyzers specify how the text in a field is to be indexed
Options in Lucene WhitespaceAnalyzer divides text at whitespace SimpleAnalyzer divides text at non-letters convert to lower case StopAnalyzer removes stop words StandardAnalyzer good for most European Languages Create you own Analyzers Spring 2008
10
An example of analyzing a document
Spring 2008
11
Analyzers are specified in schema.xml
<fieldType name="integer" class="solr.IntField" omitNorms="true"/> <fieldType name="date" class="solr.DateField" sortMissingLast="true" omitNorms="true"/> <fieldType name="textTight" class="solr.TextField" positionIncrementGap="100" > <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> Spring 2006
12
What is an index comprised of ?
Inverted Index (Inverted File) Posting id word doc offset 1 football Doc 1 3 67 Doc 2 2 penn players 4 state 13 Doc 1: Penn State Football … football Posting Table Doc 2: Football players … State Spring 2008
13
Postings table is a fast look-up mechanism
Key: word Value: posting id, satellite data (#df, offset, …) Lucene implements the posting table with Java’s hash table Objectified from java.util.Hashtable Hash function depends on the JVM hc2 = hc1 * 31 + nextChar Postings table usage Indexing: insertion (new terms), update (existing terms) Searching: lookup, and construct document vector Spring 2008
14
Lucene Index Files: Field infos file (.fnm)
Format: FieldsCount, <FieldName, FieldBits> FieldsCount the number of fields in the index FieldName the name of the field in a string FieldBits a byte and an int where the lowest bit of the byte shows whether the field is indexed, and the int is the id of the term 1, <content, 0x01> Spring 2008
15
Lucene Index Files: Term Dictionary file (.tis)
Format: TermCount, TermInfos TermInfos <Term, DocFreq> Term <PrefixLength, Suffix, FieldNum> This file is sorted by Term. Terms are ordered first lexicographically by the term's field name, and within that lexicographically by the term's text TermCount the number of terms in the documents Term text prefixes are shared. The PrefixLength is the number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y". FieldNumber the term's field, whose name is stored in the .fnm file 4,<<0,football,1>,2> <<0,penn,1>, 1> <<1,layers,1>,1> <<0,state,1>,2> Document Frequency can be obtained from this file. Spring 2008
16
Lucene Index Files: Term Info index (.tii)
Format: IndexTermCount, IndexInterval, TermIndices TermIndices <TermInfo, IndexDelta> This contains every IndexInterval th entry from the .tis file, along with its location in the "tis" file. This is designed to be read entirely into memory and used to provide random access to the "tis" file. IndexDelta determines the position of this term's TermInfo within the .tis file. In particular, it is the difference between the position of this term's entry in that file and the position of the previous term's entry. 4,<football,1> <penn,3><layers,2> <state,1> Spring 2008
17
Lucene Index Files: Frequency file (.frq)
Format: <TermFreqs> TermFreqs TermFreq DocDelta, Freq? TermFreqs are ordered by term (the term is implicit, from the .tis file). TermFreq entries are ordered by increasing document number. DocDelta determines both the document number and the frequency. In particular, DocDelta/2 is the difference between this document number and the previous document number (or zero when this is the first document in a TermFreqs). When DocDelta is odd, the frequency is one. When DocDelta is even, the frequency is read as the next Int. For example, the TermFreqs for a term which occurs once in document seven and three times in document eleven would be the following sequence of Ints: 15, 8, 3 <<2, 2, 3> <3> <5> <3, 3>> Term Frequency can be obtained from this file. Spring 2008
18
Lucene Index Files: Position file (.prx)
Format: <TermPositions> TermPositions <Positions> Positions <PositionDelta > TermPositions are ordered by term (the term is implicit, from the .tis file). Positions entries are ordered by increasing document number (the document number is implicit from the .frq file). PositionDelta the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first occurrence in this document). For example, the TermPositions for a term which occurs as the fourth term in one document, and as the fifth and ninth term in a subsequent document, would be the following sequence of Ints: 4, 5, 4 <<3, 64> <1>> <<1> <0>> <<0> <2>> <<2> <13>> Spring 2008
19
Query Processing Field info (in Memory) Query Term Info Index
Constant time Query Term Info Index (in Memory) Constant time Constant time Constant time Constant time Term Dictionary (Random file access) Frequency File (Random file access) Position File (Random file access) Spring 2008
20
Searching solr index Types of query: Boolean
[IST441 Giles] [IST441 OR Giles] [java AND NOT SUN] wildcard [nu?ch] [nutc*] phrase [“JAVA TOMCAT”] Proximity [“lucene nutch” ~10] Fuzzy [roam~] matches roams and foam Field:value … Spring 2008
21
Scoring Function is specified in schema.xml
Similarity score(Q,D) = coord(Q,D) · queryNorm(Q) · ∑ t in Q ( tf(t in D) · idf(t)2 · t.getBoost() · norm(D) ) Question: What type of IR model does Lucene use? factors term-based factors tf(t in D) : term frequency of term t in document d default implementation idf(t): inverse document frequency of term t in the entire corpus Spring 2008
22
Default Scoring Function
coord(Q,D) = overlap between Q and D / maximum overlap Maximum overlap is the maximum possible length of overlap between Q and D queryNorm(Q) = 1/sum of square weight½ sum of square weight = q.getBoost()2 · ∑ t in Q ( idf(t) · t.getBoost() )2 If t.getBoost() = 1, q.getBoost() = 1 Then, sum of square weight = ∑ t in Q ( idf(t) )2 thus, queryNorm(Q) = 1/(∑ t in Q ( idf(t) )2) ½ norm(D) = 1/number of terms½ (This is the normalization by the total number of terms in a document. Number of terms is the total number of terms appeared in a document D.) Spring 2008
23
Example: D1: hello, please say hello to him. D2: say goodbye
Q: you say hello coord(Q, D) = overlap between Q and D / maximum overlap coord(Q, D1) = 2/3, coord(Q, D2) = 1/2, queryNorm(Q) = 1/sum of square weight½ sum of square weight = q.getBoost()2 · ∑ t in Q ( idf(t) · t.getBoost() )2 t.getBoost() = 1, q.getBoost() = 1 sum of square weight = ∑ t in Q ( idf(t) )2 queryNorm(Q) = 1/( ) ½ =0.8596 tf(t in d) = frequency½ tf(you,D1) = 0, tf(say,D1) = 1, tf(hello,D1) = 2½ =1.4142 tf(you,D2) = 0, tf(say,D2) = 1, tf(hello,D2) = 0 idf(t) = ln (N/(nj+1)) + 1 idf(you) = 0, idf(say) = ln(2/(2+1)) + 1 = , idf(hello) = ln(2/(1+1)) +1 = 1 norm(D) = 1/number of terms½ norm(D1) = 1/6½ =0.4082, norm(D2) = 1/2½ =0.7071 Score(Q, D1) = 2/3*0.8596*(1* *12)*0.4082=0.4135 Score(Q, D2) = 1/2*0.8596*(1* )*0.7071=0.1074 Spring 2008
24
References http://youseer.sourceforge.net/doc/Tutorial.pdf
Spring 2006
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.