Indexing CSCI 572: Information Retrieval and Search Engines Summer 2010.

Indexing CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10CS572-Summer2010CAM-2 Outline Building your search corpus Differences from RDBMS The Document/Field Model –The Flattening Process –Understanding Field Types Challenges

May-20-10CS572-Summer2010CAM-3 Building an index Once you have content in the form of metadata and extracted text, you need to persist that content –For querying –For retrieval and display How should we persist the content?

May-20-10CS572-Summer2010CAM-4 Some considerations Extracted metadata is typically unstructured –It’s not something that necessarily maps to a set of Entities (Tables), with rows and with consistent columns –Documents have different, sometimes, non-overlapping metadata models Dublin Core Word Climate Forecast The write/access patterns are a bit different –Think crawling strategies…

May-20-10CS572-Summer2010CAM-5 Databases versus Search Indices Databases are optimized for –Write often –Read often –Transactional properties in the face of the above Atomic – operations should occur atomically, or be rolled back Consistent – writes/etc., should be propogated in a consistent fashion Isolated – transactions and modifications limited to the entities that they modify Durable – expected to be running all the time and thus resilient in the face of catastrophic failure

May-20-10CS572-Summer2010CAM-6 Databases versus Search Indices Search Indices are optimized for –Write infrequently –Read very frequently –Based off a loose unstructured Document model –Limited transactional properties ACID not necessary Onus to produce results quickly Rollback not supported most often Subject to corruption –Extremely efficient in terms of query read times by exploiting the above

May-20-10CS572-Summer2010CAM-7 A method of dealing with unstructured data and its persistence to an index Treat each indexable content item as a “Document” –Each Document has 1…N named Fields –Each Field has 1…N values Values can be: –Text –Numerical –Hierarchical (made up of other fields) –Complex (Geospatial, etc.) The Document Field Model Field1: v…vn Field2: v…vn

May-20-10CS572-Summer2010CAM-8 Example: two web pages Document 1 –Field [title], Value(s): “Chris Mattmann’s Web Page” Type: string (text) –Field [length], Value(s): 3026 Type: int (assumed to be bytes) –Field [author], Value(s): Chris Mattmann Document 2 –Field [title], Value(s): “CS572 Web Page” Type: string (text) –Field [length], Value(s): 10000 Type: int, (assumed to be bytes) –Field [author], Value(s): Chris Mattmann, Univ. of Southern California

May-20-10CS572-Summer2010CAM-9 Example: a word document Document 3 –Field [title], Value(s): “My CS572 Final Project” Type: string (text) –Field [length], Value(s): 30012 Type: int (assumed to be bytes) –Field [wordcount], Value(s): 2912 Type: int –Field [mswordversion], Value(s): 2008, Mac Type: string (text)

May-20-10CS572-Summer2010CAM-10 Apples to Oranges Whether it’s an HTML page, a Word document, a PDF file, etc. –We can still use the Document/Field model to represent the content as it is indexed The Document Field model works for Metadata, but also for extracted text –Define a custom text field containing all extracted, searchable text

May-20-10CS572-Summer2010CAM-11 What about structure? For example, let’s say we are extracting Person records from a RDBMS to index We’ve got 2 tables –Person Attribute: id, int PK UNIQUE AUTO INCREMENT Attribute: first_name VARCHAR(255) Attribute: last_name VARCHAR(255) –PersonAddress Attribute: person_id FK to Person.id Attribute: address_txt, VARCHAR(255) Attribute: zipcode, int

May-20-10CS572-Summer2010CAM-12 What about structure? Example records –Person: id, first_name, last_name 1, Chris, Mattmann 2, Homer, Simpson –PersonAddress: person_id, address_txt, zipcode 1, 1234 Joe Lane, 91354 2, 6344 Evergreen Terrace, Springfield, IL, 60999

May-20-10CS572-Summer2010CAM-13 What about structure? How to get the aforementioned rows into the Document Field model? –Flatten the structure Document 1 –Field [first_name], Value(s): Chris Type: string (text) –Field [last_name], Value(s): Mattmann Type: string (text) –Field [id], Value(s): 1 Type: int –Field [addresstxt], Value(s): Joe Lane Type: string (text) –Field [zipcode], Value(s): 91354 Type: int Document 2 Field [first_name], Value(s): Homer Type: string (text) Field [last_name], Value(s): Simpson Type: string (text) Field [id], Value(s): 2 Type: int Field [addresstxt], Value(s): 6344 Evergreen Terrace, Springfield, IL Type: string (text) Field [zipcode], Value(s): 60999 Type: int

May-20-10CS572-Summer2010CAM-14 Benefits of the Document Field model Documents are independent, wholly contained entities –Reduces ACID dependencies –Increases the ability to become eventually consistent Fields can be indexed and stored in different ways –Reformatted on entry into the index, and reformatted on the way out Geohash great example of this Analyzers – implications on query model Tokenizers – implications on query model

May-20-10CS572-Summer2010CAM-15 Challenges Reducing structured data to unstructured, flattened data isn’t exactly as easy as the cooked up example –Imagine having to encode values to preserve ordering in some fashion Requires deep understanding of the data and methodologies for naming field names and ordering values Loss of ACID properties makes it difficult to leverage index structure for search directly in transactional systems –Have to stand up search as a separate service outside of data management system Determining the right tuning parameters to index –Max Buffer Size, When to Optimize, When to Merge, etc.

May-20-10CS572-Summer2010CAM-16 Wrapup Introduction to the Document Field indexing model Differences between traditional RDBMS models and Search indices Know when and where to use each Search optimized for read frequent, write infrequent

Indexing CSCI 572: Information Retrieval and Search Engines Summer 2010.

Similar presentations

Presentation on theme: "Indexing CSCI 572: Information Retrieval and Search Engines Summer 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Indexing CSCI 572: Information Retrieval and Search Engines Summer 2010.

Similar presentations

Presentation on theme: "Indexing CSCI 572: Information Retrieval and Search Engines Summer 2010."— Presentation transcript:

Similar presentations

About project

Feedback