Download presentation
Presentation is loading. Please wait.
1
Thanks to Bill Arms, Marti Hearst
Documents Thanks to Bill Arms, Marti Hearst
2
Last time Size of information IR an old field, goes back to the ‘40s
Continues to grow IR an old field, goes back to the ‘40s IR iterative process Search engine most popular information retrieval model Still new ones being built
3
Focus on documents Document will be what we: IR iterative process
Crawl (harvest) Index Retrieve with query Evaluate Rank IR iterative process
4
IR is an Iterative Process
Repositories Workspace Goals
5
User’s Information Need text input Query Parse
6
Collections Pre-process Index
7
User’s Information Need Collections Pre-process text input Query Index Parse Rank or Match
8
Evaluation User’s Information Need Collections Pre-process text input
Query Index Parse Rank or Match Evaluation Query Reformulation
9
Definitions Collections consist of Documents Document Tokens or terms
The basic unit which we will automatically index usually a body of text which is a sequence of terms has to be digital Tokens or terms Basic units of a document, usually consisting of text semantic word or phrase, numbers, dates, etc Collections or repositories or corpus particular collections of documents sometimes called a database Query request for documents on a topic
10
Document Collectons Many on the web
From the Text Search Engines: IR in Practive Document collections Collections Corpus collections at UW Some searchable but cost to download
11
Collection vs documents vs terms
Terms or tokens Document
12
What is a Document? A document is a digital object with an operational definition Indexable (usually digital) Can be queried and retrieved. Many types of documents Text or part of text Web page Image Audio Video Data Etc.
13
Text Documents A text digital document consists of a sequence of words and other symbols, e.g., punctuation. The individual words and other symbols are known as tokens or terms. A textual document can be: • Free text, also known as unstructured text, which is a continuous sequence of tokens. • Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup. Example?
14
Why the focus on text? Language is the most powerful query model
Language can be treated as text Text has many interesting properties Others?
15
What we covered Documents are the atoms of IR
Index terms or tokens in documents Terms or tokes will be text Interested in collections of documents Repository Corpus Document collection
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.