Thanks to Bill Arms, Marti Hearst Documents Thanks to Bill Arms, Marti Hearst
Last time Size of information IR an old field, goes back to the ‘40s Continues to grow IR an old field, goes back to the ‘40s IR iterative process Search engine most popular information retrieval model Still new ones being built
Focus on documents Document will be what we: IR iterative process Crawl (harvest) Index Retrieve with query Evaluate Rank IR iterative process
IR is an Iterative Process Repositories Workspace Goals
User’s Information Need text input Query Parse
Collections Pre-process Index
User’s Information Need Collections Pre-process text input Query Index Parse Rank or Match
Evaluation User’s Information Need Collections Pre-process text input Query Index Parse Rank or Match Evaluation Query Reformulation
Definitions Collections consist of Documents Document Tokens or terms The basic unit which we will automatically index usually a body of text which is a sequence of terms has to be digital Tokens or terms Basic units of a document, usually consisting of text semantic word or phrase, numbers, dates, etc Collections or repositories or corpus particular collections of documents sometimes called a database Query request for documents on a topic
Document Collectons Many on the web From the Text Search Engines: IR in Practive Document collections Collections Corpus collections at UW Some searchable but cost to download
Collection vs documents vs terms Terms or tokens Document
What is a Document? A document is a digital object with an operational definition Indexable (usually digital) Can be queried and retrieved. Many types of documents Text or part of text Web page Image Audio Video Data Email Etc.
Text Documents A text digital document consists of a sequence of words and other symbols, e.g., punctuation. The individual words and other symbols are known as tokens or terms. A textual document can be: • Free text, also known as unstructured text, which is a continuous sequence of tokens. • Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup. Example?
Why the focus on text? Language is the most powerful query model Language can be treated as text Text has many interesting properties Others?
What we covered Documents are the atoms of IR Index terms or tokens in documents Terms or tokes will be text Interested in collections of documents Repository Corpus Document collection