Thanks to Bill Arms, Marti Hearst

Thanks to Bill Arms, Marti Hearst
Documents Thanks to Bill Arms, Marti Hearst

Last time Size of information IR an old field, goes back to the ‘40s
Continues to grow IR an old field, goes back to the ‘40s IR iterative process Search engine most popular information retrieval model Still new ones being built

Focus on documents Document will be what we: IR iterative process
Crawl (harvest) Index Retrieve with query Evaluate Rank IR iterative process

IR is an Iterative Process
Repositories Workspace Goals

User’s Information Need text input Query Parse

Collections Pre-process Index

User’s Information Need Collections Pre-process text input Query Index Parse Rank or Match

Evaluation User’s Information Need Collections Pre-process text input
Query Index Parse Rank or Match Evaluation Query Reformulation

Definitions Collections consist of Documents Document Tokens or terms
The basic unit which we will automatically index usually a body of text which is a sequence of terms has to be digital Tokens or terms Basic units of a document, usually consisting of text semantic word or phrase, numbers, dates, etc Collections or repositories or corpus particular collections of documents sometimes called a database Query request for documents on a topic

Document Collectons Many on the web
From the Text Search Engines: IR in Practive Document collections Collections Corpus collections at UW Some searchable but cost to download

Collection vs documents vs terms
Terms or tokens Document

What is a Document? A document is a digital object with an operational definition Indexable (usually digital) Can be queried and retrieved. Many types of documents Text or part of text Web page Image Audio Video Data Etc.

Text Documents A text digital document consists of a sequence of words and other symbols, e.g., punctuation. The individual words and other symbols are known as tokens or terms. A textual document can be: • Free text, also known as unstructured text, which is a continuous sequence of tokens. • Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup. Example?

Why the focus on text? Language is the most powerful query model
Language can be treated as text Text has many interesting properties Others?

What we covered Documents are the atoms of IR
Index terms or tokens in documents Terms or tokes will be text Interested in collections of documents Repository Corpus Document collection

Thanks to Bill Arms, Marti Hearst

Similar presentations

Presentation on theme: "Thanks to Bill Arms, Marti Hearst"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Thanks to Bill Arms, Marti Hearst

Similar presentations

Presentation on theme: "Thanks to Bill Arms, Marti Hearst"— Presentation transcript:

Similar presentations

About project

Feedback