Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page:

Slides:



Advertisements
Similar presentations
1. XP 2 * The Web is a collection of files that reside on computers, called Web servers. * Web servers are connected to each other through the Internet.
Advertisements

CSCI N241: Fundamentals of Web Design Copyright ©2004 Department of Computer & Information Science Introducing XHTML: Module B: HTML to XHTML.
Copyright © 2003 Pearson Education, Inc. Slide 7-1 Created by Cheryl M. Hughes, Harvard University Extension School Cambridge, MA The Web Wizards Guide.
Copyright © 2003 Pearson Education, Inc. Slide 8-1 Created by Cheryl M. Hughes, Harvard University Extension School Cambridge, MA The Web Wizards Guide.
Copyright © 2003 Pearson Education, Inc. Slide 3-1 Created by Cheryl M. Hughes The Web Wizards Guide to XML by Cheryl M. Hughes.
Copyright © 2003 Pearson Education, Inc. Slide 6-1 Created by Cheryl M. Hughes, Harvard University Extension School Cambridge, MA The Web Wizards Guide.
LIS650lecture 1 XHTML 1.0 strict Thomas Krichel
Introduction to HTML, XHTML, and CSS
1 © Netskills Quality Internet Training, University of Newcastle Introducing Cascading Style Sheets © Netskills, Quality Internet.
UKOLN, University of Bath
4. Internet Programming ENG224 INFORMATION TECHNOLOGY – Part I
1 What is JavaScript? JavaScript was designed to add interactivity to HTML pages JavaScript is a scripting language A scripting language is a lightweight.
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
The World Wide Web. 2 The Web is an infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that.
Lecture plan Outline of DB design process Entity-relationship model
XHTML Week Two Web Design. 2 What is XHTML? XHTML is the current standard for HTML Newest generation of HTML (post-HTML 4) but has many new features which.
Boolean and Vector Space Retrieval Models
Essentials for Design JavaScript Level One Michael Brooks
Macromedia Dreamweaver MX 2004 – Design Professional Dreamweaver GETTING STARTED WITH.
Dr. Alexandra I. Cristea XHTML.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation ( ), numbers (1999),
Morphology Reading: Chap 3, Jurafsky & Martin Instructor: Paul Tarau, based on Rada Mihalcea’s original slides Note: Some of the material in this slide.
Processing of structured documents Spring 2003, Part 1 Helena Ahonen-Myka.
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Information Retrieval in Practice
CS 898N – Advanced World Wide Web Technologies Lecture 21: XML Chin-Chih Chang
1 COS 425: Database and Information Management Systems XML and information exchange.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
1 Text Properties and Mark-up Languages. 2 Statistical Properties of Text How is the frequency of different words distributed? How fast does vocabulary.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Overview of Search Engines
1 Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval.
Introducing HTML & XHTML:. Goals  Understand hyperlinking  Understand how tags are formed and used.  Understand HTML as a markup language  Understand.
Pemrograman Berbasis WEB XML part 2 -Aurelio Rahmadian- Sumber: w3cschools.com.
Creating a Basic Web Page
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Document Type Definition.
Introduction to XML cs3505. References –I got most of this presentation from this site –O’reilly tutorials.
XML CPSC 315 – Programming Studio Fall 2008 Project 3, Lecture 1.
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
XML About XML Things to be known Related Technologies XML DOC Structure Exploring XML.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Tutorial 1: XML Creating an XML Document. 2 Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content.
HTML. Principle of Programming  Interface with PC 2 English Japanese Chinese Machine Code Compiler / Interpreter C++ Perl Assembler Machine Code.
XML 2nd EDITION Tutorial 1 Creating An Xml Document.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation An Introduction to XML.
HTML: Hyptertext Markup Language Doman’s Sections.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
XP 1 Creating an XML Document Developing an XML Document for the Jazz Warehouse XML Tutorial.
Lecture 16 Introduction to XML Boriana Koleva Room: C54
XML Design Goals 1.XML must be easily usable over the Internet 2.XML must support a wide variety of applications 3.XML must be compatible with SGML 4.It.
1 Tutorial 11 Creating an XML Document Developing a Document for a Cooking Web Site.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Internet & World Wide Web How to Program, 5/e. © by Pearson Education, Inc. All Rights Reserved.2.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
XML CSC1310 Fall HTML (TIM BERNERS-LEE) HyperText Markup Language  HTML (HyperText Markup Language): December  Markup  Markup is a symbol.
XML The Extensible Markup Language (XML ), which is comparable to SGML and modeled on it, describes how to describe a collection of data. A standard way.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Information Retrieval in Practice
Search Engine Architecture
XML QUESTIONS AND ANSWERS
Text Properties and Languages
Java VSR Implementation
Java VSR Implementation
Java VSR Implementation
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: (Note: This slide set was adapted from an IR course taught by Prof. Ray Mooney at UT Austin)

Slide 1 Last time Architecture of a classic IR system –Including main IR components Main IR models –Boolean –Vectorial –Probabilistic

Slide 2 IR System Architecture Text Database Manager Indexing Index Query Operations Searching Ranking Ranked Docs User Feedback Text Operations User Interface Retrieved Docs User Need Text Query Logical View Inverted file

Slide 3 IR System Components Text Operations forms index words (tokens). –Tokenization –Stopword removal –Stemming Indexing constructs an inverted index of word to document pointers. –Mapping from keywords to document ids I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Doc 2

Slide 4 IR System Components Searching retrieves documents that contain a given query token from the inverted index. Ranking scores all retrieved documents according to a relevance metric. User Interface manages interaction with the user: –Query input and document output. –Relevance feedback. –Visualization of results. Query Operations transform the query to improve retrieval: –Query expansion using a thesaurus. –Query transformation using relevance feedback.

Slide 5 Today’s topics Text operations in IR systems –Tokenization –Stopword removal –Lemmatization –Stemming –In an IR system, text operations are applied on ??? On metadata and markup languages –(if time permits)

Slide 6 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation ( ), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token. –However, frequently they are not. Simplest approach is to ignore all numbers and punctuation and use only case-insensitive unbroken strings of alphabetic characters as tokens. More careful approach: –Separate ? ! ; : “ ‘ [ ] ( ) –Care with. - why? when? –Care with … ??

Slide 7 Punctuation Ne’er: use language-specific, handcrafted “locale” to normalize. State-of-the-art: break up hyphenated sequence. U.S.A. vs. USA - use locale. a.out

Slide 8 Numbers 3/12/91 Mar. 12, B.C. B –Generally, don’t index as text –Creation dates for docs

Slide 9 Case folding Reduce all letters to lower case –exception: upper case in mid-sentence e.g., General Motors Fed vs. fed SAIL vs. sail

Slide 10 Tokenizing HTML Should text in HTML commands not typically seen by the user be included as tokens? –Words appearing in URLs. –Words appearing in “meta text” of images. Simplest approach is to exclude all HTML tag information (between “ ”) from tokenization.

Slide 11 Stopwords It is typical to exclude high-frequency words (e.g. function words: “a”, “the”, “in”, “to”; pronouns: “I”, “he”, “she”, “it”). Stopwords are language dependent For efficiency, store strings for stopwords in a hashtable to recognize them in constant time. –Simple Perl hashtable for Perl-based implementations How to determine a list of stopwords? –For English? – may use existing lists of stopwords E.g. SMART’s commonword list (~ 400) WordNet stopword list –For Spanish? Bulgarian?

Slide 12 Lemmatization Reduce inflectional/variant forms to base form Direct impact on VOCABULARY size E.g., –am, are, is  be –car, cars, car's, cars'  car the boy's cars are different colors  the boy car be different color How to do this? –Need a list of grammatical rules + a list of irregular words –Children  child, spoken  speak … –Practical implementation: use WordNet’s morphstr function Perl: WordNet::QueryData –[ Digression: See “Words and Rules” by Steven Pinker A theory on how human mind combines rules for regular words with memorization of irregular forms ]

Slide 13 Stemming Reduce tokens to “root” form of words to recognize morphological variation. –“computer”, “computational”, “computation” all reduced to same token “compute” Correct morphological analysis is language specific and can be complex. Stemming “blindly” strips off known affixes (prefixes and suffixes) in an iterative fashion. for example compressed and compression are both accepted as equivalent to compress. for exampl compres and compres are both accept as equival to compres.

Slide 14 Porter Stemmer Simple procedure for removing known affixes in English without using a dictionary. Can produce unusual stems that are not English words: –“computer”, “computational”, “computation” all reduced to same token “comput” May conflate (reduce to the same token) words that are actually distinct. Not recognize all morphological derivations.

Slide 15 Typical rules in Porter sses  ss ies  i ational  ate tional  tion See class website for link to “official” Porter stemmer site –Provides Perl, C ready to use implementations

Slide 16 Porter Stemmer Errors Errors of “comission”: –organization, organ  organ –police, policy  polic –arm, army  arm Errors of “omission”: –cylinder, cylindrical –create, creation –Europe, European

Slide 17 Other stemmers Other stemmers exist, e.g., Lovins stemmer Single-pass, longest suffix removal (about 250 rules) Motivated by Linguistics as well as IR Full morphological analysis - modest benefits for retrieval

Slide 18 Stemming exercise Stemming procedure?

Slide 19 Remainder of today’s lecture On Metadata –Often included in Web pages –Hidden from the browser, but useful for indexing Information about a document that may not be a part of the document itself (data about data). Descriptive metadata is external to the meaning of the document: –Author –Title –Source (book, magazine, newspaper, journal) –Date –ISBN –Publisher –Length

Slide 20 Web Metadata META tag in HTML – META “HTTP-EQUIV” attribute allows server or browser to access information: –

Slide 21 RDF Resource Description Framework. XML compatible metadata format. New standard for web metadata. –Content description –Collection description –Privacy information –Intellectual property rights (e.g. copyright) –Content ratings –Digital signatures for authority

Slide 22 Markup Languages Language used to annotate documents with “tags” that indicate layout or semantic information. Most document languages (Word, RTF, Latex, HTML) primarily define layout. History of Generalized Markup Languages: GML(1969)SGML (1985) HTML (1993) XML (1998) Standard HyperText eXtensible

Slide 23 Basic SGML Document Syntax Blocks of text surrounded by start and end tags. – Tagged blocks can be nested. In HTML end tag is not always necessary, but in XML it is.

Slide 24 HTML Developed for hypertext on the web. – May include code such as Javascript in Dynamic HTML (DHTML). Separates layout somewhat by using style sheets (Cascade Style Sheets, CSS). However, primarily defines layout and formatting.

Slide 25 XML Like SGML, a metalanguage for defining specific document languages. Simplification of original SGML for the web promoted by WWW Consortium (W3C). Fully separates semantic information and layout. Provides structured data (such as a relational DB) in a document format. Replacement for an explicit database schema.

Slide 26 XML (cont’d) Allows programs to easily interpret information in a document, as opposed to HTML intended as layout language for formatting docs for human consumption. New tags are defined as needed. Structures can be nested arbitrarily deep. Separate (optional) Document Type Definition (DTD) defines tags and document grammar.

Slide 27 XML Example John Doe 38 is shorthand for empty tag Tag names are case-sensitive (unlike HTML) A tagged piece of text is called an element.

Slide 28 XML Example with Attributes arroz con pollo 2.30 Attribute values must be strings enclosed in quotes. For a given tag, an attribute name can only appear once.

Slide 29 Document Type Definition (DTD) Grammar or schema for defining the tags and structure of a particular document type. Allows defining structure of a document element using a regular expression. Expression defining an element can be recursive, allowing the expressive power of a context-free grammar.

Slide 30 DTD Example <!DOCTYPE db [ ]> *: 0 or more repetitions ?: 0 or 1 (optional) | : alternation (or) PCDATA: Parsed Character Data (may contain tags)

Slide 31 DTD (cont’d) Tag attributes are also defined: CDATA: Character data (string) IMPLIED: Optional Can define DTD in a separate file: