1 Text Properties and Languages. 2 Statistical Properties of Text How is the frequency of different words distributed? How fast does vocabulary size grow.

Slides:



Advertisements
Similar presentations
Simplifications of Context-Free Grammars
Advertisements

1 eXtensible Markup Language. XML is based on SGML: Standard Generalized Markup Language HTML and XML are both based on SGML 2 SGML HTMLXML.
1 Copyright © 2002 Pearson Education, Inc.. 2 Chapter 2 Getting Started.
Chapter 7 System Models.
Copyright © 2003 Pearson Education, Inc. Slide 7-1 Created by Cheryl M. Hughes, Harvard University Extension School Cambridge, MA The Web Wizards Guide.
Copyright © 2003 Pearson Education, Inc. Slide 8-1 Created by Cheryl M. Hughes, Harvard University Extension School Cambridge, MA The Web Wizards Guide.
Copyright © 2003 Pearson Education, Inc. Slide 3-1 Created by Cheryl M. Hughes The Web Wizards Guide to XML by Cheryl M. Hughes.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Copyright © 2003 Pearson Education, Inc. Slide 7-1 Created by Cheryl M. Hughes The Web Wizards Guide to XML by Cheryl M. Hughes.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Introduction to HTML, XHTML, and CSS
Chapter 6 File Systems 6.1 Files 6.2 Directories
Dr. Alexandra I. Cristea CS 253: Topics in Database Systems: C3.
10. Juni 1998reto ambühler ( WELCOME TO THE GATHERING PLACE.
Programming Language Concepts
LIBRARY WEBSITE, CATALOG, DATABASES AND FREE WEB RESOURCES.
Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,
Web Service Testing RESTful Web Services Snejina Lazarova Dimo Mitev
PP Test Review Sections 6-1 to 6-6
XML and Databases Exercise Session 3 (courtesy of Ghislain Fourny/ETH)
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
XHTML Week Two Web Design. 2 What is XHTML? XHTML is the current standard for HTML Newest generation of HTML (post-HTML 4) but has many new features which.
What is XML? a meta language that allows you to create and format your own document markups a method for putting structured data into a text file; these.
XML INTRODUCTION Prepared by Hongming Yu Modified by Fernando Farfán.
CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions.
Dr. Alexandra I. Cristea XHTML.
1 An inference engine for the semantic web Naudts Guido Student at the Open University Netherlands.
Chapter 2 Entity-Relationship Data Modeling: Tools and Techniques
Chapter 12 Working with Forms Principles of Web Design, 4 th Edition.
Essential Cell Biology
PSSA Preparation.
Chapter 11 Creating Framed Layouts Principles of Web Design, 4 th Edition.
Simple Linear Regression Analysis
Chapter 13 Web Page Design Studio
Energy Generation in Mitochondria and Chlorplasts
Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation ( ), numbers (1999),
Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page:
RefWorks: The Basics October 12, What is RefWorks? A personal bibliographic software manager –Manages citations –Creates bibliogaphies Accessible.
Steffen Staab 1WeST Web Science & Technologies University of Koblenz ▪ Landau, Germany Structured Data on the Web Introduction to.
Introduction Peter Dolog dolog [at] cs [dot] aau [dot] dk Intelligent Web and Information Systems September 9, 2010.
South Dakota Library Network MetaLib User Interface South Dakota Library Network 1200 University, Unit 9672 Spearfish, SD © South Dakota.
XML: text format Dr Andy Evans. Text-based data formats As data space has become cheaper, people have moved away from binary data formats. Text easier.
XML and Enterprise Computing. What is XML? Stands for “Extensible Markup Language” –similar to SGML and HTML –document “tags” are used to define content.
CS 898N – Advanced World Wide Web Technologies Lecture 21: XML Chin-Chih Chang
1 Text Properties and Mark-up Languages. 2 Statistical Properties of Text How is the frequency of different words distributed? How fast does vocabulary.
IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection.
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
XML 1 Enterprise Applications CE00465-M XML. 2 Enterprise Applications CE00465-M XML Overview Extensible Mark-up Language (XML) is a meta-language that.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Electronic Commerce COMP3210 Session 4: Designing, Building and Evaluating e-Commerce Initiatives – Part II Dr. Paul Walcott Department of Computer Science,
1 Chapter 10: XML What is XML What is XML Basic Components of XML Basic Components of XML XPath XPath XQuery XQuery.
XML 2nd EDITION Tutorial 1 Creating An Xml Document.
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
Web Technologies COMP6115 Session 4: Adding a Database to a Web Site Dr. Paul Walcott Department of Computer Science, Mathematics and Physics University.
1 Introduction to XML XML stands for Extensible Markup Language. Because it is extensible, XML has been used to create a wide variety of different markup.
An Introduction to XML Paul Donohue May 8th 2002 Hotel Senator Zürich.
XML Design Goals 1.XML must be easily usable over the Internet 2.XML must support a wide variety of applications 3.XML must be compatible with SGML 4.It.
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
XML CSC1310 Fall HTML (TIM BERNERS-LEE) HyperText Markup Language  HTML (HyperText Markup Language): December  Markup  Markup is a symbol.
Statistical Properties of Text
XML in Web Technologies
Text Properties and Languages
Presentation transcript:

1 Text Properties and Languages

2 Statistical Properties of Text How is the frequency of different words distributed? How fast does vocabulary size grow with the size of a corpus? Such factors affect the performance of information retrieval and can be used to select appropriate term weights and other aspects of an IR system.

3 Word Frequency A few words are very common. –2 most frequent words (e.g. “the”, “of”) can account for about 10% of word occurrences. Most words are very rare. –Half the words in a corpus appear only once, called hapax legomena (Greek for “read only once”) Called a “heavy tailed” or “long tailed” distribution, since most of the probability mass is in the “tail” compared to an exponential distribution.

4 Sample Word Frequency Data (from B. Croft, UMass)

5 Zipf’s Law Rank (r): The numerical position of a word in a list sorted by decreasing frequency (f ). Zipf (1949) “discovered” that: If probability of word of rank r is p r and N is the total number of word occurrences:

6 Zipf and Term Weighting Luhn (1958) suggested that both extremely common and extremely uncommon words were not very useful for indexing.

Prevalence of Zipfian Laws Many items exhibit a Zipfian distribution. –Population of cities –Wealth of individuals Discovered by sociologist/economist Pareto in 1909 –Popularity of books, movies, music, web-pages, etc. –Popularity of consumer products Chris Anderson’s “long tail” 7

8 Predicting Occurrence Frequencies By Zipf, a word appearing n times has rank r n =AN/n Several words may occur n times, assume rank r n applies to the last of these. Therefore, r n words occur n or more times and r n+1 words occur n+1 or more times. So, the number of words appearing exactly n times is:

9 Predicting Word Frequencies (cont) Assume highest ranking term occurs once and therefore has rank D = AN/1 Fraction of words with frequency n is: Fraction of words appearing only once is therefore ½.

10 Occurrence Frequency Data (from B. Croft, UMass)

11 Does Real Data Fit Zipf’s Law? A law of the form y = kx c is called a power law. Zipf’s law is a power law with c = –1 On a log-log plot, power laws give a straight line with slope c. Zipf is quite accurate except for very high and low rank.

12 Fit to Zipf for Brown Corpus k = 100,000

13 Mandelbrot (1954) Correction The following more general form gives a bit better fit:

14 Mandelbrot Fit P = , B = 1.15,  = 100

15 Explanations for Zipf’s Law Zipf’s explanation was his “principle of least effort.” Balance between speaker’s desire for a small vocabulary and hearer’s desire for a large one. Debate ( ) between Mandelbrot and H. Simon over explanation. Simon explanation is “rich get richer.” Li (1992) shows that just random typing of letters including a space will generate “words” with a Zipfian distribution. –

16 Zipf’s Law Impact on IR Good News: –Stopwords will account for a large fraction of text so eliminating them greatly reduces inverted-index storage costs. –Postings list for most remaining words in the inverted index will be short since they are rare, making retrieval fast. Bad News: –For most words, gathering sufficient data for meaningful statistical analysis (e.g. for correlation analysis for query expansion) is difficult since they are extremely rare.

17 Vocabulary Growth How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus? This determines how the size of the inverted index will scale with the size of the corpus. Vocabulary not really upper-bounded due to proper names, typos, etc.

18 Heaps’ Law If V is the size of the vocabulary and the n is the length of the corpus in words: Typical constants: –K  10  100 –   0.4  0.6 (approx. square-root)

19 Heaps’ Law Data

20 Explanation for Heaps’ Law Can be derived from Zipf’s law by assuming documents are generated by randomly sampling words from a Zipfian distribution.

21 Metadata Information about a document that may not be a part of the document itself (data about data). Descriptive metadata is external to the meaning of the document: –Author –Title –Source (book, magazine, newspaper, journal) –Date –ISBN –Publisher –Length

22 Metadata (cont) Semantic metadata concerns the content: –Abstract –Keywords –Subject Codes Library of Congress Dewey Decimal UMLS (Unified Medical Language System) Subject terms may come from specific ontologies (hierarchical taxonomies of standardized semantic terms).

23 Web Metadata META tag in HTML – META “HTTP-EQUIV” attribute allows server or browser to access information: –

24 Markup Languages Language used to annotate documents with “tags” that indicate layout or semantic information. Most document languages (Word, RTF, Latex, HTML) primarily define layout. History of Generalized Markup Languages: GML(1969)SGML (1985) HTML (1993) XML (1998) Standard HyperText eXtensible

25 Basic SGML Document Syntax Blocks of text surrounded by start and end tags. – Tagged blocks can be nested. In HTML end tag is not always necessary, but in XML it is.

26 HTML Developed for hypertext on the web. – May include code such as Javascript in Dynamic HTML (DHTML). Separates layout somewhat by using style sheets (Cascade Style Sheets, CSS). However, primarily defines layout and formatting.

27 XML Like SGML, a metalanguage for defining specific document languages. Simplification of original SGML for the web promoted by WWW Consortium (W3C). Fully separates semantic information and layout. Provides structured data (such as a relational DB) in a document format. Replacement for an explicit database schema.

28 XML (cont) Allows programs to easily interpret information in a document, as opposed to HTML intended as layout language for formatting docs for human consumption. New tags are defined as needed. Structures can be nested arbitrarily deep. Separate (optional) Document Type Definition (DTD) defines tags and document grammar.

29 XML Example John Doe 38 is shorthand for empty tag Tag names are case-sensitive (unlike HTML) A tagged piece of text is called an element.

30 XML Example with Attributes arroz con pollo 2.30 Attribute values must be strings enclosed in quotes. For a given tag, an attribute name can only appear once.

31 XML Miscellaneous XML Document must start with a special tag. – Tag “id” and “idref” attributes allows specifying graph- structured data as well as tree-structured data. TX Texas AUS Austin

32 Document Type Definition (DTD) Grammar or schema for defining the tags and structure of a particular document type. Allows defining structure of a document element using a regular expression. Expression defining an element can be recursive, allowing the expressive power of a context-free grammar.

33 DTD Example <!DOCTYPE db [ ]> *: 0 or more repetitions ?: 0 or 1 (optional) | : alternation (or) PCDATA: Parsed Character Data (may contain tags)

34 Sample Valid Document for DTD John Doe 26 Robert Doe 55

35 DTD (cont) Tag attributes are also defined: CDATA: Character data (string) IMPLIED: Optional Can define DTD in a separate file:

36 XSL (Extensible Style-sheet Language) Defines layout for XML documents. Defines how to translate XML into HTML. Define style sheet in document: –

37 XML Standardized DTD’s MathML: For mathematical formulae. SMIL (Synchronized Multimedia Integration Language): Scheduling language for web-based multi-media presentations. RDF TEI (Text Encoding Initiative): For literary works. NITF: For news articles. CML: For chemicals. AIML: For astronomical instruments.

38 Parsing XML Process XML file into an internal data format for further processing. SAX (Simple API for XML): Reads the flow of XML text, detecting events (e.g. tag start and end) that are sent back to the application for processing. DOM (Document Object Model): Parses XML text into a tree-structured object- oriented data structure.

39 DOM XML document represented as a tree of Node objects (e.g. Java objects). Node class has subclasses: –Element –Attribute –CharacterData Node has methods: –getParentNode() –getChildNodes()

40 Sample DOM Tree 55 person name ageparent person lastnamefirstname name firstname age lastname JohnDoe 26 DoeRobert Element Character-Data

41 More Node Methods Element node –getTagName() –getAttributes() –getAttribute(String name) CharacterData node –getData() Also methods for adding and deleting nodes and text in the DOM tree, setting attributes, etc.

42 Apache Xerces XML Parser Parser for creating DOM trees for XML documents. Java version 2 available at: – Full Javadoc available at: –

43 Java Specific Document Models Java specific models that exploit Java collections and other classes. JDOM: dom4j: JAXP (part of Sun Java 1.5):