LIN3098 Corpus Linguistics, Lecture 3 Albert Gatt.

Slides:



Advertisements
Similar presentations
CSCI N241: Fundamentals of Web Design Copyright ©2004 Department of Computer & Information Science Introducing XHTML: Module B: HTML to XHTML.
Advertisements

LIS650lecture 1 XHTML 1.0 strict Thomas Krichel
Corpus Linguistics Richard Xiao
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
XHTML Week Two Web Design. 2 What is XHTML? XHTML is the current standard for HTML Newest generation of HTML (post-HTML 4) but has many new features which.
What is XML? a meta language that allows you to create and format your own document markups a method for putting structured data into a text file; these.
Dr. Alexandra I. Cristea XHTML.
Internet Services and Web Authoring (CSET 226) Lecture # 5 HyperText Markup Language (HTML) 1.
Website Design.
1. Content – Collective term for all text, images, videos, etc. that you want to deliver to your audience. 2. Structure – How the content is placed on.
XHTML 16-Apr-17.
IS 373—Web Standards Todd Will
17-Jun-15 XHTML 2 What is XHTML? XHTML stands for Extensible Hypertext Markup Language XHTML is aimed to replace HTML.
Sistemi basati su conoscenza XML Prof. M.T. PAZIENZA a.a
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Sistemi basati su conoscenza XML Prof. M.T. PAZIENZA a.a
XML Primer. 2 History: SGML vs. HTML vs. XML SGML (1960) XML(1996) HTML(1990) XHTML(2000)
Introduction to XML: Yong Choi CSU Bakersfield.
Introducing XHTML: Module B: HTML to XHTML. Goals Understand how XHTML evolved as a language for Web delivery Understand the importance of DTDs Understand.
Introduction to XML This material is based heavily on the tutorial by the same name at
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Introducing HTML & XHTML:. Goals  Understand hyperlinking  Understand how tags are formed and used.  Understand HTML as a markup language  Understand.
ECA 228 Internet/Intranet Design I Intro to XML. ECA 228 Internet/Intranet Design I HTML markup language very loose standards browsers adjust for non-standard.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Creating a Simple Page: HTML Overview
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
XML introduction to Ahmed I. Deeb Dr. Anwar Mousa  presenter  instructor University Of Palestine-2009.
DAT602 Database Application Development Lecture 14 HTML.
ULI101 – XHTML Basics (Part II) What is Markup Language? XHTML vs. HTML General XHTML Rules Block Level XHTML Tags XHTML Validation.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Introduction to XML. XML - Connectivity is Key Need for customized page layout – e.g. filter to display only recent data Downloadable product comparisons.
Learning Web Design: Chapter 4. HTML  Hypertext Markup Language (HTML)  Uses tags to tell the browser the start and end of a certain kind of formatting.
CP2022 Multimedia Internet Communication1 HTML and Hypertext The workings of the web Lecture 7.
XHTML. Introduction to XHTML What Is XHTML? – XHTML stands for EXtensible HyperText Markup Language – XHTML is almost identical to HTML 4.01 – XHTML is.
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
 2008 Pearson Education, Inc. All rights reserved Introduction to XHTML.
1 XML: an introduction David Nathan. 2 XML  an in-line markup system  single sequence of plain text only (but can be unicode)  equivalent to a tree.
EXtensible Markup Language (XML) and Documentation --ManojBokil -- Manoj Bokil.
IS1811 Multimedia Development for Internet Applications Lecture 4: Introduction to HTML Rob Gleasure
1 What is HTML? Standardized codes Web pages SGML Descriptive markup Tags.
How do I use HTML and XML to present information?.
XML eXtensible Markup Language. Topics  What is XML  An XML example  Why is XML important  XML introduction  XML applications  XML support CSEB.
XP Tutorial 9 1 Working with XHTML. XP SGML 2 Standard Generalized Markup Language (SGML) A standard for specifying markup languages. Large, complex standard.
XML 2nd EDITION Tutorial 1 Creating An Xml Document.
XP 2 HTML Tutorial 1: Developing a Basic Web Page.
HTML: Hyptertext Markup Language Doman’s Sections.
WEB APPLICATION DEVELOPMENT For More visit:
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
XP 1 Creating an XML Document Developing an XML Document for the Jazz Warehouse XML Tutorial.
XML Introduction. What is XML? XML stands for eXtensible Markup Language XML stands for eXtensible Markup Language XML is a markup language much like.
What it is and how it works
XML for Text Markup An introduction to XML markup.
XML Design Goals 1.XML must be easily usable over the Internet 2.XML must support a wide variety of applications 3.XML must be compatible with SGML 4.It.
Lecture: Web Design Assis. Prof. Freshta Hanif Ehsan Faculty of Computer Science Kabul Polytechnic University Spring Semester
HTML Basics Computers. What is an HTML file? *HTML is a format that tells a computer how to display a web page. The documents themselves are plain text.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
Introduction to DTD A Document Type Definition (DTD) defines the legal building blocks of an XML document. It defines the document structure with a list.
HTML Basics. HTML Coding HTML Hypertext markup language The code used to create web pages.
XML CSC1310 Fall HTML (TIM BERNERS-LEE) HyperText Markup Language  HTML (HyperText Markup Language): December  Markup  Markup is a symbol.
XML. HTML Before you continue you should have a basic understanding of the following: HTML HTML was designed to display data and to focus on how data.
XP 1 HTML Tutorial 1: Developing a Basic Web Page.
Creating Your 1 st Web Page. Tags Refers to anything between on a webpage Most appear in pairs surrounding content Some appear as empty tags (no closing.
Tutorial #1 Using HTML to Create Web Pages. HTML, XHTML, and CSS HTML – HyperText Markup Language The tags the browser uses to define the content of the.
Web Design Principles 5 th Edition Chapter 3 Writing HTML for the Modern Web.
XML QUESTIONS AND ANSWERS
Introduction to XHTML.
Presentation transcript:

LIN3098 Corpus Linguistics, Lecture 3 Albert Gatt

Goals of this lecture  Focus on annotation: 1.what makes a good annotation scheme; 2.what standards exist; 3.what markup languages exist.

Corpora and annotation  Unannotated corpora: simple plain text the linguistic information is implicit e.g. no explicit representation of man as a noun  Annotated corpora: no longer just text real repositories of linguistic information  the relevant linguistic information is now explicit

Types of corpora  Corpora are often defined according to what kind of annotation they contain. part-of-speech annotation (tagging)  annotation of morphosyntactic categories (BNC) parsed corpora (treebanks)  annotation of syntactic structure (Penn Treebank, SMULTRON) anaphora  annotation of pronominal coreferents in context (GNOME corpus)

How is it done?  Depends on the type of annotation being carried out.  Many kinds of annotation are done manually.  Some kinds of annotation, especially POS tagging can be done semi-automatically: many available POS taggers start with a manually tagged sample of text train the tagger on the sample tagger is then applied to new data, and tries to “guess” the POS of new words this is not an error-free process! Current state of the art achieves about 96-7% accuracy

BNC example from Lecture 1   Explosives  found  on  Hampstead  Heath . Explosives found on Hampstead Heath new sentence plural noun past tense verb preposition proper noun punctuation

The Penn Treebank parsed corpus (S (NPSBJ1 Chris) (VP wants (S (NPSBJ *1) (VP to (VP throw (NP the ball))))))  Predicate Argument Structure: wants(Chris, throw(Chris, ball)) Empty embedded subject linked to NP subject no. 1

The GNOME anaphora corpus <ne cat="pn" per="per3" num="sing" gen="neut" ani="inanimate" disc="disc-old"> Dermovate Cream is <ne cat="a-np" per="per3" num="sing" gen="neut" ani="inanimate" disc="disc-new"> a strong and rapidly effective treatment

Part 1 Annotation principles, standards and guidelines

Annotation Principles (Leech 1993) 1.Recoverability: it should be possible to remove the annotation and extract the raw text 2.Extractability: it should be possible to extract the annotation itself to store it separately 3.Transparency of guidelines: the annotation should be based on explicit guidelines which are available to the end user

Annotation Principles (II) 4.Transparency of method It should be clear who annotated what (often many people are involved in the project) Typically, projects will also report some statistical measure of inter-annotator agreement The extent to which different annotators agree will reflect on:  how good the guidelines are  how theory-neutral the annotation is

Annotation principles (III) 5.Fallibility The annotation scheme is not infallible; the user should be made aware of this. E.g. the BNC documentation actually reports on errors in the POS tagging 6. Theory-neutrality As far as possible, the annotation should not be based on narrow theoretical principles. E.g. A treebank with syntactic info is usually parsed with a simple, context-free grammar. Using something more specific, like Chomsky’s Principles and Parameters Framework, would mean it’s useful to a narrower community.

Annotation principles (IV) 7. Standards: no single annotation scheme has the right to be considered an a priori standard e.g. there are many different formats for annotating part of speech info, or syntactic structure However, there are published standards which provide a minimum for format and amount of information to include.

Comments on Leech (1993)  Rather than standards, these are “desiderata” for annotation schemes.  They don’t really specify the form or content of an annotation scheme.  However, there have been concerted efforts to define real standards to which corpora should conform.

The concept of a markup language  A markup language provides a way of specifying meta-data about a document.  Why “language”? it specifies a basic “vocabulary” of elements; it specifies a syntax for well-formed expressions.

The “SGML” family of markup languages  SGML (Standard Generalised Markup Language): one of the first truly standardised formalisms  Basic idea: create a tag which has some “meaning”  e.g. means “word”, means “paragraph” wrap portions of a document with start/end tags  e.g. chair  end tags can often be omitted: chair the “meaning” of the tag must be specified tag can have attributes:  e.g. tags can be nested inside eachother

Descendants of SGML: HTML  HTML: “Hypertext Markup Language” developed by the World-Wide Web Consortium (W3C) based on the SGML tagging principle defines a basic representation language for document layout used by web browsers: when you visit a page, your browser “interprets” the html and renders the layout visually. fixed set of tags such as:  : paragraph  : image  etc

Descendants of SGML: XML  XML: Extensible Markup Language developed by the World-Wide Web Consortium (W3C) nowadays, this is ubiquitous, and has largely replaced SGML as the markup language of choice stricter syntax than SGML: end-tags can’t be omitted less complex than SGML in other ways unlike HTML, specifies only a syntax; the actual tags can be anything depending on the application.

XML documents are trees DOCUMENT PARAGRAPH SENTENCE WORD SENTENCE PARAGRAPH

XML Documents are trees … … DOCUMENT PARAGRAPH SENTENCE WORD SENTENCE PARAGRAPH

Meta-data in XML  What properties does a book have? author, ISBN, publisher, number of pages, genre: fiction, etc John Smith CUP Lost in translation …  This contains “data” such as John SMith, CUP, Lost in Translation… tags have attributes (e.g. gender for author, type for book)  It contains meta-data (data about the data) in the form of tags  Easy for a machine to know which pieces of information are about what.

The Text Encoding Initiative (TEI)  Sponsored by the main academic bodies with an interest in machine-readable textual markup.  Aims: provide standardised formats for annotation allow interchange of data: If corpus X is annotated according to TEI standards, then it is easy to:  develop tools to “read” the annotation  make the annotation comprehensible to others  NB: The TEI does not specify the content, i.e. what the annotation should contain. It does specify how it should be done, i.e. the form.

The “document” according to TEI  A document (e.g. a corpus text) consists of: a header  information about the text such as author, date, source, etc. the text itself  including annotation of textual elements, such as paragraphs, words, etc Encoded using tags and entity references a Document Type Declaration (DTD)  a formal representation which tells a computer program what elements the text contains, and what they mean

In graphics… HEADER TEXT element … DTD Usually, the DTD is a separate document, explaining what each annotated element means

Example: Structure of a BNC document (fragment) (description of the file) (source of the text, including publisher) (the actual text + annotation)

Markup language  The TEI uses SGML  Tags in SGML (and TEI): Always use angle brackets Indicate start and end  text  end-tag often omitted if not required Used for text elements:  paragraph, word, sentence…

Markup language (cont/d)  TEI also specifes a format for entity references: an entity reference is a kind of abbreviation for some detailed formatting or linguistic information  Format: enclosed using & and ;  Example: é  represents the letter e with an acute accent, i.e. é man&nn1;  represents the information that man is a noun in the singular  Interpretation of entity references: each different entity reference used in the text is defined in detail in the document header

Example: tags and references in a BNC document (fragment) there are between 40–60,000 people … Sentence element with number Word element with Part of Speech Word element + entity reference – = a dash

Beyond format: Content guidelines  EAGLES “Expert Advisory Groups on Language Engineering Standards” EU-sponsored teams of experts who drew up guidelines on many aspects of language engineering, including corpus annotation.  Aim: “best-practice” recommendations on what to annotate, at all levels (textual, part-of-speech, etc) cover a wide variety of languages guidelines on corpora are TEI-conformant.  Main document: Corpus Encoding Standard (CES). Assumes SGML as the markup language.  Later development: XCES: The CES using XML as the markup language.

Part 2 Levels of corpus annotation

LIN Corpus Linguistics Textual/Extra-textual level  Information about the text, origins etc. cf the earlier example of the BNC header cf. McEnery & Wilson’s examples from other corpora  Extra-textual information can be very detailed, e.g. include gender of author.  Textual information can include things like questions, abbreviations and their expansions, etc.

LIN Corpus Linguistics Orthographic level  Problems with different alphabets, accents etc. Maltese: ħ, ġ, ż, ċ; German: umlaut etc; Russian: cyrillic alphabet  TEI recommends use of entity references: ù  ù ġ  &gdot; also, recommends sticking to the basic (“English”) ISO-646 character set  More recently, the UNICODE standard provides for a single, unified representation of all characters in (hopefully) all alphabets and writing systems as they are, without needing any special graphics capabilities. every character is mapped to a unique numeric code all codes are readable by a computer  TEI also recommends representing changes of typography etc (boldface, italic...) using start/end tags.

LIN Corpus Linguistics The challenges of spoken data  Speech does not contain “sentences” but “utterances”.  Transcription of spoken data entails decisions about: whether to assume sentence-based transcription or intonation units what to do about pauses, false starts, coughing... what to do about interruptions and overlapping speech whether to add punctuation  Example: London-Lund corpus uses intonation units for speech, with no punctuation

LIN Corpus Linguistics Spoken data in the BNC You got ta Radio Two with that.  Many other tags to mark non-linguistic phenomena... Utterance tag + speaker ID attribute Sentence tag within utterance Non-verbal action during speech Pauses marked with duration Unclear, non- transcribed speech

LIN Corpus Linguistics Levels of linguistic annotation  part-of-speech (word-level)  lemmatisation (word-level)  parsing (phrase & sentence-level)  semantics (multi-level) semantic relationships between words and phrases semantic features of words  discourse features (supra-sentence level)  phonetic transcription  prosody

Part of speech tagging  Purpose: Label every token with information about its part of speech.  Requirements: A tagset which lists all the relevant labels.

LIN Corpus Linguistics Part of speech tagsets  Tagging schemes can be very granular. Maltese example: VV1SR: verb, main, 1st pers, sing, perf  imxejt – “I walked” VA1SP: verb, aux, 1st pers, sing, past  kont miexi – “I was walking” NNSM-PS1S: noun, common, sing, masc + poss. pronoun, sing, 1st pers  missier-i – “my father”

How POS Taggers tend to work 1.Start with a manually annotated portion of text (usually several thousand words). the/DET man/NN1 walked/VV 2.Extract a lexicon and some probabilities from it. E.g. Probability that a word is NN given that the previous word is DET. Used for tagging new (previously unseen) words. 3.Run the tagger on new data.

LIN Corpus Linguistics Challenges in POS tagging  Recall that the process is usually semi-automatic.  Granularity vs. correctness the finer the distinctions, the greater the likelihood of error manual correction is extremely time- consuming

LIN Corpus Linguistics EAGLES recommendations on POS tagging  Set of obligatory features for all languages Noun, verb, interjection, unique, residual, etc  Set of recommended features: Noun: number, gender, case, type  Set of optional features: generic: apply to “all” languages (e.g. noun=count or mass) language-specific: e.g. Danish has a suffixed definite article, so has a “definiteness” feature for Nouns