TEI Workshop Digitization of Text 文字數位化 Reasons, Methods, Stages.

Slides:



Advertisements
Similar presentations
Home-Grown Digital Library System Built Upon Open Source XML Technologies and Metadata Standards David Lacy Villanova University
Advertisements

Delivering textual resources. Overview Getting the text ready – decisions & costs Structures for delivery Full text Marked-up Image and text Indexed How.
Developing Interfaces and Interactivity for DSpace with Manakin Part 5: Introduction to Manakin’s Theme Tier Eric Luhrs Digital Initiatives Librarian,
Automatic Authorship Identification Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis.
3/5/2009Computer systems1 Analyzing System Using Data Dictionaries Computer System: 1. Data Dictionary 2. Data Dictionary Categories 3. Creating Data Dictionary.
Dr Gordon Russell, Napier University Unit Data Dictionary 1 Data Dictionary Unit 5.3.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Beyond the Digital Incunabular Period: Toward Web 2.0 Gideon Burton Asst. Prof. of English Assoc. Editor, BYU Studies Presentation to the Harold B. Lee.
Stylometry System CSIS Stylometry System – Use Cases and Feasibility Study Gregory Shalhoub, Robin Simon, Jayendra Tailor, Ramesh Iyer, Dr. Sandra Westcott.
A Practical Introduction to XML in Libraries Marty Kurth NYLA October 22, 2004.
Stylometry System CSIS Stylometry Projects, mostly Fall 2009 Project Seidenberg School of Computer Science and Information Systems.
XML Introduction What is XML –XML is the eXtensible Markup Language –Became a W3C Recommendation in 1998 –Tag-based syntax, like HTML –You get to make.
Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis.
Data Management: Documentation & Metadata Types of Documentation.
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
Metadata: Its Functions in Knowledge Representation for Digital Collections 1 Summary.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
 Using Microsoft Expression Web you can: › Create Web pages and Web sites › Set what you site will look like as you design it › Add text, images, multimedia.
CC 2007, 2011 attrbution - R.B. Allen Text and Text Processing.
EAD: A Technical Introduction Julie Hardesty, Metadata Analyst June 3, 2014.
Online Scholarly Editions Introduction to Advanced Research Academic Technology Services.
Copyright © 2012 Accenture All Rights Reserved.Copyright © 2012 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are.
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
WORKING WITH XSLT AND XPATH
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
Leiden University. The university to discover. DMT Week 3 Adriaan van der Weel and Peter Verhaar.
Hala Bezine IGS 2011 Cancun-Mexico 1 Presented by :M me Hala Bezine Republic of Tunisia Ministery of Higher Education and Scientific Research University.
Sofia Garcia/Roberto Silva Tutorial Workshop, GrenobleDate: 31/Jan/2007 The work of a professional translator and the translation agency V1.0.
Date : 3/3/2010 Web Technology Solutions Class: Application Syndication: Parse and Publish RSS & XML Data.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Comparability of language data and analysis Using an ontology for linguistics Scott Farrar, U.
Web Page Design Introduction. The ________________ is a large collection of pages stored on computers, or ______________ around the world. Hypertext ________.
Using XML to store Descriptive Metadata Richard Murphy Rosarie O’Riordan Central Statistics Office Ireland.
An exercise in preservation and applied technology Making an Electronic Text.
Web Technologies Lecture 4 XML and XHTML. XML Extensible Markup Language Set of rules for encoding a document in a format readable – By humans, and –
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Section 2B. Objectives List two reasons why some people prefer alternative methods of input over a standard keyboard or mouse. List three categories of.
Lifecycle Metadata for Digital Objects October 2, 2006 Implementing Metadata in XML.
Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved Chapter 7 Representing Web Data:
Open Access and Institutional Repositories, 10 July 2007, UKZN, Durban,,South Africa Metadata for institutional repositories: an introduction Pat Liebetrau.
XML Notes taken from w3schools. What is XML? XML stands for EXtensible Markup Language. XML was designed to store and transport data. XML was designed.
Spring 2013 Markup – Validate – Transform Introduction to Digital Text and XML Rice University, April.
Beyond HTML: Extensible Markup Language (XML)
Kynn Bartlett 11 April 2001 STC San Diego The HTML Writers Guild Copyright © 2001 XML, XHTML, XSLT, and other X-named specifications.
1 XML and XML in DLESE Katy Ginger November 2003.
1. Intro and XML Rules Spring Summer 2010 Marcus Bingenheimer TEI Workshop.
TEI 工作坊 TEI and Images October The Concept.
Distinguishing authorship
Unit 4 Representing Web Data: XML
Measuring Monolinguality
common core UDL RTI ELL PARCC smarter balanced dyslexia autism Read&Write is the award-winning literacy support solution that has useful tools for.
3.3 Fundamentals of data representation
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Demystifying Digital Scholarship 10: TEI
Text-To-Speech System for English
Computational and Statistical Methods for Corpus Analysis: Overview
TEI Workshop 10. ROMA Summer 2010.
Corpus Linguistics I ENG 617
Prepared for Md. Zakir Hossain Lecturer, CSE, DUET Prepared by Miton Chandra Datta
Chapter 7 Representing Web Data: XML
Databases.
Hui Ping, Chuan Yin, Xuan Qi Group 5
Word 2013 Microsoft Word® 2013 is a word processing application ideal for creating, editing, formatting and sharing professional-looking documents using.
Topics in Linguistics ENG 331
Structuring Content in a Web Document
What is xMod? xMod is: a desktop application (at the moment!) which can transform a repository of XML into a completely finished website Paul Spence, Paul.
Statistical n-gram David ling.
Using GOLD to Tracking L2 Development
How Digital Humanities adds to PhD Projects
Presentation transcript:

TEI Workshop Digitization of Text 文字數位化 Reasons, Methods, Stages

Humanites Computing 人文資訊學 - Digital Humanities 1 Main applications (so far): ● Digitization of known tools: – encyclopedias 百科全書 – Dictionaries & glossaries 辭典 – bibliographies, indices 參考書目, 索引 ● But also new types of knowledge bases (datasets/interfaces etc.)

Humanites Computing 人文資訊學 - Digital Humanities 2 ● New forms of information production & dissemination: wiki, blog, social network applications, twitter... ● New methods research questions: – authorship attribution & stylistic analysis – literary analysis – linguistic analysis, corpus linguistics

Example: authorship attribution 1 ● Mosteller and Wallace (1964): Inference and Disputed Authorship – The Federalist ● : 85 papers, Hamilton, Madison, Jay ● 12 of disputed authorship: either Hamilton or Madison

Example: authorship attribution 2 ● Count Sentence Length 句子長度 ☹ ● Vocabulary usage 詞彙使用量化性分析: ☺ – compare frequency for 30 marker words – e.g. “upon”: Hamilton (2.93 per 1000), Madison (0.16 per 1000)

Example: analysing literary texts 分析文學 ● Estrella Irizarry (1992) compares two Mexican writers (O. Paz ♂ and Rosario Castellanos ♀) on language use & gender ● ♀ uses more and longer questions ● ♂ uses more words like ‘always’ and ‘absolutely’, expressions of certitude ● Words of compassion (taken from a thesaurus) appear only in ♀ work

Example: corpus linguistics 語言資料庫語言學 1 ● British National Corpus (BNC) ( ● 100 mil. words (一億詞), in samples of 45,000 words ● Markup with TEI (P3) ● Automated Part of Speech (PoS) tagging

Example: corpus linguistics 語言資料庫語言學 2 ● The BNC is: ● balanced 平衡的: written, spoken material from divers sources ● monolingual 單語的: only English ● synchronic 同時的/同步的: 20 th century

Core Technologies ● Relational databases ● For the Humanities: XML-technologies (XSLT, XQUERY, SVG...) & Mark-up (TEI) ● Web interfaces (HTML, JS, CSS, AJAX, RSS...)

4 stages in the production of high- quality digital texts 1. Input 輸入 2. Add Value 加值: Basic Markup 基本標記 Deep Markup 詳細標記 3. Content Delivery 內容發行 4. Archiving 保存

1. Input 輸入 ● Basic data input ● Texts: – Keyboarding (Double Keying evtl. with Access TEI) – Scanning 掃描 (OCR: Optical Character Recognition 光學字元辨識機) ⇨ a file (perhaps a.txt file)

2a. Basic Markup 基本標記 ● File management (formats, file-names etc.) 檔 案處理系統 (格式, 檔名 etc.) ● Metadata management 關於數位化過程的 Metadata (e.g. teiHeader) ● Basic structural content markup基本架構的內 容標記 (e.g. with TEI) ⇨ probably an.xml file

2b. Scholarly in-depth markup & Documentation 學術標準標記與使用說明 ● Value adding through encoding 以標記加值 ● Encode (with TEI) what you wish to say about the text ⇨ valid TEI.xml file ● Document your markup procedures ● Documentation is indispensable

3. Content Delivery 內容發行 ● Making the content available. E.g. – online-interface – webservice – as DVD or FlashDrive – Source data as archive ● This needs skills beyond markup.

4. Archiving 保存 ● Stability = Referencability ● Make sure your edition finds its way into larger collections, repositories or archives ● E.g.: – Internet Archive – GRETIL, Gutenberg Project ● Be happy if other projects transform and reuse your content.

© marcus bingenheimer