The electronic corpus of 17th and 18th century Polish texts (up to 1772) – aims, methods, current state, problems and prospects for development Włodzimierz.

Slides:



Advertisements
Similar presentations
A Common Standard for Data and Metadata: The ESDS Qualidata XML Schema Libby Bishop ESDS Qualidata – UK Data Archive E-Research Workshop Melbourne 27 April.
Advertisements

DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Teppo Räisänen LIIKE/OAMK 2010
XHTML Basics.
the Classical Text Editor
Possibilities of searching the corpus of baroque texts Goals and objectives Renata Bronikowska Institute of Polish Language Polish Academy of Sciences.
Project 1 Introduction to HTML.
What is a Web Page? Web pages are a combination of text and graphics, wrapped in a special “markup” language. The markup language (Hypertext Markup Language.
New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.
1 Computing for Todays Lecture 22 Yumei Huo Fall 2006.
Tutorial 8 Sharing, Integrating and Analyzing Data
Developing a Basic Web Page with HTML
1st Project Introduction to HTML.
COMPUTERS AND INFORMATION SYSTEMS HTML. How the Web Works To access a web site  Enter its address (URL) in the address box of your browser 
 Definition of HTML Definition of HTML  Tags in HTML Tags in HTML  Creation of HTML document Creation of HTML document  Structure of HTML Structure.
(C) 2013 Logrus International Practical Visualization of ITS 2.0 Categories for Real World Localization Process Part of the Multilingual Web-LT Program.
TEKS & Activities Desktop Publishing.
HTML 1 Introduction to HTML. 2 Objectives Describe the Internet and its associated key terms Describe the World Wide Web and its associated key terms.
Chapter ONE Introduction to HTML.
Understanding HTML Style Sheets. What is a style?  A style is a rule that defines the appearance and position of text and graphics. It may define the.
Creating a Web Page HTML, FrontPage, Word, Composer.
Concordia University Department of Computer Science and Software Engineering Click to edit Master title style ADVANCED PROGRAMING PRACTICES API documentation.
Chapter 12 Creating and Using XML Documents HTML5 AND CSS Seventh Edition.
Overview of Mini-Edit and other Tools Access DB Oracle DB You Need to Send Entries From Your Std To the Registry You Need to Get Back Updated Entries From.
European Metadata Initiatives: The METAe Metadata Engine Simon Tanner Higher Education Digitisation Service
Data Exchange Tools (DExT) DExT PROJECTAN OPEN EXCHANGE FORMAT FOR DATA enables long-term preservation and re-use of metadata,
First things, First Do you belong in here? – 10 – 12 – Comp. Discovery or Keyboard/Comp Apps – Do you have any experience with Web Page Design?????
GEH M IT! Web Page Editor Introduction. Basic types of web editors Web page editors can be divided into two basic groups: WYSWIG HTML Editors WYSWIG HTML.
Introduction to World Wide Web Authoring © Directorate of Information Systems and Services University of Aberdeen, 1999 Part II.
The Internet and the World Wide Web. The Internet A Network is a collection of computers and devices that are connected together. The Internet is a worldwide.
Tutorial 1 Developing a Basic Web Page. New Perspectives on HTML, XHTML, and XML, Comprehensive, 3rd Edition Objectives – Lesson 1 Introduction to the.
Introduction to XML. XML - Connectivity is Key Need for customized page layout – e.g. filter to display only recent data Downloadable product comparisons.
CIS 451: Introduction to XML Dr. Ralph D. Westfall October, 2011.
HTML, XHTML, and CSS Sixth Edition Chapter 1 Introduction to HTML, XHTML, and CSS.
Chapter 13-Tools for the World Wide Web. Overview Web servers. Web browsers. Web page makers and site builders. Plug-ins and delivery vehicles. Beyond.
Internet Web Publishing III. Intro to Cascading Style Sheets Patricia Roberts.
Welcome! The Topic For Today Is Word Processing and Desktop Publishing.
Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Libby Bishop Language and Computation Day University of Essex 4 October 2005.
Tools of the Trade: Construction CECS 5030: Introduction to the Internet Dr. Cathleen Norris & Jennifer Smolka.
2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb ( Tübingen,
INTELLECTUAL RIGHTS AND HISTORIC CORPORA Mark Sandler University of Michigan ICOLC, March, 2003.
An exercise in preservation and applied technology Making an Electronic Text.
The Web Wizard’s Guide to HTML Chapter One World Wide Web Basics.
Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D
HTML Concepts and Techniques Fifth Edition Chapter 1 Introduction to HTML.
Chapter 1 Introduction to HTML, XHTML, and CSS HTML5 & CSS 7 th Edition.
HTML HYPER TEXT MARKUP LANGUAGE. INTRODUCTION Normal text” surrounded by bracketed tags that tell browsers how to display web pages Pages end with “.htm”
CSS THE MISSING MANUAL Introduction. Benefits of CSS Style sheets offer more formatting choices than are offered in straight HTML  EXAMPLE: When you.
Introduction to HTML Simple facts yet crucial to beginning of study in fundamentals of web page design!
HTML HyperText Markup Language Victoria E. Kozlek.
PYP002 Intro.to Computer Science Brwosing the Web1 Browsing the Web Chapter 19.
BRAT: a web based tool for manual annotation Hans Paulussen ITEC, KU Leuven KULAK.
What is CSS? A set of style rules that tell the web browser how to present a web page or document. – In earlier versions of HTML, style characteristics,
1 Cascading Style Sheet (CSS). 2 Cascading Style Sheets (CSS)  a style defines the appearance of a document element. o E.g., font size, font color etc…
XP Creating Web Pages with Microsoft Office
HTML PROJECT #1 Project 1 Introduction to HTML. HTML Project 1: Introduction to HTML 2 Project Objectives 1.Describe the Internet and its associated key.
Project 1 Introduction to HTML.
Chapter 1 Introduction to HTML.
XHTML Basics.
Project 1 Introduction to HTML.
Microsoft Office Illustrated
XHTML Basics.
XHTML Basics.
Part of the Multilingual Web-LT Program
DIGITAL LIBRARY.
Introduction to HTML Simple facts yet crucial to beginning of study in fundamentals of web page design!
What is HTML?.
XHTML Basics.
XHTML Basics.
Browsing the Web Chapter 19 PYP002 Intro.to Computer Science
Presentation transcript:

The electronic corpus of 17th and 18th century Polish texts (up to 1772) – aims, methods, current state, problems and prospects for development Włodzimierz Gruszczyński*, Maciej Ogrodniczuk** * Instytut Języka Polskiego PAN **Instytut Podstaw Informatyki PAN

Project factsheet Funding: Polish Ministry of Science and Higher Education Grant (contract nr 0036/NPRH2/H11/81/2012) Duration: Coordinating body: Institute of Polish Language PAS Cooperation: Institute of Computer Science PAS Acronym: KORBA (crank) = KORpus BArokowy

Corpus summary a considerably large ( ≈ 12 million tokens) and fairly balanced collection of 17th and 18th century Polish texts (up to 1772); featuring annotation of text structure and language; intended to supplement the National Corpus of Polish (NKJP) with a diachronic aspect; can come in useful for all forms of historical linguistic studies (e.g. evolution of Polish grammar, regional and social variety of historical Polish), as well as literary studies and editorial work.

Text acquisition First of all manual transliteration! Texts offered by IMPACT project. Texts acquired from publishers (very little). Texts acquired from various Internet sources in XIX-, XX- or XXI-century transcription. Texts from XIX- and XX-century editions (scan + ocr).

Transcription setting A familiar word-processing environment: MS Word-based styles used to produce pseudo-XML macros implemented to apply basic processing keyboard shortcuts one template (to rule them all :)

Transcription process Step by step: 1.editor: create a minimal metadata description of a new text; 2.editor: assign it to a transcriber; 3.transcriber: prepare the transcription using Microsoft Word; 4.transcriber: save resulting document in RTF; 5.transcriber: upload the text via Web browser to a dedicated site; 6.editor: approve the text, forward it to proofreading or send it back to transcription if errors are found.

Conversion On upload texts are converted to TEI P5 XML. HTML presentation is also available in the system for easier review.

XML representation format: National Corpus of Polish-based: stand-off annotation XML TEI P5 editor: assign it to a transcriber; text_structure.xml:...

Other structural elements: Marginal notes preserved inline: Wielkosc zámyka w sobie/ wielkie ábo nie wielkie iedzenie/ y czeste iedzenie. Co sie tknie pierwszego. Iák wiele iesc Iák wiele mamy iesc/ naznácza nam vsmierzenie głodu/ to iest/ gdy sie zniesie przez wziety pokarm przyrodzona chciwosc iadłá. Other (examples): or placeholders preserved in place foreign interjections embedded in tag with language codes preserved in xml:lang.

The current state of project Over 300 texts; Over 6.2 M tokens; All texts tagged structurally; In all the texts are marked all foreign tokens’; All texts with rich metadata; Morphosyntactic tag set ready to use for middle Polish.

Tools we apply: Morphological analyzer Morfeusz (adapted to old Polish grammar): Tagger PoliTa: New version of Poliqarp as a search engine (not ready); old version see: