Minority Language Engineering Professor Tony McEnery, Dept. Linguistics and Modern English Language, Lancaster University

Slides:



Advertisements
Similar presentations
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
Advertisements

Variation and regularities in translation: insights from multiple translation corpora Sara Castagnoli (University of Bologna at Forlì – University of Pisa)
Computational Paradigms in the Humanities – eHumanities and their role and impact in transdisciplinary research Gerhard Budin University of Vienna.
LEARNING TO WRITE IN TWO LANGUAGES Professor Anthony Liddicoat University of South Australia Bilingual Schools Network Camberwell PS, March 2013.
GSK: Development and Distribution of Resources Hitoshi ISAHARA GSK: Gengo Shigen Kyokai (Language Resource Association) National Institute of Information.
3 levels: Foundation, Standard, Advanced Language B Spanish Criteria.
What is VOICE? VOICE, the Vienna-Oxford International Corpus of English, is a structured collection of language data, the first computer-readable corpus.
Where do we stand? Harold Somers Centre for Computational Linguistics, UMIST, Manchester, England Panel session, MT Summit VIII, September 2001.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
PERFORMANCE FOR ALL The Project & the System. A HE project co-ordinated by University of Bristol, open to HE internationally. Developing the requirements.
Pedagogic uses of a corpus of student writing and their implications for sampling and annotation Alois Heuboeck University of Reading, UK.
Chapter 8_2 Bits and the "Why" of Bytes: Representing Information Digitally.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Session 2 The Planning Process for Literacy. Aims of the session: To consider how to develop the phases of the planning process for a literacy unit of.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
The Three Little Pigs Traditional Tales in Literacy to improve key competencies.
Primary Curriculum 2014 Statutory from September 2014 in Years 1, 3, 4 & 5. Years 2 & 6 from September 2015 (as they are currently in last year of a key.
Background Data validation, a critical issue for the E.S.S.
Current Trends in Language Documentation and the Hans Rausing Endangered Languages Project Lenore A. Grenoble Dartmouth College Lenore A. Grenoble Linguistics.
1 DEVELOPING ASSESSMENT TOOLS FOR ESL Liz Davidson & Nadia Casarotto CMM General Studies and Further Education.
Data Exchange Tools (DExT) DExT PROJECTAN OPEN EXCHANGE FORMAT FOR DATA enables long-term preservation and re-use of metadata,
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
B.Sc. Multimedia ComputingMedia Technologies Character Representation & Font Technology.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Copyright, UCL LEADERS: Linking EAD to Electronically Retrievable Sources Interoperability: Where the irresistible force of flexibility meets the immovable.
Ways for Improvement of Validity of Qualifications PHARE TVET RO2006/ Training and Advice for Further Development of the TVET.
Sharad Oberoi and Susan Finger Carnegie Mellon University DesignWebs: Towards the Creation of an Interactive Navigational Tool to assist and support Engineering.
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
High expectations… “To improve teaching and learning throughout the school”
Data Management David Nathan & Peter Austin & Robert Munro.
Riverside County Assessment Network CCSS SBAC Update.
DC 2004 Metadata Generation and Accessibility Auditing Liddy Nevile La Trobe University, Australia Mail
Licensing and Distribution of Resources and Software PAN L10n Perspective Sarmad Hussain Center for Research in Urdu Language Processing National University.
Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda.
Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham.
21st September 2004localisation and the digital divide1 and the Development and the Information Society Economic divides Language divides Cultural divides.
Customizing the IMDI metadata schema for endangered languages Heidi Johnson (AILLA) Arienne Dwyer (DOBES)
COSTEP Massachusetts: An Example of Statewide Preparedness for Cultural Heritage Resources Society of American Archivists Austin Texas August 15, 2009.
Rutgers Multimedia Chinese Teaching System (RMCTS) MERLOT International Conference, August 7-10, 2008.
March 2004 At A Glance NASA’s GSFC GMSEC architecture provides a scalable, extensible ground and flight system approach for future missions. Benefits Simplifies.
National 4 Course Overview. Skills The course aims to enable pupils to develop their skills in: listening, talking, reading and writing. understanding,
An exercise in preservation and applied technology Making an Electronic Text.
MODEL-BASED SOFTWARE ARCHITECTURES.  Models of software are used in an increasing number of projects to handle the complexity of application domains.
LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,
ICT in Classroom Prepared by: Ymer LEKSI Kukes
GL15 Grey Literature Bratislava 2-3 december 2013 Industrial Philology: problems and techniques of data and archives preservation for future generations.
A Unicode-based Environment for the Creation and use of LRs Valentin Tablan, Cristian Ursu, Kalina Bontcheva, Hamish Cunningham, Diana Maynard, Oana Hamza,
Software Engineering Introduction.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Key Stage 2 Portfolio. Llafaredd / Oracy Darllen / Reading Ysgrifennu / Writing Welsh Second Language.
The 2014 National Curriculum. When? Comes into force for Years 1 to 6 from September 2014, except for…. English, maths and science for Years 2 and 6 –
CORPUS LINGUISTICS 1) A revision of corpus linguistics 2) Language corpora in the ESL/EFL classroom.
OER Humanities: The HumBox Project Alison Dickens (Project Director) Subject Centre LLAS.
Authentication and Authorisation for Research and Collaboration Heiko Hütter, Martin Haase, Peter Gietz, David Groep AARC 3 rd.
CLARIN ERIC Franciska de Jong Oxford April 2016
Student Details & Bulk Update Changes for School Census Autumn 2016 Version 1.0.
Parents Writing Workshop. Aims of session How is writing taught at Seer Green CE School? What elements of writing does my child need to be competent in?
Entry Level Occupational Studies Agreement Trial
INTERNATIONALIZATION
English Hub School networks A-level English Language
IT Governance at the SCO
IB Assessments CRITERION!!!.
Higher Modern Studies Essay Prep
Applied Linguistics Chapter Four: Corpus Linguistics
Introducing English.
(ii) PhDs and Postdocs Janice Carruthers
QoS Metadata Status 106th OGC Technical Committee Orléans, France
Presentation transcript:

Minority Language Engineering Professor Tony McEnery, Dept. Linguistics and Modern English Language, Lancaster University

Itroduction §The TEI and corpus building at Lancaster §MILLE & BIMLLER §EMILLE -its outline §Progress on EMILLE §TEI, NIMLS & BIMLS §Conclusion

The TEI and Corpus Building at Lancaster §The use of the TEI on past corpus building projects has shown the scheme to be: Comprehensive Flexible Well suited to linguistic annotation §In using the TEI we have been able to approach a data of many types.

§Hand-written: The Lancaster/Leverhulme Corpus of Children's Writing §Transcription of hand-written material §TEI of use In normalising spelling; In annotating features lost in the transcription; Adding visual annotation; Articulating a multimodal corpus. See Smith, McEnery & Ivanic, Literary & Linguistic Computing, 1998 (4).

§Speech: The encoding of speech and thought presentation in spoken language §Transcription and annotation of oral history archives §TEI of use In encoding linguistic annotations; In helping to track changes and evolving analyses through responsibility statements; Preparing the corpus for presentation as a time-aligned multimodal corpus.

§Historical: The creation of machine readable versions of Early Modern English newsbooks §Transcription of newsbooks from the Civil War/Commonwealth/Restoration period §TEI of use In normalising spelling; In tracking editorial decisions; In tracking text reuse across a number of newsbooks.

Minority Language Engineering §The focus - non-indigenous UK minority languages (NIMLS - McEnery) and British indigenous minority languages (BIMLS - Wilson). Part of Lancaster’s focus on widening the range of corpus data available (see McEnery & Ostler, 2000). §NIMLS - mainly Indic languages and varieties of Chinese, but covering languages such as Arabic and Somali also §BIMLS - Varieties of Gaelic (Cornish, Erse, Manx, Scots Gaelic, Welsh). We are not covering BIMLS based on English such as Scots and Ullans.

The MILLE Project §MILLE (Minority Language Engineering) §Partners: Lancaster University, Oxford University Computing Service §Steering Group: (Universities) Edinburgh, Sussex and UMIST. (Industry) Canon, Linguacubun, Routledge and Sharp. (Public sector) BBC, ELRA, Dept. Health. §Funded by the EPSRC ( ) §Pilot project examining the feasibility of constructing NIML corpora

Why? §Most UK domestic translation tasks are focused on NIMLS and BIMLS §We are liasing with nations where these are indigenous/major languages §Yet even where such nations do produce resources, they may not be relevant to the UK context

BIMLLER §Starting February 2002 §Repeating the MILLE exercise for BIMLs §Some issues will be similar (code switching), some different (reviving languages, language endangerment), some irrelevant (character encoding). §Considering the role of such data in preserving dying languages the use of TEI is crucial. We must get the markup right.

Enabling Minority Language Engineering (EMILLE) §40 month project funded by the UK EPSRC (grant no. GR/N 19106). Began September §Main partners: Lancaster University (McEnery) and Sheffield University (Gaizauskas). §Others helping (e.g. Oxford) §Languages initially covered: Bengali, Gujarati, Hindi, Panjabi, Urdu (200,000 word parallel, 500,000 word spoken and 9,000,000 word written corpora each) plus Singhalese and Tamil (9,000,000 word written corpora each)

§Aims: 1.) To generate corpus data for Indic languages 2.) To adapt an existing language engineering architecture (GATE) for NIMLs

Progress report 1 - data §24,000,000 words of written data collected to date. We are focusing on news material. §Collection and orthographic transcription of spoken material on-going. Radio broadcasts main source of data. Around 1,000,000 words transcribed to date. All TEI compliant. §Parallel corpus material being collected (50,000 words of multiple translations to date) §Agreement with Central Institute of Indian Languages, Mysore

Progress report 2 - GATE §Alignment software being embedded within GATE. Part-of-speech tagging for Urdu under development. §Becoming Unicode compliant in a new Java based version of GATE. Using JMUT from NMCL (cross platform delivery).

Progress report 3 - the need for UNICODE §The main issues we have encountered have related to character interchange §The writing systems used by Indic languages can be represented in an 8 bit format, but lack of appropriate word processing software has led to a number of conflicting font led solutions to using English-language software, so a may map to A with one font, while mapping to m may map to A with another

Unicode §The obvious standard - though harmonising to one 8/16 bit representation per writing system is a possibility §For languages with an 8 bit standard which is widely adhered to this may not seem so necessary §But for a wide range of languages where 8 bit standardization has not been established/successful it is much more useful

What happens when standardisation fails? §South Asian languages are good examples of the failure of standardization §There ARE standards- they are simply not adhered to §The standards came too late, and now compete with well established rival commercial/shareware standards §These standards and rivals are mutually incompatible to varying degrees

For example, Panjabi (k, g, t) kgt (Anandpur Sahib, Maboli Systems Inc.) kgt (Gurbani, Gurbani Foundation) kgt (Panjabi, Hardip Singh Pannu) kgt (WCGurumukhi, Duke University)

Graphics 8 bit solutions Unicode “Legacy” Material Here No material here Some software to achieve this Some software to achieve this (SIL, UniEdit, NCST/Lancs) No software to achieve this

Solutions? §TEI WSDs? Gurmukhi letter letter KA §UTR 22 §Simple minded ‘bespoke’ programs §LDC developing ‘best practice’ guidelines in this area

TEI, NIMLS and BIMLS §Application of the TEI to NIML/BIML data fairly straightforward (Singh, McEnery, and Baker, 2000, ‘Building a parallel corpus of English/Panjabi’ in Véronis, J. (ed.), Parallel Text Processing : Alignment and Use of Translation Corpora, Kluwer) §The degree of code switching in some spoken material has led us to use the distinct element to allow us to mark this up.

§The degree of borrowing noted may simply be of whole words or whole words with distinct pronunciations (sap -> shop). However, morphology may be mixed below the word level: §daktor-e (object) §daktor-o (locative) §Using distinct we have worked on a simple scheme to mark up both distinctive pronunciation and morphology in code switching (see Baker, Lie, McEnery & Sebba, 2000). §Ongoing effort to engage South Asian corpus linguists with the TEI (Burnard, McEnery)

Conclusion §Use of TEI on-going - indeed just beginning for some languages. §Work of utility beyond the UK - how well are the NIMLS and IMLS of Europe provided with LE resources? How will TEI be able to help? §TEI standards applied to data being produced in a wide range of corpus building projects at Lancaster.