Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University.

Slides:



Advertisements
Similar presentations
Whois Internationalization Issues John C Klensin.
Advertisements

Worldwide typography (and how to apply JIS-X to Unicode) Michel Suignard Microsoft Corporation.
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
Workshop 8 : What role can human rights play in Internet policy? Multilingualism on the Internet as a human right.
June 2004 Adil Allawi Technical Director
Text #ICANN50. Text #ICANN50 IDN Variant TLD Program GNSO Update Saturday 21 June 2014.
Solutions for Multilingual Literature by XSL Formatter 6,800 known languages.
The Application Layer Chapter 7. Electronic Mail Architecture and Services The User Agent Message Formats Message Transfer Final Delivery.
Representing Information as Bit Patterns
Media: Text “Words and symbols in any form, spoken or written, are the most common system of communication.” ~ unknown.
The front door of the OACIS site includes: 1.General information 2.Funding information – active links concerning TICFIA 3.Contact links 4.Quick links –
Robofest 2001 Online Management System Jim Needham MCS 4833/01 Senior Project Dr. Chan-Jin Chung, Ph.D.
Topics in this presentation: The Web and how it works Difference between Web pages and web sites Web browsers and Web servers HTML purpose and structure.
New gTLD Basics. 2  Overview about domain names, gTLD timeline and the New gTLD Program  Why is ICANN doing this; potential impact of this initiative.
New.net and Multilingual Names Andrew Duff Director of Mktg and Policy, New.net December 2001.
Moving a Large Scale University to Unicode Elizabeth J. Pyatt, Ph.D. Teaching and Learning with Technology Penn State University
 area of law that deals with protecting the rights of those who create original works  Also called as confidential information.  It is called “intellectual”
Website Publishing. Publishing Basics Early Web Sites Obtain a Domain Name IP Address (Internet Protocol Address) – A number that uniquely identifies.
Language Chapter 5 An Introduction to Human Geography
Chapter 1 Internet & Web Basics Key Concepts Copyright © 2013 Terry Ann Morris, Ed.D. 1.
Internationalized Domain Names: Overview of ICANN Activities Masanobu Katoh, Chair, IDN Committee Director, ICANN Board CDNC-CNSG-MINC IDN Joint Meeting.
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
Unicode & W3C Jataayu Software C. Kumar January 2007.
Domain Names, Internationalization, and Alternatives John C KLENSIN © John C Klensin, 2002.
Creating Interfaces: Localization Language & other issues character codes Homework: preparation for future topics.
Copyright © cs-tutorial.com. Introduction to Web Development In 1990 and 1991,Tim Berners-Lee created the World Wide Web at the European Laboratory for.
Internationalized Domain Names (IDNs) Yale A2K2 Conference New Haven, USA April 27, 2007 Ram Mohan Building a Sustainable Framework.
CcTLD IDN TF Report ccTLD Meeting, Rio de Janero Mar. 25, 2003 Young-Eum Chair, ccTLD IDN TF.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
A Field Linguist’s Guide to Unicode Deborah Anderson Script Encoding Initiative (Universal Scripts Project) Dept. of Lings., UC-Berkeley LSA Panel: A Field.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Web Mastering Module Internet Fundamentals. What is the Internet? –Global network of networks –Communicating using same set of rules (protocols/languages)
Chapter 1 Internet & Web Basics Key Concepts Copyright © 2013 Terry Ann Morris, Ed.D. Revised 1/12/2015 by William Pegram 1.
1 An ICU Library Supporting the Display of Complex Text Eric Mader Globalization Center of Competency, Cupertino, CA.
Postacademic Interuniversity Course in Information Technology – Module C1p1 Contents Data Communications Applications –File & print serving –Mail –Domain.
Globalisation & Computer systems Week 4 writing systems and their implications for globalisation character representation ASCII extended ASCII code pages.
Using the Unicode Standard for Linguistic Data: Preliminary Guidelines Deborah Anderson Researcher Dept. of Linguistics, UC Berkeley.
INFOCODING BASICS & EXAMPLES OF CURRENT USE Introduction to Computer Science Using Ruby (c) 2010 Gideon Frieder.
Chapter 8 Cookies And Security JavaScript, Third Edition.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Internationalized Domain Names Dr. Cary Karp MUSENIC Project Manager Second MUSENIC Project Workshop Stockholm, March 2004 MUSENIC – The Museum Network.
Issues in IDN APTLD Meeting in Taipai Feb. 24, 2003 Young-Eum Lee.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Data Files on Computers Text Files (ASCII) Files that can be created by typing on the keyboard while using a text editor such as notepad or TextEdit.
UNICODE & Indic Scripts
Internet Architecture and Governance
Legal Informatics & E-Governance as tools for the Knowledge Society LEFIS Seminar, Reykjavik (Iceland), July 12-13, 2007 Oleksandr Pastukhov MPhil (Koretsky.
How the Web Works Building a Website – Lesson 1. How People Access the Web Browsers People access websites using software called a web browser. To view.
COP 3813 Intro to Internet Computing Prof. Roy Levow Lecture 1.
Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
INTRODUCTION JavaScript can make websites more interactive, interesting, and user-friendly.
Building Database-backended Multilingual, Multimedia Data Repositories: The aAQUA Experience.
1 Problem Solving using Computers “Data....Representation, and Storage.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
Module Road Map Assignment Road Map Notice we have linked the conduit directly to the presentation layer. This is normally a bad idea!
Matthew Baillie, Luke Day THE INTERNET. HISTORY OF THE INTERNET J.C.R. Licklider authored a series of memos concerning theoretical network structures.
Different language, different world. Global Linguistic Diversity Globe: 6,000 Languages (in oral use) Unesco Language Vitality Index (2009): more than.
LANGUAGES quiz.
HTML5 Basics.
Unit 2.6 Data Representation Lesson 2 ‒ Characters
Environmental issues and the importance of standards
NCUC Africa Members First Regional Webinar Ines Hfaiedh
Representing Information as bit patterns
TOPICS Information Representation Characters and Images
Technology 1 Computer system Computer types Devices
Intro to PHP.
INFOCODING BASICS & EXAMPLES OF CURRENT USE
Introduction to UNICODE (ஒருங்குறி)
Presentation transcript:

Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Overview Sources of Linguistic BiasSources of Linguistic Bias Linguistic Bias: examplesLinguistic Bias: examples –Text Communication –Internet Host Names –Web Programming Global Linguistic DiversityGlobal Linguistic Diversity –Who bears the costs? ConclusionsConclusions

Sources of Linguistic Bias (Friedman and Nissenbaum 1997) Pre-existingPre-existing –originate from outside the technical system National, trans-national and institutional policiesNational, trans-national and institutional policies Technology companiesTechnology companies TechnicalTechnical –are built into the technical system itself Developers language backgrounds, national originsDevelopers language backgrounds, national origins Legacy standards, backward compatibilityLegacy standards, backward compatibility EmergentEmergent –arise in specific contexts of use of a technical system Economics of technology industry (marketing, monopoly power, unstable markets, etc.)Economics of technology industry (marketing, monopoly power, unstable markets, etc.) Rapid technologizationRapid technologization

Text Communication Requires an encoding and its supportRequires an encoding and its support –Assign code numbers to script characters ASCII (American English)ASCII (American English) ISO (European Languages)ISO (European Languages) Unicode (most languages, but support is uneven)Unicode (most languages, but support is uneven) –Support means many things Fonts, rendering, sorting, spell-checking etc.Fonts, rendering, sorting, spell-checking etc. Computer-Mediated CommunicationComputer-Mediated Communication –Web pages, , chat, etc. –Language use is not uniform in these modes Multilinguals tend to favor different languages for specific purposesMultilinguals tend to favor different languages for specific purposes Represents both technical and emergent biasesRepresents both technical and emergent biases

Unicode Status: Examples Language Chinese English French German Spanish Finnish Russian Arabic Hindi Sinhala S. Azerbaijani Unicode yes no Browser good good (late) poor none Script Chinese Roman Cyrillic Arabic Indic Arabic Pop. 1,240M 400M 81M 82M 358M 5M 132M 247M 213M 15M 26M Good support Poor support No support

Internet Host Names The Domain Name SystemThe Domain Name System –Uses a 30-year old 7-bit ASCII standard Now supports Punycode (a variant of Unicode)Now supports Punycode (a variant of Unicode) Imposes a maximum name lengthImposes a maximum name length –Run by ICANN under US Dept of Commerce contract More concerned with trademark protectionMore concerned with trademark protection Host/domain naming is widely abused (e.g. tv domain)Host/domain naming is widely abused (e.g. tv domain) Names provided by the DNS are not that usefulNames provided by the DNS are not that useful An example of emergent biasAn example of emergent bias –Technical origin –Economic and political forces amplify and sustain it

Web Programming and Unicode Markup & web scripting languagesMarkup & web scripting languages –Unicode is standard –Browser support, fonts, etc. lag behind –Databases and development environments tend to lack proper Unicode support –End-user oriented, not programmer oriented All of the most important technologies are Open- Source software (FLOSS)All of the most important technologies are Open- Source software (FLOSS) –User extensible/modifiable –Language localization of these is possible but rare

Linguistic Bias in Web Programming English is the source language for most programming & markup languagesEnglish is the source language for most programming & markup languages –Keywords –Operator-argument order –Programming constructs, etc. Programming as a linguistic actProgramming as a linguistic act –Complex concepts are rendered into text –Different languages have different ways of doing this Emergent language biasesEmergent language biases

Linguistic Properties of Programming LISPLISP –Predicates precede their arguments Like Arabic, Celtic, Hebrew, etc.Like Arabic, Celtic, Hebrew, etc. (defun fact (x)(if (<= x 0) 1 (* x (fact (- x 1))))) PostscriptPostscript –Predicates follow their arguments Like Farsi, Hindi, Japanese, Tamil, Turkish, etc.Like Farsi, Hindi, Japanese, Tamil, Turkish, etc. /factorial { dup 1 gt { dup 1 sub factorial mul } if } def

The Linguistic Digital Divide Language issues go beyond contentLanguage issues go beyond content –WSIS repeatedly re-affirms principles of TransparencyTransparency Self-determinationSelf-determination Open access to participation for all partiesOpen access to participation for all parties These principles cannot be guaranteed unless speakers of different languages can manipulate all aspects of IT use in a way that is native-like The linguistic divide has broader consequencesThe linguistic divide has broader consequences –Costs are borne in Education great for non-English speaking peopleEducation great for non-English speaking people Technical development small, in comparisonTechnical development small, in comparison (there is a trade-off)

Language Diversity Who bears the costs?

(source data: A typical language group has around thousand people 80% of language groups have fewer than 100 thousand members

(source data: 90% of the worlds population belongs to a language group with at least 1 million people (416 groups) Many languages with hundreds of milloins of speakers lack adequate support

(source data:

Conclusions Linguistic Bias is manifest in many waysLinguistic Bias is manifest in many ways –Technical biases are sometimes overt –Emergent biases can be subtle All potential sources of bias need to be examined and questioned if we are to uphold principles affirmed by WSISAll potential sources of bias need to be examined and questioned if we are to uphold principles affirmed by WSIS Without this effort, the linguistic digital divide will simply amplify existing disparities in wealth and powerWithout this effort, the linguistic digital divide will simply amplify existing disparities in wealth and power

Language Diversity On The Internet

Global Reach

Linguistic Diversity Based on Entropy: Diversity = –2 p i ln p i Diversity is the long-run per-individual average variance in language category (similar to log-likelihood)

ONeill, Lavoie and Bennett, 2003

ITU

ITU

UNPD

ITU, UNPD

ITU

ITU