Mark Davis President, Unicode Consortium

Slides:



Advertisements
Similar presentations
Unicode from a distance…
Advertisements

Unicode Security Mark Davis. The Unicode Consortium Software globalization standards: define properties and behavior for every character in every script.
New in Unicode Mark Davis, John Jenkins. Agenda Unicode UCA Regular Expressions Security Considerations Character Mapping Common Locale Data.
Unicode/IDN Security Mark Davis President, Unicode Consortium Chief SW Globalization Arch., IBM.
Globalization Gotchas
Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect
Unicode Normalization Mark Davis
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect.
CSCI N241: Fundamentals of Web Design Copyright ©2004 Department of Computer & Information Science Introducing XHTML: Module B: HTML to XHTML.
International Domain Name Committee Proceedings Report Shanghai, China October, 2002.
Internationalizing WHOIS Preliminary Approaches for Discussion Internationalized Registration Data Working Group ICANN Meeting, Brussels, Belgium Jeremy.
ICANN Rio Meeting IDN Authorization for TLDs with ICANN agreements 26 March, 2003 Andrew McLaughlin.
Internationalized Domain Names Introduction & Update MENOG 1 Bahrain April 3-5, 2007 By: Baher Esmat Middle East Liaison.
IDN Variant Issues Project (VIP) Project Update and Next Steps 11 April 2012.
DC Architecture WG meeting Monday Sept 12 Slot 1: Slot 2: Location: Seminar Room 4.1.E01.
Worldwide typography (and how to apply JIS-X to Unicode) Michel Suignard Microsoft Corporation.
Global Registry Services 1 INTERNATIONALIZED Domain Names Testbed presented to ITU/WIPO Joint Symposium Geneva 6-7 Dec An Overview On VeriSign Global.
1 Character Conversions and Mapping Tables Presented By: Markus Scherer George Rhoten Raghuram (Ram) Viswanadha.
Office Links - Sharing Data in Microsoft Office A Mixed Bag of Treasures Chester N. Barkan Registrar Long Island University, C.W.Post Campus.
XML: Extensible Markup Language
Internationalization Status and Directions: IETF, JET, and ICANN John C Klensin October 2002 © 2002 John C Klensin.
Text #ICANN50. Text #ICANN50 IDN Variant TLD Program GNSO Update Saturday 21 June 2014.
Teppo Räisänen LIIKE/OAMK 2010
RDF Schemata (with apologies to the W3C, the plural is not ‘schemas’) CSCI 7818 – Web Technologies 14 November 2001 Van Lepthien.
Internationalized Domain Names Status Report Prepared for: ICANN Meeting, Lisbon 29 March, 2007 Tina Dam IDN Program Director ICANN
The Assembly Language Level
2440: 211 Interactive Web Programming JavaScript Fundamentals.
International Domain Name TWNIC Nai-Wen Hsu
Layer 7- Application Layer
Thayer School of Engineering Dartmouth Lecture 2 Overview Web Services concept XML introduction Visual Studio.net.
1 HTML’s Transition to XHTML. 2 XHTML is the next evolution of HTML Extensible HTML eXtensible based on XML (extensible markup language) XML like HTML.
Introducing XHTML: Module B: HTML to XHTML. Goals Understand how XHTML evolved as a language for Web delivery Understand the importance of DTDs Understand.
Introducing HTML & XHTML:. Goals  Understand hyperlinking  Understand how tags are formed and used.  Understand HTML as a markup language  Understand.
Introduction to Chinese Domain Name ZHANG Hong Aug 24, 2003.
Chapter 9 Collecting Data with Forms. A form on a web page consists of form objects such as text boxes or radio buttons into which users type information.
1 © 2000, Cisco Systems, Inc. DNSSEC IDN Patrik Fältström
IDN over EPP (IDNPROV) IETF BOF, Washington DC November 2004.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
IDN Standards and Implications Kenny Huang Board, PIR
Creating Interfaces: Localization Language & other issues character codes Homework: preparation for future topics.
CcTLD IDN TF Report ccTLD Meeting, Rio de Janero Mar. 25, 2003 Young-Eum Chair, ccTLD IDN TF.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
NASK: at the cutting edge of technology Andrzej Bartosiewicz ITU-T SG17 meeting Moscow, 2005.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Chapter 1 Internet & Web Basics Key Concepts Copyright © 2013 Terry Ann Morris, Ed.D. Revised 1/12/2015 by William Pegram 1.
Internationalized Domain Names (IDN) APAN Busan James Seng former co-chair, IDN Working Group.
XML The Overview. Three Key Questions What is XML? What Problems does it solve? Where and how is it used?
JavaScript, Fourth Edition
Internationalized Domain Names Dr. Cary Karp MUSENIC Project Manager Second MUSENIC Project Workshop Stockholm, March 2004 MUSENIC – The Museum Network.
ccTLD IDN Report ccTLD Meeting, Montreol June 24, 2003 Young-Eum
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
SSAC Report on Domain Name Registration Data Model Jim Galvin.
XP Tutorial 8 Adding Interactivity with ActionScript.
Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 8e Kendall & Kendall 8.
Forms Collecting Data CSS Class 5. Forms Create a form Add text box Add labels Add check boxes and radio buttons Build a drop-down list Group drop-down.
IDNAbis and Security Protocols or Internationalization Issues with Short Strings John C Klensin SAAG – 26 July 2007.
The original Internationalized Domain Name (IDN) WG set the requirements for international characters in domain names in RFC 3454, RFC3490, RFC3491 and.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
Tutorial 9 Working with XHTML. New Perspectives on HTML, XHTML, and XML, Comprehensive, 3rd Edition 2 Objectives Describe the history and theory of XHTML.
1 Lesson 14 Sharing Documents Computer Literacy BASICS: A Comprehensive Guide to IC 3, 4 th Edition Morrison / Wells.
IDN issues ITU-T SG16-Q7 26 July Objective Provide a set of examples why IDN study at ITU-T could be considered Demonstrate the need of ITU recommendation.
1 Internationalized Domain Names Paul Twomey 7 April 2008.
Director of Data Communications
IDN Variant TLDs Program Update
Lesson 14 Sharing Documents
Developing a Data Model
John C Klensin APNIC Beijing, 25 August 2009
Presentation transcript:

Mark Davis President, Unicode Consortium Unicode Security Mark Davis President, Unicode Consortium

The Unicode Consortium Software globalization standards: define properties and behavior for every character in every script Unicode Standard: a unique code for every character in the world Common Locale Data Repository: LDML format plus repository for required locale data Collation, line breaking, regular expressions, character mapping, …

Where in every major operating system in all major office software in all modern browsers, email systems as the core definition of text in XML, HTML, … as the core of Java, C#, C (with ICU), Javascript, …

Security ~ Identity System A X = x System B X ≠ x

Examples: Letters p in Latin vs p in Cyrillic Cross-Script In-Script Sequences rn may appear at display sizes like m अ + ा typically looks identical to आ Rendering Support ä with two umlauts may look the same as ä with one eḷ actually has the dot modifying the l

০ ১ ২ ৩ ৫ ৬ ৮ ৯ ৪ ৭ Examples: Numbers Western Bengali Oriya ୦ ୧ ୨ ୩ ୪ 1 2 3 4 5 6 7 8 9 Bengali ০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯ Oriya ୦ ୧ ୨ ୩ ୪ ୫ ୬ ୭ ୮ ୯ Thus ৪୨ = 42

Syntax Spoofing http://example.org/1234/not.mydomain.com / = fraction-slash Also possible without Unicode: http://example.org--long-and-obscure-list-of-characters.mydomain.com

UTR #36: Security Recommendations General Security Issues (not just IDN) V1 approved mid-2005; V2 in progress http://unicode.org/draft/reports/tr36/tr36.html Describes the problems, recommends best practices Users Programmers User-Agents (browsers, email, office apps) Registries Registrars

UTS #39: Security Mechanisms Supplies data /algorithms for implementing recommendations Restricted character repertoire: Based on Unicode Identifier Profile Intersect with current NamePrep Characters → scripts, confusable characters Originally in UTR #36 Version 1; split out for clarity http://www.unicode.org/draft/reports/tr39/tr39.html

Current NamePrep ≠ Unicode Identifiers Alphanumerics* U3.2 (87,068) Symbols U3.2 (2,974) Alphanum. U5.0 (+2,810) a œ и ש س ௫ ཀ ᚥ ሑ ฎ अ ꀅ あ タ 入 ৫ 2 … ℞ § ♔ / ∞ ☃ ➥ √ … ਁ ჹ ਃ ჺ ሇ ऄ ঽ … http://unicode.org/reports/tr36/idn-chars.html

Restriction Levels* * Subject also to restrictions on confusables 2. Highly Restrictive All characters from a single script, or from limited combinations: Han + Hiragana + Katakana; Han + Bopomofo; or Han + Hangul No characters in the identifier can be outside of the Identifier Profile includes Letters, Numbers; excludes Symbols, Punctuation,… 3. Moderately Restrictive Allow Latin with other scripts except Cyrillic, Greek, Cherokee: ip-アドレス.co.jp خدمة-rss.eg 4. Minimally Restrictive Allow arbitrary mixtures of scripts: sony-βίντεο.gr xml-документы.ru игро-shop.com … * Subject also to restrictions on confusables

ICANN Guidelines v2 http://icann. org/general/idn-guidelines-14nov05 Improvement on v1,… but needs new revision: Procedurally Insufficient time for through review The disposition (with rationale) of comments not available Only single cycle of public review Technically Any specification needs a much clearer structure – the exact implications of a claim to adhere to the guidelines are currently impossible to measure, and useless for security #3 (script/language limitations) has far too many loopholes. #4 (symbols) is too permissive, and not well-defined #5 (registration) should use the post-nameprep’ed form

Guideline 3 (lang./script limitations) Associate with script except with language and script, or except with set of languages, or except with “more than one designator” Publish set of code points, define variant code points; indicate script/language. Why language? (too fuzzy to be testable) Why script? (derivable from characters) Single script in label, except when language requires, except with mixed-script confusables, except with “policy & table” defined. Who decides when required? Allows single-script confusables. All registry policies documented and publicly available, with table for each set of code points Machine readable? Discursive description?

Guideline 4 (disallowed symbols) Line symbol-drawing characters (as those in the Unicode Box Drawing block) One small set of the many symbols Symbols and icons that are neither alphanumeric nor ideographic language characters, such as typographic and pictographic dingbats Numbers? Combining Marks? Letter modifiers? Kana length mark? Ill-defined, untestable. Characters with well-established functions as protocol elements / is confusable with a “protocol element” but isn’t one. Ill-defined, untestable. Punctuation marks used solely to indicate the structure of sentences Em-dash? Who decides? Ill-defined, untestable. Punctuation marks that are used within words except “essential to the language” & “associated with explicit prescriptive rules” Ill-defined, untestable. Except “under corresponding conditions, a single specified character may be used as a separator within a label, either by allowing the hyphen-minus to appear together with non-Latin scripts, or by designating a functionally equivalent punctuation mark from within the script.”

Guideline 5 (registration) A registry will define an IDN registration in terms of both its Unicode and ASCII-encoded representations. Should use output Unicode representation (after mapping and normalization): otherwise many more visually confusable characters are present Should say “ACE”, not ASCII.

Unicode Recommendations Precise Specifications, Mechanically Testable: Guideline 3 (script/language limitations) → Publicly document the Restriction Level being enforced (≤ Level 4) Publicly document the enforcement policy on confusables: whether any two domain names are allowed to be whole-script or mixed script confusables according to [UTR39]. Guideline 4 (symbols) → Only characters in Security Profiles for Identifiers [UTR39]. Guideline 5 (registration) → Define an IDN registration in terms of its: Nameprep-Normalized Unicode representation (output format) ACE representation

Backup Slides

Agenda Unicode Background Security Issues

IDN You get an email about your paypal.com account, click on the link… You carefully examine your browser's address box to make sure that it is actually going to http://paypal.com/ … But actually it is going to a spoof site: “paypal.com” with the Cyrillic letter “p”. You (System A) think that they are the same DNS (System B) thinks they are different

Domain Names ät.com ät.com tοp.com so̷s.com søs.com String UTF-16   String UTF-16 Internal - IDNA 1a ät.com 0061 0308 0074 002E 0063 006F 006D xn--t-zfa.com 1b ät.com 00E4 0074 002E 0063 006F 006D 2a tοp.com 0074 03BF 0070 002E 0063 006F 006D xn--tp-jbc.com 2b 0074 006F 0070 002E 0063 006F 006D top.com 4a so̷s.com 0073 006F 0337 0073 002E 0063 006F 006D xn--sos-rjc.com 4b søs.com 0073 00F8 0073 002E 0063 006F 006D xn--ss-lka.com

Non-Visual Attacks Exploiting Expectations Casing: Encoding: Collation: X < Y, so X + H < Y + H wrong Casing: len(X) = len(toUpper(X)) wrong Encoding: ‘/’ is always represented by 2F16 wrong

UAX #31: Identifier & Pattern Syntax For identification of entities (programming variables, resources, domain names, ... Appropriate characters -- stable across versions Not all natural language words: can’t U.S.A. Provides Foundation: specifications can “tailor” it for different environments: adding or removing characters.

“StringPrep” Processing Map A → a Normalize c + ¸ → ç ㄱ +ㅏ → 가 カ → カ fi → f + i Prohibit & / . , …

UAX #15: Unicode Normalization Forms Normalizes most visually confusable sequences to unique form c + ¸ → ç ㄱ +ㅏ → 가 カ → カ fi → f + i Core part of StringPrep, other Identifier Profiles