UNICODE DAY /22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University
Course organization 22-Sept-2014NLP, Prof. Howard, Tulane University 2 The syllabus is under construction.
The quiz was the review. Review of Lists 22-Sept NLP, Prof. Howard, Tulane University
Open Spyder 22-Sept NLP, Prof. Howard, Tulane University
6. Non-English characters: one code to rule them all 22-Sept NLP, Prof. Howard, Tulane University
Did you know … 1. >>> unsorted = 2. >>> sorted(unsorted) 3. ['*', '6', 'A', 'a'] 22-Sept-2014NLP, Prof. Howard, Tulane University 6
Introduction So your program is humming along, and it hits the string 'cañón' and chokes. For instance, it may try to find out the length of cañón: 1. >>> S = 'cañón' 2. >>> len(S) 3. >>> from re import findall 4. >>> findall(r'\w{5}',S) 5. >>> T = findall(r'.{5}',S) 6. >>> T 7. ['ca\xc3\xb1\xc3'] 8. >>> U = ''.join(T) 9. >>> print U 10. >>> findall(r'.{7}',S) 11. ['ca\xc3\xb1\xc3\xb3n'] 12. >>> T = findall(r'.{7}',S) 13. >>> U = ''.join(T) 14. >>> print U 15. cañón 22-Sept-2014NLP, Prof. Howard, Tulane University 7
6.1. English characters and ASCII Computers were originally designed to use the English alphabet, and in particular, an encoding of it called the American Standard Code for Information Interchange, abbreviated ASCII and pronounced / ˈ æski/ or “ass- kee”, see ASCII in Wikipedia.ASCII ASCII is ultimately based on telegraph codes and represents the numbers 0-9, the English letters a-z and A-Z, the English punctuation symbols plus a blank space, along with control codes that originated with Teletype machines, some of which are now obsolete. 22-Sept-2014NLP, Prof. Howard, Tulane University 8
ASCII characters ABCDEF 0–––––––––––––––– 1–––––––––––––––– 2 !“#$%&‘()*+,-./ :;<=>? 5PQRSTUVWXYZ[\]^_ 6`abcdefghijklmno 7pqrstuvwxyz{|}~– 22-Sept-2014NLP, Prof. Howard, Tulane University 9
So now you know … 1. >>> unsorted = 2. >>> sorted(unsorted) 3. ['*', '6', 'A', 'a'] 4. >>> ord(' ') 5. >>> ord('!') 6. >>> ord('~') 7. >>> chr(32) 8. >>> chr(33) 9. >>> chr(126) 10. >>> chr(127) 22-Sept-2014NLP, Prof. Howard, Tulane University 10
Background 6.2. Unicode and UTF-8 22-Sept NLP, Prof. Howard, Tulane University
Character encoding in Python 22-Sept-2014NLP, Prof. Howard, Tulane University 12
7. NLTK and Internet corpora but I am going to fold this chapter into §1 & §2, so the chapter numbering will change. Next time 22-Sept-2014NLP, Prof. Howard, Tulane University 13