Download presentation
Presentation is loading. Please wait.
Published byEdward Moore Modified over 9 years ago
1
UNICODE DAY 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University
2
Course organization 22-Sept-2014NLP, Prof. Howard, Tulane University 2 http://www.tulane.edu/~howard/LING3820/ http://www.tulane.edu/~howard/LING3820/ The syllabus is under construction. http://www.tulane.edu/~howard/CompCultEN/ http://www.tulane.edu/~howard/CompCultEN/
3
The quiz was the review. Review of Lists 22-Sept-2014 3 NLP, Prof. Howard, Tulane University
4
Open Spyder 22-Sept-2014 4 NLP, Prof. Howard, Tulane University
5
6. Non-English characters: one code to rule them all 22-Sept-2014 5 NLP, Prof. Howard, Tulane University
6
Did you know … 1. >>> unsorted = 'a*@A6' 2. >>> sorted(unsorted) 3. ['*', '6', '@', 'A', 'a'] 22-Sept-2014NLP, Prof. Howard, Tulane University 6
7
Introduction So your program is humming along, and it hits the string 'cañón' and chokes. For instance, it may try to find out the length of cañón: 1. >>> S = 'cañón' 2. >>> len(S) 3. >>> from re import findall 4. >>> findall(r'\w{5}',S) 5. >>> T = findall(r'.{5}',S) 6. >>> T 7. ['ca\xc3\xb1\xc3'] 8. >>> U = ''.join(T) 9. >>> print U 10. >>> findall(r'.{7}',S) 11. ['ca\xc3\xb1\xc3\xb3n'] 12. >>> T = findall(r'.{7}',S) 13. >>> U = ''.join(T) 14. >>> print U 15. cañón 22-Sept-2014NLP, Prof. Howard, Tulane University 7
8
6.1. English characters and ASCII Computers were originally designed to use the English alphabet, and in particular, an encoding of it called the American Standard Code for Information Interchange, abbreviated ASCII and pronounced / ˈ æski/ or “ass- kee”, see ASCII in Wikipedia.ASCII ASCII is ultimately based on telegraph codes and represents the numbers 0-9, the English letters a-z and A-Z, the English punctuation symbols plus a blank space, along with control codes that originated with Teletype machines, some of which are now obsolete. 22-Sept-2014NLP, Prof. Howard, Tulane University 8
9
ASCII characters 0123456789ABCDEF 0–––––––––––––––– 1–––––––––––––––– 2 !“#$%&‘()*+,-./ 30123456789:;<=>? 4@ABCDEFGHIJKLMNO 5PQRSTUVWXYZ[\]^_ 6`abcdefghijklmno 7pqrstuvwxyz{|}~– 22-Sept-2014NLP, Prof. Howard, Tulane University 9
10
So now you know … 1. >>> unsorted = 'a*@A6' 2. >>> sorted(unsorted) 3. ['*', '6', '@', 'A', 'a'] 4. >>> ord(' ') 5. >>> ord('!') 6. >>> ord('~') 7. >>> chr(32) 8. >>> chr(33) 9. >>> chr(126) 10. >>> chr(127) 22-Sept-2014NLP, Prof. Howard, Tulane University 10
11
Background 6.2. Unicode and UTF-8 22-Sept-2014 11 NLP, Prof. Howard, Tulane University
12
6.2.1. Character encoding in Python 22-Sept-2014NLP, Prof. Howard, Tulane University 12
13
7. NLTK and Internet corpora but I am going to fold this chapter into §1 & §2, so the chapter numbering will change. Next time 22-Sept-2014NLP, Prof. Howard, Tulane University 13
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.