Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Slides:



Advertisements
Similar presentations
Input and Output Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Advertisements

Binary Representation Introduction to Computer Science and Programming I Chris Schmidt.
Review Binary –Each digit place is a power of 2 –Any two state phenomenon can encode a binary number –The number of bits (digits) required directly relates.
1 File Systems Chapter Files 6.2 Directories 6.3 File system implementation 6.4 Example file systems.
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
Data Representation Kieran Mathieson. Outline Digital constraints Data types Integer Real Character Boolean Memory address.
Characters and Strings. Characters In Java, a char is a primitive type that can hold one single character A character can be: –A letter or digit –A punctuation.
Data Representation (in computer system) Computer Fundamental CIM2460 Bavy LI.
Representing Information in Binary (Continued)
CIS 234: Character Codes Dr. Ralph D. Westfall April, 2011.
COMPUTER FUNDAMENTALS David Samuel Bhatti
ASCII and Unicode. ASCII Inside a computer, EVERYTHING is a number – that includes music, sound, and text. In the early days of computers, every manufacturer.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
Working with Files CSC 161: The Art of Programming Prof. Henry Kautz 11/9/2009.
LING 408/508: Programming for Linguists Lecture 2 August 28 th.
Aloha Aloha What you see: What the computer sees: binary number columns binary number columns
Chapter 1 Data Storage(2) Yonsei University 1 st Semester, 2014 Sanghyun Park.
Representing Nonnumeric Data Everything is really a number.
Computer Structure & Architecture 7c - Data Representation.
CS 102 Computers In Context (Multimedia)‏ 01 / 23 / 2009 Instructor: Michael Eckmann.
Computer Programming I. Today’s Lecture  Components of a computer  Program  Programming language  Binary representation.
Hans-Peter Plag October 9, 2014 Session 2 Storing Information File Formats Accessing Information Processing Information.
Data Representation and Storage Lecture 5. Representations A number value can be represented in many ways: 5 Five V IIIII Cinq Hold up my hand.
ICT Foundation 1 Copyright © 2010, IT Gatekeeper Project – Ohiwa Lab. All rights reserved. Character representation.
Binary, Decimal and Hexadecimal Numbers Svetlin Nakov Telerik Corporation
HOW COMPUTERS MANIPULATE DATA Chapter 1 Coming up: Analog vs. Digital.
Chapter 11 File Systems and Directories. 2 File Systems File: A named collection of related data. File system: The logical view that an operating system.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Operating Systems COMP 4850/CISG 5550 File Systems Files Dr. James Money.
CS 102 Computers In Context (Multimedia)‏ 01 / 26 / 2009 Instructor: Michael Eckmann.
CISC1100: Binary Numbers Fall 2014, Dr. Zhang 1. Numeral System 2  A way for expressing numbers, using symbols in a consistent manner.  " 11 " can be.
Data Files on Computers Text Files (ASCII) Files that can be created by typing on the keyboard while using a text editor such as notepad or TextEdit.
Strings in MIPS. Chapter 2 — Instructions: Language of the Computer — 2 Character Data Byte-encoded character sets – ASCII: 128 characters 95 graphic,
Representing Characters in a computer Pressing a key on the computer a code is generated that the computer can convert into a symbol for displaying or.
The character data type char. Character type char is used to represent alpha-numerical information (characters) inside the computer uses 2 bytes of memory.
Chapter 14: Files and Streams. 2Microsoft Visual C# 2012, Fifth Edition Files and the File and Directory Classes Temporary storage – Usually called computer.
Operators Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
File Input and Output Chapter 14 Java Certification by:Brian Spinnato.
Data Representation. How is data stored on a computer? Registers, main memory, etc. consists of grids of transistors Transistors are in one of two states,
1 CSC103: Introduction to Computer and Programming Lecture No 28.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
17-Mar-16 Characters and Strings. 2 Characters In Java, a char is a primitive type that can hold one single character A character can be: A letter or.
1.4 Representation of data in computer systems Character.
Nat 4/5 Computing Science Data Representation Lesson 3: Storing Text
Text and Images Key Revision Points.
Binary Representation in Text
Binary Representation in Text
Unit 2.6 Data Representation Lesson 2 ‒ Characters
Binary 1 Basic conversions.
CSCI 198: Lecture 4: Data Representation
Data Transfer ASCII FILES.
CSCI 161: Lecture 4: Data Representation
Binary, Decimal and Hexadecimal Numbers
Data Encoding Characters.
TOPICS Information Representation Characters and Images
Representing Nonnumeric Data
Representing Characters
String Encodings and Penny Math
Fundamentals of Data Structures
Presenting information as bit patterns
COMS 161 Introduction to Computing
Plan Attendance Files Posted on Campus Cruiser Homework Reminder
Digital Encodings.
Learning Intention I will learn how computers store text.
Real-World File Structures
String Encodings and Penny Math
ASCII LP1.
Lecture 36 – Unit 6 – Under the Hood Binary Encoding – Part 2
ASCII and Unicode.
Presentation transcript:

Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See for more information. Python

Text How to represent characters?

PythonText How to represent characters? American English in the 1960s:

PythonText How to represent characters? American English in the 1960s: 26 characters × {upper, lower}

PythonText How to represent characters? American English in the 1960s: 26 characters × {upper, lower} +10 digits

PythonText How to represent characters? American English in the 1960s: 26 characters × {upper, lower} +10 digits +punctuation

PythonText How to represent characters? American English in the 1960s: 26 characters × {upper, lower} +10 digits +punctuation +special characters for controlling teletypes (new line, carriage return, form feed, bell, …)

PythonText How to represent characters? American English in the 1960s: 26 characters × {upper, lower} +10 digits +punctuation +special characters for controlling teletypes (new line, carriage return, form feed, bell, …) =7 bits per character (ASCII standard)

PythonText How to represent text?

PythonText How to represent text? 1.Fixed-width records

PythonText How to represent text? 1.Fixed-width records A crash reduces your expensive computer to a simple stone.

PythonText How to represent text? 1.Fixed-width records A crash reduces your expensive computer to a simple stone. Acrashreduces········ yourexpensivecomputer toasimplestone.·····

PythonText How to represent text? 1.Fixed-width records A crash reduces your expensive computer to a simple stone. Easy to get to line N Acrashreduces········ yourexpensivecomputer toasimplestone.·····

PythonText How to represent text? 1.Fixed-width records A crash reduces your expensive computer to a simple stone. Easy to get to line N But may waste space Acrashreduces········ yourexpensivecomputer toasimplestone.·····

PythonText How to represent text? 1.Fixed-width records A crash reduces your expensive computer to a simple stone. Easy to get to line N But may waste space What if lines are longer than the record length? Acrashreduces········ yourexpensivecomputer toasimplestone.·····

PythonText How to represent text? 1. Fixed-width records 2.Stream with embedded end-of-line markers

PythonText How to represent text? 1. Fixed-width records 2.Stream with embedded end-of-line markers A crash reduces your expensive computer to a simple stone. Acrashreducesyourexpensiv e computer toasimplestone.

PythonText How to represent text? 1. Fixed-width records 2.Stream with embedded end-of-line markers A crash reduces your expensive computer to a simple stone. Acrashreducesyourexpensiv e computer toasimplestone. More flexible

PythonText How to represent text? 1. Fixed-width records 2.Stream with embedded end-of-line markers A crash reduces your expensive computer to a simple stone. Acrashreducesyourexpensiv e computer toasimplestone. More flexible Wastes less space

PythonText How to represent text? 1. Fixed-width records 2.Stream with embedded end-of-line markers A crash reduces your expensive computer to a simple stone. Acrashreducesyourexpensiv e computer toasimplestone. More flexible Wastes less space Skipping ahead is harder

PythonText How to represent text? 1. Fixed-width records 2.Stream with embedded end-of-line markers A crash reduces your expensive computer to a simple stone. Acrashreducesyourexpensiv e computer toasimplestone. More flexible Wastes less space Skipping ahead is harder What to use for end of line?

PythonText Unix: newline ('\n')

PythonText Unix: newline ('\n') Windows: carriage return + newline ('\r\n')

PythonText Unix: newline ('\n') Windows: carriage return + newline ('\r\n') Oh dear…

PythonText Unix: newline ('\n') Windows: carriage return + newline ('\r\n') Oh dear… Python converts '\r\n' to '\n' and back on Windows

PythonText Unix: newline ('\n') Windows: carriage return + newline ('\r\n') Oh dear… Python converts '\r\n' to '\n' and back on Windows To prevent this (e.g., when reading image files) open the file in binary mode

PythonText Unix: newline ('\n') Windows: carriage return + newline ('\r\n') Oh dear… Python converts '\r\n' to '\n' and back on Windows To prevent this (e.g., when reading image files) open the file in binary mode reader = open('mydata.dat', 'rb')

PythonText Back to characters…

PythonText Back to characters… How to represent ĕ, β, Я, …?

PythonText Back to characters… How to represent ĕ, β, Я, …? 7 bits = 0…127

PythonText Back to characters… How to represent ĕ, β, Я, …? 7 bits = 0…127 8 bits (a byte) = 0…255

PythonText Back to characters… How to represent ĕ, β, Я, …? 7 bits = 0…127 8 bits (a byte) = 0…255 Different companies/countries defined different meanings for

PythonText Back to characters… How to represent ĕ, β, Я, …? 7 bits = 0…127 8 bits (a byte) = 0…255 Different companies/countries defined different meanings for Did not play nicely together

PythonText Back to characters… How to represent ĕ, β, Я, …? 7 bits = 0…127 8 bits (a byte) = 0…255 Different companies/countries defined different meanings for Did not play nicely together And East Asian "characters" won't fit in 8 bits

PythonText 1990s: Unicode standard

PythonText 1990s: Unicode standard Defines mapping from characters to integers

PythonText 1990s: Unicode standard Defines mapping from characters to integers Does not specify how to store those integers

PythonText 1990s: Unicode standard Defines mapping from characters to integers Does not specify how to store those integers 32 bits per character will do it...

PythonText 1990s: Unicode standard Defines mapping from characters to integers Does not specify how to store those integers 32 bits per character will do it......but wastes a lot of space in common cases

PythonText 1990s: Unicode standard Defines mapping from characters to integers Does not specify how to store those integers 32 bits per character will do it......but wastes a lot of space in common cases Use in memory (for speed)

PythonText 1990s: Unicode standard Defines mapping from characters to integers Does not specify how to store those integers 32 bits per character will do it......but wastes a lot of space in common cases Use in memory (for speed) Use something else on disk and over the wire

PythonText (Almost) everyone uses a variable-length encoding called UTF-8 instead

PythonText (Almost) everyone uses a variable-length encoding called UTF-8 instead First 128 characters (old ASCII) stored in 1 byte each

PythonText (Almost) everyone uses a variable-length encoding called UTF-8 instead First 128 characters (old ASCII) stored in 1 byte each Next 1920 stored in 2 bytes, etc.

PythonText (Almost) everyone uses a variable-length encoding called UTF-8 instead First 128 characters (old ASCII) stored in 1 byte each Next 1920 stored in 2 bytes, etc. 0xxxxxxx7 bits

PythonText (Almost) everyone uses a variable-length encoding called UTF-8 instead First 128 characters (old ASCII) stored in 1 byte each Next 1920 stored in 2 bytes, etc. 0xxxxxxx7 bits 110yyyyy10xxxxxx11 bits

PythonText (Almost) everyone uses a variable-length encoding called UTF-8 instead First 128 characters (old ASCII) stored in 1 byte each Next 1920 stored in 2 bytes, etc. 0xxxxxxx7 bits 110yyyyy10xxxxxx11 bits 1110zzzz10yyyyyy10xxxxxx16 bits

PythonText (Almost) everyone uses a variable-length encoding called UTF-8 instead First 128 characters (old ASCII) stored in 1 byte each Next 1920 stored in 2 bytes, etc. 0xxxxxxx7 bits 110yyyyy10xxxxxx11 bits 1110zzzz10yyyyyy10xxxxxx16 bits 11110www10zzzzzz10yyyyyy10xxxxxx21 bits

PythonText (Almost) everyone uses a variable-length encoding called UTF-8 instead First 128 characters (old ASCII) stored in 1 byte each Next 1920 stored in 2 bytes, etc. 0xxxxxxx7 bits 110yyyyy10xxxxxx11 bits 1110zzzz10yyyyyy10xxxxxx16 bits 11110www10zzzzzz10yyyyyy10xxxxxx21 bits The good news is, you don't need to know

PythonText Python 2.* provides two kinds of string

PythonText Python 2.* provides two kinds of string Classic: one byte per character

PythonText Python 2.* provides two kinds of string Classic: one byte per character Unicode: "big enough" per character

PythonText Python 2.* provides two kinds of string Classic: one byte per character Unicode: "big enough" per character Write u'the string' for Unicode

PythonText Python 2.* provides two kinds of string Classic: one byte per character Unicode: "big enough" per character Write u'the string' for Unicode Must specify encoding when converting from Unicode to bytes

PythonText Python 2.* provides two kinds of string Classic: one byte per character Unicode: "big enough" per character Write u'the string' for Unicode Must specify encoding when converting from Unicode to bytes Use UTF-8

October 2010 created by Greg Wilson Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See for more information.