Dale & Lewis Chapter 3 Data Representation
Analog and digital information The real world is continuous and finite, data on computers are finite need to approximate real-world data for our computational needs Analog data: information represented in a continuous form Digital data: information represented in digital form
Analog and digital information
Noise in signals
Digitizing a signal Sample the signal in time within discrete levels The pieces are numbered The binary number system is used to represent the numbers n bits can represent 2n numbers Q: how many bits are needed to represent m numbers? Actual number of bits that can be easily addressed in a computer sets some constraints
Representing text English language character set: 26 letters (both upper and lower case), punctuation, numeric digits, etc How many bits can we use? What about other languages?
ASCII character set American Standard Code for Information Interchange Each character is coded as a byte (8 bits) 7-bit code (1 check bit) Later all 8 bits used in the “extended character set” 128 characters encoded (27) 95 visible characters 33 invisible (control) characters
7-bit ASCII character set
ASCII Table The table above was sorted in decimal values These decimal values are really representing binary sequences So the character J is in position 74 This would be 01001010 in Binary or 4A in Hexadecimal j in 106 is 01101010 in Binary or 6A in Hexadecimal Notice anything? There is a purpose for that! The Unicode character set 16-bit standard, 65,536 possible codes Enough to cover the principal languages of the World Superset of ASCII so the first 256 codes of Unicode are the same as Extended ASCII
Text compression Keyword encoding Substitute frequently used words with single characters i.e.: “as” ^, “the” ~, “and” +, “that” $, etc. Problems: These characters can’t be part of the text Frequently used words tend to be short, so not much gain Word variations are not handled: i.e. “The” vs. “the”
Run-length encoding Replace long series of a repeated character with a special short code i.e.: replace “AAAAAAA” with *A7 This is equivalent to 01000001 01000001 01000001 01000001 01000001 01000001 01000001 with 00101010 01000001 00000111 Note that repetitions shorter than 4 characters are not worth encoding Also note that the repetition number is encoded in binary, not ASCII, so that repetitions longer than 9 can be captured Used in limited-palette image compression and fax machines
Huffman encoding Generalization of Morse Code Morse code (dots & dashes) is based on distribution of letters in general English usage Huffman encoding in based on distribution in a given message Algorithm: Encoding: Build frequency table of letter usage Build the code and encode the message Decoding Huffman code has the prefix property Prefix property: no code is the front part of another code Decoding processes the bit stream until a match is found
Example of Huffman encoding/decoding Message: DOORBELL Encoding: 1011110110111101001100100 Compression ratio (vs ASCII): 25/64 = 0.39 Decode: