Data Representation Andrew Csizmadia.

Data Representation Andrew Csizmadia

Only Connect Challenge
“What is the connection in the following Only Connect Challenge? 3 Points 2 Points 1 Point 4 Points

Data REPRESENTATION

Translate …. translate What does the following say?
C 6C 6F F 72 6C 64

Historical Character Usage
Early general purpose computers (dominated by IBM) supported limited usage of non-numeric characters: Identification/headings on printed reports Program source code text Outside of the general computer area, some character encoding was used for text transmission (telegrams); initially Morse code, but this was replaced with fixed length pattern codes when automated equipment began to be used. Early general computer systems were viewed as “number crunchers”; their primary strength lay in their ability to perform arithmetic very quickly and accurately. Numeric values were what was of primary importance. Non-numeric characters (when used at all) were mainly used to “label” or identify the meaning of specific numeric values. The only other significant use of non-numeric characters was in “coding” source programs in high-level languages (FORTRAN, COBOL, and ASSEMBLER). There was no perceived need to be able to encode both upper and lower case alphabetic characters. If you needed “fancy” reports or letters, you used a typewriter… computers where much to expensive to be wasted on those kinds of activities. Of course, encoding of character information was also required outside of the computer industry. A major user of encoded character information were the telegraph companies. As the telegraph companies replaced skilled human operators using Morse code with telegraph terminals which could be operated by any typist, it was discovered that it was easier to build these terminals so they worked with “fixed-length” bit patterns to represent characters. To keep transmission costs to a minimum and because there was no significant advantage in providing both upper and lower case alphabetics, the telegraph industry also restricted itself to only upper case letters.

Historical Requirements
Text was not intended for the general public. Alphabetic characters where only in upper case Relatively few “special characters” (periods, parentheses, dollar signs, arithmetic operators) were supplied. 10 digits, 26 letters, (less than) 20 “special” character symbols  (less than) 56 code patterns were required. In general, then, throughout the early days of the computer era, the number of characters which were considered necessary to encode was minimal: codes for each of: the 10 digits. 26 upper case letters, and a few “special” characters (including the space, period, dollar sign, etc.) Certainly less that 60 code patterns were all that was necessary to provide encoding of the “essential” character set.

6-Bit Codes A 6-bit encoding systems permits 64 symbols to be encoded. This was enough provided only upper case alphabetic symbols and relatively few “special” symbols were required. IBM and Western Union (the telegraph company) stayed with 6-bit encoding systems after most of the rest of the computer and data transmission companies moved to a system which permitted both upper and lower alphabetics, and more special symbols. A 5-bit code only would have provided 32 different patterns; so this was obviously not enough. 6-bits, however, permits 64 patterns which was quite enough for the “essential” character set. Tied to the old view of computers as “number processors” with no economically valid role in general text processing, IBM (which controlled over 90% of the general computer market) stayed with the 6-bit/upper case only/limited special character encoding for quite a while after the rest of the computer industry moved to a more extensive character set. The telegraph companies, especially Western Union, stayed with their 6-bit coding system, called 6-bit Transcode, even longer, so that even today, telegrams are usually thought of as being in all upper case.

Formation of ASCII General user demand for more character symbols (including lower case alphabetics). IBM did not believe that the market demand was sufficient to move from a 6-bit code No other single company controlled a large enough market share to be able to create a viable system on its own. A group of computer, peripheral, and data transmission companies joined to establish a standard. Outside of IBM, a collection of marginal computer and computer equipment manufactures recognized a growing demand to be able to process a more inclusive character set. However, none of these companies (or users) was large enough to make any private encoding system practical. If, for example, you manufactured a minicomputer system, but not the peripherals for it, you needed to know that the coding system you used would work with the equipment supplied by the different peripheral manufacturers. Therefore, a group of interested manufactures and users, outside of IBM, got together to develop an industry standard encoding system for character encoding which would include a wider range of characters than available with the 6-bit systems.

ASCII Basics American Standard Code for Information Interchange
7-bit code provided unique codes for up to 128 different characters Some terminal equipment: when idle, the power was off (which would look like ); other terminal equipment: when idle, the power was on (which would look like ). Therefore both the and the patterns were eliminated from the encoding (“null” patterns). This new encoding system was called the American Standard Code for Information Interchange, or ASCII. ASCII was developed as a 7-bit code, which means that 128 different code patterns should be possible. In practice certain code patterns were eliminated as not being “valid” characters; specifically the patterns with all bits “off” and with all bits “on” were eliminated for reasons having to do with physical implementation. The first 32 patterns, from to (binary), were reserved and assigned to “non-display” or control codes. The next 16 patterns, starting with which was assigned to the “blank” character, were assigned to “special character” (non-digit and non-alphabetic) codes. These were followed by the digit codes, then the upper case letters and then the lower case letters. (Some otherwise “unused” code patterns in the gaps between the digits and the upper case letters, and between the upper case letters and the lower case letters, and following the lower case letters were assigned to additional “special” characters.

Extended ASCII Byte = Collection of bits used to encode a character
ASCII is almost always implemented using an 8-bit byte (character). Only the 7-bit patterns were standardized under ASCII. Standard 8-bit ASCII codes start with a zero-valued bit (followed by the 7-bit ASCII code). “Extended ASCII” codes start with a one-valued bit; these codes are not standard and vary in meaning among different manufactures and equipment. Technically, the term “byte” means the collection of bits, used by some specific computer system, to encode its standard character form. Often, a “byte” is thought of as being the same thing as 8-bits, since this is the most common sized collection used fro character encoding; however, in some computer systems 6, 7, 9, 12 and 16-bit bytes are used. Although designed as a 7-bit system, ASCII is almost always implemented using an 8-bit byte. For “standard” codes, that is those defined by the original ASCII group, the highest (or left-most) bit has a zero value and the remaining 7 bits contain the original ASCII code. No standard meaning was ever developed for 8-bit patterns with the highest bit set to a one value; many computer manufactures developed their own extensions to ASCII with the highest bit turned “on”… but there are no official standards for these codes and the same pattern (with the highest bit “on”) will likely generate different symbols using different pieces of equipment.

Major ASCII Coding Patterns
First 32 patterns (when written in hexadecimal, any patterns starting with 0 or 1): control codes; the most common of these are 0Ah (Line Feed) and 0Dh (Carriage Return) 20(hex) blank; remainder of codes starting with 2(hex) are “special” characters. 30(hex): “0”; 31(hex): “ 1”; etc. 41(hex): “ A”; 42(hex): “ B”; etc. 61(hex): “a”; 62(hex): “ b”; etc. If you can remember a few basic ASCII codes, it is possible to decode 95% of almost any ASCII text file from its internal patterns. The first 32 patterns, those which, in hexadecimal, start with 0 or 1, are “non-display”, control codes. Of these, the “line feed” with code 0A(hex) and the “carriage return” with code 0D(hex) are, by far, the most essential. The blank (or “space”) character has a code of 20(hex), and all codes starting with 2 (in hex) are “special characters”… for example, 24(hex) is the “dollar sign” symbol. The digit character, ‘0’ to ‘9’, are encoded using the hex patterns 30(hex) to 39(hex) inclusive. (Codes from 3A(hex) to 40(hex) are more “special characters”). The upper case letters, ‘A’ to ‘Z”, are encoding using the hexadecimal values 41(hex) to 5A(hex) inclusive. If you can remember that upper case ‘A’ has the code 41(hex) and can recite the alphabet, you should be able to figure out the value of any upper case letter’s ASCII code. The lower case letters, ‘a’ to ‘z’, have ASCII codes in the range 61(hex) to 7A(hex). A lower case letter has an ASCII code which is equal to its upper case ASCII code, plus the ASCII code for a blank. For example, since the ASCII code for an upper case ‘E’ is 45(hex) and a blank has an ASCII code of 20(hex), the ASCII code for a lower case ‘e’ is 45(hex) + 20(hex) = 65(hex).

Sample ASCII Decoding - 1
Suppose we have the bit stream: … Our first task would normally be to rewrite this as a series of pairs of hexadecimal digits: … …. (in actual practice it would be more common for the “bit stream to be presented already in pairs of hexadecimal digits) For example, if we encountered a bit stream… that is a sequence of 0’s and 1’s…which we knew represented a collection of ASCII encoded text, the first step would be to convert this stream into bytes written in hexadecimal notation. Note that this stage is normally handled automatically, unless the stream is coming from a source which is using a 7-bit (or other non-8-bit) ASCII transmission.

Write down the alphabet and beside each letter write its ASCII code: A : 41h (lower case add 20h) K : 4Bh B : 42h … C : 43h Z : 5Ah …. Remember: digits are 3?h I : 49h blank is 20h J : 4Ah LF is 0Ah CR is 0Dh Write a table with all the letters of the alphabet in sequence, from A to Z; beside ‘A” write 41h, beside B write 42h, and so on until you have written 5Ah beside ‘Z’ (note that the number after 49h is 4Ah… not 50h… this is a common place to make an error). If you find it more convenient, you might like to re-write this table with the lower case letters ‘a’ to ‘z’ with their corresponding codes 61h to 7Ah… or you can simply remember to subtract 20h from any ASCII code value between 61h and 7Ah to find its upper case equivalent code. You will also need to remember the digit codes, 30h to 39h… perhaps you might want to write these values down as a table too… but with a very little practice, they become trivially obvious… for example, 37h is the ASCII code for the digit character ‘7’. Finally, remember 20h is a blank, 0Ah is a line feed, and 0Dh is a carriage return and you will be able to translate a large portion of the encoded text.

Given the ASCII hexadecimal pattern (as an example): A 0D 47 6F Matching these codes to the table we created, we should have no trouble converting this into the text: The 3 Goats Consider the hexadecimal ASCII code sequence supplied on this slide as a final example. The code 54 (in hex) represents the letter ‘T’ from the table we should have written out by this point. 68 (hex) is in the range 61h to 7Ah, so we subtract 20(hex) from it and look-up 48(hex); 48(hex) is the code for an upper case ‘H’… therefore the character represented by 68(hex) is a lower case ‘h’. Similarly, 65(hex) is reduced to 45(hex) the code for ‘E’; so the next character is a lower case ‘e’. 20(hex) is a blank (we just need to remember this one). 33(hex) is an encoded value in the range 30h to 39h and therefore represents a digit character, namely (in this case) ‘3’. 0Ah is a line feed (remember it), so we drop down one line; and 0Dh is a carriage return (remember it, too), so we move back to the left side of our working space. 47(hex) is the ASCII code for a ‘G’ (from our table), and so on…

Note on End-Of-Line Codes
Different operating systems use different standards for indicating an end of line. MicroSoft uses a two-character sequence: 0Dh 0Ah (carriage return and line feed) Unix uses only 0Dh (the carriage return) Macintosh uses only 0Ah (the line feed) This can cause some problems when moving text files from one system to another. A problem often arises when copying ASCII text files from one type of computer or operating system to another. This problem is the result of there not being an agreed upon standard for how an end-of-line is indicated. Microsoft used a two character code, the carriage return followed by the line feed; this permits (at least in theory) dropping down one line without going back to the left edge or going back to the left edge without dropping down to a new line… perhaps to overtype on existing text. Unix uses only the carriage return, corresponding to the single key entry of an “Enter” key. Macintosh computers use only the line feed. As a typical problem which might occur, consider the situation where an ASCII text file has been uploaded to a UNIX web server from a MicroSoft Windows operating system, as a binary file, and then downloaded to (another) computer running MicroSoft Windows, but as a text file. When uploaded as a binary file, no consideration is given to the meaning of the bytes as characters (this means that the line feed code, 0Ah, might have unexpected results if someone on the UNIX computer tried to look at this file as a text document, since UNIX assigns no meaning to the 0Ah code value). When this file is downloaded as an ASCII text file, the download recognizes that it is coming from a UNIX computer (which uses only carriage return codes as line terminators) to a MicroSoft Windows computer (which expects both a carriage return and a line feed code). Therefore, the download process replaces each carriage return code from the UNIX computer with a carriage return and a line feed… but since a line feed code was already there, the downloaded text now contains a carriage return and two line feeds, so the text appears to be double spaced.

Activity – Text to ASCII Conversion
Complete the following sections of the Data Representation and Data Compression worksheet: Hex Binary

Data Compression

Paper fold Challenge How many times can you fold a piece of paper in half?

Why Data Compression? Make optimal use of limited storage space
Save time and help to optimize resources If compression and decompression are done in I/O processor, less time is required to move data to or from storage subsystem, freeing I/O bus for other work In sending data over communication line: less time to transmit and less storage to host

Data Compression- Entropy
Entropy is the measure of information content in a message. Messages with higher entropy carry more information than messages with lower entropy. How to determine the entropy Find the probability p(x) of symbol x in the message The entropy H(x) of the symbol x is: H(x) = - p(x) • log2p(x) The average entropy over the entire message is the sum of the entropy of all n symbols in the message

Data Compression Methods
Data compression is about storing and sending a smaller number of bits. There’re two major categories for methods to compress data: lossless and lossy methods

Lossless Compression Methods
In lossless methods, original data and the data after compression and decompression are exactly the same. Redundant data is removed in compression and added during decompression. Lossless methods are used when we can’t afford to lose any data: legal and medical documents, computer programs.

Run-length encoding Simplest method of compression.
How: replace consecutive repeating occurrences of a symbol by 1 occurrence of the symbol itself, then followed by the number of occurrences. The method can be more efficient if the data uses only 2 symbols (0s and 1s) in bit patterns and 1 symbol is more frequent than another.

Activity – Kid Fax Using the Kid Fax worksheet, what four images do you generate? Hint You will only use two colours: white and black The first number is the number of white blocks The second number is the number of black blocks Then alternative number of black and white blocks

Activity – Run Length Encoding
Complete the Run Length Encoding section on the Data Representation and Data Compression Worksheet Extension How many bits are now needed for storing/transmitting? What is the compression ratio in this case?

Huffman Coding Assign fewer bits to symbols that occur more frequently and more bits to symbols appear less often. There’s no unique Huffman code and every Huffman code has the same average code length. Algorithm: Make a leaf node for each code symbol Add the generation probability of each symbol to the leaf node Take the two leaf nodes with the smallest probability and connect them into a new node Add 1 or 0 to each of the two branches The probability of the new node is the sum of the probabilities of the two connecting nodes If there is only one node left, the code construction is completed. If not, go back to (2)

Huffman Coding Example

Huffman Coding Encoding Decoding

Activity – Huffman Coding
Complete the following worksheets: You can say that again Extra for experts Short and sweet Extra for real experts

Lempel Ziv Encoding It is dictionary-based encoding Basic idea:
Create a dictionary(a table) of strings used during communication. If both sender and receiver have a copy of the dictionary, then previously-encountered strings can be substituted by their index in the dictionary.

Lempel Ziv Compression
Have 2 phases: Building an indexed dictionary Compressing a string of symbols Algorithm: Extract the smallest substring that cannot be found in the remaining uncompressed string. Store that substring in the dictionary as a new entry and assign it an index value Substring is replaced with the index found in the dictionary Insert the index and the last character of the substring into the compressed string

Lempel Ziv Compression
Compression process

Lempel Ziv Decompression
It’s just the inverse of compression process

Activity – Hour of Code Use the Hour of Code activity: Text Compression to learn how to apply Lempel Ziv Compression. Try this activity with the following song lyrics: Video: Activity:

Lossy Compression Methods
Used for compressing images and video files (our eyes cannot distinguish subtle changes, so lossy data is acceptable). These methods are cheaper, less time and space. Several methods: JPEG: compress pictures and graphics MPEG: compress video MP3: compress audio

JPEG Encoding Used to compress pictures and graphics.
In JPEG, a grayscale picture is divided into 8x8 pixel blocks to decrease the number of calculations. Basic idea: Change the picture into a linear (vector) sets of numbers that reveals the redundancies. The redundancies is then removed by one of lossless compression methods.

JPEG Encoding Discrete Concise Transform (DCT)
DCT transforms the 64 values in 8x8 pixel block in a way that the relative relationships between pixels are kept but the redundancies are revealed. Example: A gradient grayscale

Quantization & Compression
After T table is created, the values are quantized to reduce the number of bits needed for encoding. Quantization divides the number of bits by a constant, then drops the fraction. This is done to optimize the number of bits and the number of 0s for each particular application. Compression: Quantized values are read from the table and redundant 0s are removed. To cluster the 0s together, the table is read diagonally in an zigzag fashion. The reason is if the table doesn’t have fine changes, the bottom right corner of the table is all 0s. JPEG usually uses lossless run-length encoding at the compression phase.

JPEG Encoding

MPEG Encoding Used to compress video. Basic idea:
Each video is a rapid sequence of a set of frames. Each frame is a spatial combination of pixels, or a picture. Compressing video = spatially compressing each frame + temporally compressing a set of frames.

MPEG Encoding Spatial Compression Temporal Compression
Each frame is spatially compressed by JPEG. Temporal Compression Redundant frames are removed. For example, in a static scene in which someone is talking, most frames are the same except for the segment around the speaker’s lips, which changes from one frame to the next.

Audio Compression Used for speech or music
Speech: compress a 64 kHz digitized signal Music: compress a MHz signal Two categories of techniques: Predictive encoding Perceptual encoding

Audio Encoding Predictive Encoding Perceptual Encoding: MP3
Only the differences between samples are encoded, not the whole sample values. Several standards: GSM (13 kbps), G.729 (8 kbps), and G (6.4 or 5.3 kbps) Perceptual Encoding: MP3 CD-quality audio needs at least Mbps and cannot be sent over the Internet without compression. MP3 (MPEG audio layer 3) uses perceptual encoding technique to compress audio.

Background Reading ASCII
ASCCII text to Binary Converter ( Compression Coding ( Run-length Encoding Image representation ( Colour by Numbers ( Interactive Run Length Encoding ( Huffman Coding Text Compression (

Any questions? Refreshments in diner 4pm

Data Representation Andrew Csizmadia.

Similar presentations

Presentation on theme: "Data Representation Andrew Csizmadia."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Representation Andrew Csizmadia.

Similar presentations

Presentation on theme: "Data Representation Andrew Csizmadia."— Presentation transcript:

Similar presentations

About project

Feedback