Download presentation
Presentation is loading. Please wait.
1
Varying Character Lengths
Addendum on Strings Varying Character Lengths Copyright © Curt Hill
2
How Long is a Character? It depends on whom you ask 5 bit – telegraph
6 bit – Control Data Display Code 7 bit – ASCII 8 bit – EBCDIC 8 or 16 – Multi Byte Character Set 16 – Unicode Let us take another shot at this We will ignore telegraph 5 bit code Copyright © Curt Hill
3
ASCII American Standard Code Information Interchange A 7 bit code
control characters (0-31) blank (32) punctuation and math (33-47) digits (48-57) more punctuation (58-64) Upper case ( ) more punctuation (91-96) Lower case (97-122) more punctuation ( ) Copyright © Curt Hill
4
ASCII Again Usually stored in a signed 8 bit byte
The negatives were often used for graphic characters European countries often used a variant $ removed in favor or local currency symbol Accented or other characters might also be added Preferred by some manufacturers Copyright © Curt Hill
5
EBCDIC Extended Binary Coded Decimal Interchange Code 8 bit
control characters (0-63) blank (64) punctuation and math (65-127) Lower case ( , with gaps) Upper case ( , with gaps) digits ( ) The gaps are sometimes unassigned, sometimes occupied Copyright © Curt Hill
6
ASCII & EBCDIC Different but essentially equivalent character sets
Converting one to the other was a routine task Sometimes assisted by machine language statements Either works pretty well, with small variations, for European languages with a small alphabet Not so much for pictogram languages Copyright © Curt Hill
7
CDC Display Code The odd one of single byte codes
CDC made number crunchers Character processing was an afterthought Six bits is not enough Upper and lower case plus digits is 62 of the possible 64 Well almost They needed to cheat to get lower case letters Copyright © Curt Hill
8
The Cheat They had a special character, the escape, that indicated a second character was to be used Thus a string would be composed of characters that were either 6 or 12 bits long A 12 bit character had as the first character the escape and another character as the second This allows it to be as capable as the 7 bit code of ASCII Copyright © Curt Hill
9
Not enough money After a certain amount of maturation in the industry it was realized that certain countries would never convert to a western-style alphabet Large market countries like China, India and Japan This gives rise to several attempts to achieve much larger character sets Multi Byte Character Set (MBCS) Unicode Copyright © Curt Hill
10
MBCS Similar to the CDC scheme except each character is 8 bits
Values outside of ASCII may be escape sequences 0x81-0x9f and 0xE0-0xFC This allows for access to foreign markets Subscripting into this become impossible The characters are variable length Copyright © Curt Hill
11
Unicode A constant size 16 bit code First byte selects a ‘language’
ASCII is 0 Italian is 10 Some languages need multiple codes Second byte is the ‘character’ Takes more space than MBCS but is easier in most respects On Intel machines they are in reversed order Copyright © Curt Hill
12
Language Support Java cheated and adopted Unicode from the start
Easy C++ is more complicated Since it is more capable It also maintains compatibility with older versions of C It needs new character types Copyright © Curt Hill
13
Single Byte Regardless of the single byte character set chosen C++ uses the type char for one byte characters You should be familiar with single byte strings Strings are delimited by a null or zero A variety of string functions exist such as strlen, strcpy, strcat, etc. Copyright © Curt Hill
14
MBCS Sometimes referred to as double byte (DBCS)
Strings are still a char Delimited by a single byte null or zero There are a variety of string functions just for multi byte strings Where the single byte function was named strXXX we now have a function _mbsXXX Thus strcat is single byte, while _mbscat is a multibyte concatenation function Copyright © Curt Hill
15
Unicode Cannot use the char type since the character is one byte and Unicode uses two Instead we use wchar_t Wide Char Type Similar to MBCS we terminate with a null character Which is now two bytes Copyright © Curt Hill
16
Unicode String Functions
Similar to MBCS we have a variety of string functions with a different prefix These start with wcs instead of str So the copy is now wcscpy Copyright © Curt Hill
17
Windows Windows has maintained these parallel string libraries for some time Single byte Multiple byte Unicode In Windows 10 the default internal type becomes Unicode Thus Common Dialog Boxes return a Unicode file name Even if the file contents is still in ASCII Copyright © Curt Hill
18
Windows APIs Many text functions have two versions
Consider SetWindowText which sets the title bar text of a Window It comes in two flavors: SetWindowTextA – ASCII/MBCS SetWindowTextW – Unicode MessageBox similarly comes in both flavors Copyright © Curt Hill
19
Defines To simplify things there are some defines that give common types TCHAR is defined as the standard character If UNICODE is defined then TCHAR is made into a wchar_t Otherwise a char Similarly LPTSTR is defined to be a pointer at TCHAR Copyright © Curt Hill
20
Finally Now we should be ready to handle any kind of C style string
Copyright © Curt Hill
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.