Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING 408/508: Computational Techniques for Linguists

Similar presentations


Presentation on theme: "LING 408/508: Computational Techniques for Linguists"— Presentation transcript:

1 LING 408/508: Computational Techniques for Linguists
Lecture 3

2 Last time We talked about the underlying computer representation of numbers (2's complement for integers; 32/64 bit floating point) Quick Homework 1 on floating point representation

3 Last time Recall the speed of light:
c = x 108 (m/s) The 32 bit floating point representation (float) – sometimes called single precision - is composed of 1 bit sign, 8 bits exponent (unsigned with bias 2(8-1)-1), and 23 bits coefficient (24 bits effective). Can it represent c without loss of precision? 224-1 = 16,777,215 Nope

4 hw1.xls

5 Representing zero Let's not worry about representing zero in the spreadsheet…

6 Homework 1: Representing e
Wikipedia article on e: Example application (Bernoulli; 1683): compound interest An account starts with $1.00 and pays 100 percent interest per year. If the interest is credited once, at the end of the year, the value of the account at year-end will be $2.00. What happens if the interest is computed and credited more frequently during the year? n=2 i.e. every 6 months:  $2.25 n=4, i.e. quarterly: $2.4414 n=12, i.e. monthly: $ n=52, i.e. weekly: $ n=365, i.e. daily: $ n→∞, i.e. continuous: $ … (e), an irrational number

7 Quick Homework 1: Representing e
Using the spreadsheet, compute the best 32 bit approximation to e that you can come up with. Submit a snapshot of the spreadsheet to me by by Wednesday midnight Subject: 408/508 Your Name Homework 1

8 Introduction: data types
char How about letters, punctuation, etc.? ASCII American Standard Code for Information Interchange Based on English alphabet (upper and lower case) + space + digits + punctuation + control (Teletype Model 33) Question: how many bits do we need? 7 bits + 1 bit parity Remember everything is in binary … Teletype Model 33 ASR Teleprinter (Wikipedia)

9 Introduction: data types
order is important in sorting! 0-9: there’s a connection with BCD. Notice: code 30 (hex) through 39 (hex)

10 Introduction: data types
Parity bit: transmission can be noisy parity bit can be added to ASCII code can spot single bit transmission errors even/odd parity: receiver understands each byte should be even/odd Example: 0 (zero) is ASCII 30 (hex) = even parity: , odd parity: Checking parity: Exclusive or (XOR): basic machine instruction A xor B true if either A or B true but not both (even parity 0) xor bit by bit 0 xor 0 = 0 xor 1 = 1 xor 1 = 0 xor 0 = 0 xor 0 = 0 xor 0 = 0 xor 0 = 0 x86 assemby language: PF: even parity flag set by arithmetic ops. TEST: AND (don’t store result), sets PF JP: jump if PF set Example: MOV al,<char> TEST al, al JP <location if even> <go here if odd>

11 Introduction: data types
UTF-8 standard in the post-ASCII world backwards compatible with ASCII (previously, different languages had multi-byte character sets that clashed) Universal Character Set (UCS) Transformation Format 8-bits (Wikipedia)

12 Introduction: data types
Example: あ Hiragana letter A: UTF-8: E38182 Byte 1: E = 1110, 3 = 0011 Byte 2: 8 = 1000, 1 = 0001 Byte 3: 8 = 1000, 2 = 0010 い Hiragana letter I: UTF-8: E38184 Shift-JIS (Hex): あ: 82A0 い: 82A2

13 Introduction: data types
How can you tell what encoding your file is using? Detecting UTF-8 Microsoft: 1st three bytes in the file is EF BB BF (not all software understands this; not everybody uses it) HTML: <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> (not always present) Analyze the file: Find non-valid UTF-8 sequences: if found, not UTF-8… Interesting paper:

14 Introduction: data types
Filesystem: different on different computers: sometimes a problem if you mount filesystems across different systems Examples: FAT32 (File Allocation Table) DOS, Windows, memory cards ExFAT (Extended FAT) SD cards (> 4GB files) NTFS (New Technology File System) Windows ext4 (Fourth Extended Filesystem) Linux HFS+ (Hierarchical File System Plus) Macs limited to 4GB max file size

15 Introduction: data types
Filesystem: different on different computers: sometimes a problem if you mount filesystems across different systems Files: Name (Path from / root) Type (e.g. .docx, .pptx, .pdf, .html, .txt) Owner (usually the Creator) Permissions (for the Owner, Group, or Everyone) need to be opened (to read from or write to) Mode: read/write/append Binary/Text in all programming languages: open command

16 Introduction: data types
Text files: text files have lines: how do we mark the end of a line? End of line (EOL) control character(s): LF 0x0A (Mac/Linux), CR 0x0D (Old Macs), CR+LF 0x0D0A (Windows) End of file (EOF) control character: EOT 0x04 (aka Control-D) programming languages: NUL used to mark the end of a string binaryvision.nl


Download ppt "LING 408/508: Computational Techniques for Linguists"

Similar presentations


Ads by Google