Strings.

Slides:



Advertisements
Similar presentations
C Characters & Strings Character Review Character Handling Library Initialization String Conversion Functions String Handling Library Standard Input/Output.
Advertisements

 2003 Prentice Hall, Inc. All rights reserved Fundamentals of Characters and Strings Character constant –Integer value represented as character.
Strings.
Current Assignments Homework 5 will be available tomorrow and is due on Sunday. Arrays and Pointers Project 2 due tonight by midnight. Exam 2 on Monday.
Primitive Data Types There are a number of common objects we encounter and are treated specially by almost any programming language These are called basic.

 2000 Prentice Hall, Inc. All rights reserved. Chapter 8 - Characters and Strings Outline 8.1Introduction 8.2Fundamentals of Strings and Characters 8.3Character.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved Fundamentals of Strings and Characters Characters.
Introduction to Computers and Programming Lecture 7:
Hash Tables1 Part E Hash Tables  
Working with the data type: char  2000 Prentice Hall, Inc. All rights reserved. Modified for use with this course. Introduction to Computers and Programming.
Hash Tables1 Part E Hash Tables  
Characters and Strings. Characters In Java, a char is a primitive type that can hold one single character A character can be: –A letter or digit –A punctuation.
C Programming Lecture 3. The Three Stages of Compiling a Program b The preprocessor is invoked The source code is modified b The compiler itself is invoked.
Input & Output: Console
Chars and strings Jordi Cortadella Department of Computer Science.
Number Systems Spring Semester 2013Programming and Data Structure1.
Why does it matter how data is stored on a computer? Example: Perform each of the following calculations in your head. a = 4/3 b = a – 1 c = 3*b e = 1.
Week 1 Algorithmization and Programming Languages.
Character Arrays Based on the original work by Dr. Roger deBry Version 1.0.
Java Overview. Comments in a Java Program Comments can be single line comments like C++ Example: //This is a Java Comment Comments can be spread over.
Introducing Python CS 4320, SPRING Lexical Structure Two aspects of Python syntax may be challenging to Java programmers Indenting ◦Indenting is.
Dale Roberts Department of Computer and Information Science, School of Science, IUPUI C-Style Strings Strings and String Functions Dale Roberts, Lecturer.
Vladimir Misic: Characters and Strings1Tuesday, 9:39 AM Characters and Strings.
Chapter 4 Literals, Variables and Constants. #Page2 4.1 Literals Any numeric literal starting with 0x specifies that the following is a hexadecimal value.
School of Computer Science & Information Technology G6DICP - Lecture 4 Variables, data types & decision making.
Strings. Our civilization has been built on them but we are moving towards to digital media Anyway, handling digital media is similar to.. A string is.
Chapter 8 Characters and Strings. Objectives In this chapter, you will learn: –To be able to use the functions of the character handling library ( ctype).
Characters and Strings
 Variables are nothing but reserved memory locations to store values. This means that when you create a variable you reserve some space in memory. 
Character representation in the computers Home Assignment 1 Assigned. Deadline 2016 January 24th, Sunday.
Java Basics. Tokens: 1.Keywords int test12 = 10, i; int TEst12 = 20; Int keyword is used to declare integer variables All Key words are lower case java.
17-Mar-16 Characters and Strings. 2 Characters In Java, a char is a primitive type that can hold one single character A character can be: A letter or.
Principles of Programming - NI Chapter 10: Character & String : In this chapter, you’ll learn about; Fundamentals of Strings and Characters The difference.
Basic Data Types อ. ยืนยง กันทะเนตร คณะเทคโนโลยีสารสนเทศและการสื่อสาร มหาวิทยาลัยพะเยา Chapter 4.
Chapter 2 Variables and Constants. Objectives Explain the different integer variable types used in C++. Declare, name, and initialize variables. Use character.
1 ENERGY 211 / CME 211 Lecture 3 September 26, 2008.
EC-111 Algorithms & Computing Lecture #10 Instructor: Jahan Zeb Department of Computer Engineering (DCE) College of E&ME NUST.
Week 2 - Wednesday CS 121.
Machine level representation of data Character representation
Lec 3: Data Representation
Computer Architecture & Operations I
Data Representation.
INC 161 , CPE 100 Computer Programming
Pointers & Arrays 1-d arrays & pointers 2-d arrays & pointers.
Fundamentals of Characters and Strings
Chapter 6: Data Types Lectures # 10.
EPSII 59:006 Spring 2004.
Binary, Decimal and Hexadecimal Numbers
13 Text Processing Hongfei Yan June 1, 2016.
Chapter 8 - Characters and Strings
COMP3221: Microprocessors and Embedded Systems
Strings, Line-by-line I/O, Functions, Call-by-Reference, Call-by-Value
Lecture 8b: Strings BJ Furman 15OCT2012.
Lecture 2 Data representation
Review for Final Exam.
Introduction to Primitive Data types
Strings.
Variables Title slide variables.
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
Digital Encodings.
Strings.
B065: PROGRAMMING Variables 1.
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Lecture 19: Working with strings
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
B065: PROGRAMMING Variables 2.
Variables and Constants
Introduction to Primitive Data types
Section 6 Primitive Data Types
Presentation transcript:

Strings

Strings Our civilization has been built on them but we are moving towards to digital media Anyway, handling digital media is similar to .. A string is a list of characters 성균관대학교정보통신학부 Am I a string? 一片花飛減却春

Character Code - ASCII Define 2**7 = 128 characters why 7 bits? char in C is similar to an integer ‘D’+ ‘a’ – ‘A’

more about ASCII What about other languages? Unicode there are more than 100 scripts some scripts have thousands characters Unicode several standards UTF16 UTF8 –variable length, leading zero means ASCII set. 1~6 bytes are used Programming Languages C, C++ treat char as a byte Java treat char as two bytes

UTF-8 Number of bytes Bits for code point First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4 1 7 U+0000 U+007F 0xxxxxxx 2 11 U+0080 U+07FF 110xxxxx 10xxxxxx 3 16 U+0800 U+FFFF 1110xxxx 4 21 U+10000 U+10FFFF 11110xxx

String Representations null-terminated: C, C++ array with length: Java linked list of characters used in some special areas Your Choice should consider space occupied constraints on the contents O(1) access to i-th char (search, insert, delete) what if the length can be limited (file names)

null-terminated char str[6] = “Hello”; char arr1[] = “abc”; need for the termination printf(“format string”, str); /* str can be any length str H e l l \0

String Match Exact matching Inexact matching a BIG problem for Google, twitter, Naver, bioinfo.... native method preprocessing methods Boyer-Moore Algorithm Utf-8 raises a limit Inexact matching this problem itself can be a single whole course for a semester utf-8 accounts for 91% of all Web pages

String match(2) Problem Your solution find P in T how many comparisons? better way? T x a b y z P a b x y z

The naïve string search 1 2 3 4 5 6 7 8 9 10 11 12 x a b y z a b x y z * a b x y z ^ ^ ^ ^ ^ ^ ^ * a b x y z * a b x y z * a b x y z Matched! * a b x y z ^ ^ ^ ^ ^ ^ ^ ^

The naïve string search Worst case Find “aaaaab” in “aaaaaaaaaaaaaaaab” O(nm) Bring a better one next week !!

Problem Example There are millions documents that contain company names – Anderson, Enron, Lehman.. M&A, bankrupt force them to be changed Your mission change all the company names if they are changed

Input/Output Example 4 “Anderson Consulting” to “Accenture” “Enron” to “Dynegy” “DEC” to “Compaq” “TWA” to “American” 5 Anderson Accounting begat Anderson Consulting, which offered advice to Enron before it DECLARED bankruptcy, which made Anderson Consulting quite happy it changed its name in the first place!

Your Plan read the M&A list into a DB main loop Now, define define DB structure how to handle the double quote (“) sign? main loop compare each line of a doc with each DB entry if there is a match, replace it with a new name Now, define functions global variables

Your naive solution read data for changed company names to build a DB for each entry look for the whole documents if there is a match, change it old name new name Anderson Consulting Accenture Enron Dynegy DEC Compaq .... .....

Solution in Detail mergers[101][2] 1 Anderson Consulting\0 Accenture\0 1 Anderson Consulting\0 Accenture\0 Enron\0 Dynegy\0 2 DEC\0 Compaq\0 3 TWA\0 American\0 mergers[101][2]

top down approach

String y C o m p a q String s D E C L A R

Example Review Is it safe? Is it efficient? What if a quote sign is missing? what if there is no room for a new long company name? .... Is it efficient? space time

Character Library

String Library Why size_t instead of int? According to the 1999 ISO C standard (C99), size_t is an unsigned integer type of at least 16 bit. It is usually the largest integer representable while the size of int is fixed to 32 bits or 16 bits.

Doublets a dictionary up to 25,143 words, not exceeding 16 letters