Chapter 23 Text Processing Bjarne Stroustrup www.stroustrup.com/Programming.

Slides:



Advertisements
Similar presentations
“Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom.
Advertisements

National Constitution Day 1 st Amendment White Out! Activity.
Introduction to First Amendment Law. The First Amendment “Congress shall make no law respecting an establishment of religion, or prohibiting the free.
“TIPS” PRESENTATION BY BILL MULQUEEN MAY 16 & 17, 2000.
Civil Liberties: The First Amendment. Bill of Rights First 10 Amendments to Constitution Part of the “Deal” to Obtain State Ratification of Constitution.
First Amendment of the United States Constitution (1791) “Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise.
The First Amendment. Actual Text Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging.
Constitution Sydney Werlein, Ali Voss, Brian Jones.
What you will learn today: 1 What is the Bill of Rights? 2 What does the 1 st Amendment to the Constitution say about Freedom of Speech? 3 What are Civil.
The First Amendment Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom.
Chapter 4 section 1 The First Amendment. The First Amendment “ Congress shall make no law respecting an establishment of religion, or prohibiting the.
SECTION 1 Freedom of Assembly and Petition Standard Discuss the meaning and importance of each of the rights guaranteed under the Bill of Rights.
The First Amendment By: Subhi, Brittany, and Deanna EDU 2022 Dr. Fero.
The First Amendment.
MODULE 3: RESPONSIBILITY. As responsible journalists, staffs have obligations. Legal decisions have affected students’ rights. Statement of policy can.
CSCE 121: Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.
American Government and Politics (POLS 122) Professor Jonathan Day.
American Government Fall 2007 Civil Liberties. Freedoms from arbitrary government interference Found in Bill of Rights (first 10 amendments) –Speech –Press.
2.6 Protecting Individual Citizens 1 st & 4 th Amendments In Depth Government & Citizenship Timpanogos High School.
Project 1: Creating Newsletters Module 1: Censoring Freedom of Expression.
SIXTH GRADE WRITING CLASS “FREEDOM OF SPEECH” IN THE.
Journal #4 Imagine you are the shark and describe your thoughts. Use complete sentences and write ½ page minimum! 1 You have 10mins to write a CREATIVE,
Freedoms of Expression. What is an Amendment?  Amend: to change  Bill of Rights: first ten amendments to the Constitution  The Anti-Federalists wanted.
Our Banned Book Project By, Jaymee Gourley and Danielle Adams.
BANNED BOOKS. #1! 2CvlU.
Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of.
The Bill of Rights. Congress shall make no law The Bill of Rights Congress shall make no law a) respecting an establishment of religion,
The Constitution and your First Amendment Rights.
The First Amendment Congress shall make no law … abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and.
Basics of Religious Rights. 1 st Amendment Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof;
Daily Agenda Eng IV 6/19/13. Bellwork Take 5 minutes to respond to the quote. We will share after that. For John…Solve World Hunger in 3 steps or less.
The first amendment What it is and how it affects American media today.
“The day you teach the child the name of the bird, the child will never see that bird again” – Krishnamurti.
Amendment a·mend·ment P Pronunciation Key ( -m nd m nt) n. Pronunciation Key 1. The act of changing for the better; improvement:
Project 1: Creating Newsletters Module 1: Censoring Freedom of Expression.
MODULE 3: RESPONSIBILITY Responsibility Student journalists on the yearbook staff should follow important legal and ethical GUIDELINES. AS RESPONSIBLE.
Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of.
We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defense,
The first amendment What it is and how it affects American journalism.
The First Amendment Guy Ciancia Charles Thweatt. Who & What  Juan Williams, analyst of National Public Radio.  Juan Williams was fired from NPR after.
The 1 st Amendment. Brainstorm… Imagine you are in a club or a group and you have a super important message. You need as many people as possible to hear.
Amendment I Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech,
Name the five freedoms of the First Amendment. The First Amendment “Congress shall make no law respecting an establishment of religion, or prohibiting.
Chapter 23 Text Processing Bjarne Stroustrup
Civics. 1 st amendment Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the.
What does it mean to be an American? In one word … From Sam Chaltain’s First Amendment 101 Part 1 80:Video:1236.
LEA 2 Cours de civilisation américaine J. Kempf Americans and religion 1.Centrality in American life 2.An ambiguous separation of churches and State 3.The.
How does the 1 st Amendment apply today? SS8. Warm-Up: Re-write the following text in your own words… (pg. 34, left of your notebook) Congress shall make.
THE FIRST AMENDMENT EXPLAINED.
The First Amendment Journalism I Mr. Bruno. First Amendment to the Constitution Congress shall make no law respecting an establishment of religion, or.
What is it and how does it affect American journalism?
The First Amendment.
1st Amendment Court Cases
Chapter 23 Text Processing
Chapter 23 Text Processing
By: Dr. Seuss Capstone Project Example Ms. Payne, English 12
Personal protections and liberties added to the Constitution for you!
Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of.
Free Speech Thanks for coming in..
Using Church Records for Historical and Genealogical Research
Limiting Constitutional Rights: A Balancing Act
Americans and religion
*Breakdown the fundamental ideas of the 1st amendment.
The First Amendment Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom.
The First Amendment!.
Banned Books.
Warm Up: Religion ( WRITE STATEMENTS then write yes or no by each skip a line between each one) 1. Animal sacrifice as part of church services 2. Amish.
Newspaper bhspioneerspirit.
RIGHTS GIVEN TO THE PEOPLE
Presentation transcript:

Chapter 23 Text Processing Bjarne Stroustrup

Overview Application domains Application domains Strings Strings I/O I/O Maps Maps Regular expressions Regular expressions Stroustrup/PPP - Oct'112

Now you know the basics Really! Congratulations! Really! Congratulations! Dont get stuck with a sterile focus on programming language features Dont get stuck with a sterile focus on programming language features What matters are programs, applications, what good can you do with programming What matters are programs, applications, what good can you do with programming Text processing Text processing Numeric processing Numeric processing Embedded systems programming Embedded systems programming Banking Banking Medical applications Medical applications Scientific visualization Scientific visualization Animation Animation Route planning Route planning Physical design Physical design Stroustrup/PPP - Oct'113

Text processing all we know can be represented as text all we know can be represented as text And often is And often is Books, articles Books, articles Transaction logs ( , phone, bank, sales, …) Transaction logs ( , phone, bank, sales, …) Web pages (even the layout instructions) Web pages (even the layout instructions) Tables of figures (numbers) Tables of figures (numbers) Mail Mail Programs Programs Measurements Measurements Historical data Historical data Medical records Medical records … Stroustrup/PPP - Oct'114 Amendment I Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the government for a redress of grievances.

String overview Strings Strings std::string std::string s.size() s.size() s1==s2 s1==s2 C-style string (zero-terminated array of char) C-style string (zero-terminated array of char) or or strlen(s) strlen(s) strcmp(s1,s2)==0 strcmp(s1,s2)==0 std::basic_string, e.g. unicode strings std::basic_string, e.g. unicode strings typedef std::basic_string string; typedef std::basic_string string; Proprietary string classes Proprietary string classes Stroustrup/PPP - Oct'115

String conversion Simple to_string Simple to_string template string to_string(const T& t) { ostringstream os; os << t; return os.str(); } For example: For example: string s1 = to_string(12.333); string s2 = to_string(1+5*6-99/7); Stroustrup/PPP - Oct'116

String conversion Simple extract from string Simple extract from string template T from_string(const string& s) { istringstream is(s); T t; if (!(is >> t)) throw bad_from_string(); return t; } For example: For example: double d = from_string ("12.333"); Matrix m = from_string >("{ {1,2}, {3,4} }"); Stroustrup/PPP - Oct'117

General stream conversion template template Target lexical_cast(Source arg) { std::stringstream ss; Target result; if (!(ss << arg)// read arg into stream || !(ss >> result)// read result from stream || !(ss >> std::ws).eof())// stuff left in stream? || !(ss >> std::ws).eof())// stuff left in stream? throw bad_lexical_cast(); return result; } string s = lexical cast (lexical_cast (" 12.7 "));// ok // works for any type that can be streamed into and/or out of a string: XX xx = lexical_cast (lexical_cast (XX(whatever)));// !!! Stroustrup/PPP - Oct'118

I/O overview Stroustrup/PPP - Oct'119 istreamostream ifstreamiostreamofstreamostringstreamistringstream fstreamstringstream Stream I/O in >> xRead from in into x according to xs format out << x Write x to out according to xs format in.get(c)Read a character from in into c getline(in,s)Read a line from in into the string s

Map overview Associative containers Associative containers,,,,,, map map multimap multimap set set multiset multiset unordered_map unordered_map unordered_multimap unordered_multimap unordered_set unordered_set unordered_multiset unordered_multiset The backbone of text manipulation The backbone of text manipulation Find a word Find a word See if you have already seen a word See if you have already seen a word Find information that correspond to a word Find information that correspond to a word See example in Chapter 23 See example in Chapter 23 Stroustrup/PPP - Oct'1110

Map overview Stroustrup/PPP - Oct'1111 vector multimap John Doe John Q. Public Mail_file:

A problem: Read a ZIP code U.S. state abbreviation and ZIP code U.S. state abbreviation and ZIP code two letters followed by five digits two letters followed by five digits string s; while (cin>>s) { if (s.size()==7 && isletter(s[0]) && isletter(s[1]) && isdigit(s[2]) && isdigit(s[3]) && isdigit(s[4]) && isdigit(s[5]) && isdigit(s[6])) cout << "found " << s << '\n'; } Brittle, messy, unique code Brittle, messy, unique code Stroustrup/PPP - Oct'1112

A problem: Read a ZIP code Problems with simple solution Problems with simple solution Its verbose (4 lines, 8 function calls) Its verbose (4 lines, 8 function calls) We miss (intentionally?) every ZIP code number not separated from its context by whitespace We miss (intentionally?) every ZIP code number not separated from its context by whitespace "TX77845", TX , and ATM77845 "TX77845", TX , and ATM77845 We miss (intentionally?) every ZIP code number with a space between the letters and the digits We miss (intentionally?) every ZIP code number with a space between the letters and the digits TX TX We accept (intentionally?) every ZIP code number with the letters in lower case We accept (intentionally?) every ZIP code number with the letters in lower case tx77845 tx77845 If we decided to look for a postal code in a different format we have to completely rewrite the code If we decided to look for a postal code in a different format we have to completely rewrite the code CB3 0DS, DK-8000 Arhus CB3 0DS, DK-8000 Arhus Stroustrup/PPP - Oct'1113

TX st try:wwddddd 1 st try:wwddddd 2 nd (remember ):wwddddd-dddd 2 nd (remember ):wwddddd-dddd Whats special? Whats special? 3 rd :\w\w\d\d\d\d\d-\d\d\d\d 3 rd :\w\w\d\d\d\d\d-\d\d\d\d 4 th (make counts explicit):\w2\d5-\d4 4 th (make counts explicit):\w2\d5-\d4 5 th (and special): \w{2}\d{5}-\d{4} 5 th (and special): \w{2}\d{5}-\d{4} But was optional? But was optional? 6 th : \w{2}\d{5}(-\d{4})? 6 th : \w{2}\d{5}(-\d{4})? We wanted an optional space after TX We wanted an optional space after TX 7 th (invisible space): \w{2} ?\d{5}(-\d{4})? 7 th (invisible space): \w{2} ?\d{5}(-\d{4})? 8 th (make space visible): \w{2}\s?\d{5}(-\d{4})? 8 th (make space visible): \w{2}\s?\d{5}(-\d{4})? 9 th (lots of space – or none):\w{2}\s*\d{5}(-\d{4})? 9 th (lots of space – or none):\w{2}\s*\d{5}(-\d{4})? Stroustrup/PPP - Oct'1114

Regex library – availability Not part of C++98 standard Not part of C++98 standard Part of Technical Report Part of Technical Report Part of C++0x Part of C++0x Ships with Ships with VS 9.0 C++, use, std::tr1::regex VS 9.0 C++, use, std::tr1::regex GCC 4.3.0, use, std::tr1::regex GCC 4.3.0, use, std::tr1::regex use, std::boost::regex use, std::boost::regex Stroustrup/PPP - Oct'1115

#include #include using namespace std; int main() { ifstream in("file.txt");// input file if (!in) cerr << "no file\n"; regex pat ("\\w{2}\\s*\\d{5}(-\\d{4})?"); // ZIP code pattern cout << "pattern: " << pat << '\n'; // … } Stroustrup/PPP - Oct'1116

int lineno = 0; string line;// input buffer while (getline(in,line)) { ++lineno; smatch matches;// matched strings go here if (regex_search(line, matches, pat)) { cout << lineno << ": " << matches[0] << '\n';// whole match if (1<matches.size() && matches[1].matched) cout << "\t: " << matches[1] << '\n;// sub-match }} Stroustrup/PPP - Oct'1117

Results Input:address TX77845 ffff tx asasasaa ggg TX howdy zzz TX sss ggg TX cvzcv TX sdsas xxxTx77845xxxTX Output:pattern: "\w{2}\s*\d{5}(-\d{4})?" 1: TX : tx : TX : : TX : : Tx : TX : Stroustrup/PPP - Oct'1118

Regular expression syntax Regular expressions have a thorough theoretical foundation based on state machines Regular expressions have a thorough theoretical foundation based on state machines You can mess with the syntax, but not much with the semantics You can mess with the syntax, but not much with the semantics The syntax is terse, cryptic, boring, useful The syntax is terse, cryptic, boring, useful Go learn it Go learn it Examples Examples Xa{2,3}// Xaa Xaaa Xa{2,3}// Xaa Xaaa Xb{2}// Xbb Xb{2}// Xbb Xc{2,}// Xcc Xccc Xcccc Xccccc … Xc{2,}// Xcc Xccc Xcccc Xccccc … \w{2}-\d{4,5}// \w is letter \d is digit \w{2}-\d{4,5}// \w is letter \d is digit (\d*:)?(\d+) // 124: : (\d*:)?(\d+) // 124: : Subject: (FW:|Re:)?(.*)//. (dot) matches any character Subject: (FW:|Re:)?(.*)//. (dot) matches any character [a-zA-Z] [a-zA-Z_0-9]*// identifier [a-zA-Z] [a-zA-Z_0-9]*// identifier [^aeiouy] // not an English vowel [^aeiouy] // not an English vowel Stroustrup/PPP - Oct'1119

Searching vs. matching Searching for a string that matches a regular expression in an (arbitrarily long) stream of data Searching for a string that matches a regular expression in an (arbitrarily long) stream of data regex_search() looks for its pattern as a substring in the stream regex_search() looks for its pattern as a substring in the stream Matching a regular expression against a string (of known size) Matching a regular expression against a string (of known size) regex_match() looks for a complete match of its pattern and the string regex_match() looks for a complete match of its pattern and the string Stroustrup/PPP - Oct'1120

Table grabbed from the web KLASSE ANTAL DRENGE ANTAL PIGER ELEVER IALT 0A A7815 1B A A A7714 4B A A B A G358 7I7310 8A A MO325 0P1112 0P B448 10CE011 1MO8513 2CE8513 3DCE336 4MO415 6CE347 8CE448 9CE4913 REST5611 Alle klasser Stroustrup/PPP - Oct'1121 Numeric fields Text fields Invisible field separators Semantic dependencies i.e. the numbers actually mean something first row + second row == third row Last line are column sums

Describe rows Header line Header line Regular expression:^[\w ]+([\w ]+)*$ Regular expression:^[\w ]+([\w ]+)*$ As string literal:"^[\\w ]+([\\w ]+)*$" As string literal:"^[\\w ]+([\\w ]+)*$" Other lines Other lines Regular expression: ^([\w ]+)(\d+)(\d+)(\d+)$ Regular expression: ^([\w ]+)(\d+)(\d+)(\d+)$ As string literal: "^([\\w ]+)(\\d+)(\\d+)(\\d+)$" As string literal: "^([\\w ]+)(\\d+)(\\d+)(\\d+)$" Arent those invisible tab characters annoying? Arent those invisible tab characters annoying? Define a tab character class Define a tab character class Arent those invisible space characters annoying? Arent those invisible space characters annoying? Use \s Use \s Stroustrup/PPP - Oct'1122

Simple layout check int main() { ifstream in("table.txt");// input file if (!in) error("no input file\n"); string line;// input buffer string line;// input buffer int lineno = 0; regex header( "^[\\w ]+([\\w ]+)*$"); // header line regex row( "^([\\w ]+)(\\d+)(\\d+)(\\d+)$"); // data line // … check layout … // … check layout …} Stroustrup/PPP - Oct'1123

Simple layout check int main() { // … open files, define patterns … if (getline(in,line)) {// check header line smatch matches; if (!regex_match(line, matches, header)) error("no header"); } while (getline(in,line)) {// check data line ++lineno; smatch matches; if (!regex_match(line, matches, row)) error("bad line", to_string(lineno)); }} Stroustrup/PPP - Oct'1124

Validate table int boys = 0;// column totals int girls = 0; while (getline(in,line)) {// extract and check data smatch matches; if (!regex_match(line, matches, row)) error("bad line"); int curr_boy = from_string (matches[2]);// check row int curr_girl = from_string (matches[3]); int curr_total = from_string (matches[4]); if (curr_boy+curr_girl != curr_total) error("bad row sum"); if (matches[1]==Alle klasser) {// last line; check columns: if (curr_boy != boys) error("boys dont add up"); if (curr_girl != girls) error("girls dont add up"); return 0; } boys += curr_boy; girls += curr_girl; } Stroustrup/PPP - Oct'1125

Application domains Text processing is just one domain among many Text processing is just one domain among many Or even several domains (depending how you count) Or even several domains (depending how you count) Browsers, Word, Acrobat, Visual Studio, … Browsers, Word, Acrobat, Visual Studio, … Image processing Image processing Sound processing Sound processing Data bases Data bases Medical Medical Scientific Scientific Commercial Commercial … Numerics Numerics Financial Financial … Stroustrup/PPP - Oct'1126