Searching, Modifying, and Encoding Text. Parts: 1) Forming Regular Expressions 2) Encoding and Decoding.

Slides:



Advertisements
Similar presentations
Reading and Writing Text Files Svetlin Nakov Telerik Corporation
Advertisements

Lecture 4 Sending and Receiving Messages Erick Pranata © Sekolah Tinggi Teknik Surabaya 1.
Liang, Introduction to Java Programming, Ninth Edition, (c) 2013 Pearson Education, Inc. All rights reserved. 1 Chapter 9 Strings.
1 Strings and Text I/O. 2 Motivations Often you encounter the problems that involve string processing and file input and output. Suppose you need to write.
©The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 4 th Ed Chapter Chapter 2 Getting Started with Java Program development.
Regular Expression ASCII Converting. Regular Expression Regular Expression is a tool to check if a string matches some rules. It is a very complicated.
©TheMcGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 9: Characters * Character primitives * Character Wrapper class.
©The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 4 th Ed Chapter Chapter 2 Getting Started with Java Structure of.
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
Data Representation Kieran Mathieson. Outline Digital constraints Data types Integer Real Character Boolean Memory address.
Overview of C++ Chapter 2 in both books programs from books keycode for lab: get Program 1 from web test files.
CIS 234: Character Codes Dr. Ralph D. Westfall April, 2011.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s.
More on Regular Expressions Regular Expressions More character classes \s matches any whitespace character (space, tab, newline etc) \w matches.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Computer and Programming File I/O File Input/Output Author: Chaiporn Jaikaeo, Jittat Fakcharoenphol Edited by Supaporn Erjongmanee Lecture 13.
XP Tutorial 14 New Perspectives on HTML, XHTML, and DHTML, Comprehensive 1 Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.
UNICODE Character Sets and Coding Standards Han Unification and ISO10646 Encoding Evolution and Unicode Programming Unicode.
Georgia Institute of Technology Creating and Modifying Text part 1 Barb Ericson Georgia Institute of Technology Oct 2005.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
Regular Expressions in.NET Ashraya R. Mathur CS NET Security.
INFOCODING BASICS & EXAMPLES OF CURRENT USE Introduction to Computer Science Using Ruby (c) 2010 Gideon Frieder.
© Copyright 2013 by Pearson Education, Inc. All Rights Reserved.1 Chapter 4 Mathematical Functions, Characters, and Strings.
©The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 4 th Ed Chapter Characters In Java, single characters are represented.
1 INFORMATION IN DIGITAL DEVICES. 2 Digital Devices Most computers today are composed of digital devices. –Process electrical signals. –Can only have.
Chapter 8 Cookies And Security JavaScript, Third Edition.
ASP.NET Programming with C# and SQL Server First Edition Chapter 5 Manipulating Strings with C#
Regular Expression (continue) and Cookies. Quick Review What letter values would be included for the following variable, which will be used for validation.
 JAVA Compilation and Interpretation  JAVA Platform Independence  Building First JAVA Program  Escapes Sequences  Display text with printf  Data.
Reference: Lecturer Lecturer Reham O. Al-Abdul Jabba lectures for cap211 Files and Streams- I.
An Introduction to Java Programming and Object-Oriented Application Development Chapter 7 Characters, Strings, and Formatting.
Post-Module JavaScript BTM 395: Internet Programming.
BY Sandeep Kumar Gampa.. What is Regular Expression? Regex in.NET Regex Language Elements Examples Regular Expression API How to Test regex in.NET Conclusion.
Regular Expressions in PHP. Supported RE’s The most important set of regex functions start with preg. These functions are a PHP wrapper around the PCRE.
T U T O R I A L  2009 Pearson Education, Inc. All rights reserved Screen Scraping Application Introducing String Processing.
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
C# Strings 1 C# Regular Expressions CNS 3260 C#.NET Software Development.
Satisfy Your Technical Curiosity Regular Expressions Roy Osherove Methodology & Team System Expert Sela Group The.
Introduction to Java Java Translation Program Structure
Java Overview. Comments in a Java Program Comments can be single line comments like C++ Example: //This is a Java Comment Comments can be spread over.
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
Computing with C# and the.NET Framework Chapter 2 C# Programming Basics ©2003, 2011 Art Gittleman.
1 Information Management DIG 3563 – Lecture 14 Data Formats J. Michael Moshell University of Central Florida Original image* by Moshell et al. Imagery.
XP Tutorial 8 Adding Interactivity with ActionScript.
The character data type char. Character type char is used to represent alpha-numerical information (characters) inside the computer uses 2 bytes of memory.
CS360 Windows Programming
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …
BIM313 – Advanced Programming File Operations 1. Contents Structure of a File Reading/Writing Texts from/to Files File and Directory Operations 2.
Text Files and String Processing
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. ADVANCED.
CSC 298 Streams and files.
Regular Expressions /^Hel{2}o\s*World\n$/ SoftUni Team Technical Trainers Software University
XP Tutorial 7 New Perspectives on JavaScript, Comprehensive 1 Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
Computer Programming 2 Lab (1) I.Fatimah Alzahrani.
Regular Expressions /^Hel{2}o\s*World\n$/ SoftUni Team Technical Trainers Software University
A data type in a programming language is a set of data with values having predefined characteristics.data The language usually specifies:  the range.
Strings, Characters, and Regular Expressions Session 10 Mata kuliah: M0874 – Programming II Tahun: 2010.
Strings in C++/CLI us/library/system.string.aspxhttp://msdn.microsoft.com/en- us/library/system.string.aspx public: static.
1 Statements © University of Linz, Institute for System Software, 2004 published under the Microsoft Curriculum License.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
Lecture 19 Strings and Regular Expressions
Representing Information as bit patterns
Pattern Matching in Strings
Primitive Types Vs. Reference Types, Strings, Enumerations
Advanced Programming Behnam Hatami Fall 2017.
Matcher functions boolean find() Attempts to find the next subsequence of the input sequence that matches the pattern. boolean lookingAt() Attempts to.
Presentation transcript:

Searching, Modifying, and Encoding Text

Parts: 1) Forming Regular Expressions 2) Encoding and Decoding

FORMING REGULAR EXPRESSIONS

([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$

using System.Text.RegularExpressions; namespace TestRegExp { class Class1 { [STAThread] static void Main(string[] args) { if (Regex.IsMatch(args[1], args[0])) Console.WriteLine("Input matches regular expression."); else Console.WriteLine("Input DOES NOT match regular expression."); }

C:\>TestRegExp ^\d{5}$ 1234 Input DOES NOT match regular expression. C:\>TestRegExp ^\d{5}$ Input matches regular expression.

Regular Expression Language Elements \dMatches a digit character. Equivalent to “[0-9]”. \wMatches any word character, including underscore. Equivalent to “[A-Za-z0-9_]”. [^char]Match all chars but not specified char \bSpecifies that the match must occur on a boundary between \w (alphanumeric) and \W (nonalphanumeric) characters. The match must occur on word boundaries, which are the first or last characters in words separated by any nonalphanumeric characters. \numberBackreference. For example, (\w)\1 finds doubled word characters. \k Named backreference. For example, (? \w)\k finds doubled word characters. The expression (? \w)\43 does the same. You can use single quotes instead of anglebrackets — for example, \k'char'. Examples

^abcabc$ abctrue yzabcfalsetrue abcdetruefalse String Start and End Reference

to*nto+n tontrue tooontrue tntruefalse Wildcards

to{3}nto{1,3}nto{3,}n tnfalse tonfalsetruefalse tooontrue toooonfalse true Wildcards

to?nto.n tntruefalse tontruefalse tooonfalse totnfalsetrue tojnfalsetrue Wildcards

to[ro]n toontrue torntrue tonfalse toronfalse Wildcards

foo(loo){1,3}hoofoo(loo|roo|)hoo fooloohootrue fooloolooloohootruefalse foohoofalse foololohoofalse fooroohoofalsetrue Groups

How to Extract Matched Data string input = "Company Name: Contoso, Inc."; Match m = Name: (.*$)"); Console.WriteLine(m.Groups[1]); // Display: “Contoso, Inc.”

String Extension(String url) { Regex r = new \w+)://[^/]+?(? :\d +)?/", RegexOptions.Compiled); return r.Match(url).Result("${proto}${port}"); } // //

How to Replace Substrings Using Regular Expressions String MDYToDMY(String input) { return Regex.Replace(input, "\\b(? \\d{1,2})/(? \\d{1,2})/(? \\d{2,4})\\b", "${day}-${month}-${year}"); } // From: “Today is 03/06/09” // To: “Today is ”

■ Regular expressions enable you to determine whether text matches almost any type of format. Regular expressions support dozens of special characters and operators. The most commonly used are “^” to match the beginning of a string, “$” to match the end of a string, “?” to make a character optional, “.” to match any character, and “*” to match a repeated character. ■ To match data using a regular expression, create a pattern using groups to specify the data you need to extract, call Regex.Match to create a Match object, and then examine each of the items in the Match.Groups array. ■ To reformat text data using a regular expression, call the static Regex.Replace method. Summary

ENCODING AND DECODING

Content-Type: text/plain; charset=ISO Content-Type: text/plain; charset="Windows-1251" Web page: “ISO ” corresponds to code page 28591, “Western European (ISO)” “Windows-1251” corresponds to code page 1251 cover languages that use the Cyrillic alphabet such as Russian, Bulgarian and other languages ASCII – American Standard Code for Information Interchange 0 – 127 English communication ANSI/ISO – American National Standards Institute / International Organization for Standardization 128 – 255 National codepages

Unicode encodings: ■ Unicode UTF-32 encoding ■ Unicode UTF-16 encoding ■ Unicode UTF-8 encoding Unicode is one big code page covered everything.

Using the Encoding Class // Get Korean encoding Encoding e = Encoding.GetEncoding("Korean"); // Convert ASCII bytes to Korean encoding byte[] encoded; encoded = e.GetBytes("Hello, world!"); // Display the byte codes for (int i = 0; i < encoded.Length; i++) Console.WriteLine("Byte {0}: {1}", i, encoded[i]);

How to Examine Supported Code Pages EncodingInfo[] ei = Encoding.GetEncodings(); foreach (EncodingInfo e in ei) Console.WriteLine("{0}: {1}, {2}", e.CodePage, e.Name, e.DisplayName);

How to Specify the Encoding Type when Writing a File StreamWriter swUtf7 = new StreamWriter("utf7.txt", false, Encoding.UTF7); swUtf7.WriteLine("Hello, World!"); swUtf7.Close(); StreamWriter swUtf8 = new StreamWriter("utf8.txt", false, Encoding.UTF8); swUtf8.WriteLine("Hello, World!"); swUtf8.Close(); StreamWriter swUtf16 = new StreamWriter("utf16.txt", false, Encoding.Unicode); swUtf16.WriteLine("Hello, World!"); swUtf16.Close(); StreamWriter swUtf32 = new StreamWriter("utf32.txt", false, Encoding.UTF32); swUtf32.WriteLine("Hello, World!"); swUtf32.Close();

How to Specify the Encoding Type when Reading a File string fn = "file.txt"; StreamWriter sw = new StreamWriter(fn, false, Encoding.UTF7); sw.WriteLine("Hello, World!"); sw.Close(); StreamReader sr = new StreamReader(fn, Encoding.UTF7); Console.WriteLine(sr.ReadToEnd()); sr.Close();

Summary ■ Encoding standards map byte values to characters. ASCII is one of the oldest, most widespread encoding standards; however, it provides very limited support for non-English languages. Today, various Unicode encoding standards provide multilingual support. ■ The System.Text.Encoding class provides static methods for encoding and decoding text. ■ Call Encoding.GetEncodings to retrieve a list of supported code pages. ■ To specify the encoding type when writing a file, use an overloaded Stream constructor that accepts an Encoding object. ■ You do not typically need to specify an encoding type when reading a file. However, you can specify an encoding type by using an overloaded Stream constructor that accepts an Encoding object.

Your Key Competences ■ Use regular expressions to determine whether a string matches a specific pattern. ■ Use regular expressions to extract data from a text file. ■ Use regular expressions to reformat text data. ■ Describe the importance of encoding, and list common encoding standards. ■ Use the Encoding class to specify encoding formats, and convert between encoding standards. ■ Programmatically determine which code pages the.NET Framework supports. ■ Create files using a specific encoding format. ■ Read files using unusual encoding formats.

Key Terms ■ code page ■ regular expression ■ Unicode