Topic 6 Lesson 1 – Text Processing

Slides:



Advertisements
Similar presentations
Computer Science & Engineering 2111 Text Functions 1CSE 2111 Lecture-Text Functions.
Advertisements

MIS: Chapter 14 Cumulative concepts, features and functions, plus new functions COUNTIFS, SUMIFS, AVERAGEIFS (Separate ppt on REACH.louisville.edu) All.
CS1100: Computer Science and Its Applications Text Processing Created By Martin Schedlbauer
1 A Balanced Introduction to Computer Science, 2/E David Reed, Creighton University ©2008 Pearson Prentice Hall ISBN Chapter 17 JavaScript.
Multiplication of Fractions
Chapter 8: Manipulating Strings
CS 255: Database System Principles slides: Variable length data and record By:- Arunesh Joshi( 107) Id: Cs257_107_ch13_13.7.
Computer Science 1000 Spreadsheets II Permission to redistribute these slides is strictly prohibited without permission.
Importing Data Text Data Parsing Scrubbing Data June 21, 2012.
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 9 More About Strings.
Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.
Visual Basic.NET BASICS Lesson 4 Mathematical Operators.
Chapter 1 Working with strings. Objectives Understand simple programs using character strings and the string library. Get acquainted with declarations,
Strings The Basics. Strings can refer to a string variable as one variable or as many different components (characters) string values are delimited by.
 Agenda: 4/24/13 o External Data o Discuss data manipulation tools and functions o Discuss data import and linking in Excel o Sorting Data o Date and.
Fundamentals of Python: First Programs
Oracle 11g: SQL Chapter 10 Selected Single-Row Functions.
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
A Balanced Introduction to Computer Science, 3/E David Reed, Creighton University ©2011 Pearson Prentice Hall ISBN Chapter 17 JavaScript.
Chapter 3: Classes and Objects Java Programming FROM THE BEGINNING Copyright © 2000 W. W. Norton & Company. All rights reserved Java’s String Class.
More Strings CS303E: Elements of Computers and Programming.
CS1100: Computer Science and Its Applications Text Parsing in Excel Martin Schedlbauer, Ph.D.
Processing Text Excel can not only be used to process numbers, but also text. This often involves taking apart (parsing) or putting together text values.
Excel Text Functions 1. LEFT(text, [num_chars])) Returns the number of characters specified starting from the beginning of the text string Syntax Text:
1 CSE 2337 Chapter 7 Organizing Data. 2 Overview Import unstructured data Concatenation Parse Create Excel Lists.
C language + The Preprocessor. + Introduction The preprocessor is a program that processes that source code before it passes through the compiler. It.
Exporting & Formatting Budgets from FlexGen, NextGen & Zortec into Excel.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.
Emdeon Office Batch Management Services This document provides detailed information on Batch Import Services and other Batch features.
CS 257: Database System Principles Variable length data and record BY Govind Kalyankar Class Id: 107.
JavaScript: Conditionals contd.
Advanced Excel Helen Mills OME-RESA.
Sorts, CompareTo Method and Strings
Lesson 3: Using Formulas
Microsoft Visual Basic 2008: Reloaded Third Edition
Lesson 5: Viewing and Printing Workbooks
CMPT 120 Topic: Python’s building blocks -> More Statements
Logical Functions Excel Lesson 10.
NUMBER SYSTEMS.
Topics Designing a Program Input, Processing, and Output
Excel IF Function.
Miscellaneous Excel Combining Excel and Access.
Examining Two Pieces of Data
Introduction To Repetition The for loop
Dividing Polynomials.
CS1100: Computer Science and Its Applications
Exporting & Formatting Budgets from NextGen o Excel
Topics Introduction to File Input and Output
Introduction to C++ Programming
Data types Numeric types Sequence types float int bool list str
Manipulating Text In today’s lesson we will look at:
Functions In Matlab.
CS 106 Computing Fundamentals II Chapter 66 “Working With Strings”
Click ‘browse’ to search your device for
Topics Designing a Program Input, Processing, and Output
Topics Basic String Operations String Slicing
Topics Designing a Program Input, Processing, and Output
Introduction to Computer Science
Topic 3 Lesson 2 – Flexible Models
MapInfo SQL String Functions
OPERATORS AND EXPRESSIONS IN C++
Chapter 17 JavaScript Arrays
Topics Basic String Operations String Slicing
MS Excel – Analyzing Data
Topics Introduction to File Input and Output
CMSC201 Computer Science I for Majors Lecture 12 – Program Design
Excel Tips & Tricks July 18, 2019.
Programming Techniques :: String Manipulation
Topics Basic String Operations String Slicing
Presentation transcript:

Topic 6 Lesson 1 – Text Processing

Processing Text Text processing is often necessary when files are imported from other programs. Excel can be used not only to process numbers, but also text. This often involves taking apart (parsing) or putting together text values (strings). The parts into which we split a string are generally called fields or components. Fields may be separated by delimiting text and/or fields may have a fixed width which permits them to be identified. CS1100

Example We’d like to extract the customer name and the payment terms from the text in column A and place them into columns B and C. CS1100

Terminology The process of taking text values apart is called parsing. A text value is also called a string The parts of a text value are called a substring or a field or a component CS1100

Properties of Text values or Strings Each string has a length A string’s length is = the number of characters (either visible or invisible) that compose it Each character in a string has a position within the string The first character is at position 1, the second character is at position 2, etc. CS1100

Text Processing Functions Excel provides a number of functions for parsing: Text Processing Function Description RIGHT returns a specified numbers of characters from the right of the string LEFT returns a specified numbers of characters from the left of the string MID returns a specified numbers of characters from within the the string LEN returns the number of characters (including spaces) in a string FIND & SEARCH finds the start of a specified substring within a string TRIM remove leading and trailing spaces from a string CS1100

LEN Function The LEN function returns the total number of characters in a text, i.e., the “length” of the text value: A is the first character N is the 14th character =LEN(A9) CS1100

LEFT Function The LEFT function extracts a specific number of characters from the left side of a text value: The extracted characters are a prefix of the string Formula for B1: =LEFT(A1,4) =LEFT(string or cell reference, number of characters) CS1100

RIGHT Function The RIGHT function extracts a specific number of characters from the right side (end) of a text value (string): The extracted characters are a suffix of the string SPECIFY THE NUMBER OF CHARACTERS TO EXTRACT, NOT WHERE TO START! =RIGHT(A1,4) CS1100

MID Function The MID function extracts some number of characters starting at some position within a text value: The extracted characters are a substring of the string Number of Characters =MID(A1,5,4) Where to start CS1100

FIND Function FIND returns the position where a substring starts within a string. Finds the first occurrence only. Returns a #VALUE! error if the substring cannot be found. =FIND("DEF",A1) =FIND(" ",A2) =FIND(",",A3) CS1100

Case Sensitivity Note that FIND is case sensitive. As an alternative, Excel has a SEARCH function which is not case sensitive but otherwise works the same way as FIND. =FIND("cde",A16) =SEARCH("cde",A17) CS1100

IFERROR and FIND Since FIND returns an error when a substring cannot be found, it is best practice to catch the error with IFERROR and assign the cell to a sentinel value. Sentinel value – a distinct value that represents that an error occurred =FIND("[",A5) Sentinel value the empty string “” =IFERROR(FIND("[",A5),"") CS1100

TRIM Function The TRIM function removes all spaces before and after a piece of text. Spaces between words are not removed. This is useful if the text you are trying to parse has trailing spaces which may result in errors later For example, if you need to use a result later in a VLOOKUP function. CS1100

Example 1 – Delimited Text You are given a list of usernames, each followed by a comma, then a space, then the user’s full name A comma followed by a space only appears between the username and full name Everything following the username, the comma and the space is the user’s full name CS1100

Locating the Delimiter (where to split) The first step is to identify the location where the split will be made The split location may be identified by Delimiting text A fixed width field CS1100

Delimited Text A delimiter is any sequence of characters that can reliably be used to separate one field (part of the text) from another field. In this example, a comma followed by a space can serve as the delimiter (delimiting text). On the other hand, the width of each field may vary, so we cannot identify the splitting location by field widths CS1100

Finding the Delimiting Text Since the width of each field may vary, and we cannot identify the splitting location by field widths, we need to find the location of the comma and space Use FIND to return the location of the delimiter. =FIND(“, ”,A2) CS1100

Left portion of the Text LEFT(start position, end position, number of characters) Start position = 1 End Position = Find(delimiter, cell) – 1 Number of characters = = End position – Start Position + 1 = End position -1 + 1 = End position CS1100

Splitting the Text Once we have found the delimiting text, we can split the original text using functions like LEFT, RIGHT and MID Note that we must adjust the length in our function to omit the delimiting text. =LEFT(A2, B2 – 1) CS1100

Right portion of the Text RIGHT(Start Position, End Position, Number of Characters) Start position = FIND(delimiter, cell) + LEN(delimiter) End Position = LEN(cell) Number of characters = End position – Start Position + 1 CS1100

Splitting the Text Using the RIGHT function to find the full name, we need to find the number of characters from the right Subtract the length of the whole text by the location of the delimiter and adjust to omit the delimiter =RIGHT(A2, E2 – (B2+2) + 1) CS1100

Any substring of the Text MID(Start Position, End position, Number of characters to read) Start position = FIND(first delimiter, cell) + LEN(first delimiter) End Position = FIND(second delimiter, cell)-1 Number of characters = End position – Start Position + 1 CS1100

Splitting the Text We could also use the MID function … =MID(A2, B2+2, E2-(B2+2)-1) CS1100

Splitting the Text We could also use the MID function … =MID(A2, B2+2, E2 - B2 + 1) CS1100

Divide and Conquer CS Principle Divide and Conquer is a strategy for solving problems by breaking up a big problem into similar smaller problems Example: suppose we are given a username, followed by a comma and a space, followed by a real name, followed by another comma and a space, followed by a job title. CS1100

Divide and Conquer: Split Once Our first step will be to split the original text into two parts A username Everything else CS1100

Divide and Conquer: Split Again Repeat the splitting process by splitting the remainder into the full name and the job title Using this strategy, we could repeat the splitting process into smaller and smaller pieces until we have solved the problem. In the above example, we are done.

FIND Function FIND returns the position where a substring starts within a string. Optional Value: position to start search To find second comma: find a comma starting after the first comma. CS1100

FIND Function FIND returns the position where a substring starts within a string. Optional Value: position to start search CS1100

Parsing Optional Data Sometimes we need to split some text into parts, but one of the parts may be missing. A reasonable first step is to determine whether or not the data is present. CS1100

Parsing Optional Data: Example Suppose we are given a list of usernames optionally followed by commas and a full name Use IFERROR and FIND to see if there is a comma and return the position if so. CS1100

Parsing Optional Data: Example Now use an IF statement to extract the username CS1100

Parsing Text To extract parts of a text value (parsing) requires thoughtful analysis and often a divide-and-conquer approach. CS1100

Strategy You need think about your strategy: How do I detect where the first name starts? Are there some delimiters? What is the delimiter? Does the delimiter always work? Is there always a first or last name or is the field optional? Break the problem into several problems and create auxiliary or helper columns. CS1100

Hidden Columns Solving complex parsing problems often requires the use of intermediate values: Solve the problem in pieces, don’t do it all in a single formula So, place intermediate values into temporary columns and then hide the column to make the model less confusing to read. CS1100

Let’s Put This Together… Let’s see if we can parse the text into its name and terms components… Before starting with formulas, think about your strategy. How can you recognize the beginning and end of the name component? How about the beginning and end of the terms component? Do we need intermediate values? Which text processing functions could help? CS1100

Questions?