Unix RE’s Text Processing Lexical Analysis.   RE’s appear in many systems, often private software that needs a simple language to describe sequences.

Slides:



Advertisements
Similar presentations
Specifying Languages Our aim is to be able to specify languages for use in the computer. The sketch of the FSA is easy for us to understand, but difficult.
Advertisements

Specifying Languages Our aim is to be able to specify languages for use in the computer. The sketch of an FSA is easy for us to understand, but difficult.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Compiler Baojian Hua Lexical Analysis (II) Compiler Baojian Hua
Regular Expressions and DFAs COP 3402 (Summer 2014)
Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of.
Lexical Analysis - Scanner Computer Science Rensselaer Polytechnic Compiler Design Lecture 2.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
1 The scanning process Goal: automate the process Idea: –Start with an RE –Build a DFA How? –We can build a non-deterministic finite automaton (Thompson's.
Homework #2 Solutions.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Automata and Regular Expression Discrete Mathematics and Its Applications Baojian Hua
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source program) – divides it into tokens.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Slide 1 Today you will: think about criteria for judging a website understand that an effective website will match the needs and interests of users use.
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Programming Languages Meeting 13 December 2/3, 2014.
Lexical Analysis (I) Compiler Baojian Hua
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
Scanning & FLEX CPSC 388 Ellen Walker Hiram College.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
What is a language? An alphabet is a well defined set of characters. The character ∑ is typically used to represent an alphabet. A string : a finite.
CS 536 Fall Scanner Construction  Given a single string, automata and regular expressions retuned a Boolean answer: a given string is/is not in.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
Sys Prog & Scrip - Heriot Watt Univ 1 Systems Programming & Scripting Lecture 12: Introduction to Scripting & Regular Expressions.
The Functions and Purposes of Translators Syntax (& Semantic) Analysis.
ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Scanner Introduction to Compilers 1 Scanner.
Specifying Languages Our aim is to be able to specify languages for use in the computer. The sketch of an FSA is easy for us to understand, but difficult.
C Chuen-Liang Chen, NTUCS&IE / 35 SCANNING Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University Taipei,
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
Exercise Solution for Exercise (a) {1,2} {3,4} a b {6} a {5,6,1} {6,2} {4} {3} {5,6} { } b a b a a b b a a b a,b b b a.
Scanner Generation Using SLK and Flex++ Followed by a Demo Copyright © 2015 Curt Hill.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Prof. Necula CS 164 Lecture 31 Lexical Analysis Lecture 3-4.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
using Deterministic Finite Automata & Nondeterministic Finite Automata
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Deterministic Finite Automata Nondeterministic Finite Automata.
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
Department of Software & Media Technology
More on scanning: NFAs and Flex
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Two issues in lexical analysis
Recognizer for a Language
Lecture 5: Lexical Analysis III: The final bits
Specifying Languages Our aim is to be able to specify languages for use in the computer. The sketch of the FSA is easy for us to understand, but difficult.
Subject Name: FORMAL LANGUAGES AND AUTOMATA THEORY
Compiler Construction
Lecture 5 Scanning.
Lexical Analysis - Scanner-Contd
Presentation transcript:

Unix RE’s Text Processing Lexical Analysis

  RE’s appear in many systems, often private software that needs a simple language to describe sequences of events. Some Applications

  Started in the mid-90’s by three of J.F.Ullman’s students, Ashish Gupta, Anand Rajaraman, and Venky Harinarayan.  Goal was to integrate information from Web pages.  Bought by Amazon when Yahoo! hired them to build a comparison shopper for books. Junglee

  Junglee’s first contract was to integrate on-line want ads into a queryable table.  Each company organized its employment pages differently.  Worse: the organization typically changed weekly. Integrating Want Ads

  They developed a regular-expression language for navigating within a page and among pages.  Input symbols were:  Letters, for forming words like “salary”.  HTML tags, for following structure of page.  Links, to jump between pages. Junglee’s Solution

  Engineers could then write RE’s to describe how to find key information at a Web site.  E.g., position title, salary, requirements,…  Because they had a little language, they could incorporate new sites quickly, and they could modify their strategy when the site changed. Junglee’s Solution – (2)

  Junglee used a common form of architecture:  Use RE’s plus actions (arbitrary code) as your input language.  Compile into a DFA or simulated NFA.  Each accepting state is associated with an action, which is executed when that state is entered. RE-Based Software Architecture

  UNIX, from the beginning, used regular expressions in many places, including the “grep” command.  Grep = “Global (search for a) Regular Expression and Print.”  Most UNIX commands use an extended RE notation that still defines only regular languages. UNIX Regular Expressions

  [a 1 a 2 …a n ] is shorthand for a 1 +a 2 +…+a n.  Ranges indicated by first-dash-last and brackets.  Order is ASCII.  Examples: [a-z] = “any lower-case letter,” [a-zA-Z] = “any letter.”  Dot = “any character.” UNIX RE Notation

  | is used for union instead of +.  But + has a meaning: “one or more of.”  E+ = EE*.  Example: [a-z]+ = “one or more lower-case letters.  ? = “zero or one of.”  E? = E + ε.  Example: [ab]? = “an optional a or b.” UNIX RE Notation – (2)

  Remember our DFA for recognizing strings that end in “ing”?  It was rather tricky.  But the RE for such strings is easy:. * ing where the dot is the UNIX “any”.  Even an NFA is easy (next slide). Example: Text Processing

 NFA for “Ends in ing ” Start any ing

  The first thing a compiler does is break a program into tokens = substrings that together represent a unit.  Examples: identifiers, reserved words like “if,” meaningful single characters like “;” or “+”, multicharacter operators like “<=”. Lexical Analysis

  Using a tool like Lex or Flex, one can write a regular expression for each different kind of token.  Example: in UNIX notation, identifiers are something like [A-Za-z][A-Za-z0-9]*.  Each RE has an associated action.  Example: return a code for the token found. Lexical Analysis – (2)

  There are some ambiguities that need to be resolved as we convert RE’s to a DFA.  Examples: 1.“if” looks like an identifier, but it is a reserved word. 2.< might be a comparison operator, but if followed by =, then the token is <=. Tricks for Combining Tokens

  Convert the RE for each token to an ε –NFA.  Each has its own final state.  Combine these all by introducing a new start state with ε -transitions to the start states of each ε –NFA.  Then convert to a DFA. Tricks – (2)

  If a DFA state has several final states among its members, give them priority.  Example: Give all reserved words priority over identifiers, so if the DFA arrives at a state that contains final states for the “if” ε –NFA as well as for the identifier ε –NFA, if declares “if”, not identifier. Tricks – (3)

  It’s a bit more complicated, because the DFA has to have an additional power.  It must be able to read an input symbol and then, when it accepts, put that symbol back on the input to be read later. Tricks – (4)

  Suppose “<” is the first input symbol.  Read the next input symbol.  If it is “=”, accept and declare the token is <=.  If it is anything else, put it back and declare the token is <. Example: Put-Back

  Suppose “if” has been read from the input.  Read the next input symbol.  If it is a letter or digit, continue processing.  You did not have reserved word “if”; you are working on an identifier.  Otherwise, put it back and declare the token is “if”. Example: Put-Back – (2)