Regular expressions and the Corpus Query Language

Slides:



Advertisements
Similar presentations
Chapter Chapter Summary Languages and Grammars Finite-State Machines with Output Finite-State Machines with No Output Language Recognition Turing.
Advertisements

Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
1 CSE 303 Lecture 7 Regular expressions, egrep, and sed read Linux Pocket Guide pp , 73-74, 81 slides created by Marty Stepp
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst.
1 CSE 390a Lecture 7 Regular expressions, egrep, and sed slides created by Marty Stepp, modified by Jessica Miller
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
Using regular expressions Search for a single occurrence of a specific string. Search for all occurrences of a string. Approximate string matching.
Scripting Languages Chapter 8 More About Regular Expressions.
CSE467/567 Computational Linguistics Carl Alphonce Computer Science & Engineering University at Buffalo.
Form Validation CS What is form validation?  validation: ensuring that form's values are correct  some types of validation:  preventing blank.
Regex Wildcards on steroids. Regular Expressions You’ve likely used the wildcard in windows search or coding (*), regular expressions take this to the.
Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.
An Introduction to TokensRegex
Regular Language & Expressions. Regular Language A regular language is one that a finite state machine (fsm) will accept. ‘Alphabet’: {a, b} ‘Rules’:
Chapter 2: Finite-State Machines Heshaam Faili University of Tehran.
Mechanics Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Last Updated March 2006 Slide 1 Regular Expressions.
1-1 1 Syntax  Informal vs formal specification  Regular expressions  Backus Naur Form (BNF)  Extended Backus Naur Form (EBNF)  Case study: Calc syntax.
Overview of the grep Command Alex Dukhovny CS 265 Spring 2011.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
Globalisation & Computer systems Week 7 Text processes and globalisation part 1: Sorting strings: collation Searching strings and regular expressions Practical:
INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7.
March 1, 2009 Dr. Muhammed Al-mulhem 1 ICS 482 Natural Language Processing Regular Expression and Finite Automata Muhammed Al-Mulhem March 1, 2009.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
Natural Language Processing Lecture 6 : Revision.
+ Using Corpora - II Albert Gatt. + Corpus search These notes introduce some practical tools to find patterns: regular expressions A general formalism.
CPSC 388 – Compiler Design and Construction Scanners – JLex Scanner Generator.
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
Regular Expressions.
Regular Expressions – An Overview Regular expressions are a way to describe a set of strings based on common characteristics shared by each string in.
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
Corpus Linguistics- Practical utilities (Lecture 7) Albert Gatt.
Prof. Alfred J Bird, Ph.D., NBCT Door Code for IT441 Students.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,
Sys Prog & Scrip - Heriot Watt Univ 1 Systems Programming & Scripting Lecture 12: Introduction to Scripting & Regular Expressions.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
CSC 2720 Building Web Applications PHP PERL-Compatible Regular Expressions.
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
1 Validating user input is the bane of every software developer’s existence. When you are developing cross-browser web applications (IE4+ and NS4+) this.
CSCI 330 UNIX and Network Programming Unit IV Shell, Part 2.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
CGS – 4854 Summer 2012 Web Site Construction and Management Instructor: Francisco R. Ortega Chapter 5 Regular Expressions.
Regular expressions and the Corpus Query Language Albert Gatt.
What is grep ?  % man grep  DESCRIPTION  The grep utility searches text files for a pattern and prints all lines that contain that pattern. It uses.
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
An Introduction to Regular Expressions Specifying a Pattern that a String must meet.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
IST 220 – Intro to DB Lab 2 Specifying Criteria in SELECT Statements.
RE Tutorial.
Regular Expressions Upsorn Praphamontripong CS 1110
Strings and Serialization
Looking for Patterns - Finding them with Regular Expressions
Regular Expressions and perl
Introduction to Corpus Linguistics: Exploring Collocation
Corpus Linguistics I ENG 617
Pattern Matching in Strings
Query Languages.
Advanced Find and Replace with Regular Expressions
CS 1111 Introduction to Programming Fall 2018
Regular Expressions grep Familiy of Commands
AntConc Search Wildcards (not Regex)
Presentation transcript:

Regular expressions and the Corpus Query Language Albert Gatt

Corpus search These notes introduce some practical tools to find patterns: regular expressions the corpus query language (CQL): developed by the Corpora and Lexicons Group, University of Stuttgart a language for building complex queries using: attributes and values

A typographical note In the following, regular expressions are written between forward slashes (/.../) to distinguish them from normal text. You do not typically need to enclose them in slashes when using them.

Practice Log in to the sketchengine Choose the BNC http://the.sketchengine.co.uk Choose the BNC

Practice In the concordance window, click Query type

Practice Then choose Phrase as your query type

Practice In what follows, we’ll be trying out some pattern searches. This will help you grasp the idea of regular expressions better.

Part 1 Regular expressions

Regular expressions A regular expression is a pattern that matches some sequence in a text. It is a mixture of: characters or strings of text special characters groups or ranges e.g. “match a string starting with the letter S and ending in ane”

The simplest regex The simplest regex is simply a string which specifies exactly which tokens or phrases you want. These are all regexes: the tall dark lady dog the

Beyond that But the whole point if regexes is that we can make much more general searches, specifying patterns.

Delimiting regexes Special characters for start and end: /^man/ => any sequence which begins with “man”: man, manned, manning... /man$/ => any sequence ending with “man”: doberman, policeman... /^man$/=> any sequence consisting of “man” only

Groups of characters and choices /[wh]ood/ matches wood or hood […] signifies a choice of characters /[^wh]ood/ matches mood, food, but not wood or hood /[^…]/ signifies any character except what’s in the brackets

Practice Type a regular expression to match: The word beginning with l or m followed by aid This should match maid or laid [lm]aid The word beginning with r or s or b or t followed by at This should match rat, bat, tat or sat [rbst]at

Ranges Some sets of characters can be expressed as ranges: /[a-z]/ any alphabetic, lower-case character /[0-9]/ any digit between 0 and 9 /[a-zA-Z]/ any alphabetic, upper- or lower-case character

Practice Type a regular expression to match: a date between 1800 and 1899 18[0-9][0-9] the number 2 followed by x or y 2[xy] A four-word letter beginning with i in lowercase i[a-z][a-z][a-z]

Disjunction and wildcards /ba./ matches bat, bad, … /./ means “any single alphanumeric character” /gupp(y|ies)/ guppy OR guppies /(x|y)/ means “either X or Y” important to use parentheses!

Practice Rewrite this regex using the (.) wildcard A four-word letter beginning with i in lowercase i[a-z][a-z][a-z] i... Does it match exactly the same things? Why?

Quantifiers (I) /colou?r/ /govern(ment)?/ matches color or colour /govern(ment)?/ matches govern or government /?/ means zero or one of the preceding character or group

Practice Write a regex to match: color or colour sand or sandy colou?r

Quantifiers (II) /ba+/ /(inkiss )+/ matches ba, baa, baaa… /(inkiss )+/ matches inkiss, inkiss inkiss (note the whitespace in the regex) /+/ means “one or more of the preceding character or group”

Practice Write a regex to match: A word starting with ba followed by one or more of characters. ba.+

Quantifiers (III) /ba*/ /(ba ){1,3}/ /(ba ){2}/ matches b, ba, baa, baaa /*/ means “zero or more of the preceding character or group” /(ba ){1,3}/ matches ba, ba ba or ba ba ba {n, m} means “between n and m of the preceding character or group” /(ba ){2}/ matches ba ba {n} means “exactly n of the preceding character or group”

Practice Write a regex to match: A word starting with ba followed by one or more of characters. ba.+ Now rewrite this to match ba followed by exactly one character. ba.{1} Re-write, to match b followed by between two and four a’s (e.g. Baa, baaa etc) ba{2,4}

The corpus query language Part 2 The corpus query language

Switch the sketchengine interface Under Query type, select CQL

CQL syntax So far, we’ve used regexes to match strings (words, phrases). We often want to combine searches for words and grammatical patterns. CQL queries consist of regular expressions. But we can specify patterns of words, lemmas and tags.

Structure of a CQL query [attribute=“regex”] What we want to search for. Can be word, lemma or tag The actual pattern it should match.

Structure of a CQL query Examples: [word=“it.+”] Matches a single word, beginning with it followed by one or more characters [tag=“V.*”] Matches any word that is tagged with a label beginning with “V” (so any verb) [lemma=“man.+”] Matches all tokens that belong to a lemma that begins with “man”

Structure of a CQL query [attribute=“regex”] What we want to search for. Can be word, lemma or tag The actual pattern it should match. Each expression in square brackets matches one word. We can have multiple expressions in square brackets to match a sequence.

CQL Syntax (I) Regex over word: [word=“it”] [word=“resulted”] [word=“that”] matches only it resulted that Regex over word with special characters: [word=“it”] [word=“result.*”] [word=“that”] matches it resulted/results that Regex over lemma: [word=“it”] [lemma=“result”] [word=“that”] matches any form of result (regex over lemma)

Practice Write a CQL query to match: Any word beginning with lad [word=“lad.*”] The word strong followed by any noun NB: remember that noun tags start with “N” [word=“strong”] [tag=“N.+”]

CQL Syntax II We can combine word, lemma and tag queries for any single word. Word and tag constraints: [word=“it”] [lemma=“result” & tag=“V.*] Matches only it followed by a morphological variant of the lemma result whose tag begins with V (i.e. a verb)

Practice The word strong followed by any noun [word=“strong”] [tag=“N.+”] Rewrite this to search for the lemma strong tagged as adjective NB: Adjective tags in the BNC start with AJ [lemma=“strong” & tag=“AJ.*”][tag=“N.+”] The lemma eat in its verb (V) forms [lemma=“eat” & tag=“V.*”]

CQL syntax III The empty square brackets signify “any match” Using complex quantifiers to match things over a span: [word=“confus.*” & tag=“V.*”] []{0,2} [word=“by”] “verb beginning with confus tagged as verb, followed by the word by, with between zero and two intervening words” confused by (the problem) confused John by (saying that) confused John Smith by (saying that)

Practice Search for the verb knock (in any of its forms), followed by the noun door, with between zero and three intervening words: [lemma=“knock” & tag=“V.*”][]{0,3}[word=“door” & tag=“N.*”]

We can count occurrences of these complex phrases

Node forms = the actual phrases

Node tags = the tag sequences

CQL summary A very powerful query language BNC SARA client uses CQL online SketchEngine uses it too Ideal for finding complex grammatical patterns.

A final task Choose two adjectives which are semantically similar. Search for them in the corpus, looking for occurrences where they’re followed by a noun. Run a frequency query on the results.