More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Slides:



Advertisements
Similar presentations
HTML Basics 1450 Technology Seminar Copyright 2003, Matthew Hottell.
Advertisements

Introduction Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Introduction Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Introduction Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Slicing Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
What is MLA and why do we use it?
Word 2007 ® Business and Personal Communication How can Word 2007 help you create and manage lengthy documents?
MLA BASICS AWC Writing Center What is MLA?  MLA (Modern Language Association) style is most commonly used to write papers and cite sources within.
Lesson 4: Manage Lengthy Documents When writing a research paper, give proper credit to the sources of your ideas in a bibliography or Works Cited page.
1 Javascrbipt Intro Javascript (or js) is a programming language. Interpreted, not compiled. Not the same as java, but, similar. Use tags to use. Object-oriented.
XHTML1 Building Document Structure. XHTML2 Objectives In this chapter, you will: Learn how to create Extensible Hypertext Markup Language (XHTML) documents.
Introduction Copyright © Software Carpentry 2011 This work is licensed under the Creative Commons Attribution License See
1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.
Mechanics Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
In-text Citations. In-text citations (APA style) are called “parenthetical citations” in MLA style. APA MLA In-text = parenthetical.
W. Torres What is plagiarism?.
Programming for Linguists An Introduction to Python 24/11/2011.
Functions Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Fixtures Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Nanotech Example Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
(Stream Editor) By: Ross Mills.  Sed is an acronym for stream editor  Instead of altering the original file, sed is used to scan the input file line.
6-1:Adding and Subtracting Rational Expressions Unit 6: Rational and Radical Equations.
Citations and Works Cited Page Research Essentials.
Regular Expressions Chapter 11 Python for Informatics: Exploring Information
Invasion Percolation: Assembly Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Pipes and Filters Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Introduction to Unix – CS 21 Lecture 12. Lecture Overview A few more bash programming tricks The here document Trapping signals in bash cut and tr sed.
Aim: What should the final research paper look like? Note: Your final research paper is due Monday, April 27 th. If you will be absent on that day, you.
Plagiarism 1.Failing to cite quotes and borrowed ideas 2.Failing to enclose borrowed text in quotation marks 3.Failing to put summaries and paraphrases.
Files and Directories Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Introduction Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
 2002 Prentice Hall. All rights reserved. 1 Chapter 13 – String Manipulation and Regular Expressions Outline 13.1 Introduction 13.2 Fundamentals of Characters.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
Regular Expressions This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this.
CSC 110 Using Python [Reading: chapter 1] CSC 110 B 1.
Introducing Python CS 4320, SPRING Lexical Structure Two aspects of Python syntax may be challenging to Java programmers Indenting ◦Indenting is.
First-Class Functions Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Working With Objects Tonga Institute of Higher Education.
More Patterns Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
MLA Formatting and Style Format your writing according to the Modern Language Association In accordance with The Online Writing Lab: Purdue University.
Finding Things Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Microsoft Word 2010 Chapter 2 Creating a Research Paper with Citations and References.
Operators Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Dictionaries Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Help! Not bib & note cards… Oh, the horror of it all!
Normalizing Data for Migration Kyle Banerjee
Formatting and Editing Skills Keyboarding Objective 4.01 – Apply formatting and editing features and operational keys appropriately.
Check with your teacher to find out what they want and what they want it called!
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
Java Basics Regular Expressions.  A regular expression (RE) is a pattern used to search through text.  It either matches the.
PB173 - Tématický vývoj aplikací v C/C++ (jaro 2016)
HTML TEXT.
Writing Technical Reports
CS170 – Week 1 Lecture 3: Foundation Ismail abumuhfouz.
Regular Expressions Chapter 11 Python for Everybody
Program Design Invasion Percolation: Testing
Avoiding Plagiarism, Using Citations and Quotations
Mr. Stadnycki Honors 10 English
Notes Over 2.1 Graph the numbers on a number line. Then write two inequalities that compare the two numbers and and 9 l l l.
Graphing Linear Inequalities
* Lecture # 7 Instructor: Rida Noor Department of Computer Science
Miscellaneous Items Loop control, block labels, unless/until, backwards syntax for “if” statements, split, join, substring, length, logical operators,
Procedures for the “bibFile”
Functions BIS1523 – Lecture 17.
Number and String Operations
Writing Papers in the Biological Sciences
Version Control Basic Operation Copyright © Software Carpentry 2010
HyperText Markup Language
Presentation transcript:

More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See for more information. Regular Expressions

More Tools Thousands of papers and theses written in LaTeX

Regular ExpressionsMore Tools Thousands of papers and theses written in LaTeX Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮

Regular ExpressionsMore Tools Thousands of papers and theses written in LaTeX Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ All share a common bibliography

Regular ExpressionsMore Tools Thousands of papers and theses written in LaTeX Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ All share a common bibliography Want to see how often citations appear together

Regular ExpressionsMore Tools Thousands of papers and theses written in LaTeX Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ All share a common bibliography Want to see how often citations appear together First step: extract citation sets from documents

Regular ExpressionsMore Tools Citations enclosed in \cite{…} Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮

Regular ExpressionsMore Tools Citations enclosed in \cite{…} Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ Multiple labels separated by commas

Regular ExpressionsMore Tools Citations enclosed in \cite{…} Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ Multiple labels separated by commas May be white space

Regular ExpressionsMore Tools Citations enclosed in \cite{…} Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ Multiple labels separated by commas May be white space (including line breaks)

Regular ExpressionsMore Tools Citations enclosed in \cite{…} Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ Multiple labels separated by commas May be white space (including line breaks) And multiple citations per line

Regular ExpressionsMore Tools print re.search('cite{(.+)}', 'a \\cite{X} b').groups() ('X',) Idea #1: capture everything in cite{…} in a group

Regular ExpressionsMore Tools print re.search('cite{(.+)}', 'a \\cite{X} b').groups() ('X',) Idea #1: capture everything in cite{…} in a group print re.search('cite{(.+)}', 'a \\cite{X} b \\cite{Y} c').groups() ('X} b \\cite{Y',) What about multiple citations?

Regular ExpressionsMore Tools print re.search('cite{(.+)}', 'a \\cite{X} b').groups() ('X',) Idea #1: capture everything in cite{…} in a group print re.search('cite{(.+)}', 'a \\cite{X} b \\cite{Y} c').groups() ('X} b \\cite{Y',) What about multiple citations?

Regular ExpressionsMore Tools print re.search('cite{(.+)}', 'a \\cite{X} b').groups() ('X',) Idea #1: capture everything in cite{…} in a group print re.search('cite{(.+)}', 'a \\cite{X} b \\cite{Y} c').groups() ('X} b \\cite{Y',) What about multiple citations? Matching is greedy

Regular ExpressionsMore Tools Idea #2: match everything inside '{}' except '}'

Regular ExpressionsMore Tools Idea #2: match everything inside '{}' except '}' Use '[^}]' to negate the set containing only '}'

Regular ExpressionsMore Tools print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups() ('X',) Idea #2: match everything inside '{}' except '}' Use '[^}]' to negate the set containing only '}'

Regular ExpressionsMore Tools print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups() ('X',) Idea #2: match everything inside '{}' except '}' Use '[^}]' to negate the set containing only '}'

Regular ExpressionsMore Tools print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups() ('X',) What about multiple citations? Idea #2: match everything inside '{}' except '}' Use '[^}]' to negate the set containing only '}'

Regular ExpressionsMore Tools print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups() ('X',) print re.search('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c').groups() ('X',) What about multiple citations? Idea #2: match everything inside '{}' except '}' Use '[^}]' to negate the set containing only '}'

Regular ExpressionsMore Tools print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups() ('X',) print re.search('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c').groups() ('X',) What about multiple citations? Need to extract all matches, not just the first Idea #2: match everything inside '{}' except '}' Use '[^}]' to negate the set containing only '}'

Regular ExpressionsMore Tools Idea #3: use re.findall instead of re.search

Regular ExpressionsMore Tools Idea #3: use re.findall instead of re.search "A programmer is only as good as her knowledge of her language's libraries."

Regular ExpressionsMore Tools print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c') ['X', 'Y'] Idea #3: use re.findall instead of re.search "A programmer is only as good as her knowledge of her language's libraries."

Regular ExpressionsMore Tools print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c') ['X', 'Y'] Idea #3: use re.findall instead of re.search "A programmer is only as good as her knowledge of her language's libraries."

Regular ExpressionsMore Tools print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c') ['X', 'Y'] Idea #3: use re.findall instead of re.search "A programmer is only as good as her knowledge of her language's libraries." What about spaces?

Regular ExpressionsMore Tools print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c') ['X', 'Y'] Idea #3: use re.findall instead of re.search "A programmer is only as good as her knowledge of her language's libraries." print re.search('cite{([^}]+)}', 'a \\cite{ X} b \\cite{Y } c').groups() [' X', 'Y '] What about spaces?

Regular ExpressionsMore Tools Could tidy this up after matching using string.strip()

Regular ExpressionsMore Tools Could tidy this up after matching using string.strip() Let's modify the pattern instead

Regular ExpressionsMore Tools print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c') ['X', 'Y '] Could tidy this up after matching using string.strip() Let's modify the pattern instead

Regular ExpressionsMore Tools print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c') ['X', 'Y '] Could tidy this up after matching using string.strip() Let's modify the pattern instead

Regular ExpressionsMore Tools print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c') ['X', 'Y '] Could tidy this up after matching using string.strip() Let's modify the pattern instead Still capturing the space after 'Y'

Regular ExpressionsMore Tools print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c') ['X', 'Y '] Could tidy this up after matching using string.strip() Let's modify the pattern instead Still capturing the space after 'Y' Match the word-to-nonword transition as well

Regular ExpressionsMore Tools print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c') ['X', 'Y '] Could tidy this up after matching using string.strip() Let's modify the pattern instead print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', 'a \\cite{ X} b \\cite{Y } c') [' X', 'Y'] Still capturing the space after 'Y' Match the word-to-nonword transition as well

Regular ExpressionsMore Tools print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c') ['X', 'Y '] Could tidy this up after matching using string.strip() Let's modify the pattern instead print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', 'a \\cite{ X} b \\cite{Y } c') [' X', 'Y'] Still capturing the space after 'Y' Match the word-to-nonword transition as well

Regular ExpressionsMore Tools What about multiple labels in a single citation?

Regular ExpressionsMore Tools print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X,Y} ') ['X,Y'] print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X, Y, Z} ') ['X, Y, Z'] What about multiple labels in a single citation?

Regular ExpressionsMore Tools print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X,Y} ') ['X,Y'] print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X, Y, Z} ') ['X, Y, Z'] What about multiple labels in a single citation? Actually can be done, but it's very complex

Regular ExpressionsMore Tools print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X,Y} ') ['X,Y'] print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X, Y, Z} ') ['X, Y, Z'] What about multiple labels in a single citation? Actually can be done, but it's very complex Use re.split() to break matches on '\\s*,\\s*'

Regular ExpressionsMore Tools # Start with a working skeleton. def get_citations(text): '''Return the set of all citation tags found in a block of text.''' return set() if __name__ == '__main__': test = '''\ Granger's work on graphs \cite{dd-gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers.''' print get_citations(test) set([])

Regular ExpressionsMore Tools import re CITE = 'cite{\\s*\\b([^}]+)\\b\\s*}' SPLIT = '\\s*,\\s*' def get_citations(text): '''Return the set of all citation tags found in a block of text.''' result = set() match = re.findall(CITE, text) if match: for citation in match: cites = re.split(SPLIT, citation) for c in cites: result.add(c) return result

Regular ExpressionsMore Tools import re CITE = re.compile('cite{\\s*\\b([^}]+)\\b\\s*}') SPLIT = re.compile('\\s*,\\s*') def get_citations(text): '''Return the set of all citation tags found in a block of text.''' result = set() match = CITE.findall(text) if match: for citations in match: label_list = SPLIT.split(citations) for label in label_list: result.add(label) return result

Regular ExpressionsMore Tools import re CITE = re.compile('cite{\\s*\\b([^}]+)\\b\\s*}') SPLIT = re.compile('\\s*,\\s*') def get_citations(text): '''Return the set of all citation tags found in a block of text.''' result = set() match = CITE.findall(text) if match: for citations in match: label_list = SPLIT.split(citations) for label in label_list: result.add(label) return result

Regular ExpressionsMore Tools # Now test it all out. if __name__ == '__main__': test = '''\ Granger's work on graphs \cite{dd-gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers.''' print get_citations(test) set(['gr2009', 'stibbons2002', 'dd-gr2007', 'stibbons2008', 'snape87', 'quirrell89'])

June 2010 created by Greg Wilson Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See for more information.