Download presentation
Presentation is loading. Please wait.
Published byJonas Harrell Modified over 9 years ago
1
More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See http://software-carpentry.org/license.html for more information. Regular Expressions
2
More Tools Thousands of papers and theses written in LaTeX
3
Regular ExpressionsMore Tools Thousands of papers and theses written in LaTeX Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
4
Regular ExpressionsMore Tools Thousands of papers and theses written in LaTeX Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ All share a common bibliography
5
Regular ExpressionsMore Tools Thousands of papers and theses written in LaTeX Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ All share a common bibliography Want to see how often citations appear together
6
Regular ExpressionsMore Tools Thousands of papers and theses written in LaTeX Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ All share a common bibliography Want to see how often citations appear together First step: extract citation sets from documents
7
Regular ExpressionsMore Tools Citations enclosed in \cite{…} Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
8
Regular ExpressionsMore Tools Citations enclosed in \cite{…} Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ Multiple labels separated by commas
9
Regular ExpressionsMore Tools Citations enclosed in \cite{…} Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ Multiple labels separated by commas May be white space
10
Regular ExpressionsMore Tools Citations enclosed in \cite{…} Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ Multiple labels separated by commas May be white space (including line breaks)
11
Regular ExpressionsMore Tools Citations enclosed in \cite{…} Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ Multiple labels separated by commas May be white space (including line breaks) And multiple citations per line
12
Regular ExpressionsMore Tools print re.search('cite{(.+)}', 'a \\cite{X} b').groups() ('X',) Idea #1: capture everything in cite{…} in a group
13
Regular ExpressionsMore Tools print re.search('cite{(.+)}', 'a \\cite{X} b').groups() ('X',) Idea #1: capture everything in cite{…} in a group print re.search('cite{(.+)}', 'a \\cite{X} b \\cite{Y} c').groups() ('X} b \\cite{Y',) What about multiple citations?
14
Regular ExpressionsMore Tools print re.search('cite{(.+)}', 'a \\cite{X} b').groups() ('X',) Idea #1: capture everything in cite{…} in a group print re.search('cite{(.+)}', 'a \\cite{X} b \\cite{Y} c').groups() ('X} b \\cite{Y',) What about multiple citations?
15
Regular ExpressionsMore Tools print re.search('cite{(.+)}', 'a \\cite{X} b').groups() ('X',) Idea #1: capture everything in cite{…} in a group print re.search('cite{(.+)}', 'a \\cite{X} b \\cite{Y} c').groups() ('X} b \\cite{Y',) What about multiple citations? Matching is greedy
16
Regular ExpressionsMore Tools Idea #2: match everything inside '{}' except '}'
17
Regular ExpressionsMore Tools Idea #2: match everything inside '{}' except '}' Use '[^}]' to negate the set containing only '}'
18
Regular ExpressionsMore Tools print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups() ('X',) Idea #2: match everything inside '{}' except '}' Use '[^}]' to negate the set containing only '}'
19
Regular ExpressionsMore Tools print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups() ('X',) Idea #2: match everything inside '{}' except '}' Use '[^}]' to negate the set containing only '}'
20
Regular ExpressionsMore Tools print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups() ('X',) What about multiple citations? Idea #2: match everything inside '{}' except '}' Use '[^}]' to negate the set containing only '}'
21
Regular ExpressionsMore Tools print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups() ('X',) print re.search('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c').groups() ('X',) What about multiple citations? Idea #2: match everything inside '{}' except '}' Use '[^}]' to negate the set containing only '}'
22
Regular ExpressionsMore Tools print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups() ('X',) print re.search('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c').groups() ('X',) What about multiple citations? Need to extract all matches, not just the first Idea #2: match everything inside '{}' except '}' Use '[^}]' to negate the set containing only '}'
23
Regular ExpressionsMore Tools Idea #3: use re.findall instead of re.search
24
Regular ExpressionsMore Tools Idea #3: use re.findall instead of re.search "A programmer is only as good as her knowledge of her language's libraries."
25
Regular ExpressionsMore Tools print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c') ['X', 'Y'] Idea #3: use re.findall instead of re.search "A programmer is only as good as her knowledge of her language's libraries."
26
Regular ExpressionsMore Tools print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c') ['X', 'Y'] Idea #3: use re.findall instead of re.search "A programmer is only as good as her knowledge of her language's libraries."
27
Regular ExpressionsMore Tools print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c') ['X', 'Y'] Idea #3: use re.findall instead of re.search "A programmer is only as good as her knowledge of her language's libraries." What about spaces?
28
Regular ExpressionsMore Tools print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c') ['X', 'Y'] Idea #3: use re.findall instead of re.search "A programmer is only as good as her knowledge of her language's libraries." print re.search('cite{([^}]+)}', 'a \\cite{ X} b \\cite{Y } c').groups() [' X', 'Y '] What about spaces?
29
Regular ExpressionsMore Tools Could tidy this up after matching using string.strip()
30
Regular ExpressionsMore Tools Could tidy this up after matching using string.strip() Let's modify the pattern instead
31
Regular ExpressionsMore Tools print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c') ['X', 'Y '] Could tidy this up after matching using string.strip() Let's modify the pattern instead
32
Regular ExpressionsMore Tools print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c') ['X', 'Y '] Could tidy this up after matching using string.strip() Let's modify the pattern instead
33
Regular ExpressionsMore Tools print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c') ['X', 'Y '] Could tidy this up after matching using string.strip() Let's modify the pattern instead Still capturing the space after 'Y'
34
Regular ExpressionsMore Tools print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c') ['X', 'Y '] Could tidy this up after matching using string.strip() Let's modify the pattern instead Still capturing the space after 'Y' Match the word-to-nonword transition as well
35
Regular ExpressionsMore Tools print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c') ['X', 'Y '] Could tidy this up after matching using string.strip() Let's modify the pattern instead print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', 'a \\cite{ X} b \\cite{Y } c') [' X', 'Y'] Still capturing the space after 'Y' Match the word-to-nonword transition as well
36
Regular ExpressionsMore Tools print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c') ['X', 'Y '] Could tidy this up after matching using string.strip() Let's modify the pattern instead print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', 'a \\cite{ X} b \\cite{Y } c') [' X', 'Y'] Still capturing the space after 'Y' Match the word-to-nonword transition as well
37
Regular ExpressionsMore Tools What about multiple labels in a single citation?
38
Regular ExpressionsMore Tools print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X,Y} ') ['X,Y'] print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X, Y, Z} ') ['X, Y, Z'] What about multiple labels in a single citation?
39
Regular ExpressionsMore Tools print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X,Y} ') ['X,Y'] print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X, Y, Z} ') ['X, Y, Z'] What about multiple labels in a single citation? Actually can be done, but it's very complex
40
Regular ExpressionsMore Tools print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X,Y} ') ['X,Y'] print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X, Y, Z} ') ['X, Y, Z'] What about multiple labels in a single citation? Actually can be done, but it's very complex Use re.split() to break matches on '\\s*,\\s*'
41
Regular ExpressionsMore Tools # Start with a working skeleton. def get_citations(text): '''Return the set of all citation tags found in a block of text.''' return set() if __name__ == '__main__': test = '''\ Granger's work on graphs \cite{dd-gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers.''' print get_citations(test) set([])
42
Regular ExpressionsMore Tools import re CITE = 'cite{\\s*\\b([^}]+)\\b\\s*}' SPLIT = '\\s*,\\s*' def get_citations(text): '''Return the set of all citation tags found in a block of text.''' result = set() match = re.findall(CITE, text) if match: for citation in match: cites = re.split(SPLIT, citation) for c in cites: result.add(c) return result
43
Regular ExpressionsMore Tools import re CITE = re.compile('cite{\\s*\\b([^}]+)\\b\\s*}') SPLIT = re.compile('\\s*,\\s*') def get_citations(text): '''Return the set of all citation tags found in a block of text.''' result = set() match = CITE.findall(text) if match: for citations in match: label_list = SPLIT.split(citations) for label in label_list: result.add(label) return result
44
Regular ExpressionsMore Tools import re CITE = re.compile('cite{\\s*\\b([^}]+)\\b\\s*}') SPLIT = re.compile('\\s*,\\s*') def get_citations(text): '''Return the set of all citation tags found in a block of text.''' result = set() match = CITE.findall(text) if match: for citations in match: label_list = SPLIT.split(citations) for label in label_list: result.add(label) return result
45
Regular ExpressionsMore Tools # Now test it all out. if __name__ == '__main__': test = '''\ Granger's work on graphs \cite{dd-gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers.''' print get_citations(test) set(['gr2009', 'stibbons2002', 'dd-gr2007', 'stibbons2008', 'snape87', 'quirrell89'])
46
June 2010 created by Greg Wilson Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See http://software-carpentry.org/license.html for more information.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.