Presentation is loading. Please wait.

Presentation is loading. Please wait.

More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Similar presentations


Presentation on theme: "More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See"— Presentation transcript:

1 More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See http://software-carpentry.org/license.html for more information. Regular Expressions

2 More Tools Thousands of papers and theses written in LaTeX

3 Regular ExpressionsMore Tools Thousands of papers and theses written in LaTeX Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮

4 Regular ExpressionsMore Tools Thousands of papers and theses written in LaTeX Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ All share a common bibliography

5 Regular ExpressionsMore Tools Thousands of papers and theses written in LaTeX Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ All share a common bibliography Want to see how often citations appear together

6 Regular ExpressionsMore Tools Thousands of papers and theses written in LaTeX Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ All share a common bibliography Want to see how often citations appear together First step: extract citation sets from documents

7 Regular ExpressionsMore Tools Citations enclosed in \cite{…} Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮

8 Regular ExpressionsMore Tools Citations enclosed in \cite{…} Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ Multiple labels separated by commas

9 Regular ExpressionsMore Tools Citations enclosed in \cite{…} Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ Multiple labels separated by commas May be white space

10 Regular ExpressionsMore Tools Citations enclosed in \cite{…} Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ Multiple labels separated by commas May be white space (including line breaks)

11 Regular ExpressionsMore Tools Citations enclosed in \cite{…} Granger's work on graphs \cite{dd- gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ Multiple labels separated by commas May be white space (including line breaks) And multiple citations per line

12 Regular ExpressionsMore Tools print re.search('cite{(.+)}', 'a \\cite{X} b').groups() ('X',) Idea #1: capture everything in cite{…} in a group

13 Regular ExpressionsMore Tools print re.search('cite{(.+)}', 'a \\cite{X} b').groups() ('X',) Idea #1: capture everything in cite{…} in a group print re.search('cite{(.+)}', 'a \\cite{X} b \\cite{Y} c').groups() ('X} b \\cite{Y',) What about multiple citations?

14 Regular ExpressionsMore Tools print re.search('cite{(.+)}', 'a \\cite{X} b').groups() ('X',) Idea #1: capture everything in cite{…} in a group print re.search('cite{(.+)}', 'a \\cite{X} b \\cite{Y} c').groups() ('X} b \\cite{Y',) What about multiple citations?

15 Regular ExpressionsMore Tools print re.search('cite{(.+)}', 'a \\cite{X} b').groups() ('X',) Idea #1: capture everything in cite{…} in a group print re.search('cite{(.+)}', 'a \\cite{X} b \\cite{Y} c').groups() ('X} b \\cite{Y',) What about multiple citations? Matching is greedy

16 Regular ExpressionsMore Tools Idea #2: match everything inside '{}' except '}'

17 Regular ExpressionsMore Tools Idea #2: match everything inside '{}' except '}' Use '[^}]' to negate the set containing only '}'

18 Regular ExpressionsMore Tools print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups() ('X',) Idea #2: match everything inside '{}' except '}' Use '[^}]' to negate the set containing only '}'

19 Regular ExpressionsMore Tools print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups() ('X',) Idea #2: match everything inside '{}' except '}' Use '[^}]' to negate the set containing only '}'

20 Regular ExpressionsMore Tools print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups() ('X',) What about multiple citations? Idea #2: match everything inside '{}' except '}' Use '[^}]' to negate the set containing only '}'

21 Regular ExpressionsMore Tools print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups() ('X',) print re.search('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c').groups() ('X',) What about multiple citations? Idea #2: match everything inside '{}' except '}' Use '[^}]' to negate the set containing only '}'

22 Regular ExpressionsMore Tools print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups() ('X',) print re.search('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c').groups() ('X',) What about multiple citations? Need to extract all matches, not just the first Idea #2: match everything inside '{}' except '}' Use '[^}]' to negate the set containing only '}'

23 Regular ExpressionsMore Tools Idea #3: use re.findall instead of re.search

24 Regular ExpressionsMore Tools Idea #3: use re.findall instead of re.search "A programmer is only as good as her knowledge of her language's libraries."

25 Regular ExpressionsMore Tools print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c') ['X', 'Y'] Idea #3: use re.findall instead of re.search "A programmer is only as good as her knowledge of her language's libraries."

26 Regular ExpressionsMore Tools print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c') ['X', 'Y'] Idea #3: use re.findall instead of re.search "A programmer is only as good as her knowledge of her language's libraries."

27 Regular ExpressionsMore Tools print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c') ['X', 'Y'] Idea #3: use re.findall instead of re.search "A programmer is only as good as her knowledge of her language's libraries." What about spaces?

28 Regular ExpressionsMore Tools print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c') ['X', 'Y'] Idea #3: use re.findall instead of re.search "A programmer is only as good as her knowledge of her language's libraries." print re.search('cite{([^}]+)}', 'a \\cite{ X} b \\cite{Y } c').groups() [' X', 'Y '] What about spaces?

29 Regular ExpressionsMore Tools Could tidy this up after matching using string.strip()

30 Regular ExpressionsMore Tools Could tidy this up after matching using string.strip() Let's modify the pattern instead

31 Regular ExpressionsMore Tools print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c') ['X', 'Y '] Could tidy this up after matching using string.strip() Let's modify the pattern instead

32 Regular ExpressionsMore Tools print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c') ['X', 'Y '] Could tidy this up after matching using string.strip() Let's modify the pattern instead

33 Regular ExpressionsMore Tools print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c') ['X', 'Y '] Could tidy this up after matching using string.strip() Let's modify the pattern instead Still capturing the space after 'Y'

34 Regular ExpressionsMore Tools print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c') ['X', 'Y '] Could tidy this up after matching using string.strip() Let's modify the pattern instead Still capturing the space after 'Y' Match the word-to-nonword transition as well

35 Regular ExpressionsMore Tools print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c') ['X', 'Y '] Could tidy this up after matching using string.strip() Let's modify the pattern instead print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', 'a \\cite{ X} b \\cite{Y } c') [' X', 'Y'] Still capturing the space after 'Y' Match the word-to-nonword transition as well

36 Regular ExpressionsMore Tools print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c') ['X', 'Y '] Could tidy this up after matching using string.strip() Let's modify the pattern instead print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', 'a \\cite{ X} b \\cite{Y } c') [' X', 'Y'] Still capturing the space after 'Y' Match the word-to-nonword transition as well

37 Regular ExpressionsMore Tools What about multiple labels in a single citation?

38 Regular ExpressionsMore Tools print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X,Y} ') ['X,Y'] print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X, Y, Z} ') ['X, Y, Z'] What about multiple labels in a single citation?

39 Regular ExpressionsMore Tools print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X,Y} ') ['X,Y'] print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X, Y, Z} ') ['X, Y, Z'] What about multiple labels in a single citation? Actually can be done, but it's very complex

40 Regular ExpressionsMore Tools print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X,Y} ') ['X,Y'] print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X, Y, Z} ') ['X, Y, Z'] What about multiple labels in a single citation? Actually can be done, but it's very complex Use re.split() to break matches on '\\s*,\\s*'

41 Regular ExpressionsMore Tools # Start with a working skeleton. def get_citations(text): '''Return the set of all citation tags found in a block of text.''' return set() if __name__ == '__main__': test = '''\ Granger's work on graphs \cite{dd-gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers.''' print get_citations(test) set([])

42 Regular ExpressionsMore Tools import re CITE = 'cite{\\s*\\b([^}]+)\\b\\s*}' SPLIT = '\\s*,\\s*' def get_citations(text): '''Return the set of all citation tags found in a block of text.''' result = set() match = re.findall(CITE, text) if match: for citation in match: cites = re.split(SPLIT, citation) for c in cites: result.add(c) return result

43 Regular ExpressionsMore Tools import re CITE = re.compile('cite{\\s*\\b([^}]+)\\b\\s*}') SPLIT = re.compile('\\s*,\\s*') def get_citations(text): '''Return the set of all citation tags found in a block of text.''' result = set() match = CITE.findall(text) if match: for citations in match: label_list = SPLIT.split(citations) for label in label_list: result.add(label) return result

44 Regular ExpressionsMore Tools import re CITE = re.compile('cite{\\s*\\b([^}]+)\\b\\s*}') SPLIT = re.compile('\\s*,\\s*') def get_citations(text): '''Return the set of all citation tags found in a block of text.''' result = set() match = CITE.findall(text) if match: for citations in match: label_list = SPLIT.split(citations) for label in label_list: result.add(label) return result

45 Regular ExpressionsMore Tools # Now test it all out. if __name__ == '__main__': test = '''\ Granger's work on graphs \cite{dd-gr2007,gr2009}, particularly ones obeying Snape's Inequality \cite{ snape87 } (but see \cite{quirrell89}), has opened up new lines of research. However, studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers.''' print get_citations(test) set(['gr2009', 'stibbons2002', 'dd-gr2007', 'stibbons2008', 'snape87', 'quirrell89'])

46 June 2010 created by Greg Wilson Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See http://software-carpentry.org/license.html for more information.


Download ppt "More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See"

Similar presentations


Ads by Google