Download presentation
Presentation is loading. Please wait.
Published byMelvin Lloyd Modified over 9 years ago
2
Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files or lines you want to work with › Used inside of substitution functions to change the contents of a string
3
ls 14* › * is a wildcard here, not regex › 14 followed by zero or more of any character ls 14[0-1][0-9]* › [0-1] and [0-9] are regex character classes, specifying a single character within the the list of characters from 0 to 1, and 0 to 9, respectively ls 14[0-1][0-9][0-3][0-9]* › 6 digits that look like a date YYMMDD, mostly
4
mv [b-z]* $data_scratch › An alphabetical class, which depending on your system might match the lower case letters from b through z, OR a mix of upper and lower case: b C c D d... Z z grep 'MIT01$' sysnos.txt › Find lines that end ($) with MIT01 › ^ can be used to match at the beginning of a line
5
In vi, you can use regular expressions with the s/// substitution operator With emacs, use M-x query-replace- regexp › Replace $ with MIT01 › Take a list of system numbers and make it valid input to an Aleph service by adding the library code to the end of each line
6
Look through a MARC file in Aleph sequential format for lines with tag 260 › 001234567 260 L $$aCambridge$$bMIT Press if ($matched =~ m/^\d{9}\s260.+/) {... } › $matched is the while loop variable representing the line we're working on › =~ is a pattern operator used with the matching (m), substitution (s), and translation (tr) functions › m// is the pattern matching function
7
^ start at the beginning of the line \d Perl-speak for the digits character class {9} a quantifier. Find exactly 9 of \d \s Perl-speak for the whitespace char class 260 the MARC tag I'm looking for . any character + a quantifier. Find 1 or more of.
8
^start at the beginning of the line \dPerl-speak for the digits character class {9}a quantifier. Find exactly 9 of \d \sPerl-speak for the whitespace char class 260the MARC tag I'm looking for.any character +a quantifier. Find 1 or more of.
9
Look for deleted records › LDR position 05 is d › $my_LDR =~ /LDR L.....d/ Look for e-resource records › $my_245 =~ /\$\$h\[electronic resource\]/ Look for OCLC numbers › $my_035 =~ /(\(OCoLC\)\d{8,10})/ › Note the double use of () here
10
if ($hash{$tmp} =~ m/SKIP/ || $hash{$tmp} =~ m/NEW/) { $new_count++ if (m/ FMT L /); $skip_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP/); $bre_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP Brief/); $bks_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP Books24x7/); $eebo_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP EEBO/); $epda_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP EPDA/); $sta_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP STA/); }
11
We have a browse index of URLs An Aleph browse index only sorts the first 69 characters of the field When we have many URLs from the same site, we need to get the unique part closer to the beginning Following is an SFX OpenURL from the MARCit! service
12
http://owens.mit.edu/sfx_local? url_ver=Z39.88-2004&ctx_ver=Z39.88- 2004&ctx_enc=info:ofi/enc:UTF- 8&rfr_id=info:sid/sfxit.com:opac_856&url_ ctx_fmt=info:ofi/fmt:kev:mtx:ctx&sfx.ignor e_date_threshold=1&rft.object_id=37100 00000092335&svc_val_fmt=info:ofi/fmt:ke v:mtx:sch_svc&
13
http://owens.mit.edu/sfx_local?rft.object _id=3710000000092335&url_ver=Z39.88- 2004&ctx_ver=Z39.88- 2004&ctx_enc=info:ofi/enc:UTF- 8&rfr_id=info:sid/sfxit.com:opac_856&url_ ctx_fmt=info:ofi/fmt:kev:mtx:ctx&sfx.ignor e_date_threshold=1&svc_val_fmt=info:ofi /fmt:kev:mtx:sch_svc&
14
$my_856 =~ s/(^.*sfx_local\?)(.*)(rft\.object_id\=\d{1,}\&)(.*$)/$1$3$2$4/; s is the substitution operator › substitute/this/for this/ Parentheses used here to group different sections of the pattern, and then re- arrange them
15
$1The first matched parenthetical section ^.*sfx_local\?From the beginning, anything up to and including sfx_local? ? is a special character and is escaped here to get a literal question mark $2The 2nd matched parenthetical section.*Any number of any character, until it reaches the next match string
16
Now change the order from $1$2$3$4 to $1$3$2$4 $3The 3rd parenthetical section rft\.object_id\=\d{1,} \& rft.object_id= followed by one or more digits and an ampersand. = and & are escaped with \ because they are special characters {1,} is like + a quantifier meaning one or more $4The 4th and final parenthetical section.*$Any number of any character to the end
17
Thesis degree, year, and department are stored in a single free text MARC field 502 We have applied some structure to this, but it has varied over time In DSpace, we want to get these 3 bits into separate fields, so the note is parsed on the way from MARC to Dublin Core
18
$MIT = 'Massachusetts Institute of Technology\.?|M\.\s?I\.\s?T\.'; › ? is the zero or one quantifier. › | match the pattern alternative before or after this $Dept = '[Dd]epartment\s[Oo]f|[dD]ept\.\s+[Oo ]f'; › A few small character classes, to allow for case variation, and Department vs Dept.
19
$Month = 'January|February|March|April|May|J une|July|August|September|October| November|December'; › match any one month name when $Month is used inside a pattern
20
/^Thesis\.\s+(\d+)\.?\s+([\w\.\s]+)-- ($MIT)\.?\s+($Dept)?\s*(.+)$/o /^Thesis\.Begin with Thesis. \s+1 or more spaces (\d+)1 or more digits = $1 \.?0 or 1 period \s+1 or more spaces ([\w\.\s]+)1 or more word chars, periods, spaces = $2 -- ($MIT)something matching $MIT = $3
21
/^Thesis\.\s+(\d+)\.?\s+([\w\.\s]+)-- ($MIT)\.?\s+($Dept)?\s*(.+)$/o \.?0 or 1 period \s+1 or more spaces ($Dept)?0 or 1 strings matching $Dept = $4 \s*0 or more spaces (.+)$anything left to the end = $5 /oAn option. Compile the expression only once. The variables, $MIT and $Dept are not going to change
22
Massachusetts Institute of Technology. Dept. of Economics. Thesis. 1968. Ph.D. Massachusetts Institute of Technology, Dept. of Civil Engineering, Thesis. 1965. Sc. D. /^($MIT)(\.|,)?\s+($Dept)?\s*([\w\s\.,]+ )\s+Thesis.\s*(\d{4})\.?\s*(.*)$/o
23
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Aeronautics and Astronautics, 1973. Thesis (Sc. D.)--Massachusetts Institute of Technology, Dept. of Aeronautics an Astronautics. Thesis. (M.S.)--Sloan School of Management, 1983. Thesis (Sc. D.)--Massachusetts Institute of Technology, Dept. of Mechanical Engineering, 1951. Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Linguistics and Philosophy, February 2004. /^Thesis\.?\s*\(([^\)]*)\)(\s*-- ?\s*|\s+)?(($MIT)[\.,]?)?\s*($Dept)?\s*(.*)(,\s+(\d{4}) )?\.?$/o
24
Thesis (Ph. D.)--Joint Program in Oceanography/Applied Ocean Science and Engineering (Massachusetts Institute of Technology, Dept. of Earth, Atmospheric, and Planetary Sciences; and the Woods Hole Oceanographic Institution), 2013. /^Thesis\.?\s*\(([^\)]*)\)(\s*--(Joint Program in ([\w\.\s]+)\((($MIT)[\.,]?)?\s*($Dept)?\s*([ \w,;\s]+)\)))(,\s+(\d{4}))?\.?$/o
25
orbitee@mit.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.