1 Introduction to Perl Part III: Biological Data Manipulation
2 Column data Column delimited data Often CSV, comma delimited Tab, space, or other character delimited Process data in scripts instead of using Excel
3 GFF formats 8 columns, tab delimited seq_id, source, feature, start, end, score, strand, frame, group mats/GFF/GFF_Spec.shtml mats/GFF/GFF_Spec.shtml Actually 3 or 4 different versions, GFF3 will hopefully be new emerging standard
4 BLAST -m9 output The columns are – Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score Lines starting with ‘#’ are comments
# TBLASTN [May ] # Query: GLEAN_08256_1 pchr_1:join(complement( ),complement( ),complement( ),complement( ),complement( ),complement( ),complement( ),complement( ),comp # Database: /data/blast/cryptococcus_neoformans_JEC fa # Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score GLEAN_08256_1 cn-jec21_chr e GLEAN_08256_1 cn-jec21_chr e GLEAN_08256_1 cn-jec21_chr e GLEAN_08256_1 cn-jec21_chr e GLEAN_08256_1 cn-jec21_chr e GLEAN_08256_1 cn-jec21_chr e GLEAN_08256_1 cn-jec21_chr GLEAN_08256_1 cn-jec21_chr
Process data, filter by a percent id, print GFF open(IN, $filename) || die $!; while ( ) { chomp; = split(/\t/,$_); next if $cols[2] $end ) { ($start,$end,$strand) = ( $end,$start,’-’); } print join(“\t”, $cols[0], ‘BLAST’, ‘HSP’,$start,$end,$cols[10], “Target=$cols[1]+$cols[8]+$cols[9]”); # Target=subject+start+end } BLAST cols: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score GFF cols: seq_id, source, feature, start, end, score, strand, frame, group
7 Microarray data Lots of columns, R/G channels Want to add a log/transform sort data get subset of data
8 Filter rows ORF Name G1 G1.Bkg R1 R1.Bkg F1 G2 G2.Bkg R2 R2.Bkg F2 G3 G3.Bkg R3 R3.Bkg F3 G4 G4.Bkg R4 R4.Bkg F4 G5 G5.Bkg R5 R5.Bkg F5 G6 G6.Bkg R6 R6.Bkg F6 G7 G7.Bkg R7 R7.Bkg F7 G1.Ratio G1.Ratio G2.Ratio G3.Ratio G4.Ratio G5.Ratio G6.Ratio G7.Ratio R1.Ratio R2.Ratio R3.Ratio R4.Ratio R5.Ratio R6.Ratio R7.Ratio G1-Bkg G2-Bkg G3-Bkg G4-Bkg G5-Bkg G6-Bkg G7-Bkg R1-Bkg R2- Bkg R3-Bkg R4-Bkg R5-Bkg R6-Bkg R7-Bkg YHR007C ERG
9 Filter rows my $header= ; = split(/\s+/,$header); my $i = 0; my %header_col_num = map { $_ => $i++ my $index = $header_col_num{‘G2.Ratio’}; while( ) { = split; if( $col[$index] > 2 ) { } for my $row ( sort { $a->[$index] $b->[$index] ) { print $row->[$header_col_num{‘ORF’}], “ “, $row->[$index], “\n”; }
Add a column sub log_2 { return / log(2); } my $header= ; = split(/\s+/,$header); my $i = 0; my %header_col_num = map { $_ => $i++ my $index = $header_col_num{‘G2.Ratio’}; while( ) { = split; my $extra_col = log_2($col[$index]); [$col[0], $col[$index], $extra_col]; } for my $row ( sort { $a->[$index] $b->[$index] ) { print “\n”; }
11 Motif finding with regexps Want to find a binding site motif in DNA sequence Find motif in protein sequence
12 Let’s find SBF binding site SBF binding site in yeast: – CACGAAA and CGCGAAA – Combine these into C[AG]CGAAA Search DNA sequence for these sites
13 Find one motif my $dna; while( ) { if(/^>/ ) { last if ( $seen ); $seen = 1; } chomp; $dna.= $_; } if( $dna =~ /(C[AG]CGAAA)/ ) { # found the site but how to # say where it is in the sequence? }
14 More special variables ` - back quote (same key as ~) ‘- single quote (same key as “) $` - the stuff before the match $’ - the stuff after the match
15 Find one motif if( $dna =~ /(C[AG]CGAAA)/ ) { my $location = length($`); printf “$1 found at %d..%d\n”, $location, $location+length($1); }
16 Find multiple instances while( $dna =~ /(C[AG]CGAAA)/ig ) { my $location = length($`); print “$1 found at $location\n”; }
17 What about reverse strand? $dna = reverse($dna); $dna =~ tr/CAGT/GTCA/; if( $dna =~ /(C[AG]CGAAA)/ ) { my $location = length($`); printf “$1 found at %d..%d\n”, $location+length($1), $location; }
18 Making reports Text reports are great for summarizing output HTML is an easy and excellent way to summarize output and make it pretty Allows for linking to other resources
19 HTML with CGI.pm use CGI qw/:standard/; # equivalent to using namespace std; open(OUT, “>report.html”) || die $!; print OUT header, start_html('Motifs found'), h1('Motifs found'), table(Tr(th([“Motif”,“Chrom”, “Location”])), Tr(td([“CACGAAA”, “I”, “ ”])), ), hr, end_html;