Download presentation
Presentation is loading. Please wait.
1
The Linnaeus Centre for Bioinformatics Short introduction to perl & gff Marcus Ronninger The Linnaeus Centre for Bioinformatics
2
Motivation Bioinformatics yields lots of information The information have to be mined Build or modify text files Small changes can take long time with lots of data Example: Change every letter to lower case With script programming this could be done in less than a second
3
The Linnaeus Centre for Bioinformatics perl Practical extraction and report language Scripts Object oriented programming Graphical web interface, CGI Possibilities BioPerl
4
The Linnaeus Centre for Bioinformatics Example Example of a very simple perl script, to_lower_case.pl #!/usr/bin/perl -w use strict; my $seqfile = $ARGV[0]; my $outfile = $ARGV[1]; open (SEQ, $seqfile) || die "Can't open file: $seqfile"; open (OUTFILE, "> $outfile"); while( ){ if ($_ =~ /^\>.*\n/){ print OUTFILE $_; } else{ print OUTFILE lc ($_); }
5
The Linnaeus Centre for Bioinformatics Useful tools for parsing files Scalar $ Array @ Regular expression /.fasta/ Split, @chars = split //, $word Substitute s/old-regex/new-string/ Upper and lower case: uc, lc Escape characters: \n \t \s etc sub
6
The Linnaeus Centre for Bioinformatics General feature format, gff AKA “gene finding format” A format for handling output from different feature finding programs Processes can be decoupled but the result can still be put together Makes it easy to include external algorithms
7
The Linnaeus Centre for Bioinformatics General feature format The construction of the format is very simple. The values are tab-delimited. SEQ1EMBLatg103105.+0 SEQ1EMBLexon103172.+0 1.2.3.4.5.6.7.8. 1. Sequence name 2. Source of the feature 3. Feature type 4. Start 5. End 6. Score - most feature finding programs have some kind of score for the found motif 7. Strand - can either be + or - 8. Frame - 0, 1, 2,.
8
The Linnaeus Centre for Bioinformatics Small example A small script that transforms known transcription factor binding sites into a.gff file TFBS PositionMotif AP-2-101ccccaccccc NF-1-116tgggctgcggccca Hgcs-117ctgggctgcggc #Gfap #Known TFBS (Besnard et al 1991) #count backwards form the TSS #start -14 AP-2:ccccaccccc-101 NF-1:tgggctgcggccca-116 Hgcs:ctgggctgcggc-117
9
The Linnaeus Centre for Bioinformatics Example Basically the same procedure as the perl example above $seqlength = 5000; $gff = “”; while ( ){ if ($_ =~ /^#start/){ $rel_start = $'; } elsif (!($_ =~ /^#/) && ($_ =~ /\w+/)){ make_gff($_, $rel_start, "Literature"); }
10
The Linnaeus Centre for Bioinformatics Example while ( ){ if ($_ =~ /^#start/){ $rel_start = $'; } elsif (!($_ =~ /^#/) && ($_ =~ /\w+/)){ make_gff($_, $rel_start, "Literature"); } sub make_gff{ my $start; my $stop; (my $seq, my $rs, my $type) = @_; my @feature = split(/\s+/, $seq); # now the array has the feature information if($type eq "Literature"){ $start = $seqlength + $rs + $feature[2]; $stop = $start + length($feature[1]) -1; $sign = '.'; $gff.= "$feature[0]\t$type\t$feature[0]\t$start\t$stop\tundef\t$sign\t$s ign\n"; } etc.
11
The Linnaeus Centre for Bioinformatics Example Output: a file named lit.gff with the following contents AP-2: Literature AP-2: 4886 4895 undef.. NF-1: Literature NF-1: 4871 4884 undef.. Hgcs: Literature Hgcs: 4870 4881 undef.. This can now be loaded into programs thatsupport the gff format, e.g. Apollo
12
The Linnaeus Centre for Bioinformatics Apollo Gff files is boring to view as they are Use graphical software Apollo, a sequence annotation editor Great for viewing gff files together with the sequence
13
The Linnaeus Centre for Bioinformatics References Tisdall J.D, “Beginning Perl for Bioinformatics” 2001, O’Reilly http://www.sanger.ac.uk/Software/formats/G FF/ http://www.fruitfly.org/annot/apollo/.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.