Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Bioinformatic Computation. Lecture #

Similar presentations


Presentation on theme: "Introduction to Bioinformatic Computation. Lecture #"— Presentation transcript:

1 Introduction to Bioinformatic Computation. Lecture #5 02-15-10
Objectives for today Regular expressions Practicing in PERL programming.

2 Compression and decompression
Downloading BLAST tar (tape archive) Command line for packing files: tar -cvf archive.tar dir_names Command line for unpacking files: tar –xvf archive.tar Compression and decompression gzip and gunzip

3 UNIX commands find / -name your_filename -print
sort –nr -k4 –k7 filename

4 Regular expressions Variable could work inside the regular expression
$c_box = ‘tgatga’; if ($sequence =~/$c_box/) { print “C-box inside \n”; }

5 Quiz on Regular Expressions
^ $ \s \w \d . + * ? {3,5}

6 Regular expression modifiers
/pattern/g; g - find all occurrences i - turn off case sensitivity s - do not care about \n inside the string m x e

7 reg_expr1.pl (in LECTURE5)
#!/usr/local/perl $seq = 'agtctcaggatacaaagtccctacatccggat'; if ($seq =~ /^.{6}([agtc]{4})/ ){print "$1 \n";}

8 reg_expr2.pl (in LECTURE5)
#!/usr/local/perl #greedy example $seq = 'agtctcaggatacaaagtccctacatccggat'; if ($seq =~ /^(.+a)/ ) {print "$1 \n";}

9 The match variables for regular expressions (page 121 Learning perl 3rd edition
$& - pattern $` - string before pattern $+ - string before pattern and pattern $’ – string after pattern Example: preparation of Intron Database from the Exon-Intron Database prog_introns_exons_IBC2 /home/afedorov/SHEPELEV

10 Example of EID entry (DNA form)
> 2_AAABGLOB protein_id:AAA ; Anadara trapezia beta globin gene, complete cds.; intron(phase:20,size:760,690,intr_sum:1450); exon(size:125,220,114,ex_sum:459); {splice:gtag,gtag} ATGAGTACTGTGGCCGAGTTGGCGAATGCTGTTGTCAGCAATGCCGATCAAAAAGATTTGCTGAGACTTAGCTGGGGAGT ATTATCCGTTGATATGGAAGGTACAGGATTGATGCTTATGGCGAAgtaagaacacttaagaatatatgttttagcaattt ttatttcaattcatgaaatgacattcttatcatgttatttcaagggttaatgcaaaattgcgtgtcaaatgaaatacaat gacagaaaggatatttgtttcaaacaaatttagccaatgttcccgtgtttcatcagaattatccagttacaagtttttac ttatgtttaggaagttagttagtatgttttgatttctttcaaaaattattattattctatgagtgattgtacctggtaaa tctaagtgaaacggtaactatattcaatatttgtttttaaagatgtttcttcaattataaacggctccttttgatatatt ttcagTTTGTTTAAAACAAGTTCAGCAGCCAGGACAAAATTCGCTCGTCTTGGAGACGTATCAGCTGGTAAAGATAACAG CAAGCTGAGAGGTCATTCTATCACCTTGATGTACGCCCTCCAGAACTTCATCGATGCTCTCGACAATGTAGACAGATTAA AGTGTGTTGTAGAAAAATTTGCTGTAAACCACATCAACAGACAAATATCTGCTGACGAATTTGGGgtaagctctttcaaa gattatgtcttcactttcctcgtgagagcgcacgaagttaatctgatttgtaattttcaagtttttactatgctttggga tttgaagaagatcgggatagaaaattgacttggtcgggaacacgacttgaatatagataatcacgcagtattttctgttt taaaaaggccaaatattctagtgaagaaacttaaaaatcgttttcctctgttaggattaggaaccttttatgcatattgt cttctgttaacatttctgtccattcaactgtaaatgcaagtaaattattttacagtggggagaacaatccctatcaacca cccaatcaatcaatcaatatttatttacattacagGAAATAGTTGGCCCCTTAAGGCAAACATTAAAGGCTAGGATGGGA AGTTATTTCGATGAAGATACTGTTTCTGCATGGGCTTCACTTGTTGCTGTTGTCCAGGCTGCATTATAA

11 > EXON_1 1_NT_077402 protein_id:XP_498727
> EXON_1 1_NT_ protein_id:XP_ ; ATGGAGGAGTTCAGAGAAGGTGCAACATTTCTGACCCCCTACAAG > EXON_2 1_NT_ protein_id:XP_ ; GAAAATGCAGACACAGCACGCCTCTTTGGGACCGCGGTTTATACTTTCGAAGTGCTCGGAGCCCTTCCTCCAGACCGTTCTCCCACACCCCGCTCCAGGGTCTCTCCCGGAGTTACAAGCCTCGCTGTAGGCCCCGGGAACCCAACGCGGTGTCAGAGAAGTGGGGTCCCCTACGAGGGACCAGGAGCTCCGGGCGGGCAGCAGCTGCGGAAGAGCCGCGCGAGGCTTCCCAGAACCCGGCAGGGGCGGGAAGACGCAGGAGTGGGGAGGCGGAACCGGGACCCCGCAGAGCCCGGGTCCCTGCGCCCCACAAGCCTTGGCTTCCCTGCTAGGGCCGGGCAAGGCCGGGTGCAGGGCGCGGCTCCAGGGAGGAAGCTCCGGGGCGAGCCCAAGACGCCTCCCGGGCGGTCGGGGCCCAGCGGCGGCGTTCGCAGTGGAGCCGGGCACCGGGCAGCGGCCGCGGAACACCAGCTTGGCGCAGGCTTCTCGGTCAGGAACG

12 prog_introns_exons_IBC2 Part I: preparing gene sequence
while (<CDS>) { $c++; $sign = chop($_); undef($CDS); undef($EXONS); undef($INTRONS); undef($sequence); @lines = split("\n", $_); $id = $lines[0]; if($c == 1) {$id = substr($lines[0],1);} for $n (1..$#lines) { chomp ($lines[$n]); $lines[$n] =~ s/\s//g; $sequence .= $lines[$n]; }

13 prog_introns_exons_IBC2 Part II: obtaining exons and introns
$count_ex = 0; while ($sequence) { $count_ex++; if ($sequence =~/(^[A-Z]+)/) { $curr_ex = $+; $sequence = $'; $CDS .= $curr_ex; $EXONS .= '> EXON_' . $count_ex . $id . "\n" . $curr_ex . "\n\n"; } if ($sequence =~/(^[a-z\.]+)/) { $curr_intr = $+; $INTRONS .= '> INTRON_' . $count_ex . $id . "\n" . $curr_intr . "\n\n"; unless($curr_ex) { print 'CHECK YOUR CURRENT SEQUENCE', $id, "\n"; print OUTPUT $sequence, "\n"; die;

14 PRACTICE (homework!) Write a perl script that detects and report all AGG triplets in the miRNA sequences in the file miRNAmature.fa

15 Regular expressions (chapters 7, 8, 9)
^ beginning of the string $ end of the string \s space or tab (\t\n\r\f) \w word character [a-zA-Z_0-9] \d any digit [0-9] . any character + one or more characters * zero or more characters ? zero or one character {3,5} repeat character from 3 to 5 times


Download ppt "Introduction to Bioinformatic Computation. Lecture #"

Similar presentations


Ads by Google