Biology is the science of reverse-engineering life Living organisms are molecular machines capable of replicating themselves The functional unit of life is the “cell” 1-100 um in diameter contains a primary information store: the genome
The Structure of a Genome Generally one or several strands of a polymer, called DNA, packaged into “chromosomes” Information is encoded as the order of the monomer sub-units (of types “A”, “C”, “G”, “T”) in the linear polymer Each cell carries the entire genome of the organism.
The Nature of DNA Linear, water-soluble, molecular data storage The polymer is actually “double-stranded”-each strand the “reverse-complement” of the the other The double strand is 2 nm in diameter Each monomer unit, “base”, added lengthens the strand by 0.34 nm
DNA Storage Density Genome length of an average bacterium is 2 megabases (Mb) Human genome 3 gigabases (Gb) Typical DNA “prep” solution contains about 25 petabytes/ml. (A ml is about 20 drops of a liquid.)
Perl and Genomics Good: Bad: Result: frequently middleware Perl is quick to write Excellent for parsing DWIM is good for the typical biologist Bad: Not as fast running Result: frequently middleware
My perl scripts 167 in my /bin Most are for either dealing with system stuff or parsing output from other programs A few are meant to directly analyze “sequence data”
Example Sequence Analysis Program ssr3.pl “ssr” is “simple sequence repeat” aka “microsatellite” E.g: >Echinomicrosat_01_B04_T7 XXXXXCAGAAGCGCTTCACAATTAAAAGCAAATCATACAAATATGATCAT CAGGCAGGCTATTTGAACACACTGTTTCGCACTGAACTCATAGTCACATT TCAGTCGTTCAGTGAGATGATTCATATGGCATAATTTGAACTGACGTTCG CTCTGACTATCGTTCAGCTCGTTGTGGGCACAATCGTTAGTCAGTTCGTT CACTCAACCACACACACACACACACACACGGAAACATCAGATTCGAGCTA AGCTCTTATTACAGCTGATCAGTAGGAGCACTGTTAGACAGTCTACTAAA TCAATATCAATTATCCCCCCCACACAACCATGGCTTCTGXXXXX
Example run of ssr3.pl >Echinomicrosat_01_B04_T7 %ssr3.pl Echinomicrosat_01_B04_T7.fasta Name Seq Len Range # of repetitions of sub unit Sub unit Echinomicrosat_01_B04_T7 344 209-228 10 of repeat "CA" ----------------------------- >Echinomicrosat_01_B04_T7 XXXXXCAGAAGCGCTTCACAATTAAAAGCAAATCATACAAATATGATCAT CAGGCAGGCTATTTGAACACACTGTTTCGCACTGAACTCATAGTCACATT TCAGTCGTTCAGTGAGATGATTCATATGGCATAATTTGAACTGACGTTCG CTCTGACTATCGTTCAGCTCGTTGTGGGCACAATCGTTAGTCAGTTCGTT CACTCAACCACACACACACACACACACACGGAAACATCAGATTCGAGCTA AGCTCTTATTACAGCTGATCAGTAGGAGCACTGTTAGACAGTCTACTAAA TCAATATCAATTATCCCCCCCACACAACCATGGCTTCTGXXXXX
ssr3.pl core routine: while ( $sequences{$x} =~ m/#Capture each ssr sub-unit within tolerance #Note "?" for lazy capture. Ensures "AC" is #the repeat unit instead of "ACAC" for example ([ACGT]{$min_repeat_unit_len,$max_repeat_unit_len}?) \1{$min_repeat_num,} /gix ) { my $repeat_unit = $1; my $start_of_ssr = $-[0]+1; my $end_of_ssr = $+[0]; my $ssr = $&; my $ssr_length = length($ssr)/length($repeat_unit);