Download presentation
Presentation is loading. Please wait.
Published byWilla Bell Modified over 9 years ago
1
Computational Theory MAT542 (Computational Methods in Genomics) - Part 2 & 3 - Benjamin King Mount Desert Island Biological Laboratory bking@mdibl.org
2
Overview of 4 Lectures Introduction to Computation and Programming Programming (Text File Processing) Programming (Text File Processing) Genome Sequencing and Informatics Mon. Sept. 8 Wed. Sept 10 & Mon. Sept 15 Wed. Sept 17 Homework Due on Oct. 1 nd (Wed) by 2pm
3
Example Scripts and Input File go_bears.pl array_example.pl associative_array_example.pl if_loop_example.pl regular_expression_example.pl for_and_while_loop_examples.pl subsitution_and_translation_examples.pl read_file.pl ben_input_file.txt reading_and_writing_files_example.pl
4
Perl http://www.cpan.org – Comprehensive Perl Archive Network http://www.activestate.com – Active Perl (for PC, Mac OS X, Linux)
5
Homework Due by 2pm on Wednesday, Oct 1 st Email scripts as text file attachments as well as input data files bking@mdibl.org
6
Homework Assignment Write a Perl script for each of the following: 1. (10 points) Using an iterative loop and a formula, print out the following two-column array: 01 12 24 38 416 532 664 2. (20 points) Print out the transcribed RNA sequence for a DNA sequence in FASTA format. The script shall read in a text file containing the input DNA sequence from a FASTA formatted sequence file. Use the GenBank record, M15131, as the input sequence. 3. (30 points) Read in a tab-delimited text file downloaded using Ensembl’s BioMart that contains a listing of all transcription factors in the mouse genome, store the genome coordinates in associative arrays (using gene symbol as the key), and write an output file that contains the coordinates for all members of the HOX gene family. The list of all transcription factors can be retrieved by filtering by genes with proteins that have been annotated with the Gene Ontology molecular function term “sequence-specific DNA binding transcription factor activity” (GO:0003700). 4. (40 points) Calculate the percent GC content for each of the 36 positions in a subset of 100,000 RNAseq reads that you can download here as a FASTQ-formatted text file: https://gillnet.mdibl.org/~bking/MAT500/mini.fastq.gz
7
Programming Concepts Variables Data Structures Common Operations $a = “Go Bears”; $b = 25; $c = 3.1415; $d = 0; Used to store: character string integer real number Boolean value (True or False) Store “collections” of data in an organized fashion Mathematical operations Testing for specific values (if / then loop) Iteration (for, while loops) Translation operations Printing messages Reading in files Writing output
8
To Run Perl Using Interactive Console 1.Type (in Command Prompt or Terminal window) perl 2.Type statements $a = 1; $b = 2; $c = $a + $b; print $c; 3.Enter CTRL-D to execute commands
9
#!/usr/bin/perl # Header # Example script # Variable declarations $a = "Go "; $b = "Black "; $c = "Bears"; # Main print $a,$b,$c,"\n"; perl go_bears.pl Go Black Bears go_bears.pl
10
Variables Scalar Types: character string integer real number Boolean value (True or False) $a = " TAATAA " ; print $a; $n = 25; $m = 100; $sum = $n + $m;
11
Data Structures Store “collections” of data in an organized fashion Arrays – ordered list of items of the same type (character, integer, etc) @sequences = ("TAATAA", "TCATAA", "GAATAA"); print $sequences[0]; print $sequences[1]; print $sequences[2]; @numbers = (18,25,78); print $numbers[0]; print $numbers[1]; print $numbers[2];
12
Associative Arrays – list of items of the same type (character, integer, etc), but indexed by a particular character, integer, etc. Data Structures %genbankids; $genbankids{ " Il1b " } = " M15131 " ; $genbankids{ " Hoxc8 " } = " AF198989 " ; print $genbankids{ " Il1b " }; M15131 Also called “hash tables” Called dictionaries in Python
14
Steps for cDNA alignment: 1 – Break cDNA into non-overlapping n base chunks (k-mers) 2 – Use index to find regions in genome similar to each k-mer 3 – Find exons by looking for k-mers that align to same genome region and cDNA 4 – Stitch together exons Sequence Alignment Program, BLAT
15
genome: cacaattatcacgaccgc (K = 8-13 real genome) example from Jim Kent K-mers: cac aat tat cac gac cgc 0 3 6 9 12 15 hits: aat 0,3 -3 cac 6,0 6 cac 6,9 -3 clump: cacAATtatCACgaccgc cDNA: aattctcac 3-mers: aat att ttc tct ctc tca cac 0 1 2 3 4 5 6 genome position cDNA position Sequence Alignment Program, BLAT
16
Common Operations Mathematical operations Testing for specific values (if / then loop) - Regular expressions Iteration (for, while loops) Translation operations Printing messages Reading in files Writing output
17
Mathematical Operations SyntaxDescription 5 + 10Addition 10 - 5Subtraction 10 * 5Multiplication 5 / 10Division 10**2Exponent exp(2)Exponential function log(256)Natural log abs, atan2, cos, exp, hex, int, log, oct, rand, sin, sqrt, srand abs(-1) sqrt(256)
18
$a = 1; if ($a == 1) { print “Value is 1”; print “hello”; } else { print “Value is not 1”; } if ($a >= 0) { == != > < >= <= $a != -1 $a > 0 $b eq “okay” $b ne “okay” if, then loops (testing for values)
19
$a = 1; if ($a == 1) { print "Value is 1\n"; } elsif ($a == 2) { print "Value is 2\n"; } else { print "Value is not 1 or 2\n"; } if, then loops (testing for values)
20
Regular Expressions Used to match a pattern of characters Often applied in if/then loops $a = “Today is Sept 11, 2013”; if ($a =~ /, \d+/) { print "Found year\n"; } if ($a =~ /, (\d+)/) { print "Year=",$1; } Found year Year=2013
21
^ Match at beginning of string $ Match at end of string. Match any character \w Match "word" character (alphanumeric plus "_") \W Match non-word character \s Match whitespace character \S Match non-whitespace character \d Match digit character \D Match non-digit character \t Match tab \n Match newline * Match 0 or more times + Match 1 or more times ? Match 1 or 0 times {n} Match exactly n times {n,} Match at least n times {n,m} Match at least n but not more than m times [ ] Match a range of characters (e.g, [A|T|G|C] ) [0-9] [a-zA-Z] Regular Expressions
22
Iteration (for, while loops) for ($i = 0; $i <= 5; $i++) { print "i=",$i, " i**2=",$i**2, "\n"; } i=0 i**2=0 i=1 i**2=1 i=2 i**2=4 i=3 i**2=9 i=4 i**2=16 i=5 i**2=25
23
Iteration (for, while loops) $i = 0; while ($i <= 5) { print "i=",$i, " i**2=",$i**2, "\n"; $i = $i + 1; #$i++; } i=0 i**2=0 i=1 i**2=1 i=2 i**2=4 i=3 i**2=9 i=4 i**2=16 i=5 i**2=25
24
Substitution and Translation Operations $sentence = "I flew to london yesterday"; $sentence =~ s/london/London/; #$sentence =~ s/london/London/g; print $sentence,"\n"; $sentence = "abcdefghijklmnopqrstuvwxyz"; print $sentence,"\n"; $sentence =~ tr/abc/edf/; print $sentence,"\n"; $sentence =~ tr/[a-z]/[A-Z]/; print $sentence,"\n"; I flew to London yesterday abcdefghijklmnopqrstuvwxyz edfdefghijklmnopqrstuvwxyz EDFDEFGHIJKLMNOPQRSTUVWXYZ
25
Printing Messages $a = 24.56; print "Value of a=",$a, "\n"; print "Value of a=$a\n"; \n = new line character \t = tab character
26
Reading Input Files #!/usr/bin/perl # Header # Example script that reads in an input file # and prints it out # File handling $input_fh = open(INPUT,"<ben_input_file.txt"); # Main while ( ) { $line = $_; chomp($line); print $line,”\n”; if ($_ =~ /Il12/) { print "Found Il12\n"; } 13500050Il1b 256049490Il12 278079000Il10 ben_input_file.txt read_file.pl
27
Reading Input Files #!/usr/bin/perl # File handling $input_fd = open(INPUT,"<ben_input_file.txt"); $output_fd = open(OUTPUT,">ben_output_file.txt"); # Main while ( ) { $line = $_; chomp($line); @fields = split("\t",$line); # splits current line by # tab characters $chr = $fields[0]; $start = $fields[1]; $symbol = $fields[2]; print "chr=",$chr," symbol=",$symbol,"\n"; } 13500050Il1b 256049490Il12 278079000Il10 ben_input_file.txt
28
Writing Output 13500050Il1b 256049490Il12 278079000Il10 ben_input_file.txt #!/usr/bin/perl # File handling $input_fd = open(INPUT,"<ben_input_file.txt"); $output_fd = open(OUTPUT,">ben_output_file.txt"); # Main while ( ) { $line = $_; chomp($line); @fields = split("\t",$line); # splits current line by # tab characters $chr = $fields[0]; $start = $fields[1]; $symbol = $fields[2]; print OUTPUT "chr=",$chr," symbol=",$symbol,"\n"; }
29
Using Modules
30
BioPerl
31
Install BioPerl using Active Perl’s “Perl Package Manager”
33
Using Modules getlengths.pl ben_sequences.fa fasta >gi|47115295|emb|CAG28607.1| IL1B [Homo sapiens] MAEVPKLASEMMAYYSGNEDDLFFEADGPKQMKCSFQDLDLCPLDGGIQLRISDHHYSKGFRQAASVVVA MDKLRKMLVPCPQTFQENDLSTFFPFIFEEEPIFFDTWDNEAYVHDAPVRSLNCTLRDSQQKSLVMSGPY ELKALHLQGQDMEQQVVFSMSFVQGEESNDKIPVALGLKEKNLYLSCVLKDDKPTLQLESVDPKNYPKKK MEKRFVFNKIEINNKLEFESAQFPNWYISTSQAENMPVFLGGTKGGQDITDFTMQFVSS >gi|68534031|gb|AAH98597.1| Il1b protein [Danio rerio] MACGQYEVTIAPKNLWETDSAVYSDSDEMDCSDPLAMSYRCDMHEGIRLEMWTSQHKMKQLVNVIIALNR MKHIKPQSTEFGEKEVLDMLMANVIQEREVNVVDSVPSYTKTKNVLQCTICDQYKKSLVRSGGSPHLQAV TLRAGSSDLKVRFSMSTYASPSAPATSAQPVCLGISKSNLYLACSPAEGSAPHLVLKEISGSLETIKAGD PNGYDQLLFFRKETGSSINTFESVKCPGWFISTAYEDSQMVEMDRKDTERIINFELQDKVRI >gi|13928692|ref|NP_113700.1| interleukin 1, beta [Rattus norvegicus] MATVPELNCEIAAFDSEENDLFFEADRPQKIKDCFQALDLGCPDESIQLQISQQHLDKSFRKAVSLIVAV EKLWQLPMSCPWSFQDEDPSTFFSFIFEEEPVLCDSWDDDDLLVCDVPIRQLHCRLRDEQQKCLVLSDPC ELKALHLNGQNISQQVVFSMSFVQGETSNDKIPVALGLKGLNLYLSCVMKDGTPTLQLESVDPKQYPKKK MEKRFVFNKIEVKTKVEFESAQFPNWYISTSQAEHRPVFLGNSNGRDIVDFTMEPVSS
34
Using Modules # first, bring in the SeqIO module use Bio::SeqIO; # usage statement if one or both arguments are missing. my $usage = "getlengths.pl file format\n"; my $file = shift or die $usage; my $format = shift or die $usage; # create a SeqID object that will bring in the contents of the input file my $inseq = Bio::SeqIO->new(-file => " $format ); while (my $seq = $inseq->next_seq) { print $seq->length,"\n"; } exit; getlengths.pl ben_sequences.fa fasta
35
Homework Due by 2pm on Wednesday, Oct. 1 st Email scripts as text file attachments as well as input data files bking@mdibl.org
36
Homework Assignment Write a Perl script for each of the following: 1. (10 points) Using an iterative loop and a formula, print out the following two-column array: 01 12 24 38 416 532 664 2. (20 points) Print out the transcribed RNA sequence for a DNA sequence in FASTA format. The script shall read in a text file containing the input DNA sequence from a FASTA formatted sequence file. Use the GenBank record, M15131, as the input sequence. 3. (30 points) Read in a tab-delimited text file downloaded using Ensembl’s BioMart that contains a listing of all transcription factors in the mouse genome, store the genome coordinates in associative arrays (using gene symbol as the key), and write an output file that contains the coordinates for all members of the HOX gene family. The list of all transcription factors can be retrieved by filtering by genes with proteins that have been annotated with the Gene Ontology molecular function term “sequence-specific DNA binding transcription factor activity” (GO:0003700). 4. (40 points) Calculate the percent GC content for each of the 36 positions in a subset of 100,000 RNAseq reads that you can download here as a FASTQ-formatted text file: https://gillnet.mdibl.org/~bking/MAT500/mini.fastq.gz
37
Retrieve a Sequence in FASTA Format
41
Gene Ontology Uses terms to describe gene products: Biological Process Molecular Function Cellular Component Given term may have multiple parent nodes (DAG = directed acyclic graph)
42
Obtain List of All Human Genes Annotated To Be Involved in Signal Transduction Using Ensembl’s BioMart
47
Gene Ontology Biological Process Term Name “signal transduction” GO:0007165
48
Obtain List of All Human Genes Annotated To Be Involved in Signal Transduction Using Ensembl’s BioMart
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.