Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Theory MAT542 (Computational Methods in Genomics) - Part 2 & 3 - Benjamin King Mount Desert Island Biological Laboratory

Similar presentations


Presentation on theme: "Computational Theory MAT542 (Computational Methods in Genomics) - Part 2 & 3 - Benjamin King Mount Desert Island Biological Laboratory"— Presentation transcript:

1 Computational Theory MAT542 (Computational Methods in Genomics) - Part 2 & 3 - Benjamin King Mount Desert Island Biological Laboratory bking@mdibl.org

2 Overview of 4 Lectures Introduction to Computation and Programming Programming (Text File Processing) Programming (Text File Processing) Genome Sequencing and Informatics Mon. Sept. 8 Wed. Sept 10 & Mon. Sept 15 Wed. Sept 17 Homework Due on Oct. 1 nd (Wed) by 2pm

3 Example Scripts and Input File go_bears.pl array_example.pl associative_array_example.pl if_loop_example.pl regular_expression_example.pl for_and_while_loop_examples.pl subsitution_and_translation_examples.pl read_file.pl ben_input_file.txt reading_and_writing_files_example.pl

4 Perl http://www.cpan.org – Comprehensive Perl Archive Network http://www.activestate.com – Active Perl (for PC, Mac OS X, Linux)

5 Homework Due by 2pm on Wednesday, Oct 1 st Email scripts as text file attachments as well as input data files bking@mdibl.org

6 Homework Assignment Write a Perl script for each of the following: 1. (10 points) Using an iterative loop and a formula, print out the following two-column array: 01 12 24 38 416 532 664 2. (20 points) Print out the transcribed RNA sequence for a DNA sequence in FASTA format. The script shall read in a text file containing the input DNA sequence from a FASTA formatted sequence file. Use the GenBank record, M15131, as the input sequence. 3. (30 points) Read in a tab-delimited text file downloaded using Ensembl’s BioMart that contains a listing of all transcription factors in the mouse genome, store the genome coordinates in associative arrays (using gene symbol as the key), and write an output file that contains the coordinates for all members of the HOX gene family. The list of all transcription factors can be retrieved by filtering by genes with proteins that have been annotated with the Gene Ontology molecular function term “sequence-specific DNA binding transcription factor activity” (GO:0003700). 4. (40 points) Calculate the percent GC content for each of the 36 positions in a subset of 100,000 RNAseq reads that you can download here as a FASTQ-formatted text file: https://gillnet.mdibl.org/~bking/MAT500/mini.fastq.gz

7 Programming Concepts Variables Data Structures Common Operations $a = “Go Bears”; $b = 25; $c = 3.1415; $d = 0; Used to store: character string integer real number Boolean value (True or False) Store “collections” of data in an organized fashion Mathematical operations Testing for specific values (if / then loop) Iteration (for, while loops) Translation operations Printing messages Reading in files Writing output

8 To Run Perl Using Interactive Console 1.Type (in Command Prompt or Terminal window) perl 2.Type statements $a = 1; $b = 2; $c = $a + $b; print $c; 3.Enter CTRL-D to execute commands

9 #!/usr/bin/perl # Header # Example script # Variable declarations $a = "Go "; $b = "Black "; $c = "Bears"; # Main print $a,$b,$c,"\n"; perl go_bears.pl Go Black Bears go_bears.pl

10 Variables Scalar Types: character string integer real number Boolean value (True or False) $a = " TAATAA " ; print $a; $n = 25; $m = 100; $sum = $n + $m;

11 Data Structures Store “collections” of data in an organized fashion Arrays – ordered list of items of the same type (character, integer, etc) @sequences = ("TAATAA", "TCATAA", "GAATAA"); print $sequences[0]; print $sequences[1]; print $sequences[2]; @numbers = (18,25,78); print $numbers[0]; print $numbers[1]; print $numbers[2];

12 Associative Arrays – list of items of the same type (character, integer, etc), but indexed by a particular character, integer, etc. Data Structures %genbankids; $genbankids{ " Il1b " } = " M15131 " ; $genbankids{ " Hoxc8 " } = " AF198989 " ; print $genbankids{ " Il1b " }; M15131 Also called “hash tables” Called dictionaries in Python

13

14 Steps for cDNA alignment: 1 – Break cDNA into non-overlapping n base chunks (k-mers) 2 – Use index to find regions in genome similar to each k-mer 3 – Find exons by looking for k-mers that align to same genome region and cDNA 4 – Stitch together exons Sequence Alignment Program, BLAT

15 genome: cacaattatcacgaccgc (K = 8-13 real genome) example from Jim Kent K-mers: cac aat tat cac gac cgc 0 3 6 9 12 15 hits: aat 0,3 -3 cac 6,0 6 cac 6,9 -3 clump: cacAATtatCACgaccgc cDNA: aattctcac 3-mers: aat att ttc tct ctc tca cac 0 1 2 3 4 5 6 genome position cDNA position Sequence Alignment Program, BLAT

16 Common Operations Mathematical operations Testing for specific values (if / then loop) - Regular expressions Iteration (for, while loops) Translation operations Printing messages Reading in files Writing output

17 Mathematical Operations SyntaxDescription 5 + 10Addition 10 - 5Subtraction 10 * 5Multiplication 5 / 10Division 10**2Exponent exp(2)Exponential function log(256)Natural log abs, atan2, cos, exp, hex, int, log, oct, rand, sin, sqrt, srand abs(-1) sqrt(256)

18 $a = 1; if ($a == 1) { print “Value is 1”; print “hello”; } else { print “Value is not 1”; } if ($a >= 0) { == != > < >= <= $a != -1 $a > 0 $b eq “okay” $b ne “okay” if, then loops (testing for values)

19 $a = 1; if ($a == 1) { print "Value is 1\n"; } elsif ($a == 2) { print "Value is 2\n"; } else { print "Value is not 1 or 2\n"; } if, then loops (testing for values)

20 Regular Expressions Used to match a pattern of characters Often applied in if/then loops $a = “Today is Sept 11, 2013”; if ($a =~ /, \d+/) { print "Found year\n"; } if ($a =~ /, (\d+)/) { print "Year=",$1; } Found year Year=2013

21 ^ Match at beginning of string $ Match at end of string. Match any character \w Match "word" character (alphanumeric plus "_") \W Match non-word character \s Match whitespace character \S Match non-whitespace character \d Match digit character \D Match non-digit character \t Match tab \n Match newline * Match 0 or more times + Match 1 or more times ? Match 1 or 0 times {n} Match exactly n times {n,} Match at least n times {n,m} Match at least n but not more than m times [ ] Match a range of characters (e.g, [A|T|G|C] ) [0-9] [a-zA-Z] Regular Expressions

22 Iteration (for, while loops) for ($i = 0; $i <= 5; $i++) { print "i=",$i, " i**2=",$i**2, "\n"; } i=0 i**2=0 i=1 i**2=1 i=2 i**2=4 i=3 i**2=9 i=4 i**2=16 i=5 i**2=25

23 Iteration (for, while loops) $i = 0; while ($i <= 5) { print "i=",$i, " i**2=",$i**2, "\n"; $i = $i + 1; #$i++; } i=0 i**2=0 i=1 i**2=1 i=2 i**2=4 i=3 i**2=9 i=4 i**2=16 i=5 i**2=25

24 Substitution and Translation Operations $sentence = "I flew to london yesterday"; $sentence =~ s/london/London/; #$sentence =~ s/london/London/g; print $sentence,"\n"; $sentence = "abcdefghijklmnopqrstuvwxyz"; print $sentence,"\n"; $sentence =~ tr/abc/edf/; print $sentence,"\n"; $sentence =~ tr/[a-z]/[A-Z]/; print $sentence,"\n"; I flew to London yesterday abcdefghijklmnopqrstuvwxyz edfdefghijklmnopqrstuvwxyz EDFDEFGHIJKLMNOPQRSTUVWXYZ

25 Printing Messages $a = 24.56; print "Value of a=",$a, "\n"; print "Value of a=$a\n"; \n = new line character \t = tab character

26 Reading Input Files #!/usr/bin/perl # Header # Example script that reads in an input file # and prints it out # File handling $input_fh = open(INPUT,"<ben_input_file.txt"); # Main while ( ) { $line = $_; chomp($line); print $line,”\n”; if ($_ =~ /Il12/) { print "Found Il12\n"; } 13500050Il1b 256049490Il12 278079000Il10 ben_input_file.txt read_file.pl

27 Reading Input Files #!/usr/bin/perl # File handling $input_fd = open(INPUT,"<ben_input_file.txt"); $output_fd = open(OUTPUT,">ben_output_file.txt"); # Main while ( ) { $line = $_; chomp($line); @fields = split("\t",$line); # splits current line by # tab characters $chr = $fields[0]; $start = $fields[1]; $symbol = $fields[2]; print "chr=",$chr," symbol=",$symbol,"\n"; } 13500050Il1b 256049490Il12 278079000Il10 ben_input_file.txt

28 Writing Output 13500050Il1b 256049490Il12 278079000Il10 ben_input_file.txt #!/usr/bin/perl # File handling $input_fd = open(INPUT,"<ben_input_file.txt"); $output_fd = open(OUTPUT,">ben_output_file.txt"); # Main while ( ) { $line = $_; chomp($line); @fields = split("\t",$line); # splits current line by # tab characters $chr = $fields[0]; $start = $fields[1]; $symbol = $fields[2]; print OUTPUT "chr=",$chr," symbol=",$symbol,"\n"; }

29 Using Modules

30 BioPerl

31 Install BioPerl using Active Perl’s “Perl Package Manager”

32

33 Using Modules getlengths.pl ben_sequences.fa fasta >gi|47115295|emb|CAG28607.1| IL1B [Homo sapiens] MAEVPKLASEMMAYYSGNEDDLFFEADGPKQMKCSFQDLDLCPLDGGIQLRISDHHYSKGFRQAASVVVA MDKLRKMLVPCPQTFQENDLSTFFPFIFEEEPIFFDTWDNEAYVHDAPVRSLNCTLRDSQQKSLVMSGPY ELKALHLQGQDMEQQVVFSMSFVQGEESNDKIPVALGLKEKNLYLSCVLKDDKPTLQLESVDPKNYPKKK MEKRFVFNKIEINNKLEFESAQFPNWYISTSQAENMPVFLGGTKGGQDITDFTMQFVSS >gi|68534031|gb|AAH98597.1| Il1b protein [Danio rerio] MACGQYEVTIAPKNLWETDSAVYSDSDEMDCSDPLAMSYRCDMHEGIRLEMWTSQHKMKQLVNVIIALNR MKHIKPQSTEFGEKEVLDMLMANVIQEREVNVVDSVPSYTKTKNVLQCTICDQYKKSLVRSGGSPHLQAV TLRAGSSDLKVRFSMSTYASPSAPATSAQPVCLGISKSNLYLACSPAEGSAPHLVLKEISGSLETIKAGD PNGYDQLLFFRKETGSSINTFESVKCPGWFISTAYEDSQMVEMDRKDTERIINFELQDKVRI >gi|13928692|ref|NP_113700.1| interleukin 1, beta [Rattus norvegicus] MATVPELNCEIAAFDSEENDLFFEADRPQKIKDCFQALDLGCPDESIQLQISQQHLDKSFRKAVSLIVAV EKLWQLPMSCPWSFQDEDPSTFFSFIFEEEPVLCDSWDDDDLLVCDVPIRQLHCRLRDEQQKCLVLSDPC ELKALHLNGQNISQQVVFSMSFVQGETSNDKIPVALGLKGLNLYLSCVMKDGTPTLQLESVDPKQYPKKK MEKRFVFNKIEVKTKVEFESAQFPNWYISTSQAEHRPVFLGNSNGRDIVDFTMEPVSS

34 Using Modules # first, bring in the SeqIO module use Bio::SeqIO; # usage statement if one or both arguments are missing. my $usage = "getlengths.pl file format\n"; my $file = shift or die $usage; my $format = shift or die $usage; # create a SeqID object that will bring in the contents of the input file my $inseq = Bio::SeqIO->new(-file => " $format ); while (my $seq = $inseq->next_seq) { print $seq->length,"\n"; } exit; getlengths.pl ben_sequences.fa fasta

35 Homework Due by 2pm on Wednesday, Oct. 1 st Email scripts as text file attachments as well as input data files bking@mdibl.org

36 Homework Assignment Write a Perl script for each of the following: 1. (10 points) Using an iterative loop and a formula, print out the following two-column array: 01 12 24 38 416 532 664 2. (20 points) Print out the transcribed RNA sequence for a DNA sequence in FASTA format. The script shall read in a text file containing the input DNA sequence from a FASTA formatted sequence file. Use the GenBank record, M15131, as the input sequence. 3. (30 points) Read in a tab-delimited text file downloaded using Ensembl’s BioMart that contains a listing of all transcription factors in the mouse genome, store the genome coordinates in associative arrays (using gene symbol as the key), and write an output file that contains the coordinates for all members of the HOX gene family. The list of all transcription factors can be retrieved by filtering by genes with proteins that have been annotated with the Gene Ontology molecular function term “sequence-specific DNA binding transcription factor activity” (GO:0003700). 4. (40 points) Calculate the percent GC content for each of the 36 positions in a subset of 100,000 RNAseq reads that you can download here as a FASTQ-formatted text file: https://gillnet.mdibl.org/~bking/MAT500/mini.fastq.gz

37 Retrieve a Sequence in FASTA Format

38

39

40

41 Gene Ontology Uses terms to describe gene products: Biological Process Molecular Function Cellular Component Given term may have multiple parent nodes (DAG = directed acyclic graph)

42 Obtain List of All Human Genes Annotated To Be Involved in Signal Transduction Using Ensembl’s BioMart

43

44

45

46

47 Gene Ontology Biological Process Term Name “signal transduction” GO:0007165

48 Obtain List of All Human Genes Annotated To Be Involved in Signal Transduction Using Ensembl’s BioMart

49

50

51

52

53

54


Download ppt "Computational Theory MAT542 (Computational Methods in Genomics) - Part 2 & 3 - Benjamin King Mount Desert Island Biological Laboratory"

Similar presentations


Ads by Google