Bioinformatics 生物信息学理论和实践 唐继军 13928761660.

Slides:



Advertisements
Similar presentations
Lecture 6 More advanced Perl…. Substitute Like s/// function in vi: #cut with EcoRI and chew back $linker = “GGCCAATTGGAAT”; $linker =~ s/CAATTG/CG/g;
Advertisements

1 Introduction to Perl Part III: Biological Data Manipulation.
DNA Technology & Gene Mapping Biotechnology has led to many advances in science and medicine including the creation of DNA clones via recombinant clones,
V) BIOTECHNOLOGY.
Programming Perls* Objective: To introduce students to the perl language. –Perl is a language for getting your job done. –Making Easy Things Easy & Hard.
Perl for Bioinformatics Lecture 4. Variables - review A variable name starts with a $ It contains a number or a text string Use my to define a variable.
Programming and Perl for Bioinformatics Part III.
Advanced Perl for Bioinformatics Lecture 5. Regular expressions - review You can put the pattern you want to match between //, bind the pattern to the.
More Regular Expressions. List/Scalar Context for m// Last week, we said that m// returns ‘true’ or ‘false’ in scalar context. (really, 1 or 0). In list.
Physical Mapping II + Perl CIS 667 March 2, 2004.
Single DNA Sequence Analysis Tools BME 110: CompBio Tools Todd Lowe May 6, 2008.
4.1 More loops. 4.2 Loops Commands inside a loop are executed repeatedly (iteratively): my $num=0; print "Guess a number.\n"; while ($num != 31) { $num.
© Wiley Publishing All Rights Reserved. Working with a Single DNA Sequence.
Advanced Perl for Bioinformatics Lecture 5. Regular expressions - review You can put the pattern you want to match between //, bind the pattern to the.
Computer Programming for Biologists Class 2 Oct 31 st, 2014 Karsten Hokamp
DOT PLOT Daniel Svozil. Software choice source: Bioinformatics for Dummies.
DNA Technology and Genomics
From Haystacks to Needles AP Biology Fall Isolating Genes  Gene library: a collection of bacteria that house different cloned DNA fragments, one.
Chapter 20~DNA Technology & Genomics. Who am I? Recombinant DNA n Def: DNA in which genes from 2 different sources are linked n Genetic engineering:
Lecture 7: Perl pattern handling features. Pattern Matching Recall =~ is the pattern matching operator A first simple match example print “An methionine.
DNA Technology Chapter 12. Applications of Biotechnology Biotechnology: The use of organisms to perform practical tasks for human use. – DNA Technology:
Bioinformatics 生物信息学理论和实践 唐继军 北京林业大学计算生物学中心
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
Lecture 8 perl pattern matching features
Manipulating DNA.
MCB 5472 Assignment #6: HMMER and using perl to perform repetitive tasks February 26, 2014.
Subroutines and Files Bioinformatics Ellen Walker Hiram College.
Module 1 Section 1.3 DNA Technology
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
BINF 634 FALL 2013 LECTURE 8 Modules and Maps1 Thanks to John Grefenstette for Many of These Slides !! Topics Midterm Discussions Program 2 Discussions.
Manipulation of DNA. Restriction enzymes are used to cut DNA into smaller fragments. Different restriction enzymes recognize and cut different DNA sequences.
DNA Technology Chapter 12. Transgenic Organisms Contain recombinant DNA – Nucleotide sequences from 2+ different sources Cells express original AND newly.
Biology 417 Week 1, Lecture #2 With input from: Yung Huang, Luis Sanchez, Lee Lin, Leticia Argueta, Kay Nguyen PGM 2000 Revised SBS.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
Bioinformatics 生物信息学理论和实践 唐继军
Copyright © 2010 Certification Partners, LLC -- All Rights Reserved Perl Specialist.
Bioinformatics 生物信息学理论和实践 唐继军 北京林业大学计算生物学中心
Review from last week. The Making of a Plasmid Plasmid: - a small circular piece of extra-chromosomal bacterial DNA, able to replicate - bacteria exchange.
Bioinformatics 生物信息学理论和实践 唐继军
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
Books. Perl Perl (Practical Extraction and Report Language) by Larry Wall Perl 1.0 was released to usenet's alt.comp.sources in 1987 Perl 5 was released.
7 1 User-Defined Functions CGI/Perl Programming By Diane Zak.
Computer Programming for Biologists Class 3 Nov 13 th, 2014 Karsten Hokamp
Applied Bioinformatics Week 5. Topics Cleaning of Nucleotide Sequences Assembly of Nucleotide Reads.
KEY CONCEPT Biotechnology relies on cutting DNA at specific places.
Copyright © 2003 ProsoftTraining. All rights reserved. Perl Fundamentals.
A Few More Functions. One more quoting operator qw// Takes a space separated sequence of words, and returns a list of single-quoted words. –no interpolation.
GE3M25: Computer Programming for Biologists Python, Class 5
Perl Variables: Array Web Programming1. Review: Perl Variables Scalar ► e.g. $var1 = “Mary”; $var2= 1; ► holds number, character, string Array ► e.g.
Perl Scripting III Arrays and Hashes (Also known as Data Structures) Ed Lee & Suzi Lewis Genome Informatics.
DNA Technology Ch. 20. The Human Genome The human genome has over 3 billion base pairs 97% does not code for proteins Called “Junk DNA” or “Noncoding.
Computer Programming for Biologists Class 4 Nov 14 th, 2014 Karsten Hokamp
Part 4 Arrays: Stacks foreach command Regular expressions: String structure analysis and substrings extractions and substitutions Command line arguments:
CSC 4630 Meeting 17 March 21, Exam/Quiz Schedule Due to ice, travel, research and other commitments that we all have: –Quiz 2, scheduled for Monday.
Perl for Bioinformatics Part 2 Stuart Brown NYU School of Medicine.
BINF 634 Fall LECTURE061 Outline Lab 1 (Quiz 3) Solution Program 2 Scoping Algorithm efficiency Sorting Hashes Review for midterm Quiz 4 Outline.
Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if ($sequence =~ m/atg/) { print "Yes"; } else { print.
CSC 4630 Perl 3 adapted from R. E. Beck. Problem But we worked on it first: Input: Read from a text file named in a command line argument Output: List.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
Biotechnology.
Part 3 Gene Technology & Medicine
A simple and powerful way to match characters
Perl Variables: Array Web Programming.
Topics Introduction to File Input and Output
Programming Perls* Objective: To introduce students to the perl language. Perl is a language for getting your job done. Making Easy Things Easy & Hard.
DNA Profiling Vocabulary
Topics Introduction to File Input and Output
Presentation transcript:

Bioinformatics 生物信息学理论和实践 唐继军

!/usr/bin/perl -w use Bio; use strict; use warnings; my $DNA = fasta_read(); print "First ", dna2peptide($DNA), "\n"; print "Second ", dna2peptide(substr($DNA, 1)), "\n"; print "Third ", dna2peptide(substr($DNA, 2)), "\n"; $DNA = reverse $DNA; $DNA =~ tr/ACGTacgt/TGCAtgca/; print "Fourth ", dna2peptide($DNA), "\n"; print "Fifth ", dna2peptide(substr($DNA, 1)), "\n"; print "Sixth ", dna2peptide(substr($DNA, 2)), "\n";

my $x = 10; for (my $x = 0; $x < 5; $x++) { Scope(); print $x, "\n"; } print $x, "\n"; sub Scope { my $x = 0; }

my $x = 10; for (my $x = 0; $x < 5; $x++) { Scope(); print $x, "\n"; } print $x, "\n"; sub Scope { $x = 0; }

sub extract_sequence_from_fasta_data { my $sequence = ''; foreach my $line { if ($line =~ /^\s*$/) { next; } elsif($line =~ /^\s*#/) { next; } elsif($line =~ /^>/) { next; } else { $sequence.= $line; } # remove non-sequence data (in this case, whitespace) from $sequence string $sequence =~ s/\s//g; return $sequence; }

Molecular Scissors Molecular Cell Biology, 4 th edition

R = G or A Y = C or T M = A or C K = G or T S = G or C W = A or T B = not A (C or G or T) D = not C (A or G or T) H = not G (A or C or T) V = not T (A or C or G) N = A or C or G or T

sub IUB_to_regexp { my($iub) my $regular_expression = ‘’; my %iub2character_class = ( A => 'A', C => 'C', G => 'G', T => 'T', R => '[GA]', Y => '[CT]', M => '[AC]', K => '[GT]', S => '[GC]', W => '[AT]', B => '[CGT]', D => '[AGT]', H => '[ACT]', V => '[ACG]', N => '[ACGT]', ); $iub =~ s/\^//g; for ( my $i = 0 ; $i < length($iub) ; ++$i ) { $regular_expression.= $iub2character_class{substr($iub, $i, 1)}; } return $regular_expression; }

Hash Initialize: my %hash = (); Add key/value pair: $hash{$key} = $value; Add more keys: %hash = ( 'key1', 'value1', 'key2', 'value2 ); %hash = ( key1 => 'value1', key2 => 'value2', ); Delete: delete $hash{$key};

while ( my ($key, $value) = each(%hash) ) { print "$key => $value\n"; } foreach my $key ( keys %hash ) { my $value = $hash{$key}; print "$key => $value\n"; }

sub parseREBASE { my($rebasefile) = ( ); my %rebase_hash = ( ); my $name; my $site; my $regexp; open($rebase_filehandle, $rebasefile) or die "Cannot open file\n"; while( ) { # Discard header lines ( 1.. /Rich Roberts/ ) and next; # Discard blank lines /^\s*$/ and next; # Split the two (or three if includes parenthesized name) fields = split( " ", $_); $name = $site = # Translate the recognition sites to regular expressions $regexp = IUB_to_regexp($site); # Store the data into the hash $rebase_hash{$name} = "$site $regexp"; } # Return the hash containing the reformatted REBASE data return %rebase_hash; }

Range ( 1.. /Rich Roberts/ ) and next from first line till some line containing Rich Roberts If that is true, it will check the statement after "and" If that is not true, it will not check the statement after "and" open(…) or die If can open, the statement is already true, no need to check the statement after "or" If cannot open, the statement is false, need to check the statement after "or" to see if it can be true

Array operators push and pop (right-most = (1,2,3); $oldvalue = shift and unshift (left-most = (5,6,7); $x = = = = =

sub match_positions { my($regexp, $sequence) use BeginPerlBioinfo; = ( ); while ( $sequence =~ /$regexp/ig ) { push pos($sequence) - length($&) + 1); } }

use BeginPerlBioinfo; my %rebase_hash = ( ); = ( ); my $query = ''; my $dna = ''; my $recognition_site = ''; my $regexp = ''; = ( = get_file_data("sample.dna"); $dna = %rebase_hash = parseREBASE('bionet'); do { print "Search for what restriction site for (or quit)?: "; $query = ; chomp $query; if ($query =~ /^\s*$/ ) { exit; } if ( exists $rebase_hash{$query} ) { ($recognition_site, $regexp) = split ( " ", = match_positions($regexp, $dna); if { print "Searching for $query $recognition_site $regexp\n"; print "Restriction site for $query at :", join(" "\n"; } else { print "A restriction enzyme $query is not in the DNA:\n"; } } until ( $query =~ /quit/ ); exit;

Print to file Open a file to print open FILE, ">filename.txt"; open (FILE, ">filename.txt“); Print to the file print FILE $str;

#write new file open(FILE, ">out") or die "Cannot open file to write"; print FILE "Test\n"; close FILE; exit;

#Append open(FILE, ">>out") or die "Cannot open file to write"; print FILE "Test\n"; close FILE; exit;

#!/usr/bin/perl print "My name is $0 \n"; print "First arg is: $ARGV[0] \n"; print "Second arg is: $ARGV[1] \n"; print "Third arg is: $ARGV[2] \n"; $num = $#ARGV + 1; print "How many args? $num \n"; print "The full argument string \n";

use BeginPerlBioinfo; my %rebase_hash = ( ); = ( ); my $query = ''; my $dna = ''; my $recognition_site = ''; my $regexp = ''; = ( = get_file_data($ARGV[0]); $dna = %rebase_hash = parseREBASE('bionet'); do { print "Search for what restriction site for (or quit)?: "; $query = ; chomp $query; if ($query =~ /^\s*$/ ) { exit; } if ( exists $rebase_hash{$query} ) { ($recognition_site, $regexp) = split ( " ", = match_positions($regexp, $dna); if { print "Searching for $query $recognition_site $regexp\n"; print "Restriction site for $query at :", join(" "\n"; } else { print "A restriction enzyme $query is not in the DNA:\n"; } } until ( $query =~ /quit/ ); exit;

use BeginPerlBioinfo; my %rebase_hash = ( ); = ( ); my $query = ''; my $dna = ''; my $recognition_site = ''; my $regexp = ''; = ( = get_file_data($ARGV[0]); $dna = %rebase_hash = parseREBASE($ARGV[1]); do { print "Search for what restriction site for (or quit)?: "; $query = ; chomp $query; if ($query =~ /^\s*$/ ) { exit; } if ( exists $rebase_hash{$query} ) { ($recognition_site, $regexp) = split ( " ", = match_positions($regexp, $dna); if { print "Searching for $query $recognition_site $regexp\n"; print "Restriction site for $query at :", join(" "\n"; } else { print "A restriction enzyme $query is not in the DNA:\n"; } } until ( $query =~ /quit/ ); exit;

Regular Expression ^ beginning of string $ end of string. any character except newline * match 0 or more times + match 1 or more times ? match 0 or 1 times; | alternative ( ) grouping; “storing” [ ] set of characters { } repetition modifier \ quote or special

Repeats a*zero or more a’s a+one or more a’s a?zero or one a’s (i.e., optional a) a{m}exactly m a’s a{m,}at least m a’s a{m,n}at least m but at most n a’s

\

[]

Perl tr/// function tr means transliterate – replaces a character with another character $dna =~ tr/a/c/ replaces all “a” with “c” in in $dna It also works on a range: $dna =~ tr/a-z/A-Z/ replaces all lower case letters with upper case tr also counts $count = ($string =~ tr/A//) (you might think this also deletes all “A” from the string, but it doesn’t)

Wildcards Perl has a set of wildcard characters for Reg. Exps. that are completely different than the ones used by Unix the dot (. ) matches any character \d matches any digit (a number from 0-9) \w matches any text character (a letter or number, not punctuation or space) \s matches white space (any amount) ^ matches the beginning of a line $ matches the end of a line (Yes, this is very confusing!)

Repeat for a count Use curly brackets to show that a character repeats a specific number (or range) of times: find an EcoRI fragment of bp length (two EcoRI sites with any other sequence between): if $ecofrag =~ /GAATTC[GATC]{100,500}GAATTC/ The + sign is used to indicate an unlimited number of repeats (occurs 1 or more times)

my $mystring; $mystring = "Hello world!"; if($mystring =~ m/World/) { print "Yes"; } if($mystring =~ m/World/i) { print "Yes"; }

Grabbing parts of a string Regular expressions can do more than just ask ‘ if ” questions They can be used to extract parts of a line of text into variables; Check this out: /^>(\w+)\s(. +)$/; Complete gibberish, right? It means: -look for the > sign at the beginning of a FASTA formatted sequence file -dump the first word (\w+) into variable $1 ( the sequence ID ) -after a space, dump the rest of the line (.+), until you reach the end of line $, into variable $2 ( the description )

$mystring = "[2004/04/13] The date of this article."; if($mystring =~ m/(\d)/) { print "The first digit is $1."; } if($mystring =~ m/(\d+)/) { print "The first number is $1."; } if($mystring =~ m/(\d+)\/(\d+)\/(\d+)/) { print "The date is $1-$2-$3"; } while($mystring =~ m/(\d+)/g) { print "Found number $1."; = ($mystring =~ m/(\d+)/g); print

Working with Single DNA Sequences

Learning Objectives Discover how to manipulate your DNA sequence on a computer, analyze its composition, predict its restriction map, and amplify it with PCR Find out about gene-prediction methods, their potential, and their limitations Understand how genomes and sequences and assembled

Outline 1.Cleaning your DNA of contaminants 2.Digesting your DNA in the computer 3.Finding protein-coding genes in your DNA sequence 4.Assembling a genome

Cleaning DNA Sequences In order to sequence genomes, DNA sequences are often cloned in a vector (plasmid, YAC, or cosmide) Sequences of the vector can be mixed with your DNA sequence Before working with your DNA sequence, you should always clean it with VecScreen

VecScreen /VecScreen.html Runs a special version of Blast A system for quickly identifying segments of a nucleic acid sequence that may be of vector origin

What to do if hits found If hits are in the extremity, can just remove them If in the middle, or vectors are not what you are using, the safest thing is to throw the sequence away

Computing a Restriction Map It is possible to cut DNA sequences using restriction enzymes Each type of restriction enzyme recognizes and cuts a different sequence: EcoR1: GAATTC BamH1: GGATCC There are more than 900 different restriction enzymes, each with a different specificity The restriction map is the list of all potential cleavage sites in a DNA molecule You can compile a restriction map with

Cannot get it work!

Making PCR with a Computer Polymerase Chain Reaction (PCR) is a method for amplifying DNA PCR is used for many applications, including Gene cloning Forensic analysis Paternity tests PCR amplifies the DNA between two anchors These anchors are called the PCR primer

Designing PCR Primers PCR primes are typically 20 nucleotides long The primers must hybridize well with the DNA On biotools.umassmed.edu, find the best location for the primers: Most stable Longest extension

Analyzing DNA Composition DNA composition varies a lot Stability of a DNA sequence depends on its G+C content (total guanine and cytosine) High G+C makes very stable DNA molecules Online resources are available to measure the GC content of your DNA sequence Also for counting words and internal repeats

Counting words ATGGCTGACT A, T, G, G, C, T, G, A, C, T AT, TG, GG, GC, CT, TG, GA, AC, CT ATG, TGG, GGC, GCT, CTG, TGA, GAC, ACT

EMBOSS servers European Molecular Biology Open Software Suite

ORF EMBOSS NCBI

ncbi.nlm.nih.gov/gorf/gorf.html

Internal repeats A word repeated in the sequence, long enough to not occur by chance Can be imperfect (regular expression) Dot plot is the best way to spot it

arbl.cvmbs.colostate.edu/molkit

Predicting Genes The most important analysis carried out on DNA sequences is gene prediction Gene prediction requires different methods for eukaryotes and prokaryotes Most gene-prediction methods use hidden Markov Models

Predicting Genes in Prokaryotic Genome In prokaryotes, protein-coding genes are uninterrupted No introns Predicting protein-coding genes in prokaryotes is considered a solved problem You can expect 99% accuracy

Finding Prokaryotic Genes with GeneMark GeneMark is the state of the art for microbial genomes GeneMark can Find short proteins Resolve overlapping genes Identify the best start codon Use exon.gatech.edu/GeneMark Click the “heutistic models”

Predicting Eukaryotic Genes Eukaryotic genes (human, for example) are very hard to predict Precise and accurate eukaryotic gene prediction is still an open problem ENSEMBL contains 21,662 genes for the human genome There may well be more genes than that in the genome, as yet unpredicted You can expect 70% accuracy on the human genome with automatic methods Experimental information is still needed to predict eukaryotic genes

Finding Eukaryotic Genes with GenomeScan GenomeScan is the state of the art for eukaryotic genes GenomeScan works best with Long exons Genes with a low GC content It can incorporate experimental information Use genes.mit.edu/genomescan

Producing Genomic Data Until recently, sequencing an entire genome was very expensive and difficult Only major institutes could do it Today, scientists estimate that in 10 years, it will cost about $1000 to sequence a human genome With sequencing so cheap, assembling your own genomes is becoming an option How could you do it?

Sequencing and Assembling a Genome (I) To sequence a genome, the first task is to cut it into many small, overlapping pieces Then clone each piece

Sequencing and Assembling a Genome (II) Each piece must be sequenced Sequencing machines cannot do an entire sequence at once They can only produce short sequences smaller than 1 Kb These pieces are called reads It is necessary to assemble the reads into contigs

Sequencing and Assembling a Genome (III) The most popular program for assembling reads is PHRAP Available at Other programs exist for joining smaller datasets For example, try CAP3 at pbil.univ-lyon1.fr/cap3.php