Bioinformatics 生物信息学理论和实践唐继军北京林业大学计算生物学中心

Bioinformatics 生物信息学理论和实践唐继军 jtang@cse.sc.edu 北京林业大学计算生物学中心 www.bjfuccb.edu

Hash Initialize: my %hash = (); Add key/value pair: $hash{$key} = $value; Add more keys: %hash = ( 'key1', 'value1', 'key2', 'value2 ); %hash = ( key1 => 'value1', key2 => 'value2', ); Delete: delete $hash{$key};

Print to file Open a file to print open FILE, ">filename.txt"; open (FILE, ">filename.txt“); Print to the file print FILE $str;

#Append open(FILE, ">>out") or die "Cannot open file to write"; print FILE "Test\n"; close FILE; exit;

#!/usr/bin/perl print "My name is $0 \n"; print "First arg is: $ARGV[0] \n"; print "Second arg is: $ARGV[1] \n"; print "Third arg is: $ARGV[2] \n"; $num = $#ARGV + 1; print "How many args? $num \n"; print "The full argument string was: @ARGV \n";

use BeginPerlBioinfo; my %rebase_hash = ( ); my @file_data = ( ); my $query = ''; my $dna = ''; my $recognition_site = ''; my $regexp = ''; my @locations = ( ); @file_data = get_file_data($ARGV[0]); $dna = extract_sequence_from_fasta_data(@file_data); %rebase_hash = parseREBASE($ARGV[1]); do { print "Search for what restriction site for (or quit)?: "; $query = ; chomp $query; if ($query =~ /^\s*$/ ) { exit; } if ( exists $rebase_hash{$query} ) { ($recognition_site, $regexp) = split ( " ", $rebase_hash{$query}); @locations = match_positions($regexp, $dna); if (@locations) { print "Searching for $query $recognition_site $regexp\n"; print "Restriction site for $query at :", join(" ", @locations), "\n"; } else { print "A restriction enzyme $query is not in the DNA:\n"; } } until ( $query =~ /quit/ ); exit;

Regular Expression ^ beginning of string $ end of string. any character except newline * match 0 or more times + match 1 or more times ? match 0 or 1 times; | alternative ( ) grouping; “storing” [ ] set of characters { } repetition modifier \ quote or special

$mystring = "[2004/04/13] The date of this article."; if($mystring =~ m/(\d)/) { print "The first digit is $1."; } if($mystring =~ m/(\d+)/) { print "The first number is $1."; } if($mystring =~ m/(\d+)\/(\d+)\/(\d+)/) { print "The date is $1-$2-$3"; } while($mystring =~ m/(\d+)/g) { print "Found number $1."; } @myarray = ($mystring =~ m/(\d+)/g); print join(",", @myarray);

$mystring = "[2004/04/13] The date of this article."; if($mystring =~ m/(\d+)\/(\d+)\/(\d+)/) { print "The date is $1-$2-$3"; }

Download and install programs Unzip or untar unzip If file.tar.gz, tar xvfz file.tar.gz Go to the directory and “./configure” Then “make”

Excercies Download clustalw Try to install it

System subroutine system ("ls –ltr");

Excercies 2 Use pro.fasta Find alignment for each triple of protein Let’s design the program together Use “system” in perl system ("command parameters");

sub ReadFasta { my ($fname) = @_; open(FILE, $fname) or die "Cannot open $fname\n"; my $data = ""; my @dnas = (); while(my $line = ) { if ($line =~ /^>/) { if ($data ne "") { push(@dnas, $data); } $data = ""; } $data.= $line; } if ($data ne "") { push(@dnas, $data); } close FILE; return @dnas; }

print "Please input file name:\n"; my $fname = ; my @dnas = ReadFasta($fname); my $len = $#dnas + 1; for (my $i = 0; $i < $len; $i++) { for (my $j = $i+1; $j < $len; $j++) { for (my $k = $j+1; $k < $len; $k++) { $fname = "$i\_$j\_$k"; print $fname; open(OUT, ">$fname"); print OUT $dnas[$i]; print OUT $dnas[$j]; print OUT $dnas[$k]; close OUT; system ("./clustalw2 $i\_$j\_$k"); }

Working with Single DNA Sequences

Learning Objectives Discover how to manipulate your DNA sequence on a computer, analyze its composition, predict its restriction map, and amplify it with PCR Find out about gene-prediction methods, their potential, and their limitations Understand how genomes and sequences and assembled

Outline 1.Cleaning your DNA of contaminants 2.Digesting your DNA in the computer 3.Finding protein-coding genes in your DNA sequence 4.Assembling a genome

Cleaning DNA Sequences In order to sequence genomes, DNA sequences are often cloned in a vector (plasmid, YAC, or cosmide) Sequences of the vector can be mixed with your DNA sequence Before working with your DNA sequence, you should always clean it with VecScreen

VecScreen http://www.ncbi.nlm.nih.gov/VecScreen /VecScreen.html Runs a special version of Blast A system for quickly identifying segments of a nucleic acid sequence that may be of vector origin

What to do if hits found If hits are in the extremity, can just remove them If in the middle, or vectors are not what you are using, the safest thing is to throw the sequence away

Computing a Restriction Map It is possible to cut DNA sequences using restriction enzymes Each type of restriction enzyme recognizes and cuts a different sequence: EcoR1: GAATTC BamH1: GGATCC There are more than 900 different restriction enzymes, each with a different specificity The restriction map is the list of all potential cleavage sites in a DNA molecule You can compile a restriction map with www.firstmarket.com/cutter

Cannot get it work!

http://biotools.umassmed.edu/tacg4

Making PCR with a Computer Polymerase Chain Reaction (PCR) is a method for amplifying DNA PCR is used for many applications, including Gene cloning Forensic analysis Paternity tests PCR amplifies the DNA between two anchors These anchors are called the PCR primer

Designing PCR Primers PCR primes are typically 20 nucleotides long The primers must hybridize well with the DNA On biotools.umassmed.edu, find the best location for the primers: Most stable Longest extension

Analyzing DNA Composition DNA composition varies a lot Stability of a DNA sequence depends on its G+C content (total guanine and cytosine) High G+C makes very stable DNA molecules Online resources are available to measure the GC content of your DNA sequence Also for counting words and internal repeats

http://helixweb.nih.gov/emboss/html/

Counting words ATGGCTGACT A, T, G, G, C, T, G, A, C, T AT, TG, GG, GC, CT, TG, GA, AC, CT ATG, TGG, GGC, GCT, CTG, TGA, GAC, ACT

www.genomatix.de/cgi-bin/tools/tools.pl

EMBOSS servers European Molecular Biology Open Software Suite http://pro.genomics.purdue.edu/emboss/

ORF EMBOSS NCBI

ncbi.nlm.nih.gov/gorf/gorf.html

Internal repeats A word repeated in the sequence, long enough to not occur by chance Can be imperfect (regular expression) Dot plot is the best way to spot it

arbl.cvmbs.colostate.edu/molkit

Predicting Genes The most important analysis carried out on DNA sequences is gene prediction Gene prediction requires different methods for eukaryotes and prokaryotes Most gene-prediction methods use hidden Markov Models

Predicting Genes in Prokaryotic Genome In prokaryotes, protein-coding genes are uninterrupted No introns Predicting protein-coding genes in prokaryotes is considered a solved problem You can expect 99% accuracy

Finding Prokaryotic Genes with GeneMark GeneMark is the state of the art for microbial genomes GeneMark can Find short proteins Resolve overlapping genes Identify the best start codon Use exon.gatech.edu/GeneMark Click the “heutistic models”

Predicting Eukaryotic Genes Eukaryotic genes (human, for example) are very hard to predict Precise and accurate eukaryotic gene prediction is still an open problem ENSEMBL contains 21,662 genes for the human genome There may well be more genes than that in the genome, as yet unpredicted You can expect 70% accuracy on the human genome with automatic methods Experimental information is still needed to predict eukaryotic genes

Finding Eukaryotic Genes with GenomeScan GenomeScan is the state of the art for eukaryotic genes GenomeScan works best with Long exons Genes with a low GC content It can incorporate experimental information Use genes.mit.edu/genomescan

Producing Genomic Data Until recently, sequencing an entire genome was very expensive and difficult Only major institutes could do it Today, scientists estimate that in 10 years, it will cost about $1000 to sequence a human genome With sequencing so cheap, assembling your own genomes is becoming an option How could you do it?

Sequencing and Assembling a Genome (I) To sequence a genome, the first task is to cut it into many small, overlapping pieces Then clone each piece

Sequencing and Assembling a Genome (II) Each piece must be sequenced Sequencing machines cannot do an entire sequence at once They can only produce short sequences smaller than 1 Kb These pieces are called reads It is necessary to assemble the reads into contigs

Sequencing and Assembling a Genome (III) The most popular program for assembling reads is PHRAP Available at www.phrap.org Other programs exist for joining smaller datasets For example, try CAP3 at pbil.univ-lyon1.fr/cap3.php

Bioinformatics 生物信息学理论和实践唐继军北京林业大学计算生物学中心

Similar presentations

Presentation on theme: "Bioinformatics 生物信息学理论和实践唐继军北京林业大学计算生物学中心"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bioinformatics 生物信息学理论和实践 唐继军 北京林业大学计算生物学中心

Similar presentations

Presentation on theme: "Bioinformatics 生物信息学理论和实践 唐继军 北京林业大学计算生物学中心"— Presentation transcript:

Similar presentations

About project

Feedback

Bioinformatics 生物信息学理论和实践唐继军北京林业大学计算生物学中心

Presentation on theme: "Bioinformatics 生物信息学理论和实践唐继军北京林业大学计算生物学中心"— Presentation transcript: