1 Introduction to Perl Part III: Biological Data Manipulation.

Slides:



Advertisements
Similar presentations
Lecture 6 More advanced Perl…. Substitute Like s/// function in vi: #cut with EcoRI and chew back $linker = “GGCCAATTGGAAT”; $linker =~ s/CAATTG/CG/g;
Advertisements

INTRODUCTION TO BIOPERL Gautier Sarah & Gaëtan Droc.
FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
Perl for Bioinformatics Lecture 4. Variables - review A variable name starts with a $ It contains a number or a text string Use my to define a variable.
Advanced Perl for Bioinformatics Lecture 5. Regular expressions - review You can put the pattern you want to match between //, bind the pattern to the.
Practice retrieving data and running stand alone BLAST. Step 1. Identify genes in the ABA biosynthesis pathway from the Arabidopsis Cyc database
Tutorial 5 Motif discovery.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Perl Training Week 1 CS110 November 2008 Use of Strings and Print.
Sequence Alignment III CIS 667 February 10, 2004.
Physical Mapping II + Perl CIS 667 March 2, 2004.
Attribute databases. GIS Definition Diagram Output Query Results.
The Linnaeus Centre for Bioinformatics Short introduction to perl & gff Marcus Ronninger The Linnaeus Centre for Bioinformatics.
Access 2007 ® Use Databases How can Microsoft Access 2007 help you to enter and organize information?
Advanced Perl for Bioinformatics Lecture 5. Regular expressions - review You can put the pattern you want to match between //, bind the pattern to the.
Computer Programming for Biologists Class 2 Oct 31 st, 2014 Karsten Hokamp
Working with the Conifer_dbMagic database: A short tutorial on mining conifer assembly data. This tutorial is designed to be used in a “follow along” fashion.
Chapter 10: Working with Large Data Spreadsheet-Based Decision Support Systems Prof. Name Position (123) University Name.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
MySQL + PHP.  Introduction Before you actually start building your database scripts, you must have a database to place information into and read it from.
Perl Tutorial Presented by Pradeepsunder. Why PERL ???  Practical extraction and report language  Similar to shell script but lot easier and more powerful.
Some Ideas on Final Project. Feature extraction TGGCCGTACGAGTAACGGACTGGCTGTCTTCTCGT n CCGATACCCCCCACGCGAAACCCTACACATCAAAT p AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT.
Lecture 8 perl pattern matching features
MCB 5472 Assignment #6: HMMER and using perl to perform repetitive tasks February 26, 2014.
USING PERL FOR CGI PROGRAMMING
Subroutines and Files Bioinformatics Ellen Walker Hiram College.
Building Self-Updating Excel Workbooks John Filce and Ward Headstrom Institutional Research & Planning Humboldt State University.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
 Agenda: 4/24/13 o External Data o Discuss data manipulation tools and functions o Discuss data import and linking in Excel o Sorting Data o Date and.
Chapter 17 Creating a Database.
Microsoft Access Designing and creating tables and populating data.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Copyright © 2010 Certification Partners, LLC -- All Rights Reserved Perl Specialist.
Assignment feedback Everyone is doing very well!
7 1 User-Defined Functions CGI/Perl Programming By Diane Zak.
1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
A Powerful Python Library for Data Analysis BY BADRI PRUDHVI BADRI PRUDHVI.
Searching and Regular Expressions. Proteins 20 amino acids Interesting structures beta barrel, greek key motif, EF hand... Bind, move, catalyze, recognize,
Project 1: Using Arrays and Manipulating Strings Essentials for Design JavaScript Level Two Michael Brooks.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
- Joiner Transformation. Introduction ►Transformations help to transform the source data according to the requirements of target system and it ensures.
Parsing BLAST output. Output of a local BLAST search “less” program Full path to the BLAST output file.
Copyright © 2003 ProsoftTraining. All rights reserved. Perl Fundamentals.
MySQL Importing and creating a database. CSV (Comma Separated Values) file CSV = Comma Separated Values – they are simple text files containing data which.
Advanced Adhoc Reporting 2010 Visions Conference July 28, 2010.
A table is a set of data elements (values) that is organized using a model of vertical columns (which are identified by their name) and horizontal rows.
Microsoft Access Database Creation and Management.
 2001 Prentice Hall, Inc. All rights reserved. Chapter 7 - Introduction to Common Gateway Interface (CGI) Outline 7.1Introduction 7.2A Simple HTTP Transaction.
Trinity College Dublin, The University of Dublin GE3M25: Computer Programming for Biologists Python, Class 4 Karsten Hokamp, PhD Genetics TCD, 01/12/2015.
CPSC 203 Introduction to Computers T97 By Jie (Jeff) Gao.
Doug Raiford Phage class: introduction to sequence databases.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
Chapter 10: Working with Large Data Spreadsheet-Based Decision Support Systems Prof. Name Position (123) University Name.
Lesson 4: Querying a Database. 2 Learning Objectives After studying this lesson, you will be able to:  Create, save, and run select queries  Set query.
ORAFACT Text Processing. ORAFACT Searching Inside Files grep - searches for patterns within files grep [options] [[-e] pattern] filename [...] -n shows.
Perl for Bioinformatics Part 2 Stuart Brown NYU School of Medicine.
Dept. of Animal Breeding and Genetics Programming basics & introduction to PERL Mats Pettersson.
1 Data Manipulation (with SQL) HRP223 – 2009 October 12, 2009 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
Introduction to Programming the WWW I CMSC Winter 2003 Lecture 17.
Connecting to External Data. Financial data can be obtained from a number of different data sources.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.
Filters and Utilities. Notes: This is a simple overview of the filtering capability Some of these commands are very powerful ▫Only showing some of the.
Scoring Sequence Alignments Calculating E
Data Virtualization Demoette… Flat-File Data Sources
Sequence comparison: Local alignment
ACG 4401 XSLT Extensible Stylesheet Language for Transformations
Basic Local Alignment Search Tool
Presentation transcript:

1 Introduction to Perl Part III: Biological Data Manipulation

2 Column data Column delimited data Often CSV, comma delimited Tab, space, or other character delimited Process data in scripts instead of using Excel

3 GFF formats 8 columns, tab delimited seq_id, source, feature, start, end, score, strand, frame, group mats/GFF/GFF_Spec.shtml mats/GFF/GFF_Spec.shtml Actually 3 or 4 different versions, GFF3 will hopefully be new emerging standard

4 BLAST -m9 output The columns are – Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score Lines starting with ‘#’ are comments

# TBLASTN [May ] # Query: GLEAN_08256_1 pchr_1:join(complement( ),complement( ),complement( ),complement( ),complement( ),complement( ),complement( ),complement( ),comp # Database: /data/blast/cryptococcus_neoformans_JEC fa # Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score GLEAN_08256_1 cn-jec21_chr e GLEAN_08256_1 cn-jec21_chr e GLEAN_08256_1 cn-jec21_chr e GLEAN_08256_1 cn-jec21_chr e GLEAN_08256_1 cn-jec21_chr e GLEAN_08256_1 cn-jec21_chr e GLEAN_08256_1 cn-jec21_chr GLEAN_08256_1 cn-jec21_chr

Process data, filter by a percent id, print GFF open(IN, $filename) || die $!; while ( ) { chomp; = split(/\t/,$_); next if $cols[2] $end ) { ($start,$end,$strand) = ( $end,$start,’-’); } print join(“\t”, $cols[0], ‘BLAST’, ‘HSP’,$start,$end,$cols[10], “Target=$cols[1]+$cols[8]+$cols[9]”); # Target=subject+start+end } BLAST cols: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score GFF cols: seq_id, source, feature, start, end, score, strand, frame, group

7 Microarray data Lots of columns, R/G channels Want to add a log/transform sort data get subset of data

8 Filter rows ORF Name G1 G1.Bkg R1 R1.Bkg F1 G2 G2.Bkg R2 R2.Bkg F2 G3 G3.Bkg R3 R3.Bkg F3 G4 G4.Bkg R4 R4.Bkg F4 G5 G5.Bkg R5 R5.Bkg F5 G6 G6.Bkg R6 R6.Bkg F6 G7 G7.Bkg R7 R7.Bkg F7 G1.Ratio G1.Ratio G2.Ratio G3.Ratio G4.Ratio G5.Ratio G6.Ratio G7.Ratio R1.Ratio R2.Ratio R3.Ratio R4.Ratio R5.Ratio R6.Ratio R7.Ratio G1-Bkg G2-Bkg G3-Bkg G4-Bkg G5-Bkg G6-Bkg G7-Bkg R1-Bkg R2- Bkg R3-Bkg R4-Bkg R5-Bkg R6-Bkg R7-Bkg YHR007C ERG

9 Filter rows my $header= ; = split(/\s+/,$header); my $i = 0; my %header_col_num = map { $_ => $i++ my $index = $header_col_num{‘G2.Ratio’}; while( ) { = split; if( $col[$index] > 2 ) { } for my $row ( sort { $a->[$index] $b->[$index] ) { print $row->[$header_col_num{‘ORF’}], “ “, $row->[$index], “\n”; }

Add a column sub log_2 { return / log(2); } my $header= ; = split(/\s+/,$header); my $i = 0; my %header_col_num = map { $_ => $i++ my $index = $header_col_num{‘G2.Ratio’}; while( ) { = split; my $extra_col = log_2($col[$index]); [$col[0], $col[$index], $extra_col]; } for my $row ( sort { $a->[$index] $b->[$index] ) { print “\n”; }

11 Motif finding with regexps Want to find a binding site motif in DNA sequence Find motif in protein sequence

12 Let’s find SBF binding site SBF binding site in yeast: – CACGAAA and CGCGAAA – Combine these into C[AG]CGAAA Search DNA sequence for these sites

13 Find one motif my $dna; while( ) { if(/^>/ ) { last if ( $seen ); $seen = 1; } chomp; $dna.= $_; } if( $dna =~ /(C[AG]CGAAA)/ ) { # found the site but how to # say where it is in the sequence? }

14 More special variables ` - back quote (same key as ~) ‘- single quote (same key as “) $` - the stuff before the match $’ - the stuff after the match

15 Find one motif if( $dna =~ /(C[AG]CGAAA)/ ) { my $location = length($`); printf “$1 found at %d..%d\n”, $location, $location+length($1); }

16 Find multiple instances while( $dna =~ /(C[AG]CGAAA)/ig ) { my $location = length($`); print “$1 found at $location\n”; }

17 What about reverse strand? $dna = reverse($dna); $dna =~ tr/CAGT/GTCA/; if( $dna =~ /(C[AG]CGAAA)/ ) { my $location = length($`); printf “$1 found at %d..%d\n”, $location+length($1), $location; }

18 Making reports Text reports are great for summarizing output HTML is an easy and excellent way to summarize output and make it pretty Allows for linking to other resources

19 HTML with CGI.pm use CGI qw/:standard/; # equivalent to using namespace std; open(OUT, “>report.html”) || die $!; print OUT header, start_html('Motifs found'), h1('Motifs found'), table(Tr(th([“Motif”,“Chrom”, “Location”])), Tr(td([“CACGAAA”, “I”, “ ”])), ), hr, end_html;