Pyrosequencing for Metagenomics: accessing and organizing raw data Giuseppe D’Auria FISABIO, Valencia Norwich 08-12 September 2014.

Slides:



Advertisements
Similar presentations
1 Copyright © 2002 Pearson Education, Inc.. 2 Chapter 1 Introduction to Perl and CGI.
Advertisements

NGS Bioinformatics Workshop 2.1 Tutorial – Next Generation Sequencing and Sequence Assembly Algorithms May 3rd, 2012 IRMACS Facilitator: Richard.
Chapter 5 Operating Systems. 5 The Operating System When working with multimedia, the operating system is perhaps the most important, the most complex,
RCAC Research Computing Presents: DiaGird Overview Tuesday, September 24, 2013.
Perl Practical Extraction and Report Language Senior Projects II Jeff Wilson.
TD Ameritrade IT audit intern Ramez Mina. Position definition Department head  IT audit intern Managers  system analyst and developer to build automated.
Bioinformatics for the Canadian Potato Genome Project David De Koeyer, Martin Lagüe and Rebecca Griffiths Wageningen September 18, 2004.
Introduction to the Internet How did the Internet start? Why was the Internet developed? How does Internet handle the traffic? Why WWW changed the Internet.
ESRM 250 & CFR 520: Introduction to GIS © Phil Hurvitz, KEEP THIS TEXT BOX this slide includes some ESRI fonts. when you save this presentation,
Web Programming Language Dr. Ken Cosh Week 1 (Introduction)
Understanding the Basics of Computational Informatics Summer School, Hungary, Szeged Methos L. Müller.
Unix Presentation. What is an Operating System An operating system (OS) is a program that allows you to interact with the computer -- all of the software.
1 SEEM3460 Tutorial Unix Introduction. 2 Introduction What is Unix? An operation system (OS), similar to Windows, MacOS X Why learn Unix? Greatest Software.
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
Command line tools Manfred G. Grabherr. Overview -How do web-based tools work? -What is source code? -How to run things locally? -What is UNIX/Linux?
Overview of Linux CS3530 Spring 2014 Dr. José M. Garrido Department of Computer Science.
1 Intro to Linux - getting around HPC systems Himanshu Chhetri.
Bacterial Genome Assembly C. Victor Jongeneel Bacterial Genome Assembly | C. Victor Jongeneel | PowerPoint by Casey Hanson.
From Metagenomic Sample to Useful Visual Anna Shcherbina 01/10/ Anna Shcherbina Bioinformatics Challenge Day 02/02/2013 From Metagenomic Sample to.
Is Apache CouchDB for you?
Dedan Githae, BecA-ILRI Hub Introduction to Linux / UNIX OS MARI eBioKit Workshop; Nov , 2014.
Computer Programming for Biologists Oct 30 th – Dec 11 th, 2014 Karsten Hokamp  Fill out.
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
UWG 2013 Meeting PO.DAAC Web Services Demo. What are PO.DAAC Web Services?
Giuseppe D'Auria Norwich September 2014 FISABIO, Valencia Introduction into the processing of raw data.
Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
An Introduction to Linux Name: Haixin Wang ID :
NGS data analysis CCM Seminar series Michael Liang:
Data Management Console Synonym Editor
Quick introduction to genomic file types Preliminary quality control (lab)
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Parsing BLAST output. Output of a local BLAST search “less” program Full path to the BLAST output file.
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov.
Working with Hadoop. Requirement Virtual machine software –VM Ware –VirtualBox Virtual machine images –Download from Cloudera (Founded by leaders in the.
8 th Semester, Batch 2009 Department Of Computer Science SSUET.
The Integrated Spectral Analysis Workbench (ISAW) DANSE Kickoff Meeting, Aug. 15, 2006, D. Mikkelson, T. Worlton, Julian Tao.
Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!
Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support.
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
Culturable Bacterial Communities Analyzer DIANA VANESSA SARRIA-ZUNIGA ELIANA TORRES-ZELADA April 29, 2016.
MESA A Simple Microarray Data Management Server. General MESA is a prototype web-based database solution for the massive amounts of initial data generated.
INTRODUCTION TO SHELL SCRIPTING By Byamukama Frank
Bacterial Genome Assembly Tutorial: C. Victor Jongeneel Bacterial Genome Assembly v9 | C. Victor Jongeneel1 Powerpoint: Casey Hanson.
Introduction to R Dr. Satish Nargundkar. What is R? R is a free software environment for statistical computing and graphics. It compiles and runs on a.
July LJM Introduction to Bioinformatics Lisa Mullan, HGMP-RC.
Overview of Linux Fall 2016 Dr. Donghyun Kim
Chapter 5 Operating Systems.
Web Programming Language
Stony Brook Integrative Structural Biology Organization
MGmapper A tool to map MetaGenomics data
Bacterial Genome Assembly
Pyrosequencing Shotgun Sequencing and Metatranscriptomics Analysis
Lecture 7 You’re on your own now...
National Scientific Library at Tbilisi State University
Introduction into the processing of raw data
Workshop on Microbiome and Health
Bacterial Genome Assembly
Computer software.
Copyright 2003 The McGraw-Hill Companies, Inc.
The New Face of Information Retrieval: The Ankara University Open Access Platform Prof. Dr. Sekine Karakaş Prof. Dr. Doğan.
What is Perl? PERL--Practical Extraction and Report Language
Garbage In, Garbage Out: Quality control on sequence data
Biogeography analysis using R scripts
Computational Pipeline Strategies
Linux + Genome Assembly Tutorial
Campus and Phoenix Resources
Presentation transcript:

Pyrosequencing for Metagenomics: accessing and organizing raw data Giuseppe D’Auria FISABIO, Valencia Norwich September 2014

We will start from a single sff (standard flowgram format) file containing a metagenome and a metatranscriptome experiments labelled by two MIDs (Multiplex Identifiers) Recruitment protocol by MUMmer Search for tRNA Assembly protocol via MIRAAnnotate 16S rRNA Organize data and folders Extract fasta and quality files belonging to each dataset Practice workflow Searching for rRNAsClusterize 16S rRNA

Practice workflow We will start from a single sff (standard flowgram format) file containing a metagenome and a metatranscriptome experiments labelled by two MIDs (Multiplex Identifiers) Organize data and folders Extract fasta and quality files belonging to each dataset

Extracting MIDs → FASTA file → Fasta Qual → mid_fasta_file Identify Mids and separate Fasta and Fasta Quality files bin_fasta_on_mid_primers.pl SFF FASTA-Mid1 QUALITY-Mid1 FASTA-Mid2 QUALITY-Mid2 FASTA-MidX QUALITY-MidX Excercise 2 sff_extract 1)Use sff_extract to extract sequences from sff -c parameter to remove adaptor sequences and make possible MIDs to be identified bin_fasta_on_mid_primers.pl 2) Use bin_fasta_on_mid_primers.pl to separate mids Extract fasta and quality files belonging to each dataset

Open the terminal out_midi_CCAACC  Metagenome out_midi_CGCCAT  Metatranscriptome Extract fasta and quality files belonging to each dataset # Go to data folder cd data # Create project2 folder mkdir project2 # Go to project2 folder cd project2 # Link SFF file ln -s ~/data/Sequences/dataset2.sff ~/data/project2/dataset2.sff # Extract FASTQ and QUALITY from sff sff_extract -c -A dataset2.sff # Sort reads by MIDs bin_fasta_on_mid_primers.pl -r dataset2.fasta -q dataset2.fasta.qual -m../Sequences/mids.fas -b out

Open the terminal Extract fasta and quality files belonging to each dataset # Create Metagenome folder mkdir metage # Create Metatranscriptome folder mkdir metatra # Move project files in folders mv out_midi_CCAACC.fasta* metage/ mv out_midi_CGCCAT.fasta* metatra/ # Go to Metagenome folder cd metage # Take a look at the folder ls -ltr

We will start from a single sff (standard flowgram format) file containing a metagenome and a metatranscriptome experiments labelled by two MIDs (Multiplex Identifiers) Organize data and folders Extract fasta and quality files belonging to each dataset Practice workflow Recruitment protocol by MUMmer

Open the terminal Mapping and recruitment graph # Link file to simpler name ln -s out_midi_CCAACC.fasta metage.fas # Mapping of reads on reference genome # Obtaining mapping coordinates nucmer --prefix=recruit../../References/reference.fasta metage.fas --coords # Obtaining mapping image (postscript) mummerplot recruit.delta -R../../References/reference.fasta -Q metage.fas --coverage --postscript -p recruit # Visualizing mapping evince recruit.ps &

We will start from a single sff (standard flowgram format) file containing a metagenome and a metatranscriptome experiments labelled by two MIDs (Multiplex Identifiers) Organize data and folders Extract fasta and quality files belonging to each dataset Practice workflow Recruitment protocol by MUMmer Assembly protocol via MIRA

# Linking metagenome file for assembly ln -s out_midi_CCAACC.fasta metage_in.454.fasta ln -s out_midi_CCAACC.fasta.qual metage_in.454.fasta.qual ln -s../dataset2.xml metage_traceinfo_in.454.xml # Start denovo assembly mira --project=metage --job=denovo,genome,draft, _SETTINGS -LR:ft=fasta # Goto results folder cd metage_assembly cd metage_d_results # Take a look at the results tablet metage_out.ace & Assmebly viewer

We will start from a single sff (standard flowgram format) file containing a metagenome and a metatranscriptome experiments labelled by two MIDs (Multiplex Identifiers) Organize data and folders Extract fasta and quality files belonging to each dataset Practice workflow Recruitment protocol by MUMmer Assembly protocol via MIRASearching for rRNAs

cd../../../ cd metatra # Link needed files ln -s out_midi_CGCCAT.fasta metatra.fas # Searching for 16S sequences rna_hmm3.py -i metatra.fas -m ssu -o metatra_16S -L # Extract 16S sequences from the 16S table extract_sequences_by_list.pl -f metatra.fas -t metatra_16S -c 0 -o -d 1 extract_sequences_by_list  One of my perl scripts

Practice workflow We will start from a single sff (standard flowgram format) file containing a metagenome and a metatranscriptome experiments labelled by two MIDs (Multiplex Identifiers) Organize data and folders Extract fasta and quality files belonging to each dataset Recruitment protocol by MUMmer Assembly protocol via MIRASearching for rRNAsClusterize 16S rRNA

Clustering # Filtering out chimeras #ChimeraSlayer.pl --query_FASTA 16S.list.fasta # Clustering 16S sequences cdhit -i 16S.list.fasta -o 16Sc90s90 -c 0.9 -s 0.9 -bak 1 cd-hit_translate.pl 16Sc90s90.bak.clstr > 16S.tab cd-hit_translate  Oneother of my perl scripts

Practice workflow We will start from a single sff (standard flowgram format) file containing a metagenome and a metatranscriptome experiments labelled by two MIDs (Multiplex Identifiers) Organize data and folders Extract fasta and quality files belonging to each dataset Recruitment protocol by MUMmer Assembly protocol via MIRASearching for rRNAsClusterize 16S rRNAAnnotate 16S rRNA

# 16S assignation by RDP_classifie java -jar ~/Software/rdp_classifier_2.2/rdp_classifier-2.2.jar -q 16S.remain.fasta -o 16S_rdp -f fixrank Annotate 16S rRNA

Practice workflow We will start from a single sff (standard flowgram format) file containing a metagenome and a metatranscriptome experiments labelled by two MIDs (Multiplex Identifiers) Organize data and folders Extract fasta and quality files belonging to each dataset Recruitment protocol by MUMmer Assembly protocol via MIRASearching for rRNAsClusterize 16S rRNAAnnotate 16S rRNA Search for tRNA

# Searching for tRNAs tRNAscan-SE -B 16S.remain.fasta > tRNAs.tab # Extract tRNAs sequences from the tRNAs table extract_sequences_by_list.pl -f 16S.remain.fasta -t tRNAs.tab -c 0 -o tRNAs -d 1 Searching for tRNAs extract_sequences_by_list.pl  Another of my perl scripts

Running out of physical limits

For INTREPID and BRAVE people

Perl is a scripting language widely used for system administration and programming on the World Wide Web. It originated in the UNIX community and has a strong UNIX slant, but usage on Windows has grown rapidly. ActivePerl is a quality-assured binary distribution of Perl for popular UNIX platforms and Windows. perl (small 'p') is the program used to interpret the Perl language.

For INTREPID and BRAVE people II R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.

Thank you again for your attention