Giuseppe D'Auria Norwich 08-12 September 2014 FISABIO, Valencia Introduction into the processing of raw data.

Slides:



Advertisements
Similar presentations
Introduction Windows Explorer This tutorial will explain some aspects of file management using Windows Explorer This tutorial will explain some aspects.
Advertisements

Operating Systems Manage system resources –CPU scheduling –Process management –Memory management –Input/Output device management –Storage device management.
Kyle Thurow, Kyle Neuschaefer, Alexander Matusiak, and Justin Carroll.
NGS Bioinformatics Workshop 2.1 Tutorial – Next Generation Sequencing and Sequence Assembly Algorithms May 3rd, 2012 IRMACS Facilitator: Richard.
IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis.
Cygwin Linux for Windows Desktop Paul Stuyvesant.
Web Application Server Apache Tomcat Downloading and Deployment Guide.
CS115 HOW TO INSTALL THE JAVA DEVELOPMENT KIT (JDK)
Windows Software Installation Tutorial GEFSOC Soil Carbon Modeling System Mark Easter, Kendrick Killian, Ting Feng, and Keith Paustian Natural Resource.
Basic Unix Dr Tim Cutts Team Leader Systems Support Group Infrastructure Management Team.
Web Programming Language Dr. Ken Cosh Week 1 (Introduction)
Types of software. Sonam Dema..
Unix Presentation. What is an Operating System An operating system (OS) is a program that allows you to interact with the computer -- all of the software.
Chapter 4: Operating Systems and File Management 1 Operating Systems and File Management Chapter 4.
A crash course in njit’s Afs
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
Computer Concepts 2013 Chapter 4 Operating Systems and File Management.
Chapter 4 Operating Systems and File Management. 4 Chapter 4: Operating Systems and File Management 2 Chapter Contents  Section A: Operating System Basics.
BIF713 Operating Systems & Project Management Instructor: Murray Saul
Overview of Linux CS3530 Spring 2014 Dr. José M. Garrido Department of Computer Science.
4 1 Operating System Activities  An operating system is a type of system software that acts as the master controller for all activities that take place.
Bacterial Genome Assembly C. Victor Jongeneel Bacterial Genome Assembly | C. Victor Jongeneel | PowerPoint by Casey Hanson.
Dedan Githae, BecA-ILRI Hub Introduction to Linux / UNIX OS MARI eBioKit Workshop; Nov , 2014.
LING 408/508: Programming for Linguists Lecture 3 August 31 st.
File formats Wrapping your data in the right package Deanna M. Church
Computer Programming for Biologists Oct 30 th – Dec 11 th, 2014 Karsten Hokamp  Fill out.
ABAQUS Installation on LINUX Platform D. Hanumanthappa, A. Jérusalem May 5th, 2010.
Linux in a Virtual Environment Nagarajan Prabakar School of Computing and Information Sciences Florida International University.
Cygwin Linux for Windows Desktop Paul Stuyvesant.
Pyrosequencing for Metagenomics: accessing and organizing raw data Giuseppe D’Auria FISABIO, Valencia Norwich September 2014.
Quick introduction to genomic file types Preliminary quality control (lab)
CS2204: Introduction to Unix January 19 th, 2004 Class Meeting 1 * Notes adapted by Christian Allgood from previous work by other members of the CS faculty.
Intro to Programming Environment 1. Today You Will Learn how to connect to a machine remotely with “nomachine NX client” Learn how to create a new “source.
A Genomics View of Unix. General Unix Tips To use the command line start X11 and type commands into the “xterm” window A few things about unix commands:
8-2 What is a program? What is a “Window Manager” ? What is a “GUI” ? How do you navigate the Unix directory tree? What is a wildcard? Readings: See CCSO’s.
Linux+ Guide to Linux Certification, Third Edition
Parsing BLAST output. Output of a local BLAST search “less” program Full path to the BLAST output file.
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
, Bauru, Teacher Poly & Teacher Ulisses Audio Class!
Sequence File Formats.
CONTENT  Introduction Introduction  Operating System (OS) Operating System (OS) Operating System (OS)  Summary Summary  Application Software Application.
BMTS 242: Computer and Systems Lecture 5: Yousef Alharbi Website
PTA Linux Series Copyright Professional Training Academy, CSIS, University of Limerick, 2006 © Workshop I Introduction to Linux Professional Training Academy.
What is O.S Introduction to an Operating System OS Done by: Hani Al-Mohair.
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
INTRODUCTION TO SHELL SCRIPTING By Byamukama Frank
What should a bioinformatician know about DNA sequencing, and why?
Short Read Workshop Day 1 - Experimental Design Example 1: How to log in to vieques.
Bacterial Genome Assembly Tutorial: C. Victor Jongeneel Bacterial Genome Assembly v9 | C. Victor Jongeneel1 Powerpoint: Casey Hanson.
Review Why do we use protection levels? Why do we use constructors?
Overview of Linux Fall 2016 Dr. Donghyun Kim
Web Programming Language
Introduction to Linux and R
CSC227: Operating Systems
Hyrax Configuration.
LINUX FOR BEGINNERS Because everyone needs Fundamentals
Integrative Genomics Viewer (IGV)
Introduction to ZBOSS Embedded Systems Software Training Center
The Linux Operating System
Structure of Unix OS.
The FASTQ format and quality control
LINUX FOR BEGINNERS Because everyone needs Fundamentals
National Scientific Library at Tbilisi State University
Introduction into the processing of raw data
Workshop on Microbiome and Health
5 SYSTEM SOFTWARE CHAPTER
5 SYSTEM SOFTWARE CHAPTER
5 SYSTEM SOFTWARE CHAPTER
CS115 HOW TO INSTALL THE JAVA DEVELOPMENT KIT (JDK)
bitcurator-access-webtools Quick Start Guide
Presentation transcript:

Giuseppe D'Auria Norwich September 2014 FISABIO, Valencia Introduction into the processing of raw data

Data StorageSize ranges Sanger Sequencing Datasets in the order of thousands of sequences 454 Dataset in the order of hundred of thousands Illumina Dataset in the order of millions of sequences Solid Dataset in the order of xxx of million of sequences

Data StorageBackUp We spend much more money for sequencing than for securing obtained data!!!! Think to BackUp Our PC/Server Time Machine, Rsync, Chron, etc.... Few euros PC Daily Few euros PC Weekly

Data StorageDisk structure tmp arg1 biblio 20XX Data new Final1 Analysis new Analysis new2 Analysis new 3 Final2 Final backup backup2 data data2 tmp

Data Storage Project Folder AVOID COPYING AND COPYING AND SECURITY COPYING AND AGAIN COPYING not useful data > ln -s TARGET LINK_NAME Better using symbolic links, just pointing to the needed big data files Disk structure Analysis References Original Sequence data Filtered sequences TXT Analysis 1 Analysis Analysis 1.2 Analysis 1.1

Linux or Windows? Both allow good bioinformatics analysis Linux is more stable for massive data crunching analysis and it is FREE Most of the software work in both systems but several are exclusively working on Linux. Windows is not FREE The best structure for bioinformatics (just my personal advice): A Linux Desktop system (Ubuntu – Fedora) + A virtual machine (Virtual Box) The systemWindows or Linux

Data FormatsFASTA and QUAL QUALITY >G12OEMT03CWVU >G12OEMT03DH3XQ >G12OEMT03DD28C >G12OEMT03DGQ >G12OEMT03C0MSF >G12OEMT03CWVU1 AGAGTTTGATCATGGCTCAGGATGAACGCTAGCGGCAGGCCTAACACATGCAAGTCGAGGGAGGAG CCTTCGGGCTTCGACCGGCGTACGGGTGCGTAACG >G12OEMT03DH3XQ AGAGTTTGATCATGGCTCAGTGCCAGCCGCCGCGGGAGCGCATTAG >G12OEMT03DD28C AGAGTTTGATCCTGGCTCAGGGTGGTCATATGTTTGGAATTGGTGCCAGCCGCCGCGGGAGCGCATT AG >G12OEMT03DGQ48 AGAGTTTGATCATGGCTCAGGAGGTGCCAGCAGCCGCGGAGCGCATTAG >G12OEMT03C0MSF AGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTAGAACGCTGAA GCTTGGCGCTTGCACCGAGCGGATG FASTA

Data FormatsSFF - Standard Flowgram Format SFF >G12OEMT03CWZL8 Run Prefix: R_2011_05_03_06_02_36_ Region #: 3 XY Location: 1078_3006 Run Name: R_2011_05_03_06_02_36_FLX _Administrator_RUN19 Analysis Name: D_2011_05_04_03_47_28_NAVELINA_signalProcessingAmplicons Full Path: /data/R_2011_05_03_06_02_36_FLX _Administrator_RUN19/D_2011_05_04_03_47_28_NAVELINA_signalProcessingAmplicons/ Read Header Len: 32 Name Length: 14 # of Bases: 518 Clip Qual Left: 16 Clip Qual Right: 397 Clip Adap Left: 0 Clip Adap Right: 0 Flowgram: Flow Indexes: Bases: gactacgagtagactCCATTTGATTCGAATGTCTGTTGGCGTAGGATTTCGGAGAGCACGTTTGCGATACGCGTATCTGCTGCTCCGCGGAAAGAATTTAAAAACCGGTGAAATTACGCAGGATGTGCGTGAAGAGAATCTGAGAAT TTTCAAAGAATCTTTAGACATGGTAACCAATCTCAATAACTGGCATGCCTTCATGAATCTTTTTGCTTCTGCAGGCTATTTGAAAGGCAGCCTGGTGGCATCATCCAATGCGGTAGTTTTCAGCTATGTTTTATATCTGATCGGAA AATATGAGTATAAAGTATCGTCTGTTGAACTTCAGAAATTATTCGTAAATGGTATTTTTATGTCTACGTATTACTGGTATTTTATACGGGTATCTACAGAATCAgaggttagaaaactagtttgctgatttgcgagatgtccatcatgcagatgaattcgtatc atatctgaattctgttatcggcaaccgtatttaacggatgacttactttgtttattcgtcg Quality Scores:

Output 1:N:0:ATTTCT ATCTGACCGCCGCATTTGATGCAGTAAATTATTTATATGAGCAAGGGCATA 1:N:0:ATTCCT 1:Y:18:ATCACG GGAGTTTCATTACAATTTATATATTTAAAGAGGNNNANGNNNNNGACTGAA + 1:N:0:ATTCCT TTCAGTTTGTGATGTGCGACGATGGTTCGCTCANGCGNCTNNNGTTCTGCG + 1:N:0:ATTCCT CTCCACACTAACAATACCGTTCCCCAGGTGGTATCGCCAGNNCAGTAGAGC 1:N:0:ATTCCT GCCGCCCAGCTGAAAAACATCATCATGCTGATCNNNANTNNNNNAGGCAGA FASTQ SequenceID Sequence Quality Optional

@EAS139:136:FC706VJ:2:2104:15343: :Y:18:ATCACG 1:Y:18:ATCACG GGAGTTTCATTACAATTTATATATTTAAAGAGGNNNANGNNNNNGACTGAA + CCCFFFFFHGHHFIJIJJBHHIDHJIFHEFEEG###1#1#####00?DGFH Unique instrument nameRun idFlowcell idTile number within the flowcell lane'x'-coordinate of the cluster'y'-coordinate of the clusterThe mate member of a pairY if the read fails filter (read is bad), N otherwiseControl bitsIndex sequenceFlowcell lane Output formatsFASTQ

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL !"#$%&'()*+,-./ :; | | | | | | S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) (Note: See discussion above). L - Illumina 1.8+ Phred+33, raw reads typically (0, 41) Output formatsFASTQ CCCFFFFFHGHHFIJIJJBHHIDHJIFHEFEEG###1#1#####00?DGFH 1:Y:18:ATCACG GGAGTTTCATTACAATTTATATATTTAAAGAGGNNNANGNNNNNGACT GAA + CCCFFFFFHGHHFIJIJJBHHIDHJIFHEFEEG###1#1#####00?DGFH Q phred = -10 log 10 (e) e = estimated probability of a base being wrong

Output formats Illumina (Solexa) FastQ Solid FastQ 454 Fasta + Qual FastQ SFF Standard Flowgram Format Now we can go to our VirtualBox machine Quality assessment and sequence filtering Project definition and folder structuring

Double click on VirtualBox Icon If not already imported: follow me Turn On your virtual Machine embo2013 Open the Virtual Machine

Some basic linux commands Upper case and Lower case are different!

# Take a look at the sequences cd data/Sequences ls -ltr less dataset1.fasta less dataset1.fasta.qual # Go back one folder cd.. # Creating project folder mkdir project # change directory to "project" cd project # Create original_data directory mkdir original_data # Create filtered data directory mkdir passed # Link data from Sequence folder in /home/embo/Sequences ln -s /home/embo/Sequences/* original_data/ # Go to original_data folder cd original_data # Take a look at the folder ls -ltr less dataset1.fasta less dataset1.fasta.qual Some basic linux commands

less dataset1.fasta.qual #take a look at the folder ls -ltr less dataset.fasta less dataset.fasta.qual # Convert FASTA + QUAL to FASTQ prinseq-lite.pl -fasta dataset1.fasta -qual dataset1.fasta.qual -out_format 3 -out_good dataset1 # Obtain reports config file prinseq-lite.pl -fastq dataset1.fastq -graph_data dataset1.gd -graph_stats ld,gc,qd,de ls -ltr # Obtain reports prinseq-graphs-noPCA.pl -i dataset1.gd -o dataset1 - html_all ls -ltr firefox dataset1.html & # Go to filtered data direcotry cd../passed # Trim low quality terminal and obtain reports config file prinseq-lite.pl -fastq../original_data/dataset1.fastq - trim_qual_type mean -trim_qual_step 1 -trim_qual_window 20 -trim_qual_right 30 -out_good passed -out_format 3 # Obtain reports config file prinseq-lite.pl -fastq passed.fastq -graph_data passed.gd - graph_stats ld,gc,qd,de,da,sc # Obtain reports prinseq-graphs-noPCA.pl -i passed.gd -o passed -html_all firefox passed.html & Quality assessment

Perl is a scripting language widely used for system administration and programming on the World Wide Web. It originated in the UNIX community and has a strong UNIX slant, but usage on Windows has grown rapidly. ActivePerl is a quality-assured binary distribution of Perl for popular UNIX platforms and Windows. perl (small 'p') is the program used to interpret the Perl language. For INTREPID and BRAVE people

For INTREPID and BRAVE people II R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.

Thank you again for your attention