Using command line tools to process sequencing data

Slides:



Advertisements
Similar presentations
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Advertisements

ChIP-seq analysis Ecole de bioinformatique AVIESAN – Roscoff, Jan 2013.
Variant Calling Workshop Chris Fields Variant Calling Workshop v2 | Chris Fields1 Powerpoint by Casey Hanson.
High Throughput Sequencing
Institute for Quantitative & Computational Biosciences Workshop4: NGS- study design and short read mapping.
Before we start: Align sequence reads to the reference genome
Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.
NGS Analysis Using Galaxy
MES Genome Informatics I - Lecture V. Short Read Alignment
File formats Wrapping your data in the right package Deanna M. Church
GBS Bioinformatics Pipeline(s) Overview
Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
NGS data analysis CCM Seminar series Michael Liang:
Next Generation DNA Sequencing
EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics.
NIH Extracellular RNA Communication Consortium 2 nd Investigators’ Meeting May 19 th, 2014 Sai Lakshmi Subramanian – (Primary
DB-based DAQ monitoring and Physics analysis tools Emiliano Barbuto European Emulsion Group (LNGS May 2003)
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Chip-Seq Peak Calling in Galaxy | Lisa Stubbs | PowerPoint by Casey Hanson.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
SAVANT GENOME BROWSER Marc Fiume Department of Computer Science University of Toronto.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015
Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.
IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.
Trinity College Dublin, The University of Dublin Data download: bioinf.gen.tcd.ie/GE3M25/project Get.fastq.gz file associated with your student ID
Accessing and visualizing genomics data
Computing on TSCC Make a folder for the class and move into it –mkdir –p /oasis/tscc/scratch/username/biom262_harismendy –cd /oasis/tscc/scratch/username/biom262_harismendy.
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Lisa Stubbs | Chip-Seq Peak Calling in Galaxy1.
Short Read Workshop Day 5: Mapping and Visualization
User-friendly Galaxy interface and analysis workflows for deep sequencing data Oskari Timonen and Petri Pölönen.
HOMER – a one stop shop for ChIP-Seq analysis
Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support.
IGV Demo Slides:/g/funcgen/trainings/visualization/Demos/IGV_demo.ppt Galaxy Dev: 0.
From Reads to Results Exome-seq analysis at CCBR
Canadian Bioinformatics Workshops
Konstantin Okonechnikov Qualimap v2: advanced quality control of
Introductory RNA-seq Transcriptome Profiling
NGS File formats Raw data from various vendors => various formats
Day 5 Mapping and Visualization
Canadian Bioinformatics Workshops
MGmapper A tool to map MetaGenomics data
Integrative Genomics Viewer (IGV)
NGS Analysis Using Galaxy
Regulatory Genomics Lab
Short Read Sequencing Analysis Workshop
Chip – Seq Peak Calling in Galaxy
First Bite of Variant Calling in NGS/MPS Precourse materials
GE3M25: Data Analysis, Class 4
Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng
BF528 - Biological Data Formats
ChIP-Seq Data Processing and QC
Next Gen. Sequencing Files and pysam
Maximize read usage through mapping strategies
Yating Liu July 2018 G-OnRamp workshop
Garbage In, Garbage Out: Quality control on sequence data
Next Gen. Sequencing Files and pysam
Next Gen. Sequencing Files and pysam
Overview of Workflows: Why Use Them?
Regulatory Genomics Lab
Additional file 2: RNA-Seq data analysis pipeline
BF528 - Sequence Analysis Fundamentals
Computational Pipeline Strategies
Introduction to RNA-Seq & Transcriptome Analysis
Regulatory Genomics Lab
Chip – Seq Peak Calling in Galaxy

RNA-Seq Data Analysis UND Genomics Core.
The Variant Call Format
Presentation transcript:

Using command line tools to process sequencing data Tapio Vuorenmaa, Krista Kokki Using command line tools to process sequencing data

This hands-on session: Part 1 – Bedtools Part 2 – DEMO: Galaxy with the command line Part 3 – Bedtools excercises

Part 1: Bedtools Allows to do wide range of genomics tasks easily Command line tool Easy to use  Perfect example We’re focusing on using command line and parameters, not to tool itself ”intersect”

Bedtools - intersect ”Do my two features in the set overlap with each other?” Files: “genes.bed” and “markers.bed” (in exercises) .bed format

.bed format? Flexible way to define data lines displayed in annotation track One of the file types Genome Browser uses Three required fields: chr, chr start, chr end Additional fields; name, score, strand etc. Our ”data” is simplified

Our ”data” Chromosome 1. Chromosome 2.

Back to intersect Question: Do my chip-seq peaks overlap? The basic command: bedtools intersect –a genes.bed –b markers.bed > result.bed Optional parameters Program Command First file Second file Redirect output (optional)

Optional parameters -wa  Display the original feature for each overlap -u  Display only one (the first) overlap found. -s  Only display overlaps found on the same strand. -c  Count the number of overlaps. -v  Complement, display those which do not overlap. -S  Only display overlaps found on the opposite strand. And many more. See your bedtools cheat sheet.

That’s not all there’s to it There are number of other things you can do with bedtools, such as Coverage Merge Cluster

Part 2: DEMO: Galaxy with the command line Step 1. .sra file into fastqc fastq-dump NAME.sra --offset 33 Output: .fastq file Step 2. Quality report of your data fastqc NAME.fastq Output: .html file Quality conversion (to get quality score from ASCII code)

DEMO: Galaxy with the command line Step 3. Trimming (optional) fastx_trimmer -i NAME.fastq -o TRIMMED_NAME.fastq -f 1 -l 50 Input file Output file First base to keep Last base to keep

DEMO: Galaxy with the command line Step 4. Quality filtering fastq_quality_filter -i TRIMMED_NAME.fastq -o QFILT_NAME.fastq -q 10 -p 100 Input file Output file Minimum quality score to keep Minimum % of bases that must have –q quality

DEMO: Galaxy with the command line Step 5. Removing quality information fastx_collapser -i QFILT_NAME.fastq -o COLPS_NAME.fasta Input file Output file

DEMO: Galaxy with the command line Step 6. Mapping bowtie //hg19 -f COLPS_NAME.fasta, --best -v 2 -m 3 -k 1 output.sam Output: .sam Genome Query input files (f=fasta) Result in best-to-worst order No more than 2 mismatches Not reporting alignments for reads having more than 3 reportable alignments Report up to 1 valid alignments Output in sam format

DEMO: Galaxy with the command line Finally: Samtools Genome browser doesn’t use .sam format (output from mapping). .sam must be converted to .bam samtools view –bS NAME.sam > NAME.bam Output: .bam Now you can visualize the result in Genome browser. Sam to bam (when having header)

So, why the command line? Remote access: you can easily access and operate other computers, such as the computing servers, using the command line. Speed: Programs are (usually) controlled by parameters. Once you learn to use these commands and parameters, they are very quick to use. Control: Using pipelining and redirection enables users to perform powerful tasks with a single line of commands. Automation: With scripting users can create sequences of program tasks to execute automatically without further user interaction.

Don’t be scared of the command line! www.uef.fi