Using command line tools to process sequencing data

Slides:

Advertisements

Similar presentations

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy

Advertisements

ChIP-seq analysis Ecole de bioinformatique AVIESAN – Roscoff, Jan 2013.

Variant Calling Workshop Chris Fields Variant Calling Workshop v2 | Chris Fields1 Powerpoint by Casey Hanson.

High Throughput Sequencing

Institute for Quantitative & Computational Biosciences Workshop4: NGS- study design and short read mapping.

Before we start: Align sequence reads to the reference genome

Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.

NGS Analysis Using Galaxy

MES Genome Informatics I - Lecture V. Short Read Alignment

File formats Wrapping your data in the right package Deanna M. Church

GBS Bioinformatics Pipeline(s) Overview

Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.

Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.

NGS data analysis CCM Seminar series Michael Liang:

Next Generation DNA Sequencing

EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics.

NIH Extracellular RNA Communication Consortium 2 nd Investigators’ Meeting May 19 th, 2014 Sai Lakshmi Subramanian – (Primary

DB-based DAQ monitoring and Physics analysis tools Emiliano Barbuto European Emulsion Group (LNGS May 2003)

Chip – Seq Peak Calling in Galaxy Lisa Stubbs Chip-Seq Peak Calling in Galaxy | Lisa Stubbs | PowerPoint by Casey Hanson.

BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.

SAVANT GENOME BROWSER Marc Fiume Department of Computer Science University of Toronto.

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.

Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015

Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.

IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.

Trinity College Dublin, The University of Dublin Data download: bioinf.gen.tcd.ie/GE3M25/project Get.fastq.gz file associated with your student ID

Accessing and visualizing genomics data

Computing on TSCC Make a folder for the class and move into it –mkdir –p /oasis/tscc/scratch/username/biom262_harismendy –cd /oasis/tscc/scratch/username/biom262_harismendy.

Chip – Seq Peak Calling in Galaxy Lisa Stubbs Lisa Stubbs | Chip-Seq Peak Calling in Galaxy1.

Short Read Workshop Day 5: Mapping and Visualization

User-friendly Galaxy interface and analysis workflows for deep sequencing data Oskari Timonen and Petri Pölönen.

HOMER – a one stop shop for ChIP-Seq analysis

Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support.

IGV Demo Slides:/g/funcgen/trainings/visualization/Demos/IGV_demo.ppt Galaxy Dev: 0.

From Reads to Results Exome-seq analysis at CCBR

Canadian Bioinformatics Workshops

Konstantin Okonechnikov Qualimap v2: advanced quality control of

Introductory RNA-seq Transcriptome Profiling

NGS File formats Raw data from various vendors => various formats

Day 5 Mapping and Visualization

Canadian Bioinformatics Workshops

MGmapper A tool to map MetaGenomics data

Integrative Genomics Viewer (IGV)

NGS Analysis Using Galaxy

Regulatory Genomics Lab

Short Read Sequencing Analysis Workshop

Chip – Seq Peak Calling in Galaxy

First Bite of Variant Calling in NGS/MPS Precourse materials

GE3M25: Data Analysis, Class 4

Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng

BF528 - Biological Data Formats

ChIP-Seq Data Processing and QC

Next Gen. Sequencing Files and pysam

Maximize read usage through mapping strategies

Yating Liu July 2018 G-OnRamp workshop

Garbage In, Garbage Out: Quality control on sequence data

Next Gen. Sequencing Files and pysam

Next Gen. Sequencing Files and pysam

Overview of Workflows: Why Use Them?

Regulatory Genomics Lab

Additional file 2: RNA-Seq data analysis pipeline

BF528 - Sequence Analysis Fundamentals

Computational Pipeline Strategies

Introduction to RNA-Seq & Transcriptome Analysis

Regulatory Genomics Lab

Chip – Seq Peak Calling in Galaxy

RNA-Seq Data Analysis UND Genomics Core.

The Variant Call Format

Presentation transcript:

Using command line tools to process sequencing data Tapio Vuorenmaa, Krista Kokki Using command line tools to process sequencing data

This hands-on session: Part 1 – Bedtools Part 2 – DEMO: Galaxy with the command line Part 3 – Bedtools excercises

Part 1: Bedtools Allows to do wide range of genomics tasks easily Command line tool Easy to use  Perfect example We’re focusing on using command line and parameters, not to tool itself ”intersect”

Bedtools - intersect ”Do my two features in the set overlap with each other?” Files: “genes.bed” and “markers.bed” (in exercises) .bed format

.bed format? Flexible way to define data lines displayed in annotation track One of the file types Genome Browser uses Three required fields: chr, chr start, chr end Additional fields; name, score, strand etc. Our ”data” is simplified

Our ”data” Chromosome 1. Chromosome 2.

Back to intersect Question: Do my chip-seq peaks overlap? The basic command: bedtools intersect –a genes.bed –b markers.bed > result.bed Optional parameters Program Command First file Second file Redirect output (optional)

Optional parameters -wa  Display the original feature for each overlap -u  Display only one (the first) overlap found. -s  Only display overlaps found on the same strand. -c  Count the number of overlaps. -v  Complement, display those which do not overlap. -S  Only display overlaps found on the opposite strand. And many more. See your bedtools cheat sheet.

That’s not all there’s to it There are number of other things you can do with bedtools, such as Coverage Merge Cluster

Part 2: DEMO: Galaxy with the command line Step 1. .sra file into fastqc fastq-dump NAME.sra --offset 33 Output: .fastq file Step 2. Quality report of your data fastqc NAME.fastq Output: .html file Quality conversion (to get quality score from ASCII code)

DEMO: Galaxy with the command line Step 3. Trimming (optional) fastx_trimmer -i NAME.fastq -o TRIMMED_NAME.fastq -f 1 -l 50 Input file Output file First base to keep Last base to keep

DEMO: Galaxy with the command line Step 4. Quality filtering fastq_quality_filter -i TRIMMED_NAME.fastq -o QFILT_NAME.fastq -q 10 -p 100 Input file Output file Minimum quality score to keep Minimum % of bases that must have –q quality

DEMO: Galaxy with the command line Step 5. Removing quality information fastx_collapser -i QFILT_NAME.fastq -o COLPS_NAME.fasta Input file Output file

DEMO: Galaxy with the command line Step 6. Mapping bowtie //hg19 -f COLPS_NAME.fasta, --best -v 2 -m 3 -k 1 output.sam Output: .sam Genome Query input files (f=fasta) Result in best-to-worst order No more than 2 mismatches Not reporting alignments for reads having more than 3 reportable alignments Report up to 1 valid alignments Output in sam format

DEMO: Galaxy with the command line Finally: Samtools Genome browser doesn’t use .sam format (output from mapping). .sam must be converted to .bam samtools view –bS NAME.sam > NAME.bam Output: .bam Now you can visualize the result in Genome browser. Sam to bam (when having header)

So, why the command line? Remote access: you can easily access and operate other computers, such as the computing servers, using the command line. Speed: Programs are (usually) controlled by parameters. Once you learn to use these commands and parameters, they are very quick to use. Control: Using pipelining and redirection enables users to perform powerful tasks with a single line of commands. Automation: With scripting users can create sequences of program tasks to execute automatically without further user interaction.

Don’t be scared of the command line! www.uef.fi