NGS Analysis Using Galaxy Galaxy is a genomics analysis platform that allows researchers to obtain data from various databases such as the UCSC Genome Browser, ENCODE data and many other sources, prepare and manipulate the data, and perform various analyses on the data in ways that might not be possible at the original sites. Galaxy also includes a workflow tool that allows the user to save and share various customized and useful analysis steps.
NGS Analysis Using Galaxy Sequences and Alignment Format Galaxy overview and Interface Getting Data in Galaxy Analyzing Data in Galaxy Quality Control Mapping Data History and workflow Galaxy Exercises
FASTQ Format FASTQ format FASTQ variants The range of quality scores in a FASTQ file will depend on the technology and the base caller used. If the quality scores contain characters in the range ASCII 33 - 58 -> can only be Sanger If FastQ file is known to be from an Illumina/Solexa platform AND the quality scores contain characters in the range ASCII 59 - 63 -> can only be Solexa/Illumina 1.0 If ASCII characters 64 or 65 are used in quality scores -> cannot be Illumina 1.5+
SAM Format There are many short-read aligners... Most aligners use their own format to output the alignments. Hence, downstream tools can not be exchanged between aligners. To resolve this issue, Li et al. have suggested a standardized file format: the Sequence Alignment/Map (SAM) format Is flexible enough to store all the alignment information generated by various alignment programs; Is simple enough to be easily generated by alignment programs or converted from existing alignment formats; For more details: http://samtools.sourceforge.net/SAM1.pdf
NGS Analysis Using Galaxy Sequences and Alignment Format Galaxy overview and Interface Getting Data in Galaxy Analyzing Data in Galaxy Quality Control Mapping Data History and workflow Galaxy Exercises
Galaxy Start http://galaxy.psu.edu/ Galaxy Site If you go to the URL shown here, you will find a short introduction to Galaxy. We’ll start at the bottom of the page, since this is where we will learn what Galaxy is. In the red section at the bottom, you will learn the rationale for Galaxy. Galaxy is not a database, but rather it is an analysis tool. The amount of data and number of databases is increasing exponentially, but the tools available to analyze this data are small in number and have a difficult time keeping up with new data types. Galaxy aims to provide an analysis resource that allows the user to pull data from many different databases and analyze that data. It also aims to give the user a way to reproduce and share that analysis. Galaxy integrates many databases and analysis tools and is continually growing and evolving. The Galaxy tool was created for both bench scientists to have a relatively simple tool to use to do data analysis, and for the bioinformatics developer to integrate tools and analyses for the user. This tutorial is aimed at that first group of users, the biologists who wish to perform data analysis, but developers might find it useful. We also suggest that developers view the FAQs and screencasts for more information on how to integrate your tools into the Galaxy framework. This integration of databases, tools and workflows allows for the ability to collaborate between scientists and between scientists and developers. This also allows researchers to share their analysis and for other researcher to reproduce their analysis where otherwise they might not have. The website is available for anyone to use right now if you click the link to go to the public site.
Galaxy Conceptual Framework Obtain data from many data sources including the UCSC Table Browser, BioMart, WormBase, or your own data. Prepare data for further analysis by rearranging or cutting data columns, filtering data and many other actions. Analyze data by finding overlapping regions, determining statistics, phylogenetic analysis and much more Galaxy is a research tool that is of great use to the researcher and the developer. Through the Galaxy tool, the researcher can obtain data from many different sources and databases, prepare the data for further analysis and analyze the data using many different included tools or analysis tools that can be added by developers. The data types you start with will vary based on your research interests. The possible preparations and manipulations you can choose will be customized for your needs. The analyses you can perform will be nearly endless. But the basic conceptual framework is: obtain data, prepare data, and analyze the data. In this tutorial we will examine these basic concepts. But you should not be limited to these examples. You should be able to ask amazing and complex questions of the data using the Galaxy framework.
Galaxy Interface Sections User Register contains links to the downloading, preparation and analysis tools. show you the history of your analysis steps, allow you view data and results, and more. The center column is where the menus and data will appear The interface to the Galaxy straightforward. There are three sections. The left column contains links to the downloading, preparation and analysis tools. The center column is where the menus and data will appear. The right hand column will show you the history of your analysis steps, allow you view data and results, and more. Let’s do a really quick sample step of obtaining data to give you an idea of how the interface works. In the next section we’ll be explaining more about getting data and the types of data you can obtain, but for now we’ll be doing this just to give you an example. Click the link that says “Get Data.”
NGS Analysis Using Galaxy Sequences and Alignment Format Galaxy overview and Interface Getting Data in Galaxy Analyzing Data in Galaxy Quality Control Mapping Data History and workflow Galaxy Exercises
Getting Data Click Get Data On the Galaxy interface, the section in the left column includes all the links for getting data into Galaxy, preparing the data and analyzing it. “Get Data” has many venues for you to add all types of data to Galaxy in order to start your analysis.
Getting Data: Table Browser Get Table Main Right now, let’s upload some data from the UCSC Table Browser database. The UCSC Table Browser includes genomic data from dozens of species. It allows the user to customize a search with powerful filters and intersections of data to obtain exactly the data the user needs. We’ll just go through some basics of getting data, but if you haven’t used the Table Browser before, we suggest you view our tutorial on the UCSC Table Browser to get acquainted with this database search tool.
Getting Data: UCSC Table Browser clade: Mammal genome: Human assmbly: Mar. 2006 group: Genes and… track: UCSC Genes table: knownGene region: position, chrX Output format: BED, and check Send output to Galaxy The Table Browser Interface will now appear. We are going to get known gene data from the human X chromosome. To do so, we’ll just choose the right data set: Human genome assembly “Mar. 2006” and the “UCSC Genes” dataset from the “Genes and Gene Prediction Tracks” group. We’ll put the position “chrX.” Leaving all the rest as default, including the output format as BED and the “Send output to Galaxy” checkbox checked, we’ll click the link “Get Output.” Get Output
Getting Data: Upload File Upload or paste file File Format Upload File Species Execute You can also upload your own data. To upload your file, you’ll either paste the contents of your file into the window here, or click the browse button here and find the file on your computer. Here I’ve found and uploaded a file from my desktop. You’ll then choose which file format the file is in. In this case it is in the interval format, but as you see there are many other possible formats. You can also choose “Auto-detect”, which works pretty well to determine the format of a data file if you are not quite sure of the format, or even if you just don’t want to look for the format type on the list. Once you’ve chosen the file format, you choose which species the data has been obtained from. There are many to choose from. Ours is from Homo sapiens, so that is what we will choose. Then click “Execute.”
Getting Data: Upload File Specify multiple URLs into the "URL / Text" box
NGS Analysis Using Galaxy Sequences and Alignment Format Galaxy overview and Interface Getting Data in Galaxy Analyzing Data in Galaxy Quality Control Mapping Data History and workflow Galaxy Exercises
Analyzing Data: Next Generation Sequencing The Next Generation Sequencing (or NGS) Toolbox, which is in beta at the time of development of this tutorial, offers lots of new tools. We’ve started from the beginning with a new history here. Let’s open the “QC and manipulation tools.” You’ll notice there are many tools here to choose from. For example, we could choose to draw a nucleotides distribution chart from a statistics file we might have uploaded. If you have NGS data to analyze, you may want to explore these a bit closer.
Analyzing Data: Next Generation Sequencing FASTQ file manipulation, like format conversation, summary statistics, trimming reads, filtering reads by quality score…
Analyzing Data: Next Generation Sequencing Input: sanger FASTQ Output: SAM format
Analyzing Data: Next Generation Sequencing After alignment , there are many downstream analysis Galaxy can support. In this workshop, we currently only cover how to convert SAM file to BAM file. We will include more tools introduction in future workshops.
NGS Analysis Using Galaxy Sequences and Alignment Format Galaxy overview and Interface Getting Data in Galaxy Analyzing Data in Galaxy Quality Control Mapping Data History and workflow Galaxy Exercises
History: History Options List saved histories and shared histories. Work on Current History, create new, clone, share, create workflow, set permissions, show deleted datasets or delete history. List saved histories There are a lot of options that make histories very helpful. List all the analysis histories you’ve done, create a new empty history to start a new analysis, make a history into a workflow, share your history with other users and change the permissions and more. You can show deleted data within the current history or delete the current history. Let’s list all saved histories by clicking the “Saved Histories” link. Copyright OpenHelix. No use or reproduction without express written consent
Workflow Creates a workflow, allows user to repeat analysis using different datasets.
NGS Analysis Using Galaxy Sequences and Alignment Format Galaxy overview and Interface Getting Data in Galaxy Analyzing Data in Galaxy Quality Control Mapping Data History and workflow Galaxy Exercises