BigBed/bigWig remote file access Hiram Clawson UCSC Center for Biomolecular Science & Engineering.

Slides:



Advertisements
Similar presentations
Intro to C#. Programming Coverage Methods, Classes, Arrays Iteration, Control Structures Variables, Expressions Data Types.
Advertisements

File Systems.
Communicating Information: Web Design. It’s a big net HTTP FTP TCP/IP SMTP protocols The Internet The Internet is a network of networks… It connects millions.
The RIGHT DATA at the RIGHT TIME in the RIGHT PLACE Data Stream Processor Introduction.
Internet Networking Spring 2006 Tutorial 12 Web Caching Protocols ICP, CARP.
Mi-Joung choi, Hong-Taek Ju, Hyun-Jun Cha, Sook-Hyang Kim and J
Copyright OpenHelix. No use or reproduction without express written consent1.
Web Servers How do our requests for resources on the Internet get handled? Can they be located anywhere? Global?
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #13 Web Caching Protocols ICP, CARP.
Exploiting SCI in the MultiOS management system Ronan Cunniffe Brian Coghlan SCIEurope’ AUG-2000.
World Wide Web1 Applications World Wide Web. 2 Introduction What is hypertext model? Use of hypertext in World Wide Web (WWW) – HTML. WWW client-server.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Domain Name System: DNS
How Clients and Servers Work Together. Objectives Learn about the interaction of clients and servers Explore the features and functions of Web servers.
Session Management A290/A590, Fall /25/2014.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 Application Layer Functionality and Protocols Network Fundamentals – Chapter.
File Transfer Protocol (FTP)
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
Bertrand Bellenot root.cern.ch ROOT I/O in JavaScript Reading ROOT files from any web browser ROOT Users Workshop
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
Hands-On Microsoft Windows Server 2008 Chapter 5 Configuring, Managing, and Troubleshooting Resource Access.
10 May 2007 HTTP - - User data via HTTP(S) Andrew McNab University of Manchester.
Chapter 10 Intro to Routing & Switching.  Upon completion of this chapter, you should be able to:  Explain how the functions of the application layer,
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
CMPE 421 Parallel Computer Architecture
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 OSI Transport Layer Network Fundamentals – Chapter 4.
Chapter 9 How Do Users Share Computer Files?. What is a File Server A (central) computer which stores files which can be accessed by network users.
Component 4: Introduction to Information and Computer Science Unit 4: Application and System Software Lecture 3 This material was developed by Oregon Health.
10/13/2015 ©2006 Scott Miller, University of Victoria 1 Content Serving Static vs. Dynamic Content Web Servers Server Flow Control Rev. 2.0.
Wenjing Wu Computer Center, Institute of High Energy Physics Chinese Academy of Sciences, Beijing BOINC workshop 2013.
Internet Protocol B Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Data Communications and Computer Networks Chapter 2 CS 3830 Lecture 8 Omar Meqdadi Department of Computer Science and Software Engineering University of.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
CHEP 2013, Amsterdam Reading ROOT files in a browser ROOT I/O IN JAVASCRIPT B. Bellenot, CERN, PH-SFT B. Linev, GSI, CS-EE.
The UCSC Table Browser & Custom Tracks Advanced searching and discovery using the UCSC Table Browser and Custom Tracks Osvaldo Graña CNIO Bioinformatics.
1 WWW. 2 World Wide Web Major application protocol used on the Internet Simple interface Two concepts –Point –Click.
CS 346 – Chapter 11 File system –Files –Access –Directories –Mounting –Sharing –Protection.
BASIC NETWORK PROTOCOLS AND THEIR FUNCTIONS Created by: Ghadeer H. Abosaeed June 23,2012.
Worldwide Lexicon Brian McConnell May, WWL – Brian McConnell Worldwide Lexicon Intro Automatic discovery of dictionary, semantic net and translation.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
ROOT I/O in JavaScript Browsing ROOT Files on the Web For more information see: For any questions please use following address:
ESG-CET Meeting, Boulder, CO, April 2008 Gateway Implementation 4/30/2008.
TOPIC 7.0 LINUX SERVICES AND CONFIGURATION. ROOT USER Root user is called “super user” because it has power far beyond those of mortal user. As root,
Accessing and visualizing genomics data
Supplementary Figure S1. Supplementary Figure S2.
Reading ROOT files in (almost) any browser.  Use XMLHttpRequest JavaScript class to perform the HTTP HEAD and GET requests  This class is highly browser.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Chapter 7: Using Network Clients The Complete Guide To Linux System Administration.
Fundamentals of Web DevelopmentRandy Connolly and Ricardo HoarFundamentals of Web DevelopmentRandy Connolly and Ricardo Hoar Fundamentals of Web DevelopmentRandy.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Petr Škoda, Jakub Koza Astronomical Institute Academy of Sciences
How Do Users Share Computer Files?
File Transfer Protocol (FTP)
WWW and HTTP King Fahd University of Petroleum & Minerals
File System Structure How do I organize a disk into a file system?
Net 323 D: Networks Protocols
Whether you decide to use hidden frames or XMLHttp, there are several things you'll need to consider when building an Ajax application. Expanding the role.
Internet Networking recitation #12
Application layer Lecture 7.
Net 323 D: Networks Protocols
Information Technology Ms. Abeer Helwa
Lesson 3 Bioinformatics Laboratory
FTP AND COMMAND PROCESSING IN FTP
Applications Layer Functionality & Protocols
Presentation transcript:

bigBed/bigWig remote file access Hiram Clawson UCSC Center for Biomolecular Science & Engineering

The problem: Huge amount of data at remote sites UCSC User View e.g. 3x10 9 floating point data values

The Good Ol’ Days Download data from experimentalists to UCSC Process data locally Load results into local MySQL tables Present data via CGI to the WWW Expensive staff time Slow turnaround Can only work on “important” data sets

Version 0.9: Custom Tracks Automatic CGI submission mechanism Users submit file for processing Pipeline downloads entire data set and prepares for presentation Practical for file sizes 50 to 100 Mb gzipped ASCII text files limited lifetime, each user sees a separate copy of the same data

Next Generation Sequencing Lincoln Stein - genomebiology.com/2010/11/5/207

Jim Kent solution: Index the data UCSC User View fetch only that needed for viewpoint bioinformatics.oxfordjournals.org/content/26/17/2204.full

bigBed/bigWig design goals  Allow visualization without long uploads  Transfer only necessary data needed for current view.  Minimize system administrator involvement  Files on http(s) site easier than setting up web services  Leverage existing formats  Input files are same as existing custom tracks: BED, wig.

fixedStep chrom=chr1 start= … etc … bed format: notations at positions on chromosomes wig format: data values graphed at positions on chromosomes chr uc011jzk ,646,80, 0,3924,7727, chr uc011jzl ,646,80, 0,3924,13325, chr uc003syk ,646,84,135, 0,3924,16368,20695,

big* Implementation Layers  Data transfer layer – based on http(s) or ftp  Caching layer – relies on Linux sparse files  Compression layer – uses zlib  Indexing layer – uses B+ trees and R trees  Data summary layer

Data Transfer Layer  Use existing web-based protocols: http/https and ftp already supported on user sites  Unlike a web browser require random access  Use byte range protocols for http/https – requires HTTP/1.1 but this is common.  OpenSSL for HTTPS support via BIO protocol  For FTP use resume command.

Caching Layer  File header, root index block used by almost every query  Popular files may be used by hundreds or thousands of users.  Caching makes sense so only have to transfer a block of data once  Uses bitmap file + sparse data file  One bit for every 8k block  Just seek and write for sparse data. Linux/Unix only allocates data for parts written.

Data Compression Layer  Uses zlib library  Same compression as gzip  C library is widespread, stable, fast  Compress ~500 items at a time, following same pattern as in index  Get nearly random access in spite of compression  Lose a little compression compared to compressing all items at once

Indexing Layer  Most genomic annotations are of form Chromosome, start, end, additional-data  Most common query “give me all things that overlap with chromosome,start,end”  R Tree indexes work well for this  Similar to B+ Tree – used for geographical data  Each node can have hundreds of subnodes, so tree is bushy, not deep, few seeks required.  Each index slot contains range spanned by all children, only need to recurse if range overlaps.  File sorted, only every ~500 th node indexed

Data summary layer  When user is looking at megabases at once, does not need full info on every base  Store a series of summaries  ~100 bases at a time, 200 base at a time …  Summary contains min/max/sum/sumData/N  Can compute mean, std. deviation, coverage  System uses appropriate summary for given resolution, doing _some_ oversampling.

Overall BigBed/BigWig Features  Reduces data sent over the network  Only data needed for a particular view fetched  Data fetched at reduced resolution when appropriate  Data compressed  Data cached so doesn’t have to be resent.  Labs do need to run a program to create the binary indexed compressed files on their own machines  Labs do need to be able to put files (but not programs) on their web sites.

UCSC Genome Browser Staff