On the role of Interactivity and Data Placement in Big Data Analytics Srini Parthasarathy OSU.

Slides:



Advertisements
Similar presentations
Towards Data Mining Without Information on Knowledge Structure
Advertisements

Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Transportation Finance Advisory Committee, June 2012 fafa Value Capture Strategies for Transportation Finance Zhirong (Jerry) Zhao Associate Professor.
Copyright © 2009 EMC Corporation. Do not Copy - All Rights Reserved.
Shark:SQL and Rich Analytics at Scale
Chapter 18 Methodology – Monitoring and Tuning the Operational System Transparencies © Pearson Education Limited 1995, 2005.
Hash Tables.
Dynamic Data Partitioning for Distributed Graph Databases Xavier Martínez Palau David Domínguez Sal Josep Lluís Larriba Pey.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
HORIZON 2020 STEPHEN FOX FOR KEELE UNIVERSITY 26 TH FEBRUARY 2014.
Sheldon Brown, UCSD, Site Director Milton Halem, UMBC Director Yelena Yesha, UMBC Site Director Tom Conte, Georgia Tech Site Director Fundamental Research.
Control Case Common Always active
Benchmarking traversal operations over graph databases Marek Ciglan 1, Alex Averbuch 2 and Ladialav Hluchý 1 1 Institute of Informatics, Slovak Academy.
 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
Open Data – reflections from behind the Big Firewall Or, may you be cursed to live in interesting times.
Massive Graph Visualization: LDRD Final Report Sandia National Laboratories Sand Printed October 2007.
Sheldon Brown, UCSD, Site Director Milton Halem, UMBC Director Yelena Yesha, UMBC Site Director Tom Conte, Georgia Tech Site Director Fundamental Research.
Mining Behavior Models Wenke Lee College of Computing Georgia Institute of Technology.
Microarray Analysis Software at NIH. BRB ArrayTools Visualization and Statistical analysis of gene expression data Features –Excel Add-in –Flexible Data.
Social-Aware Collaborative Visualization for Large Scientific Projects Kwan-Liu Ma and Chaoli Wang CTS’085/21/2008.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Business Intelligence components Introduction. Microsoft® SQL Server™ 2005 is a complete business intelligence (BI) platform that provides the features,
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Graph Visualization: Extensions 1 Presented by Dave Fuhry Yang Zhang.
WHT/ HPCC Systems Flavio Villanustre VP, Products and Infrastructure HPCC Systems Risk Solutions.
Computational Thinking Related Efforts. CS Principles – Big Ideas  Computing is a creative human activity that engenders innovation and promotes exploration.
Data Mining Techniques
1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian.
Object-based Storage Long Liu Outline Why do we need object based storage? What is object based storage? How to take advantage of it? What's.
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
My work: 1. Co-cluster users and content to summarize user  content relationships. 2. Define a new similarity index to efficiently answer complex queries.
Copyright © 2011, Oracle and/or its affiliates. All rights reserved.
CSE 242A Integrated Circuit Layout Automation Lecture: Partitioning Winter 2009 Chung-Kuan Cheng.
Science Research: Journey to 10,000 Sources Presented by: Abe Lederman, President and Founder Deep Web Technologies, Inc. Special Libraries Association.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
1 Configurable Security for Scavenged Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany, Matei Ripeanu.
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
How to get the most out of the survey task + suggested survey topics for CS512 Presented by Nikita Spirin.
Min Chen School of Computer Science and Engineering Seoul National University Data Structure: Chapter 1.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Science Problem: Cognitive capacity (human/scientist understanding), storage and I/O have not kept up with our capacity to generate massive amounts physics-based.
EpiFast: A Fast Algorithm for Large Scale Realistic Epidemic Simulations on Distributed Memory Systems Keith R. Bisset, Jiangzhuo Chen, Xizhou Feng, V.S.
Visual Analytics with Linked Open Data and Social Media for e- Governance Vitaveska Lanfranchi Suvodeep Mazumdar Tomi Kauppinen Anna Lisa Gentile Updated.
PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Actualog Social PIM Helps Companies to Manage and Share Product Information Using Secure, Scalable Ease of Microsoft Azure MICROSOFT AZURE ISV PROFILE:
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
Local/Global Term Analysis for Discovering Community Differences in Social Networks David Fuhry, Yiye Ruan, and Srinivasan Parthasarathy Data Mining Research.
Computational Tools for Population Biology Tanya Berger-Wolf, Computer Science, UIC; Daniel Rubenstein, Ecology and Evolutionary Biology, Princeton; Jared.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Big Data Yuan Xue CS 292 Special topics on.
Twitter Community Discovery & Analysis Using Topologies Andrew McClain Karen Aguar.
Data Mining Techniques Applied in Advanced Manufacturing PRESENT BY WEI SUN.
BUSINESS INTELLIGENCE. The new technology for understanding the past & predicting the future … BI is broad category of technologies that allows for gathering,
Paul Ormonde-James 2014 CYBERTREKING.COM By Paul Ormonde-James January 2014 The future is closed than you think. COM.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Big Data Analytics and HPC Platforms
A Viewpoint-based Approach for Interaction Graph Analysis
Open Source distributed document DB for an enterprise
IBM Content and Predictive Analytics for Healthcare How it works
Data Warehouse.
DISTRIBUTED CLUSTERING OF UBIQUITOUS DATA STREAMS
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

On the role of Interactivity and Data Placement in Big Data Analytics Srini Parthasarathy OSU

The Data Deluge: Data Data Everywhere 22

600$ to buy a disk drive that can store all of the worlds music 3 [McKinsey Global Institute Special Report, June 11] Data Storage is Cheap

Data does not exist in isolation. 4

Data almost always exists in connection with other data – integral part of the value proposition. 5

6 Social networks Protein InteractionsInternet VLSI networks Data dependencies Neighborhood graphs

7 Big Data Problem: All this data is only useful if we can scalably extract useful knowledge from such complex data

THIS TALK THE ROLE OF DATA PLACEMENT IN BIG DATA SYSTEMS THE ROLE OF VISUALIZATION AND INTERACTION IN BIG DATA ANALYSIS

GLOBAL GRAPHS

What? – System for deploying applications processing complex data Why? – Seeks balance between high productivity and high performance How? – Built on top of PNLs GlobalArrays – Trees (GlobalTrees, GlobalForests) – Relational Arrays (ArrayDB-GA) – Graphs (GlobalGraphs) Data Placement is key to high performance

Importance of Data Placement Locality – Placing related items close to each other so they may be processed together Mitigating Impact of Data Skew – Reducing load imbalance in a parallel setting – Reducing variance in partition samples Generating Stratified Samples – Improving interactive performance

Key Ideas Pivotization – Convert data with complex structure into sets – Each element of set captures features of local topology Hashing into Strata: Hash related sets into similar bins – Can employ a sketch-clustering algorithm Partitioning: Place Strata into partitions for Locality Mitigating Data Skew Samples

SKETCHSORT or SKETCHCLUSTER S-1 : S-4 (Δ1, SK-1) (Δ5, SK-5) (Δ12,SK-12) (Δ25,SK-25) : S-5 : S-128 : PARTITIONING & REPLICATION P-1 : P-2 S-4 S-7 S-8 S-12 : S-128 P-3 : P-8 S-3 S-4 S-9 S-12 : S- 127 PIVOT TRANSFORMATIONS A B C L E A B C L E F Δ1 Δ25 DATA ( Δ ) A B C A F C A E C A F L B E F A E L A B L A B C A E C A E L A B L (PS-1) (PS-25) PIVOT SETS (PS) MINWISE HASHING on PIVOT SETS {1050, 2020, 3130,1800} (SK-1) {1050, 2020, 7225, 2020} (SK-25) SKETCHES(SK) Strata (S)

Frequent Tree Mining Our proposed approaches shows 100X gains

WebGraph Compression Linear Scaleup with no loss in compression ratio

PRISM-HD - PRobing the Intrinsic Structure and Makeup of High-dimensional Data HD

Visualization and Interactivity are key to discovery 17

PRISM-HD What? – A novel mechanism for exploring complex data Why? – User is often overwhelmed with characteristics of data – Befuddled on where to start How? – Given, similarity measure-of-interest – Compute similarity graph at threshold (t) Key: Graphs are dimensionless – Provide user graph visualization cues User determines next threshold and repeats HD

HIGH THRESHOLD MODERATE THRESHOLD LOW THRESHOLD

Benefits of Knowledge Caching HD

Benefits of Incremental Processing on Twitter Incremental estimates on Twitter t 1 = 0.95 HD

PRISM-HD and Global Graphs in Context: Leveraging Social Media in Emergency Response HD

Concluding Remarks Data is everywhere Data is fraught with complexities – Dimensionality, dynamics, structure, massive… Both data placement and data interactivity have an important role to play in big data analytics – PRISM-HD and GlobalGraphs can help! HD

Thanks for your attention Contact: Mining Simulation Data Medical Image Analysis Protein Interaction Network (yeast) Acknowledgements: Various NSF, NIH, DOE and industry grants