Adopting Big-Data Computing Across the Undergraduate Curriculum Bina Ramamurthy (Bina) This talk.

Slides:



Advertisements
Similar presentations
Distributed Data Processing
Advertisements

Suggested Course Outline Cloud Computing Bahga & Madisetti, © 2014Book website:
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Nokia Technology Institute Natural Partner for Innovation.
+ Hbase: Hadoop Database B. Ramamurthy. + Introduction Persistence is realized (implemented) in traditional applications using Relational Database Management.
Emerging Platform#6: Cloud Computing B. Ramamurthy 6/20/20141 cse651, B. Ramamurthy.
Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Big Data and.
Google App Engine Cloud B. Ramamurthy 7/11/2014CSE651, B. Ramamurthy1.
Virtualization and the Cloud
AN INTRODUCTION TO CLOUD COMPUTING Web, as a Platform…
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
M.A.Doman Model for enabling the delivery of computing as a SERVICE.
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-1 HDFS itself is “big” Why do we need “hbase” that is bigger and more complex? Word count, web logs.
Data-intensive Computing on the Cloud: Concepts, Technologies and Applications B. Ramamurthy This talks is partially supported by National.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Introduction. Readings r Van Steen and Tanenbaum: 5.1 r Coulouris: 10.3.
Component 4: Introduction to Information and Computer Science Unit 10: Future of Computing Lecture 2 This material was developed by Oregon Health & Science.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over the Internet. Cloud is the metaphor for.
Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials:
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Software Architecture
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
M.A.Doman Short video intro Model for enabling the delivery of computing as a SERVICE.
B. RAMAMURTHY MapReduce and Hadoop Distributed File System 10/6/ Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY)
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-0 Think about the goal of a typical application today and the data characteristics Application trend:
The Memory B. Ramamurthy C B. Ramamurthy1. Topics for discussion On chip memory On board memory System memory Off system/online storage/ secondary memory.
Enterprise Cloud Computing
1 Melanie Alexander. Agenda Define Big Data Trends Business Value Challenges What to consider Supplier Negotiation Contract Negotiation Summary 2.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
LIMPOPO DEPARTMENT OF ECONOMIC DEVELOPMENT, ENVIRONMENT AND TOURISM The heartland of southern Africa – development is about people! 2015 ICT YOUTH CONFERENCE.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
Cloud Computing: Concepts, Technologies and Business Implications B. Ramamurthy & K. Madurai &
Microsoft Ignite /28/2017 6:07 PM
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Bleeding edge technology to transform Data into Knowledge HADOOP In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
Big Data is a Big Deal!.
Big Data Enterprise Patterns
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Status and Challenges: January 2017
Hadoopla: Microsoft and the Hadoop Ecosystem
Hadoop Clusters Tess Fulkerson.
Cloud Computing: Concepts, Technologies and Business Implications
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Ministry of Higher Education
Defining Data-intensive computing
Defining Data-intensive computing
Introduction to Apache
The Memory B. Ramamurthy C B. Ramamurthy.
Cloud Programming Models
Big Data Young Lee BUS 550.
Charles Tappert Seidenberg School of CSIS, Pace University
Big DATA.
Big-Data Analytics with Azure HDInsight
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Adopting Big-Data Computing Across the Undergraduate Curriculum Bina Ramamurthy (Bina) This talk is partially funded by NSF grant NSF-TUES & by AWS in Education Coursework Grant award 10/19/2012 Symposium on Big Data Science and Engineering 1

Outline of the talk Golden era in computing Big Data computing curriculum Data-intensive/Big Data Computing Certificate program at University at Buffalo Outcome Evaluation Important Findings Recommendations for adoption into undergraduate curriculum Demos Useful links and project web page Question and Answers 10/19/2012 Symposium on Big Data Science and Engineering 2

A Golden Era in Computing Heavy societal involvement Powerful multi-core processors Superior software methodologies Virtualization leveraging the powerful hardware Wider bandwidth for communication Proliferation of devices Explosion of domain applications 10/19/2012 Symposium on Big Data Science and Engineering 3

Top Ten Largest Databases Ref: 10/19/2012 Symposium on Big Data Science and Engineering 4

Top Ten Largest Databases in 2007 vs Facebook ‘s cluster in 2010 Ref: 10/19/2012 Symposium on Big Data Science and Engineering 5 Facebook 21 PetaByte In 2010

Data Deluge: smallest to largest Bioinformatics data: from about 3.3 billion base pairs in a human genome to huge number of sequences of proteins and the analysis of their behaviors The internet: web logs, facebook, twitter, maps, blogs, etc.: Analytics … Financial applications: that analyze volumes of data for trends and other deeper knowledge Health Care: huge amount of patient data, drug and treatment data The universe: The Hubble ultra deep telescope shows 100s of galaxies each with billions of stars 10/19/20126 Symposium on Big Data Science and Engineering

Different Type of Storage Internet introduced a new challenge in the form web logs, web crawler’s data: large scale “peta scale” But observe that this type of data has an uniquely different characteristic than your transactional or the “customer order” data, or “bank account data” : The data type is “write once read many (WORM)” ; Privacy protected healthcare and patient information; Historical financial data; Other historical data Relational file system and tables are insufficient. Large stores (files) and storage management system. Built-in features for fault-tolerance, load balancing, data-transfer and aggregation,… Clusters of distributed nodes for storage and computing. Computing is inherently parallel 710/19/2012 Symposium on Big Data Science and Engineering

Big-data Concepts Originated from the Google File System (GFS) is the special store Hadoop Distributed file system (HDFS) is the open source version of this. (Currently an Apache project) Parallel processing of the data using MapReduce (MR) programming model Challenges Formulation of MR algorithms Proper use of the features of infrastructure (Ex: sort) Best practices in using MR and HDFS An extensive ecosystem consisting of other components such as column-based store (Hbase, BigTable), big data warehousing (Hive), workflow languages, etc. 10/19/2012 Symposium on Big Data Science and Engineering 8

Data & Analytics We have witnessed explosion in algorithmic solutions. “In pioneer days they used oxen for heavy pulling, when one couldn’t budge a log they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.” Grace Hopper What you cannot achieve by an algorithm can be achieved by more data. Big data if analyzed right gives you better answers: Center for disease control prediction of flu vs. prediction of flu through “search” data 2 full weeks before the onset of flu season! (see the reference) 10/19/2012 Symposium on Big Data Science and Engineering 9

The Cloud Computing Cloud is a facilitator for Big Data computing and is an indispensable in this context Cloud provides processor, software, operating systems, storage, monitoring, load balancing, clusters and other requirements as a service Cloud offers accessibility to Big Data computing Cloud computing models: o platform (PaaS), Microsoft Azure o software (SaaS), Google App Engine (GAE) o infrastructure (IaaS), Amazon web services (AWS) o Services-based application programming interface (API) 10/19/2012 Symposium on Big Data Science and Engineering 10

Big-data Courses We introduced the concepts in a two course (sequence): Course 1 (has become a core course): CSE 486 o Foundational concepts of MapReduce and Hadoop distributed file system is introduced as a part of the Distributed System course o The last project/lab in the distributed systems course is based on MapReduce (MR) concepts, and is implemented on HDFS Course 2 (has become an elective course):CSE 487 o The second course focuses completely on Big-data issues and mostly MR algorithm formulation and best practices o Analytics on large clusters, and on the cloud o Text book we use for this course is deals with algorithms and data structures for MR and best ways to leverage the parallelism in MR family of operations (map, reduce, combine, partition, etc.) o Other Big Databases such as Hbase and Hive and workflows are also introduced 10/19/2012 Symposium on Big Data Science and Engineering 11

Big-data Certificate Program Official name is Data-intensive computing certificate o Initiated with support from NSF TUES program o Approved by SUNY system in Fall 2011 o Offered by the University o For the enrolled undergraduates--- Any major! Details of the program o CS1, CS2 o Distributed system (CSE486) : Pre-req CS2 o Data-intensive computing system (CSE487) Pre-req CS2 o An elective in the discipline of choice (Ex: BIO4XY or MGS4XY) o A capstone project applying data-intensive computing (Ex:BIO499or MGS499) 10/19/2012 Symposium on Big Data Science and Engineering 12

Evaluation 10/19/2012 Symposium on Big Data Science and Engineering 13

Findings There is high demand from student for “data-intensive” and big data computation related courses Certificate program is hard for non-CSE majors High demand for big-data skills from employers Educators and administrators need to be educated about big- data (Remember the times we educated people on Object- orientation, Java, web-enabling etc.) It is imperative we improve the preparedness of our workforce at all levels for Big Data skills for global competitiveness. Just one course or a single certificate is NOT enough: we need continuous and repeated exposure to Big Data in various contexts. It is often very hard to create and sustain a new curriculum How can we address these challenges? 10/19/2012 Symposium on Big Data Science and Engineering 14

Recommendations Introduce big-data concepts as integral part of UG curriculum o For example for CS, simple word-count of big-data in CS1, map-reduce algorithm in CS2, cloud storage and big-table in Database systems, Hadoop in distributed systems, the entire big-data analytics in other elective courses such as Machine Learning and Data Mining. Use compelling examples using real world datasets Train the educators: big-data professional development for the academic core is critical Expose the administrators: to use of Big Data applications/tools in all possible areas: institutional analysis, data collected at various educational institutions is a gold-mine for macro-level analytics; “What is the trend?” “Are they learning?” Train the counselors who advise high school students, and college entry level counselors Include the community colleges and four years colleges Need investment from major industries (mentoring, educator days, etc.) 10/19/2012 Symposium on Big Data Science and Engineering 15

Demos Simple word count using MR model on HDFS on the local machine o Foundation for many algorithms such as word cloud o Simple and easy to understand o Project Guttenberg Project Guttenberg Simple co-occurrence analysis of twitter data o Twitter has donated the entire collection of tweets to Library of Congress Amazon MR workflow and working AWS facilities Finally sample run of 10miilion node tree of a compute cluster on the Center for Computational Research (CCR) at Buffalo Center for Computational Research (CCR) 10/19/2012 Symposium on Big Data Science and Engineering 16

Summary We explored the need for data-intensive or big-data computing We illustrated Big Data concepts and demonstrated the cloud capabilities through simple applications Data-intensive computing on the cloud is an essential and indispensable skill for the workforce of today and tomorrow University at Buffalo has implemented a SUNY-wide a Certificate Program in Data-intensive Computing Certificate Program in Data-intensive Computing Actionable thing we could do is form a group of people passionate about Big Data and work at introducing it in their courses/projects 10/19/2012 Symposium on Big Data Science and Engineering 17

References & useful links Flu prediction reference: J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski and L. Brilliant. Detecting influenza epidemics using search engine query data, Nature 457, (19 February 2009): Twitter and Library of Congress: A. Watters. How Library of Congress is Building the Twitter Archive. last viewed July Project web page for all the project material including course description, course material, project description, several presentations, useful links, and references 10/19/2012 Symposium on Big Data Science and Engineering 18