Download presentation
Presentation is loading. Please wait.
Published byMadeline Wheeler Modified over 9 years ago
1
Adopting Big-Data Computing Across the Undergraduate Curriculum Bina Ramamurthy (Bina) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina This talk is partially funded by NSF grant NSF-TUES-0920335 & by AWS in Education Coursework Grant award 10/19/2012 Symposium on Big Data Science and Engineering 1
2
Outline of the talk Golden era in computing Big Data computing curriculum Data-intensive/Big Data Computing Certificate program at University at Buffalo Outcome Evaluation Important Findings Recommendations for adoption into undergraduate curriculum Demos Useful links and project web page Question and Answers 10/19/2012 Symposium on Big Data Science and Engineering 2
3
A Golden Era in Computing Heavy societal involvement Powerful multi-core processors Superior software methodologies Virtualization leveraging the powerful hardware Wider bandwidth for communication Proliferation of devices Explosion of domain applications 10/19/2012 Symposium on Big Data Science and Engineering 3
4
Top Ten Largest Databases Ref: http://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-world/http://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-world/ 10/19/2012 Symposium on Big Data Science and Engineering 4
5
Top Ten Largest Databases in 2007 vs Facebook ‘s cluster in 2010 Ref: http://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-worldhttp://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-world 10/19/2012 Symposium on Big Data Science and Engineering 5 Facebook 21 PetaByte In 2010
6
Data Deluge: smallest to largest Bioinformatics data: from about 3.3 billion base pairs in a human genome to huge number of sequences of proteins and the analysis of their behaviors The internet: web logs, facebook, twitter, maps, blogs, etc.: Analytics … Financial applications: that analyze volumes of data for trends and other deeper knowledge Health Care: huge amount of patient data, drug and treatment data The universe: The Hubble ultra deep telescope shows 100s of galaxies each with billions of stars 10/19/20126 Symposium on Big Data Science and Engineering
7
Different Type of Storage Internet introduced a new challenge in the form web logs, web crawler’s data: large scale “peta scale” But observe that this type of data has an uniquely different characteristic than your transactional or the “customer order” data, or “bank account data” : The data type is “write once read many (WORM)” ; Privacy protected healthcare and patient information; Historical financial data; Other historical data Relational file system and tables are insufficient. Large stores (files) and storage management system. Built-in features for fault-tolerance, load balancing, data-transfer and aggregation,… Clusters of distributed nodes for storage and computing. Computing is inherently parallel 710/19/2012 Symposium on Big Data Science and Engineering
8
Big-data Concepts Originated from the Google File System (GFS) is the special store Hadoop Distributed file system (HDFS) is the open source version of this. (Currently an Apache project) Parallel processing of the data using MapReduce (MR) programming model Challenges Formulation of MR algorithms Proper use of the features of infrastructure (Ex: sort) Best practices in using MR and HDFS An extensive ecosystem consisting of other components such as column-based store (Hbase, BigTable), big data warehousing (Hive), workflow languages, etc. 10/19/2012 Symposium on Big Data Science and Engineering 8
9
Data & Analytics We have witnessed explosion in algorithmic solutions. “In pioneer days they used oxen for heavy pulling, when one couldn’t budge a log they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.” Grace Hopper What you cannot achieve by an algorithm can be achieved by more data. Big data if analyzed right gives you better answers: Center for disease control prediction of flu vs. prediction of flu through “search” data 2 full weeks before the onset of flu season! (see the reference) 10/19/2012 Symposium on Big Data Science and Engineering 9
10
The Cloud Computing Cloud is a facilitator for Big Data computing and is an indispensable in this context Cloud provides processor, software, operating systems, storage, monitoring, load balancing, clusters and other requirements as a service Cloud offers accessibility to Big Data computing Cloud computing models: o platform (PaaS), Microsoft Azure o software (SaaS), Google App Engine (GAE) o infrastructure (IaaS), Amazon web services (AWS) o Services-based application programming interface (API) 10/19/2012 Symposium on Big Data Science and Engineering 10
11
Big-data Courses We introduced the concepts in a two course (sequence): Course 1 (has become a core course): CSE 486 o Foundational concepts of MapReduce and Hadoop distributed file system is introduced as a part of the Distributed System course o The last project/lab in the distributed systems course is based on MapReduce (MR) concepts, and is implemented on HDFS Course 2 (has become an elective course):CSE 487 o The second course focuses completely on Big-data issues and mostly MR algorithm formulation and best practices o Analytics on large clusters, and on the cloud o Text book we use for this course is deals with algorithms and data structures for MR and best ways to leverage the parallelism in MR family of operations (map, reduce, combine, partition, etc.) o Other Big Databases such as Hbase and Hive and workflows are also introduced 10/19/2012 Symposium on Big Data Science and Engineering 11
12
Big-data Certificate Program Official name is Data-intensive computing certificate o Initiated with support from NSF TUES program o Approved by SUNY system in Fall 2011 o Offered by the University o For the enrolled undergraduates--- Any major! Details of the program o CS1, CS2 o Distributed system (CSE486) : Pre-req CS2 o Data-intensive computing system (CSE487) Pre-req CS2 o An elective in the discipline of choice (Ex: BIO4XY or MGS4XY) o A capstone project applying data-intensive computing (Ex:BIO499or MGS499) 10/19/2012 Symposium on Big Data Science and Engineering 12
13
Evaluation 10/19/2012 Symposium on Big Data Science and Engineering 13
14
Findings There is high demand from student for “data-intensive” and big data computation related courses Certificate program is hard for non-CSE majors High demand for big-data skills from employers Educators and administrators need to be educated about big- data (Remember the times we educated people on Object- orientation, Java, web-enabling etc.) It is imperative we improve the preparedness of our workforce at all levels for Big Data skills for global competitiveness. Just one course or a single certificate is NOT enough: we need continuous and repeated exposure to Big Data in various contexts. It is often very hard to create and sustain a new curriculum How can we address these challenges? 10/19/2012 Symposium on Big Data Science and Engineering 14
15
Recommendations Introduce big-data concepts as integral part of UG curriculum o For example for CS, simple word-count of big-data in CS1, map-reduce algorithm in CS2, cloud storage and big-table in Database systems, Hadoop in distributed systems, the entire big-data analytics in other elective courses such as Machine Learning and Data Mining. Use compelling examples using real world datasets Train the educators: big-data professional development for the academic core is critical Expose the administrators: to use of Big Data applications/tools in all possible areas: institutional analysis, data collected at various educational institutions is a gold-mine for macro-level analytics; “What is the trend?” “Are they learning?” Train the counselors who advise high school students, and college entry level counselors Include the community colleges and four years colleges Need investment from major industries (mentoring, educator days, etc.) 10/19/2012 Symposium on Big Data Science and Engineering 15
16
Demos Simple word count using MR model on HDFS on the local machine o Foundation for many algorithms such as word cloud o Simple and easy to understand o Project Guttenberg Project Guttenberg Simple co-occurrence analysis of twitter data o Twitter has donated the entire collection of tweets to Library of Congress Amazon MR workflow and working AWS facilities Finally sample run of 10miilion node tree of a compute cluster on the Center for Computational Research (CCR) at Buffalo Center for Computational Research (CCR) 10/19/2012 Symposium on Big Data Science and Engineering 16
17
Summary We explored the need for data-intensive or big-data computing We illustrated Big Data concepts and demonstrated the cloud capabilities through simple applications Data-intensive computing on the cloud is an essential and indispensable skill for the workforce of today and tomorrow University at Buffalo has implemented a SUNY-wide a Certificate Program in Data-intensive Computing Certificate Program in Data-intensive Computing Actionable thing we could do is form a group of people passionate about Big Data and work at introducing it in their courses/projects 10/19/2012 Symposium on Big Data Science and Engineering 17
18
References & useful links Flu prediction reference: J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski and L. Brilliant. Detecting influenza epidemics using search engine query data, Nature 457, 1012-1014 (19 February 2009): http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html Twitter and Library of Congress: A. Watters. How Library of Congress is Building the Twitter Archive. http://radar.oreilly.com/2011/06/library-of-congress-twitter-archive.html, last viewed July 2012. http://radar.oreilly.com/2011/06/library-of-congress-twitter-archive.html Project web page for all the project material including course description, course material, project description, several presentations, useful links, and references http://www.cse.buffalo.edu/faculty/bina/DataIntensive 10/19/2012 Symposium on Big Data Science and Engineering 18
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.