Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course.

Slides:



Advertisements
Similar presentations
Introduction to Advanced Computing Platforms for Data Analysis Ruoming Jin.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Big Data and.
1 i206: Distributed Computing Applications & Infrastructure 2012
Information Retrieval in Practice
Search Engines and Information Retrieval
introduction to MSc projects
Overview of Search Engines
1: IntroductionData Management & Engineering1 Course Overview: CS 395T Semantic Web, Ontologies and Cloud Databases Daniel P. Miranker Objectives: Get.
Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department Overview: introduction.
Introduction. Readings r Van Steen and Tanenbaum: 5.1 r Coulouris: 10.3.
 A set of objectives or student learning outcomes for a course or a set of courses.  Specifies the set of concepts and skills that the student must.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Maintaining a Microsoft SQL Server 2008 Database SQLServer-Training.com.
CS 21a: Intro to Computing I Department of Information Systems and Computer Science Ateneo de Manila University.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.
Search Engines and Information Retrieval Chapter 1.
© Spinnaker Labs, Inc. Google Cluster Computing Faculty Training Workshop Open Source Tools for Teaching.
Advanced Topics in Distributed Systems Fall 2011 Instructor: Costin Raiciu.
The Savvy Cyber Teacher ® Using the Internet Effectively in the K-12 Classroom 1 Savvy Cyber Teacher ® Using the Internet Effectively in the K-12 Classroom.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Hadoop and HDFS
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Overviews of ITCS 6161/8161: Advanced Topics on Database Systems Dr. Jianping Fan Department of Computer Science UNC-Charlotte
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 11: Conclusion Aidan Hogan
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
ITGS Databases.
Summary and Review. Course Objectives The main objectives of the course are to –introduce different concepts in operating system theory and implementation;
Most of contents are provided by the website Introduction TJTSD66: Advanced Topics in Social Media Dr.
Problem-solving on large-scale clusters: theory and applications Lecture 4: GFS & Course Wrap-up.
Unit 9: Distributing Computing & Networking Kaplan University 1.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
Cheating The School of Network Computing, the Faculty of Information Technology and Monash as a whole regard cheating as a serious offence. Where assignments.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
© Spinnaker Labs, Inc. Google Cluster Computing Faculty Training Workshop Module II: Student Background Knowledge This presentation includes course content.
CSCD 433/533 Advanced Computer Networks Lecture 1 Course Overview Spring 2016.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
BIG DATA/ Hadoop Interview Questions.
Csinparallel.org Workshop 307: CSinParallel: Using Map-Reduce to Teach Parallel Programming Concepts, Hands-On Dick Brown, St. Olaf College Libby Shoop,
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
E 96 Introduction to Engineering Design Peter Reiher UCLA
Map Reduce.
Introduction to MapReduce and Hadoop
The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.
Hadoop Clusters Tess Fulkerson.
Extraction, aggregation and classification at Web Scale
Software Architecture in Practice
Ministry of Higher Education
CS110: Discussion about Spark
Charles Tappert Seidenberg School of CSIS, Pace University
Presentation transcript:

Hadoop Technical Workshop Academic Hadoop Usage

Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course Staff Requirements

UW: Course Summary Course title: “Problem Solving on Large Scale Clusters” Primary purpose: developing large-scale problem solving skills Format: 6 weeks of lectures + labs, 4 week project

UW: Course Goals Think creatively about large-scale problems in a parallel fashion; design parallel solutions Manage large data sets under memory, bandwidth limitations Develop a foundation in parallel algorithms for large-scale data Identify and understand engineering trade- offs in real systems

Lectures 2 hours, once per week Half formal lecture, half discussion Mostly covered systems & background Included group activities for reinforcement

Classroom Activities Worksheets included pseudo-code programming, working through examples –Performed in groups of 2—3 Small-group discussions about engineering and systems design –Groups of ~10 –Course staff facilitated, but mostly open- ended

Readings No textbook One academic paper per week –E.g., “Simplified Data Processing on Large Clusters” –Short homework covered comprehension Formed basis for discussion

Lecture Schedule Introduction to Distributed Computing MapReduce: Theory and Implementation Networks and Distributed Reliability Real-World Distributed Systems Distributed File Systems Other Distributed Systems

Intro to Distributed Computing What is distributed computing? Flynn’s Taxonomy Brief history of distributed computing Some background on synchronization and memory sharing

MapReduce Brief refresher on functional programming MapReduce slides –More detailed version of module I Discussion on MapReduce

Networking and Reliability Crash course in networking Distributed systems reliability –What is reliability? –How do distributed systems fail? –ACID, other metrics Discussion: Does MapReduce provide reliability?

Real Systems Design and implementation of Nutch Tech talk from Googler on Google Maps

Distributed File Systems Introduced GFS Discussed implementation of NFS and AndrewFS for comparison

Other Distributed Systems BOINC: Another platform Broader definition of distributed systems –DNS –One Laptop per Child project

Labs Also 2 hours, once per week Focused on applications of distributed systems Four lab projects over six weeks

Lab Schedule Introduction to Hadoop, Eclipse Setup, Word Count Inverted Index PageRank on Wikipedia Clustering on Netflix Prize Data

Design Projects Final four weeks of quarter Teams of 1—3 students Students proposed topic, gathered data, developed software, and presented solution

Example: Geozette Image © Julia Schwartz

Example: Galaxy Simulation Image © Slava Chernyak, Mike Hoak

Other Projects Bayesian Wikipedia spam filter Unsupervised synonym extraction Video collage rendering

Ongoing research: traceroutes Analyze time-stamped traceroute data to model changes in Internet router topology –4.5 GB of data/day * 1.5 years –12 billion traces from 200 PlanetLab sites Calculates prevalence and persistence of routes between hosts

Ongoing research: dynamic program traces Dynamic memory trace data from simulators can reach hundreds of GB Existing work focuses on sampling New capability: record all accesses and post-process with Hadoop

Common Features Hadoop! Used publicly-available web APIs for data Many involved reading papers for algorithms and translating into MapReduce framework

Background Topics Programming Languages Systems: –Operating Systems –File Systems –Networking Databases

Programming Languages MapReduce is based on functional programming map and fold FP is taught in one quarter, but not reinforced –“Crash course” necessary –Worksheets to pose short problems in terms of map and fold –Immutable data a key concept

Multithreaded programming Taught in OS course at Washington –Not a prerequisite! Students need to understand multiple copies of same method running in parallel

File Systems Necessary to understand GFS Comparison to NFS, other distributed file systems relevant

Networking TCP/IP Concepts of “connection,” network splits, other failure modes Bandwidth issues

Other Systems Topics Process Scheduling Synchronization Memory coherency

Databases Concept of shared consistency model Consensus ACID characteristics –Journaling –Multi-phase commit processes

Course Staff Instructor (me!) Two undergrad teaching assistants –Helped facilitate discussions, directed labs One student sys admin –Worked only about three hours/week

Preparation Teaching assistants had taken previous iteration of course in winter Lectures retooled based on feedback from that quarter –Added reasonably large amount of background material Ran & solved all labs in advance

The Course: What Worked Discussions –Often covered broad range of subjects Hands-on lab projects “Active learning” in classroom Independent design projects

Things to Improve: Coverage Algorithms were not reinforced during lecture –Students requested much more time be spent on “how to parallelize an iterative algorithm” Background material was very fast-paced

Things to Improve: Projects Labs could have used a moderated/scripted discussion component –Just “jumping in” to the code proved difficult –No time was devoted to Hadoop itself in lecture –Clustering lab should be split in two Design projects could have used more time

Future Course Ideas Overview Systems course Web application design Integration in other applications courses Misc. content ideas Making your own data sets

Systems Course Focused on parallel & distributed systems Hadoop included in comparison to other cluster techniques Emphasis on performance, profiling, and management

Topic Map

Introductory Material Networking basics Multithreading

Distributed Reliability Reliability metrics Methods of failure Techniques to combat failure –Journaling, n-phase commit Techniques to achieve consensus –Leader election, voting

Parallel Processing How to parallelize algorithms Parallelization in one machine vs. across several machines –Techniques applicable to one vs. other –Cache coherency –Memory distribution

Parallelization Frameworks Multithreading on one machine RPC, MPI, PVM Higher-level scheduling –Condor vs. Hadoop Tradeoffs in design

Algorithm Design Comparison Matrix multiplication, sorting, searching, PageRank, etc –… For a standard distributed system –… For Hadoop

Distributed Storage NFS, AFS, GFS Database clustering techniques –Distributed SQL databases –HBase –Distributed memory caches, object stores

Lab Focus Implementing parallel and distributed algorithms Experiment with different frameworks Perform measurements –Bandwidth consumption –Latency & performance Code analysis

Final Thoughts Lots of low-level programming involved Appropriate mostly for last-year students Hadoop community would find scholarly benchmarks useful –wiki.apache.org/hadoop/ProjectSuggestions –“JIRA” bug/feature request database

Web Application Design

Basic Web Development Topics

Large-Scale Web Server Technology

Next Steps RPC –Internal RPC; message queues and distributed back-ends –Thrift, ProtocolBuffers –SOAP and XMLHttpRequest

Scaling Really Big Nutch/Lucene Hadoop Amazon Web Services

Data Aggregation and Analysis How to crawl and parse web pages Generate link graphs Perform analyses (e.g., PageRank) Semantic analysis

Web Site Tuning Web page layout optimization –Speed –Accessibility –Ease-of-use Server log analysis –User-targeted site features Service replication –Consistency, latency issues

Security and the Web Data sanitization SQL injection attacks DOS attacks Data collection methods & ethics –User data privacy

Projects Code labs in Python, PHP, Ruby Simple database design Building a small search engine with Nutch/Lucene Design scalable architecture and run on Amazon EC2 Web site design project –Security/penetration analysis of other teams’ sites

Final Course Thoughts Web-based services are increasingly relevant –Exciting new opportunity for students –Example course in action: cse454/07au/

Using Hadoop in Other Courses Hadoop is a natural component for many existing courses –Artificial intelligence –Web search –Data mining / information retrieval –Databases (HBase) –Networking –Computational biology? Graphics?

Low Level Module “MapReduce in a week:” code.google.com/edu/content/parallel.html 3-lecture series on distributed processing and Hadoop; enough to get students started … more discussion of online resources next

AI/Data Mining Ideas Use Nutch to perform a web crawl and classify pages using Bayesian analysis Hadoop makes processing easy –Data sanitization –Classifier engine (Use WEKA right in Hadoop) –HDFS for document storage/retrieval/search

AI/Data Mining Ideas Extract semantically valuable data from web pages –E.g., match names to phone numbers, –News articles to locations Hadoop allows students to explore a much broader scale than previously possible

Graphics Examples Re-encode a render pipeline as a set of MapReduce tasks Use feature detection + clustering on a corpus of images to find images with similar shapes/features

Student-Generated Ideas Data processing with Yahoo Pig Distributed SQL databases Distributed systems “ground-up” projects: –Sockets, then RPC, then Hadoop Other concepts: Bittorrent, DHTs, P2P Other frameworks: e.g., BOINC projects

Making Datasets Your department is full of data! –Graphics data –Sensor data from RFID, Ubicomp, robotics… –Measurements from networking lab –Ask around: Someone has a few dozen gigs of log files to donate –(What happens if you leave Ethereal in promiscuous mode for a week straight?)

Making Datasets Other departments are full of data! –Biology –Chemistry –Physics (campus particle accelerator?)

Making Datasets The web is full of data! –Use Nutch to crawl web sites –Wikis are especially good (hmm..)

Conclusions Hadoop isn’t a full course in itself –But it combines well with a lot of other ideas Can be used for at least a half a course … Or as little as a week or two Look around you – Hadoop can be applied to more areas than you might think

Open Source Tools for Teaching

Overview Slides Lab Materials Readings Video Lectures Datasets

Slides Multiple short course outlines available: “MapReduce in a week” “Introduction to Problem Solving on Large Scale Clusters” “MapReduce Mini Lecture Series”

Labs Lab designs from UW course available –“Introduction to MapReduce” –“A Simple Inverted Index” –“PageRank on the Wikipedia Corpus” –“Clustering the Netflix Movie Data”

Readings Google has several papers available –“Introduction to Distributed Systems” –“MapReduce: Simplified Data Processing on Large Scale Clusters” –“The Google File System” –“BigTable: A Distributed Storage System for Structured Data”

Lecture Videos MapReduce Mini-series

Datasets: Wikipedia Wikipedia supports free “bulk download” of data –Current site snapshot (big) –Entire revision history (massive) Eliminates need for Nutch crawls Good for indexing, search labs

Datasets: Netflix Netflix’s web site provides recommendations Theory: Other people watched movie X, then Y. You watched X, you might like Y. Open question: Can you provide more useful recommendations than their current system?

Datasets: Netflix The Netflix Prize: $1,000,000 if you can find a better algorithm, based on their criteria They provide you with a large dataset of existing rental associations to work with

Conclusions Lots of starter materials available on the web –Good for reference –Get teaching assistants up to speed Readings, sample worksheets and other resources are open content & ready to use

Aaron Kimball