Hadoop Technical Workshop Academic Hadoop Usage
Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course Staff Requirements
UW: Course Summary Course title: “Problem Solving on Large Scale Clusters” Primary purpose: developing large-scale problem solving skills Format: 6 weeks of lectures + labs, 4 week project
UW: Course Goals Think creatively about large-scale problems in a parallel fashion; design parallel solutions Manage large data sets under memory, bandwidth limitations Develop a foundation in parallel algorithms for large-scale data Identify and understand engineering trade- offs in real systems
Lectures 2 hours, once per week Half formal lecture, half discussion Mostly covered systems & background Included group activities for reinforcement
Classroom Activities Worksheets included pseudo-code programming, working through examples –Performed in groups of 2—3 Small-group discussions about engineering and systems design –Groups of ~10 –Course staff facilitated, but mostly open- ended
Readings No textbook One academic paper per week –E.g., “Simplified Data Processing on Large Clusters” –Short homework covered comprehension Formed basis for discussion
Lecture Schedule Introduction to Distributed Computing MapReduce: Theory and Implementation Networks and Distributed Reliability Real-World Distributed Systems Distributed File Systems Other Distributed Systems
Intro to Distributed Computing What is distributed computing? Flynn’s Taxonomy Brief history of distributed computing Some background on synchronization and memory sharing
MapReduce Brief refresher on functional programming MapReduce slides –More detailed version of module I Discussion on MapReduce
Networking and Reliability Crash course in networking Distributed systems reliability –What is reliability? –How do distributed systems fail? –ACID, other metrics Discussion: Does MapReduce provide reliability?
Real Systems Design and implementation of Nutch Tech talk from Googler on Google Maps
Distributed File Systems Introduced GFS Discussed implementation of NFS and AndrewFS for comparison
Other Distributed Systems BOINC: Another platform Broader definition of distributed systems –DNS –One Laptop per Child project
Labs Also 2 hours, once per week Focused on applications of distributed systems Four lab projects over six weeks
Lab Schedule Introduction to Hadoop, Eclipse Setup, Word Count Inverted Index PageRank on Wikipedia Clustering on Netflix Prize Data
Design Projects Final four weeks of quarter Teams of 1—3 students Students proposed topic, gathered data, developed software, and presented solution
Example: Geozette Image © Julia Schwartz
Example: Galaxy Simulation Image © Slava Chernyak, Mike Hoak
Other Projects Bayesian Wikipedia spam filter Unsupervised synonym extraction Video collage rendering
Ongoing research: traceroutes Analyze time-stamped traceroute data to model changes in Internet router topology –4.5 GB of data/day * 1.5 years –12 billion traces from 200 PlanetLab sites Calculates prevalence and persistence of routes between hosts
Ongoing research: dynamic program traces Dynamic memory trace data from simulators can reach hundreds of GB Existing work focuses on sampling New capability: record all accesses and post-process with Hadoop
Common Features Hadoop! Used publicly-available web APIs for data Many involved reading papers for algorithms and translating into MapReduce framework
Background Topics Programming Languages Systems: –Operating Systems –File Systems –Networking Databases
Programming Languages MapReduce is based on functional programming map and fold FP is taught in one quarter, but not reinforced –“Crash course” necessary –Worksheets to pose short problems in terms of map and fold –Immutable data a key concept
Multithreaded programming Taught in OS course at Washington –Not a prerequisite! Students need to understand multiple copies of same method running in parallel
File Systems Necessary to understand GFS Comparison to NFS, other distributed file systems relevant
Networking TCP/IP Concepts of “connection,” network splits, other failure modes Bandwidth issues
Other Systems Topics Process Scheduling Synchronization Memory coherency
Databases Concept of shared consistency model Consensus ACID characteristics –Journaling –Multi-phase commit processes
Course Staff Instructor (me!) Two undergrad teaching assistants –Helped facilitate discussions, directed labs One student sys admin –Worked only about three hours/week
Preparation Teaching assistants had taken previous iteration of course in winter Lectures retooled based on feedback from that quarter –Added reasonably large amount of background material Ran & solved all labs in advance
The Course: What Worked Discussions –Often covered broad range of subjects Hands-on lab projects “Active learning” in classroom Independent design projects
Things to Improve: Coverage Algorithms were not reinforced during lecture –Students requested much more time be spent on “how to parallelize an iterative algorithm” Background material was very fast-paced
Things to Improve: Projects Labs could have used a moderated/scripted discussion component –Just “jumping in” to the code proved difficult –No time was devoted to Hadoop itself in lecture –Clustering lab should be split in two Design projects could have used more time
Future Course Ideas Overview Systems course Web application design Integration in other applications courses Misc. content ideas Making your own data sets
Systems Course Focused on parallel & distributed systems Hadoop included in comparison to other cluster techniques Emphasis on performance, profiling, and management
Topic Map
Introductory Material Networking basics Multithreading
Distributed Reliability Reliability metrics Methods of failure Techniques to combat failure –Journaling, n-phase commit Techniques to achieve consensus –Leader election, voting
Parallel Processing How to parallelize algorithms Parallelization in one machine vs. across several machines –Techniques applicable to one vs. other –Cache coherency –Memory distribution
Parallelization Frameworks Multithreading on one machine RPC, MPI, PVM Higher-level scheduling –Condor vs. Hadoop Tradeoffs in design
Algorithm Design Comparison Matrix multiplication, sorting, searching, PageRank, etc –… For a standard distributed system –… For Hadoop
Distributed Storage NFS, AFS, GFS Database clustering techniques –Distributed SQL databases –HBase –Distributed memory caches, object stores
Lab Focus Implementing parallel and distributed algorithms Experiment with different frameworks Perform measurements –Bandwidth consumption –Latency & performance Code analysis
Final Thoughts Lots of low-level programming involved Appropriate mostly for last-year students Hadoop community would find scholarly benchmarks useful –wiki.apache.org/hadoop/ProjectSuggestions –“JIRA” bug/feature request database
Web Application Design
Basic Web Development Topics
Large-Scale Web Server Technology
Next Steps RPC –Internal RPC; message queues and distributed back-ends –Thrift, ProtocolBuffers –SOAP and XMLHttpRequest
Scaling Really Big Nutch/Lucene Hadoop Amazon Web Services
Data Aggregation and Analysis How to crawl and parse web pages Generate link graphs Perform analyses (e.g., PageRank) Semantic analysis
Web Site Tuning Web page layout optimization –Speed –Accessibility –Ease-of-use Server log analysis –User-targeted site features Service replication –Consistency, latency issues
Security and the Web Data sanitization SQL injection attacks DOS attacks Data collection methods & ethics –User data privacy
Projects Code labs in Python, PHP, Ruby Simple database design Building a small search engine with Nutch/Lucene Design scalable architecture and run on Amazon EC2 Web site design project –Security/penetration analysis of other teams’ sites
Final Course Thoughts Web-based services are increasingly relevant –Exciting new opportunity for students –Example course in action: cse454/07au/
Using Hadoop in Other Courses Hadoop is a natural component for many existing courses –Artificial intelligence –Web search –Data mining / information retrieval –Databases (HBase) –Networking –Computational biology? Graphics?
Low Level Module “MapReduce in a week:” code.google.com/edu/content/parallel.html 3-lecture series on distributed processing and Hadoop; enough to get students started … more discussion of online resources next
AI/Data Mining Ideas Use Nutch to perform a web crawl and classify pages using Bayesian analysis Hadoop makes processing easy –Data sanitization –Classifier engine (Use WEKA right in Hadoop) –HDFS for document storage/retrieval/search
AI/Data Mining Ideas Extract semantically valuable data from web pages –E.g., match names to phone numbers, –News articles to locations Hadoop allows students to explore a much broader scale than previously possible
Graphics Examples Re-encode a render pipeline as a set of MapReduce tasks Use feature detection + clustering on a corpus of images to find images with similar shapes/features
Student-Generated Ideas Data processing with Yahoo Pig Distributed SQL databases Distributed systems “ground-up” projects: –Sockets, then RPC, then Hadoop Other concepts: Bittorrent, DHTs, P2P Other frameworks: e.g., BOINC projects
Making Datasets Your department is full of data! –Graphics data –Sensor data from RFID, Ubicomp, robotics… –Measurements from networking lab –Ask around: Someone has a few dozen gigs of log files to donate –(What happens if you leave Ethereal in promiscuous mode for a week straight?)
Making Datasets Other departments are full of data! –Biology –Chemistry –Physics (campus particle accelerator?)
Making Datasets The web is full of data! –Use Nutch to crawl web sites –Wikis are especially good (hmm..)
Conclusions Hadoop isn’t a full course in itself –But it combines well with a lot of other ideas Can be used for at least a half a course … Or as little as a week or two Look around you – Hadoop can be applied to more areas than you might think
Open Source Tools for Teaching
Overview Slides Lab Materials Readings Video Lectures Datasets
Slides Multiple short course outlines available: “MapReduce in a week” “Introduction to Problem Solving on Large Scale Clusters” “MapReduce Mini Lecture Series”
Labs Lab designs from UW course available –“Introduction to MapReduce” –“A Simple Inverted Index” –“PageRank on the Wikipedia Corpus” –“Clustering the Netflix Movie Data”
Readings Google has several papers available –“Introduction to Distributed Systems” –“MapReduce: Simplified Data Processing on Large Scale Clusters” –“The Google File System” –“BigTable: A Distributed Storage System for Structured Data”
Lecture Videos MapReduce Mini-series
Datasets: Wikipedia Wikipedia supports free “bulk download” of data –Current site snapshot (big) –Entire revision history (massive) Eliminates need for Nutch crawls Good for indexing, search labs
Datasets: Netflix Netflix’s web site provides recommendations Theory: Other people watched movie X, then Y. You watched X, you might like Y. Open question: Can you provide more useful recommendations than their current system?
Datasets: Netflix The Netflix Prize: $1,000,000 if you can find a better algorithm, based on their criteria They provide you with a large dataset of existing rental associations to work with
Conclusions Lots of starter materials available on the web –Good for reference –Get teaching assistants up to speed Readings, sample worksheets and other resources are open content & ready to use
Aaron Kimball