Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course.

Similar presentations


Presentation on theme: "Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course."— Presentation transcript:

1 Hadoop Technical Workshop Academic Hadoop Usage

2 Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course Staff Requirements

3 UW: Course Summary Course title: “Problem Solving on Large Scale Clusters” Primary purpose: developing large-scale problem solving skills Format: 6 weeks of lectures + labs, 4 week project

4 UW: Course Goals Think creatively about large-scale problems in a parallel fashion; design parallel solutions Manage large data sets under memory, bandwidth limitations Develop a foundation in parallel algorithms for large-scale data Identify and understand engineering trade- offs in real systems

5 Lectures 2 hours, once per week Half formal lecture, half discussion Mostly covered systems & background Included group activities for reinforcement

6 Classroom Activities Worksheets included pseudo-code programming, working through examples –Performed in groups of 2—3 Small-group discussions about engineering and systems design –Groups of ~10 –Course staff facilitated, but mostly open- ended

7 Readings No textbook One academic paper per week –E.g., “Simplified Data Processing on Large Clusters” –Short homework covered comprehension Formed basis for discussion

8 Lecture Schedule Introduction to Distributed Computing MapReduce: Theory and Implementation Networks and Distributed Reliability Real-World Distributed Systems Distributed File Systems Other Distributed Systems

9 Intro to Distributed Computing What is distributed computing? Flynn’s Taxonomy Brief history of distributed computing Some background on synchronization and memory sharing

10 MapReduce Brief refresher on functional programming MapReduce slides –More detailed version of module I Discussion on MapReduce

11 Networking and Reliability Crash course in networking Distributed systems reliability –What is reliability? –How do distributed systems fail? –ACID, other metrics Discussion: Does MapReduce provide reliability?

12 Real Systems Design and implementation of Nutch Tech talk from Googler on Google Maps

13 Distributed File Systems Introduced GFS Discussed implementation of NFS and AndrewFS for comparison

14 Other Distributed Systems BOINC: Another platform Broader definition of distributed systems –DNS –One Laptop per Child project

15 Labs Also 2 hours, once per week Focused on applications of distributed systems Four lab projects over six weeks

16 Lab Schedule Introduction to Hadoop, Eclipse Setup, Word Count Inverted Index PageRank on Wikipedia Clustering on Netflix Prize Data

17 Design Projects Final four weeks of quarter Teams of 1—3 students Students proposed topic, gathered data, developed software, and presented solution

18 Example: Geozette Image © Julia Schwartz

19 Example: Galaxy Simulation Image © Slava Chernyak, Mike Hoak

20 Other Projects Bayesian Wikipedia spam filter Unsupervised synonym extraction Video collage rendering

21 Ongoing research: traceroutes Analyze time-stamped traceroute data to model changes in Internet router topology –4.5 GB of data/day * 1.5 years –12 billion traces from 200 PlanetLab sites Calculates prevalence and persistence of routes between hosts

22 Ongoing research: dynamic program traces Dynamic memory trace data from simulators can reach hundreds of GB Existing work focuses on sampling New capability: record all accesses and post-process with Hadoop

23 Common Features Hadoop! Used publicly-available web APIs for data Many involved reading papers for algorithms and translating into MapReduce framework

24 Background Topics Programming Languages Systems: –Operating Systems –File Systems –Networking Databases

25 Programming Languages MapReduce is based on functional programming map and fold FP is taught in one quarter, but not reinforced –“Crash course” necessary –Worksheets to pose short problems in terms of map and fold –Immutable data a key concept

26 Multithreaded programming Taught in OS course at Washington –Not a prerequisite! Students need to understand multiple copies of same method running in parallel

27 File Systems Necessary to understand GFS Comparison to NFS, other distributed file systems relevant

28 Networking TCP/IP Concepts of “connection,” network splits, other failure modes Bandwidth issues

29 Other Systems Topics Process Scheduling Synchronization Memory coherency

30 Databases Concept of shared consistency model Consensus ACID characteristics –Journaling –Multi-phase commit processes

31 Course Staff Instructor (me!) Two undergrad teaching assistants –Helped facilitate discussions, directed labs One student sys admin –Worked only about three hours/week

32 Preparation Teaching assistants had taken previous iteration of course in winter Lectures retooled based on feedback from that quarter –Added reasonably large amount of background material Ran & solved all labs in advance

33 The Course: What Worked Discussions –Often covered broad range of subjects Hands-on lab projects “Active learning” in classroom Independent design projects

34 Things to Improve: Coverage Algorithms were not reinforced during lecture –Students requested much more time be spent on “how to parallelize an iterative algorithm” Background material was very fast-paced

35 Things to Improve: Projects Labs could have used a moderated/scripted discussion component –Just “jumping in” to the code proved difficult –No time was devoted to Hadoop itself in lecture –Clustering lab should be split in two Design projects could have used more time

36 Future Course Ideas Overview Systems course Web application design Integration in other applications courses Misc. content ideas Making your own data sets

37 Systems Course Focused on parallel & distributed systems Hadoop included in comparison to other cluster techniques Emphasis on performance, profiling, and management

38 Topic Map

39 Introductory Material Networking basics Multithreading

40 Distributed Reliability Reliability metrics Methods of failure Techniques to combat failure –Journaling, n-phase commit Techniques to achieve consensus –Leader election, voting

41 Parallel Processing How to parallelize algorithms Parallelization in one machine vs. across several machines –Techniques applicable to one vs. other –Cache coherency –Memory distribution

42 Parallelization Frameworks Multithreading on one machine RPC, MPI, PVM Higher-level scheduling –Condor vs. Hadoop Tradeoffs in design

43 Algorithm Design Comparison Matrix multiplication, sorting, searching, PageRank, etc –… For a standard distributed system –… For Hadoop

44 Distributed Storage NFS, AFS, GFS Database clustering techniques –Distributed SQL databases –HBase –Distributed memory caches, object stores

45 Lab Focus Implementing parallel and distributed algorithms Experiment with different frameworks Perform measurements –Bandwidth consumption –Latency & performance Code analysis

46 Final Thoughts Lots of low-level programming involved Appropriate mostly for last-year students Hadoop community would find scholarly benchmarks useful –wiki.apache.org/hadoop/ProjectSuggestions –“JIRA” bug/feature request database

47 Web Application Design

48 Basic Web Development Topics

49 Large-Scale Web Server Technology

50 Next Steps RPC –Internal RPC; message queues and distributed back-ends –Thrift, ProtocolBuffers –SOAP and XMLHttpRequest

51 Scaling Really Big Nutch/Lucene Hadoop Amazon Web Services

52 Data Aggregation and Analysis How to crawl and parse web pages Generate link graphs Perform analyses (e.g., PageRank) Semantic analysis

53 Web Site Tuning Web page layout optimization –Speed –Accessibility –Ease-of-use Server log analysis –User-targeted site features Service replication –Consistency, latency issues

54 Security and the Web Data sanitization SQL injection attacks DOS attacks Data collection methods & ethics –User data privacy

55 Projects Code labs in Python, PHP, Ruby Simple database design Building a small search engine with Nutch/Lucene Design scalable architecture and run on Amazon EC2 Web site design project –Security/penetration analysis of other teams’ sites

56 Final Course Thoughts Web-based services are increasingly relevant –Exciting new opportunity for students –Example course in action: www.cs.washington.edu/education/courses/ cse454/07au/

57 Using Hadoop in Other Courses Hadoop is a natural component for many existing courses –Artificial intelligence –Web search –Data mining / information retrieval –Databases (HBase) –Networking –Computational biology? Graphics?

58 Low Level Module “MapReduce in a week:” code.google.com/edu/content/parallel.html 3-lecture series on distributed processing and Hadoop; enough to get students started … more discussion of online resources next

59 AI/Data Mining Ideas Use Nutch to perform a web crawl and classify pages using Bayesian analysis Hadoop makes processing easy –Data sanitization –Classifier engine (Use WEKA right in Hadoop) –HDFS for document storage/retrieval/search

60 AI/Data Mining Ideas Extract semantically valuable data from web pages –E.g., match names to phone numbers, –News articles to locations Hadoop allows students to explore a much broader scale than previously possible

61 Graphics Examples Re-encode a render pipeline as a set of MapReduce tasks Use feature detection + clustering on a corpus of images to find images with similar shapes/features

62 Student-Generated Ideas Data processing with Yahoo Pig Distributed SQL databases Distributed systems “ground-up” projects: –Sockets, then RPC, then Hadoop Other concepts: Bittorrent, DHTs, P2P Other frameworks: e.g., BOINC projects

63 Making Datasets Your department is full of data! –Graphics data –Sensor data from RFID, Ubicomp, robotics… –Measurements from networking lab –Ask around: Someone has a few dozen gigs of log files to donate –(What happens if you leave Ethereal in promiscuous mode for a week straight?)

64 Making Datasets Other departments are full of data! –Biology –Chemistry –Physics (campus particle accelerator?)

65 Making Datasets The web is full of data! –Use Nutch to crawl web sites –Wikis are especially good (hmm..)

66 Conclusions Hadoop isn’t a full course in itself –But it combines well with a lot of other ideas Can be used for at least a half a course … Or as little as a week or two Look around you – Hadoop can be applied to more areas than you might think

67 Open Source Tools for Teaching

68 Overview Slides Lab Materials Readings Video Lectures Datasets http://code.google.com/edu

69 Slides Multiple short course outlines available: “MapReduce in a week” “Introduction to Problem Solving on Large Scale Clusters” “MapReduce Mini Lecture Series”

70 Labs Lab designs from UW course available –“Introduction to MapReduce” –“A Simple Inverted Index” –“PageRank on the Wikipedia Corpus” –“Clustering the Netflix Movie Data”

71 Readings Google has several papers available –“Introduction to Distributed Systems” –“MapReduce: Simplified Data Processing on Large Scale Clusters” –“The Google File System” –“BigTable: A Distributed Storage System for Structured Data” http://research.google.com/pubs/papers.html

72 Lecture Videos MapReduce Mini-series

73 Datasets: Wikipedia Wikipedia supports free “bulk download” of data –Current site snapshot (big) –Entire revision history (massive) Eliminates need for Nutch crawls Good for indexing, search labs http://download.wikimedia.org

74 Datasets: Netflix Netflix’s web site provides recommendations Theory: Other people watched movie X, then Y. You watched X, you might like Y. Open question: Can you provide more useful recommendations than their current system?

75 Datasets: Netflix The Netflix Prize: $1,000,000 if you can find a better algorithm, based on their criteria They provide you with a large dataset of existing rental associations to work with www.netflixprize.com

76 Conclusions Lots of starter materials available on the web –Good for reference –Get teaching assistants up to speed Readings, sample worksheets and other resources are open content & ready to use

77 Aaron Kimball aaron@cloudera.com

78


Download ppt "Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course."

Similar presentations


Ads by Google