Download presentation
Presentation is loading. Please wait.
Published byEdith Turner Modified over 9 years ago
1
Python In The Cloud PyHou MeetUp, Dec 17 th 2013 Chris McCafferty, SunGard Consulting Services
2
Overview What is the Cloud? What is Big Data? Big Data Sources Python and Amazon Web Services Python and Hadoop Other Pythonic Cloud providers Wrap-up
3
What Is The Cloud I want 40 servers and I want them NOW I want to store 100Tb of data cheaply and reliably We can do this with Cloud technologies
4
What is Big Data “Three Vs” – Volume – Variety – Velocity Genome: sequencing machines throw off several TB per day. Each. Hard drive performance is often the killer bottleneck, both reading and writing
5
What is NOT Big Data Anything where the whole data set can be held in memory on a single standard instance Data that can be held straightforwardly in a traditional relational database Problems where most of the data can be trivially excluded There are many challenging problems in the world – but not all need Cloud or Big Data tools to solve them
6
To The Cloud! Amazon Web Services is the 800lb gorilla in this space – Start here if in doubt Other options are RackSpace, Microsoft Azure, (PiCloud/Multyvac?) You can also spin up some big iron very cheaply – Current AWS big memory spec is cr1.8xlargecr1.8xlarge – 244GB RAM, 32 Xeon-E5 cores, 10 Gigabit network – $3.50 per hour
7
Geo Big Data Sources NASA SRTM data is on the large side NASA recently released a huge set of data directly into the cloud: NEXrecently released a huge set of data directly into the cloud – Earth Sciences data sets Made available on Amazon Web Services public datasetspublic datasets Available on S3 at: – s3://nasanex/NEX-DCP30 – s3://nasanex/MODIS – s3://nasanex/Landsat There are many, many geo data sets available now (NOAA Lidar, etc)
8
Time for some code Example - Use S3 browser to look at new NASA NEX data Let’s download some with boto package Quickest to do this from an Amazon data centre See DemoDownloadNasaNEX.pyDemoDownloadNasaNEX.py
9
Weather & Big Data Sources Good public weather and energy data It's hard to move data around for free: just try! Power grids shed many GB of public data a day – Historical data sets form many Terabytes Weather data available from NOAA – QCLD: Hourly, daily, and monthly summaries for approximately 1,600 U.S. locations. QCLD – ASOS data contains sensor data at one-minute intervals. 5 min intervals available too. ASOS 900 stations, 3-4MB per day, 12 years of data = 11-15TB data set.
10
Why go to the cloud Cheap - see AWS pricing here – spot pricing of m1.medium normally ~1c/hr The cloud is increasingly where the (public) data will reside Pay as you go, less bureaucracy Support for Big Data technologies out of the box – Amazon Elastic Compute Cloud (EC2) gives you a Hadoop cluster with minimal Host a big web server farm or video streaming cluster
11
Python on AWS EC2 AWS = Amazon Web Services. The Big Cloud EC2 = Elastic Cloud Compute Let’s run up an instance and see what we have available See this script as one way to upgrade to Python 2.7this script Note absence of high-level packages like NumPy, matplotlib and Pandas It would be very useful to have a very high-level Python environment…
12
StarCluster Cluster management in AWS, written by a group at MIT Convenient package to spin up clusters (Hadoop or other) and copy across files Machine images (AMIs) for high-level Python environments (NumPy, matplotlib, Pandas, etc) Not every high-level library is there – No sklearn (SciKit-Learn, machine learning) – But easier to pip-install with most pre-requisites already there Sun Grid Engine: Job Management Hadoop Boto plugin dumbo… and much more
13
Python's Support for AWS boto - interface to AWS (Amazon Web Services) Hadoop Streaming - use Python in MapReduce tasks mrjob - Framework that wraps Hadoop Streaming and uses boto pydoop- wraps Hadoop Pipes, which is a C++ API into Hadoop Map Reduce Write Python in User-Defined Functions in Pig, Hive – Essentially wraps MapReduce and Hadoop Streaming
14
Boto - Python Interface to AWS Support for HDFS Upload/download from Amazon S3 and Glacier Start/stop EC2 instances Manage users through IAM Virtually every API available from AWS is supported django-storages uses boto to present an S3 storage option django-storages See http://docs.pythonboto.org/en/latest/http://docs.pythonboto.org/en/latest/ Make sure you keep your AWS key-pair secure
15
Another Code Example – upload Example where we merge many files together and upload to S3 Merge files to avoid the Small Files ProblemSmall Files Problem Note use of retry decorator (exponential backoff)retry decorator See CopyToCloud.py and MergeAndUploadTxOutages.pyCopyToCloud.py MergeAndUploadTxOutages.py
16
What is ? A scalable data and job manager suitable for MapReduce jobs Core technologies date from early 2000s at Google Retries failed tasks, redundant data, good for commodity hardware Rich ecosystem of tools including NoSQL databases, good Python support Example, let’s spin up a cluster of 30 machines with StarCluster
17
Hadoop Scales Massively
18
Hadoop Streaming Hadoop passes incoming data in rows on stdin Any program (including Python) can process the rows and emit to stdout Logging and errors to stderror
19
Hadoop Streaming - Echo Useful example that can be used for debugging Tells you what Hadoop is actually passing your task See echo.pyecho.py Similar example firstten.py peeks at the first ten lines then stopsfirstten.py Useful for debugging
20
Hadoop Parsing Example Python's regex support makes it very good for parsing unstructured dataregex One of the keys in working with Hadoop and Big Data is getting it into a clean row-based format Apply 'schema on read' Transmission Data from PJM is updated here every 5 mins: https://edart.pjm.com/reports/linesout.txthttps://edart.pjm.com/reports/linesout.txt Needs cleaning up before we can use it for detailed analysis - note multi-line format Script split_transmission.pysplit_transmission.py Watch out for Hadoop splitting input blocks in the middle of a file
21
Alternatives to AWS Picloud offers open source software enabling you to run large computational clusters Picloud – Just acquired by DropBox – Pay for what you use: 1 core and 300MB or RAM costs $0.05/hr – Doesn't offer many of the things Amazon does (AMIs, SMS) but great for computation or a private cloud Disco is MapReduce implemented in Python Disco – Started life at Nokia – Has its own Distributed Filesystem (like HDFS)Distributed Filesystem Or roll your own cluster in-house with pp (parallel python) StarCluster Sun Grid Engine on other vendor/in-house Google App Engine…?
22
PiCloud Acquired by DropBox Nov 2013 DropBox will probably come out with its own cloud compute offering in 2014 As of Dec 2013, no new sign-ups Existing customers encouraged to migrate to MultyvacMultyvac Feb 25th 2014 PiCloud will switch off The underlying PiCloud software is still open source
23
Conclusions For cheap compute power and cheap storage, look to the cloud Python is well-supported in this space Consider being close to your data: in the same cloud – Moving data is expensive and slow Leverage AWS with tools like boto, StarCluster Beware setting up complex environments: installing packages takes time and effort Ideally, think Pythonicly – use the best tools to get the job done
24
Links Good rundown on the Python ecosystem around Hadoop from Jan 2013: – http://blog.cloudera.com/blog/2013/01/a-guide-to- python-frameworks-for-hadoop/ http://blog.cloudera.com/blog/2013/01/a-guide-to- python-frameworks-for-hadoop/ Early vision for PiCloud (YouTube Mar 2012) – http://www.youtube.com/watch?v=47NSfuuuMfs http://www.youtube.com/watch?v=47NSfuuuMfs Disco MapReduce Framework from PyData – http://www.youtube.com/watch?v=YuLBsdvCDo8 http://www.youtube.com/watch?v=YuLBsdvCDo8 – PuTTY tool for windows Some AWS & Python war stories: – http://nz.pycon.org/schedule/presentation/12 http://nz.pycon.org/schedule/presentation/12
25
Thank you Chris McCafferty http://christophermccafferty.com/blog Slides will be at: http://christophermccafferty.com/slides Contact me at: public@christophermccafferty.com Chris.McCafferty@sungard.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.