The Big Data Network (phase 2) Cloud Hadoop system Presentation to the Manchester Data Science club 14 July 2016 Peter Smyth UK Data Service
The BDN2 Cloud Hadoop system The challenges of building and populating a secure but accessible big data environment for researchers in the Social Sciences and related disciplines.
Overview of this presentation Aims A simple Hadoop System Dealing with the data Processing Users Safeguarding the data and its usage Appeal for data and use cases Leave this for Peter
AIMS
Aims Provide a processing environment for big data Targeted at the Social Sciences - but not exclusively so Provide easy ingest of datasets Provide comprehensive search facilities Provide easy access to users for processing or download
Cloud Hadoop System
Cloud Hadoop System Start with minimal configuration Cloud based, so we can grow it as needed Adding nodes is what Hadoop is good at Need to provide HA from the outset Resilience and user access is important Search facilities will be expected 24/7
Software installed - and how we will use it Standard HDP (Hortonworks Data Platform) Spark, Hive, Pig, Ambari, Zeppelin etc. Other Apache software Ranger - monitor and manage comprehensive data security across the Hadoop platform Knox – REST API Gateway providing single point of entry to the cluster Other Software Kerberos AD integration Our own processes for workflows and ingest / metadata production
Fitting the bits together Hadoop System Job Scheduling User Access control & quotas Data Access control & SDC Data Users Performance Monitoring Auditing and Logging
Data
Getting the data in Large datasets from 3rd parties Existing UKDS datasets Not necessarily big data But likely to be used in conjunction with other data BYOD – Bring your own data Negotiation, contracts , conditions
How not to do it!
HDF – Hortonworks Data Flow Built on Apache Nifi Allows workflows to be built for collecting data from external sources Single shot datasets Regular updates (monthly, daily) Possibility of streaming data
NiFi workflow
Data storage Raw Data Metadata (Semantic Data Lake) Dashboards, summaries and samples User data Own datasets Work in progress Results Raw data – as the files come in MetaData – not only Metadata but the semantic data lake contents
Semantic data lake Must contain everything There will be only one search engine Whether in the cloud or on-prem (secure data) The metadata isn’t just what is extracted from the datasets and associated documentation Appropriate Ontologies need to be used Not only terms but relationships between them Resource Description Framework or RDF
Processing
Processing Ingest and curation processing Extracting and creating Metadata Processing for Dashboards, summaries and samples Samples – in advance or as requested? User searches User Jobs Processing systems Spark Hive / Pig Effect of interactive environments Zeppelin
Job Scheduling Ingest related jobs Metadata maintenance related jobs User jobs Batch? Hive Pig (Near) Real time Spark Streaming What kind of delay is acceptable? For users For Operations Do we need to prioritise?
Users
User types Short term (try before you ‘buy’) Long term (Researchers 3-5 years ) Commercial users? (in exchange for data) Everyone is a search user
Safeguarding Data
Security and Audit Who can access what data Making data available Disc quotas Private areas Who has access to resources and can run jobs Sandbox area for authenticated users Providing tools Levels of Support What audit trails are maintained What is recorded How long do we keep the logs Will they be reviewed?
Data Ownership and Provenance Restrictions on use of a dataset License agreements Types of research permitted Complications due to combining Permissions needed Carrying the provenance/licence with the data in the semantic data lake
SDC – Statistical Disclosure Controls Currently a manual process Likely to be more complex as more datasets are combined Could just be checked on output Automated tools are becoming available But how good are they? Or, are they good enough
Hadoop in the Cloud
Performance monitoring Need to understand usage patterns Or try to anticipate them Need to be able to detect when the system is under stress - and be able to react in a timely manner CPU RAM HDFS Need to provide proper job scheduling for true batch jobs Cannot allow the use of Spark to result in a free-for-all
Pros and Cons of the Cloud for Hadoop Elasticity Add or remove nodes as required Only pay for what you use Cons Hadoop designed as a share nothing system Adding, and particularly removing nodes not as straightforward as in other type of cloud systems Continuously paying for storage big datasets The pros are the standard one offered for cloud computing in general. The Cons explain why they are not necessarily applicable to a Hadoop system
Appeal for use cases
Why we need data and use cases Building a generalised system Many of the processes and procedures have not been tried before Need an understanding of ‘typical’ use needs Need to ensure we cater for end to end processing of the user needs
What is in it for you Safe 24/7 repository for your data Access to Big Data processing Support & Training
and offers of data Peter Smyth Peter.smyth@manchester.ac.uk ukdataservice.ac.uk/help/ Subscribe to the UK Data Service news list at https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=UKDATASERVICE Follow us on Twitter https://twitter.com/UKDataService or Facebook https://www.facebook.com/UKDataService