Download presentation
Presentation is loading. Please wait.
Published byVivian Nicholson Modified over 6 years ago
1
Deploying Big Data and Development Environments Using Ansible
Hagen Hodgkins, Badi' Abdul-Wahid, Gregor von Laszewski School Of Informatics and Computing, Indiana University Introduction Description of the Population Project Methodology: Population Project This project attempts to uncover some of the key factors behind why population numbers change. We highlight population numbers by county and use statistical data to discover trends. Data sets were made publicly available by the U.S. Census Bureau and the United States Department of Labor: Bureau of Labor Statistics. The first data set is the population change for counties in the United States and Puerto Rico: 2000 to The other two data sets are unemployment rates in counties in the United States and Puerto Rico, one for 2000 and the other for We Plan to derive meaningful insight between population changes in counties and unemployment rate, specifically how unemployment rates appear to affect population changes. The data sets are loaded into MongoDB. The analysis is conducted by a Python script, titled "analysis.py" which pulls data directly from MongoDB, does a short analysis, and exports the data in CSV format. Visualization is done on Tableau. Initially the roles were built into the project repository. These are roles that are required for other projects and as such there would have been redundancy in the roles when attempting to install this project onto a machine which hosted other projects requiring the same roles. To resolve this issue we looked to Ansible. The use of Ansible automation is important to this big data environment because it allows for the installation of the prerequisites for a project to function properly automatically and in bulk. It allowed the team to create and upload a single version of a role that we can use across a variety of projects and update as required without the need to check every project repository for duplicates of a role. This is because the role is not located within any of the projects. The roles are stored within their own repositories then called from Ansible galaxy when needed via terminal or command line onto the machine it is needed on. This removed the need to install dependencies one by one. In the case of the population project the prerequisites are Hadoop, a framework for data storage and cluster management, and mongodb, an open source database system. This project had the ansible roles for installing the dependencies built into the projects repository. This is an issue because if done in other projects, over time it can lead to the oversaturation of redundant or duplicated roles. While initially this may not present issues other than the unnecessary consumption of storage space on the users machine as software gets updated and old syntax's become obsolete or otherwise unusable in the new version all these redundant roles will need to be independently to ensure all the projects they are used in don’t become nonfunctional. The team also took time to investigate the project as a whole to remove documentation that was not required or was otherwise unnecessary. We also ensured what documentation remained was helpful and relevant to any user who is working with the project in the future. This included rewriting the instructions for installation and setup by the user as well as additional documentation pertaining to the what the project is intended to accomplish and the process through which it is achieved. We needed to automate the setup of a python development environment for a Linux machine running Ubuntu Xenial For the purpose of creating an automated deployment of programs and dependencies we chose to use Ansible. The team also evaluated a big data project dealing with the comparison of populations and unemployment rates in counties throughout the United States. The roles for this project needed to be reworked and housed within their own repositories. The readme files for both projects required extensive overhauls to make them useful to potential future users and developers. Methodology: Development Environment This project focuses on giving a developer on a new machine the environment he needs to begin development using python and working with Cloudmesh. For this purpose the team decided to use an Ansible playbook. When deployed the playbook would proceed to install python, its dependencies, a variety of python editors, as well as Cloudmesh client. The majority of the tasks in the role were written into a file called python.yml which was responsible for installing all but the Pycharm application, which in turn was written into a separate .yml file due to more specific instructions required for its automatic installation. Both of these are called from main.yml which the user launches to deploy the playbook on their machine. The main.yml file uses include functionality to call and launch the other .yml files automatically, thus deploying the playbook and installing the programs. A readme file was created in the primary directory of the role which outlines what the role does as well as the instructions on how to deploy the role onto a new machine. The role was then tested piecemeal as it was created to ensure that the components worked properly before rolling them together and testing the project as a whole. Development Environment Visualization Population Big Data Environment Visualization Role uploaded to Ansible Galaxy Hadoop & MongoDB Roles uploaded to Ansible Galaxy GitHub GitHub Role downloaded onto machine Population project downloaded onto platform Hadoop and MongoDB Roles downloaded onto Platform Machine connects to the platform FutureSystems Role deployed Roles deployed & project run & visualized Installed rofile/eden3065#!/ Cloudmesh Client + Dependencies Development Environment Deployment Conclusions & Future Work Acknowledgements The machine you deploy this on may require Ansible to be installed. $ sudo apt-get install ansible To avoid the issue of installing the role to an incorrect directory by not being able to specify one as the install path. $ sudo ansible-galaxy install cloudmesh.ansible-cloudmesh-ubuntu- xenial --roles-path=~/directory/of/your/choosing/ To deploy the role and install your development environment launch the main.yml from its directory. $ sudo ansible-playbook ~/directoryofyourchoosing/ansible- cloudmesh-ubuntu-xenial/tasks/main.yml In regards to the Deployment environment the role, with possible exception of the pycharm.yml playbook should work on Ubuntu Xenial machines with little issue. Future works for this project would include ensuring the pycharm playbook functions correctly. In addition future researchers could work on developing it further to function on additional operating systems. The Population project, while cleaned up and having the roles shifted to individual repositories still requires additional attention. Future researchers should test and ensure functionality on the chosen platform for the project, FutureSystems. Additional work is needed within the documentation of the project to guarantee that it is comprehensive for new users or researchers. Thanks to the Center of Excellence in Remote Sensing Education and Research and Indiana University for this Research Experience for Undergraduates program. Thanks to Dr. Linda Hayden, and Dr. Lamara Warren for help in securing this research opportunity. Our thanks to Gregor von Laszewski, and Badi' Abdul-Wahid for their assistance in completing this research experience. References 1. Eden Barnett, Priyanka Jadhav, and Jeff Sustarsic cloudmesh/ansible-cloudmesh-population. (May 2016). Retrieved June 28, 2016 from population/tree/master/population/class/population1
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.