Distributed Operating Systems Luke Wood
What is a distributed operating system?
Distributed Operating Systems Runs across multiple physical or virtual machines Utilizes the processing power of multiple machines Huge issues with synchronization in development Play a huge role in the world of "big data" (big daters)
What is driving this development We just have so much data!
of the world's data was generated over past two years 90% of the world's data was generated over past two years
For real - why didn't we just increase our clock speed Oh that's why. Why Distributed? For real - why didn't we just increase our clock speed Oh that's why.
Today's Solution: Hadoop Hadoop is the most widely used distributed OS in industry. It is made up of: Hadoop common Hadoop FS MapReduce and so much more...
Hadoop History Google File System published in October 2003 MapReduce: Simplified Data Processing on Large Clusters published in December 2004 Named after Doug Cutting's Son's toy elephant hadoop!
Used to Process Data Such As Surveillance Data Social Media Data Stock Exchange Data Power Grid Data Transport Data Search Engine Data
Hadoop Case Study - Incredibly impressive results - Insane performance gains using the cluster Results from Cloud Hadoop Map Reduce For Remote Sensing Image Analysis by Mohamed Almeer
The end goal of a distributed OS is to harness the power of multiple machines
What? How!? We utilize the Map Reduce Paradigm
The End.
Just Kidding.
Issues and Solutions From an OS and Application level perspective
#1: Shared Data When we use a map function - how do we access a shared state? What if our operations are not communicative?
Programmer Dependent Solution: Operating System Solution: - Just use pure functions - This can be a challenge - Not super "general population friendly" Operating System Solution: Operating system provides broadcast functionality Can we update the broadcasted data? How expensive is this broadcasting? Is this a programmer invoked function?
#2: Data distribution How do we distribute data between devices?
Data Distribution Architectures Master to workers only useful in MapReduce much simpler than other architectures Peer to Peer file distribution much harder to implement
Programmer Dependent Solution: Operating System Solution: - Explicitly broadcast data - Prevents unnecessary data distribution Operating System Solution: - Try to intelligently distribute data - Delegate specific tasks to specific systems
Conclusion - Distributed operating systems have allowed companies to crunch insane amounts of data in reasonable time frames - Parallel and distributed computing are made significantly easier through the use of the mapreduce paradigm - Many of the synchronization problems we have studied in this class are taken care of by the mapreduce implementation
Thank you - check out distributed OS programming - it's a ton of fun