Frontiers in Massive Data Analysis Chapter 3
Difficult to include data from multiple sources Each organization develops a unique way of representing the data Organizations are codeveloping shared metadata structures
Instead of developing a complicated metadata structure, different organizations share their data with a basic set of operations More complex tools are developed as they are needed
Data created from mining confidential must meet certain legal and corporate privacy requirements Private data has to be protected from malicious users as well
Raw processing speed is not increasing as quickly, so manufacturers are moving towards more processors instead of faster processors I/O performance has to increase to meet the requirements of supporting multiple cores simultaneously
Hardware elements that can perform specialized tasks quickly GPUs are often used for rapidly calculating floating point values, but are limited by I/O bottlenecks and limited software tools
CPUs have become more parallel by combining more cores per socket and how many operations can be executed per clock cycle More cores at a slower speed have superior performance and power efficiency
The DSMS runs queries on (typically real time) input streams The feeds are analyzed and summarized continuously
Can use a structured query language similar to SQL that uses windowing to limit how much data is analyzed Can also use a “boxes-and-arrows” system that provides a graphical interface. The user selects what tasks execute in a box and connects the boxes with arrows to define how data is analyzed
A clustered system consists of multiple high performance nodes that execute submitted jobs Think of the HPC systems on campus A job manager controls load balancing and queue management
Provides access to distributed file systems stored on different servers The user is presented with a standard file system that hides the underlying distributed systems
POSIX compliant systems provide the same interface that a standalone file system would provide Makes it simple to convert programs to use clustered resources
Metadata is managed separately by dedicated servers which forward client requests to the correct file server Distributed systems run into synchronization issues as the cluster grows large
These systems were designed to solve the issues that POSIX systems encounter in large clusters Metadata is still handled by dedicated servers
Designed to handle distributed analysis tasks Uses a large block size (64 MB) to minimize metadata requests by clients Clients are expected to handle inconsistencies in the file systems by comparing checksums
Maps a collection of nodes to partition data, then shuffles the hashed files so that common records are passed to the same node Simplifies analysis on distributed data
Resources in a multi-tenant cluster are dynamically allocated as a user’s needs change Allows users to gain access to large systems without the overhead associated with maintaining a large cluster
Databases reliably store and retrieve data and can provide querying over the data sets Large parallel databases are spread over servers without a cluster file system managing nodes
Data can be partitioned by evenly spreading data among the nodes or spreading the data based on hashes on some of the fields The nodes evaluate queries on local partitions then combine the results from each node
If certain tables are frequently joined together in queries, store them on the same node When joining tables from different nodes, transfer the smaller of the two
Parallel databases are very difficult to tune and populate with data Very difficult to develop and debug parallel programs