Download presentation
Presentation is loading. Please wait.
Published byGillian Manning Modified over 9 years ago
1
Frontiers in Massive Data Analysis Chapter 3
2
Difficult to include data from multiple sources Each organization develops a unique way of representing the data Organizations are codeveloping shared metadata structures
3
Instead of developing a complicated metadata structure, different organizations share their data with a basic set of operations More complex tools are developed as they are needed
4
Data created from mining confidential must meet certain legal and corporate privacy requirements Private data has to be protected from malicious users as well
5
Raw processing speed is not increasing as quickly, so manufacturers are moving towards more processors instead of faster processors I/O performance has to increase to meet the requirements of supporting multiple cores simultaneously
6
Hardware elements that can perform specialized tasks quickly GPUs are often used for rapidly calculating floating point values, but are limited by I/O bottlenecks and limited software tools
7
CPUs have become more parallel by combining more cores per socket and how many operations can be executed per clock cycle More cores at a slower speed have superior performance and power efficiency
8
The DSMS runs queries on (typically real time) input streams The feeds are analyzed and summarized continuously
9
Can use a structured query language similar to SQL that uses windowing to limit how much data is analyzed Can also use a “boxes-and-arrows” system that provides a graphical interface. The user selects what tasks execute in a box and connects the boxes with arrows to define how data is analyzed
10
A clustered system consists of multiple high performance nodes that execute submitted jobs Think of the HPC systems on campus A job manager controls load balancing and queue management
11
Provides access to distributed file systems stored on different servers The user is presented with a standard file system that hides the underlying distributed systems
12
POSIX compliant systems provide the same interface that a standalone file system would provide Makes it simple to convert programs to use clustered resources
13
Metadata is managed separately by dedicated servers which forward client requests to the correct file server Distributed systems run into synchronization issues as the cluster grows large
14
These systems were designed to solve the issues that POSIX systems encounter in large clusters Metadata is still handled by dedicated servers
15
Designed to handle distributed analysis tasks Uses a large block size (64 MB) to minimize metadata requests by clients Clients are expected to handle inconsistencies in the file systems by comparing checksums
16
Maps a collection of nodes to partition data, then shuffles the hashed files so that common records are passed to the same node Simplifies analysis on distributed data
17
Resources in a multi-tenant cluster are dynamically allocated as a user’s needs change Allows users to gain access to large systems without the overhead associated with maintaining a large cluster
18
Databases reliably store and retrieve data and can provide querying over the data sets Large parallel databases are spread over servers without a cluster file system managing nodes
19
Data can be partitioned by evenly spreading data among the nodes or spreading the data based on hashes on some of the fields The nodes evaluate queries on local partitions then combine the results from each node
20
If certain tables are frequently joined together in queries, store them on the same node When joining tables from different nodes, transfer the smaller of the two
21
Parallel databases are very difficult to tune and populate with data Very difficult to develop and debug parallel programs
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.