Download presentation
Presentation is loading. Please wait.
Published byMichael Daley Modified over 11 years ago
1
Distributed Tera-Mining R. L. Grossman Laboratory for Advanced Computing University of Illinois & Magnify, Inc.
2
1. Background Three Fundamental Trends.
3
Trend 1. Explosion of Data …
4
… All in the Wrong Format With no one to analyze it.
5
The Data Gap Total new disk (TB) since 1995 New Ph.D.s Most data comes a GB and a TB at a time.
6
Data Mining is Inevitable log #(amount of networked data) log #(statisticians) The goal of data mining is to close this gap.
7
Trend 2. Sonet is dead. Lambda Rules. Gigabytes can be moved in seconds.
8
Gigabytes can be Moved in Minutes 1 TB in 6 hours 10 GBs in 4 minutes 1 TB in 1.5 hours 10 GBs in 1 minute
9
Trend 3: Most Data is Distributed Bushs Law: The usefulness of a column of data varies as the square of the number of columns it is compared to.
10
Example 1: ENSO & Cholera El Nino Data at NCARCholera Data at WHO
11
Example 2: Voting Table 1 Table 2
12
Correlation: Reform Voters vs Votes for Buchanan Palm Beach
13
2. Internet Infrastructures for Data Data Webs, Semantic Webs, Data Grids, Distributed Data Mining, Digital Libraries and all that
14
Data Mining Data mining is the semi-automatic extraction of patterns, models, changes, associations, and anomalies from large data sets. data mining algorithm <tree-node node-id=8 threshold = 0.239494 etc. > learning set statistical model
15
Data Mining Process - End to End Viewpoint NCAR WHO Phase 1. Exploratory Analysis Phase 2. Data Analysis & Mining Phase 3. Deployment & Decision 50%0%50% DataSpace
16
DataSpace – One Approach to Making Data Useful 16 terabytes of documents 4 billion documents Todays Multi-media Web Tomorrows Data Web petabytes of data tens of billions to trillions of records html http search by keyword workstations servers pmml & dtml dstp correlate & mine data & compute clusters Complementary to the grid, which we view as a distributed computer.
17
View Data as a Collection of Distributed Columns
18
Data Servers and Data Browsers NCAR data in Boulder WHO data in Geneva DataSpace Data browser in Chicago
19
attributes [aid] UCK [uckid] k[i], y[j] k[i], x[i] DSTP Server 1 DSTP Server 2 Click to obtain graph
20
3. Summary & Conclusion
21
Terra Mining Testbed Optical testbed for distributed tera mining of scientific data. Goal also to be testbed for broadband based business services.
22
Lessons Learned 1.Its the data stupid. Cycles, cylinders & lambdas are all commodities. 2.The fundamental challenge: lower the cost to make data useful. 3.The emergence of internet infrastructure for data is inevitable. Opens up possibilities for new types of scientific discoveries.
23
For More Information DataSpace http://www.dataspaceweb.net http://www.ncdm.uic.edu DataSpace Standards http://www.dmg.org Selected articles http://www.twocultures.net Magnify –http://www.magnify.com
24
End of Slides
25
FTP Still Lives
26
Trend 2. Bandwidth is a Commodity OC-3 OC-12 OC-48
27
El Nina Anomalies
28
Indonesia Cholera Cases
29
Cholera Cases
30
Distributed Exabytes (New Disks) Source: IDC (1999) "1999 Winchester Disk Drive Market Forecast and Review" Petabytes 1 Exabyte
31
Trend 3: Most Data is Distributed Ws Law: The usefulness of a column of data varies as the square of the number of columns it is compared to.
32
Example 2: Voting
33
Database 1: Total Votes for Buchanan by County
34
Database 2: Total Registered Reform Voters by County
35
Correlation: Total Votes vs Buchanan Votes by County Palm Beach
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.