Distributed Tera-Mining R. L. Grossman Laboratory for Advanced Computing University of Illinois & Magnify, Inc.
1. Background Three Fundamental Trends.
Trend 1. Explosion of Data …
… All in the Wrong Format With no one to analyze it.
The Data Gap Total new disk (TB) since 1995 New Ph.D.s Most data comes a GB and a TB at a time.
Data Mining is Inevitable log #(amount of networked data) log #(statisticians) The goal of data mining is to close this gap.
Trend 2. Sonet is dead. Lambda Rules. Gigabytes can be moved in seconds.
Gigabytes can be Moved in Minutes 1 TB in 6 hours 10 GBs in 4 minutes 1 TB in 1.5 hours 10 GBs in 1 minute
Trend 3: Most Data is Distributed Bushs Law: The usefulness of a column of data varies as the square of the number of columns it is compared to.
Example 1: ENSO & Cholera El Nino Data at NCARCholera Data at WHO
Example 2: Voting Table 1 Table 2
Correlation: Reform Voters vs Votes for Buchanan Palm Beach
2. Internet Infrastructures for Data Data Webs, Semantic Webs, Data Grids, Distributed Data Mining, Digital Libraries and all that
Data Mining Data mining is the semi-automatic extraction of patterns, models, changes, associations, and anomalies from large data sets. data mining algorithm <tree-node node-id=8 threshold = etc. > learning set statistical model
Data Mining Process - End to End Viewpoint NCAR WHO Phase 1. Exploratory Analysis Phase 2. Data Analysis & Mining Phase 3. Deployment & Decision 50%0%50% DataSpace
DataSpace – One Approach to Making Data Useful 16 terabytes of documents 4 billion documents Todays Multi-media Web Tomorrows Data Web petabytes of data tens of billions to trillions of records html http search by keyword workstations servers pmml & dtml dstp correlate & mine data & compute clusters Complementary to the grid, which we view as a distributed computer.
View Data as a Collection of Distributed Columns
Data Servers and Data Browsers NCAR data in Boulder WHO data in Geneva DataSpace Data browser in Chicago
attributes [aid] UCK [uckid] k[i], y[j] k[i], x[i] DSTP Server 1 DSTP Server 2 Click to obtain graph
3. Summary & Conclusion
Terra Mining Testbed Optical testbed for distributed tera mining of scientific data. Goal also to be testbed for broadband based business services.
Lessons Learned 1.Its the data stupid. Cycles, cylinders & lambdas are all commodities. 2.The fundamental challenge: lower the cost to make data useful. 3.The emergence of internet infrastructure for data is inevitable. Opens up possibilities for new types of scientific discoveries.
For More Information DataSpace DataSpace Standards Selected articles Magnify –
End of Slides
FTP Still Lives
Trend 2. Bandwidth is a Commodity OC-3 OC-12 OC-48
El Nina Anomalies
Indonesia Cholera Cases
Cholera Cases
Distributed Exabytes (New Disks) Source: IDC (1999) "1999 Winchester Disk Drive Market Forecast and Review" Petabytes 1 Exabyte
Trend 3: Most Data is Distributed Ws Law: The usefulness of a column of data varies as the square of the number of columns it is compared to.
Example 2: Voting
Database 1: Total Votes for Buchanan by County
Database 2: Total Registered Reform Voters by County
Correlation: Total Votes vs Buchanan Votes by County Palm Beach