Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distributed Tera-Mining R. L. Grossman Laboratory for Advanced Computing University of Illinois & Magnify, Inc.

Similar presentations


Presentation on theme: "Distributed Tera-Mining R. L. Grossman Laboratory for Advanced Computing University of Illinois & Magnify, Inc."— Presentation transcript:

1 Distributed Tera-Mining R. L. Grossman Laboratory for Advanced Computing University of Illinois & Magnify, Inc.

2 1. Background Three Fundamental Trends.

3 Trend 1. Explosion of Data …

4 … All in the Wrong Format With no one to analyze it.

5 The Data Gap Total new disk (TB) since 1995 New Ph.D.s Most data comes a GB and a TB at a time.

6 Data Mining is Inevitable log #(amount of networked data) log #(statisticians) The goal of data mining is to close this gap.

7 Trend 2. Sonet is dead. Lambda Rules. Gigabytes can be moved in seconds.

8 Gigabytes can be Moved in Minutes 1 TB in 6 hours 10 GBs in 4 minutes 1 TB in 1.5 hours 10 GBs in 1 minute

9 Trend 3: Most Data is Distributed Bushs Law: The usefulness of a column of data varies as the square of the number of columns it is compared to.

10 Example 1: ENSO & Cholera El Nino Data at NCARCholera Data at WHO

11 Example 2: Voting Table 1 Table 2

12 Correlation: Reform Voters vs Votes for Buchanan Palm Beach

13 2. Internet Infrastructures for Data Data Webs, Semantic Webs, Data Grids, Distributed Data Mining, Digital Libraries and all that

14 Data Mining Data mining is the semi-automatic extraction of patterns, models, changes, associations, and anomalies from large data sets. data mining algorithm <tree-node node-id=8 threshold = 0.239494 etc. > learning set statistical model

15 Data Mining Process - End to End Viewpoint NCAR WHO Phase 1. Exploratory Analysis Phase 2. Data Analysis & Mining Phase 3. Deployment & Decision 50%0%50% DataSpace

16 DataSpace – One Approach to Making Data Useful 16 terabytes of documents 4 billion documents Todays Multi-media Web Tomorrows Data Web petabytes of data tens of billions to trillions of records html http search by keyword workstations servers pmml & dtml dstp correlate & mine data & compute clusters Complementary to the grid, which we view as a distributed computer.

17 View Data as a Collection of Distributed Columns

18 Data Servers and Data Browsers NCAR data in Boulder WHO data in Geneva DataSpace Data browser in Chicago

19 attributes [aid] UCK [uckid] k[i], y[j] k[i], x[i] DSTP Server 1 DSTP Server 2 Click to obtain graph

20 3. Summary & Conclusion

21 Terra Mining Testbed Optical testbed for distributed tera mining of scientific data. Goal also to be testbed for broadband based business services.

22 Lessons Learned 1.Its the data stupid. Cycles, cylinders & lambdas are all commodities. 2.The fundamental challenge: lower the cost to make data useful. 3.The emergence of internet infrastructure for data is inevitable. Opens up possibilities for new types of scientific discoveries.

23 For More Information DataSpace http://www.dataspaceweb.net http://www.ncdm.uic.edu DataSpace Standards http://www.dmg.org Selected articles http://www.twocultures.net Magnify –http://www.magnify.com

24 End of Slides

25 FTP Still Lives

26 Trend 2. Bandwidth is a Commodity OC-3 OC-12 OC-48

27 El Nina Anomalies

28 Indonesia Cholera Cases

29 Cholera Cases

30 Distributed Exabytes (New Disks) Source: IDC (1999) "1999 Winchester Disk Drive Market Forecast and Review" Petabytes 1 Exabyte

31 Trend 3: Most Data is Distributed Ws Law: The usefulness of a column of data varies as the square of the number of columns it is compared to.

32 Example 2: Voting

33 Database 1: Total Votes for Buchanan by County

34 Database 2: Total Registered Reform Voters by County

35 Correlation: Total Votes vs Buchanan Votes by County Palm Beach


Download ppt "Distributed Tera-Mining R. L. Grossman Laboratory for Advanced Computing University of Illinois & Magnify, Inc."

Similar presentations


Ads by Google