Presentation is loading. Please wait.

Presentation is loading. Please wait.

PAKDD Panel: What Next Ramakrishnan Srikant. What Next Electronic Commerce –Catalog Integration (WWW 2001, with R. Agrawal) –Searching with Numbers (WWW.

Similar presentations


Presentation on theme: "PAKDD Panel: What Next Ramakrishnan Srikant. What Next Electronic Commerce –Catalog Integration (WWW 2001, with R. Agrawal) –Searching with Numbers (WWW."— Presentation transcript:

1 PAKDD Panel: What Next Ramakrishnan Srikant

2 What Next Electronic Commerce –Catalog Integration (WWW 2001, with R. Agrawal) –Searching with Numbers (WWW 2002, with R. Agrawal) Security Privacy

3 Catalog Integration B2B electronics portal: 2000 categories, 200K datasheets Master Catalog New Catalog

4 Intuition Use affinity information in new catalog. –Products in same category are similar. Bias Naïve Bayes classifier to incorporate this information. –Accuracy boost depends on match between two categorizations. –Use tuning set to determine weight given to affinity information.

5 Yahoo & Google 5 slices of the hierarchy: Autos, Movies, Outdoors, Photography, Software –Typical match: 69%, 15%, 3%, 3%, 1%, …. Merging Yahoo into Google –30% fewer errors (14.1% absolute difference in accuracy) Merging Google into Yahoo –26% fewer errors (14.3% absolute difference) Open Problems: SVM, Decision Tree,...

6 Data Extraction is hard Synonyms for attribute names and units. –"lb" and "pounds", but no "lbs" or "pound". Attribute names are often missing. –No "Speed", just "MHz Pentium III" –No "Memory", just "MB SDRAM" 850 MHz Intel Pentium III 192 MB RAM 15 GB Hard Disk DVD Recorder: Included; Windows Me 14.1 inch diplay 8.0 pounds

7 Searching with Numbers

8 Why does it work? Conjecture: If we get a close match on numbers, it is likely that we have correctly matched attribute names. Non-overlapping attributes: –Memory: 64 - 512 Mb, Disk: 10 - 40 Gb Correlations: –Memory: 64 - 512 Mb, Disk: 10 - 100 Gb still fine.

9 Empirical Results

10 Incorporating Hints Use simple data extraction techniques to get hints, Names/Units in query matched against Hints. Open Problem: Rethink data extraction in this context.

11 Security

12 Some Hard Problems Past may be a poor predictor of future –Abrupt changes Reliability and quality of data –Wrong training examples Simultaneous mining over multiple data types Richer patterns

13 Privacy Preserving Data Mining Have your cake and mine it too! –Preserve privacy at the individual level, but still build accurate models. Challenges –Privacy Breaches –Clustering & Associations –Privacy-sensitive Security Applications Opportunities –Web Demographics –Inter-Enterprise Data Mining –Privacy-sensitive Security Applications


Download ppt "PAKDD Panel: What Next Ramakrishnan Srikant. What Next Electronic Commerce –Catalog Integration (WWW 2001, with R. Agrawal) –Searching with Numbers (WWW."

Similar presentations


Ads by Google