PAKDD Panel: What Next Ramakrishnan Srikant
What Next Electronic Commerce –Catalog Integration (WWW 2001, with R. Agrawal) –Searching with Numbers (WWW 2002, with R. Agrawal) Security Privacy
Catalog Integration B2B electronics portal: 2000 categories, 200K datasheets Master Catalog New Catalog
Intuition Use affinity information in new catalog. –Products in same category are similar. Bias Naïve Bayes classifier to incorporate this information. –Accuracy boost depends on match between two categorizations. –Use tuning set to determine weight given to affinity information.
Yahoo & Google 5 slices of the hierarchy: Autos, Movies, Outdoors, Photography, Software –Typical match: 69%, 15%, 3%, 3%, 1%, …. Merging Yahoo into Google –30% fewer errors (14.1% absolute difference in accuracy) Merging Google into Yahoo –26% fewer errors (14.3% absolute difference) Open Problems: SVM, Decision Tree,...
Data Extraction is hard Synonyms for attribute names and units. –"lb" and "pounds", but no "lbs" or "pound". Attribute names are often missing. –No "Speed", just "MHz Pentium III" –No "Memory", just "MB SDRAM" 850 MHz Intel Pentium III 192 MB RAM 15 GB Hard Disk DVD Recorder: Included; Windows Me 14.1 inch diplay 8.0 pounds
Searching with Numbers
Why does it work? Conjecture: If we get a close match on numbers, it is likely that we have correctly matched attribute names. Non-overlapping attributes: –Memory: Mb, Disk: Gb Correlations: –Memory: Mb, Disk: Gb still fine.
Empirical Results
Incorporating Hints Use simple data extraction techniques to get hints, Names/Units in query matched against Hints. Open Problem: Rethink data extraction in this context.
Security
Some Hard Problems Past may be a poor predictor of future –Abrupt changes Reliability and quality of data –Wrong training examples Simultaneous mining over multiple data types Richer patterns
Privacy Preserving Data Mining Have your cake and mine it too! –Preserve privacy at the individual level, but still build accurate models. Challenges –Privacy Breaches –Clustering & Associations –Privacy-sensitive Security Applications Opportunities –Web Demographics –Inter-Enterprise Data Mining –Privacy-sensitive Security Applications