Research Trends in Multimedia Content Services Data Mining and Web Search Group Computer and Automation Research Institute Hungarian Academy of Sciences András A. Benczúr
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Web 2.0, 3.0 …? Platform convergence (Web, PC, mobile, television) – information vs. recreation Emphasis on social content (blogs, Wikipedia, photo and video sharing) From search towards recommendation (query free, profile based, personalized) From text towards multimedia Glocalization (language, geography) Spam
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 A sample service RSS Web 2.0 Small screen browsing Recommendation based on user profile (avoid query typing) Read blogs, view media, … client software Recommende r engine
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 The user profile History stored for each user: Known ratings, preferences, opinion – scarce! Items read, weighted by time spent details seen, scrolling, back button Terms in documents read, tf.idf weighted top list User language, region, current location and known sociodemographic data Multimedia!
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Same item—multiple source
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Information vs recreation: Do not mix the two?
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Spam is increasingly annoying
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Distribution of categories Reputable 70.0% Spam 16.5% Weborg 0.8% Ad 3.7% Non-existent 7.9% Empty 0.4% Alias 0.3% Unknown 0.4%
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Keresési találati pozíció hatása Találati pozíció nézésével töltött idő Találathoz érkezés ideje
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Multimedia Information Retrieval
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Similar objects Segmentation
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Class of Query Image Pre-classified Images VOC2007 Original Training Set Query Images ImageCLEF Object Retrieval Task
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Networked relation spam social network analysis churn
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Szociális hálózatok home business ADSL --- ADSL ---
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Biztosítási csalások – hálózatban
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Stacked Graphical Learning 1.Predict churn p(v) of node v 2.For target node u, aggregate p(v) for neighbors to form new feature f(u) 3.Rerun classification by adding feature f(.) 4.Iterate ? u v1v1 v2v2 v7v7
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Why social networks are hard to analyze Subgraphs of social networks Medium size dense communities attract much algorithmic work Tentacles induce noise
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Mapping into 2D plain spectral semidefinite
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Research Highlights Recommenders: KDD Cup 2007 Task 1 First Prize Predict the probability that a user rated a movie in 2006, based on year –2005 training data Spam filtering: Web Spam Challenge 1 first place Churn prediction: method presented at KDD Cup 2009 Workshop Task XXXX
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Netflix: lessons and differences learned Ratings 1– 5 stars Predict an unseen rating Evaluation: RMSE : $1,000,000 Current leader: Oct/07: KDD Cup 2007 same data set predict existence of a rating
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Results of two separate tasks BellKor team report [Bell, Koren 2007]: Low rank approximation Restricted Boltzmann Machine Nearest neighbor KDD Cup 2007: Predict probability that a user rated a movie in 2006: Given list of 100,000 user–movie pairs Users and movies drawn from Netflix Prize data set Winner report [K, B, and our colleauges 2007]
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 For a given user i and movie j where is the predicted value KDD Cup example: Our RMSE: First runner up: All zeroes prediction: (Place 10-13) But why do we use RMSE and not precision/recall? RMSE preferes correct probability guesses for the majority unfrequently visited items The presence of the recommender changes usage Evaluation and Issue 1
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Method Overview Probability by naive user-movie independence Item frequency estimation (Time Series) User frequency estimation Reaches RMSE in itself (still first place) Data Mining SVD Item-item similarities Association Rules Combination (we used linear regression)
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Time series prediction Interest remains for long time range (several years)
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Short lifetime of online items Origo Very different behavior in time: news articles Publication day Next day usage peak Third day and gone …
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 K-dim SVD: Noise filtering – the essence of the matrix – optimizes SVD explains ratings as effect of few linear factors RMSE ( ℓ 2 error) dim: 0.93 Issue: too many news items 18K Netflix movies vs. potentially infinite set of items -> may recommend data source but not the item SVD user movienews item
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Content similarity might be the key feature Relative success of trivial estimates on KDD Cup! Data mining techniques overlap, apparently catch similar patterns Precision/recall is more important than RMSE Solution must make heavy use of time Lessons learned
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008 Future plans and ideas New partners and application fields: network infrastructure, new generation services, bioinformatics, …? Scaling our solutions to multi-core architectures Use our search (cross-lingual, multimedia etc) and recommender system capabilities in major solutions; mobile, new generation platforms etc. Expand means of our European level collaboration, e.g. KIC participation
Questions ? Andras A. Benczur