Series-O-Rama Search & Recommend TV series with SQL Guillaume Cabanac February 15th, 2011
Toulouse: A Picture is Worth a Thousand Words Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Capbreton 3h ride Toulouse population: students: Aberdeen population: students: ?? ??? Collioure 2h30 ride Ax-les-Thermes 1h40 ride
en.wikipedia.org Telly Addicts Need Help to Find TV Series Grey’s Anatomy Main Topics of Grey’s Anatomy? Text mining, Visualization plane crash island Series about ‘plane crash island’ Search engine What should I watch next? Recommender system amazon.com 3 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Text Mining: Let’s Crunch Subtitles 4 Grey’s Anatomy Main Topics of Grey’s Anatomy? Text mining, Visualization plane crash island Series about ‘plane crash island’ Search engine What should I watch next? Recommender system Cold Case Grey’s Anatomy Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
What’s in a Subtitle File? 5 Title – Season – Episode – Language.srt 1 episode = 1 plain text file Synchronization start --> stop Dialogue We can easily extract words [a, again*2, and, but, com, cuban, different, favorite, food, for*2, forum, going, great, happen*2, has, hungry, i*2, is, it, love, m, my, nice, night*2, miami, now, pork, s*2, sandwiches, something, the, to*2, tonight, town, www ] Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
6 DB technology at Work! [Home] files = 337 MB 100% Java and Oracle
DB technology at Work! [Search engine] 7 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Ranked list of results
DB technology at Work! [Infos] 8 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Most popular terms Most related series
DB technology at Work! [Recommendations] 9 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
DB technology at Work! [Recommendations] 10 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac I liked I disliked What should I watch next?
DB technology at Work! [Recommendations] 11 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Ranked list of recommendations
How Does this Work? 12 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Architecture and Data Model 13 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac DB subtitles indexing searching browsing recommending GUI offline online Series = {idS,name} 12Lost 45Dexter 45???? Dict = {idT,term} 8plane 27killer 29crash Posting = {idT*,idS*,nb}
Theory Text Indexing Pipeline 14 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac [the, plane, crashed,..., planes,..., is] [plane, crashed,..., planes,...] [plane, crash,..., plane,...] {(plane, 48), (crash, 15)...} Tokenization + lowercase Stopwords removal Stemming Porter’s Stemmer (1980) Porter’s Stemmer (1980) In 1720 Robert Gordon retired to Aberdeen having amassed a considerable fortune in Poland. On his death 11 years later he willed his entire estate to build a residential school for educating young boys. In the summer of 1750 the Robert Gordon’s Hospital was born In 1881 this was converted into a day school to be known as Robert Gordon’s College. This school also began to hold day and evening classes for boys girls and adults in primary secondary mechanical and other subjects … Counting
Vocabulary Theory Vector Space Model, Term Weighting 15 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Raw TF dexter > lost max Normalization TF / max(TF) survive ? max dexter < lost
Theory Best Match Retrieval 16 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 1 TV series = 1 vector n Now, we know how to: popular terms Find most popular terms for a TV series similarity Compute similarity between TV series matching a query Find TV series matching a query
Theory More on Term Weighting 17 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac n 1 TV series = 1 vector All terms are supposed to be equally representative … but ‘survive’ is way more unusual than ‘people’ ‘survive’ better represents Lost than ‘people’ does IDF: Inverse Document Frequency
Theory The Big Picture: TF*IDF 18 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 1 TV series = 1 vector Some Limitations Term positions?e.g., “ice truck killer” in Dexter Stemming?e.g., ananas, christmas Mixture of languages? e.g., amusant FR vs. fun EN is frequent in Sglobally unusual An important term for series S is frequent in S and globally unusual.
Theory … and Practice 19 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Series = {idS,name,maxNb} 12Lost540 45Dexter125 Dict = {idT,termidf } 8plane killer crash3.07 Posting = {idT*,idS*,nb,tf }
Description of a TV Series 20 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Lost ⋈ Many surnames need to be filtered out
Retrieval of TV Series queries with 1 term 21 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac survive ⋈ Importance of normalization Stargate Atlantis nb/maxNb = 63/1116 = Blade nb/maxNb = 9/163 =
Retrieval of TV Series queries with n terms 22 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac survive mulder ⋈ 67|The Vampire Diaries survive|0.028|0.107 = * = mulder|0.007|3.977 = * = | X-Files survive|0.014|0.107 = * = mulder|1.000|3.977 = * =
Similar to House? Computing Similarities Among TV Series 1/2 23 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac ⋈ First, let’s compute the numerator where: A i = Terms from House B i = Terms from Another TV series AiAi BiBi
Similar to House? Computing Similarities Among TV Series 2/2 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac ⋈ ⋈ ⋈ 24
Thank you