Download presentation
Presentation is loading. Please wait.
Published byLionel McDowell Modified over 9 years ago
1
Series-O-Rama Search & Recommend TV series with SQL http://bit.ly/dMh7kb Guillaume Cabanac cabanac@irit.fr February 15th, 2011
2
Toulouse: A Picture is Worth a Thousand Words Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 2 1 2 3 4 Capbreton 3h ride Toulouse population:437 000 students: 97 000 Aberdeen population:210 400 students: ?? ??? Collioure 2h30 ride Ax-les-Thermes 1h40 ride
3
en.wikipedia.org Telly Addicts Need Help to Find TV Series Grey’s Anatomy Main Topics of Grey’s Anatomy? Text mining, Visualization plane crash island Series about ‘plane crash island’ Search engine What should I watch next? Recommender system amazon.com 3 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
4
Text Mining: Let’s Crunch Subtitles 4 Grey’s Anatomy Main Topics of Grey’s Anatomy? Text mining, Visualization plane crash island Series about ‘plane crash island’ Search engine What should I watch next? Recommender system Cold Case Grey’s Anatomy Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
5
What’s in a Subtitle File? 5 Title – Season – Episode – Language.srt 1 episode = 1 plain text file Synchronization start --> stop Dialogue We can easily extract words [a, again*2, and, but, com, cuban, different, favorite, food, for*2, forum, going, great, happen*2, has, hungry, i*2, is, it, love, m, my, nice, night*2, miami, now, pork, s*2, sandwiches, something, the, to*2, tonight, town, www ] Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
6
6 DB technology at Work! [Home] 7 527 files = 337 MB 100% Java and Oracle
7
DB technology at Work! [Search engine] 7 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Ranked list of results
8
DB technology at Work! [Infos] 8 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Most popular terms Most related series
9
DB technology at Work! [Recommendations] 9 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
10
DB technology at Work! [Recommendations] 10 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac I liked I disliked What should I watch next?
11
DB technology at Work! [Recommendations] 11 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Ranked list of recommendations
12
How Does this Work? 12 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
13
Architecture and Data Model 13 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac DB subtitles indexing searching browsing recommending GUI offline online Series = {idS,name} 12Lost 45Dexter 45???? Dict = {idT,term} 8plane 27killer 29crash Posting = {idT*,idS*,nb} 274589 8453 81290
14
Theory Text Indexing Pipeline 14 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac [the, plane, crashed,..., planes,..., is] [plane, crashed,..., planes,...] [plane, crash,..., plane,...] {(plane, 48), (crash, 15)...} Tokenization + lowercase Stopwords removal Stemming Porter’s Stemmer (1980) Porter’s Stemmer (1980) http://qaa.ath.cx/porter_js_demo.html In 1720 Robert Gordon retired to Aberdeen having amassed a considerable fortune in Poland. On his death 11 years later he willed his entire estate to build a residential school for educating young boys. In the summer of 1750 the Robert Gordon’s Hospital was born In 1881 this was converted into a day school to be known as Robert Gordon’s College. This school also began to hold day and evening classes for boys girls and adults in primary secondary mechanical and other subjects … Counting
15
Vocabulary Theory Vector Space Model, Term Weighting 15 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Raw TF dexter > lost max Normalization TF / max(TF) survive ? max dexter < lost
16
Theory Best Match Retrieval 16 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 1 TV series = 1 vector 14514676790n Now, we know how to: popular terms Find most popular terms for a TV series similarity Compute similarity between TV series matching a query Find TV series matching a query
17
Theory More on Term Weighting 17 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 14514676790n 1 TV series = 1 vector All terms are supposed to be equally representative … but ‘survive’ is way more unusual than ‘people’ ‘survive’ better represents Lost than ‘people’ does IDF: Inverse Document Frequency
18
Theory The Big Picture: TF*IDF 18 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 1 TV series = 1 vector Some Limitations Term positions?e.g., “ice truck killer” in Dexter Stemming?e.g., ananas, christmas Mixture of languages? e.g., amusant FR vs. fun EN is frequent in Sglobally unusual An important term for series S is frequent in S and globally unusual.
19
Theory … and Practice 19 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Series = {idS,name,maxNb} 12Lost540 45Dexter125 Dict = {idT,termidf } 8plane1.25 27killer2.87 29crash3.07 Posting = {idT*,idS*,nb,tf } 2745890.71 84530.02 812900.16
20
Description of a TV Series 20 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Lost ⋈ Many surnames need to be filtered out
21
Retrieval of TV Series queries with 1 term 21 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac survive ⋈ Importance of normalization Stargate Atlantis nb/maxNb = 63/1116 = 0.05645 Blade nb/maxNb = 9/163 = 0.05521
22
Retrieval of TV Series queries with n terms 22 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac survive mulder ⋈ 67|The Vampire Diaries survive|0.028|0.107 = 0.028 * 0.107 = 0.003 mulder|0.007|3.977 = 0.007 * 3.977 = 0.028 + 0.031 18| X-Files survive|0.014|0.107 = 0.014 * 0.107 = 0.001 mulder|1.000|3.977 = 0.007 * 3.977 = 3.977 + 3.978
23
Similar to House? Computing Similarities Among TV Series 1/2 23 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac ⋈ First, let’s compute the numerator where: A i = Terms from House B i = Terms from Another TV series AiAi BiBi
24
Similar to House? Computing Similarities Among TV Series 2/2 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac ⋈ ⋈ ⋈ 24
25
Thank you http://www.irit.fr/~Guillaume.Cabanac
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.