Series-O-Rama Search & Recommend TV series with SQL Guillaume Cabanac February 15th, 2011.

Slides:



Advertisements
Similar presentations
Series-O-Rama Search & Recommend TV series with SQL Guillaume Cabanac March 27th, 2012.
Advertisements

WEB MINING. Why IR ? Research & Fun
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
AskMe A Web-Based FAQ Management Tool Alex Albu. Background Fast responses to customer inquiries – key factor in customer satisfaction Costs for customer.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Final Project of Information Retrieval and Extraction by d 吳蕙如.
IR Models: Overview, Boolean, and Vector
Information Retrieval in Practice
Information Retrieval Review
9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Ch 4: Information Retrieval and Text Mining
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
Evaluating the Performance of IR Sytems
The Vector Space Model …and applications in Information Retrieval.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University.
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
CS520 Web Programming Full Text Search Chengyu Sun California State University, Los Angeles.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
INFORMATION RETRIEVAL Pabitra Mitra Computer Science and Engineering IIT Kharagpur
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Search Engine Architecture
Indexing & querying text
CS 430: Information Discovery
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Efficient Ranking of Keyword Queries Using P-trees
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Information Retrieval
Introduction to Search Engines
Representation of documents and queries
Correlation of Term Count and Document Frequency for Google N-Grams
6. Implementation of Vector-Space Retrieval
Chapter 5: Information Retrieval and Web Search
Correlation of Term Count and Document Frequency for Google N-Grams
Presentation Title SUBTITLE description LOGO.
Introduction to Search Engines
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Series-O-Rama Search & Recommend TV series with SQL Guillaume Cabanac February 15th, 2011

Toulouse: A Picture is Worth a Thousand Words Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Capbreton 3h ride Toulouse population: students: Aberdeen population: students: ?? ??? Collioure 2h30 ride Ax-les-Thermes 1h40 ride

en.wikipedia.org Telly Addicts Need Help to Find TV Series Grey’s Anatomy Main Topics of Grey’s Anatomy?  Text mining, Visualization plane crash island Series about ‘plane crash island’  Search engine What should I watch next?  Recommender system amazon.com  3 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac

Text Mining: Let’s Crunch Subtitles 4 Grey’s Anatomy Main Topics of Grey’s Anatomy?  Text mining, Visualization plane crash island Series about ‘plane crash island’  Search engine What should I watch next?  Recommender system Cold Case Grey’s Anatomy Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac

What’s in a Subtitle File? 5 Title – Season – Episode – Language.srt  1 episode = 1 plain text file Synchronization  start --> stop Dialogue We can easily extract words [a, again*2, and, but, com, cuban, different, favorite, food, for*2, forum, going, great, happen*2, has, hungry, i*2, is, it, love, m, my, nice, night*2, miami, now, pork, s*2, sandwiches, something, the, to*2, tonight, town, www ] Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac

6 DB technology at Work! [Home] files = 337 MB 100% Java and Oracle

DB technology at Work! [Search engine] 7 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Ranked list of results

DB technology at Work! [Infos] 8 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Most popular terms Most related series

DB technology at Work! [Recommendations] 9 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac

DB technology at Work! [Recommendations] 10 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac I liked I disliked What should I watch next?

DB technology at Work! [Recommendations] 11 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Ranked list of recommendations

How Does this Work? 12 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac

Architecture and Data Model 13 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac DB subtitles indexing searching browsing recommending GUI offline online Series = {idS,name} 12Lost 45Dexter 45???? Dict = {idT,term} 8plane 27killer 29crash Posting = {idT*,idS*,nb}  

Theory  Text Indexing Pipeline 14 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac [the, plane, crashed,..., planes,..., is] [plane, crashed,..., planes,...] [plane, crash,..., plane,...] {(plane, 48), (crash, 15)...} Tokenization + lowercase Stopwords removal Stemming Porter’s Stemmer (1980) Porter’s Stemmer (1980) In 1720 Robert Gordon retired to Aberdeen having amassed a considerable fortune in Poland. On his death 11 years later he willed his entire estate to build a residential school for educating young boys. In the summer of 1750 the Robert Gordon’s Hospital was born In 1881 this was converted into a day school to be known as Robert Gordon’s College. This school also began to hold day and evening classes for boys girls and adults in primary secondary mechanical and other subjects … Counting

Vocabulary Theory  Vector Space Model, Term Weighting 15 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Raw TF dexter > lost max  Normalization TF / max(TF) survive ? max dexter < lost

Theory  Best Match Retrieval 16 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 1 TV series = 1 vector n Now, we know how to: popular terms  Find most popular terms for a TV series similarity  Compute similarity between TV series matching a query  Find TV series matching a query

Theory  More on Term Weighting 17 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac n 1 TV series = 1 vector  All terms are supposed to be equally representative … but ‘survive’ is way more unusual than ‘people’  ‘survive’ better represents Lost than ‘people’ does IDF: Inverse Document Frequency

Theory  The Big Picture: TF*IDF 18 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 1 TV series = 1 vector Some Limitations  Term positions?e.g., “ice truck killer” in Dexter  Stemming?e.g., ananas, christmas  Mixture of languages? e.g., amusant FR vs. fun EN is frequent in Sglobally unusual An important term for series S is frequent in S and globally unusual.

Theory … and Practice 19 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Series = {idS,name,maxNb} 12Lost540 45Dexter125 Dict = {idT,termidf } 8plane killer crash3.07 Posting = {idT*,idS*,nb,tf }  

Description of a TV Series 20 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Lost ⋈  Many surnames need to be filtered out

Retrieval of TV Series  queries with 1 term 21 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac survive ⋈ Importance of normalization Stargate Atlantis nb/maxNb = 63/1116 = Blade nb/maxNb = 9/163 =

Retrieval of TV Series  queries with n terms 22 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac survive mulder ⋈ 67|The Vampire Diaries survive|0.028|0.107 = * = mulder|0.007|3.977 = * = | X-Files survive|0.014|0.107 = * = mulder|1.000|3.977 = * =

Similar to House? Computing Similarities Among TV Series 1/2 23 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac ⋈ First, let’s compute the numerator where: A i = Terms from House B i = Terms from Another TV series AiAi BiBi

Similar to House? Computing Similarities Among TV Series 2/2 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac ⋈ ⋈ ⋈ 24

Thank you