MOVIE QUOTES SEARCH ENGINE Students: Meytal Bialik Zvi Cahana Supervisors: Hayim Makabee Oren Somekh Technion – Israel Institute Of Technology Computer Science Department MQSE3 Industrial Project – Final Presentation
Introduction The Movie Quotes Search Engine project focuses on the creation of a search engine allowing a user to search for terms that appear in the dialogues of a movie. The project consists of two main components: A web application used as a user interface to the search engine. A crawling engine used to maintain a searchable index and a content database. Introduction Goals Methodology System Diagram Achievements Testing Screenshots Conclusions
Goals Relevant search results Modern UI design Rich search options Video play option Browser agnostic website Large-scale movies database Incremental, priority-based crawling Introduction Goals Methodology System Diagram Achievements Testing Screenshots Conclusions
Methodology IMDb & OpenSubtitles.org dump files SRT subtitle files OpenSubtitles.org XML-RPC API SQLite database Apache Lucene Java Servlets / JSP HTML5 / CSS / JavaScript Introduction Goals Methodology System Diagram Achievements Testing Screenshots Conclusions
System Diagram Introduction Goals Methodology System Diagram Achievements Testing Screenshots Conclusions
Achievements Crawling Command-line tool Dump files parsing OpenSubtitles.org API based Subtitles downloading & indexing Cover art downloading Multithreaded pipelined execution Priority based Index recovery Introduction Goals Methodology System Diagram Achievements Testing Screenshots Conclusions
Achievements Storage SQLite-based database Movies metadata (popularity, rating, IMDb link...) Cover art ~20000 subtitles downloaded & indexed Local videos repository Introduction Goals Methodology System Diagram Achievements Testing Screenshots Conclusions
Achievements Indexing SRT files parsing & validating SRT files filtering Translator comments Hearing impaired comments Format tags Partitioning into overlapping search units Indexing using Lucene core Stemming Stop words removal Actual indexing of the search units ~250ms per average SRT file Introduction Goals Methodology System Diagram Achievements Testing Screenshots Conclusions
Achievements Searching Searching using Lucene core Query parsing Search operators support Stemming Stop words removal Relevant buckets retrieval & ranking Aggregating buckets to movies Merging of overlapping buckets Highlighting search words using Lucene core Buckets trimming to most relevant text Configurable weighted movie ranking Lucene rank Popularity Rating Year Introduction Goals Methodology System Diagram Achievements Testing Screenshots Conclusions
Achievements Web Application JSP/HTML5/CSS/JavaScript based Full support for IE9 Modern UI design Search results snippets Multiple hits per movie Paging Video play option Per result snippet Relevant scene Captions Introduction Goals Methodology System Diagram Achievements Testing Screenshots Conclusions
Testing A testing platform enables comparing search results “quality” against different system configurations. In each test, the search engine is queried with famous quotes A test passes if relevant movie is found in the top-K results Introduction Goals Methodology System Diagram Achievements Testing Screenshots Conclusions
Testing We tested the system with a set of ~100 famous movie quotes. With biased system configuration and K=9, we acquired ~90% pass rate. Introduction Goals Methodology System Diagram Achievements Testing Screenshots Conclusions
Screenshots Introduction Goals Methodology System Diagram Achievements Testing Screenshots Conclusions
Screenshots Introduction Goals Methodology System Diagram Achievements Testing Screenshots Conclusions
Conclusions Lucene is a powerful search platform Optimal search results are difficult to define Subtitles files from public sources should be further validated HTML5 video support is still limited & browser dependent Source control systems make life easier Introduction Goals Methodology System Diagram Achievements Testing Screenshots Conclusions