Download presentation
Presentation is loading. Please wait.
1
GROUP 12 CHI HUNG HUNG TONG KA WAI TERENCE YUEN
MOVIE WEB PORTAL GROUP 12 CHI HUNG HUNG TONG KA WAI TERENCE YUEN
2
CONTENT Web Crawling Data Preprocessing Schema Alignment
Entity Resolution Data Fusion Web Portal Demo
3
OVERVIEW Data Sources IMDB TMDB Rotten Tomatoes Programming Languages
Python R PHP Database MySQL
4
WEB CRAWLING
5
METHODOLOGY 1000 most popular movies between 2006-2016
HTTP request sender: Requests HTML/XML parser: BeautifulSoup
6
WEB CRAWLER EXAMPLE Crawler in Python: Data navigation via traversing the DOM tree top-to-bottom Nodes are recognized by the type of tag and the name of the class Data extraction Page Source:
7
CRAWLING ILLUSTRATION
List of Movies Movie1 Movie2 Movie3 Director List of Actors List of Genres … … Actor1 Actor2 Genre1 Genre2 Order of Traversing: List of Movies -> Movie1 -> Director -> List of Actors -> Actor1 -> Actor2 -> List of Genres -> Genre1 -> Genre2
8
DATA PREPROCESSING
9
EXAMPLE Date/Time format conflicts
August 7, 1975 vs vs 2 hrs. 18 mins vs 138 mins Gender naming convention “F” vs “Female” Regional discrepancies Release date/country Currencies
10
SCHEMA ALIGNMENT
11
METHODOLOGY Union the attributes among all 3 sources Example:
S1: {A1,A2,A3,A4} S2: {A1,A2,A3,A5,A7} S3: {A1,A3,A6} Unified S: {A1,A2,A3,A4,A5,A6,A7}
12
UNIFIED SCHEMA Movie Actor Director Genre
mid, title, year, overview, runtime, film_location, budget, global_revenue, us_revenue, us_release_date, other_release_date, other_release_country, dvd_date, user_rating, votes_num Actor aid, name, gender, date_of_birth, place_of_birth Director did, name, gender, date_of_birth, place_of_birth Genre gid, genre_type
14
ENTITY RESOLUTION
15
METHODOLOGY Clustering into Groups by keys Pairwise Matching
Movie, Genre: by first character of the title name/type name Actor, Director: by concatenating the first character of actor’s/director’s first name and last name Pairwise Matching Distance-based approach used on Actors, Directors, Genres edit distance: Levenshtein, Jaro-Winkler Rule-based approach used on Movies
16
PERFORMANCE COMPARISON
Efficiency evaluation is conducted on group blocking between 2 different solutions Experiment performed on Actor’s entities: Solution Blocking Strategy Computation Time / Machine Number of Computing Machine 1 key used: first character of actor’s full name 1 hour 2 key used: first character of actor’s first and last name 3.5 hour
17
BLOCK SIZE DISTRIBUTION
(Solution 2) (Solution 1)
18
RULE-BASED MATCHING Rule used for deciding whether or not two movie entities are matching Step 1: IF | year1 – year2 | > 2 years, declare a non-match ELSE go to step 2 Step 2: IF | runtime1 – runtime2 | > 15 mins, declare a non-match ELSE go to step 3 Step 3: IF edit-distance between title_name1 and title_name2 < threshold, declare a non-match ELSE consider the entity a match
19
EXAMPLE After Record Linkage…
20
DATA FUSION
21
METHODOLOGY Fusion by voting Extract the most informative value
Assumption made on trustworthiness of the 3 data sources IMDB > TMDB > Rotten Tomato Extract the most informative value Example 1: For actor’s DOB => S1: 1985, S2: /05, S3: 1983 S2: will be chosen, as S1 & S2 share the same year value, and S2 provides details on month and date over S1 Example 2:
22
METHODOLOGY
23
WEB PORTAL
24
PORTAL APPLICATION Search movies for more details.
Rank movies by filtering, such as rating , box office. Find out the relating movies of celebrities.
25
PORTAL DEMO
26
Q & A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.