Presentation is loading. Please wait.

Presentation is loading. Please wait.

GROUP 12 CHI HUNG HUNG TONG KA WAI TERENCE YUEN

Similar presentations


Presentation on theme: "GROUP 12 CHI HUNG HUNG TONG KA WAI TERENCE YUEN"— Presentation transcript:

1 GROUP 12 CHI HUNG HUNG TONG KA WAI TERENCE YUEN
MOVIE WEB PORTAL GROUP 12 CHI HUNG HUNG TONG KA WAI TERENCE YUEN

2 CONTENT Web Crawling Data Preprocessing Schema Alignment
Entity Resolution Data Fusion Web Portal Demo

3 OVERVIEW Data Sources IMDB TMDB Rotten Tomatoes Programming Languages
Python R PHP Database MySQL

4 WEB CRAWLING

5 METHODOLOGY 1000 most popular movies between 2006-2016
HTTP request sender: Requests HTML/XML parser: BeautifulSoup

6 WEB CRAWLER EXAMPLE Crawler in Python: Data navigation via traversing the DOM tree top-to-bottom Nodes are recognized by the type of tag and the name of the class Data extraction Page Source:

7 CRAWLING ILLUSTRATION
List of Movies Movie1 Movie2 Movie3 Director List of Actors List of Genres Actor1 Actor2 Genre1 Genre2 Order of Traversing: List of Movies -> Movie1 -> Director -> List of Actors -> Actor1 -> Actor2 -> List of Genres -> Genre1 -> Genre2

8 DATA PREPROCESSING

9 EXAMPLE Date/Time format conflicts
August 7, 1975 vs vs 2 hrs. 18 mins vs 138 mins Gender naming convention “F” vs “Female” Regional discrepancies Release date/country Currencies

10 SCHEMA ALIGNMENT

11 METHODOLOGY Union the attributes among all 3 sources Example:
S1: {A1,A2,A3,A4} S2: {A1,A2,A3,A5,A7} S3: {A1,A3,A6} Unified S: {A1,A2,A3,A4,A5,A6,A7}

12 UNIFIED SCHEMA Movie Actor Director Genre
mid, title, year, overview, runtime, film_location, budget, global_revenue, us_revenue, us_release_date, other_release_date, other_release_country, dvd_date, user_rating, votes_num Actor aid, name, gender, date_of_birth, place_of_birth Director did, name, gender, date_of_birth, place_of_birth Genre gid, genre_type

13

14 ENTITY RESOLUTION

15 METHODOLOGY Clustering into Groups by keys Pairwise Matching
Movie, Genre: by first character of the title name/type name Actor, Director: by concatenating the first character of actor’s/director’s first name and last name Pairwise Matching Distance-based approach used on Actors, Directors, Genres edit distance: Levenshtein, Jaro-Winkler Rule-based approach used on Movies

16 PERFORMANCE COMPARISON
Efficiency evaluation is conducted on group blocking between 2 different solutions Experiment performed on Actor’s entities: Solution Blocking Strategy Computation Time / Machine Number of Computing Machine 1 key used: first character of actor’s full name 1 hour 2 key used: first character of actor’s first and last name 3.5 hour

17 BLOCK SIZE DISTRIBUTION
(Solution 2) (Solution 1)

18 RULE-BASED MATCHING Rule used for deciding whether or not two movie entities are matching Step 1: IF | year1 – year2 | > 2 years, declare a non-match ELSE go to step 2 Step 2: IF | runtime1 – runtime2 | > 15 mins, declare a non-match ELSE go to step 3 Step 3: IF edit-distance between title_name1 and title_name2 < threshold, declare a non-match ELSE consider the entity a match

19 EXAMPLE After Record Linkage…

20 DATA FUSION

21 METHODOLOGY Fusion by voting Extract the most informative value
Assumption made on trustworthiness of the 3 data sources IMDB > TMDB > Rotten Tomato Extract the most informative value Example 1: For actor’s DOB => S1: 1985, S2: /05, S3: 1983 S2: will be chosen, as S1 & S2 share the same year value, and S2 provides details on month and date over S1 Example 2:

22 METHODOLOGY

23 WEB PORTAL

24 PORTAL APPLICATION Search movies for more details.
Rank movies by filtering, such as rating , box office. Find out the relating movies of celebrities.

25 PORTAL DEMO

26 Q & A


Download ppt "GROUP 12 CHI HUNG HUNG TONG KA WAI TERENCE YUEN"

Similar presentations


Ads by Google