Presentation is loading. Please wait.

Presentation is loading. Please wait.

Building a Domain-Specific Document Collection for Evaluating Metadata Effects on Information Retrieval Walid Magdy, Jinming Min, Johannes Leveling, Gareth.

Similar presentations

Presentation on theme: "Building a Domain-Specific Document Collection for Evaluating Metadata Effects on Information Retrieval Walid Magdy, Jinming Min, Johannes Leveling, Gareth."— Presentation transcript:

1 Building a Domain-Specific Document Collection for Evaluating Metadata Effects on Information Retrieval Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones School of Computing, Dublin City University, Ireland 20 May 2010 LREC 2010

2 Outline CNGL Objective Data collection preparation and overview IR test collection design Baseline Experiments Summary

3 CNGL Centre of Next Generation Localisation (CNGL) 4 Universities: DCU, TCD, UCD, and UL Team: 120 PhD students, PostDocs, and PIs Supported by Science Foundation of Ireland (SFI) 9 Industrial Partners: IBM, Microsoft, Symantec, … Objective: Automation of the localisation process Technologies: MT, AH, IR, NLP, Speech, and Dev.

4 Objective Create a collection of data that is: 1. Suitable for IR tasks 2. Suitable for other research fields (AH, NLP) 3. Large enough to produce conclusive results 4. Associated with defined evaluation strategies Prepare the collection from freely available data YouTube Domain specific (Basketball) Build standard IR test collection (document set + topics set + relevance assessment)

5 YouTube Videos Features Document Tags -Video URL -Video Title Posting User Posting date Descriptio n Category Number of Views Length Responde d Videos Related Videos Comment s Number of Ratings Number of Favorited

6 Methodology for Crawling Data 50 NBA related queries used to search YouTube First 700 results per query crawled with related videos Crawled pages parsed and metadata extracted. Extracted data represented in XML format Non-sport category results filtered out Used Queries: NBA - NBA Highlights - NBA All Starts - NBA fights Top ranked 15 NBA players in 2008 + Jordan + Shaq 29 NBA teams

7 Data Collection Overview 61,340 Crawled video pages: 61,340 pages 20 Max crawled related/responded video pages: 20 500 Max crawled comments for a given video page: 500 Comments associated with contributing user’s ID 250k Crawled user profiles ≈ 250k

8 XML sample

9 Topics Creation Michael Jordan best dunks Find the best dunks through the career of Michael Jordan in NBA. It can be a collection of dunks in matches, or dunk contest he participated in. A relevant video should contain at least one dunk for Jordan. Videos of dunks for other players are not relevant. And other plays for Jordan other than dunks are not relevant as well 40 topics (queries) created Specific topics related to NBA TREC topic = query (title) + description + narrative

10 Relevance Assessment 4 indexes created: Title Title +Tags Title + Tags + Description Title + Tags + Description + Related videos titles 5 different retrieval models used 20 different result lists, each contains 60 documents Result lists merged with random ranking 122 to 466 documents assessed per topic 1 to 125 relevant documents per topic (avg. = 23)

11 Baseline Experiments Search 4 different indexes: Title Title +Tags Title + Tags + Description Title + Tags + Description + Related videos titles Indri retrieval model used to rank results 1000 results retrieved for each search Mean average precision (MAP) used to compare the results

12 Results

13 Summary (new language resource) 61,340 XML docs 40 topics + rel. assess. 250,000 User profiles Comments Ratings # Views Metadata IR test set AH/Personalisation Sentiment Analysis Videos Multimedia processing Reranking using ML Tags NER Top bigrams in “Tags” field Kobe Bryant NBA Basketball Lebron James Michael Jordan Los Angeles All Star Chicago Bulls Boston Celtics Allen Iverson Angeles Lakers Slam Dunk Basketball NBA Dwight Howard Vince Carter Dwyane Wade Kevin Garnett Toronto Raptors Houston Rockets Miami Heat O’Neal Phoenix Suns Detroit Pistons Tracy Mcgrady Yao Ming Chris Paul Amazing Highlights New York Pau Gasol Cleveland Cavaliers NBA Amazing Top bigrams in “Tags” field Kobe Bryant NBA Basketball Lebron James Michael Jordan Los Angeles All Star Chicago Bulls Boston Celtics Allen Iverson Angeles Lakers Slam Dunk Basketball NBA Dwight Howard Vince Carter Dwyane Wade Kevin Garnett Toronto Raptors Houston Rockets Miami Heat O’Neal Phoenix Suns Detroit Pistons Tracy Mcgrady Yao Ming Chris Paul Amazing Highlights New York Pau Gasol Cleveland Cavaliers NBA Amazing

14 Questions & Answers Q: Is this collection available for free? A: No Q: Nothing could be provided? A: Scripts + Topics + Rel. assess. (needs updating) Q: Any other questions? A: …

15 Thank you

16 YouTube Statistics (1/8) MinMax 13/09/200503/03/2009

17 YouTube Statistics (2/8) MinMaxMeanMedianStd Dev 0841210

18 YouTube Statistics (3/8) MinMaxMeanMedianStd Dev 021,710,75735,7073,329221,091

19 YouTube Statistics (4/8) MinMaxMeanMedianStd Dev 023,147586328

20 YouTube Statistics (5/8) MinMaxMeanMedianStd Dev 00:00:0002:38:2000:02:5300:02:1000:02:54

21 YouTube Statistics (6/8) MinMaxMeanMedianStd Dev 027,029528303

22 YouTube Statistics (7/8) MinMaxMeanMedianStd Dev 05451

23 YouTube Statistics (8/8) MinMaxMeanMedianStd Dev 072,230947687

24 YouTube Statistics (9/9) MinMaxMeanMedianStd Dev 0232002

Download ppt "Building a Domain-Specific Document Collection for Evaluating Metadata Effects on Information Retrieval Walid Magdy, Jinming Min, Johannes Leveling, Gareth."

Similar presentations

Ads by Google