Potential BBC Datasets
In general… Not many off-the-shelf datasets What can be released (under licence) very much depends on the type of content AND the type of use We have precedents for releasing A/V content, text and audience behaviour data but each use case needs to be scrutinised carefully There will be a legal process to go through to get hold of the data, but we hope that once gone through we can use the agreements established as templates for other users NOTE: the broader the use case, the harder it is to get a licence agreed
Audio/Video content Possible sources Issues Website A/V content that is publically accessible BBC Redux – last 10 years of broadcast BBC video and audio E.g. MGB-3 subset of several hundred programmes [*] Specific programme content Archive content Jupiter (News CMS) Issues Copyright, editorial, legal, data protection Different types of content have different difficulties E.g. very difficult are music, films Limits based on the content type and the context of use [*] The MGB-3 subset is one that has been through a pretty rigorous sampling of a portion of Redux
Other content Possible sources Issues Web pages Subtitle archive (all pre-recorded subtitles made for the BBC), subtitles from Redux for live & pre-recorded for 10 years Transcripts of 1,6 & 10 TV News bulletins News articles Apps? Scripts Issues Copyright, legal
Metadata Possible sources Issues Programme metadata Internal: PIPs/Nitro Public: /programmes LDP: Linked data tags (BBC Things) for news/sport/other articles Genome: Radio Times history Archive metadata: Infax/Fabric Production metadata: P4A/Silvermouse Commissioning info: OnAir/What’s On Issues Confidential internal info, data protection
Audience Activity Data Possible sources Streaming performance data (rDOT) Website and iPlayer click/play data Website comments User Activity Service: likes, follows, etc BBC ID info: age, location, gender, etc BBC App data Raw BARB data (issues with sharing) Issues Commercial relationships, data protection, anonymity, BBC performance data
Audience insight data Possible sources (could be raw or results) Complaints/Comments Survey/panel data Audience Appreciation scores and comments: Pulse Panel Cross Media Index Other specific surveys/panels Official stats: BARB (TV), RAJAR (Radio) Issues Of commercial value, data protection
UGC Possible sources Issues BBC Introducing – music submitted by new bands Contributions to shows (phone ins, letters, emails) Competitions? Experiments? Issues Data protection
Internal systems/APIs Possible sources (to use or test against) Juicer – tagged news articles Mango – entity extraction algorithm & API Starfruit – text content classification algorithm & API BBC Kaldi – speech recognition system Recommendation systems Issues Access Note: for several of these we have test datasets that could be used to compare system performance
Outside/Social Media data Possible sources Twitter, Facebook, etc Our analysis of Twitter, Facebook, etc Reviews/comments of the BBC/our content in other media, websites Our analysis of external commentary Issues External data can be harvested separately Internal analyses may not be for public consumption