A PowerPoint Presentation

Slides:



Advertisements
Similar presentations
Usage Statistics in Context: related standards and tools Oliver Pesch Chief Strategist, E-Resources EBSCO Information Services Usage Statistics and Publishers:
Advertisements

Web Mining.
Collaborative Open Access Projects: Collaborative promotion of research outputs Iryna Kuchma, eIFL Open Access program manager, eIFL.net Presented at Open.
Psychological Advertising: Exploring User Psychology for Click Prediction in Sponsored Search Date: 2014/03/25 Author: Taifeng Wang, Jiang Bian, Shusen.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date : 2014/04/15 Source : KDD’13 Authors : Chi Wang, Marina Danilevsky, Nihit.
Sean Blong Presents: 1. What are they…?  “[…] specific type of information filtering (IF) technique that attempts to present information items (movies,
LYRIC-BASED ARTIST NETWORK METHODOLOGY Derek Gossi CS 765 Fall 2014.
Finding Similar Music Artists for Recommendation Presented by :Abhay Goel, Prerak Trivedi.
Finding Similar Music Artists for Recommendation Members :Abhay Goel, Prerak Trivedi.
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
Automatic Data Collection: Server Logs As with all methods, have to ask: What are the goals for your system? –What constitutes success, or good quality.
Data Mining – Intro.
Computer Science 101 Web Access to Databases Overview of Web Access to Databases.
Multimedia Data Mining Arvind Balasubramanian Multimedia Lab (ECSS 4.416) The University of Texas at Dallas.
Multimedia Data Mining Arvind Balasubramanian Multimedia Lab The University of Texas at Dallas.
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.
Objectives Overview Discuss the evolution of the Internet
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Commercial-in-Confidence 1 Information Architect and the Development Internet Application Tweek Presentation May 2001.
© Paradigm Publishing, Inc. 5-1 Chapter 5 Application Software Chapter 5 Application Software.
 Type of site: Social annotations, highlighting and social bookmarking  Launched: July 24 th, 2006  Institutional video:
As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,
LIS 506 (Fall 2006) LIS 506 Information Technology Week 11: Digital Libraries & Institutional Repositories.
1 1 Slide Introduction to Data Mining and Business Intelligence.
GCMD/IDN STATUS AND PLANS Stephen Wharton CWIC Meeting February19, 2015.
Master Thesis Defense Jan Fiedler 04/17/98
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Similarity Access for Networked Media Connectivity Pavel Zezula Masaryk University Brno, Czech Republic.
Chapter Twelve Digital Interactive Media Arens|Schaefer|Weigold Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
User Modeling and Recommender Systems: Introduction to recommender systems Adolfo Ruiz Calleja 06/09/2014.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
NATURAL LANGUAGE PROCESSING Zachary McNellis. Overview  Background  Areas of NLP  How it works?  Future of NLP  References.
Copyright © 2002 Pearson Education, Inc. Slide 3-1 Internet II A consortium of more than 180 universities, government agencies, and private businesses.
WIRED Future Quick review of Everything What I do when searching, seeking and retrieving Questions? Projects and Courses in the Fall Course Evaluation.
Finding Similar Music Artists for Recommendation Members :Abhay Goel, Prerak Trivedi.
Recommendation Systems By: Bryan Powell, Neil Kumar, Manjap Singh.
Introduction Web analysis includes the study of users’ behavior on the web Traffic analysis – Usage analysis Behavior at particular website or across.
The ___ is a global network of computer networks Internet.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
MICROSOFT AJAX CDN (CONTENT DELIVERY NETWORK) Make Your ASP.NET site faster to retrieve.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
© 2017 by McGraw-Hill Education. This proprietary material solely for authorized instructor use. Not authorized for sale or distribution in any manner.
Data mining in web applications
Evolution of Internet.
E-commerce | WWW World Wide Web - Concepts
ROBUST FACE NAME GRAPH MATCHING FOR MOVIE CHARACTER IDENTIFICATION
E-commerce | WWW World Wide Web - Concepts
Introduction to Web Mining
Project Hermes Artificial Intelligence Initiative
Chair of Tech Committee, BetterGrids.org
Automated MS Word and PowerPoint Translator
Data Science introduction.
Course Summary ChengXiang “Cheng” Zhai Department of Computer Science
Network Controllable MP3 Player
CS 345A Data Mining Lecture 1
CS 345A Data Mining Lecture 1
Introduction to Web Mining
CS 345A Data Mining Lecture 1
Presentation transcript:

A PowerPoint Presentation Yahoo’s WebscopeTM Data Sharing Program PRESENTED BY Firstname Lastname⎪ August 25, 2013 Ron Brachman, Chief Scientist and Head, Yahoo Labs April 11, 2014 © Yahoo, Inc.

A reference library of interesting and scientifically useful datasets Available globally for non-commercial use by academics and scientists at research labs affiliated with an accredited university Reviewed to conform to Yahoo’s data protection standards Data Review Committee ensures strict user privacy controls are in place Data is anonymized by utilizing a permuter code Statistics # of datasets 50 # of datasets downloaded (since 2006) 6,368 # of academics 3,962 # of universities 1,411 # of countries 96 Reviewed by the Data Review Committee reviewed to conform to Yahoo’s data protection standards

Categories of Data 50 Webscope datasets available Language and Content Graph and Social Data Ratings, Recommendation and Classification Data Advertising and Market Data Competition Data Computing Systems Data Image Data We have 7 categories of datasets 50 Webscope datasets available New datasets are continuously being added by Yahoo Labs scientists Publications attributed to Webscope datasets are listed on the website

Language and Content Data Can be utilized to research information retrieval and natural language processing algorithms. 9 of the 21 datasets created from Yahoo Answers Example: L16 - Yahoo Answers Query to Questions, 1.5MB This dataset may be used by researchers to validate algorithms to predict searcher satisfaction with existing community-based answers. It may also enable researchers to validate algorithms to predict query clarity and query-question match.

Graph and Social Data Can be utilized to research matrix, graph, clustering, and machine learning algorithms. 5 of the 8 datasets created from Yahoo Instant Messenger Example: G5 - Yahoo Messenger User Communication Pattern, 32MB This dataset may be used by researchers to validate claims on social networking theory and corroborate their assumptions/analysis against a real time social network graph consisting of a small subset of Yahoo Messenger users.

Ratings, Recommendation and Classification Data Can be utilized to research collaborative filtering, recommender systems and machine learning algorithms. Example: R1-Yahoo Music User Ratings of Musical Artists, 423MB This dataset may be used by researchers to validate recommender systems or collaborative filtering algorithms. The dataset may serve as a test bed for matrix and graph algorithms including PCA and clustering algorithms. Publications attributing Webscope dataset: Visualizing head-to-tail affinities in large networks Finding Similar Music Artists for Recommendation

Advertising and Marketing Data Can be utilized to research behavior and incentives in auctions and markets. Example: A1-Yahoo Search Marketing Advertiser Bidding Data, 81MB This dataset may be used by economists or other researchers to investigate the behavior of bidders in this unique real-time auction format. Publications attributing to this dataset: Strategic Bidder Behavior in Sponsored Search Auctions Comparing Different Yahoo Sponsored Search Auctions: A Regression Discontinuity Design Approach An Empirical Analysis of Return on Investment Maximization in Sponsored Search Auctions Equilibrium Bids in Sponsored Search Auctions: Theory and Evidence

Competition Data These types of datasets were utilized in a competition event with academics and researchers. Example: C15 - Yahoo Music user ratings of musical tracks, albums, artists and genres, 1.5GB The novel features of this dataset will make it a subject of active research and a standard in the field of recommender systems. In particular, the dataset is expected to ignite research into algorithms that utilize hierarchical structure annotating the item set.

Computing Systems Data These types of data can be used to analyze the behavior and performance of different types computer systems architectures, including distributed systems and networks. Example: S1-Yahoo Sherpa database platform system measurement, 33K This dataset can be used to analyze and simulate the bottlenecks experienced in a real cloud database system under load.

Image Data This type of data can be used to analyze images and tags and is useful for image processing research. Example: I2 - Yahoo Shopping Shoes Image Content, 131MB This dataset helps academic machine learning and computer vision researchers come up with more accurate object recognition algorithms.

Big Data: The 3 largest Webscope datasets are L19 - Yahoo News extracted metadata: noun phrases and their context, version 1.0 (206 GB)(Hosted on AWS) L20 - Yahoo Answers browsing behavior, version 1.0 (166 GB) (Hosted on AWS) L11 - HTML Forms Extracted from Publicly Available Webpages, version 1.0 (133GB) (Hosted on AWS)

Distribution channels: Webscope datasets range in size from 2.3K to 206 Gbytes Datasets as large as 102MB are hosted on Yahoo servers and the data is delivered by download link Datasets larger than 102MB are hosted on Amazon Web Services (AWS) cloud Requestors need to provide their AWS canonical id#, when requesting the data (prompted on the Webscope request form) Yahoo provides the instructions to download from AWS No cost to requestors to download datasets from AWS Can download the data from AWS and/or work with the data in AWS cloud without downloading to local machines AWS can host datasets as large as 5 terabytes, for future needs of large Webscope datasets

What people are saying about Webscope…..

Here’s how to access the data: http://webscope. sandbox. yahoo Here’s how to access the data: http://webscope.sandbox.yahoo.com/ Please attribute back to Webscope, Yahoo Labs. We hope to see your published paper on our Webscope website. Enjoy the data!