CS341: Project in Mining Massive Datasets Infosession

Slides:

Advertisements

Similar presentations

Google News Personalization: Scalable Online Collaborative Filtering

Advertisements

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

Mrs. Kercher 6 th Grade Gifted.  ExploraVision is a competition for K–12 students of all interest, skill, and ability levels. The competition encourages.

Mining di Dati Web Web Community Mining and Web log Mining : Commody Cluster based execution Romeo Zitarosa.

IS5152 Decision Making Technologies

Search Engines and Information Retrieval

MIS 470: Information Systems Project Yong Choi School of Business Administration CSU, Bakersfield.

Experimental Psychology PSY 433

Welcome to CS 395/495 Measurement and Analysis of Online Social Networks.

Final Projects CIS /SYA Spring 2011.

Final Review for CS 562. Final Exam on December 18, 2014 in CAS 216 Time: 3PM – 5PM (~2hours) OPEN NOTES, SLIDES, BOOKS Study the topics that we covered.

Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.

Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.

Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.

Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.

CS 450 MODELING AND SIMULATION Instructor: Dr. Xenia Mountrouidou (Dr. X)

05/16/001 MRKT 520-MARKETING MANAGEMENT DR. Ugur Yucelt Office Phone: Summer 2002 MW:6:00-9:10 pm Office Hours: MW: 5:00-6:00pm.

Search Engines and Information Retrieval Chapter 1.

CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.

Advanced Software Engineering PROJECT. 1. MapReduce Join (2 students)  Focused on performance analysis on different implementation of join processors.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.

Syllabus CS479(7118) / 679(7112): Introduction to Data Mining Spring-2008 course web site:

2014 ML Project2: Goal: Do some real machine learning; learn you to use machine learning to make sense out of data. Group Project—4 (3) students per group.

Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%

WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.

1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.

Mining of Massive Datasets Edited based on Leskovec’s from

GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.

Financial Management of ECE Programs.  Go to “Tools”  Click on “Personal Information” to edit your personal information (including address) or.

Reference Management Module I: Introduction By Rehema Chande-Mallya(PhD)

Using the Research by Subject Pages 1. The research by subject link is a shortcut for library users interested in finding key resources associated with.

CSCI5570 Large Scale Data Processing Systems

22C:145 Artificial Intelligence

Recommendation in Scholarly Big Data

CS6501 Advanced Topics in Information Retrieval Course Policy

CPS : Information Management and Mining

Computer Networks CNT5106C

Working in Groups in Canvas

ECE361 Engineering Practice

Data Analytics for ICT.

100 minute paper assignment

Software Documentation

Introduction to Web Mining

How to Be Successful in English 3000

ENG 135 Innovative Education- -snaptutorial.com

Data Analyzing Artificial Intelligence (AI)

Online Composition with Georgie Ziff

CS7280: Special Topics in Data Mining Information/Social Networks

Artificial Intelligence Techniques

Latent Space Model for Road Networks to Predict Time-Varying Traffic

Experimental Psychology PSY 433

CS & CS Capstone Project & Software Development Project

Proposal for Term Project Operating Systems, Fall 2018

BIO1130 Lab 2 Scientific literature

COMP390/3/4/5 Final Year Project Introduction & Specification

CS 594: Empirical Methods in HCC Social Network Analysis in HCI

Making a Case for Real World Application

Graph and Link Mining.

CS246: Web Information Systems -- Introduction

CS 345A Data Mining Lecture 1

CS 345A Data Mining Lecture 1

Introduction to Web Mining

Network Models Michael Goodrich Some slides adapted from:

CS 345A Data Mining Lecture 1

Analysis of Large Graphs: Overlapping Communities

GhostLink: Latent Network Inference for Influence-aware Recommendation

Presentation transcript:

CS341: Project in Mining Massive Datasets Infosession Jure Leskovec Chris Re Anand Rajaraman Rok Sosic Jeff Ullman Andreas Paepcke

CS341: Project in Data Mining Data mining research project on real data Teams of 3 students (Use Piazza on CS246 to form teams) We have room for 10-15 teams We provide: Data Computers (Amazon EC2, ~~3k$ per team) Mentoring: Each group will have an assigned mentor that they meet on a weekly basis You provide: Project proposals Effort 9/18/2018 Stanford CS341: Project in Mining Massive Datasets

CS341: Schedule Today (3/3): Info session. Friday 3/18: Project proposals due. Friday 3/25: Admission results. 10 to 15 projects will be admitted. Mon 3/28: First class meeting in Herrin 195. Mon 5/2 and Weds 5/4: Midterm presentations. Week of May 30: Final presentations.

Projects: Proposal Must Address (1) What is the problem/question your team is solving? Give a brief but precise description or definition of the problem or question Examples: (a) Analyze the data to understand why editors are leaving Wikipedia (b) Build a social recommender engine for movies (c) Design a better MapReduce algorithm for finding clusters in graphs (2) What data will you use? Why is the data you plan to use appropriate? Does it have the right labels/information? It is ok to use your own data (give detailed description)! (a) Wikipedia edit history where every action of every user is recorded (b) We crawled Yelp and obtained X million reviews from Y million users (c) We will use the Altavista web graph on X million nodes. 9/18/2018 Stanford CS341: Project in Mining Massive Datasets

Projects: Proposal Must Address (3) How will you solve the problem? What is your plan of action? Describe and think about your approach! What method, algorithm, technique? How will you scale it up? Be as specific as you can! Examples: (a) We will create edit histories of every article. We will then compare article edit histories and argue that users are leaving since all the “easy/obvious” articles have already been written (b) Our hypothesis is that friends have similar tastes. We will include a regularization term to a Latent Factor Rec. Sys. which will encourage neighboring users to have similar parameters (c) We will implement a scalable Frequent-itemset-based approach to identify cluster seeds (complete bipartite subgraphs). In the second pass we will then use a random walk based approach to expand around the seed and extract the clusters 9/18/2018 Stanford CS341: Project in Mining Massive Datasets

Projects: Proposal Must Address (4) How will you evaluate your method? How will you measure performance or success of your method? What baselines will you use? Examples: (a) Using insights from our analysis we will build a model that will predict how complete is the article (much the article will change in the future). We will evaluate predictive accuracy of the model (b) We will measure RMSE of our system. As a baseline for comparison will use traditional latent-factor recommender (c) We will measure resource usage and execution time of our algorithm and compare it to open source algs. Metis and Graclus (5) What do you expect to submit/accomplish by the end of the quarter? 9/18/2018 Stanford CS341: Project in Mining Massive Datasets

Projects: Proposals Submit to cs341-spr1516-staff@lists.stanford.edu PDF should include Project title Project narrative addressing the 5 questions Information about team members: For each team member: 5 line CV/Bio about prior experience, and why you are prepared to take this course No page limit (but we don’t promise to read past page 3) Due Friday 3/18 11:59pm Pacific time We will let you know whether you got in by Friday March 25 9/18/2018 Stanford CS341: Project in Mining Massive Datasets

SNAP Datasets Collection of over 70 web and social network datasets: http://snap.stanford.edu/data Social networks: online social networks, edges represent interactions between people Twitter and Memetracker : Memetracker phrases, links and 467 million Tweets Citation networks: nodes represent papers, edges represent citations Collaboration networks: nodes represent scientists, edges represent collaborations (co-authoring a paper) Amazon networks : nodes represent products and edges link commonly co-purchased products

News Media Online media Goal: Collection of over 6B news documents and 300M short textual phrases that appear in them Think of this as a complete trace of Internet news media space for the last 6 years! Goal: Detect trending topics and explores the dynamics of online news Based on time, named entities, mutation of information

Microsoft Academic Graph Exhaustive dataset of scientific papers 123M authors, 123M papers, 757M references Affiliations, keywords, conferences, journals 1.9 billion items, ~100GB Problem Many duplicate entities donald knuth appears 158 times Goal Use textual and network structure features to identify duplicate entries

Online Reviews: Amazon 18 years of Amazon reviews up to March 2013 Product and user information, ratings, review text http://snap.stanford.edu/data/web-Amazon.html

Generic Places to Find Problems Kaggle (www.kaggle.com) runs competitions. You can get both data + ideas + possibly win. Yahoo (http://webscope.sandbox.yahoo.com/) Interesting datasets, no problem suggestions, but some ideas should be obvious. TREC (http://trec.nist.gov/). Current and historical competitions. May take a week or more to get authorization for data. 9/18/2018 Stanford CS341: Project in Mining Massive Datasets

Send in Your Proposals For more detail on a dataset or problem, please contact the appropriate instructor Andreas Paepcke (paepcke@cs.stanford.edu) Anand Rajaraman (datawocky@gmail.com) Chris Re (chrismre@cs.stanford.edu) Rok Sosic (rok@cs.stanford.edu) Jeff Ullman (ullman@gmail.com) Emails for outside contacts are provided at i.stanford.edu/~ullman/cs341slides.html 9/18/2018 Stanford CS341: Project in Mining Massive Datasets