Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup November 14, 2014
Overview Six Week MIT Online Course: Started November 4th and Completed November 12th. Mined this MIT Online Course for Data Sets and Ideas: Found subset of the slides that contained data sets and ideas and were interesting and useful visualizations in themselves. Professor Karger's Lecture Slides on Visualization User Interfaces Were All About My Heroes: Tukey, Tufte, Sneiderman, and Spotfire. (In fact it was everything leading up to Spotfire, but Spotfire itself!) Preserve My Work & Present Tutorial to the Federal Big Data Working Group Meetup: MindTouch Knowledge Base, Excel Spreadsheet Index, and Spotfire Interactive Visualizations.
MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Course Assessment Web Site (private)
MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Course Progress https://mitprofessionalx.edx.org/courses/MITProfessionalX/6.BDX/2T2014/progress
MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Big Data Storage Web Site (private)
MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Modern Databases Script Web Site (private) and Script (Public)
Courseware: Big Data Storage I was especially interested in the following since both Professors Stonebraker and Madden presented to our Federal Big Data Working Group Meetup: This module begins with an overview of a number of these technologies by renowned database professor Mike Stonebraker. In his unique and ardent fashion, Mike expresses his skepticism about many new technologies, particularly Hadoop/MapReduce and NoSQL, and voices support for many new relational technologies, including column stores and main memory databases. After that, Professors Matei Zaharia and Samuel Madden provide a more nuanced view of the tradeoffs between the various approaches, discussing Hadoop and its derivatives, as well as NoSQL and its tradeoffs, in more detail. Professor Stonebraker expresses a number of strong opinions in this module. Which of them do you agree with? Which do you disagree with? Why? 3.0 Introduction to Big Data Storage and Discussion 3
Selected Slides: Professor Sam Madden What Is This Course Going to Cover? Other Techniques We'll Cover
Selected Slides: Professor David Karger Overview Interaction Strategy
Selected Slides: Professor Daniela Rus Case Study: Transportation in Singapore 1.1 Case Study: Transportation - PDF of Presentation slides (Rus)
Google Search: Singapore Taxi Data
Think Business: Why can’t I find a taxi when I really need one? Based on: Labor Supply Decisions of Singaporean Cab Drivers, May 8, 2013 Newer Paper: Labor Supply Decisions of Singaporean Cab Drivers, September 2014 http://thinkbusiness.nus.edu/smart-finance/item/131-why-can%E2%80%99t-i-find-a-taxi-when-i-really-need-one?
Labor Supply Decisions of Singaporean Cab Drivers: Table 1: Summary Statistics by Days http://www.ushakrisna.com/Cabdrivers.pdf
MIT Big Data Knowledge Base: Table 1 Spreadsheet My Note: Image PDF so had to hand build! Spreadsheet
Singapore Land Transport Authority: Traffic Info Service Providers http://www.lta.gov.sg/content/ltaweb/en/industry-matters/traffic-info-service-providers.html
Singapore Land Transport Authority: MyTransport.sg Screen Scrape http://www.mytransport.sg/content/mytransport/home/dataMall.html#All_Datasets
Singapore Land Transport Authority: All Datasets Spreadsheet
MIT Big Data Knowledge Base: MindTouch Labor Supply Decisions of Singaporean Cab Drivers, September 2014, as a Data Science Data Publication Data Science for Tackling the Challenges of Big Data
MIT Big Data: Knowledge Base Spreadsheet
MIT Big Data: Course Participant Spreadsheet My Note: This was mapped in Spotfire after data curation (cleaning of the country names). Spotfire has built in data curation functions. Spreadsheet
MIT Big Data: Spotfire Cover Page Web Player
MIT Big Data: Student Enrollment Web Player
MIT Big Data: Singaporean Cab Drivers Web Player
New York City Open Data: Socrata https://nycopendata.socrata.com/
New York City Open Data: Search Results My Note: Could Only Find Taxi Drivers Data. Web Site
New York City Open Data: Data Table Download: XLSX Web Site and Medallion_Drivers_-_Active.xlsx
Visualizing NYC’s Open Data: Socrata Beta https://nycopendata.socrata.com/viz
MIT Big Data Assessment: Questions and Answers Big Data Collection 2) Data science requires: Knowledge of statistics Knowledge of data management Knowledge of curation All of the above - correct Big Data Systems 13) For which of the following tasks is interactive visualization most useful? (choose all that apply) Developing a hypothesis about data - correct Formally confirming a hypothesis Communicating a conclusion about data - correct All of the above Big Data Analytics: 13) Big Data means that there's no shortage of useful data. True False - correct Story