Data Science Research in Big Data Era

Slides:



Advertisements
Similar presentations
Standards Alignment A study of alignment between state standards and the ACM K-12 Curriculum.
Advertisements

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Write and Publish an IR Paper ChengXiang Zhai Department of Computer.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Introduction to IR Research ChengXiang Zhai Department of Computer.
SFU, CMPT 741, Fall 2009, Martin Ester 418 Outlook Outline Trends in KDD research Graph mining and social network analysis Recommender systems Information.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Introduction to Data Science Kamal Al Nasr, Matthew Hayes and Jean-Claude Pedjeu Computer Science and Mathematical Sciences College of Engineering Tennessee.
Introduction. Readings r Van Steen and Tanenbaum: 5.1 r Coulouris: 10.3.
Querying Big Graphs within Bounded Resources 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
Understanding Data Analytics and Data Mining Introduction.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Pick a Good IR Research Problem ChengXiang Zhai Department of Computer.
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Prepare Yourself for IR Research ChengXiang Zhai Department of Computer.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Charles Tappert Seidenberg School of CSIS, Pace University
Definition of Computational Science Computational Science for NRM D. Wang Computational science is a rapidly growing multidisciplinary field that uses.
How to get the most out of the survey task + suggested survey topics for CS512 Presented by Nikita Spirin.
Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.
Overview of CS Class Jiawei Han Department of Computer Science
Real World IR Challenges (CS598-CXZ Advanced Topics in IR Presentation) Jan. 20, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
Science Fair How To Get Started… (
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Wednesday, March 29, 2000.
Tallahassee, Florida, Sept., 2015 Research in Data Sciences Peixiang Zhao Department of Computer Science Florida State University Introduction.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Most of contents are provided by the website Introduction TJTSD66: Advanced Topics in Social Media Dr.
Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.
The Interplay Between Mathematics/Computation and Analytics Haesun Park Division of Computational Science and Engineering Georgia Institute of Technology.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
CSCE 5073 Section 001: Data Mining Spring Overview Class hour 12:30 – 1:45pm, Tuesday & Thur, JBHT 239 Office hour 2:00 – 4:00pm, Tuesday & Thur,
Cyberinfrastructure Overview of Demos Townsville, AU 28 – 31 March 2006 CREON/GLEON.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
CS & CS ST: Probabilistic Data Management Fall 2016 Xiang Lian Kent State University Kent, OH
Synopsis Introduction to Data Science
Data Analytics 1 - THE HISTORY AND CONCEPTS OF DATA ANALYTICS
Term Project Proposal By J. H. Wang Apr. 7, 2017.
Outline Introduction State-of-the-art solutions
ITCS 6157/8157: Visual Database
Proposal for Term Project
Kevin C. Chang University of Illinois, Urbana-Champaign
So, what was this course about?
Introduction to IR Research
Chapter 13 The Data Warehouse
Three Goals to Accomplish by Writing Papers
中国计算机学会学科前沿讲习班:信息检索 Course Overview
Personalized Social Image Recommendation
Course Summary (Lecture for CS410 Intro Text Info Systems)
Jiawei Han Computer Science University of Illinois at Urbana-Champaign
Jiawei Han Department of Computer Science
Next-Generation Search Engines -Perspective and challenges
CS & CS Probabilistic Data Management
MBI 630: Systems Analysis and Design
Data Science Research in Big Data Era
CS7280: Special Topics in Data Mining Information/Social Networks
CSE591: Data Mining by H. Liu
Data Warehousing and Data Mining
Luke Do, Jessica Olmedo, Arely Romero, and Vianca Santana
CS & CS ST: Probabilistic Data Management
Overview of Machine Learning
Big Data Young Lee BUS 550.
Course Summary ChengXiang “Cheng” Zhai Department of Computer Science
CS6501 Embedded Operating Systems for the IoT
Web Mining Department of Computer Science and Engg.
CSCE 4143 Section 001: Data Mining Spring 2019.
CSE591: Data Mining by H. Liu
Introduction to Search Engines
Promising “Newer” Technologies to Cope with the
Computer Science Dr Hwang Chair, Computer Science Department
Presentation transcript:

Data Science Research in Big Data Era Introduction to Research Seminar, 2017 Peixiang Zhao Department of Computer Science Florida State University zhao@cs.fsu.edu

Synopsis Introduction to Data Sciences How to prepare yourself for (data) research My research portfolio Conclusions

Who am I? Peixiang Zhao Assistant Professor at CS @ FSU Homepage: http://www.cs.fsu.edu/~zhao/ Office: 262 Love Building, FSU Ph.D.: University of Illinois at Urbana-Champaign, Aug. 2012 Research Interest: Database, data mining, data-intensive computation and analytics, and Information Network Analysis!

Who am I? Courses I am offering COP4710: Introductory database systems Every fall semester What are databases and how to use databases A programming project on Web-based DB programming CIS 4930: Data Mining COP 5725: Advanced databases systems Every spring semester Database internals and advanced topics, such as MapReduce, data mining and Web search A research/implementation project I am hiring highly-motivated Ph.D. students!

Introduction What are data sciences? The sub-area of computer science dealing with querying, mining, acquisition, and management of data drawn from the real-world applications Include, but are not limited to Database systems Data mining Information retrieval Web technologies Network science Big data http://www.youtube.com/watch?v=LrNlZ7-SMPk

Data Sciences Data: Common Tasks: Model: Fully structured or relational, semi-structured, unstructured, schema-less, graphical, …… Format: textual, numeric, categorical, sequential, graph-structured, audio/video, time-series, streaming data Scale: from megabytes to zetabytes Quality, resolution, privacy, usability …… Common Tasks: Data acquisition, sanitation, transformation, storage, maintenance and integration Indexing , querying and ranking Knowledge discovery, mining and machine learning

Data Sciences Skillsets and Requirement Your Bright Future Motivation and passion to work on the state-of-the-art problems Strong mathematical reasoning and algorithm design abilities Good programming skills Your Bright Future DBA at Goldman-Sachs or D. E. Shaw Data scientist at Google, Facebook, Twitter or Foursquare Data engineering at Oracle, IBM or Microsoft Researcher at MSR, IBM Research or Yahoo! Labs Professor in SIGMOD, VLDB, KDD or SIGIR

How to prepare yourself for (data) research What is research? Discover new knowledge Seek answers to non-trivial questions Research Process Identification of the topic (e.g., Web search) Hypothesis formulation (e.g., algorithm X is better than Y=state-of-the-art) Experiment design (measures, data, etc) (e.g., retrieval accuracy on a sample of web data) Test hypothesis (e.g., compare X and Y on the data) Draw conclusions and repeat the cycle of hypothesis formulation and testing if necessary (e.g., Y is better only for some queries, now what?)

Why Research? Funding Curiosity Quality of Life Utility of Applications Advancement of Technology Amount of knowledge Application Development Applied Research Basic Research

What is Good Research? Solid work: A clear hypothesis (research question) with conclusive result (either positive or negative) Clearly adds to our knowledge base (what can we learn from this work?) Implications: a solid, focused contribution is often better than a non-conclusive broad exploration High impact = high-importance-of-problem * high-quality-of-solution high impact = open up an important problem high impact = close a problem with the best solution high impact = major milestones in between Implications: question the importance of the problem and don’t just be satisfied with a good solution, make it the best

Challenge-Impact Analysis Level of Challenges High impact High risk (hard) Good long-term research problems Difficult basic research Problems, but questionable impact High impact Low risk (easy) Good short-term research problems Low impact Low risk Bad research problems (May not be publishable) Good applications Not interesting for research Unknown “entry point” problems Known Impact/Usefulness

How to Do Research in Data Sciences? Curiosity: allow you to ask questions Critical thinking: allow you to challenge assumptions Make sense of what you have read/heard Learning: take you to the frontier of knowledge Start with textbooks and courses Read papers in top-notch conferences/journals Implement your prototype ideas Persistence: so that you don’t give up Respect data and truth: ensure your research is solid Don’t throw away negative results Communication: publish and present your work

How to Find Problems? Driven by new data: X is a new type of data emerging (e.g., X= blog vs. news) How is X different from existing types of data? What new issues/problems are raised by X? Are existing methods sufficient for solving old problems on X? If not, what are the new challenges? Driven by new users: Y is a set of new users (e.g., ordinary people vs. librarians) How are the new users different from old ones? What new needs do they have? Can existing methods work well to satisfy their needs? If not, what are the new challenges? Driven by new tasks (not necessarily new users or new data): Z is a new task (e.g., social networking, online shopping) What information management functions are needed to better support Z? Can these new functions reduced to old ones? If not, what are the new challenges?

Tuning the Problem Unknown Known Level of Challenges Make an easy problem harder Increase impact (more general) Make a hard problem easier Unknown Known Impact/Usefulness

Where to Publish? Databases Data Mining Information Retrieval SIGMOD, VLDB, ICDE ACM TODS, VLDB J., IEEE TKDE Data Mining KDD, ICDM, SDM ACM TKDD Information Retrieval SIGIR, CIKM ACM TOIS Web & Applications WWW, WSDM

My Research Portfolio What are information networks? A large number of interacting physical, conceptual, and human/societal entities Entities are interconnected with relationships Information networks are ubiquitous Technological networks Social networks Biomedical, biochemical and ecological networks The Web …… Information networks have formed a critical component of modern information infrastructure

Real-world Information Networks The network structure of the Internet Opte Project (http://www.opte.org/maps/) Entities: class C subnets Relationship: data packet routes Citation Networks (http://bluwiki.com/go/Citation) Entities: 5199 papers from SIGOPS, SIGPLAN, SIGART Relationship: 5343 citations Yeast protein interaction network(baker’s yeast) (http://www.bordalierinstitute.com/) Twitter network (http://yoan.dosimple.ch/blog/)

Information Networks: Model and Characteristics An information network can be modeled as a graph comprising both vertices and edges G = (V, E) A real-world information network is massive (Jun. 2012) Web graph: 8.94 billion pages Facebook: 901 million active users and 125 billion friendship relations dynamic Facebook U.S. grows 149% in 2009

Querying Information Networks Motivation The most natural and easiest approach to managing and accessing information networks is querying! Neighborhood query, keyword query, reachability query, shortest-path query, graph query, frequency estimation query, …… Challenges The massive and dynamic nature of information networks precludes the direct application of most well-studied, memory-resident graph algorithms! Who are my friends in Google+? Graph query: find all protein substructures containing an α-β-barrel motif in a protein-to-protein interaction network. Gene Coexpression Network Alignment and Conservation of Gene Modules between Two Grass Species: Maize and Rice Frequency query: find the heavy hitters of IP-networks with abnormal frequency behavior …… Which university is UIUC? What is the shorest route between UIUC and FSU? What are the largest phenotypic associations between rice and maize?

My Focus and Solutions Efficient, cost-effective and potentially scalable solutions Queries gSketch Frequency Estimation Graph Cube OLAP Aggregation Tree+δ Subgraph Matching P-Rank SPath Structural Similarity SimQuery Information networks Unlabeled/ Labeled Disconnected/ Connected Unidimensional/ Multidimensional Static/ Dynamic

My Other Work Location-based mining and ranking Text mining Mining large-scale information networks Mining structural patterns Industry-strength systems Hadoop-ML at IBM research Trinity at Microsoft research

Grand Challenges in Data Science Models and representations Text, HTML/XML data, relational data, graph/network, image, animation/video An internet of (homogeneous/heterogeneous) things Magnitude and complexity Big data is a big deal NCSA example: First 19 years: 1 PB; Year 20 (2007): 2 PB; Year 21 (2008): 4 PB; By 2020: ~20 Exabytes? Resolution and granularity Quality and reliability

Future Research Agenda Foundations and models of Information Networks Model, manage and access multi-genre heterogeneous information networks Querying and mining volatile, noisy and uncertain information networks Cyber-physical information networks Efficient and scalable computation in Information Networks A unified declarative language for graph and network data A distributed graph computational framework for large-scale information networks Knowledge discovery in large Information Networks

Conclusions We are in an information network era! Internet, social networks, collaboration and recommender networks, public health-care networks, technological/biological networks …… Data are pervasive, big, and of great value Research in data sciences is interesting and highly rewarding Follow your heart and don’t give up!

Good Luck! Q & A