Download presentation
Presentation is loading. Please wait.
Published byJulianna Daniels Modified over 9 years ago
1
Tallahassee, Florida, Sept., 2015 Research in Data Sciences Peixiang Zhao Department of Computer Science Florida State University zhao@cs.fsu.edu Introduction to Research Seminar, 2015
2
/ 23 Synopsis 1.Introduction to Data Sciences 2.How to prepare yourself for (data) research 3.My research portfolio 4.Conclusions 1
3
/ 23 Who am I? Peixiang Zhao – Assistant Professor at CS @ FSU – Homepage: http://www.cs.fsu.edu/~zhao/http://www.cs.fsu.edu/~zhao/ – Office: 262 Love Building, FSU – Ph.D.: University of Illinois at Urbana-Champaign, Aug. 2012 – Research Interest: Database, data mining, data-intensive computation and analytics, and Information Network Analysis! 2
4
/ 23 Who am I? Courses I am offering – COP4710: Introductory database systems Every fall semester What are databases and how to use databases – COP4930: Data mining Spring 2016 – COP 5725: Advanced databases systems Every spring semester Database internals and advanced topics, such as MapReduce, data mining and Web search A research/implementation project I am hiring highly-motivated Ph.D. students! 3
5
/ 23 Introduction What are data sciences? – The sub-area of computer science dealing with the acquisition, management, querying and mining data drawn from the real- world applications – Include, but are not limited to Database systems Data mining Information retrieval Network science Big data – https://www.youtube.com/watch?v=dKHz9LbgRmo https://www.youtube.com/watch?v=dKHz9LbgRmo – http://www.youtube.com/watch?v=LrNlZ7-SMPk http://www.youtube.com/watch?v=LrNlZ7-SMPk
6
/ 23 Data Sciences Data: – Model: Fully structured or relational, semi-structured, unstructured, schema-less, graphical, …… – Format: textual, numeric, categorical, sequential, graph- structured, audio/video, time-series, streaming data – Scale: from megabytes to zetabytes – Quality, resolution, privacy, usability …… Common Tasks: – Data acquisition, sanitation, transformation, storage, maintenance and integration – Indexing, querying and ranking – Knowledge discovery, mining and machine learning 5
7
/ 23 Data Sciences Skillsets and Requirement – Motivation and passion to work on the state-of-the-art problems – Strong mathematical reasoning and algorithm design abilities – Good programming skills Your Bright Future – DBA at Goldman-Sachs or D. E. Shaw – Data scientist at Google, Facebook, Twitter or Foursquare – Data engineer at Oracle, IBM or Microsoft – Researcher at MSR, IBM Research or Yahoo! Labs – Professor shown up in SIGMOD, KDD or SIGIR 6
8
/ 23 How to prepare yourself for (data) research What is research? – Discover new knowledge – Seek answers to non-trivial questions Research Process 1.Identification of the topic (e.g., Web search) 2.Hypothesis formulation (e.g., algorithm X is better than Y=state-of-the-art) 3.Experiment design (measures, data, etc) (e.g., retrieval accuracy on a sample of web data) 4.Test hypothesis (e.g., compare X and Y on the data) 5.Draw conclusions and repeat the cycle of hypothesis formulation and testing if necessary (e.g., Y is better only for some queries, now what?) 7
9
/ 23 Why Research? 8 Amount of knowledge Advancement of Technology Utility of Applications Quality of Life Basic Research Applied Research Application Development Curiosity
10
/ 23 What is Good Research? Solid work: – A clear hypothesis (research question) with conclusive result (either positive or negative) – Clearly adds to our knowledge base (what can we learn from this work?) – Implications: a solid, focused contribution is often better than a non-conclusive broad exploration High impact = high-importance-of-problem * high-quality-of- solution – high impact = open up an important problem – high impact = close a problem with the best solution – high impact = major milestones in between – Implications: question the importance of the problem and don’t just be satisfied with a good solution, make it the best 9
11
/ 23 Challenge-Impact Analysis 10 Level of Challenges Impact/Usefulness Known Unknown Good applications Not interesting for research High impact Low risk (easy) Good short-term research problems High impact High risk (hard) Good long-term research problems Difficult basic research Problems, but questionable impact Low impact Low risk Bad research problems (May not be publishable) “entry point” problems
12
/ 23 How to Do Research in Data Sciences? Curiosity: allow you to ask questions Critical thinking: allow you to challenge assumptions – Make sense of what you have read/heard Learning: take you to the frontier of knowledge – Start with textbooks and courses – Read papers in top-notch conferences/journals – Implement your prototype ideas Persistence: so that you don’t give up Respect data and truth: ensure your research is solid – Don’t throw away negative results Communication: publish and present your work 11
13
/ 23 Tuning the Problem 12 Level of Challenges Impact/Usefulness Known Unknown Make a hard problem easier Make an easy problem harder Increase impact (more general)
14
/ 23 Where to Publish? Databases – SIGMOD, VLDB, ICDE – ACM TODS, VLDB J., IEEE TKDE Data Mining – KDD, ICDM, SDM – ACM TKDD Information Retrieval – SIGIR, CIKM – ACM TOIS Web & Applications – WWW, WSDM 13
15
/ 23 My Research Portfolio What are information networks? 1.A large number of interacting physical, conceptual, and human/societal entities 2.Entities are interconnected with relationships Information networks are ubiquitous – Technological networks – Social networks – Biomedical, biochemical and ecological networks – The Web – …… 14
16
/ 23 Real-world Information Networks 15 The network structure of the Internet Opte Project (http://www.opte.org/maps/) Entities: class C subnets Relationship: data packet routes Twitter network ( http://yoan.dosimple.ch/blog/ ) http://yoan.dosimple.ch/blog Citation Networks ( http://bluwiki.com/go/Citation ) http://bluwiki.com/go/Citation Entities: 5199 papers from SIGOPS, SIGPLAN, SIGART Relationship: 5343 citations Yeast protein interaction network(baker’s yeast) ( http://www.bordalierinstitute.com/ ) http://www.bordalierinstitute.com/
17
/ 23 Information Networks: Model and Characteristics An information network can be modeled as a graph comprising both vertices and edges – G = (V, E) A real-world information network is – massive (Jun. 2012) Web graph: 8.94 billion pages Facebook: 901 million active users and 125 billion friendship relations – dynamic Facebook U.S. grows 149% in 2009 16
18
/ 23 Querying Information Networks Motivation – The most natural and easiest approach to managing and accessing information networks is querying! Neighborhood query, keyword query, reachability query, shortest-path query, graph query, frequency estimation query, …… Challenges – The massive and dynamic nature of information networks precludes the direct application of most well-studied, memory-resident graph algorithms! 17 Who are my friends in Google+? Which university is UIUC? What is the shorest route between UIUC and FSU? What are the largest phenotypic associations between rice and maize?
19
/ 23 My Focus and Solutions 18 Information networks Tasks Unlabeled/ Labeled Disconnected/ Connected Unidimensional/ Multidimensional Static/ Dynamic Structural Similarity Subgraph Matching OLAP Aggregation Frequency Estimation Efficient, cost-effective and potentially scalable solutions
20
/ 23 My Other Work Location-based mining and ranking – [SIGIR’11], [CIKM’11][TKDE’15] Text mining – [SDM’12], [SIGIR’10] [KAIS’13] Mining large-scale information networks – [ICDM’10][EDBT’09][SIGMOD’08][CIKM’15] Mining structural patterns – [WWW-J.’08], [DASFAA’07] Industry-strength systems – Hadoop-ML at IBM research – Trinity at Microsoft research 19
21
/ 23 Future Research Agenda Foundations and models of Information Networks – Model, manage and access multi-genre heterogeneous information networks – Querying and mining volatile, noisy and uncertain information networks – Cyber-physical information networks Efficient and scalable computation in Information Networks – A unified declarative language for graph and network data – A distributed graph computational framework for large-scale information networks Knowledge discovery in large Information Networks 20
22
/ 23 Conclusions We are in an information network era! – Internet, social networks, collaboration and recommender networks, public health-care networks, technological/biological networks …… Data are pervasive, big, and of great value Research in data sciences is interesting and highly rewarding Follow your heart and don’t give up! 21
23
/ 23 22 Good Luck! Q & A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.