Tallahassee, Florida, Sept., 2015 Research in Data Sciences Peixiang Zhao Department of Computer Science Florida State University Introduction.

Tallahassee, Florida, Sept., 2015 Research in Data Sciences Peixiang Zhao Department of Computer Science Florida State University zhao@cs.fsu.edu Introduction to Research Seminar, 2015

/ 23 Synopsis 1.Introduction to Data Sciences 2.How to prepare yourself for (data) research 3.My research portfolio 4.Conclusions 1

/ 23 Who am I? Peixiang Zhao – Assistant Professor at CS @ FSU – Homepage: http://www.cs.fsu.edu/~zhao/http://www.cs.fsu.edu/~zhao/ – Office: 262 Love Building, FSU – Ph.D.: University of Illinois at Urbana-Champaign, Aug. 2012 – Research Interest: Database, data mining, data-intensive computation and analytics, and Information Network Analysis! 2

/ 23 Who am I? Courses I am offering – COP4710: Introductory database systems Every fall semester What are databases and how to use databases – COP4930: Data mining Spring 2016 – COP 5725: Advanced databases systems Every spring semester Database internals and advanced topics, such as MapReduce, data mining and Web search A research/implementation project I am hiring highly-motivated Ph.D. students! 3

/ 23 Introduction What are data sciences? – The sub-area of computer science dealing with the acquisition, management, querying and mining data drawn from the real- world applications – Include, but are not limited to Database systems Data mining Information retrieval Network science Big data – https://www.youtube.com/watch?v=dKHz9LbgRmo https://www.youtube.com/watch?v=dKHz9LbgRmo – http://www.youtube.com/watch?v=LrNlZ7-SMPk http://www.youtube.com/watch?v=LrNlZ7-SMPk

/ 23 Data Sciences Data: – Model: Fully structured or relational, semi-structured, unstructured, schema-less, graphical, …… – Format: textual, numeric, categorical, sequential, graph- structured, audio/video, time-series, streaming data – Scale: from megabytes to zetabytes – Quality, resolution, privacy, usability …… Common Tasks: – Data acquisition, sanitation, transformation, storage, maintenance and integration – Indexing, querying and ranking – Knowledge discovery, mining and machine learning 5

/ 23 Data Sciences Skillsets and Requirement – Motivation and passion to work on the state-of-the-art problems – Strong mathematical reasoning and algorithm design abilities – Good programming skills Your Bright Future – DBA at Goldman-Sachs or D. E. Shaw – Data scientist at Google, Facebook, Twitter or Foursquare – Data engineer at Oracle, IBM or Microsoft – Researcher at MSR, IBM Research or Yahoo! Labs – Professor shown up in SIGMOD, KDD or SIGIR 6

/ 23 How to prepare yourself for (data) research What is research? – Discover new knowledge – Seek answers to non-trivial questions Research Process 1.Identification of the topic (e.g., Web search) 2.Hypothesis formulation (e.g., algorithm X is better than Y=state-of-the-art) 3.Experiment design (measures, data, etc) (e.g., retrieval accuracy on a sample of web data) 4.Test hypothesis (e.g., compare X and Y on the data) 5.Draw conclusions and repeat the cycle of hypothesis formulation and testing if necessary (e.g., Y is better only for some queries, now what?) 7

/ 23 Why Research? 8 Amount of knowledge Advancement of Technology Utility of Applications Quality of Life Basic Research Applied Research Application Development Curiosity

/ 23 What is Good Research? Solid work: – A clear hypothesis (research question) with conclusive result (either positive or negative) – Clearly adds to our knowledge base (what can we learn from this work?) – Implications: a solid, focused contribution is often better than a non-conclusive broad exploration High impact = high-importance-of-problem * high-quality-of- solution – high impact = open up an important problem – high impact = close a problem with the best solution – high impact = major milestones in between – Implications: question the importance of the problem and don’t just be satisfied with a good solution, make it the best 9

/ 23 Challenge-Impact Analysis 10 Level of Challenges Impact/Usefulness Known Unknown Good applications Not interesting for research High impact Low risk (easy) Good short-term research problems High impact High risk (hard) Good long-term research problems Difficult basic research Problems, but questionable impact Low impact Low risk Bad research problems (May not be publishable) “entry point” problems

/ 23 How to Do Research in Data Sciences? Curiosity: allow you to ask questions Critical thinking: allow you to challenge assumptions – Make sense of what you have read/heard Learning: take you to the frontier of knowledge – Start with textbooks and courses – Read papers in top-notch conferences/journals – Implement your prototype ideas Persistence: so that you don’t give up Respect data and truth: ensure your research is solid – Don’t throw away negative results Communication: publish and present your work 11

/ 23 Tuning the Problem 12 Level of Challenges Impact/Usefulness Known Unknown Make a hard problem easier Make an easy problem harder Increase impact (more general)

/ 23 Where to Publish? Databases – SIGMOD, VLDB, ICDE – ACM TODS, VLDB J., IEEE TKDE Data Mining – KDD, ICDM, SDM – ACM TKDD Information Retrieval – SIGIR, CIKM – ACM TOIS Web & Applications – WWW, WSDM 13

/ 23 My Research Portfolio What are information networks? 1.A large number of interacting physical, conceptual, and human/societal entities 2.Entities are interconnected with relationships Information networks are ubiquitous – Technological networks – Social networks – Biomedical, biochemical and ecological networks – The Web – …… 14

/ 23 Real-world Information Networks 15 The network structure of the Internet Opte Project (http://www.opte.org/maps/) Entities: class C subnets Relationship: data packet routes Twitter network ( http://yoan.dosimple.ch/blog/ ) http://yoan.dosimple.ch/blog Citation Networks ( http://bluwiki.com/go/Citation ) http://bluwiki.com/go/Citation Entities: 5199 papers from SIGOPS, SIGPLAN, SIGART Relationship: 5343 citations Yeast protein interaction network(baker’s yeast) ( http://www.bordalierinstitute.com/ ) http://www.bordalierinstitute.com/

/ 23 Information Networks: Model and Characteristics An information network can be modeled as a graph comprising both vertices and edges – G = (V, E) A real-world information network is – massive (Jun. 2012) Web graph: 8.94 billion pages Facebook: 901 million active users and 125 billion friendship relations – dynamic Facebook U.S. grows 149% in 2009 16

/ 23 Querying Information Networks Motivation – The most natural and easiest approach to managing and accessing information networks is querying! Neighborhood query, keyword query, reachability query, shortest-path query, graph query, frequency estimation query, …… Challenges – The massive and dynamic nature of information networks precludes the direct application of most well-studied, memory-resident graph algorithms! 17 Who are my friends in Google+? Which university is UIUC? What is the shorest route between UIUC and FSU? What are the largest phenotypic associations between rice and maize?

/ 23 My Focus and Solutions 18 Information networks Tasks Unlabeled/ Labeled Disconnected/ Connected Unidimensional/ Multidimensional Static/ Dynamic Structural Similarity Subgraph Matching OLAP Aggregation Frequency Estimation Efficient, cost-effective and potentially scalable solutions

/ 23 My Other Work Location-based mining and ranking – [SIGIR’11], [CIKM’11][TKDE’15] Text mining – [SDM’12], [SIGIR’10] [KAIS’13] Mining large-scale information networks – [ICDM’10][EDBT’09][SIGMOD’08][CIKM’15] Mining structural patterns – [WWW-J.’08], [DASFAA’07] Industry-strength systems – Hadoop-ML at IBM research – Trinity at Microsoft research 19

/ 23 Future Research Agenda Foundations and models of Information Networks – Model, manage and access multi-genre heterogeneous information networks – Querying and mining volatile, noisy and uncertain information networks – Cyber-physical information networks Efficient and scalable computation in Information Networks – A unified declarative language for graph and network data – A distributed graph computational framework for large-scale information networks Knowledge discovery in large Information Networks 20

/ 23 Conclusions We are in an information network era! – Internet, social networks, collaboration and recommender networks, public health-care networks, technological/biological networks …… Data are pervasive, big, and of great value Research in data sciences is interesting and highly rewarding Follow your heart and don’t give up! 21

/ 23 22 Good Luck! Q & A

Tallahassee, Florida, Sept., 2015 Research in Data Sciences Peixiang Zhao Department of Computer Science Florida State University Introduction.

Similar presentations

Presentation on theme: "Tallahassee, Florida, Sept., 2015 Research in Data Sciences Peixiang Zhao Department of Computer Science Florida State University Introduction."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tallahassee, Florida, Sept., 2015 Research in Data Sciences Peixiang Zhao Department of Computer Science Florida State University Introduction.

Similar presentations

Presentation on theme: "Tallahassee, Florida, Sept., 2015 Research in Data Sciences Peixiang Zhao Department of Computer Science Florida State University Introduction."— Presentation transcript:

Similar presentations

About project

Feedback