CAP6778: Advanced Data Mining Fall 2010 Dr

Slides:



Advertisements
Similar presentations
1 Data Mining Introductions What Is It? Cultures of Data Mining.
Advertisements

CS 197 Computers in Society Fall, Welcome, Freshmen!
CS/CMPE 535 – Machine Learning Outline. CS Machine Learning (Wi ) - Asim LUMS2 Description A course on the fundamentals of machine.
CS 331 / CMPE 334 – Intro to AI CS 531 / CMPE AI Course Outline.
CSCD 555 Research Methods for Computer Science
1 CS Data Mining Introductions What Is It? Cultures of Data Mining.
CS 597 Your Ph.D. at USC The goal of a Ph.D. What it takes to achieve a great Ph.D. Courses Advisor How to read papers? How to keep up-to-date with research?
1 CS Data Mining Introductions What Is It? Cultures of Data Mining.
COMP 14 – 02: Introduction to Programming Andrew Leaver-Fay August 31, 2005 Monday/Wednesday 3-4:15 pm Peabody 217 Friday 3-3:50pm Peabody 217.
Test Preparation Strategies
Research Problem. Outline 1. Learn how to define a research problem in CS field.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
Catie Welsh January 10, 2011 MWF 1-1:50 pm Sitterson 014.
How to read a scientific paper
1 Choosing a Computer Science Research Problem. 2 Choosing a Computer Science Research Problem One of the hardest problems with doing research in any.
Mining of Massive Datasets Edited based on Leskovec’s from
What is Research? It may be different from what you think Target Audience: Undergraduate and Graduate Students Yung-Hsiang Lu Purdue University 1Yung-Hsiang.
Book web site:
Writing and Information Literacy. General Writing Advice Understand the assignment. Be honest with your instructor if this is the first time you’ve encountered.
 What’s going on here?  There’s no way to know for sure what goes on in a reader’s head. And every reader probably reads a little differently. This.
Effective Group Projects
Hidden Slide for Instructor
An –Najah National University Submitted to : Dr. Suzan Arafat
Course Overview - Database Systems
CSc 1302 Principles of Computer Science II
How to Read a Paper.
Research Methods Dr. X.
E 96 Introduction to Engineering Design Peter Reiher UCLA
Literature Reviews Are critical evaluations of material that has already been published. By organizing, integrating, and evaluating previously published.
Writing 101 for Nursing Students
Building the foundations for innovation
Objectives of the Course and Preliminaries
Introduction C.Eng 714 Spring 2010.
The Research Paper: An Overview of the Process
Foundations of Programming: Introduction to Programming
Geography: Exam Skills for GCSE
Student Success Strategies
UNIT I- YOUR LIFE. YOUR DREAMS
CPS216: Advanced Database Systems Data Mining
FOCUS: IDEAS, ORGANIZATION
Course Overview - Database Systems
Research Presentation
Writing the Document Based Question (DBQ) Essay
Read the quote and with the person next to you, discuss what you think it means. Do you agree? Why / why not? Be prepared to share your thoughts with the.
Academic scenarios.
How to Read Research Papers?
Introducing the Ideas One of Six Traits:
Business Communication
Workshop hours.
Essay #1: Your Goals as a Writer
CSCD 506 Research Methods for Computer Science
Using networks to be more effective
Teaching Different Classes
PHYS 202 Intro Physics II Catalog description: A continuation of PHYS 201 covering the topics of electricity and magnetism, light, and modern physics.
Get In Shape With EMS Training. INTRODUCTION Those that are thinking about making a change in their life might have thought about going through with EMS.
Effective Presentation
Tonga Institute of Higher Education IT 141: Information Systems
APPROPRIATE POINT OF CARE DIAGNOSTICS
CS 2530 Intermediate Computing Dr. Schafer
The Performance-based Hiringsm Structured 2-Way Interview
They Say, I Say Chapter 1 and 12
Lecture 5: Writing Page
Creating and Embedding an Evaluation Culture in WP Project Work
Research Paper Step-by-step Process.
Tonga Institute of Higher Education IT 141: Information Systems
Reading and effective note-making
Welcome to the First-Year Experience!
Student Success Strategies
Mistakes in writing a research paper
Research Presentation
Presentation transcript:

CAP6778: Advanced Data Mining Fall 2010 Dr CAP6778: Advanced Data Mining Fall 2010 Dr. Tao Li Florida International University

Self-Introduction Ph.D. in Computer Science from University of Rochester, 2004 Research Interests: data mining, machine learning, information retrieval, bioinformatics. Associate Professor in the School of Computer Science at Florida International University CAP6778

Student Self-Introduction Name Get to know your peers! Major and Academic status Programming Skills Are you familiar with Matlab? Are you familiar with Java/C++? Research Interest Background Did you take any data mining course before? Purpose Why do you want to take the course? What do you want to learn? Anything you want us to know CAP6778

Course Overview Meeting time Office hours: Course Webpage: T Th 12:30pm – 1:45pm Office hours: T Th 5:00pm – 6:00pm or by appointment Course Webpage: http://www.cs.fiu.edu/~taoli/class/CAP6778-Fall2010 Lecture Notes and Assignments Username: CAP6778 Password: student CAP6778

Course Objectives This is a seminar course that will focus on recent developments of advanced data mining techniques and their applications to various problems. CAP6778

Tentative Course Syllabus Large-scale Data Mining using Map-Reduce Similarity Search (including minwise hashing and locality sensitive hashing) Mining Data Streams Mining Social Networks Relational Data Mining Tree/Graph Mining Privacy-preserving Data Mining High-Dimensional Data Clustering Basics of Natural Language Processing Web Applications  (including advertising, recommendation, and summarization) CAP6778

Course Material Prerequisites: COP5577 Principles of Data Mining Textbooks and References: The course materials will mainly consist of research papers closely related to the topics in data mining. A lot of reading material from top conferences/journals will be made available online or in class as required. In addition, lecture notes will be available on line. CAP6778

Grading NO EXAMS! Class Participation (5%) Assignments (25%) A Research Project (70%) Expectation: A research paper to a major conference ?! CAP6778

Grades Do your best work and don't worry about it, but pay attention to all the feedback you get, including comments, and advice. What really matters: Your GPA will not ultimately be the measure of your success in grad school. What matters most is that you acquire real skills, publish influential papers, write a thesis, and get a job. You are learning to do research in Data Mining CAP6778

No Plagiarism No plagiarism: If you use anyone else's ideas, you must cite the source. If you use anyone else's words, you must indicate that you are quoting and give the source of the quote. Don't do this unless the wording of the quote is as significant as its content. (E.g., the quote is particularly funny or memorable, or you are going to attack it as unclear.) CAP6778

How to do research in data mining? How to have a good career as a graduate student? How to find research problems? Learn about data mining Master basic techniques and algorithms Understand challenge issues in data mining Discuss emerging research issues in data mining Study important research papers in data mining CAP6778

How to have a bad career as a graduate student CAP6778

How to find Research Problems The following slides are copied from Dr. Jason Eisner’s advice for graduate students CAP6778

How to Find Research Problems I About the smallest bone that you can find in Computer Science is a replication or implementation of someone else's work. While this doesn't get you points for originality, it may be useful, both to your education and to the field. If you can make it useful to enough people (say, by making it portable and Web-available), it might even get your name known. A significant small bone to look for is a tweak that improves a well-known technique. (In many subfields, you will be expected to demonstrate objectively that your method is an improvement.) Much research is of this kind. When reading papers, stay on the lookout for such bones. In particular, notice when the author may be making harmful simplifications or arbitrary choices in his/her approach. These are opportunities for you to try something different. Along the same lines, you might make a controlled comparison of two or more algorithms, evaluating them by some objective measure of efficiency or accuracy. Designing a clean comparison does take thought, and carrying it out is often a lot of work. CAP6778

How to Find Research Problems II You can thoroughly review the existing research in some area. Note that this takes a good deal of time to do well, and is not likely to do much for your career unless a lot of people read and cite your lit review. On the upside, writing a lit review will make you something of an expert, able to talk confidently with other researchers in the area; it will give you an idea of the shortcomings of past research; and it may suffice for an M.S.E. thesis, or the first part of a Ph.D. thesis. You can make it available to others via your Web page or an online paper archive. CAP6778

How to Find Research Problems III Build a large program or device of some kind. This gets you some name recognition, since there aren't that many big systems out there, and it also confirms your ability as a software engineer. However, do consider carefully: Will this system be of direct use to anyone? If not, will it at least beat performance records? If again not, does it have other merits, such as demonstrating how to integrate or scale up existing techniques, or introducing a collection of new techniques or a new perspective? If you are only one of many participants in a lab project, be sure that you make a ``separable contribution'' -- some piece of the work that is impressive, that stands alone, and that people will associate you with. CAP6778

How to Find Research Problems IV Your field identifies various problems or issues as significant. These often represent big bones in the skeleton of the field -- problems that arise often, and whose solution makes a difference. Get to know some of these problems and the work that's been done on them. If you see how to achieve the first-ever solution, or a better solution, or a different style of solution, that's a big deal. Sometimes finding a good solution involves changing the problem slightly. If you are feeling ambitious and have a big-bone temperament, study important papers in your branch of computer science, flip through some conference proceedings to see what people are working on, and ask: What problems (recognized or unrecognized) are obstructing progress in my field? Can I solve them? If not, can I at least formalize them? Can I prove to my colleagues that solving them would make a difference? CAP6778

How to Find Research Problems V Talk to your advisor about problems that are ripe for the plucking. Every field has its share of problems that everyone knows are ``kinda important,'' and that may even get mentioned a lot, but on which no one has yet made a serious attempt. If you think you spot such a problem, use your colleagues and the library to make sure it hasn't been plucked yet. CAP6778

How to Find Research Problems VI Finally, you can identify new interesting problems. This is often not as hard as it might sound: Study existing (applied) systems and note what they do badly at. If your field is interdisciplinary, ask people in the other discipline what they think is interesting. In fact, ask them why they think computer scientists are irrelevant. In many areas, the data have a way of suggesting their own problems. Systems programmers can collect data on actual disk access patterns and study it for regularities to exploit. Theoreticians of programming languages can look at real programming languages, and graphics programmers can look at real photographs and movies, for effects that they don't know how to capture. CAP6778

Final Advice Everything you do is open-ended. That means you can easily spend too much time on any task you start, especially if stubborn perfectionism or an inferiority complex leads you to feel that your work is never good enough, or if you're subconsciously trying to put off that scary next phase of your research. Don't spend eternity on background reading. Recognize that you will have to start your work in a state of partial ignorance: you don't have time to learn everything you need to know. In fact it's good, since ignorance leaves your mind free to see new ways of doing things. So start doing your own thinking early. You can alternate that with reading: just show your ideas periodically to someone who can warn you about related work and point you to relevant papers. Don't spend eternity on one problem. No solution is ever complete. Take the time to make your work solid and beautiful and presentable, but recognize when you've hit a point of diminishing returns. Use project #1 to inspire project #2, which stands as research on its own. Don't use it as the core of project #1', #1'', etc. forever. CAP6778

Basic Techniques and Algorithms of Data Mining Data pre-processing Association/sequential pattern mining Classification Prediction Clustering Anomaly detection CAP6778

10 Well-Known Algorithms in Data Mining CAP6778

10 Challenging Problems in Data Mining CAP6778

Paradoxes on Data Mining CAP6778

Meaningfulness of Answers A big risk when data mining is that you will “discover” patterns that are meaningless. Statisticians call it Bonferroni’s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap. The following slides from Anand Rajaraman at Stanford. CAP6778

Examples A big objection to TIA was that it was looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents’ privacy. The Rhine Paradox: a great example of how not to conduct scientific research. CAP6778

Rhine Paradox --- (1) Joseph Rhine was a parapsychologist in the 1950’s who hypothesized that some people had Extra-Sensory Perception. He devised (something like) an experiment where subjects were asked to guess 10 hidden cards --- red or blue. He discovered that almost 1 in 1000 had ESP --- they were able to get all 10 right! CAP6778

Rhine Paradox --- (2) He told these people they had ESP and called them in for another test of the same type. Alas, he discovered that almost all of them had lost their ESP. What did he conclude? Answer on next slide. CAP6778

Rhine Paradox --- (3) He concluded that you shouldn’t tell people they have ESP; it causes them to lose it. CAP6778

Example: Bonferroni’s Principle This example illustrates a problem with intelligence-gathering. Suppose we believe that certain groups of evil-doers are meeting occasionally in hotels to plot doing evil. We want to find people who at least twice have stayed at the same hotel on the same day. CAP6778

The Details 109 people being tracked. 1000 days. Each person stays in a hotel 1% of the time (10 days out of 1000). Hotels hold 100 people (so 105 hotels). If everyone behaves randomly (I.e., no evil-doers) will the data mining detect anything suspicious? CAP6778

A small quiz What is the Expected number of suspicious pairs of people? CAP6778

Calculations --- (1) Probability that persons p and q will be at the same hotel on day d : 1/100 * 1/100 * 10-5 = 10-9. Probability that p and q will be at the same hotel on two given days: 10-9 * 10-9 = 10-18. Pairs of days: 5*105. CAP6778

Calculations --- (2) Probability that p and q will be at the same hotel on some two days: 5*105 * 10-18 = 5*10-13. Pairs of people: 5*1017. Expected number of suspicious pairs of people: 5*1017 * 5*10-13 = 250,000. CAP6778

Conclusion Suppose there are (say) 10 pairs of evil-doers who definitely stayed at the same hotel twice. Analysts have to sift through 250,010 candidates to find the 10 real cases. Not gonna happen. But how can we improve the scheme? CAP6778

Moral When looking for a property (e.g., “two people stayed at the same hotel twice”), make sure that there are not so many possibilities that random data will not produce facts “of interest.” CAP6778