Download presentation
Presentation is loading. Please wait.
1
CAP6778: Advanced Data Mining Fall 2010 Dr
CAP6778: Advanced Data Mining Fall Dr. Tao Li Florida International University
2
Self-Introduction Ph.D. in Computer Science from University of Rochester, 2004 Research Interests: data mining, machine learning, information retrieval, bioinformatics. Associate Professor in the School of Computer Science at Florida International University CAP6778
3
Student Self-Introduction
Name Get to know your peers! Major and Academic status Programming Skills Are you familiar with Matlab? Are you familiar with Java/C++? Research Interest Background Did you take any data mining course before? Purpose Why do you want to take the course? What do you want to learn? Anything you want us to know CAP6778
4
Course Overview Meeting time Office hours: Course Webpage:
T Th 12:30pm – 1:45pm Office hours: T Th 5:00pm – 6:00pm or by appointment Course Webpage: Lecture Notes and Assignments Username: CAP6778 Password: student CAP6778
5
Course Objectives This is a seminar course that will focus on recent developments of advanced data mining techniques and their applications to various problems. CAP6778
6
Tentative Course Syllabus
Large-scale Data Mining using Map-Reduce Similarity Search (including minwise hashing and locality sensitive hashing) Mining Data Streams Mining Social Networks Relational Data Mining Tree/Graph Mining Privacy-preserving Data Mining High-Dimensional Data Clustering Basics of Natural Language Processing Web Applications (including advertising, recommendation, and summarization) CAP6778
7
Course Material Prerequisites: COP5577 Principles of Data Mining
Textbooks and References: The course materials will mainly consist of research papers closely related to the topics in data mining. A lot of reading material from top conferences/journals will be made available online or in class as required. In addition, lecture notes will be available on line. CAP6778
8
Grading NO EXAMS! Class Participation (5%) Assignments (25%)
A Research Project (70%) Expectation: A research paper to a major conference ?! CAP6778
9
Grades Do your best work and don't worry about it, but pay attention to all the feedback you get, including comments, and advice. What really matters: Your GPA will not ultimately be the measure of your success in grad school. What matters most is that you acquire real skills, publish influential papers, write a thesis, and get a job. You are learning to do research in Data Mining CAP6778
10
No Plagiarism No plagiarism: If you use anyone else's ideas, you must cite the source. If you use anyone else's words, you must indicate that you are quoting and give the source of the quote. Don't do this unless the wording of the quote is as significant as its content. (E.g., the quote is particularly funny or memorable, or you are going to attack it as unclear.) CAP6778
11
How to do research in data mining?
How to have a good career as a graduate student? How to find research problems? Learn about data mining Master basic techniques and algorithms Understand challenge issues in data mining Discuss emerging research issues in data mining Study important research papers in data mining CAP6778
12
How to have a bad career as a graduate student
CAP6778
13
How to find Research Problems
The following slides are copied from Dr. Jason Eisner’s advice for graduate students CAP6778
14
How to Find Research Problems I
About the smallest bone that you can find in Computer Science is a replication or implementation of someone else's work. While this doesn't get you points for originality, it may be useful, both to your education and to the field. If you can make it useful to enough people (say, by making it portable and Web-available), it might even get your name known. A significant small bone to look for is a tweak that improves a well-known technique. (In many subfields, you will be expected to demonstrate objectively that your method is an improvement.) Much research is of this kind. When reading papers, stay on the lookout for such bones. In particular, notice when the author may be making harmful simplifications or arbitrary choices in his/her approach. These are opportunities for you to try something different. Along the same lines, you might make a controlled comparison of two or more algorithms, evaluating them by some objective measure of efficiency or accuracy. Designing a clean comparison does take thought, and carrying it out is often a lot of work. CAP6778
15
How to Find Research Problems II
You can thoroughly review the existing research in some area. Note that this takes a good deal of time to do well, and is not likely to do much for your career unless a lot of people read and cite your lit review. On the upside, writing a lit review will make you something of an expert, able to talk confidently with other researchers in the area; it will give you an idea of the shortcomings of past research; and it may suffice for an M.S.E. thesis, or the first part of a Ph.D. thesis. You can make it available to others via your Web page or an online paper archive. CAP6778
16
How to Find Research Problems III
Build a large program or device of some kind. This gets you some name recognition, since there aren't that many big systems out there, and it also confirms your ability as a software engineer. However, do consider carefully: Will this system be of direct use to anyone? If not, will it at least beat performance records? If again not, does it have other merits, such as demonstrating how to integrate or scale up existing techniques, or introducing a collection of new techniques or a new perspective? If you are only one of many participants in a lab project, be sure that you make a ``separable contribution'' -- some piece of the work that is impressive, that stands alone, and that people will associate you with. CAP6778
17
How to Find Research Problems IV
Your field identifies various problems or issues as significant. These often represent big bones in the skeleton of the field -- problems that arise often, and whose solution makes a difference. Get to know some of these problems and the work that's been done on them. If you see how to achieve the first-ever solution, or a better solution, or a different style of solution, that's a big deal. Sometimes finding a good solution involves changing the problem slightly. If you are feeling ambitious and have a big-bone temperament, study important papers in your branch of computer science, flip through some conference proceedings to see what people are working on, and ask: What problems (recognized or unrecognized) are obstructing progress in my field? Can I solve them? If not, can I at least formalize them? Can I prove to my colleagues that solving them would make a difference? CAP6778
18
How to Find Research Problems V
Talk to your advisor about problems that are ripe for the plucking. Every field has its share of problems that everyone knows are ``kinda important,'' and that may even get mentioned a lot, but on which no one has yet made a serious attempt. If you think you spot such a problem, use your colleagues and the library to make sure it hasn't been plucked yet. CAP6778
19
How to Find Research Problems VI
Finally, you can identify new interesting problems. This is often not as hard as it might sound: Study existing (applied) systems and note what they do badly at. If your field is interdisciplinary, ask people in the other discipline what they think is interesting. In fact, ask them why they think computer scientists are irrelevant. In many areas, the data have a way of suggesting their own problems. Systems programmers can collect data on actual disk access patterns and study it for regularities to exploit. Theoreticians of programming languages can look at real programming languages, and graphics programmers can look at real photographs and movies, for effects that they don't know how to capture. CAP6778
20
Final Advice Everything you do is open-ended. That means you can easily spend too much time on any task you start, especially if stubborn perfectionism or an inferiority complex leads you to feel that your work is never good enough, or if you're subconsciously trying to put off that scary next phase of your research. Don't spend eternity on background reading. Recognize that you will have to start your work in a state of partial ignorance: you don't have time to learn everything you need to know. In fact it's good, since ignorance leaves your mind free to see new ways of doing things. So start doing your own thinking early. You can alternate that with reading: just show your ideas periodically to someone who can warn you about related work and point you to relevant papers. Don't spend eternity on one problem. No solution is ever complete. Take the time to make your work solid and beautiful and presentable, but recognize when you've hit a point of diminishing returns. Use project #1 to inspire project #2, which stands as research on its own. Don't use it as the core of project #1', #1'', etc. forever. CAP6778
21
Basic Techniques and Algorithms of Data Mining
Data pre-processing Association/sequential pattern mining Classification Prediction Clustering Anomaly detection CAP6778
22
10 Well-Known Algorithms in Data Mining
CAP6778
23
10 Challenging Problems in Data Mining
CAP6778
24
Paradoxes on Data Mining
CAP6778
25
Meaningfulness of Answers
A big risk when data mining is that you will “discover” patterns that are meaningless. Statisticians call it Bonferroni’s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap. The following slides from Anand Rajaraman at Stanford. CAP6778
26
Examples A big objection to TIA was that it was looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents’ privacy. The Rhine Paradox: a great example of how not to conduct scientific research. CAP6778
27
Rhine Paradox --- (1) Joseph Rhine was a parapsychologist in the 1950’s who hypothesized that some people had Extra-Sensory Perception. He devised (something like) an experiment where subjects were asked to guess 10 hidden cards --- red or blue. He discovered that almost 1 in 1000 had ESP --- they were able to get all 10 right! CAP6778
28
Rhine Paradox --- (2) He told these people they had ESP and called them in for another test of the same type. Alas, he discovered that almost all of them had lost their ESP. What did he conclude? Answer on next slide. CAP6778
29
Rhine Paradox --- (3) He concluded that you shouldn’t tell people they have ESP; it causes them to lose it. CAP6778
30
Example: Bonferroni’s Principle
This example illustrates a problem with intelligence-gathering. Suppose we believe that certain groups of evil-doers are meeting occasionally in hotels to plot doing evil. We want to find people who at least twice have stayed at the same hotel on the same day. CAP6778
31
The Details 109 people being tracked. 1000 days.
Each person stays in a hotel 1% of the time (10 days out of 1000). Hotels hold 100 people (so 105 hotels). If everyone behaves randomly (I.e., no evil-doers) will the data mining detect anything suspicious? CAP6778
32
A small quiz What is the Expected number of suspicious pairs of people? CAP6778
33
Calculations --- (1) Probability that persons p and q will be at the same hotel on day d : 1/100 * 1/100 * 10-5 = 10-9. Probability that p and q will be at the same hotel on two given days: 10-9 * 10-9 = Pairs of days: 5*105. CAP6778
34
Calculations --- (2) Probability that p and q will be at the same hotel on some two days: 5*105 * = 5*10-13. Pairs of people: 5*1017. Expected number of suspicious pairs of people: 5*1017 * 5*10-13 = 250,000. CAP6778
35
Conclusion Suppose there are (say) 10 pairs of evil-doers who definitely stayed at the same hotel twice. Analysts have to sift through 250,010 candidates to find the 10 real cases. Not gonna happen. But how can we improve the scheme? CAP6778
36
Moral When looking for a property (e.g., “two people stayed at the same hotel twice”), make sure that there are not so many possibilities that random data will not produce facts “of interest.” CAP6778
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.