Presentation is loading. Please wait.

Presentation is loading. Please wait.

Course on Data Mining: Seminar Meetings Page 1/17 Course on Data Mining (581550-4): Seminar Meetings Ass. Rules EpisodesEpisodes Text Mining 02.11. 09.11.

Similar presentations


Presentation on theme: "Course on Data Mining: Seminar Meetings Page 1/17 Course on Data Mining (581550-4): Seminar Meetings Ass. Rules EpisodesEpisodes Text Mining 02.11. 09.11."— Presentation transcript:

1 Course on Data Mining: Seminar Meetings Page 1/17 Course on Data Mining (581550-4): Seminar Meetings Ass. Rules EpisodesEpisodes Text Mining 02.11. 09.11. ClusteringClustering KDD Process Home Exam 23.11. 30.11. 16.11. M M P P Seminar by Mika Seminar by Pirjo P P P P P P M M M M

2 Course on Data Mining: Seminar Meetings Page 2/17 Today 16.11.2001 R. Feldman, M. Fresko, H. Hirsh, et.al.: "Knowledge Management: A Text Mining Approach", Proc of the 2nd Int'l Conf. on Practical Aspects of Knowledge Management (PAKM98), 1998R. Feldman, M. Fresko, H. Hirsh, et.al.: "Knowledge Management: A Text Mining Approach", Proc of the 2nd Int'l Conf. on Practical Aspects of Knowledge Management (PAKM98), 1998 B. Lent, R. Agrawal, R. Srikant: "Discovering Trends in Text Databases", Proc. of the 3rd Int'l Conference on Knowledge Discovery in Databases and Data Mining, 1997.B. Lent, R. Agrawal, R. Srikant: "Discovering Trends in Text Databases", Proc. of the 3rd Int'l Conference on Knowledge Discovery in Databases and Data Mining, 1997. Course on Data Mining (581550-4): Seminar Meetings

3 Course on Data Mining: Seminar Meetings Page 3/17 Good to Read as Background Both papers refer to the Agrawal and Srikant paper we had last week:Both papers refer to the Agrawal and Srikant paper we had last week: Rakesh Agrawal and Ramakrishnan Srikant: Mining Sequential Patterns. Int'l Conference on Data Engineering, 1995. Course on Data Mining (581550-4): Seminar Meetings

4 Course on Data Mining: Seminar Meetings Page 4/17 Knowledge Management: A Text Mining Approach R. Feldman, M. Fresko, H. Hirsh, et.al Bar-Ilan University and Instict Software, ISRAEL; Rutgers University, USA; LIA-EPFL, Switzerland Published in PAKM'98 (Int'l Conf. on Practical Aspects of Knowledge Management) Data Mining course Autumn 2001/University of Helsinki Summary by Mika Klemettinen

5 Course on Data Mining: Seminar Meetings Page 5/17 KM: A Text Mining Approach Basic idea (see selected phases on the next slides):Basic idea (see selected phases on the next slides): 1. Get input data in SGML (or XML) format Select only the contents of desired elements! (title, abstract, etc.) 2. Do linguistic preprocessing: 2.1 Term extraction (use linguistic software for this) 2.2 Term generation (combine adjacent terms to morpho- syntactic patterns like "noun-noun", "adj.-noun", etc. by calculating association coefficients) 2.3 Term filtering (select only the top M most frequent ones) 3. Create taxonomies (there is a tool for this) 4. Generate associations (you may constrain the creation) 5. Visualize/explore the results

6 Course on Data Mining: Seminar Meetings Page 6/17 2.1: Term Extraction

7 Course on Data Mining: Seminar Meetings Page 7/17 3: Taxonomy Construction

8 Course on Data Mining: Seminar Meetings Page 8/17 4: Association Rule Generation

9 Course on Data Mining: Seminar Meetings Page 9/17 4: Association Rule Generation

10 Course on Data Mining: Seminar Meetings Page 10/17 5.1: Visualization/Exploration

11 Course on Data Mining: Seminar Meetings Page 11/17 5.2: Visualization/Exploration

12 Course on Data Mining: Seminar Meetings Page 12/17 Discovering Trends in Text Databases Brian Lent, Rakesh Agrawal and Ramakrishnan Srikant IBM Almaden Research Center, USA Published in KDD'97 Data Mining course Autumn 2001/University of Helsinki Summary by Mika Klemettinen

13 Course on Data Mining: Seminar Meetings Page 13/17 Discovering Trends in Text Databases Basic ideas:Basic ideas: Identify frequent phrases using sequential patterns mining (see the slides & summaries from the Agrawal et. al paper "Mining Sequential Patterns" (MSP)) Generate histories of phrases Find phrases that satisfy a specified trend Definitions:Definitions: Phrase: phrase p is  (w 1 )(w 2 ) … (w n ) , where w is a word 1-phrase:   (IBM)   (data)(mining)   2-phrase:   (IBM)   (data)(mining)     (Anderson) (Consulting)   (decision)(support)   Itemset, sequence, is contained, etc.: as in MSP paper

14 Course on Data Mining: Seminar Meetings Page 14/17 Discovering Trends in Text Databases Gaps: Minimum and maximum gaps between adjacent words: identify relations of words/phrases inside sentences/paragraphs, between words/phrases in different paragraphs, between words/phrases in different sections, etc. Sentence boundary: 1000 Paragraph boundary: 100.000 Section boundary: 10.000.000 Phases: Partition data/documents based on their time stamps, create phrases for each partition (Lent & al. have patent data documents) Select the frequent phrases and save their frequences Define shape queries using SDL (Shape Definition Language)

15 Course on Data Mining: Seminar Meetings Page 15/17 Discovering Trends in Text Databases

16 Course on Data Mining: Seminar Meetings Page 16/17 Discovering Trends in Text Databases

17 Course on Data Mining: Seminar Meetings Page 17/17 Discovering Trends in Text Databases


Download ppt "Course on Data Mining: Seminar Meetings Page 1/17 Course on Data Mining (581550-4): Seminar Meetings Ass. Rules EpisodesEpisodes Text Mining 02.11. 09.11."

Similar presentations


Ads by Google