Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Similar presentations


Presentation on theme: "Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge."— Presentation transcript:

1 Text Mining IS698 Min Song

2  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge sharing -Leverage existing information access system -The Information Sources: -Email, groupware, online reports, … Example 1: KM People Finder

3 Example 1: Simple KM People Finder Relevant Docs Search or Navigation System Name Extractor Authority List Query Ranked People Names

4 Example 1: KM People Finder

5 textual (natural-language) dataAn exploration and analysis of textual (natural-language) data by automatic and semi automatic means to discover new knowledge. Text Mining Definition  Many definitions in the literature “The non trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data”.

6 “ previously unknown”  What is “ previously unknown” information ? Strict definition  Information that not even the writer knows.  e.g., Discovering a new method for a hair growth that is described as a side effect for a different procedure Lenient definition  Rediscover the information that the author encoded in the text  e.g., Automatically extracting a product’s name from a web-page. Text Mining Definition

7 Outline  Text mining applications  Text characteristics  Text mining process  Learning methods

8 Text Mining Applications  Marketing: Discover distinct groups of potential buyers according to a user text based profile e.g. amazon  Industry: Identifying groups of competitors web pages e.g., competing products and their prices  Job seeking: Identify parameters in searching for jobs e.g., www.flipdog.com

9  Information Retrieval Indexing and retrieval of textual documents  Information Extraction partial knowledge Extraction of partial knowledge in the text  Web Mining Indexing and retrieval of textual documents and extraction of partial knowledge using the web  Clustering Generating collections of similar text documents Text Mining Methods

10 Information Retrieval  Given: A source of textual documents A user query (text based) IR System Query E.g. Spam / Text Documents source Find: A set (ranked) of documents that are relevant to the query Ranked Documents Document

11 Intelligent Information Retrieval  meaning of words Synonyms “buy” / “purchase” Ambiguity “bat” (baseball vs. mammal)  order of words in the query hot dog stand in the amusement park hot amusement stand in the dog park  user dependency for the data direct feedback indirect feedback  authority of the source IBM is more likely to be an authorized source then my second far cousin

12  Given: A source of textual documents A well defined limited query (text based)  Find: relevant Sentences with relevant information Extract the relevant information and ignore non-relevant information (important!) Link related information and output in a predetermined format What is Information Extraction?

13 Information Extraction: Example  Salvadoran President-elect Alfredo Cristiania condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti Natinal Liberation Front (FMLN) of the crime. … Garcia Alvarado, 56, was killed when a bomb placed by urban guerillas on his vehicle exploded as it came to a halt at an intersection in downtown San Salvador. … According to the police and Garcia Alvarado’s driver, who escaped unscathed, the attorney general was traveling with two bodyguards. One of them was injured.  Incident Date: 19 Apr 89  Incident Type: Bombing  Perpetrator Individual ID: “urban guerillas”  Human Target Name: “Roberto Garcia Alvarado” ...

14 What is Information Extraction? Extraction System Documents source Ranked Documents Relevant Info 1 Relevant Info 2 Relevant Info 3 Query 1 (E.g. job title) Query 2 (E.g. salary) Combine Query Results

15 Why Mine the Web?  Enormous wealth of textual information on the Web. Book/CD/Video stores (e.g., Amazon) Restaurant information (e.g., Zagats) Car prices (e.g., Carpoint)  Lots of data on user access patterns Web logs contain sequence of URLs accessed by users  Possible to retrieve “ previously unknown ” information People who ski also frequently break their leg. Restaurants that serve sea food in California are likely to be outside San-Francisco

16 Mining the Web IR / IE System Query Documents source Ranked Documents 1. Doc1 2. Doc2 3. Doc3. Web Spider

17  The Web is a huge collection of documents where many contain: Hyper-link Hyper-link information Access and usage information  The Web is very dynamic Web pages are constantly being generated (removed) Unique Features of the Web Challenge: Develop new Web mining algorithms to... Exploit hyper-links and access patterns. Be adaptable to its documents source

18  Combine the intelligent IR tools meaning meaning of words order order of words in the query user dependency user dependency for the data authority authority of the source  With the unique web features retrieve Hyper-link information utilize Hyper-link as input Intelligent Web Search

19 What is Clustering ?  Given: A source of textual documents Similarity measure  e.g., how many words are common in these documents Clustering System Similarity measure Documents source Doc Find: Several clusters of documents that are relevant to each other

20 Outline  Text mining applications  Text characteristics  Text mining process  Learning methods

21 Text characteristics: Outline  Large textual data base  High dimensionality  Several input modes  Dependency  Ambiguity  Noisy data  Not well structured text

22 Text characteristics  Large textual data base Efficiency consideration  over 2,000,000,000 web pages  almost all publications are also in electronic form  High dimensionality (Sparse input) Consider each word/phrase as a dimension  Several input modes e.g., Web mining: information about user is generated by semantics, browse pattern and outside knowledgebase.

23 Text characteristics  Dependency relevant information is a complex conjunction of words/phrases  e.g., Document categorization. Pronoun disambiguation.  Ambiguity Word ambiguity  Pronouns (he, she …)  “buy”, “purchase” Semantic ambiguity  The king saw the rabbit with his glasses.

24 Text characteristics  Noisy data  Example: Spelling mistakes  Not well structured text Chat rooms  “r u available ?”  “Hey whazzzzzz up” Speech

25 Outline  Text mining applications  Text characteristics  Text mining process  Learning methods

26 Text mining process

27  Text preprocessing Syntactic/Semantic text analysis  Features Generation Bag of words  Features Selection Simple counting Statistics  Text/Data Mining Classification- Supervised learning Clustering- Unsupervised learning  Analyzing results

28  Part Of Speech (pos) tagging  Find the corresponding pos for each word e.g., John (noun) gave (verb) the (det) ball (noun)  ~98% accurate.  Word sense disambiguation  Context basedproximity based  Context based or proximity based  Very accurate  Parsing parse tree  Generates a parse tree (graph) for each sentence  Each sentence is a stand alone graph Syntactic / Semantic text analysis

29 training set  Given: a collection of labeled records (training set) attributes label Each record contains a set of features (attributes), and the true class (label) model  Find: a model for the class as a function of the values of the features  Goal: previously unseen records should be assigned a class as accurately as possible test set A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it Text Mining: Classification definition

30 Similarity Measures: Euclidean Distance Euclidean Distance if attributes are continuous Other Problem-specific Measures e.g., how many words are common in these documents similarity measure  Given: a set of documents and a similarity measure among documents  Find: clusters such that: Documents in one cluster are more similar to one another Documents in separate clusters are less similar to one another  Goal: correct Finding a correct set of documents Text Mining: Clustering definition

31  Supervised learning (classification) labels Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set  Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Supervised vs. Unsupervised Learning

32 class result  Correct classification: The known label of test sample is identical with the class result from the classification model  Accuracy ratio: the percentage of test set samples that are correctly classified by the model distance measure  A distance measure between classes can be used e.g., classifying “football” document as a “basketball” document is not as bad as classifying it as “crime”. Evaluation:What Is Good Classification?

33  Good clustering method: produce high quality clusters with... intra-class high intra-class similarity inter-class low inter-class similarity quality hidden  The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns Evaluation: What Is Good Clustering?

34 Outline  Text mining applications  Text characteristics  Text mining process  Learning methods Classification Clustering

35 Classification: An Example categorical continuous class Training Set Model Learn Classifier Test Set

36 Text Classification: An Example class Training Set Model Learn Classifier text Test Set

37 Classification Techniques  Instance-Based Methods  Decision trees  Neural networks  Bayesian classification

38  Instance-based (memory based) learning Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified  k-nearest neighbor approach Instances points in a Euclidean space Instances (Examples) are represented as points in a Euclidean space Instance-based Methods

39 football Italian The English football football fan is a hooligan.. football Italian Similar to his English equivalent, Italian the Italian football football fan is a hooligan.. Text Examples in Euclidean Space

40 n  All instances correspond to points in the n- D space  The nearest neighbor are defined in terms of Euclidean distance. _ + + ? + _ _ + _ _ + _ + + + + _ _ + _ _ + k-NNkThe k-NN returns the most common value among the k nearest training examples 1-NNVoronoi diagram: the decision surface induced by 1-NN for a typical set of training examples K-Nearest Neighbor Algorithm

41 Classification Techniques  Instance-Based Methods  Decision trees  Neural networks  Bayesian classification

42 categorical continuous class Decision Tree: An Example Yes English Yes No MarSt NO Married Single, Divorced Splitting Attributes Income YES NO > 80K< 80K The splitting attribute at a node is determined based on a specific Attribute selection algorithm

43 class text Decision Tree: A Text Example Yes English Yes No MarSt NO Married Single, Divorced Splitting Attributes Income YES NO > 80K< 80K The splitting attribute at a node is determined based on a specific Attribute selection algorithm

44  Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution  Decision tree generation consists of two phases: Tree construction Tree pruning noise outliers  Identify and remove branches that reflect noise or outliers  Use of decision tree: Classifying an unknown sample Test the attribute of the sample against the decision tree Classification by DT Induction

45  Partitioning Methods  Hierarchical Methods Clustering Techniques

46  Partitioning method: Construct a partition of n documents into a set of k clusters  Given: a set of documents and the number k  Find: a partition of k clusters that optimizes the chosen partitioning criterion Global optimal Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means k-means: Each cluster is represented by the center of the cluster Partitioning Algorithms

47  k-means algorithm is implemented in 4 steps: k 1.Partition objects into k nonempty subsets. centroids 2.Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. 3.Assign each object to the cluster with the nearest seed point. 4.Go back to Step 2, stop when no more new assignment. The K-means Clustering Method

48 The K-means Clustering: Example

49  Partitioning Methods  Hierarchical Methods Clustering Techniques

50  Agglomerative: Start with each document being a single cluster. Eventually all document belong to the same cluster.  Divisive: Start with all document belong to the same cluster. Eventually each node forms a cluster on its own.  Does not require the number of clusters k in advance  Needs a termination condition The final mode in both Agglomerative and Divisive in of no use. Hierarchical Clustering

51 Step 0 b d c e a a b Step 1Step 2 d e Step 3 c d e Step 4 a b c d e agglomerative Step 4 Step 3Step 2Step 1Step 0 divisive Hierarchical Clustering: Example

52 Dendrogram: Decomposes data objects into a several levels of nested partitioning (tree of clusters). connectedClustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster. A Dendogram: Hierarchical Clustering

53 Demo

54 Commercial Tools  IBM Intelligent Miner for Text  Semio Map  InXight LinguistX / ThingFinder  LexiQuest  ClearForest  Teragram  SRA NetOwl Extractor  Autonomy

55  Text is tricky to process, but “ok” results are easily achieved text mining systems  There exist several text mining systems e.g., D2K - Data to Knowledge http://www.ncsa.uiuc.edu/Divisions/DMV/ALG/ Intelligence  Additional Intelligence can be integrated with text mining One may play with any phase of the text mining process Summary

56 scientific and statistical text mining methods  There are many other scientific and statistical text mining methods developed but not covered in this talk. http://www.cs.utexas.edu/users/pebronia/text-mining/ http://filebox.vt.edu/users/wfan/text_mining.html theoretical foundations  Also, it is important to study theoretical foundations of data mining. Data Mining Concepts and Techniques / J.Han & M.Kamber Machine Learning, / T.Mitchell


Download ppt "Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge."

Similar presentations


Ads by Google