Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

Text mining

The Standard Data Mining process

Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks include: – Text categorization (document classification) – Text clustering – Text summarization – Opinion mining – Entity/concept extraction – Information retrieval: search engines – information extraction: Question answering

Supervised learning algorithms – Decision tree learning – Naïve Bayes – K-nearest neighbour – Support Vector Machines – Neural Networks – Genetic algorithms

Supervised Machine learning 1. Build or get a representative corpus 2. Label it 3. Define features 4. Represent documents 5. Learn and analyse 6. Go to 3 until accuracy is acceptable First test features: stemmed words

Unsupervised Learning – Document clustering HAC K-means BIRCH …

Applying machine learning on text Text Representation (Feature Extraction) preprocessing Indexing Weighting Model Dimensionality Reduction Similarity measure: how to compare text

Feature Extraction: Task(1) Task: Extract a good subset of words to represent documents Document collection All unique words/phrases Feature Extraction All good words/phrases Some slides by Huaizhong Kou

Feature Extraction: Task(2) While more and more textual information is available online, effective retrieval is difficult without good indexing of text content. While-more-and-textual-information-is-available-online- effective-retrieval-difficult-without-good-indexing-text-content Feature Extraction Text-information-online-retrieval-index 16 5 2 1 1 1 1

Feature Extraction: preprocessing and Indexing(1) Identification all unique words Removal stop words Removal stop words Word Stemming Training documents Term Weighting Naive terms Importance of term in Doc  Removal of suffix to generate word stem  grouping words  increasing the relevance  ex.{walker,walking}  walk  non-informative word  ex.{the,and,when,more}

Feature Extraction: Indexing(2) Vector Space Model (VSM) is one of the most commonly used Text data models Any text document is represented by a vector of terms Terms are typically words and/or phrases Every term in the vocabulary becomes an independent dimension Each term in the text document would be represented by a non zero value which will be added in the corresponding dimension A document collection is represented as a matrix: Where x ji represents the weight of the ith term in jth document

Feature Extraction:Weighting Model(1) tf - Term Frequency weighting w ij = Freq ij Freq ij : := the number of times jth term occurs in document D i.  Drawback: without reflection of importance factor for document discrimination. Ex. ABRTSAQWA XAO RTABBAXA QSAK D1 D2 A B K O Q R S T W X D1 4 1 0 1 1 1 1 1 1 1 D2 4 2 1 0 1 1 1 1 0 1

Feature Extraction:Weighting Model(2) Tf-idf: simple version w ij = Freq ij * log(N/ DocFreq j ). N : := the number of documents in the training document collection. DocFreq j ::= the number of documents in which the jth term occurs. Advantage: with reflection of importance factor for document discrimination. Assumption:terms with low DocFreq are better discriminator than ones with high DocFreq in document collection A B K O Q R S T W X D1 0 0 0 0.3 0 0 0 0 0.3 0 D2 0 0 0.3 0 0 0 0 0 0 0

Feature Extraction: Weighting Model(3) Tf-IDF weighting = TF * IDF ] ABKOQRSTWX 0 4/12 * (lg(2/2)1/12*(lg(2/1)

Feature Extraction: Weighting Model Tf-IDF weighting Entropy weighting where is average entropy of ith term and -1: if word occurs once time in every document 0: if word occurs in only one document Ref:[13]Ref:[11][22]

Feature Extraction: Dimension Reduction Document Frequency Thresholding X 2 -statistic Latent Semantic Indexing Information Gain Mutual information

Dimension Reduction:DocFreq Thresholding Document Frequency Thresholding Calculates DocFreq(w) Sets threshold  Removes all words: DocFreq <  Naive Terms Training documents D Feature Terms

Similarity measure There are many different ways to measure how similar two documents are, or how similar a document is to a query Highly depending on the choice of terms to represent text documents – Euclidian distance (L2 norm) – L1 norm – Cosine similarity

Document Similarity Measures

Document Similarity measures

Document Clustering: Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering (CACTUS, STIRR) …… STC QDC

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

Similar presentations

Presentation on theme: "Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

Similar presentations

Presentation on theme: "Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks."— Presentation transcript:

Similar presentations

About project

Feedback