Download presentation
Presentation is loading. Please wait.
Published bySylvia Doyle Modified over 9 years ago
1
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format Related technologies Machine learning Information retrieval Information extraction Corpus-based computational linguistics
2
Document Collection Typical real-world document collection PubMed: National Library of Medicine’s online Repository Text-based document abstracts for more than 12 million research papers Static vs. Dynamic The Document Very formal very ad hoc Business report, legal memorandum. E-mail, research paper, manuscript, article, press release, news story “weakly structured” vs “semi-structured” Document subcomponents Titles, publication dates, author names, header, footnotes and whatever Structure meta information Typographical, Layout, markup indicators
3
Document Features Preprocessing operation of Text Mining Identification of a simplified subset of document features High feature dimensionality Ex) a small collection of 15,000 Reuters news document 25,000 nontrivial word stems Feature sparsity for a document Commonly used document features Characters Numerals, special chracters, n-grams Words A word a feature Feature selection is required Terms Single words and multiword phrases extracted by term-extraction technologies More semantically rich document representation oEx) President Abraham Lincoln experienced a career that took him from log cabin to White House. Concepts By means of statistical, rule-based or hybrid categorization methods Can consist of words not specifically found in the naïve document Handling synonymy and polysemy problems
4
The Search for Patterns and Trends Concept co-occurrence patterns Consider distributions, frequent sets, and various associations of concepts at an inter-document level Ex) relationship between X and “scandal” Ex) relationship company Y and product Z Ex) relationship proteins P1 and P2 Trend analysis Date-and-time stamping of documents within a collection Examples of related questions “What is the general trend of the news topics between periods?” Page 9
5
The Importance of the Presentation Layer Browsing tools Visualization tools Ex) Figure 1.1 (page 11) Query languages
6
Text Mining System Architecture System process: Figures 1.2 and 1.3 (pages 13 and 14) Preprocessing tasks Convert the information from each original data source into a canonical format Core mining operations Knowledge distilling Presentation layer component and browsing capabilities GUI and pattern browsing Query languages Refinement techniques Filter redundant information Cluster closely relate data
7
Text Mining System Architecture High-level text mining functional architecture (Figure 1.4, page 15) System architecture for generic text mining system (Figure 1.5, page 15) System architecture for an advanced or domain-oriented text mining system (Figures 1.6 and 1.7, pages 16 and 17)
8
Pattern-discovery algorithms Incorporation of background knowledge into text mining query operations Text mining query languages Core Text Mining Operation
9
Common Types of Patterns in Text Mining Distributions Frequent sets Associations
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.