Presentation is loading. Please wait.

Presentation is loading. Please wait.

Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.

Similar presentations


Presentation on theme: "Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format."— Presentation transcript:

1 Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format Related technologies Machine learning Information retrieval Information extraction Corpus-based computational linguistics

2 Document Collection Typical real-world document collection PubMed: National Library of Medicine’s online Repository Text-based document abstracts for more than 12 million research papers Static vs. Dynamic The Document Very formal  very ad hoc Business report, legal memorandum. E-mail, research paper, manuscript, article, press release, news story “weakly structured” vs “semi-structured” Document subcomponents Titles, publication dates, author names, header, footnotes and whatever Structure meta information Typographical, Layout, markup indicators

3 Document Features Preprocessing operation of Text Mining Identification of a simplified subset of document features High feature dimensionality Ex) a small collection of 15,000 Reuters news document  25,000 nontrivial word stems Feature sparsity for a document Commonly used document features Characters Numerals, special chracters, n-grams Words A word  a feature Feature selection is required Terms Single words and multiword phrases extracted by term-extraction technologies More semantically rich document representation oEx) President Abraham Lincoln experienced a career that took him from log cabin to White House. Concepts By means of statistical, rule-based or hybrid categorization methods Can consist of words not specifically found in the naïve document Handling synonymy and polysemy problems

4 The Search for Patterns and Trends Concept co-occurrence patterns Consider distributions, frequent sets, and various associations of concepts at an inter-document level Ex) relationship between X and “scandal” Ex) relationship company Y and product Z Ex) relationship proteins P1 and P2 Trend analysis Date-and-time stamping of documents within a collection Examples of related questions “What is the general trend of the news topics between periods?” Page 9

5 The Importance of the Presentation Layer Browsing tools Visualization tools Ex) Figure 1.1 (page 11) Query languages

6 Text Mining System Architecture System process: Figures 1.2 and 1.3 (pages 13 and 14) Preprocessing tasks Convert the information from each original data source into a canonical format Core mining operations Knowledge distilling Presentation layer component and browsing capabilities GUI and pattern browsing Query languages Refinement techniques Filter redundant information Cluster closely relate data

7 Text Mining System Architecture High-level text mining functional architecture (Figure 1.4, page 15) System architecture for generic text mining system (Figure 1.5, page 15) System architecture for an advanced or domain-oriented text mining system (Figures 1.6 and 1.7, pages 16 and 17)

8 Pattern-discovery algorithms Incorporation of background knowledge into text mining query operations Text mining query languages Core Text Mining Operation

9 Common Types of Patterns in Text Mining Distributions Frequent sets Associations


Download ppt "Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format."

Similar presentations


Ads by Google