Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.

Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format Related technologies Machine learning Information retrieval Information extraction Corpus-based computational linguistics

Document Collection Typical real-world document collection PubMed: National Library of Medicine’s online Repository Text-based document abstracts for more than 12 million research papers Static vs. Dynamic The Document Very formal  very ad hoc Business report, legal memorandum. E-mail, research paper, manuscript, article, press release, news story “weakly structured” vs “semi-structured” Document subcomponents Titles, publication dates, author names, header, footnotes and whatever Structure meta information Typographical, Layout, markup indicators

Document Features Preprocessing operation of Text Mining Identification of a simplified subset of document features High feature dimensionality Ex) a small collection of 15,000 Reuters news document  25,000 nontrivial word stems Feature sparsity for a document Commonly used document features Characters Numerals, special chracters, n-grams Words A word  a feature Feature selection is required Terms Single words and multiword phrases extracted by term-extraction technologies More semantically rich document representation oEx) President Abraham Lincoln experienced a career that took him from log cabin to White House. Concepts By means of statistical, rule-based or hybrid categorization methods Can consist of words not specifically found in the naïve document Handling synonymy and polysemy problems

The Search for Patterns and Trends Concept co-occurrence patterns Consider distributions, frequent sets, and various associations of concepts at an inter-document level Ex) relationship between X and “scandal” Ex) relationship company Y and product Z Ex) relationship proteins P1 and P2 Trend analysis Date-and-time stamping of documents within a collection Examples of related questions “What is the general trend of the news topics between periods?” Page 9

The Importance of the Presentation Layer Browsing tools Visualization tools Ex) Figure 1.1 (page 11) Query languages

Text Mining System Architecture System process: Figures 1.2 and 1.3 (pages 13 and 14) Preprocessing tasks Convert the information from each original data source into a canonical format Core mining operations Knowledge distilling Presentation layer component and browsing capabilities GUI and pattern browsing Query languages Refinement techniques Filter redundant information Cluster closely relate data

Text Mining System Architecture High-level text mining functional architecture (Figure 1.4, page 15) System architecture for generic text mining system (Figure 1.5, page 15) System architecture for an advanced or domain-oriented text mining system (Figures 1.6 and 1.7, pages 16 and 17)

Pattern-discovery algorithms Incorporation of background knowledge into text mining query operations Text mining query languages Core Text Mining Operation

Common Types of Patterns in Text Mining Distributions Frequent sets Associations

Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.

Similar presentations

Presentation on theme: "Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.

Similar presentations

Presentation on theme: "Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format."— Presentation transcript:

Similar presentations

About project

Feedback