Download presentation
Presentation is loading. Please wait.
Published byBarbra Hensley Modified over 9 years ago
1
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Text Features Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789
2
©2012 Paula Matuszek Getting Started l First step in applying text mining techniques: figure out what our input is l Documents: generic term for chunks of text l Features of documents: what can we say about each document that is relevant to what we want to know? l Possible goals –For statistical methods: end up with each document as a vector of numbers or a row in a spreadsheet –For linguistic methods: end up with each document in a standardized format, still more or less “English”
3
©2012 Paula Matuszek Finding Documents l Sometimes you have existing documents and you want to know something about them. –Villanova newsletters –news articles about the election –a social media feed l Sometimes you have some interesting questions and you have to go looking for a useful corpus –What is the general opinion or sentiment about Microsoft? –Can we determine the genre of modern fiction?
4
©2012 Paula Matuszek Collecting Documents l If you don’t have a handy corpus you will have to collect it. l Web crawler, system logger, etc. l Issues to be aware of: –copyright –document “cleansing” –size of collection –sampling bias
5
©2012 Paula Matuszek Standardizing Documents l Typically we will need documents in some kind of standardized format l Simple: ascii l Slightly less simple: markup languages such as RTF {\fonttbl\f0\froman\fcharset0 Times-Roman;} {\info \f0\fs24 \cf0 \cb2 Text mining a \i corpus \i0 of documents \b requires \b0 having a corpus of \fs28 documents \fs24 to mine. } l Richer formats introduce some structure, which we may want to use: xml, for instance Typically want tokens rather than characters
6
©2012 Paula Matuszek Tokenization l Tokenization is the process of breaking up a string of letters into words and other meaningful components (numbers, punctuation, etc. l Typically broken up at white space. l Very standard NLP tool l Language-dependent, and sometimes also domain-dependent. –3,7-dihydro-1,3,7-trimethyl-1H-purine-2,6-dione3,7-dihydro-1,3,7-trimethyl-1H-purine-2,6-dione l Tokens can also be larger divisions: sentences, paragraphs, etc.
7
©2012 Paula Matuszek NLTK Tokenizer l Natural Language ToolKit l http://text- processing.com/demo/tokenize/ http://text- processing.com/demo/tokenize/ Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
8
©2012 Paula Matuszek Document Features l Once we have a corpus of standardized and possibly tokenized documents, what features of text documents might be interesting? –Word frequencies: Bag of Words (BOW) –Language –Document Length -- characters, words, sentences –Named Entities –Parts of speech –Average word length –Average sentence length –Domain-specific stuff
9
©2012 Paula Matuszek Bag of Words l Specifically not interested in order l Frequency of each possible word in this document. l Very sparse vector! l In order to assign count to correct position, need to know all the words used in the corpus –two-pass –reverse-index
10
©2012 Paula Matuszek Simplifying BOW Do we really want every word? l Stop words –omit common function words. – e.g. http://www.ranks.nl/resources/stopwords.html http://www.ranks.nl/resources/stopwords.html l Stemming or lemmatization –Convert words to standard form –lemma is the standard word; stem may not be a word at all l Synonym matching l TF*IDF –Use the “most meaningful” words
11
©2012 Paula Matuszek Stemming l Inflectional stemming: eliminate morphological variants –singular/plural, present/past –In English, rules plus a dictionary –books -> book, children -> child –few errors, but many omissions l Root stemming: eliminate inflections and derivations –invention, inventor -> invent –much more aggressive l Which (if either) depends on problem l http://text-processing.com/demo/stem/ http://text-processing.com/demo/stem/
12
©2012 Paula Matuszek Synonym-matching l Map some words to their synonyms l http://thesaurus.com/browse/computer http://thesaurus.com/browse/computer l Problematic in English –requires a large dictionary –many words have multiple meanings l In specific domains may be important –biological and chemical domains: http://en.wikipedia.org/wiki/Caffeine http://en.wikipedia.org/wiki/Caffeine –any specific domain: Nova, Villanova, V’Nova
13
©2012 Paula Matuszek TF-IDF l Word frequency is simple, but: –affected by length of document –not best indicator of what doc is about –very common words don’t tell us much about differences between documents –very uncommon words are typos or idiosyncratic l Term Frequency Inverse Document Frequency –tf-idf(j) = tf(j)*idf(j). idf(j)=log(N/df(j))
14
©2012 Paula Matuszek Language l Relevant if documents are in multiple languages l May know from source l Determining language itself can be considered a form of text mining, or just NLP. The line is fuzzy :-) l http://translate.google.com/?hl=en&tab=wT http://translate.google.com/?hl=en&tab=wT l http://fr.wikipedia.org/ http://fr.wikipedia.org/ l http://de.wikipedia.org/ http://de.wikipedia.org/
15
©2012 Paula Matuszek Document Counts l Number of characters, words, sentences l Average length of words, sentences, paragraphs l EG: clustering documents to determine how many authors have written them or how many genres are represented l NLTK + Python makes this easy
16
©2012 Paula Matuszek Named Entities l What persons, places, companies are mentioned in documents? l “Proper nouns” l One of most common information extraction tasks l Combination of rules and dictionary –Example rules: –Capitalized word not at beginning of sentence –Two capitalized words in a row –One or more capitalized words followed by Inc –Dictionaries of common names, places, major corporations. Sometimes called “gazetteer”
17
©2012 Paula Matuszek Parts of Speech l Part-of-Speech (POS) taggers identify nouns, verbs, adjectives, noun phrases, etc l Brill is the best-known rule-based tagger l More recent work uses machine learning to create taggers from labeled examples l http://text-processing.com/demo/tag/ http://text-processing.com/demo/tag/ l http://cst.dk/online/pos_tagger/uk/index.html http://cst.dk/online/pos_tagger/uk/index.html
18
©2012 Paula Matuszek Domain-Specific l Structure in document –email: To, From, Subject, Body –Villanova Campus Currents: News, Academic Notes, Events, Sports, etc l Tags in document –Medline corpus is tagged with MeSH terms –Twitter feeds may be tagged with #tags. l Intranet documents might have date, source, department, author, etc.
19
©2012 Paula Matuszek Choosing Features l Goal is to turn each document into a standardized format which can be fed to a text mining tool l Format depends on tool. –statistically based tools such as classifiers and clustering tools typically need a numeric representation –information extraction tools need text, but cleaned and standardized text l Features to choose depends on what you are trying to do –For classification tasks the features must be related to the categories you are classifying into –for clustering tasks they must reflect potentially interesting dimensions, and not reflect irrelevant ones l Text mining, like all data mining, isn’t magic. You must understand your domain and your information need.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.