Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 / 22 Issues in Text Similarity and Categorization Jordan Smith – MUMT 611 – 27 March 2008.

Similar presentations


Presentation on theme: "1 / 22 Issues in Text Similarity and Categorization Jordan Smith – MUMT 611 – 27 March 2008."— Presentation transcript:

1 1 / 22 Issues in Text Similarity and Categorization Jordan Smith – MUMT 611 – 27 March 2008

2 2 / 22 Outline Why text? Why text? Text categorization: Text categorization: Some sample problems Some sample problems Comparison to MIR Comparison to MIR Document indexing Document indexing Detailed example Detailed example

3 Jordan Smith – MUMT 611 – 27 March 2008 3 / 22 Why text? 28.9% of MIR queries refer to lyric fragments 28.9% of MIR queries refer to lyric fragments (Bainbridge et al. 2003) Easy to collect! Easy to collect! (Knees et al. 2005, Geleijnse & Korst 2006) Accurate ground truth Accurate ground truth (Logan et al. 2004) Information about mood, “content” Information about mood, “content”

4 Jordan Smith – MUMT 611 – 27 March 2008 4 / 22 Why text? Potential applications: Genre, mood categorization (Maxwell 2007) Genre, mood categorization (Maxwell 2007) Similarity searches (Mahadero et al. 2005) Similarity searches (Mahadero et al. 2005) Hit-song prediction (Dhanaraj & Logan 2004) Hit-song prediction (Dhanaraj & Logan 2004) Musical document retrieval (Google) Musical document retrieval (Google) Accompany query-by-humming (Suzuki et al. 2007, Fujihara et al. 2006) Accompany query-by-humming (Suzuki et al. 2007, Fujihara et al. 2006)

5 Jordan Smith – MUMT 611 – 27 March 2008 5 / 22 Some text categorization problems Indexing Indexing Document organization Document organization Filtering Filtering Web content hierarchy Web content hierarchy Language identification Language identificationetc.

6 Jordan Smith – MUMT 611 – 27 March 2008 6 / 22 What is text categorization? “Text categorization may be defined as the task of assigning a Boolean value to each pair ∈ D x C, where D is a domain of documents and C = {c1,..., c |C| } is a set of pre-defined categories. ” “Text categorization may be defined as the task of assigning a Boolean value to each pair ∈ D x C, where D is a domain of documents and C = {c1,..., c |C| } is a set of pre-defined categories. ” (Sebastiani 2002)

7 Jordan Smith – MUMT 611 – 27 March 2008 7 / 22 Text vs. music Music classification: extract features extract features train classifiers train classifiers evaluate classifier evaluate classifier Text categorization: extract features extract features train classifiers train classifiers evaluate classifier evaluate classifier Same Not the same

8 Jordan Smith – MUMT 611 – 27 March 2008 8 / 22 Text feature extraction Convert each document d j into a vector Convert each document d j into a vector d j = d j = where T is the set of terms {t 1, t 2, … t |T| }. Different indexing systems: Different indexing systems: Definition of set of terms Definition of set of terms Computation of weights Computation of weights

9 Jordan Smith – MUMT 611 – 27 March 2008 9 / 22 Indexing techniques “Set of words” indexing “Set of words” indexing Terms: every word that occurs in the corpus Terms: every word that occurs in the corpus Weights: binary Weights: binary

10 Jordan Smith – MUMT 611 – 27 March 2008 10 / 22 Indexing techniques “Bag of words” indexing “Bag of words” indexing Terms: every word that occurs in the corpus Terms: every word that occurs in the corpus Weights: tf-idf Weights: tf-idf term frequency / inverse document frequency: term frequency / inverse document frequency: tf-idf(t k, d j ) = #(t k, d j ) · log( |T r | / #T r (t k ) ) Frequency of term t k in document d j Number of documents that t k occurs in Normalization:

11 Jordan Smith – MUMT 611 – 27 March 2008 11 / 22 Indexing techniques Phrase indexing Phrase indexing Terms: all word sequences that occur in the corpus Terms: all word sequences that occur in the corpus Weights: binary, tf-idf Weights: binary, tf-idf

12 Jordan Smith – MUMT 611 – 27 March 2008 12 / 22 Indexing techniques “The Darmstadt Indexing Approach” “The Darmstadt Indexing Approach” Terms: properties of the words, documents, categories Terms: properties of the words, documents, categories Weights: various Weights: various

13 Jordan Smith – MUMT 611 – 27 March 2008 13 / 22 Feature reduction techniques Remove function words (the, for, in, etc.) Remove function words (the, for, in, etc.) Remove words that are least frequent: Remove words that are least frequent: in each document in each document in the corpus in the corpusRemainder: low and mid-range frequency words

14 Jordan Smith – MUMT 611 – 27 March 2008 14 / 22 Feature reduction techniques Sebastiani 2002

15 Jordan Smith – MUMT 611 – 27 March 2008 15 / 22 Feature reduction techniques Latent Semantic Analysis (LSA): Search: Search: Demographic shifts in the U.S. with economic impact. Demographic shifts in the U.S. with economic impact. Result: Result: The nation grew to 249.6 million people in the 1980s as more Americans left the industrial and agricultural heartlands for the South and West. The nation grew to 249.6 million people in the 1980s as more Americans left the industrial and agricultural heartlands for the South and West. Sebastiani 2002

16 Jordan Smith – MUMT 611 – 27 March 2008 16 / 22 A word on speech “Expert” feature reduction: “Expert” feature reduction: Rhymingness Rhymingness Iambicness of meter Iambicness of meter

17 Jordan Smith – MUMT 611 – 27 March 2008 17 / 22 Example: Hit song prediction Goal: Measure some unknown, global, intrinsic property Measure some unknown, global, intrinsic propertyFeatures: Acoustic-Mel-Frequency Cepstral Coefficient Acoustic-Mel-Frequency Cepstral Coefficient Lyric-Probabilistic Latent Semantic Analysis Lyric-Probabilistic Latent Semantic AnalysisClassifiers: Support vector machines Support vector machines Boosting classifiers Boosting classifiersCorpus: 1700 #1 hits from 1956 to 2004 1700 #1 hits from 1956 to 2004 Dhanaraj, R. and B. Logan. 2005. Automatic Prediction of Hit Songs. International Conference on Music Information Retrieval, London UK. 488-91.

18 Jordan Smith – MUMT 611 – 27 March 2008 18 / 22 Example: Hit song detection Results of PLSA: Results of PLSA: Best features are for contraindication

19 Jordan Smith – MUMT 611 – 27 March 2008 19 / 22 Example: Genre classification Logan, B., A. Kositsky and P. Moreno. 2004. Semantic Analysis of Song Lyrics. Proceedings of IEEE International Conference on Multimedia and Expo. 1-7.

20 Jordan Smith – MUMT 611 – 27 March 2008 20 / 22 References Sebastiani, F. 1999. Machine learning in automated text categorization. Technical report, Consiglio Nazionale delle Ricerche. Pisa, Italy. 1–59. Dhanaraj, R., and B. Logan. 2005. Automatic prediction of hit songs. International Conference on Music Information Retrieval, London UK. 488–91. Logan, B., A. Kositsky, and P. Moreno. 2004. Semantic analysis of song lyrics. Proceedings of IEEE International Conference on Multimedia and Expo. 1–7. Mahadero, J., Á. Martínez, and P. Cano. 2005. Natural language processing of lyrics. Proceedings of the 13th Annual ACM International Conference on Multimedia. 475–8. Maxwell, T. 2007. Exploring the music genome: Lyric clustering with heterogeneous features. M.Sc. Thesis. University of Edinburgh.

21 Jordan Smith – MUMT 611 – 27 March 2008 21 / 22 Query-by-asking


Download ppt "1 / 22 Issues in Text Similarity and Categorization Jordan Smith – MUMT 611 – 27 March 2008."

Similar presentations


Ads by Google