Download presentation
Presentation is loading. Please wait.
Published byCameron Napier Modified over 10 years ago
1
Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds
2
Introduction Significant advances in Machine Learning approaches to the automatic analysis of corpora A range of Machine Learning approaches Three dimensions of classification –Levels of linguistic analysis; –Machine Learning techniques; –Current research in Discourse Analysis A framework for further development
3
Levels of linguistic analysis Tokenisation Part-of-Speech tagging Parsing Semantic analysis Discourse analysis
4
Low-level Linguistic Analysis Tokenisation: breaks up the sequence of characters in a text by locating the word boundaries Part-of-Speech: assigns correct Part-of-Speech and additional grammatical features to each word A forced move from hand- built to Machine Learning approaches Many systems learn statistical models from a training corpus, e.g. CLAWS Transformation-Based Learning is the most popular alternative approach
5
Parsing and Semantic Analysis Parsing: take a formal grammar and a linguistic input and apply the grammar to the input to produce a parse-tree –Top-Down and Bottom Up reflect contrasting perspectives Semantic Analysis: augment data to facilitate automatic recognition of the underlying semantic content and structure –A common practice is to label documents with thesaurus classes for document classification and management
6
Discourse Analysis Discourse analysis extends beyond sentence boundaries No universal agreement on discourse analysis categories or labels A growing range of dialogue transcript corpora have been hand-annotated with dialogue-act or speech-act tags designed for specific applications
7
Machine Learning Techniques for Linguistic Annotation of Corpora N-gram Markov models, HMMs Neural Networks, Semantic Networks Transformation-Based Learning Decision-Tree classification Vector-based clustering
8
N-gram, Markov models N-gram and Markov Models A Markov Model of a sequence of states or symbols (e.g. words or Part-of-Speech tags) is used to estimate the probability or likelihood of a symbol sequence Hidden Markov Models (HMMs) are a variant including 2 layers of states: –a visible layer corresponding to input symbols –a hidden layer learnt by the system
9
Neural Networks, Semantic Networks Neural networks have been developed in many fields in the hope of achieving human-like learning A related model is the semantic network –Typically nodes represent concepts –Connections represent semantically meaningful associations between these concepts.
10
Transformation-Based Learning Brill (1995) developed a symbolic Machine Learning method called Transformation-Based Learning (TBL) Given a tagged training corpus, TBL produces a sequence of rules that serves as a model of the training data
11
Decision Tree Classification and Vector-Based Clustering A decision tree is constructed by partitioning the training set, selecting, at each step, the feature that most reduce the uncertainty about the class in each partition, and using it as a split Vector-based clustering uses co-occurrence statistics to construct vectors that represent word classes or meanings by virtue of their direction in multi-dimensional word-collocation space
12
Discourse Analysis 1/2 1994: Woszczyna and Waibel – N-grams, Markov Model 1996: Reithinger, Engel, Kipp and Klesen – N- grams, HMM 1996: Mast et al. – Decision Trees, N-grams 1997: Reithinger and Klesen – N-grams, Bayesian network
13
Discourse Analysis 2/2 1998: Samuel, Carberry, and Vijay-Shanker – Transformation-Based Learning 1998: Wright – N-grams, CART Decision Tree, Neural Networks 1998: Taylor, King, Isard, and Wright – Combined N-grams and HMM 1998: Fukada et al – Bi- grams, HMM 1998: Stolcke et al. – HMM, Decision Trees
14
Conclusion This survey has explored algorithms underlying different levels of linguistic analysis, providing a framework for further research Better to combine 2 or more ML approaches? Discourse Analysis: HMM/n-grams + ano Future work –Explore systems which can be used and re-used –Integrate such systems and comparatively evaluate Machine Learning techniques for corpus analysis
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.