Download presentation
Presentation is loading. Please wait.
Published byAdela Wilson Modified over 9 years ago
1
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT
2
Clustering Many packages – CLUTO – Weka – MALLET MAHOUT – Supported by the Apache foundation – Industrial strength (builds on top of Hadoop) – Includes libraries for reading in index files in different formats including Weka.arff and Lucene index files – We’ll use SOLR to produce Lucene index files
3
This Lab Clustering with Mahout Clustering with indices produced using Lucene: brief review of SOLR
4
MAHOUT A machine learning framework Built to be usable on top of Hadoop – scalability What’s in it: –Simple Matrix/Vector library –Taste Collaborative Filtering –Clustering Canopy/K-Means/Fuzzy K-Means/Mean-shift/Dirichlet –Classifiers Naïve Bayes Complementary NB –Evolutionary Integration with Watchmaker for fitness function
5
Basic format bin/mahout bin/mahout kmeans bin/mahout seqdirectory
6
INPUT FORMAT IMF p. 155: ‘for clustering Mahout relies on data in org.apache.mahout.matrix.Vector format’ – Vector = a tuple of floats SparseVector vs DenseVector Several libraries for creating Vectors from other formats – Weka – Apache Lucene – programmatic
7
K_MEANS CLUSTERING The Federalist papers example
8
CONVERSION The Reuters example
9
For more sophisticated indexing … … can use SOLR for preprocessing; Mahout knows how to read in Lucene-style indices
10
What is Solr? Solr is an open source enterprise search server based on the Lucene Java search library. Solr runs in a Java servlet container such as Tomcat or Jetty Solr is free software and a project of the Apache Software Foundation Solr is a sub-project of Lucene and can be found at http://lucene.apache.org/solr/http://lucene.apache.org/solr/ By Mick England
11
Key Features Advanced Full-Text search Optimized for High Volume Web Traffic Standards Based Open Interfaces – XML and HTTP Comprehensive HTML Administration Interface Server statistics exposed over JMX for monitoring Scalability through efficient replication Flexibility with XML configuration and Plugins Push vs Crawl indexing method
12
Solr Clients Solr can be integrated with, among others… – Ruby – PHP – Java – Python – JSON – Forrest/Cocoon – C# or Deveel Solr Client or solrnet – Coldfusion – Drupal or apacheSolr project for Drupal
13
Why SOLR? It can be used to preprocess documents and produce an index for them that can then be used as representation
14
Indexing Push vs Crawl Schema.xml Add documents HTML interface – Update – Delete – Commit DataImportHandler – For searching databases By Mick England
15
SOLR: what you should do (Installing SOLR on your laptop: see Section 0 of Lab script) Posting docs to SOLR Searching Getting the indexed docs
16
Posting documents to SOLR SOLR documents – fields schema.xml
17
SOLR Documents: fields
18
Importing Lucene indices into MAHOUT Use the lucene.vector option
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.