Download presentation
Presentation is loading. Please wait.
Published byRosamund McCoy Modified over 8 years ago
1
Using Sequence Files
2
Mahout Installation – wget http://apache.osuosl.org/mahout/0.9/mahout- distribution-0.9.tar.gz http://apache.osuosl.org/mahout/0.9/mahout- distribution-0.9.tar.gz – sudo tar zxvf mahout-distribution-0.9.tar.gz – export MAHOUT_HOME=/opt/mahout- distribution-0.9 – export PATH=$MAHOUT_HOME/bin:$PATH – export MAHOUT_LOCAL=
3
Mahout Testing # 建立暫存用工作資料夾 mkdir /tmp/canopy export WORK_DIR=/tmp/canopy cd $WORK_DIR # 下載範例資料 wget http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic _control.data # 上傳到 Hadoop hadoop fs -mkdir /user/hduser/testdata hadoop fs -put ${WORK_DIR}/synthetic_control.data /user/hduser/testdata # 執行範例程式 mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
4
Mahout Testing Result 一樣用範例程式轉成一般文字檔 mahout seqdumper -i output/clusteredPoints -o clusteredPoints 轉檔結果會存在當前目錄,使用 cat 或 vi 來看 cat clusteredPoints
5
HDFS Outline Creating sequence files from the command line Generating sequence files from code Reading sequence files from code
6
Introduction Mapping: the original computational problem is taken by the master node and divided into smaller pieces. Every computational piece is then sent to different worker nodes called mappers Reducing: the output of every mapper node is collected and reassembled using the same key index for all the nodes
7
Getting Ready Choose a folder (e.g., /mnt/new/) – mkdir lastfm mkdir –./lastfm/original mkdir –./lastfm/sequencesfiles – export WORK_DIR=/mnt/new/lastfm – cd $WORK_DIR Download Lastfm dataset – cd $WORK_DIR – wget http://static.echonest.com/Lastfm- ArtistTags2007.tar.gzhttp://static.echonest.com/Lastfm- ArtistTags2007.tar.gz – tar –xvzf LastfmArtistTags2007.tar.gz http://musicmachinery.com/2010/11/10/lastfm-artisttags2007/
8
Getting Ready cp /mnt/new/lastfm/LastfmArtistTags2007/*.* /mnt/new/lastfm/original/ – Artists.txt: This contains the artist's registry – Tags.txt: This consists of all the tags in the dataset – ArtistTags.dat: This lists all the associations between tags and artists
9
Getting Ready Now it is time to convert our first file from its original format to the Mahout's sequence format – mahout seqdirectory -i $WORK_DIR/original -o $WORK_DIR/sequencesfiles
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.