Download presentation
Presentation is loading. Please wait.
Published byPatricia Carson Modified over 8 years ago
1
H IGH E NERGY A CCELERATOR R ESEARCH O RGANIZATION KEKKEK Learning System to Classify Batch Jobs Yutaka Kawai (*1), Adil Hasan (*2), Wataru Takase (*1), Takashi Sasaki (*1) (*1) High Energy Accelerator Research Organization (KEK), Japan (*2) Kings' College London, University of London, UK yutaka.kawai@kek.jp
2
Background ▸There are lots of different types of middleware that can be used to execute computations (Grid systems, Cloud stacks, batch schedulers and so on). ▸Each middleware makes use of compute node resources to optimize the usage of its nodes. ▸There are very few pieces of middleware that optimize usage from the point of view of the user that can span different types of job processing middleware. ▸However, implementing a component of middleware selection requires some communications between it and the clients in order to understand the characteristics of each job. ▸Our goals is provide an approach can offer to use resources more efficiently from the user perspective with the following policies: ▹Our approach does not require the servers to install any particular kind of software. ▹Our approach is client-based system and there is no impact or changes on the server-side. ▹Our approach imposes a little overhead on the server-side. ▸This study mentions only part of "understanding characteristics of our each job", focusing on CPU and memory usages. Mar 18, 2015Yutaka Kawai, KEK/CRC - ISGC2015 @ Taipei2
3
Why we need to classify Jobs? ▸Data analysis usually requires lots of processing that has demands on CPU, storage and memory. ▸Users usually want the results now (or yesterday). ▸Increasingly more than one type of platform is on offer to users (eg grid, cloud). ▸From the user perspective it’s difficult to know the optimal pattern for max throughput beforehand ▸We tried to analyze resource usage data as data are being processed (or small test jobs before running lots of jobs) Mar 18, 2015Yutaka Kawai, KEK/CRC - ISGC2015 @ Taipei3 Next jobs can be classified as CPU consumption type or Memory one.
4
Concept of the learning system ▸Learning System Server (LSS) exports NFS ▸Each compute node monitors cpu/memory utilization during job executions ▸The monitored statistical data are stored in NFS ▸LSS DB contains the historical data with using the statistical data ▸LSS judges the job type based on the historical data in DB Mar 18, 2015Yutaka Kawai, KEK/CRC - ISGC2015 @ Taipei4 NFS DB Torque A Torque B ・・・ LSS
5
Monitoring tool ▸A tool for monitoring systems ▹Dstat : ▪Can be easily installed by “yum install dstat” ▪Simply outputs resource utilizations as csv file ▪Python based application (hope to use API?) ▹Options “-cm”: ▪-c : monitors CPU usage ▪-m: monitors Memory usage Mar 18, 2015Yutaka Kawai, KEK/CRC - ISGC2015 @ Taipei5 dstat --output "/data/testdata/${filename}" -cm > /dev/null & Example of dstat command
6
Sample jobs ▸“stress” command is used for sample jobs ▹CPU intensive job : “--cpu” option ▹Memory intensive job : “-m” option ▸Specify the job name of each job ▹Uses “#PBS –N” option Mar 18, 2015Yutaka Kawai, KEK/CRC - ISGC2015 @ Taipei6 #!/bin/bash -l #PBS -N test2 stress -m 6 --vm-bytes 512M --timeout 10s #!/bin/bash -l #PBS -N test1 stress --cpu 1 --io 4 --timeout 10s Adding “–-io” option can fluctuate CPU utilizations CPU intensive Job Memory intensive Job
7
Statistical data collection ▸Prepare “prologue” and “epilogue” scripts to stat and kill a monitoring tool with specifying the log location. Mar 18, 2015Yutaka Kawai, KEK/CRC - ISGC2015 @ Taipei7 Job submission (i.e. qsub) Prologue Job execution Epilogue #!/bin/bash job_id=$1 job_name=$4 datename=`date +%Y%m%d%H%M%S` basename="${datename}_${job_id}" filename="${basename}.csv" echo "$job_name" > "/data/testdata/${filename}" echo "$basename" >> "/data/testdata/${filename}" dstat --output "/data/testdata/${filename}" -cm > /dev/null & ps_no=$! echo "$ps_no" > /tmp/dstat_ps #!/bin/bash ps_no=`cat /tmp/dstat_ps` echo "kill the process : " $ps_no kill $ps_no
8
DB update (stores historical data) ▸PostgreSQL and Python are used for DB and language to access DB ▸Several useful Python packages: ▹“psycopg2” to access PostgreSQL ▹“csv” to parse statistical data (csv file) ▹“numpy” to get average values from statistical data ▸Each record contains four columns: ▹job_name : Job name ▹cpu : CPU utilization average of the job ▹mem : Memory utilization average of the job ▹job_id : identification value of the job Mar 18, 2015Yutaka Kawai, KEK/CRC - ISGC2015 @ Taipei8
9
Sample code (update_db.py) Mar 18, 2015Yutaka Kawai, KEK/CRC - ISGC2015 @ Taipei9 import sys, csv, numpy, glob, psycopg2 conn = psycopg2.connect("dbname=jobtypedb user=ykawai") cur = conn.cursor() for file in glob.glob('/data/testdata/*.csv'): f = open(file, 'rb') r = csv.reader(f) for i, row in enumerate(r): if i==0: job_name = row[0] if i==1: job_id = row[0] if i>12: cpu_dat.append(float(row[0])) mem_total= float(row[6]) + float(row[7]) + float(row[8]) + float(row[9]) mem_usage= float(row[6]) / mem_total * 100 mem_dat.append(mem_usage) cpu_dat_num = numpy.array(cpu_dat) cpu_dat_ave = numpy.average(cpu_dat_num) mem_dat_num = numpy.array(mem_dat) mem_dat_ave = numpy.average(mem_dat_num) ### insert a record to DB ### sql = "INSERT INTO jobtype (job_name, cpu, mem, job_id) select '" + job_name + "', " + str(cpu_dat_ave) + ", " + str(mem_dat_ave) + ", '" + job_id + "' WHERE NOT EXISTS ( SELECT job_name FROM jobtype WHERE job_id = '" + job_id + "')" cur.execute(sql) conn.commit() cur.close() conn.close()
10
Judgment of job types ▸Select CPU and Memory utilization data where “job_name” column is the same ▸Get historical averages of CPU and Memory utilization ▹Hope to use some intelligent calculation like a “recent” average or so (future work) ▸Compare CPU vs Memory (which one is used more?) Mar 18, 2015Yutaka Kawai, KEK/CRC - ISGC2015 @ Taipei10
11
Sample code (which_type.py) Mar 18, 2015Yutaka Kawai, KEK/CRC - ISGC2015 @ Taipei11 import sys, psycopg2, numpy argvs = sys.argv job_name=argvs[1] conn = psycopg2.connect("dbname=jobtypedb user=ykawai") cur = conn.cursor() sql = "SELECT cpu FROM jobtype WHERE job_name = '" + job_name + "'" cur.execute(sql) for record in cur: cpu_dat.append(float(record[0])) cpu_dat_num = numpy.array(cpu_dat) cpu_dat_ave = numpy.average(cpu_dat_num) sql = "SELECT mem FROM jobtype WHERE job_name = '" + job_name + "'" cur.execute(sql) for record in cur: mem_dat.append(float(record[0])) mem_dat_num = numpy.array(mem_dat) mem_dat_ave = numpy.average(mem_dat_num) conn.commit() cur.close() conn.close() if cpu_dat_ave >= mem_dat_ave: job_type = "CPU" else: job_type = "MEM" print "This job type is " + job_type
12
Future Works ▸Study for other resource consumption cases (IO, Network, etc.) ▸How to make a judgment when resource usages are almost even ▸Other than “job name” can be considered as a selection key (script name, binary name, including some numbers or arguments, etc.) ▸Automatically updates DB, judges job types, and select an ideal resource target whenever submitting jobs Mar 18, 2015Yutaka Kawai, KEK/CRC - ISGC2015 @ Taipei12
13
Acknowledgements ▸The authors would like to thank to Yoshiyuki Watase and Go Iwai at KEK for constructive discussions to improve slide materials. Mar 18, 2015Yutaka Kawai, KEK/CRC - ISGC2015 @ Taipei13
14
Thank you for your attentions! Yutaka Kawai yutaka.kawai@kek.jp Mailing List (KEK) renkei@ml.post.kek.jp Contact Mar 18, 2015Yutaka Kawai, KEK/CRC - ISGC2015 @ Taipei14
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.