From: Niu Yawei Date: Fri, 9 Nov 2012 09:10:26 +0000 (-0500) Subject: LUDOC-100 monitoring: jobstats manual X-Git-Tag: 2.4.0~30 X-Git-Url: https://git.whamcloud.com/gitweb?a=commitdiff_plain;h=e975fd76ef0246cf83669ceabf9e5ae1d2d5cbbc;p=doc%2Fmanual.git LUDOC-100 monitoring: jobstats manual Update manual for jobstats. Signed-off-by: Niu Yawei Change-Id: Icfd85f788c1657b0601d06f2890db03068ea5c06 Reviewed-on: http://review.whamcloud.com/4500 Tested-by: Hudson Reviewed-by: Andreas Dilger Reviewed-by: Richard Henwood --- diff --git a/LustreMonitoring.xml b/LustreMonitoring.xml index 912c1c8..54c03b0 100644 --- a/LustreMonitoring.xml +++ b/LustreMonitoring.xml @@ -7,6 +7,9 @@ Lustre Changelogs + Lustre Jobstats + + Lustre Monitoring Tool @@ -322,6 +325,178 @@ $ ln /mnt/lustre/mydir/foo/file /mnt/lustre/mydir/myhardlink +
+ <indexterm><primary>jobstats</primary><see>monitoring</see></indexterm> +<indexterm><primary>monitoring</primary></indexterm> +<indexterm><primary>monitoring</primary><secondary>jobstats</secondary></indexterm> + +Lustre Jobstats + The Lustre Jobstats feature is available starting in Lustre version 2.3. It collects filesystem operation statistics for the jobs running on Lustre clients, and exposes them via procfs on the server. Job schedulers known to be able to work with jobstats include: SLURM, SGE, LSF, Loadleveler, PBS and Maui/MOAB. + Since Jobstats is implemented in a scheduler-agnostic manner, it is likely that it will be able to work with other schedulers also. +
+ <indexterm><primary>monitoring</primary><secondary>jobstats</secondary></indexterm> +Enable/Disable Jobstats + Jobstats are disabled by default, the current state of jobstats can be verified by checking lctl get_param jobid_var on client: + +$ lctl get_param jobid_var +jobid_var=disable + + The Lustre Jobstats code extracts the job identifier from an environment variable set by the scheduler when the job is started. To enable jobstats, specify the jobid_var to name the environment variable set by the scheduler. For example, SLURM sets the SLURM_JOB_ID environment variable with the unique job ID on each client. To permanently enable Jobstats on the testfs filesystem: + $ lctl conf_param testfs.sys.jobid_var=SLURM_JOB_ID + The value of jobid_var can be: + + + + + + + + Value + + + Job Scheduler + + + + + + + SLURM_JOB_ID + + + Simple Linux Utility for Resource Management (SLURM) + + + + + JOB_ID + + + Sun Grid Engine (SGE) + + + + + LSB_JOBID + + + Load Sharing Facility (LSF) + + + + + LOADL_STEP_ID + + + Loadleveler + + + + + PBS_JOBID + + + Portable Batch Scheduler (PBS)/MAUI + + + + + procname_uid + + + process name and user ID (for debugging, or if no job scheduler is in use) + + + + + disable + + + disable jobstats + + + + + + To disable jobstats specify the jobid_var as disable: + $ lctl conf_param testfs.sys.jobid_var=disable +
+
+ <indexterm><primary>monitoring</primary><secondary>jobstats</secondary></indexterm> +Check Job Stats + Metadata operation statistics are collected on MDTs. These statistics can be accessed for all filesystems and all jobs on the MDT via the lctl get_param mdt.*.job_stats. For example, clients running with jobid_var=procname_uid: + +$ lctl get_param mdt.*.job_stats +job_stats: +- job_id: bash.0 + snapshot_time: 1352084992 + open: { samples: 2, unit: reqs } + close: { samples: 2, unit: reqs } + mknod: { samples: 0, unit: reqs } + link: { samples: 0, unit: reqs } + unlink: { samples: 0, unit: reqs } + mkdir: { samples: 0, unit: reqs } + rmdir: { samples: 0, unit: reqs } + rename: { samples: 0, unit: reqs } + getattr: { samples: 3, unit: reqs } + setattr: { samples: 0, unit: reqs } + getxattr: { samples: 0, unit: reqs } + setxattr: { samples: 0, unit: reqs } + statfs: { samples: 0, unit: reqs } + sync: { samples: 0, unit: reqs } + samedir_rename: { samples: 0, unit: reqs } + crossdir_rename: { samples: 0, unit: reqs } +- job_id: dd.0 + snapshot_time: 1352085037 + open: { samples: 1, unit: reqs } + close: { samples: 1, unit: reqs } + mknod: { samples: 0, unit: reqs } + link: { samples: 0, unit: reqs } + unlink: { samples: 0, unit: reqs } + mkdir: { samples: 0, unit: reqs } + rmdir: { samples: 0, unit: reqs } + rename: { samples: 0, unit: reqs } + getattr: { samples: 0, unit: reqs } + setattr: { samples: 0, unit: reqs } + getxattr: { samples: 0, unit: reqs } + setxattr: { samples: 0, unit: reqs } + statfs: { samples: 0, unit: reqs } + sync: { samples: 2, unit: reqs } + samedir_rename: { samples: 0, unit: reqs } + crossdir_rename: { samples: 0, unit: reqs } + + Data operation statistics are collected on OSTs. Data operations statistics can be accessed via lctl get_param obdfilter.*.job_stats, for example: + +$ lctl get_param obdfilter.*.job_stats +job_stats: +- job_id: bash.0 + snapshot_time: 1352085025 + read: { samples: 0, unit: bytes, min: 0, max: 0, sum: 0 } + write: { samples: 1, unit: bytes, min: 4, max: 4, sum: 4 } + setattr: { samples: 0, unit: reqs } + punch: { samples: 0, unit: reqs } + sync: { samples: 0, unit: reqs } + +
+
+ <indexterm><primary>monitoring</primary><secondary>jobstats</secondary></indexterm> +Clear Job Stats + Accumulated job statistics can be reset by writing proc file job_stats. + Clear statistics for all jobs on the local node: + $ lctl set_param obdfilter.*.job_stats=clear + Clear statistics for job 'dd.0' on lustre-MDT0000: + $ lctl set_param mdt.lustre-MDT0000.job_stats=clear +
+
+ <indexterm><primary>monitoring</primary><secondary>jobstats</secondary></indexterm> +Configure Auto-cleanup Interval + By default, if a job is inactive for 600 seconds (10 minutes) statistics for this job will be dropped. This expiration value can be changed temporarily via: + $ lctl set_param *.*.job_cleanup_interval={max_age} + It can also be changed permanently, for example to 700 seconds via: + $ lctl conf_param testfs.mdt.job_cleanup_interval=700 + The job_cleanup_interval can be set as 0 to disable the auto-cleanup. Note that if auto-cleanup of Jobstats is disabled, then all statistics will be kept in memory forever, which may eventually consume all memory on the servers. In this case, any monitoring tool should explicitly clear individual job statistics as they are processed, as shown above. +
+
<indexterm><primary>monitoring</primary><secondary>Lustre Monitoring Tool</secondary></indexterm> Lustre Monitoring Tool