Whamcloud - gitweb
LU-10698 obdclass: allow specifying complex jobids 91/31691/7
authorAndreas Dilger <andreas.dilger@intel.com>
Tue, 20 Mar 2018 09:45:36 +0000 (03:45 -0600)
committerOleg Drokin <oleg.drokin@intel.com>
Mon, 9 Apr 2018 19:46:03 +0000 (19:46 +0000)
commit6488c0ec57de2d188bd15e502917b762e3a9dd1d
tree4d49c5f409b7959e1bf0306e36186fbc3100d07a
parente3bc6e681666aa2c60ada5f997966efa31fae68c
LU-10698 obdclass: allow specifying complex jobids

Allow specifying a format string for the jobid_name variable to create
a jobid for processes on the client.  The jobid_name is used when
jobid_var=nodelocal, if jobid_name contains "%j", or as a fallback if
getting the specified jobid_var from the environment fails.

The jobid_node string allows the following escape sequences:

    %e = executable name
    %g = group ID
    %h = hostname (system utsname)
    %j = jobid from jobid_var environment variable
    %p = process ID
    %u = user ID

Any unknown escape sequences are dropped. Other arbitrary characters
pass through unmodified, up to the maximum jobid string size of 32,
though whitespace within the jobid is not copied.

This allows, for example, specifying an arbitrary prefix, such as the
cluster name, in addition to the traditional "procname.uid" format,
to distinguish between jobs running on clients in different clusters:

    lctl set_param jobid_var=nodelocal jobid_name=cluster2.%e.%u
or
    lctl set_param jobid_var=SLURM_JOB_ID jobid_name=cluster2.%j.%e

To use an environment-specified JobID, if available, but fall back to
a static string for all processes that do not have a valid JobID:

    lctl set_param jobid_var=SLURM_JOB_ID jobid_name=unknown

Implementation notes:

The LUSTRE_JOBID_SIZE includes a trailing NUL, so don't use
"LUSTRE_JOBID_SIZE + 1" anywhere, as that is misleading.

Rename the "obd_jobid_node" variable to "obd_jobid_name" to match
the /proc "jobid_name" parameter name to avoid confusion.

Rename "struct jobid_to_pid_map" to "jobid_pid_map" since this is
not actually mapping from a jobid *to* a PID, but the reverse.
Save jobid length, and reorder fields to avoid holes in structure.

Consolidate PID->jobid cache handling in jobid_get_from_cache(),
which only does environment lookups and caches the results.
The fallback to using obd_jobid_name is handled by the caller.

Rename check_job_name() to jobid_name_is_valid(), since that makes
it clear to the reader a "true" return is a valid name.

In jobid_cache_init() there is no benefit for locking the jobid_hash
creation, since the spinlock is just initialized in this function,
so multiple callers of this function would already be broken.

Pass the buffer size from the callers (who know the buffer size) to
lustre_get_jobid() instead of assuming it is LUSTRE_JOBID_SIZE.

Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Change-Id: Iad350e87b446c7d2356718cf2e5f9563e63ebbe5
Reviewed-on: https://review.whamcloud.com/31691
Tested-by: Jenkins
Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Tested-by: Maloo <hpdd-maloo@intel.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
12 files changed:
libcfs/libcfs/linux/linux-curproc.c
lustre/include/obd_class.h
lustre/include/uapi/linux/lustre/lustre_idl.h
lustre/llite/llite_internal.h
lustre/llite/llite_lib.c
lustre/llite/vvp_io.c
lustre/llite/vvp_object.c
lustre/obdclass/jobid.c
lustre/obdclass/linux/linux-module.c
lustre/obdclass/lprocfs_jobstats.c
lustre/ptlrpc/pack_generic.c
lustre/tests/sanity.sh