1 <?xml version='1.0' encoding='UTF-8'?><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="lustremonitoring">
2 <title xml:id="lustremonitoring.title">Monitoring a Lustre File System</title>
3 <para>This chapter provides information on monitoring a Lustre file system and includes the
4 following sections:</para>
7 <para><xref linkend="dbdoclet.50438273_18711"/>Lustre Changelogs</para>
10 <para><xref linkend="dbdoclet.jobstats"/>Lustre Jobstats</para>
13 <para><xref linkend="dbdoclet.50438273_81684"/>Lustre Monitoring Tool</para>
16 <para><xref linkend="dbdoclet.50438273_80593"/>CollectL</para>
19 <para><xref linkend="dbdoclet.50438273_44185"/>Other Monitoring Options</para>
22 <section xml:id="dbdoclet.50438273_18711">
23 <title><indexterm><primary>change logs</primary><see>monitoring</see></indexterm>
24 <indexterm><primary>monitoring</primary></indexterm>
25 <indexterm><primary>monitoring</primary><secondary>change logs</secondary></indexterm>
27 Lustre Changelogs</title>
28 <para>The changelogs feature records events that change the file system namespace or file metadata. Changes such as file creation, deletion, renaming, attribute changes, etc. are recorded with the target and parent file identifiers (FIDs), the name of the target, and a timestamp. These records can be used for a variety of purposes:</para>
31 <para>Capture recent changes to feed into an archiving system.</para>
34 <para>Use changelog entries to exactly replicate changes in a file system mirror.</para>
37 <para>Set up "watch scripts" that take action on certain events or directories.</para>
40 <para>Maintain a rough audit trail (file/directory changes with timestamps, but no user information).</para>
43 <para>Changelogs record types are:</para>
44 <informaltable frame="all">
46 <colspec colname="c1" colwidth="50*"/>
47 <colspec colname="c2" colwidth="50*"/>
51 <para><emphasis role="bold">Value</emphasis></para>
54 <para><emphasis role="bold">Description</emphasis></para>
64 <para> Internal recordkeeping</para>
72 <para> Regular file creation</para>
80 <para> Directory creation</para>
88 <para> Hard link</para>
96 <para> Soft link</para>
104 <para> Other file creation</para>
112 <para> Regular file removal</para>
120 <para> Directory removal</para>
128 <para> Rename, original</para>
136 <para> Rename, final</para>
144 <para> ioctl on file or directory</para>
152 <para> Regular file truncated</para>
160 <para> Attribute change</para>
168 <para> Extended attribute change</para>
176 <para> Unknown operation</para>
182 <para>FID-to-full-pathname and pathname-to-FID functions are also included to map target and parent FIDs into the file system namespace.</para>
184 <title><indexterm><primary>monitoring</primary><secondary>change logs</secondary></indexterm>
185 Working with Changelogs</title>
186 <para>Several commands are available to work with changelogs.</para>
189 <literal>lctl changelog_register</literal>
191 <para>Because changelog records take up space on the MDT, the system administration must register changelog users. The registrants specify which records they are "done with", and the system purges up to the greatest common record.</para>
192 <para>To register a new changelog user, run:</para>
193 <screen>lctl --device <replaceable>fsname</replaceable>-<replaceable>MDTnumber</replaceable> changelog_register
195 <para>Changelog entries are not purged beyond a registered user's set point (see <literal>lfs changelog_clear</literal>).</para>
199 <literal>lfs changelog</literal>
201 <para>To display the metadata changes on an MDT (the changelog records), run:</para>
202 <screen>lfs changelog <replaceable>fsname</replaceable>-<replaceable>MDTnumber</replaceable> [startrec [endrec]] </screen>
203 <para>It is optional whether to specify the start and end records.</para>
204 <para>These are sample changelog records:</para>
205 <screen>2 02MKDIR 4298396676 0x0 t=[0x200000405:0x15f9:0x0] p=[0x13:0x15e5a7a3:0x0]\
207 3 01CREAT 4298402264 0x0 t=[0x200000405:0x15fa:0x0] p=[0x200000405:0x15f9:0\
209 4 06UNLNK 4298404466 0x0 t=[0x200000405:0x15fa:0x0] p=[0x200000405:0x15f9:0\
211 5 07RMDIR 4298405394 0x0 t=[0x200000405:0x15f9:0x0] p=[0x13:0x15e5a7a3:0x0]\
216 <literal>lfs changelog_clear</literal>
218 <para>To clear old changelog records for a specific user (records that the user no longer needs), run:</para>
219 <screen>lfs changelog_clear <replaceable>mdt_name</replaceable> <replaceable>userid</replaceable> <replaceable>endrec</replaceable></screen>
220 <para>The <literal>changelog_clear</literal> command indicates that changelog records previous to <replaceable>endrec</replaceable> are no longer of interest to a particular user <replaceable>userid</replaceable>, potentially allowing the MDT to free up disk space. An <literal><replaceable>endrec</replaceable></literal> value of 0 indicates the current last record. To run <literal>changelog_clear</literal>, the changelog user must be registered on the MDT node using <literal>lctl</literal>.</para>
221 <para>When all changelog users are done with records < X, the records are deleted.</para>
225 <literal>lctl changelog_deregister</literal>
227 <para>To deregister (unregister) a changelog user, run:</para>
228 <screen>lctl --device <replaceable>mdt_device</replaceable> changelog_deregister <replaceable>userid</replaceable> </screen>
229 <para> <literal>changelog_deregister cl1</literal> effectively does a <literal>changelog_clear cl1 0</literal> as it deregisters.</para>
233 <title>Changelog Examples</title>
234 <para>This section provides examples of different changelog commands.</para>
236 <title>Registering a Changelog User</title>
237 <para>To register a new changelog user for a device (<literal>lustre-MDT0000</literal>):</para>
238 <screen># lctl --device lustre-MDT0000 changelog_register
239 lustre-MDT0000: Registered changelog userid 'cl1'</screen>
242 <title>Displaying Changelog Records</title>
243 <para>To display changelog records on an MDT (<literal>lustre-MDT0000</literal>):</para>
244 <screen>$ lfs changelog lustre-MDT0000
245 1 00MARK 19:08:20.890432813 2010.03.24 0x0 t=[0x10001:0x0:0x0] p=[0:0x0:0x\
246 0] mdd_obd-lustre-MDT0000-0
247 2 02MKDIR 19:10:21.509659173 2010.03.24 0x0 t=[0x200000420:0x3:0x0] p=[0x61\
248 b4:0xca2c7dde:0x0] mydir
249 3 14SATTR 19:10:27.329356533 2010.03.24 0x0 t=[0x200000420:0x3:0x0]
250 4 01CREAT 19:10:37.113847713 2010.03.24 0x0 t=[0x200000420:0x4:0x0] p=[0x20\
251 0000420:0x3:0x0] hosts </screen>
252 <para>Changelog records include this information:</para>
254 operation_type(numerical/text)
261 <para>Displayed in this format:</para>
262 <screen>rec# operation_type(numerical/text) timestamp datestamp flags t=target_FID \
263 p=parent_FID target_name</screen>
264 <para>For example:</para>
265 <screen>4 01CREAT 19:10:37.113847713 2010.03.24 0x0 t=[0x200000420:0x4:0x0] p=[0x20\
266 0000420:0x3:0x0] hosts</screen>
269 <title>Clearing Changelog Records</title>
270 <para>To notify a device that a specific user (<literal>cl1</literal>) no longer needs records (up to and including 3):</para>
271 <screen>$ lfs changelog_clear lustre-MDT0000 cl1 3</screen>
272 <para>To confirm that the <literal>changelog_clear</literal> operation was successful, run <literal>lfs changelog</literal>; only records after id-3 are listed:</para>
273 <screen>$ lfs changelog lustre-MDT0000
274 4 01CREAT 19:10:37.113847713 2010.03.24 0x0 t=[0x200000420:0x4:0x0] p=[0x20\
275 0000420:0x3:0x0] hosts</screen>
278 <title>Deregistering a Changelog User</title>
279 <para>To deregister a changelog user (<literal>cl1</literal>) for a specific device (<literal>lustre-MDT0000</literal>):</para>
280 <screen># lctl --device lustre-MDT0000 changelog_deregister cl1
281 lustre-MDT0000: Deregistered changelog user 'cl1'</screen>
282 <para>The deregistration operation clears all changelog records for the specified user (<literal>cl1</literal>).</para>
283 <screen>$ lfs changelog lustre-MDT0000
284 5 00MARK 19:13:40.858292517 2010.03.24 0x0 t=[0x40001:0x0:0x0] p=[0:0x0:0x\
285 0] mdd_obd-lustre-MDT0000-0
288 <para>MARK records typically indicate changelog recording status changes.</para>
292 <title>Displaying the Changelog Index and Registered Users</title>
293 <para>To display the current, maximum changelog index and registered changelog users for a specific device (<literal>lustre-MDT0000</literal>):</para>
294 <screen># lctl get_param mdd.lustre-MDT0000.changelog_users
295 mdd.lustre-MDT0000.changelog_users=current index: 8
301 <title>Displaying the Changelog Mask</title>
302 <para>To show the current changelog mask on a specific device (<literal>lustre-MDT0000</literal>):</para>
303 <screen># lctl get_param mdd.lustre-MDT0000.changelog_mask
305 mdd.lustre-MDT0000.changelog_mask=
306 MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RNMFM RNMTO OPEN CLOSE IOCTL\
307 TRUNC SATTR XATTR HSM
311 <title>Setting the Changelog Mask</title>
312 <para>To set the current changelog mask on a specific device (<literal>lustre-MDT0000</literal>):</para>
313 <screen># lctl set_param mdd.lustre-MDT0000.changelog_mask=HLINK
314 mdd.lustre-MDT0000.changelog_mask=HLINK
315 $ lfs changelog_clear lustre-MDT0000 cl1 0
316 $ mkdir /mnt/lustre/mydir/foo
317 $ cp /etc/hosts /mnt/lustre/mydir/foo/file
318 $ ln /mnt/lustre/mydir/foo/file /mnt/lustre/mydir/myhardlink
320 <para>Only item types that are in the mask show up in the changelog.</para>
321 <screen>$ lfs changelog lustre-MDT0000
322 9 03HLINK 19:19:35.171867477 2010.03.24 0x0 t=[0x200000420:0x6:0x0] p=[0x20\
323 0000420:0x3:0x0] myhardlink
328 <section xml:id="dbdoclet.jobstats">
329 <title><indexterm><primary>jobstats</primary><see>monitoring</see></indexterm>
330 <indexterm><primary>monitoring</primary></indexterm>
331 <indexterm><primary>monitoring</primary><secondary>jobstats</secondary></indexterm>
333 Lustre Jobstats</title>
334 <para>The Lustre jobstats feature is available starting in Lustre software
335 release 2.3. It collects file system operation statistics for user processes
336 running on Lustre clients, and exposes them via procfs on the server using
337 the unique Job Identifier (JobID) provided by the job scheduler for each
338 job. Job schedulers known to be able to work with jobstats include:
339 SLURM, SGE, LSF, Loadleveler, PBS and Maui/MOAB.</para>
340 <para>Since jobstats is implemented in a scheduler-agnostic manner, it is
341 likely that it will be able to work with other schedulers also.</para>
343 <title><indexterm><primary>monitoring</primary><secondary>jobstats</secondary></indexterm>
344 How Jobstats Works</title>
345 <para>The Lustre jobstats code on the client extracts the unique JobID
346 from an environment variable within the user process, and sends this
347 JobID to the server with the I/O operation. The server tracks
348 statistics for operations whose JobID is given, indexed by that
351 <para>A Lustre setting on the client, <literal>jobid_var</literal>,
352 specifies which variable to use. Any environment variable can be
353 specified. For example, SLURM sets the
354 <literal>SLURM_JOB_ID</literal> environment variable with the unique
355 job ID on each client when the job is first launched on a node, and
356 the <literal>SLURM_JOB_ID</literal> will be inherited by all child
357 processes started below that process.</para>
359 <para>Lustre can also be configured to generate a synthetic JobID from
360 the user's process name and User ID, by setting
361 <literal>jobid_var</literal> to a special value,
362 <literal>procname_uid</literal>.</para>
364 <para>The setting of <literal>jobid_var</literal> need not be the same
365 on all clients. For example, one could use
366 <literal>SLURM_JOB_ID</literal> on all clients managed by SLURM, and
367 use <literal>procname_uid</literal> on clients not managed by SLURM,
368 such as interactive login nodes.</para>
370 <para>It is not possible to have different
371 <literal>jobid_var</literal> settings on a single node, since it is
372 unlikely that multiple job schedulers are active on one client.
373 However, the actual JobID value is local to each process environment
374 and it is possible for multiple jobs with different JobIDs to be
375 active on a single client at one time.</para>
379 <title><indexterm><primary>monitoring</primary><secondary>jobstats</secondary></indexterm>
380 Enable/Disable Jobstats</title>
381 <para>Jobstats are disabled by default. The current state of jobstats
382 can be verified by checking <literal>lctl get_param jobid_var</literal>
385 $ lctl get_param jobid_var
389 To enable jobstats on the <literal>testfs</literal> file system with SLURM:</para>
390 <screen># lctl conf_param testfs.sys.jobid_var=SLURM_JOB_ID</screen>
391 <para>The <literal>lctl conf_param</literal> command to enable or disable
392 jobstats should be run on the MGS as root. The change is persistent, and
393 will be propagated to the MDS, OSS, and client nodes automatically when
394 it is set on the MGS and for each new client mount.</para>
395 <para>To temporarily enable jobstats on a client, or to use a different
396 jobid_var on a subset of nodes, such as nodes in a remote cluster that
397 use a different job scheduler, or interactive login nodes that do not
398 use a job scheduler at all, run the <literal>lctl set_param</literal>
399 command directly on the client node(s) after the filesystem is mounted.
400 For example, to enable the <literal>procname_uid</literal> synthetic
401 JobID on a login node run:
402 <screen># lctl set_param jobid_var=procname_uid</screen>
403 The <literal>lctl set_param</literal> setting is not persistent, and will
404 be reset if the global <literal>jobid_var</literal> is set on the MGS or
405 if the filesystem is unmounted.</para>
406 <para>The following table shows the environment variables which are set
407 by various job schedulers. Set <literal>jobid_var</literal> to the value
408 for your job scheduler to collect statistics on a per job basis.</para>
409 <informaltable frame="all">
411 <colspec colname="c1" colwidth="50*"/>
412 <colspec colname="c2" colwidth="50*"/>
416 <para><emphasis role="bold">Job Scheduler</emphasis></para>
419 <para><emphasis role="bold">Environment Variable</emphasis></para>
426 <para>Simple Linux Utility for Resource Management (SLURM)</para>
429 <para>SLURM_JOB_ID</para>
434 <para>Sun Grid Engine (SGE)</para>
442 <para>Load Sharing Facility (LSF)</para>
445 <para>LSB_JOBID</para>
450 <para>Loadleveler</para>
453 <para>LOADL_STEP_ID</para>
458 <para>Portable Batch Scheduler (PBS)/MAUI</para>
461 <para>PBS_JOBID</para>
466 <para>Cray Application Level Placement Scheduler (ALPS)</para>
469 <para>ALPS_APP_ID</para>
475 <para>There are two special values for <literal>jobid_var</literal>:
476 <literal>disable</literal> and <literal>procname_uid</literal>. To disable
477 jobstats, specify <literal>jobid_var</literal> as <literal>disable</literal>:</para>
478 <screen># lctl conf_param testfs.sys.jobid_var=disable</screen>
479 <para>To track job stats per process name and user ID (for debugging, or
480 if no job scheduler is in use on some nodes such as login nodes), specify
481 <literal>jobid_var</literal> as <literal>procname_uid</literal>:</para>
482 <screen># lctl conf_param testfs.sys.jobid_var=procname_uid</screen>
485 <title><indexterm><primary>monitoring</primary><secondary>jobstats</secondary></indexterm>
486 Check Job Stats</title>
487 <para>Metadata operation statistics are collected on MDTs. These statistics can be accessed for
488 all file systems and all jobs on the MDT via the <literal>lctl get_param
489 mdt.*.job_stats</literal>. For example, clients running with
490 <literal>jobid_var=procname_uid</literal>:</para>
492 # lctl get_param mdt.*.job_stats
495 snapshot_time: 1352084992
496 open: { samples: 2, unit: reqs }
497 close: { samples: 2, unit: reqs }
498 mknod: { samples: 0, unit: reqs }
499 link: { samples: 0, unit: reqs }
500 unlink: { samples: 0, unit: reqs }
501 mkdir: { samples: 0, unit: reqs }
502 rmdir: { samples: 0, unit: reqs }
503 rename: { samples: 0, unit: reqs }
504 getattr: { samples: 3, unit: reqs }
505 setattr: { samples: 0, unit: reqs }
506 getxattr: { samples: 0, unit: reqs }
507 setxattr: { samples: 0, unit: reqs }
508 statfs: { samples: 0, unit: reqs }
509 sync: { samples: 0, unit: reqs }
510 samedir_rename: { samples: 0, unit: reqs }
511 crossdir_rename: { samples: 0, unit: reqs }
512 - job_id: mythbackend.0
513 snapshot_time: 1352084996
514 open: { samples: 72, unit: reqs }
515 close: { samples: 73, unit: reqs }
516 mknod: { samples: 0, unit: reqs }
517 link: { samples: 0, unit: reqs }
518 unlink: { samples: 22, unit: reqs }
519 mkdir: { samples: 0, unit: reqs }
520 rmdir: { samples: 0, unit: reqs }
521 rename: { samples: 0, unit: reqs }
522 getattr: { samples: 778, unit: reqs }
523 setattr: { samples: 22, unit: reqs }
524 getxattr: { samples: 0, unit: reqs }
525 setxattr: { samples: 0, unit: reqs }
526 statfs: { samples: 19840, unit: reqs }
527 sync: { samples: 33190, unit: reqs }
528 samedir_rename: { samples: 0, unit: reqs }
529 crossdir_rename: { samples: 0, unit: reqs }
531 <para>Data operation statistics are collected on OSTs. Data operations
532 statistics can be accessed via
533 <literal>lctl get_param obdfilter.*.job_stats</literal>, for example:</para>
535 $ lctl get_param obdfilter.*.job_stats
536 obdfilter.myth-OST0000.job_stats=
538 - job_id: mythcommflag.0
539 snapshot_time: 1429714922
540 read: { samples: 974, unit: bytes, min: 4096, max: 1048576, sum: 91530035 }
541 write: { samples: 0, unit: bytes, min: 0, max: 0, sum: 0 }
542 setattr: { samples: 0, unit: reqs }
543 punch: { samples: 0, unit: reqs }
544 sync: { samples: 0, unit: reqs }
545 obdfilter.myth-OST0001.job_stats=
547 - job_id: mythbackend.0
548 snapshot_time: 1429715270
549 read: { samples: 0, unit: bytes, min: 0, max: 0, sum: 0 }
550 write: { samples: 1, unit: bytes, min: 96899, max: 96899, sum: 96899 }
551 setattr: { samples: 0, unit: reqs }
552 punch: { samples: 1, unit: reqs }
553 sync: { samples: 0, unit: reqs }
554 obdfilter.myth-OST0002.job_stats=job_stats:
555 obdfilter.myth-OST0003.job_stats=job_stats:
556 obdfilter.myth-OST0004.job_stats=
558 - job_id: mythfrontend.500
559 snapshot_time: 1429692083
560 read: { samples: 9, unit: bytes, min: 16384, max: 1048576, sum: 4444160 }
561 write: { samples: 0, unit: bytes, min: 0, max: 0, sum: 0 }
562 setattr: { samples: 0, unit: reqs }
563 punch: { samples: 0, unit: reqs }
564 sync: { samples: 0, unit: reqs }
565 - job_id: mythbackend.500
566 snapshot_time: 1429692129
567 read: { samples: 0, unit: bytes, min: 0, max: 0, sum: 0 }
568 write: { samples: 1, unit: bytes, min: 56231, max: 56231, sum: 56231 }
569 setattr: { samples: 0, unit: reqs }
570 punch: { samples: 1, unit: reqs }
571 sync: { samples: 0, unit: reqs }
575 <title><indexterm><primary>monitoring</primary><secondary>jobstats</secondary></indexterm>
576 Clear Job Stats</title>
577 <para>Accumulated job statistics can be reset by writing proc file <literal>job_stats</literal>.</para>
578 <para>Clear statistics for all jobs on the local node:</para>
579 <screen># lctl set_param obdfilter.*.job_stats=clear</screen>
580 <para>Clear statistics only for job 'bash.0' on lustre-MDT0000:</para>
581 <screen># lctl set_param mdt.lustre-MDT0000.job_stats=bash.0</screen>
584 <title><indexterm><primary>monitoring</primary><secondary>jobstats</secondary></indexterm>
585 Configure Auto-cleanup Interval</title>
586 <para>By default, if a job is inactive for 600 seconds (10 minutes) statistics for this job will be dropped. This expiration value can be changed temporarily via:</para>
587 <screen># lctl set_param *.*.job_cleanup_interval={max_age}</screen>
588 <para>It can also be changed permanently, for example to 700 seconds via:</para>
589 <screen># lctl conf_param testfs.mdt.job_cleanup_interval=700</screen>
590 <para>The <literal>job_cleanup_interval</literal> can be set as 0 to disable the auto-cleanup. Note that if auto-cleanup of Jobstats is disabled, then all statistics will be kept in memory forever, which may eventually consume all memory on the servers. In this case, any monitoring tool should explicitly clear individual job statistics as they are processed, as shown above.</para>
593 <section xml:id="dbdoclet.50438273_81684">
595 <primary>monitoring</primary>
596 <secondary>Lustre Monitoring Tool</secondary>
597 </indexterm> Lustre Monitoring Tool (LMT)</title>
598 <para>The Lustre Monitoring Tool (LMT) is a Python-based, distributed system that provides a
599 <literal>top</literal>-like display of activity on server-side nodes (MDS, OSS and portals
600 routers) on one or more Lustre file systems. It does not provide support for monitoring
601 clients. For more information on LMT, including the setup procedure, see:</para>
602 <para><link xl:href="http://code.google.com/p/lmt/"
603 >https://github.com/chaos/lmt/wiki</link></para>
604 <para>LMT questions can be directed to:</para>
605 <para><link xl:href="mailto:lmt-discuss@googlegroups.com">lmt-discuss@googlegroups.com</link></para>
607 <section xml:id="dbdoclet.50438273_80593">
609 <literal>CollectL</literal>
611 <para><literal>CollectL</literal> is another tool that can be used to monitor a Lustre file
612 system. You can run <literal>CollectL</literal> on a Lustre system that has any combination of
613 MDSs, OSTs and clients. The collected data can be written to a file for continuous logging and
614 played back at a later time. It can also be converted to a format suitable for
616 <para>For more information about <literal>CollectL</literal>, see:</para>
617 <para><link xl:href="http://collectl.sourceforge.net">http://collectl.sourceforge.net</link></para>
618 <para>Lustre-specific documentation is also available. See:</para>
619 <para><link xl:href="http://collectl.sourceforge.net/Tutorial-Lustre.html">http://collectl.sourceforge.net/Tutorial-Lustre.html</link></para>
621 <section xml:id="dbdoclet.50438273_44185">
622 <title><indexterm><primary>monitoring</primary><secondary>additional tools</secondary></indexterm>
623 Other Monitoring Options</title>
624 <para>A variety of standard tools are available publicly including the following:<itemizedlist>
626 <para><literal>lltop</literal> - Lustre load monitor with batch scheduler integration.
627 <link xmlns:xlink="http://www.w3.org/1999/xlink"
628 xlink:href="https://github.com/jhammond/lltop"
629 >https://github.com/jhammond/lltop</link></para>
632 <para><literal>tacc_stats</literal> - A job-oriented system monitor, analyzation, and
633 visualization tool that probes Lustre interfaces and collects statistics. <link
634 xmlns:xlink="http://www.w3.org/1999/xlink"
635 xlink:href="https://github.com/jhammond/tacc_stats"/></para>
638 <para><literal>xltop</literal> - A continuous Lustre monitor with batch scheduler
639 integration. <link xmlns:xlink="http://www.w3.org/1999/xlink"
640 xlink:href="https://github.com/jhammond/xltop"/></para>
642 </itemizedlist></para>
643 <para>Another option is to script a simple monitoring solution that looks at various reports
644 from <literal>ipconfig</literal>, as well as the <literal>procfs</literal> files generated by
645 the Lustre software.</para>