BackupAndRestore.xml

   1 <?xml version='1.0' encoding='utf-8'?>
   2 <chapter xmlns="http://docbook.org/ns/docbook"
   3 xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
   4 xml:id="backupandrestore">
   5   <title xml:id="backupandrestore.title">Backing Up and Restoring a File
   6   System</title>
   7   <para>This chapter describes how to backup and restore at the file
   8   system-level, device-level and file-level in a Lustre file system. Each
   9   backup approach is described in the the following sections:</para>
  10   <itemizedlist>
  11     <listitem>
  12       <para>
  13         <xref linkend="dbdoclet.backup_file"/>
  14       </para>
  15     </listitem>
  16     <listitem>
  17       <para>
  18         <xref linkend="dbdoclet.backup_device"/>
  19       </para>
  20     </listitem>
  21     <listitem>
  22       <para>
  23         <xref linkend="dbdoclet.backup_target_filesystem"/>
  24       </para>
  25     </listitem>
  26     <listitem>
  27       <para>
  28         <xref linkend="dbdoclet.restore_target_filesystem"/>
  29       </para>
  30     </listitem>
  31     <listitem>
  32       <para>
  33         <xref linkend="dbdoclet.backup_lvm_snapshot"/>
  34       </para>
  35     </listitem>
  36   </itemizedlist>
  37   <para>It is <emphasis>strongly</emphasis> recommended that sites perform
  38   periodic device-level backup of the MDT(s)
  39   (<xref linkend="dbdoclet.backup_device"/>),
  40   for example twice a week with alternate backups going to a separate
  41   device, even if there is not enough capacity to do a full backup of all
  42   of the filesystem data.  Even if there are separate file-level backups of
  43   some or all files in the filesystem, having a device-level backup of the
  44   MDT can be very useful in case of MDT failure or corruption.  Being able to
  45   restore a device-level MDT backup can avoid the significantly longer process
  46   of restoring the entire filesystem from backup.  Since the MDT is required
  47   for access to all files, its loss would otherwise force full restore of the
  48   filesystem (if that is even possible) even if the OSTs are still OK.</para>
  49   <para>Performing a periodic device-level MDT backup can be done relatively
  50   inexpensively because the storage need only be connected to the primary
  51   MDS (it can be manually connected to the backup MDS in the rare case
  52   it is needed), and only needs good linear read/write performance.  While
  53   the device-level MDT backup is not useful for restoring individual files,
  54   it is most efficient to handle the case of MDT failure or corruption.</para>
  55   <section xml:id="dbdoclet.backup_file">
  56     <title>
  57     <indexterm>
  58       <primary>backup</primary>
  59     </indexterm>
  60     <indexterm>
  61       <primary>restoring</primary>
  62       <see>backup</see>
  63     </indexterm>
  64     <indexterm>
  65       <primary>LVM</primary>
  66       <see>backup</see>
  67     </indexterm>
  68     <indexterm>
  69       <primary>rsync</primary>
  70       <see>backup</see>
  71     </indexterm>Backing up a File System</title>
  72     <para>Backing up a complete file system gives you full control over the
  73     files to back up, and allows restoration of individual files as needed.
  74     File system-level backups are also the easiest to integrate into existing
  75     backup solutions.</para>
  76     <para>File system backups are performed from a Lustre client (or many
  77     clients working parallel in different directories) rather than on
  78     individual server nodes; this is no different than backing up any other
  79     file system.</para>
  80     <para>However, due to the large size of most Lustre file systems, it is
  81     not always possible to get a complete backup. We recommend that you back
  82     up subsets of a file system. This includes subdirectories of the entire
  83     file system, filesets for a single user, files incremented by date, and
  84     so on, so that restores can be done more efficiently.</para>
  85     <note>
  86       <para>Lustre internally uses a 128-bit file identifier (FID) for all
  87       files. To interface with user applications, the 64-bit inode numbers
  88       are returned by the <literal>stat()</literal>,
  89       <literal>fstat()</literal>, and
  90       <literal>readdir()</literal> system calls on 64-bit applications, and
  91       32-bit inode numbers to 32-bit applications.</para>
  92       <para>Some 32-bit applications accessing Lustre file systems (on both
  93       32-bit and 64-bit CPUs) may experience problems with the
  94       <literal>stat()</literal>,
  95       <literal>fstat()</literal> or
  96       <literal>readdir()</literal> system calls under certain circumstances,
  97       though the Lustre client should return 32-bit inode numbers to these
  98       applications.</para>
  99       <para>In particular, if the Lustre file system is exported from a 64-bit
 100       client via NFS to a 32-bit client, the Linux NFS server will export
 101       64-bit inode numbers to applications running on the NFS client. If the
 102       32-bit applications are not compiled with Large File Support (LFS), then
 103       they return
 104       <literal>EOVERFLOW</literal> errors when accessing the Lustre files. To
 105       avoid this problem, Linux NFS clients can use the kernel command-line
 106       option "<literal>nfs.enable_ino64=0</literal>" in order to force the
 107       NFS client to export 32-bit inode numbers to the client.</para>
 108       <para>
 109       <emphasis role="bold">Workaround</emphasis>: We very strongly recommend
 110       that backups using
 111       <literal>tar(1)</literal> and other utilities that depend on the inode
 112       number to uniquely identify an inode to be run on 64-bit clients. The
 113       128-bit Lustre file identifiers cannot be uniquely mapped to a 32-bit
 114       inode number, and as a result these utilities may operate incorrectly on
 115       32-bit clients.  While there is still a small chance of inode number
 116       collisions with 64-bit inodes, the FID allocation pattern is designed
 117       to avoid collisions for long periods of usage.</para>
 118     </note>
 119     <section remap="h3">
 120       <title>
 121       <indexterm>
 122         <primary>backup</primary>
 123         <secondary>rsync</secondary>
 124       </indexterm>Lustre_rsync</title>
 125       <para>The
 126       <literal>lustre_rsync</literal> feature keeps the entire file system in
 127       sync on a backup by replicating the file system's changes to a second
 128       file system (the second file system need not be a Lustre file system, but
 129       it must be sufficiently large).
 130       <literal>lustre_rsync</literal> uses Lustre changelogs to efficiently
 131       synchronize the file systems without having to scan (directory walk) the
 132       Lustre file system. This efficiency is critically important for large
 133       file systems, and distinguishes the Lustre
 134       <literal>lustre_rsync</literal> feature from other replication/backup
 135       solutions.</para>
 136       <section remap="h4">
 137         <title>
 138         <indexterm>
 139           <primary>backup</primary>
 140           <secondary>rsync</secondary>
 141           <tertiary>using</tertiary>
 142         </indexterm>Using Lustre_rsync</title>
 143         <para>The
 144         <literal>lustre_rsync</literal> feature works by periodically running
 145         <literal>lustre_rsync</literal>, a userspace program used to
 146         synchronize changes in the Lustre file system onto the target file
 147         system. The
 148         <literal>lustre_rsync</literal> utility keeps a status file, which
 149         enables it to be safely interrupted and restarted without losing
 150         synchronization between the file systems.</para>
 151         <para>The first time that
 152         <literal>lustre_rsync</literal> is run, the user must specify a set of
 153         parameters for the program to use. These parameters are described in
 154         the following table and in
 155         <xref linkend="dbdoclet.50438219_63667" />. On subsequent runs, these
 156         parameters are stored in the the status file, and only the name of the
 157         status file needs to be passed to
 158         <literal>lustre_rsync</literal>.</para>
 159         <para>Before using
 160         <literal>lustre_rsync</literal>:</para>
 161         <itemizedlist>
 162           <listitem>
 163             <para>Register the changelog user. For details, see the
 164             <xref linkend="systemconfigurationutilities" />(
 165             <literal>changelog_register</literal>) parameter in the
 166             <xref linkend="systemconfigurationutilities" />(
 167             <literal>lctl</literal>).</para>
 168           </listitem>
 169         </itemizedlist>
 170         <para>- AND -</para>
 171         <itemizedlist>
 172           <listitem>
 173             <para>Verify that the Lustre file system (source) and the replica
 174             file system (target) are identical
 175             <emphasis>before</emphasis> registering the changelog user. If the
 176             file systems are discrepant, use a utility, e.g. regular
 177             <literal>rsync</literal>(not
 178             <literal>lustre_rsync</literal>), to make them identical.</para>
 179           </listitem>
 180         </itemizedlist>
 181         <para>The
 182         <literal>lustre_rsync</literal> utility uses the following
 183         parameters:</para>
 184         <informaltable frame="all">
 185           <tgroup cols="2">
 186             <colspec colname="c1" colwidth="3*" />
 187             <colspec colname="c2" colwidth="10*" />
 188             <thead>
 189               <row>
 190                 <entry>
 191                   <para>
 192                     <emphasis role="bold">Parameter</emphasis>
 193                   </para>
 194                 </entry>
 195                 <entry>
 196                   <para>
 197                     <emphasis role="bold">Description</emphasis>
 198                   </para>
 199                 </entry>
 200               </row>
 201             </thead>
 202             <tbody>
 203               <row>
 204                 <entry>
 205                   <para>
 206                     <literal>--source=
 207                     <replaceable>src</replaceable></literal>
 208                   </para>
 209                 </entry>
 210                 <entry>
 211                   <para>The path to the root of the Lustre file system (source)
 212                   which will be synchronized. This is a mandatory option if a
 213                   valid status log created during a previous synchronization
 214                   operation (
 215                   <literal>--statuslog</literal>) is not specified.</para>
 216                 </entry>
 217               </row>
 218               <row>
 219                 <entry>
 220                   <para>
 221                     <literal>--target=
 222                     <replaceable>tgt</replaceable></literal>
 223                   </para>
 224                 </entry>
 225                 <entry>
 226                   <para>The path to the root where the source file system will
 227                   be synchronized (target). This is a mandatory option if the
 228                   status log created during a previous synchronization
 229                   operation (
 230                   <literal>--statuslog</literal>) is not specified. This option
 231                   can be repeated if multiple synchronization targets are
 232                   desired.</para>
 233                 </entry>
 234               </row>
 235               <row>
 236                 <entry>
 237                   <para>
 238                     <literal>--mdt=
 239                     <replaceable>mdt</replaceable></literal>
 240                   </para>
 241                 </entry>
 242                 <entry>
 243                   <para>The metadata device to be synchronized. A changelog
 244                   user must be registered for this device. This is a mandatory
 245                   option if a valid status log created during a previous
 246                   synchronization operation (
 247                   <literal>--statuslog</literal>) is not specified.</para>
 248                 </entry>
 249               </row>
 250               <row>
 251                 <entry>
 252                   <para>
 253                     <literal>--user=
 254                     <replaceable>userid</replaceable></literal>
 255                   </para>
 256                 </entry>
 257                 <entry>
 258                   <para>The changelog user ID for the specified MDT. To use
 259                   <literal>lustre_rsync</literal>, the changelog user must be
 260                   registered. For details, see the
 261                   <literal>changelog_register</literal> parameter in
 262                   <xref linkend="systemconfigurationutilities" />(
 263                   <literal>lctl</literal>). This is a mandatory option if a
 264                   valid status log created during a previous synchronization
 265                   operation (
 266                   <literal>--statuslog</literal>) is not specified.</para>
 267                 </entry>
 268               </row>
 269               <row>
 270                 <entry>
 271                   <para>
 272                     <literal>--statuslog=
 273                     <replaceable>log</replaceable></literal>
 274                   </para>
 275                 </entry>
 276                 <entry>
 277                   <para>A log file to which synchronization status is saved.
 278                   When the
 279                   <literal>lustre_rsync</literal> utility starts, if the status
 280                   log from a previous synchronization operation is specified,
 281                   then the state is read from the log and otherwise mandatory
 282                   <literal>--source</literal>,
 283                   <literal>--target</literal> and
 284                   <literal>--mdt</literal> options can be skipped. Specifying
 285                   the
 286                   <literal>--source</literal>,
 287                   <literal>--target</literal> and/or
 288                   <literal>--mdt</literal> options, in addition to the
 289                   <literal>--statuslog</literal> option, causes the specified
 290                   parameters in the status log to be overridden. Command line
 291                   options take precedence over options in the status
 292                   log.</para>
 293                 </entry>
 294               </row>
 295               <row>
 296                 <entry>
 297                   <literal>--xattr
 298                   <replaceable>yes|no</replaceable></literal>
 299                 </entry>
 300                 <entry>
 301                   <para>Specifies whether extended attributes (
 302                   <literal>xattrs</literal>) are synchronized or not. The
 303                   default is to synchronize extended attributes.</para>
 304                   <para>
 305                     <note>
 306                       <para>Disabling xattrs causes Lustre striping information
 307                       not to be synchronized.</para>
 308                     </note>
 309                   </para>
 310                 </entry>
 311               </row>
 312               <row>
 313                 <entry>
 314                   <para>
 315                     <literal>--verbose</literal>
 316                   </para>
 317                 </entry>
 318                 <entry>
 319                   <para>Produces verbose output.</para>
 320                 </entry>
 321               </row>
 322               <row>
 323                 <entry>
 324                   <para>
 325                     <literal>--dry-run</literal>
 326                   </para>
 327                 </entry>
 328                 <entry>
 329                   <para>Shows the output of
 330                   <literal>lustre_rsync</literal> commands (
 331                   <literal>copy</literal>,
 332                   <literal>mkdir</literal>, etc.) on the target file system
 333                   without actually executing them.</para>
 334                 </entry>
 335               </row>
 336               <row>
 337                 <entry>
 338                   <para>
 339                     <literal>--abort-on-err</literal>
 340                   </para>
 341                 </entry>
 342                 <entry>
 343                   <para>Stops processing the
 344                   <literal>lustre_rsync</literal> operation if an error occurs.
 345                   The default is to continue the operation.</para>
 346                 </entry>
 347               </row>
 348             </tbody>
 349           </tgroup>
 350         </informaltable>
 351       </section>
 352       <section remap="h4">
 353         <title>
 354         <indexterm>
 355           <primary>backup</primary>
 356           <secondary>rsync</secondary>
 357           <tertiary>examples</tertiary>
 358         </indexterm>
 359         <literal>lustre_rsync</literal> Examples</title>
 360         <para>Sample
 361         <literal>lustre_rsync</literal> commands are listed below.</para>
 362         <para>Register a changelog user for an MDT (e.g.
 363         <literal>testfs-MDT0000</literal>).</para>
 364         <screen># lctl --device testfs-MDT0000 changelog_register testfs-MDT0000
 365 Registered changelog userid 'cl1'</screen>
 366         <para>Synchronize a Lustre file system (
 367         <literal>/mnt/lustre</literal>) to a target file system (
 368         <literal>/mnt/target</literal>).</para>
 369         <screen>$ lustre_rsync --source=/mnt/lustre --target=/mnt/target \
 370            --mdt=testfs-MDT0000 --user=cl1 --statuslog sync.log  --verbose
 371 Lustre filesystem: testfs
 372 MDT device: testfs-MDT0000
 373 Source: /mnt/lustre
 374 Target: /mnt/target
 375 Statuslog: sync.log
 376 Changelog registration: cl1
 377 Starting changelog record: 0
 378 Errors: 0
 379 lustre_rsync took 1 seconds
 380 Changelog records consumed: 22</screen>
 381         <para>After the file system undergoes changes, synchronize the changes
 382         onto the target file system. Only the
 383         <literal>statuslog</literal> name needs to be specified, as it has all
 384         the parameters passed earlier.</para>
 385         <screen>$ lustre_rsync --statuslog sync.log --verbose
 386 Replicating Lustre filesystem: testfs
 387 MDT device: testfs-MDT0000
 388 Source: /mnt/lustre
 389 Target: /mnt/target
 390 Statuslog: sync.log
 391 Changelog registration: cl1
 392 Starting changelog record: 22
 393 Errors: 0
 394 lustre_rsync took 2 seconds
 395 Changelog records consumed: 42</screen>
 396         <para>To synchronize a Lustre file system (
 397         <literal>/mnt/lustre</literal>) to two target file systems (
 398         <literal>/mnt/target1</literal> and
 399         <literal>/mnt/target2</literal>).</para>
 400         <screen>$ lustre_rsync --source=/mnt/lustre --target=/mnt/target1 \
 401            --target=/mnt/target2 --mdt=testfs-MDT0000 --user=cl1  \
 402            --statuslog sync.log</screen>
 403       </section>
 404     </section>
 405   </section>
 406   <section xml:id="dbdoclet.backup_device">
 407     <title>
 408     <indexterm>
 409       <primary>backup</primary>
 410       <secondary>MDT/OST device level</secondary>
 411     </indexterm>Backing Up and Restoring an MDT or OST (ldiskfs Device Level)</title>
 412     <para>In some cases, it is useful to do a full device-level backup of an
 413     individual device (MDT or OST), before replacing hardware, performing
 414     maintenance, etc. Doing full device-level backups ensures that all of the
 415     data and configuration files is preserved in the original state and is the
 416     easiest method of doing a backup. For the MDT file system, it may also be
 417     the fastest way to perform the backup and restore, since it can do large
 418     streaming read and write operations at the maximum bandwidth of the
 419     underlying devices.</para>
 420     <note>
 421       <para>Keeping an updated full backup of the MDT is especially important
 422       because permanent failure or corruption of the MDT file system renders
 423       the much larger amount of data in all the OSTs largely inaccessible and
 424       unusable.  The storage needed for one or two full MDT device backups
 425       is much smaller than doing a full filesystem backup, and can use less
 426       expensive storage than the actual MDT device(s) since it only needs to
 427       have good streaming read/write speed instead of high random IOPS.</para>
 428     </note>
 429     <warning condition='l23'>
 430       <para>In Lustre software release 2.0 through 2.2, the only successful
 431       way to backup and restore an MDT is to do a device-level backup as is
 432       described in this section. File-level restore of an MDT is not possible
 433       before Lustre software release 2.3, as the Object Index (OI) file cannot
 434       be rebuilt after restore without the OI Scrub functionality.
 435       <emphasis role="bold">Since Lustre software release 2.3</emphasis>,
 436       Object Index files are automatically rebuilt at first mount after a
 437       restore is detected (see
 438       <link xl:href="http://jira.hpdd.intel.com/browse/LU-957">LU-957</link>),
 439       and file-level backup is supported (see
 440       <xref linkend="dbdoclet.backup_target_filesystem"/>).</para>
 441     </warning>
 442     <para>If hardware replacement is the reason for the backup or if a spare
 443     storage device is available, it is possible to do a raw copy of the MDT or
 444     OST from one block device to the other, as long as the new device is at
 445     least as large as the original device. To do this, run:</para>
 446     <screen>dd if=/dev/{original} of=/dev/{newdev} bs=4M</screen>
 447     <para>If hardware errors cause read problems on the original device, use
 448     the command below to allow as much data as possible to be read from the
 449     original device while skipping sections of the disk with errors:</para>
 450     <screen>dd if=/dev/{original} of=/dev/{newdev} bs=4k conv=sync,noerror /
 451       count={original size in 4kB blocks}</screen>
 452     <para>Even in the face of hardware errors, the <literal>ldiskfs</literal>
 453     file system is very robust and it may be possible
 454     to recover the file system data after running
 455     <literal>e2fsck -fy /dev/{newdev}</literal> on the new device, along with
 456     <literal>ll_recover_lost_found_objs</literal> for OST devices.</para>
 457     <para condition="l26">With Lustre software version 2.6 and later, there is
 458     no longer a need to run
 459     <literal>ll_recover_lost_found_objs</literal> on the OSTs, since the
 460     <literal>LFSCK</literal> scanning will automatically move objects from
 461     <literal>lost+found</literal> back into its correct location on the OST
 462     after directory corruption.</para>
 463     <para>In order to ensure that the backup is fully consistent, the MDT or
 464     OST must be unmounted, so that there are no changes being made to the
 465     device while the data is being transferred.  If the reason for the
 466     backup is preventative (i.e. MDT backup on a running MDS in case of
 467     future failures) then it is possible to perform a consistent backup from
 468     an LVM snapshot.  If an LVM snapshot is not available, and taking the
 469     MDS offline for a backup is unacceptable, it is also possible to perform
 470     a backup from the raw MDT block device.  While the backup from the raw
 471     device will not be fully consistent due to ongoing changes, the vast
 472     majority of ldiskfs metadata is statically allocated, and inconsistencies
 473     in the backup can be fixed by running <literal>e2fsck</literal> on the
 474     backup device, and is still much better than not having any backup at all.
 475     </para>
 476   </section>
 477   <section xml:id="dbdoclet.backup_target_filesystem">
 478     <title>
 479     <indexterm>
 480       <primary>backup</primary>
 481       <secondary>OST file system</secondary>
 482     </indexterm>
 483     <indexterm>
 484       <primary>backup</primary>
 485       <secondary>MDT file system</secondary>
 486     </indexterm>Backing Up an OST or MDT (ldiskfs File System Level)</title>
 487     <para>This procedure provides an alternative to backup or migrate the data
 488     of an OST or MDT at the file level. At the file-level, unused space is
 489     omitted from the backed up and the process may be completed quicker with
 490     smaller total backup size. Backing up a single OST device is not
 491     necessarily the best way to perform backups of the Lustre file system,
 492     since the files stored in the backup are not usable without metadata stored
 493     on the MDT and additional file stripes that may be on other OSTs. However,
 494     it is the preferred method for migration of OST devices, especially when it
 495     is desirable to reformat the underlying file system with different
 496     configuration options or to reduce fragmentation.</para>
 497     <note>
 498       <para>Prior to Lustre software release 2.3, the only successful way to
 499       perform an MDT backup and restore is to do a device-level backup as is
 500       described in
 501       <xref linkend="dbdoclet.backup_device" />. The ability to do MDT
 502       file-level backups is not available for Lustre software release 2.0
 503       through 2.2, because restoration of the Object Index (OI) file does not
 504       return the MDT to a functioning state.
 505       <emphasis role="bold">Since Lustre software release 2.3</emphasis>,
 506       Object Index files are automatically rebuilt at first mount after a
 507       restore is detected (see
 508       <link xl:href="http://jira.hpdd.intel.com/browse/LU-957">LU-957</link>),
 509       so file-level MDT restore is supported.</para>
 510     </note>
 511     <para>For Lustre software release 2.3 and newer with MDT file-level backup
 512     support, substitute
 513     <literal>mdt</literal> for
 514     <literal>ost</literal> in the instructions below.</para>
 515     <orderedlist>
 516       <listitem>
 517         <para>
 518           <emphasis role="bold">Make a mountpoint for the file
 519           system.</emphasis>
 520         </para>
 521         <screen>[oss]# mkdir -p /mnt/ost</screen>
 522       </listitem>
 523       <listitem>
 524         <para>
 525           <emphasis role="bold">Mount the file system.</emphasis>
 526         </para>
 527         <screen>[oss]# mount -t ldiskfs /dev/<emphasis>{ostdev}</emphasis> /mnt/ost</screen>
 528       </listitem>
 529       <listitem>
 530         <para>
 531           <emphasis role="bold">Change to the mountpoint being backed
 532           up.</emphasis>
 533         </para>
 534         <screen>[oss]# cd /mnt/ost</screen>
 535       </listitem>
 536       <listitem>
 537         <para>
 538           <emphasis role="bold">Back up the extended attributes.</emphasis>
 539         </para>
 540         <screen>[oss]# getfattr -R -d -m '.*' -e hex -P . &gt; ea-$(date +%Y%m%d).bak</screen>
 541         <note>
 542           <para>If the
 543           <literal>tar(1)</literal> command supports the
 544           <literal>--xattr</literal> option, the
 545           <literal>getfattr</literal> step may be unnecessary as long as tar
 546           does a backup of the
 547           <literal>trusted.*</literal> attributes. However, completing this step
 548           is not harmful and can serve as an added safety measure.</para>
 549         </note>
 550         <note>
 551           <para>In most distributions, the
 552           <literal>getfattr</literal> command is part of the
 553           <literal>attr</literal> package. If the
 554           <literal>getfattr</literal> command returns errors like
 555           <literal>Operation not supported</literal>, then the kernel does not
 556           correctly support EAs. Stop and use a different backup method.</para>
 557         </note>
 558       </listitem>
 559       <listitem>
 560         <para>
 561           <emphasis role="bold">Verify that the
 562           <literal>ea-$date.bak</literal> file has properly backed up the EA
 563           data on the OST.</emphasis>
 564         </para>
 565         <para>Without this attribute data, the restore process may be missing
 566         extra data that can be very useful in case of later file system
 567         corruption. Look at this file with more or a text editor. Each object
 568         file should have a corresponding item similar to this:</para>
 569         <screen>[oss]# file: O/0/d0/100992
 570 trusted.fid= \
 571 0x0d822200000000004a8a73e500000000808a0100000000000000000000000000</screen>
 572       </listitem>
 573       <listitem>
 574         <para>
 575           <emphasis role="bold">Back up all file system data.</emphasis>
 576         </para>
 577         <screen>[oss]# tar czvf {backup file}.tgz [--xattrs] --sparse .</screen>
 578         <note>
 579           <para>The tar
 580           <literal>--sparse</literal> option is vital for backing up an MDT. In
 581           order to have
 582           <literal>--sparse</literal> behave correctly, and complete the backup
 583           of and MDT in finite time, the version of tar must be specified.
 584           Correctly functioning versions of tar include the Lustre software
 585           enhanced version of tar at
 586           <link xmlns:xlink="http://www.w3.org/1999/xlink"
 587           xlink:href="https://wiki.hpdd.intel.com/display/PUB/Lustre+Tools#LustreTools-lustre-tar" />,
 588           the tar from a Red Hat Enterprise Linux distribution (version 6.3 or
 589           more recent) and the GNU tar version 1.25 or more recent.</para>
 590         </note>
 591         <warning>
 592           <para>The tar
 593           <literal>--xattrs</literal> option is only available in GNU tar
 594           distributions from Red Hat or Intel.</para>
 595         </warning>
 596       </listitem>
 597       <listitem>
 598         <para>
 599           <emphasis role="bold">Change directory out of the file
 600           system.</emphasis>
 601         </para>
 602         <screen>[oss]# cd -</screen>
 603       </listitem>
 604       <listitem>
 605         <para>
 606           <emphasis role="bold">Unmount the file system.</emphasis>
 607         </para>
 608         <screen>[oss]# umount /mnt/ost</screen>
 609         <note>
 610           <para>When restoring an OST backup on a different node as part of an
 611           OST migration, you also have to change server NIDs and use the
 612           <literal>--writeconf</literal> command to re-generate the
 613           configuration logs. See
 614           <xref linkend="lustremaintenance" />(Changing a Server NID).</para>
 615         </note>
 616       </listitem>
 617     </orderedlist>
 618   </section>
 619   <section xml:id="dbdoclet.restore_target_filesystem">
 620     <title>
 621     <indexterm>
 622       <primary>backup</primary>
 623       <secondary>restoring file system backup</secondary>
 624     </indexterm>Restoring a File-Level Backup</title>
 625     <para>To restore data from a file-level backup, you need to format the
 626     device, restore the file data and then restore the EA data.</para>
 627     <orderedlist>
 628       <listitem>
 629         <para>Format the new device.</para>
 630         <screen>[oss]# mkfs.lustre --ost --index {<emphasis>OST index</emphasis>} {<emphasis>other options</emphasis>} /dev/<emphasis>{newdev}</emphasis></screen>
 631       </listitem>
 632       <listitem>
 633         <para>Set the file system label.</para>
 634         <screen>[oss]# e2label {fsname}-OST{index in hex} /mnt/ost</screen>
 635       </listitem>
 636       <listitem>
 637         <para>Mount the file system.</para>
 638         <screen>[oss]# mount -t ldiskfs /dev/<emphasis>{newdev}</emphasis> /mnt/ost</screen>
 639       </listitem>
 640       <listitem>
 641         <para>Change to the new file system mount point.</para>
 642         <screen>[oss]# cd /mnt/ost</screen>
 643       </listitem>
 644       <listitem>
 645         <para>Restore the file system backup.</para>
 646         <screen>[oss]# tar xzvpf <emphasis>{backup file}</emphasis> [--xattrs] --sparse</screen>
 647       </listitem>
 648       <listitem>
 649         <para>Restore the file system extended attributes.</para>
 650         <screen>[oss]# setfattr --restore=ea-${date}.bak</screen>
 651         <note>
 652           <para>If
 653           <literal>--xattrs</literal> option is supported by tar and specified
 654           in the step above, this step is redundant.</para>
 655         </note>
 656       </listitem>
 657       <listitem>
 658         <para>Verify that the extended attributes were restored.</para>
 659         <screen>[oss]# getfattr -d -m ".*" -e hex O/0/d0/100992 trusted.fid= \
 660 0x0d822200000000004a8a73e500000000808a0100000000000000000000000000</screen>
 661       </listitem>
 662       <listitem>
 663         <para>Remove old OI files.</para>
 664         <screen>[oss]# rm -f oi.16*</screen>
 665       </listitem>
 666       <listitem>
 667         <para>Remove old CATALOGS.</para>
 668         <screen>[oss]# rm -f CATALOGS</screen>
 669         <note>
 670         <para>This is optional for the MDT side only. The CATALOGS record the
 671     llog file handlers that are used for recovering cross-server updates. Before
 672         OI scrub rebuilds the OI mappings for the llog files, the related recovery
 673         will get a failure if it runs faster than the background OI scrub.  This will
 674     result in a failure of the whole mount process. OI scrub is an online tool,
 675     therefore, a mount failure means that the OI scrub will be stopped.
 676     Removing the old CATALOGS will avoid this potential trouble.  The
 677         side-effect of removing old CATALOGS is that the recovery for related
 678         cross-server updates will be aborted. However, this can be handled by LFSCK
 679         after the system mount is up.</para>
 680         </note>
 681       </listitem>
 682       <listitem>
 683         <para>Change directory out of the file system.</para>
 684         <screen>[oss]# cd -</screen>
 685       </listitem>
 686       <listitem>
 687         <para>Unmount the new file system.</para>
 688         <screen>[oss]# umount /mnt/ost</screen>
 689       </listitem>
 690     </orderedlist>
 691     <para condition='l23'>If the file system was used between the time the backup was made and
 692     when it was restored, then the online
 693     <literal>LFSCK</literal> tool (part of Lustre code after version 2.3)
 694     will automatically be
 695     run to ensure the file system is coherent. If all of the device file
 696     systems were backed up at the same time after the entire Lustre file system
 697     was stopped, this step is unnecessary. In either case, the file system will
 698     be immediately although there may be I/O errors reading
 699     from files that are present on the MDT but not the OSTs, and files that
 700     were created after the MDT backup will not be accessible or visible. See
 701     <xref linkend="dbdoclet.lfsckadmin" />for details on using LFSCK.</para>
 702   </section>
 703   <section xml:id="dbdoclet.backup_lvm_snapshot">
 704     <title>
 705     <indexterm>
 706       <primary>backup</primary>
 707       <secondary>using LVM</secondary>
 708     </indexterm>Using LVM Snapshots with the Lustre File System</title>
 709     <para>If you want to perform disk-based backups (because, for example,
 710     access to the backup system needs to be as fast as to the primary Lustre
 711     file system), you can use the Linux LVM snapshot tool to maintain multiple,
 712     incremental file system backups.</para>
 713     <para>Because LVM snapshots cost CPU cycles as new files are written,
 714     taking snapshots of the main Lustre file system will probably result in
 715     unacceptable performance losses. You should create a new, backup Lustre
 716     file system and periodically (e.g., nightly) back up new/changed files to
 717     it. Periodic snapshots can be taken of this backup file system to create a
 718     series of "full" backups.</para>
 719     <note>
 720       <para>Creating an LVM snapshot is not as reliable as making a separate
 721       backup, because the LVM snapshot shares the same disks as the primary MDT
 722       device, and depends on the primary MDT device for much of its data. If
 723       the primary MDT device becomes corrupted, this may result in the snapshot
 724       being corrupted.</para>
 725     </note>
 726     <section remap="h3">
 727       <title>
 728       <indexterm>
 729         <primary>backup</primary>
 730         <secondary>using LVM</secondary>
 731         <tertiary>creating</tertiary>
 732       </indexterm>Creating an LVM-based Backup File System</title>
 733       <para>Use this procedure to create a backup Lustre file system for use
 734       with the LVM snapshot mechanism.</para>
 735       <orderedlist>
 736         <listitem>
 737           <para>Create LVM volumes for the MDT and OSTs.</para>
 738           <para>Create LVM devices for your MDT and OST targets. Make sure not
 739           to use the entire disk for the targets; save some room for the
 740           snapshots. The snapshots start out as 0 size, but grow as you make
 741           changes to the current file system. If you expect to change 20% of
 742           the file system between backups, the most recent snapshot will be 20%
 743           of the target size, the next older one will be 40%, etc. Here is an
 744           example:</para>
 745           <screen>cfs21:~# pvcreate /dev/sda1
 746    Physical volume "/dev/sda1" successfully created
 747 cfs21:~# vgcreate vgmain /dev/sda1
 748    Volume group "vgmain" successfully created
 749 cfs21:~# lvcreate -L200G -nMDT0 vgmain
 750    Logical volume "MDT0" created
 751 cfs21:~# lvcreate -L200G -nOST0 vgmain
 752    Logical volume "OST0" created
 753 cfs21:~# lvscan
 754    ACTIVE                  '/dev/vgmain/MDT0' [200.00 GB] inherit
 755    ACTIVE                  '/dev/vgmain/OST0' [200.00 GB] inherit</screen>
 756         </listitem>
 757         <listitem>
 758           <para>Format the LVM volumes as Lustre targets.</para>
 759           <para>In this example, the backup file system is called
 760           <literal>main</literal> and designates the current, most up-to-date
 761           backup.</para>
 762           <screen>cfs21:~# mkfs.lustre --fsname=main --mdt --index=0 /dev/vgmain/MDT0
 763  No management node specified, adding MGS to this MDT.
 764     Permanent disk data:
 765  Target:     main-MDT0000
 766  Index:      0
 767  Lustre FS:  main
 768  Mount type: ldiskfs
 769  Flags:      0x75
 770                (MDT MGS first_time update )
 771  Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
 772  Parameters:
 773 checking for existing Lustre data
 774  device size = 200GB
 775  formatting backing filesystem ldiskfs on /dev/vgmain/MDT0
 776          target name  main-MDT0000
 777          4k blocks     0
 778          options        -i 4096 -I 512 -q -O dir_index -F
 779  mkfs_cmd = mkfs.ext2 -j -b 4096 -L main-MDT0000  -i 4096 -I 512 -q
 780   -O dir_index -F /dev/vgmain/MDT0
 781  Writing CONFIGS/mountdata
 782 cfs21:~# mkfs.lustre --mgsnode=cfs21 --fsname=main --ost --index=0
 783 /dev/vgmain/OST0
 784     Permanent disk data:
 785  Target:     main-OST0000
 786  Index:      0
 787  Lustre FS:  main
 788  Mount type: ldiskfs
 789  Flags:      0x72
 790                (OST first_time update )
 791  Persistent mount opts: errors=remount-ro,extents,mballoc
 792  Parameters: mgsnode=192.168.0.21@tcp
 793 checking for existing Lustre data
 794  device size = 200GB
 795  formatting backing filesystem ldiskfs on /dev/vgmain/OST0
 796          target name  main-OST0000
 797          4k blocks     0
 798          options        -I 256 -q -O dir_index -F
 799  mkfs_cmd = mkfs.ext2 -j -b 4096 -L lustre-OST0000 -J size=400 -I 256
 800   -i 262144 -O extents,uninit_bg,dir_nlink,huge_file,flex_bg -G 256
 801   -E resize=4290772992,lazy_journal_init, -F /dev/vgmain/OST0
 802  Writing CONFIGS/mountdata
 803 cfs21:~# mount -t lustre /dev/vgmain/MDT0 /mnt/mdt
 804 cfs21:~# mount -t lustre /dev/vgmain/OST0 /mnt/ost
 805 cfs21:~# mount -t lustre cfs21:/main /mnt/main
 806 </screen>
 807         </listitem>
 808       </orderedlist>
 809     </section>
 810     <section remap="h3">
 811       <title>
 812       <indexterm>
 813         <primary>backup</primary>
 814         <secondary>new/changed files</secondary>
 815       </indexterm>Backing up New/Changed Files to the Backup File
 816       System</title>
 817       <para>At periodic intervals e.g., nightly, back up new and changed files
 818       to the LVM-based backup file system.</para>
 819       <screen>cfs21:~# cp /etc/passwd /mnt/main
 820
 821 cfs21:~# cp /etc/fstab /mnt/main
 822
 823 cfs21:~# ls /mnt/main
 824 fstab  passwd</screen>
 825     </section>
 826     <section remap="h3">
 827       <title>
 828       <indexterm>
 829         <primary>backup</primary>
 830         <secondary>using LVM</secondary>
 831         <tertiary>creating snapshots</tertiary>
 832       </indexterm>Creating Snapshot Volumes</title>
 833       <para>Whenever you want to make a "checkpoint" of the main Lustre file
 834       system, create LVM snapshots of all target MDT and OSTs in the LVM-based
 835       backup file system. You must decide the maximum size of a snapshot ahead
 836       of time, although you can dynamically change this later. The size of a
 837       daily snapshot is dependent on the amount of data changed daily in the
 838       main Lustre file system. It is likely that a two-day old snapshot will be
 839       twice as big as a one-day old snapshot.</para>
 840       <para>You can create as many snapshots as you have room for in the volume
 841       group. If necessary, you can dynamically add disks to the volume
 842       group.</para>
 843       <para>The snapshots of the target MDT and OSTs should be taken at the
 844       same point in time. Make sure that the cronjob updating the backup file
 845       system is not running, since that is the only thing writing to the disks.
 846       Here is an example:</para>
 847       <screen>cfs21:~# modprobe dm-snapshot
 848 cfs21:~# lvcreate -L50M -s -n MDT0.b1 /dev/vgmain/MDT0
 849    Rounding up size to full physical extent 52.00 MB
 850    Logical volume "MDT0.b1" created
 851 cfs21:~# lvcreate -L50M -s -n OST0.b1 /dev/vgmain/OST0
 852    Rounding up size to full physical extent 52.00 MB
 853    Logical volume "OST0.b1" created
 854 </screen>
 855       <para>After the snapshots are taken, you can continue to back up
 856       new/changed files to "main". The snapshots will not contain the new
 857       files.</para>
 858       <screen>cfs21:~# cp /etc/termcap /mnt/main
 859 cfs21:~# ls /mnt/main
 860 fstab  passwd  termcap
 861 </screen>
 862     </section>
 863     <section remap="h3">
 864       <title>
 865       <indexterm>
 866         <primary>backup</primary>
 867         <secondary>using LVM</secondary>
 868         <tertiary>restoring</tertiary>
 869       </indexterm>Restoring the File System From a Snapshot</title>
 870       <para>Use this procedure to restore the file system from an LVM
 871       snapshot.</para>
 872       <orderedlist>
 873         <listitem>
 874           <para>Rename the LVM snapshot.</para>
 875           <para>Rename the file system snapshot from "main" to "back" so you
 876           can mount it without unmounting "main". This is recommended, but not
 877           required. Use the
 878           <literal>--reformat</literal> flag to
 879           <literal>tunefs.lustre</literal> to force the name change. For
 880           example:</para>
 881           <screen>cfs21:~# tunefs.lustre --reformat --fsname=back --writeconf /dev/vgmain/MDT0.b1
 882  checking for existing Lustre data
 883  found Lustre data
 884  Reading CONFIGS/mountdata
 885 Read previous values:
 886  Target:     main-MDT0000
 887  Index:      0
 888  Lustre FS:  main
 889  Mount type: ldiskfs
 890  Flags:      0x5
 891               (MDT MGS )
 892  Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
 893  Parameters:
 894 Permanent disk data:
 895  Target:     back-MDT0000
 896  Index:      0
 897  Lustre FS:  back
 898  Mount type: ldiskfs
 899  Flags:      0x105
 900               (MDT MGS writeconf )
 901  Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
 902  Parameters:
 903 Writing CONFIGS/mountdata
 904 cfs21:~# tunefs.lustre --reformat --fsname=back --writeconf /dev/vgmain/OST0.b1
 905  checking for existing Lustre data
 906  found Lustre data
 907  Reading CONFIGS/mountdata
 908 Read previous values:
 909  Target:     main-OST0000
 910  Index:      0
 911  Lustre FS:  main
 912  Mount type: ldiskfs
 913  Flags:      0x2
 914               (OST )
 915  Persistent mount opts: errors=remount-ro,extents,mballoc
 916  Parameters: mgsnode=192.168.0.21@tcp
 917 Permanent disk data:
 918  Target:     back-OST0000
 919  Index:      0
 920  Lustre FS:  back
 921  Mount type: ldiskfs
 922  Flags:      0x102
 923               (OST writeconf )
 924  Persistent mount opts: errors=remount-ro,extents,mballoc
 925  Parameters: mgsnode=192.168.0.21@tcp
 926 Writing CONFIGS/mountdata
 927 </screen>
 928           <para>When renaming a file system, we must also erase the last_rcvd
 929           file from the snapshots</para>
 930           <screen>cfs21:~# mount -t ldiskfs /dev/vgmain/MDT0.b1 /mnt/mdtback
 931 cfs21:~# rm /mnt/mdtback/last_rcvd
 932 cfs21:~# umount /mnt/mdtback
 933 cfs21:~# mount -t ldiskfs /dev/vgmain/OST0.b1 /mnt/ostback
 934 cfs21:~# rm /mnt/ostback/last_rcvd
 935 cfs21:~# umount /mnt/ostback</screen>
 936         </listitem>
 937         <listitem>
 938           <para>Mount the file system from the LVM snapshot. For
 939           example:</para>
 940           <screen>cfs21:~# mount -t lustre /dev/vgmain/MDT0.b1 /mnt/mdtback
 941 cfs21:~# mount -t lustre /dev/vgmain/OST0.b1 /mnt/ostback
 942 cfs21:~# mount -t lustre cfs21:/back /mnt/back</screen>
 943         </listitem>
 944         <listitem>
 945           <para>Note the old directory contents, as of the snapshot time. For
 946           example:</para>
 947           <screen>cfs21:~/cfs/b1_5/lustre/utils# ls /mnt/back
 948 fstab  passwds
 949 </screen>
 950         </listitem>
 951       </orderedlist>
 952     </section>
 953     <section remap="h3">
 954       <title>
 955       <indexterm>
 956         <primary>backup</primary>
 957         <secondary>using LVM</secondary>
 958         <tertiary>deleting</tertiary>
 959       </indexterm>Deleting Old Snapshots</title>
 960       <para>To reclaim disk space, you can erase old snapshots as your backup
 961       policy dictates. Run:</para>
 962       <screen>lvremove /dev/vgmain/MDT0.b1</screen>
 963     </section>
 964     <section remap="h3">
 965       <title>
 966       <indexterm>
 967         <primary>backup</primary>
 968         <secondary>using LVM</secondary>
 969         <tertiary>resizing</tertiary>
 970       </indexterm>Changing Snapshot Volume Size</title>
 971       <para>You can also extend or shrink snapshot volumes if you find your
 972       daily deltas are smaller or larger than expected. Run:</para>
 973       <screen>lvextend -L10G /dev/vgmain/MDT0.b1</screen>
 974       <note>
 975         <para>Extending snapshots seems to be broken in older LVM. It is
 976         working in LVM v2.02.01.</para>
 977       </note>
 978     </section>
 979   </section>
 980 </chapter>