From 158e78ed99ddf84d176d9b4c60a5f8425033bf3a Mon Sep 17 00:00:00 2001 From: Richard Henwood Date: Fri, 20 May 2011 16:24:49 -0500 Subject: [PATCH] FIX: typesetting --- ConfigurationFilesModuleParameters.xml | 1013 +++++++++++++++----------- LustreProgrammingInterfaces.xml | 195 +++-- Preface.xml | 13 +- SettingLustreProperties.xml | 1213 +++++++++++++++++++------------- UnderstandingLustre.xml | 43 +- UnderstandingLustreNetworking.xml | 19 +- 6 files changed, 1448 insertions(+), 1048 deletions(-) diff --git a/ConfigurationFilesModuleParameters.xml b/ConfigurationFilesModuleParameters.xml index 01693ce..7ce5279 100644 --- a/ConfigurationFilesModuleParameters.xml +++ b/ConfigurationFilesModuleParameters.xml @@ -1,450 +1,651 @@ - - + + - Configuration Files and Module Parameters + Configuration Files and Module Parameters This section describes configuration files and module parameters and includes the following sections: - + + - - + - - -
- 35.1 Introduction - LNET network hardware and routing are now configured via module parameters. Parameters should be specified in the /etc/modprobe.conf file, for example: - alias lustre llite -options lnet networks=tcp0,elan0 - - The above option specifies that this node should use all the available TCP and Elan interfaces. - Module parameters are read when the module is first loaded. Type-specific LND modules (for instance, ksocklnd) are loaded automatically by the LNET module when LNET starts (typically upon modprobe ptlrpc). - Under Linux 2.6, LNET configuration parameters can be viewed under /sys/module/; generic and acceptor parameters under LNET, and LND-specific parameters under the name of the corresponding LND. - Under Linux 2.4, sysfs is not available, but the LND-specific parameters are accessible via equivalent paths under /proc. - Important: All old (pre v.1.4.6) Lustre configuration lines should be removed from the module configuration files and replaced with the following. Make sure that CONFIG_KMOD is set in your linux.config so LNET can load the following modules it needs. The basic module files are: - modprobe.conf (for Linux 2.6) - alias lustre llite -options lnet networks=tcp0,elan0 - - modules.conf (for Linux 2.4) - alias lustre llite -options lnet networks=tcp0,elan0 - - For the following parameters, default option settings are shown in parenthesis. Changes to parameters marked with a W affect running systems. (Unmarked parameters can only be set when LNET loads for the first time.) Changes to parameters marked with Wc only have effect when connections are established (existing connections are not affected by these changes.) -
-
- 35.2 Module <anchor xml:id="dbdoclet.50438293_marker-1293311" xreflabel=""/>Options - - With routed or other multi-network configurations, use ip2nets rather than networks, so all nodes can use the same configuration. - - - - For a routed network, use the same 'routes†configuration everywhere. Nodes specified as routers automatically enable forwarding and any routes that are not relevant to a particular node are ignored. Keep a common configuration to guarantee that all nodes have consistent routing tables. - - - - A separate modprobe.conf.lnet included from modprobe.conf makes distributing the configuration much easier. - - - - If you set config_on_load=1, LNET starts at modprobe time rather than waiting for Lustre to start. This ensures routers start working at module load time. - - - - # lctl -# lctl> net down - - - Remember the lctl ping {nid} command - it is a handy way to check your LNET configuration. - - - -
- 35.2.1 <anchor xml:id="dbdoclet.50438293_94707" xreflabel=""/>LNET <anchor xml:id="dbdoclet.50438293_marker-1293320" xreflabel=""/>Options - This section describes LNET options. -
- 35.2.1.1 Network Topology - Network topology module parameters determine which networks a node should join, whether it should route between these networks, and how it communicates with non-local networks. - Here is a list of various networks and the supported software stacks: - - - - - - - Network - Software Stack - - - - - o2ib - OFED Version 2 - - - mx - Myrinet MX - - - gm - Myrinet GM-2 - - - - - Lustre ignores the loopback interface (lo0), but Lustre use any IP addresses aliased to the loopback (by default). When in doubt, explicitly specify networks. - ip2nets ("") is a string that lists globally-available networks, each with a set of IP address ranges. LNET determines the locally-available networks from this list by matching the IP address ranges with the local IPs of a node. The purpose of this option is to be able to use the same modules.conf file across a variety of nodes on different networks. The string has the following syntax. - <ip2nets> :== <net-match> [ <comment> ] { <net-sep> <net-match> } -<net-match> :== [ <w> ] <net-spec> <w> <ip-range> { <w> <ip-range> } -[ <w> ] -<net-spec> :== <network> [ "(" <interface-list> ")" ] -<network> :== <nettype> [ <number> ] -<nettype> :== "tcp" | "elan" | "openib" | ... -<iface-list> :== <interface> [ "," <iface-list> ] -<ip-range> :== <r-expr> "." <r-expr> "." <r-expr> "." <r-expr> -<r-expr> :== <number> | "*" | "[" <r-list> "]" -<r-list> :== <range> [ "," <r-list> ] -<range> :== <number> [ "-" <number> [ "/" <number> ] ] -<comment :== "#" { <non-net-sep-chars> } -<net-sep> :== ";" | "\n" -<w> :== <whitespace-chars> { <whitespace-chars> } - - <net-spec> contains enough information to uniquely identify the network and load an appropriate LND. The LND determines the missing "address-within-network" part of the NID based on the interfaces it can use. - <iface-list> specifies which hardware interface the network can use. If omitted, all interfaces are used. LNDs that do not support the <iface-list> syntax cannot be configured to use particular interfaces and just use what is there. Only a single instance of these LNDs can exist on a node at any time, and <iface-list> must be omitted. - <net-match> entries are scanned in the order declared to see if one of the node's IP addresses matches one of the <ip-range> expressions. If there is a match, <net-spec> specifies the network to instantiate. Note that it is the first match for a particular network that counts. This can be used to simplify the match expression for the general case by placing it after the special cases. For example: - ip2nets="tcp(eth1,eth2) 134.32.1.[4-10/2]; tcp(eth1) *.*.*.*" - - 4 nodes on the 134.32.1.* network have 2 interfaces (134.32.1.{4,6,8,10}) but all the rest have 1. - ip2nets="vib 192.168.0.*; tcp(eth2) 192.168.0.[1,7,4,12]" - - This describes an IB cluster on 192.168.0.*. Four of these nodes also have IP interfaces; these four could be used as routers. - Note that match-all expressions (For instance, *.*.*.*) effectively mask all other - <net-match> entries specified after them. They should be used with caution. - Here is a more complicated situation, the route parameter is explained below. We have: - - Two TCP subnets - - - - One Elan subnet - - - - One machine set up as a router, with both TCP and Elan interfaces - - - - IP over Elan configured, but only IP will be used to label the nodes. - - - - options lnet ip2nets=â€tcp 198.129.135.* 192.128.88.98; \ - elan 198.128.88.98 198.129.135.3; \ - routes='cp 1022@elan # Elan NID of router; \ - elan 198.128.88.98@tcp # TCP NID of router ' - -
-
- 35.2.1.2 networks ("tcp") - This is an alternative to "ip2nets" which can be used to specify the networks to be instantiated explicitly. The syntax is a simple comma separated list of <net-spec>s (see above). The default is only used if neither 'ip2nets†nor 'networks†is specified. -
-
- 35.2.1.3 routes ("") - This is a string that lists networks and the NIDs of routers that forward to them. - It has the following syntax (<w> is one or more whitespace characters): - <routes> :== <route>{ ; <route> } -<route> :== [<net>[<w><hopcount>]<w><nid>{<w><nid>} - - So a node on the network tcp1 that needs to go through a router to get to the Elan network: - options lnet networks=tcp1 routes="elan 1 192.168.2.2@tcpA" - - The hopcount is used to help choose the best path between multiply-routed configurations. - A simple but powerful expansion syntax is provided, both for target networks and router NIDs as follows. - <expansion> :== "[" <entry> { "," <entry> } "]" -<entry> :== <numeric range> | <non-numeric item> -<numeric range> :== <number> [ "-" <number> [ "/" <number> ] ] - - The expansion is a list enclosed in square brackets. Numeric items in the list may be a single number, a contiguous range of numbers, or a strided range of numbers. For example, routes="elan 192.168.1.[22-24]@tcp" says that network elan0 is adjacent (hopcount defaults to 1); and is accessible via 3 routers on the tcp0 network (192.168.1.22@tcp, 192.168.1.23@tcp and 192.168.1.24@tcp). - routes="[tcp,vib] 2 [8-14/2]@elan" says that 2 networks (tcp0 and vib0) are accessible through 4 routers (8@elan, 10@elan, 12@elan and 14@elan). The hopcount of 2 means that traffic to both these networks will be traversed 2 routers - first one of the routers specified in this entry, then one more. - Duplicate entries, entries that route to a local network, and entries that specify routers on a non-local network are ignored. - Equivalent entries are resolved in favor of the route with the shorter hopcount. The hopcount, if omitted, defaults to 1 (the remote network is adjacent). - It is an error to specify routes to the same destination with routers on different local networks. - If the target network string contains no expansions, then the hopcount defaults to 1 and may be omitted (that is, the remote network is adjacent). In practice, this is true for most multi-network configurations. It is an error to specify an inconsistent hop count for a given target network. This is why an explicit hopcount is required if the target network string specifies more than one network. -
-
- 35.2.1.4 forwarding ("") - This is a string that can be set either to "enabled" or "disabled" for explicit control of whether this node should act as a router, forwarding communications between all local networks. - A standalone router can be started by simply starting LNET ('modprobe ptlrpcâ€) with appropriate network topology options. - - - - - - - Variable - Description - - - - - acceptor - The acceptor is a TCP/IP service that some LNDs use to establish communications. If a local network requires it and it has not been disabled, the acceptor listens on a single port for connection requests that it redirects to the appropriate local network. The acceptor is part of the LNET module and configured by the following options: - secure - Accept connections only from reserved TCP ports (< 1023). - - - all - Accept connections from any TCP port. NOTE: this is required for liblustre clients to allow connections on non-privileged ports. - - - none - Do not run the acceptor. - - - - - accept_port (988) - Port number on which the acceptor should listen for connection requests. All nodes in a site configuration that require an acceptor must use the same port. - - - accept_backlog (127) - Maximum length that the queue of pending connections may grow to (see listen(2)). - - - accept_timeout (5, W) - Maximum time in seconds the acceptor is allowed to block while communicating with a peer. - - - accept_proto_version - Version of the acceptor protocol that should be used by outgoing connection requests. It defaults to the most recent acceptor protocol version, but it may be set to the previous version to allow the node to initiate connections with nodes that only understand that version of the acceptor protocol. The acceptor can, with some restrictions, handle either version (that is, it can accept connections from both 'old' and 'new' peers). For the current version of the acceptor protocol (version 1), the acceptor is compatible with old peers if it is only required by a single local network. - - - - - - -
-
-
- 35.2.2 SOCKLND <anchor xml:id="dbdoclet.50438293_marker-1293448" xreflabel=""/>Kernel TCP/IP LND - The SOCKLND kernel TCP/IP LND (socklnd) is connection-based and uses the acceptor to establish communications via sockets with its peers. - It supports multiple instances and load balances dynamically over multiple interfaces. If no interfaces are specified by the ip2nets or networks module parameter, all non-loopback IP interfaces are used. The address-within-network is determined by the address of the first IP interface an instance of the socklnd encounters. - Consider a node on the 'edge†of an InfiniBand network, with a low-bandwidth management Ethernet (eth0), IP over IB configured (ipoib0), and a pair of GigE NICs (eth1,eth2) providing off-cluster connectivity. This node should be configured with "networks=vib,tcp(eth1,eth2)†to ensure that the socklnd ignores the management Ethernet and IPoIB. + +
+ 35.1 Introduction + LNET network hardware and routing are now configured via module parameters. Parameters should be specified in the /etc/modprobe.conf file, for example: + alias lustre llite +options lnet networks=tcp0,elan0 + The above option specifies that this node should use all the available TCP and Elan interfaces. + Module parameters are read when the module is first loaded. Type-specific LND modules (for instance, ksocklnd) are loaded automatically by the LNET module when LNET starts (typically upon modprobe ptlrpc). + Under Linux 2.6, LNET configuration parameters can be viewed under /sys/module/; generic and acceptor parameters under LNET, and LND-specific parameters under the name of the corresponding LND. + Under Linux 2.4, sysfs is not available, but the LND-specific parameters are accessible via equivalent paths under /proc. + Important: All old (pre v.1.4.6) Lustre configuration lines should be removed from the module configuration files and replaced with the following. Make sure that CONFIG_KMOD is set in your linux.config so LNET can load the following modules it needs. The basic module files are: + modprobe.conf (for Linux 2.6) + alias lustre llite +options lnet networks=tcp0,elan0 + modules.conf (for Linux 2.4) + alias lustre llite +options lnet networks=tcp0,elan0 + For the following parameters, default option settings are shown in parenthesis. Changes to parameters marked with a W affect running systems. (Unmarked parameters can only be set when LNET loads for the first time.) Changes to parameters marked with Wc only have effect when connections are established (existing connections are not affected by these changes.) +
+
+ 35.2 Module Options + + + With routed or other multi-network configurations, use ip2nets rather than networks, so all nodes can use the same configuration. + + + For a routed network, use the same 'routes' configuration everywhere. Nodes specified as routers automatically enable forwarding and any routes that are not relevant to a particular node are ignored. Keep a common configuration to guarantee that all nodes have consistent routing tables. + + + A separate modprobe.conf.lnet included from modprobe.conf makes distributing the configuration much easier. + + + If you set config_on_load=1, LNET starts at modprobe time rather than waiting for Lustre to start. This ensures routers start working at module load time. + + + # lctl +# lctl> net down + + + Remember the lctl ping {nid} command - it is a handy way to check your LNET configuration. + + +
+ 35.2.1 LNET Options + This section describes LNET options. +
+ 35.2.1.1 Network Topology + Network topology module parameters determine which networks a node should join, whether it should route between these networks, and how it communicates with non-local networks. + Here is a list of various networks and the supported software stacks: - Variable - Description + + Network + + + Software Stack + - timeout (50,W) - Time (in seconds) that communications may be stalled before the LND completes them with failure. - - - nconnds (4) - Sets the number of connection daemons. + + o2ib + + + OFED Version 2 + - min_reconnectms (1000,W) - Minimum connection retry interval (in milliseconds). After a failed connection attempt, this is the time that must elapse before the first retry. As connections attempts fail, this time is doubled on each successive retry up to a maximum of 'max_reconnectms'. + + mx + + + Myrinet MX + - max_reconnectms (6000,W) - Maximum connection retry interval (in milliseconds). - - - eager_ack (0 on linux, 1 on darwin,W) - Boolean that determines whether the socklnd should attempt to flush sends on message boundaries. - - - typed_conns (1,Wc) - Boolean that determines whether the socklnd should use different sockets for different types of messages. When clear, all communication with a particular peer takes place on the same socket. Otherwise, separate sockets are used for bulk sends, bulk receives and everything else. - - - min_bulk (1024,W) - Determines when a message is considered "bulk". - - - tx_buffer_size, rx_buffer_size (8388608,Wc) - Socket buffer sizes. Setting this option to zero (0), allows the system to auto-tune buffer sizes. WARNING: Be very careful changing this value as improper sizing can harm performance. - - - nagle (0,Wc) - Boolean that determines if nagle should be enabled. It should never be set in production systems. - - - keepalive_idle (30,Wc) - Time (in seconds) that a socket can remain idle before a keepalive probe is sent. Setting this value to zero (0) disables keepalives. - - - keepalive_intvl (2,Wc) - Time (in seconds) to repeat unanswered keepalive probes. Setting this value to zero (0) disables keepalives. - - - keepalive_count (10,Wc) - Number of unanswered keepalive probes before pronouncing socket (hence peer) death. - - - enable_irq_affinity (0,Wc) - Boolean that determines whether to enable IRQ affinity. The default is zero (0).When set, socklnd attempts to maximize performance by handling device interrupts and data movement for particular (hardware) interfaces on particular CPUs. This option is not available on all platforms. This option requires an SMP system to exist and produces best performance with multiple NICs. Systems with multiple CPUs and a single NIC may see increase in the performance with this parameter disabled. - - - zc_min_frag (2048,W) - Determines the minimum message fragment that should be considered for zero-copy sends. Increasing it above the platform's PAGE_SIZE disables all zero copy sends. This option is not available on all platforms. + + gm + + + Myrinet GM-2 + + + Lustre ignores the loopback interface (lo0), but Lustre use any IP addresses aliased to the loopback (by default). When in doubt, explicitly specify networks. + + ip2nets ("") is a string that lists globally-available networks, each with a set of IP address ranges. LNET determines the locally-available networks from this list by matching the IP address ranges with the local IPs of a node. The purpose of this option is to be able to use the same modules.conf file across a variety of nodes on different networks. The string has the following syntax. + <ip2nets> :== <net-match> [ <comment> ] { <net-sep> <net-match> } +<net-match> :== [ <w> ] <net-spec> <w> <ip-range> { <w> <ip-range> } +[ <w> ] +<net-spec> :== <network> [ "(" <interface-list> ")" ] +<network> :== <nettype> [ <number> ] +<nettype> :== "tcp" | "elan" | "openib" | ... +<iface-list> :== <interface> [ "," <iface-list> ] +<ip-range> :== <r-expr> "." <r-expr> "." <r-expr> "." <r-expr> +<r-expr> :== <number> | "*" | "[" <r-list> "]" +<r-list> :== <range> [ "," <r-list> ] +<range> :== <number> [ "-" <number> [ "/" <number> ] ] +<comment :== "#" { <non-net-sep-chars> } +<net-sep> :== ";" | "\n" +<w> :== <whitespace-chars> { <whitespace-chars> } + + <net-spec> contains enough information to uniquely identify the network and load an appropriate LND. The LND determines the missing "address-within-network" part of the NID based on the interfaces it can use. + <iface-list> specifies which hardware interface the network can use. If omitted, all interfaces are used. LNDs that do not support the <iface-list> syntax cannot be configured to use particular interfaces and just use what is there. Only a single instance of these LNDs can exist on a node at any time, and <iface-list> must be omitted. + <net-match> entries are scanned in the order declared to see if one of the node's IP addresses matches one of the <ip-range> expressions. If there is a match, <net-spec> specifies the network to instantiate. Note that it is the first match for a particular network that counts. This can be used to simplify the match expression for the general case by placing it after the special cases. For example: + ip2nets="tcp(eth1,eth2) 134.32.1.[4-10/2]; tcp(eth1) *.*.*.*" + 4 nodes on the 134.32.1.* network have 2 interfaces (134.32.1.{4,6,8,10}) but all the rest have 1. + ip2nets="vib 192.168.0.*; tcp(eth2) 192.168.0.[1,7,4,12]" + This describes an IB cluster on 192.168.0.*. Four of these nodes also have IP interfaces; these four could be used as routers. + Note that match-all expressions (For instance, *.*.*.*) effectively mask all other + <net-match> entries specified after them. They should be used with caution. + Here is a more complicated situation, the route parameter is explained below. We have: + + + Two TCP subnets + + + One Elan subnet + + + One machine set up as a router, with both TCP and Elan interfaces + + + IP over Elan configured, but only IP will be used to label the nodes. + + + options lnet ip2nets=â€tcp 198.129.135.* 192.128.88.98; \ + elan 198.128.88.98 198.129.135.3; \ + routes='cp 1022@elan # Elan NID of router; \ + elan 198.128.88.98@tcp # TCP NID of router '
-
- 35.2.3 Portals <anchor xml:id="dbdoclet.50438293_marker-1293719" xreflabel=""/>LND (Linux) - The Portals LND Linux (ptllnd) can be used as a interface layer to communicate with Sandia Portals networking devices. This version is intended to work on Cray XT3 Linux nodes that use Cray Portals as a network transport. - Message Buffers - When ptllnd starts up, it allocates and posts sufficient message buffers to allow all expected peers (set by concurrent_peers) to send one unsolicited message. The first message that a peer actually sends is a - (so-called) "HELLO" message, used to negotiate how much additional buffering to setup (typically 8 messages). If 10000 peers actually exist, then enough buffers are posted for 80000 messages. - The maximum message size is set by the max_msg_size module parameter (default value is 512). This parameter sets the bulk transfer breakpoint. Below this breakpoint, payload data is sent in the message itself. Above this breakpoint, a buffer descriptor is sent and the receiver gets the actual payload. - The buffer size is set by the rxb_npages module parameter (default value is 1). The default conservatively avoids allocation problems due to kernel memory fragmentation. However, increasing this value to 2 is probably not risky. - The ptllnd also keeps an additional rxb_nspare buffers (default value is 8) posted to account for full buffers being handled. - Assuming a 4K page size with 10000 peers, 1258 buffers can be expected to be posted at startup, increasing to a maximum of 10008 as peers that are actually connected. By doubling rxb_npages halving max_msg_size, this number can be reduced by a factor of 4. - ME/MD Queue Length - The ptllnd uses a single portal set by the portal module parameter (default value of 9) for both message and bulk buffers. Message buffers are always attached with PTL_INS_AFTER and match anything sent with "message" matchbits. Bulk buffers are always attached with PTL_INS_BEFORE and match only specific matchbits for that particular bulk transfer. - This scheme assumes that the majority of ME / MDs posted are for "message" buffers, and that the overhead of searching through the preceding "bulk" buffers is acceptable. Since the number of "bulk" buffers posted at any time is also dependent on the bulk transfer breakpoint set by max_msg_size, this seems like an issue worth measuring at scale. - TX Descriptors - The ptllnd has a pool of so-called "tx descriptors", which it uses not only for outgoing messages, but also to hold state for bulk transfers requested by incoming messages. This pool should scale with the total number of peers. - To enable the building of the Portals LND (ptllnd.ko) configure with this option: - ./configure --with-portals=<path-to-portals-headers> - - - - - - - Variable - Description - - - - - ntx (256) - Total number of messaging descriptors. - - - concurrent_peers (1152) - Maximum number of concurrent peers. Peers that attempt to connect beyond the maximum are not allowed. - - - peer_hash_table_size (101) - Number of hash table slots for the peers. This number should scale with concurrent_peers. The size of the peer hash table is set by the module parameter peer_hash_table_size which defaults to a value of 101. This number should be prime to ensure the peer hash table is populated evenly. It is advisable to increase this value to 1001 for ~10000 peers. - - - cksum (0) - Set to non-zero to enable message (not RDMA) checksums for outgoing packets. Incoming packets are always check-summed if necessary, independent of this value. - - - timeout (50) - Amount of time (in seconds) that a request can linger in a peers-active queue before the peer is considered dead. - - - portal (9) - Portal ID to use for the ptllnd traffic. - - - rxb_npages (64 * #cpus) - Number of pages in an RX buffer. - - - credits (128) - Maximum total number of concurrent sends that are outstanding to a single peer at a given time. - - - peercredits (8) - Maximum number of concurrent sends that are outstanding to a single peer at a given time. - - - max_msg_size (512) - Maximum immediate message size. This MUST be the same on all nodes in a cluster. A peer that connects with a different max_msg_size value will be rejected. - - - - +
+ 35.2.1.2 networks ("tcp") + This is an alternative to "ip2nets" which can be used to specify the networks to be instantiated explicitly. The syntax is a simple comma separated list of <net-spec>s (see above). The default is only used if neither 'ip2nets' nor 'networks' is specified.
-
- 35.2.4 MX <anchor xml:id="dbdoclet.50438293_marker-1295997" xreflabel=""/>LND - MXLND supports a number of load-time parameters using Linux's module parameter system. The following variables are available: +
+ 35.2.1.3 routes ("") + This is a string that lists networks and the NIDs of routers that forward to them. + It has the following syntax (<w> is one or more whitespace characters): + <routes> :== <route>{ ; <route> } +<route> :== [<net>[<w><hopcount>]<w><nid>{<w><nid>} + So a node on the network tcp1 that needs to go through a router to get to the Elan network: + options lnet networks=tcp1 routes="elan 1 192.168.2.2@tcpA" + The hopcount is used to help choose the best path between multiply-routed configurations. + A simple but powerful expansion syntax is provided, both for target networks and router NIDs as follows. + <expansion> :== "[" <entry> { "," <entry> } "]" +<entry> :== <numeric range> | <non-numeric item> +<numeric range> :== <number> [ "-" <number> [ "/" <number> ] ] + The expansion is a list enclosed in square brackets. Numeric items in the list may be a single number, a contiguous range of numbers, or a strided range of numbers. For example, routes="elan 192.168.1.[22-24]@tcp" says that network elan0 is adjacent (hopcount defaults to 1); and is accessible via 3 routers on the tcp0 network (192.168.1.22@tcp, 192.168.1.23@tcp and 192.168.1.24@tcp). + routes="[tcp,vib] 2 [8-14/2]@elan" says that 2 networks (tcp0 and vib0) are accessible through 4 routers (8@elan, 10@elan, 12@elan and 14@elan). The hopcount of 2 means that traffic to both these networks will be traversed 2 routers - first one of the routers specified in this entry, then one more. + Duplicate entries, entries that route to a local network, and entries that specify routers on a non-local network are ignored. + Equivalent entries are resolved in favor of the route with the shorter hopcount. The hopcount, if omitted, defaults to 1 (the remote network is adjacent). + It is an error to specify routes to the same destination with routers on different local networks. + If the target network string contains no expansions, then the hopcount defaults to 1 and may be omitted (that is, the remote network is adjacent). In practice, this is true for most multi-network configurations. It is an error to specify an inconsistent hop count for a given target network. This is why an explicit hopcount is required if the target network string specifies more than one network. +
+
+ 35.2.1.4 forwarding ("") + This is a string that can be set either to "enabled" or "disabled" for explicit control of whether this node should act as a router, forwarding communications between all local networks. + A standalone router can be started by simply starting LNET ('modprobe ptlrpc') with appropriate network topology options. - Variable - Description + + Variable + + + Description + - n_waitd - Number of completion daemons. - - - max_peers - Maximum number of peers that may connect. - - - cksum - Enables small message (< 4 KB) checksums if set to a non-zero value. - - - ntx - Number of total tx message descriptors. - - - credits - Number of concurrent sends to a single peer. - - - board - Index value of the Myrinet board (NIC). - - - ep_id - MX endpoint ID. - - - polling - Use zero (0) to block (wait). A value > 0 will poll that many times before blocking. - - - hosts - IP-to-hostname resolution file. + + acceptor + + + The acceptor is a TCP/IP service that some LNDs use to establish communications. If a local network requires it and it has not been disabled, the acceptor listens on a single port for connection requests that it redirects to the appropriate local network. The acceptor is part of the LNET module and configured by the following options: + + + secure - Accept connections only from reserved TCP ports (< 1023). + + + all - Accept connections from any TCP port. + + This is required for liblustre clients to allow connections on non-privileged ports. + + + + none - Do not run the acceptor. + + + + + + + accept_port + (988) + + + Port number on which the acceptor should listen for connection requests. All nodes in a site configuration that require an acceptor must use the same port. + + + + + accept_backlog + (127) + + + Maximum length that the queue of pending connections may grow to (see listen(2)). + + + + + accept_timeout + (5, W) + + + Maximum time in seconds the acceptor is allowed to block while communicating with a peer. + + + + + accept_proto_version + + + Version of the acceptor protocol that should be used by outgoing connection requests. It defaults to the most recent acceptor protocol version, but it may be set to the previous version to allow the node to initiate connections with nodes that only understand that version of the acceptor protocol. The acceptor can, with some restrictions, handle either version (that is, it can accept connections from both 'old' and 'new' peers). For the current version of the acceptor protocol (version 1), the acceptor is compatible with old peers if it is only required by a single local network. + - Of the described variables, only hosts is required. It must be the absolute path to the MXLND hosts file. - For example: - options kmxlnd hosts=/etc/hosts.mxlnd - - The file format for the hosts file is: - IP HOST BOARD EP_ID - - The values must be space and/or tab separated where: - IP is a valid IPv4 address - HOST is the name returned by `hostname` on that machine - BOARD is the index of the Myricom NIC (0 for the first card, etc.) - EP_ID is the MX endpoint ID - To obtain the optimal performance for your platform, you may want to vary the remaining options. - n_waitd (1) sets the number of threads that process completed MX requests (sends and receives). - max_peers (1024) tells MXLND the upper limit of machines that it will need to communicate with. This affects how many receives it will pre-post and each receive will use one page of memory. Ideally, on clients, this value will be equal to the total number of Lustre servers (MDS and OSS). On servers, it needs to equal the total number of machines in the storage system. cksum (0) turns on small message checksums. It can be used to aid in troubleshooting. MX also provides an optional checksumming feature which can check all messages (large and small). For details, see the MX README. - ntx (256) is the number of total sends in flight from this machine. In actuality, MXLND reserves half of them for connect messages so make this value twice as large as you want for the total number of sends in flight. - credits (8) is the number of in-flight messages for a specific peer. This is part of the flow-control system in Lustre. Increasing this value may improve performance but it requires more memory because each message requires at least one page. - board (0) is the index of the Myricom NIC. Hosts can have multiple Myricom NICs and this identifies which one MXLND should use. This value must match the board value in your MXLND hosts file for this host. - ep_id (3) is the MX endpoint ID. Each process that uses MX is required to have at least one MX endpoint to access the MX library and NIC. The ID is a simple index starting at zero (0). This value must match the endpoint ID value in your MXLND hosts file for this host. - polling (0) determines whether this host will poll or block for MX request completions. A value of 0 blocks and any positive value will poll that many times before blocking. Since polling increases CPU usage, we suggest that you set this to zero (0) on the client and experiment with different values for servers.
+
+ 35.2.2 <literal>SOCKLND</literal> Kernel TCP/IP LND + The SOCKLND kernel TCP/IP LND (socklnd) is connection-based and uses the acceptor to establish communications via sockets with its peers. + It supports multiple instances and load balances dynamically over multiple interfaces. If no interfaces are specified by the ip2nets or networks module parameter, all non-loopback IP interfaces are used. The address-within-network is determined by the address of the first IP interface an instance of the socklnd encounters. + Consider a node on the 'edge' of an InfiniBand network, with a low-bandwidth management Ethernet (eth0), IP over IB configured (ipoib0), and a pair of GigE NICs (eth1,eth2) providing off-cluster connectivity. This node should be configured with 'networks=vib,tcp(eth1,eth2)' to ensure that the socklnd ignores the management Ethernet and IPoIB. + + + + + + + + Variable + + + Description + + + + + + + timeout + (50,W) + + + Time (in seconds) that communications may be stalled before the LND completes them with failure. + + + + + nconnds + (4) + + + Sets the number of connection daemons. + + + + + min_reconnectms + (1000,W) + + + Minimum connection retry interval (in milliseconds). After a failed connection attempt, this is the time that must elapse before the first retry. As connections attempts fail, this time is doubled on each successive retry up to a maximum of 'max_reconnectms'. + + + + + max_reconnectms + (6000,W) + + + Maximum connection retry interval (in milliseconds). + + + + + eager_ack + (0 on linux, + 1 on darwin,W) + + + Boolean that determines whether the socklnd should attempt to flush sends on message boundaries. + + + + + typed_conns + (1,Wc) + + + Boolean that determines whether the socklnd should use different sockets for different types of messages. When clear, all communication with a particular peer takes place on the same socket. Otherwise, separate sockets are used for bulk sends, bulk receives and everything else. + + + + + min_bulk + (1024,W) + + + Determines when a message is considered "bulk". + + + + + tx_buffer_size, rx_buffer_size + (8388608,Wc) + + + Socket buffer sizes. Setting this option to zero (0), allows the system to auto-tune buffer sizes. + + Be very careful changing this value as improper sizing can harm performance. + + + + + + nagle + (0,Wc) + + + Boolean that determines if nagle should be enabled. It should never be set in production systems. + + + + + keepalive_idle + (30,Wc) + + + Time (in seconds) that a socket can remain idle before a keepalive probe is sent. Setting this value to zero (0) disables keepalives. + + + + + keepalive_intvl + (2,Wc) + + + Time (in seconds) to repeat unanswered keepalive probes. Setting this value to zero (0) disables keepalives. + + + + + keepalive_count + (10,Wc) + + + Number of unanswered keepalive probes before pronouncing socket (hence peer) death. + + + + + enable_irq_affinity + (0,Wc) + + + Boolean that determines whether to enable IRQ affinity. The default is zero (0). + When set, socklnd attempts to maximize performance by handling device interrupts and data movement for particular (hardware) interfaces on particular CPUs. This option is not available on all platforms. This option requires an SMP system to exist and produces best performance with multiple NICs. Systems with multiple CPUs and a single NIC may see increase in the performance with this parameter disabled. + + + + + zc_min_frag + (2048,W) + + + Determines the minimum message fragment that should be considered for zero-copy sends. Increasing it above the platform's PAGE_SIZE disables all zero copy sends. This option is not available on all platforms. + + + + + +
+
+ 35.2.3 Portals LND (Linux) + The Portals LND Linux (ptllnd) can be used as a interface layer to communicate with Sandia Portals networking devices. This version is intended to work on Cray XT3 Linux nodes that use Cray Portals as a network transport. + Message Buffers + When ptllnd starts up, it allocates and posts sufficient message buffers to allow all expected peers (set by concurrent_peers) to send one unsolicited message. The first message that a peer actually sends is (so-called) "HELLO" message, used to negotiate how much additional buffering to setup (typically 8 messages). If 10000 peers actually exist, then enough buffers are posted for 80000 messages. + The maximum message size is set by the max_msg_size module parameter (default value is 512). This parameter sets the bulk transfer breakpoint. Below this breakpoint, payload data is sent in the message itself. Above this breakpoint, a buffer descriptor is sent and the receiver gets the actual payload. + The buffer size is set by the rxb_npages module parameter (default value is 1). The default conservatively avoids allocation problems due to kernel memory fragmentation. However, increasing this value to 2 is probably not risky. + The ptllnd also keeps an additional rxb_nspare buffers (default value is 8) posted to account for full buffers being handled. + Assuming a 4K page size with 10000 peers, 1258 buffers can be expected to be posted at startup, increasing to a maximum of 10008 as peers that are actually connected. By doubling rxb_npages halving max_msg_size, this number can be reduced by a factor of 4. + ME/MD Queue Length + The ptllnd uses a single portal set by the portal module parameter (default value of 9) for both message and bulk buffers. Message buffers are always attached with PTL_INS_AFTER and match anything sent with "message" matchbits. Bulk buffers are always attached with PTL_INS_BEFORE and match only specific matchbits for that particular bulk transfer. + This scheme assumes that the majority of ME/MDs posted are for "message" buffers, and that the overhead of searching through the preceding "bulk" buffers is acceptable. Since the number of "bulk" buffers posted at any time is also dependent on the bulk transfer breakpoint set by max_msg_size, this seems like an issue worth measuring at scale. + TX Descriptors + The ptllnd has a pool of so-called "tx descriptors", which it uses not only for outgoing messages, but also to hold state for bulk transfers requested by incoming messages. This pool should scale with the total number of peers. + To enable the building of the Portals LND (ptllnd.ko) configure with this option: + ./configure --with-portals=<path-to-portals-headers> + + + + + + + + Variable + + + Description + + + + + + + ntx + (256) + + + Total number of messaging descriptors. + + + + + concurrent_peers + (1152) + + + Maximum number of concurrent peers. Peers that attempt to connect beyond the maximum are not allowed. + + + + + peer_hash_table_size + (101) + + + Number of hash table slots for the peers. This number should scale with concurrent_peers. The size of the peer hash table is set by the module parameter peer_hash_table_size which defaults to a value of 101. This number should be prime to ensure the peer hash table is populated evenly. It is advisable to increase this value to 1001 for ~10000 peers. + + + + + cksum + (0) + + + Set to non-zero to enable message (not RDMA) checksums for outgoing packets. Incoming packets are always check-summed if necessary, independent of this value. + + + + + timeout + (50) + + + Amount of time (in seconds) that a request can linger in a peers-active queue before the peer is considered dead. + + + + + portal + (9) + + + Portal ID to use for the ptllnd traffic. + + + + + rxb_npages + (64 * #cpus) + + + Number of pages in an RX buffer. + + + + + credits + (128) + + + Maximum total number of concurrent sends that are outstanding to a single peer at a given time. + + + + + peercredits + (8) + + + Maximum number of concurrent sends that are outstanding to a single peer at a given time. + + + + + max_msg_size + (512) + + + Maximum immediate message size. This MUST be the same on all nodes in a cluster. A peer that connects with a different max_msg_size value will be rejected. + + + + + +
+
+ 35.2.4 MX LND + MXLND supports a number of load-time parameters using Linux's module parameter system. The following variables are available: + + + + + + + + Variable + + + Description + + + + + + + n_waitd + + + Number of completion daemons. + + + + + max_peers + + + Maximum number of peers that may connect. + + + + + cksum + + + Enables small message (< 4 KB) checksums if set to a non-zero value. + + + + + ntx + + + Number of total tx message descriptors. + + + + + credits + + + Number of concurrent sends to a single peer. + + + + + board + + + Index value of the Myrinet board (NIC). + + + + + ep_id + + + MX endpoint ID. + + + + + polling + + + Use zero (0) to block (wait). A value > 0 will poll that many times before blocking. + + + + + hosts + + + IP-to-hostname resolution file. + + + + + + Of the described variables, only hosts is required. It must be the absolute path to the MXLND hosts file. + For example: + options kmxlnd hosts=/etc/hosts.mxlnd + The file format for the hosts file is: + IP HOST BOARD EP_ID + The values must be space and/or tab separated where: + IP is a valid IPv4 address + HOST is the name returned by `hostname` on that machine + BOARD is the index of the Myricom NIC (0 for the first card, etc.) + EP_ID is the MX endpoint ID + To obtain the optimal performance for your platform, you may want to vary the remaining options. + n_waitd(1) sets the number of threads that process completed MX requests (sends and receives). + max_peers(1024) tells MXLND the upper limit of machines that it will need to communicate with. This affects how many receives it will pre-post and each receive will use one page of memory. Ideally, on clients, this value will be equal to the total number of Lustre servers (MDS and OSS). On servers, it needs to equal the total number of machines in the storage system. cksum (0) turns on small message checksums. It can be used to aid in troubleshooting. MX also provides an optional checksumming feature which can check all messages (large and small). For details, see the MX README. + ntx(256) is the number of total sends in flight from this machine. In actuality, MXLND reserves half of them for connect messages so make this value twice as large as you want for the total number of sends in flight. + credits(8) is the number of in-flight messages for a specific peer. This is part of the flow-control system in Lustre. Increasing this value may improve performance but it requires more memory because each message requires at least one page. + board(0) is the index of the Myricom NIC. Hosts can have multiple Myricom NICs and this identifies which one MXLND should use. This value must match the board value in your MXLND hosts file for this host. + ep_id(3) is the MX endpoint ID. Each process that uses MX is required to have at least one MX endpoint to access the MX library and NIC. The ID is a simple index starting at zero (0). This value must match the endpoint ID value in your MXLND hosts file for this host. + polling(0) determines whether this host will poll or block for MX request completions. A value of 0 blocks and any positive value will poll that many times before blocking. Since polling increases CPU usage, we suggest that you set this to zero (0) on the client and experiment with different values for servers. +
+
diff --git a/LustreProgrammingInterfaces.xml b/LustreProgrammingInterfaces.xml index 56521f9..e631bd5 100644 --- a/LustreProgrammingInterfaces.xml +++ b/LustreProgrammingInterfaces.xml @@ -1,90 +1,91 @@ - - + + - Lustre Programming Interfaces + Lustre Programming Interfaces This chapter describes public programming interfaces to control various aspects of Lustre from userspace. These interfaces are generally not guaranteed to remain unchanged over time, although we will make an effort to notify the user community well in advance of major changes. This chapter includes the following section: - - + + - - + - - - - Lustre programming interface man pages are found in the lustre/doc folder. - -
- 33.1 User/Group <anchor xml:id="dbdoclet.50438291_marker-1293215" xreflabel=""/>Cache Upcall - This section describes user and group upcall. - For information on a universal UID/GID, see Environmental Requirements. - -
- 33.1.1 Name - Use /proc/fs/lustre/mdt/${FSNAME}-MDT{xxxx}/identity_upcall to look up a given user's group membership. -
-
- 33.1.2 Description - The group upcall file contains the path to an executable that, when installed, is invoked to resolve a numeric UID to a group membership list. This utility should complete the mds_grp_downcall_data data structure (see Data Structures) and write it to the /proc/fs/lustre/mdt/${FSNAME}-MDT{xxxx}/identity_info pseudo-file. - For a sample upcall program, see lustre/utils/l_getgroups.c in the Lustre source distribution. -
- 33.1.2.1 Primary and Secondary Groups - The mechanism for the primary/secondary group is as follows: - - The MDS issues an upcall (set per MDS) to map the numeric UID to the supplementary group(s). - - - - If there is no upcall or if there is an upcall and it fails, supplementary groups will be added as supplied by the client (as they are now). - - - - The default upcall is /usr/sbin/l_getidentity, which can interact with the user/group database to obtain UID/GID/suppgid. The user/group database depends on authentication configuration, and can be local /etc/passwd, NIS, LDAP, etc. If necessary, the administrator can use a parse utility to set /proc/fs/lustre/mdt/${FSNAME}-MDT{xxxx}/identity_upcall. If the upcall interface is set to NONE, then upcall is disabled. The MDS uses the UID/GID/suppgid supplied by the client. - - - - The default group upcall is set by mkfs.lustre. Use tunefs.lustre --param or echo{path}>/proc/fs/lustre/mds/{mdsname}/group_upcall - - - - The Lustre administrator can specify permissions for a specific UID by configuring /etc/lustre/perm.conf on the MDS. As commented in lustre/utils/l_getidentity.c - - - - /** permission file format is like this: * {nid} {uid} {perms} * * '*' nid \ -means any nid* '*' uid means any uid* the valid values for perms are:* setu\ -id/setgid/setgrp/rmtacl -- enable corresponding perm* nosetuid/nosetgid/nos\ -etgrp/normtacl -- disable corresponding perm* they can be listed together, \ -seperated by ',',* when perm and noperm are in the same line (item), noperm\ - is preferential,* when they are in different lines (items), the latter is \ -preferential,* '*' nid is as default perm, and is not preferential.*/ - - Currently, rmtacl/normtacl can be ignored (part of security functionality), and used for remote clients. The /usr/sbin/l_getidentity utility can parse /etc/lustre/perm.conf to obtain permission mask for specified UID. - - To avoid repeated upcalls, the MDS caches supplemental group information. Use /proc/fs/lustre/mdt/${FSNAME}-MDT{xxxx}/identity_expire to set the cache time (default is 600 seconds). The kernel waits for the upcall to complete (at most, 5 seconds) and takes the "failure" behavior as described. Set the wait time in /proc/fs/lustre/mdt/${FSNAME}-MDT{xxxx}/identity_acquire_expire (default is 15 seconds). Cached entries are flushed by writing to /proc/fs/lustre/mdt/${FSNAME}-MDT{xxxx}/identity_flush. - - - -
-
-
- 33.1.3 Parameters - - Name of the MDS service + + + Lustre programming interface man pages are found in the lustre/doc folder. + +
+ 33.1 User/Group Cache Upcall + This section describes user and group upcall. + + For information on a universal UID/GID, see . + +
+ 33.1.1 Name + Use /proc/fs/lustre/mdt/${FSNAME}-MDT{xxxx}/identity_upcall to look up a given user's group membership. +
+
+ 33.1.2 Description + The group upcall file contains the path to an executable that, when installed, is invoked to resolve a numeric UID to a group membership list. This utility should complete the mds_grp_downcall_data data structure (see ) and write it to the /proc/fs/lustre/mdt/${FSNAME}-MDT{xxxx}/identity_info pseudo-file. + For a sample upcall program, see lustre/utils/l_getgroups.c in the Lustre source distribution. +
+ 33.1.2.1 Primary and Secondary Groups + The mechanism for the primary/secondary group is as follows: + + + The MDS issues an upcall (set per MDS) to map the numeric UID to the supplementary group(s). + + + If there is no upcall or if there is an upcall and it fails, supplementary groups will be added as supplied by the client (as they are now). + + + The default upcall is /usr/sbin/l_getidentity, which can interact with the user/group database to obtain UID/GID/suppgid. The user/group database depends on authentication configuration, and can be local /etc/passwd, NIS, LDAP, etc. If necessary, the administrator can use a parse utility to set /proc/fs/lustre/mdt/${FSNAME}-MDT{xxxx}/identity_upcall. If the upcall interface is set to NONE, then upcall is disabled. The MDS uses the UID/GID/suppgid supplied by the client. - - - Numeric UID + + The default group upcall is set by mkfs.lustre. Use tunefs.lustre --param or echo{path}>/proc/fs/lustre/mds/{mdsname}/group_upcall - - + + The Lustre administrator can specify permissions for a specific UID by configuring /etc/lustre/perm.conf on the MDS. As commented in lustre/utils/l_getidentity.c + + + +/* +* permission file format is like this: +* {nid} {uid} {perms} +* +* '*' nid means any nid +* '*' uid means any uid +* the valid values for perms are: +* setuid/setgid/setgrp/rmtacl -- enable corresponding perm +* nosetuid/nosetgid/nosetgrp/normtacl -- disable corresponding perm +* they can be listed together, seperated by ',', +* when perm and noperm are in the same line (item), noperm is preferential, +* when they are in different lines (items), the latter is preferential, +* '*' nid is as default perm, and is not preferential.*/ + + Currently, rmtacl/normtacl can be ignored (part of security functionality), and used for remote clients. The /usr/sbin/l_getidentity utility can parse /etc/lustre/perm.conf to obtain permission mask for specified UID. + + + To avoid repeated upcalls, the MDS caches supplemental group information. Use /proc/fs/lustre/mdt/${FSNAME}-MDT{xxxx}/identity_expire to set the cache time (default is 600 seconds). The kernel waits for the upcall to complete (at most, 5 seconds) and takes the "failure" behavior as described. Set the wait time in /proc/fs/lustre/mdt/${FSNAME}-MDT{xxxx}/identity_acquire_expire (default is 15 seconds). Cached entries are flushed by writing to /proc/fs/lustre/mdt/${FSNAME}-MDT{xxxx}/identity_flush. + +
-
- 33.1.4 <anchor xml:id="dbdoclet.50438291_33759" xreflabel=""/>Data Structures - struct identity_downcall_data { +
+
+ 33.1.3 Parameters + + + Name of the MDS service + + + Numeric UID + + +
+
+ 33.1.4 <anchor xml:id="dbdoclet.50438291_33759" xreflabel=""/>Data Structures + struct identity_downcall_data { __u32 idd_magic; __u32 idd_err; __u32 idd_uid; @@ -93,27 +94,25 @@ preferential,* '*' nid is as default perm, and is not preferential.*/ struct perm_downcall_data idd_perms[N_PERMS_MAX]; __u32 idd_ngroups; __u32 idd_groups[0]; -}; - -
+}; +
+
+
+ 33.2 <literal>l_getgroups</literal><anchor xml:id="dbdoclet.50438291_marker-1294565" xreflabel=""/> Utility + The l_getgroups utility handles Lustre user/group cache upcall. +
+ Synopsis + l_getgroups [-v] [-d|mdsname] uid] +l_getgroups [-v] -s +
+
+ Description + The group upcall file contains the path to an executable that, when properly installed, is invoked to resolve a numeric UID to a group membership list. This utility should complete the mds_grp_downcall_data data structure (see Data structures) and write it to the /proc/fs/lustre/mds/mds-service/group_info pseudo-file. + l_getgroups is the reference implementation of the user/group cache upcall. +
+
+ Files + /proc/fs/lustre/mds/mds-service/group_upcall
-
- 33.2 l_getgroups<anchor xml:id="dbdoclet.50438291_marker-1294565" xreflabel=""/> Utility - The l_getgroups utility handles Lustre user/group cache upcall. -
- Synopsis - l_getgroups [-v] [-d|mdsname] uid] -l_getgroups [-v] -s - -
-
- Description - The group upcall file contains the path to an executable that, when properly installed, is invoked to resolve a numeric UID to a group membership list. This utility should complete the mds_grp_downcall_data data structure (see Data structures) and write it to the /proc/fs/lustre/mds/mds-service/group_info pseudo-file. - l_getgroups is the reference implementation of the user/group cache upcall. -
-
- Files - /proc/fs/lustre/mds/mds-service/group_upcall -
diff --git a/Preface.xml b/Preface.xml index d92129f..9a6b4d5 100644 --- a/Preface.xml +++ b/Preface.xml @@ -1,12 +1,10 @@ - - + Preface This operations manual provides detailed information and procedures to install, configure and tune the Lustre file system. The manual covers topics such as failover, quotas, striping and bonding. The Lustre manual also contains troubleshooting information and tips to improve Lustre operation and performance.
- - - + About this Document + This document is maintained by Whamcloud, Inc in Docbook format. The canonical version is available at http://wiki.whamcloud.com/.
UNIX Commands This document might not contain information about basic UNIX commands and procedures such as shutting down the system, booting the system, and configuring devices. Refer to the following for this information: @@ -74,7 +72,7 @@
- <anchor xml:id="dbdoclet.50438247_43930" xreflabel=""/>Related Documentation + Related Documentation The documents listed as online are available at: http://wiki.whamcloud.com/display/PUB/Documentation @@ -144,10 +142,9 @@ Support http://www.whamcloud.com/ - Training http://www.whamcloud.com/ + Training http://www.whamcloud.com/ -  
diff --git a/SettingLustreProperties.xml b/SettingLustreProperties.xml index 1d02230..48a7f72 100644 --- a/SettingLustreProperties.xml +++ b/SettingLustreProperties.xml @@ -1,124 +1,138 @@ - - + + - Setting Lustre Properties in a C Program (llapi) + Setting Lustre Properties in a C Program (<literal>llapi</literal>) - This chapter describes the llapi library of commands used for setting Lustre file properties within a C program running in a cluster environment, such as a data processing or MPI application. The commands described in this chapter are: - + This chapter describes the llapi library of commands used for setting Lustre file properties within a C program running in a cluster environment, such as a data processing or MPI application. The commands described in this chapter are: + + - - + - - + - - + - - + + + + Lustre programming interface man pages are found in the lustre/doc folder. + +
+ 34.1 <literal>llapi_file_create</literal> + Use llapi_file_create to set Lustre properties for a new file. +
+ Synopsis + #include <lustre/liblustreapi.h> +#include <lustre/lustre_user.h> - - - Lustre programming interface man pages are found in the lustre/doc folder. - -
- 34.1 llapi_file_create - Use llapi_file_create to set Lustre properties for a new file. -
- Synopsis - #include <lustre/liblustreapi.h>#include <lustre/lustre_user.h> -int llapi_file_create(char *name, long stripe_size, int stripe_offset, int \ -stripe_count, int stripe_pattern); +int llapi_file_create(char *name, long stripe_size, int stripe_offset, int stripe_count, int stripe_pattern); -
-
- Description - The llapi_file_create() function sets a file descriptor's Lustre striping information. The file descriptor is then accessed with open (). - - - - - - - Option - Description - - - - - llapi_file_create() - If the file already exists, this parameter returns to 'EEXIST'. If the stripe parameters are invalid, this parameter returns to 'EINVAL'. - - - stripe_size - This value must be an even multiple of system page size, as shown by getpagesize (). The default Lustre stripe size is 4MB. - - - stripe_offset - Indicates the starting OST for this file. - - - stripe_count - Indicates the number of OSTs that this file will be striped across. - - - stripe_pattern - Indicates the RAID pattern. - - - - - Currently, only RAID 0 is supported. To use the system defaults, set these values: stripe_size = 0, stripe_offset = -1, stripe_count = 0, stripe_pattern = 0 -
-
- Examples - System default size is 4 MB. - char *tfile = TESTFILE; -int stripe_size = 65536 - - To start at default, run: - int stripe_offset = -1 - - To start at the default, run: - int stripe_count = 1 - - To set a single stripe for this example, run: - int stripe_pattern = 0 - - Currently, only RAID 0 is supported. - int stripe_pattern = 0; +
+
+ Description + The llapi_file_create() function sets a file descriptor's Lustre striping information. The file descriptor is then accessed with open(). + + + + + + + + Option + + + Description + + + + + + + llapi_file_create() + + + If the file already exists, this parameter returns to 'EEXIST'. If the stripe parameters are invalid, this parameter returns to 'EINVAL'. + + + + + stripe_size + + + This value must be an even multiple of system page size, as shown by getpagesize(). The default Lustre stripe size is 4MB. + + + + + stripe_offset + + + Indicates the starting OST for this file. + + + + + stripe_count + + + Indicates the number of OSTs that this file will be striped across. + + + + + stripe_pattern + + + Indicates the RAID pattern. + + + + + + + Currently, only RAID 0 is supported. To use the system defaults, set these values: stripe_size = 0, stripe_offset = -1, stripe_count = 0, stripe_pattern = 0 + +
+
+ Examples + System default size is 4 MB. + char *tfile = TESTFILE; +int stripe_size = 65536 + To start at default, run: + int stripe_offset = -1 + To start at the default, run: + int stripe_count = 1 + To set a single stripe for this example, run: + int stripe_pattern = 0 + Currently, only RAID 0 is supported. + int stripe_pattern = 0; int rc, fd; -rc = llapi_file_create(tfile, stripe_size,stripe_offset, stripe_count,strip\ -e_pattern); - - Result code is inverted, you may return with 'EINVAL' or an ioctl error. - if (rc) { -fprintf(stderr,"llapi_file_create failed: %d (%s) 0, rc, strerror(-rc));retu\ -rn -1; } - - llapi_file_create closes the file descriptor. You must re-open the descriptor. To do this, run: - fd = open(tfile, O_CREAT | O_RDWR | O_LOV_DELAY_CREATE, 0644); if (fd < 0) \\ - { fprintf(stderr, "Can't open %s file: %s0, tfile, +rc = llapi_file_create(tfile, stripe_size,stripe_offset, stripe_count,stripe_pattern); + Result code is inverted, you may return with 'EINVAL' or an ioctl error. + if (rc) { +fprintf(stderr,"llapi_file_create failed: %d (%s) 0, rc, strerror(-rc));return -1; } + llapi_file_create closes the file descriptor. You must re-open the descriptor. To do this, run: + fd = open(tfile, O_CREAT | O_RDWR | O_LOV_DELAY_CREATE, 0644); if (fd < 0) \ { +fprintf(stderr, "Can't open %s file: %s0, tfile, str- error(errno)); return -1; -} - -
+}
-
- 34.2 llapi_file_get_stripe - Use llapi_file_get_stripe to get striping information for a file or directory on a Lustre file system. -
- Synopsis - #include <sys/types.h> +
+
+ 34.2 llapi_file_get_stripe + Use llapi_file_get_stripe to get striping information for a file or directory on a Lustre file system. +
+ Synopsis + #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <liblustre.h> @@ -126,13 +140,12 @@ return -1; #include <lustre/liblustreapi.h> #include <lustre/lustre_user.h> -int llapi_file_get_stripe(const char *path, void *lum); - -
-
- Description - The llapi_file_get_stripe() function returns striping information for a file or directory path in lum (which should point to a large enough memory region) in one of the following formats: - struct lov_user_md_v1 { +int llapi_file_get_stripe(const char *path, void *lum); +
+
+ Description + The llapi_file_get_stripe() function returns striping information for a file or directory path in lum (which should point to a large enough memory region) in one of the following formats: + struct lov_user_md_v1 { __u32 lmm_magic; __u32 lmm_pattern; __u64 lmm_object_id; @@ -152,121 +165,207 @@ __u16 lmm_stripe_count; __u16 lmm_stripe_offset; char lmm_pool_name[LOV_MAXPOOLNAME]; struct lov_user_ost_data_v1 lmm_objects[0]; -} __attribute__((packed)); - - - - - - - - Option - Description - - - - - lmm_magic - Specifies the format of the returned striping information. LOV_MAGIC_V1 isused for lov_user_md_v1. LOV_MAGIC_V3 is used for lov_user_md_v3. - - - lmm_pattern - Holds the striping pattern. Only LOV_PATTERN_RAID0 is possible in this Lustre version. - - - lmm_object_id - Holds the MDS object ID. - - - lmm_object_gr - Holds the MDS object group. - - - lmm_stripe_size - Holds the stripe size in bytes. - - - lmm_stripe_count - Holds the number of OSTs over which the file is striped. - - - lmm_stripe_offset - Holds the OST index from which the file starts. - - - lmm_pool_name - Holds the OST pool name to which the file belongs. - - - lmm_objects - An array of lmm_stripe_count members containing per OST file information inthe following format:struct lov_user_ost_data_v1 {__u64 l_object_id;__u64 l_object_seq;__u32 l_ost_gen;__u32 l_ost_idx;} __attribute__((packed)); - - - l_object_id - Holds the OST's object ID. - - - l_object_seq - Holds the OST's object group. - - - l_ost_gen - Holds the OST's index generation. - - - l_ost_idx - Holds the OST's index in LOV. - - - - -
-
- Return Values - llapi_file_get_stripe() returns: - 0 On success - != 0 On failure, errno is set appropriately -
-
- Errors - - - - - - - Errors - Description - - - - - ENOMEM - Failed to allocate memory - - - ENAMETOOLONG - Path was too long - - - ENOENT - Path does not point to a file or directory - - - ENOTTY - Path does not point to a Lustre file system - - - EFAULT - Memory region pointed by lum is not properly mapped - - - - -
-
- Examples - #include <sys/vfs.h> +} __attribute__((packed)); + + + + + + + + Option + + + Description + + + + + + + lmm_magic + + + Specifies the format of the returned striping information. LOV_MAGIC_V1 isused for lov_user_md_v1. LOV_MAGIC_V3 is used for lov_user_md_v3. + + + + + lmm_pattern + + + Holds the striping pattern. Only LOV_PATTERN_RAID0 is possible in this Lustre version. + + + + + lmm_object_id + + + Holds the MDS object ID. + + + + + lmm_object_gr + + + Holds the MDS object group. + + + + + lmm_stripe_size + + + Holds the stripe size in bytes. + + + + + lmm_stripe_count + + + Holds the number of OSTs over which the file is striped. + + + + + lmm_stripe_offset + + + Holds the OST index from which the file starts. + + + + + lmm_pool_name + + + Holds the OST pool name to which the file belongs. + + + + + lmm_objects + + + An array of lmm_stripe_count members containing per OST file information in + the following format: + struct lov_user_ost_data_v1 { + __u64 l_object_id; + __u64 l_object_seq; + __u32 l_ost_gen; + __u32 l_ost_idx; + } __attribute__((packed)); + + + + + l_object_id + + + Holds the OST's object ID. + + + + + l_object_seq + + + Holds the OST's object group. + + + + + l_ost_gen + + + Holds the OST's index generation. + + + + + l_ost_idx + + + Holds the OST's index in LOV. + + + + + +
+
+ Return Values + llapi_file_get_stripe() returns: + 0 On success + != 0 On failure, errno is set appropriately +
+
+ Errors + + + + + + + + Errors + + + Description + + + + + + + ENOMEM + + + Failed to allocate memory + + + + + ENAMETOOLONG + + + Path was too long + + + + + ENOENT + + + Path does not point to a file or directory + + + + + ENOTTY + + + Path does not point to a Lustre file system + + + + + EFAULT + + + Memory region pointed by lum is not properly mapped + + + + + +
+
+ Examples + #include <sys/vfs.h> #include <liblustre.h> #include <lnet/lnetctl.h> #include <obd.h> @@ -313,16 +412,15 @@ cleanup: if (lum_file != NULL) free(lum_file); return rc; -} - -
+}
-
- 34.3 llapi_file_open - The llapi_file_open command opens (or creates) a file or device on a Lustre filesystem. -
- Synopsis - #include <sys/types.h> +
+
+ 34.3 <literal>llapi_file_open</literal> + The llapi_file_open command opens (or creates) a file or device on a Lustre filesystem. +
+ Synopsis + #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <liblustre.h> @@ -336,92 +434,140 @@ int llapi_file_create(const char *name, unsigned long long int stripe_offset, int stripe_count, int stripe_pattern); -
-
- Description - The llapi_file_create() call is equivalent to the llapi_file_open call with flags equal to O_CREAT|O_WRONLY and mode equal to 0644, followed by file close. - llapi_file_open() opens a file with a given name on a Lustre filesystem. - - - - - - - Option - Description - - - - - flags - Can be a combination of O_RDONLY, O_WRONLY, O_RDWR, O_CREAT, O_EXCL, O_NOCTTY, O_TRUNC, O_APPEND, O_NONBLOCK, O_SYNC, FASYNC, O_DIRECT, O_LARGEFILE, O_DIRECTORY, O_NOFOLLOW, O_NOATIME. - - - mode - Specifies the permission bits to be used for a new file when O_CREAT is used. - - - stripe_size - Specifies stripe size (in bytes). Should be multiple of 64 KB, not exceeding 4 GB. - - - stripe_offset - Specifies an OST index from which the file should start. The default value is -1. - - - stripe_count - Specifies the number of OSTs to stripe the file across. The default value is -1. - - - stripe_pattern - Specifies the striping pattern. In this version of Lustre, only LOV_PATTERN_RAID0 is available. The default value is 0. - - - - -
-
- Return Values - llapi_file_open() and llapi_file_create() return: - >=0 On success, for llapi_file_open the return value is a file descriptor - <0 On failure, the absolute value is an error code -
-
- Errors - - - - - - - Errors - Description - - - - - EINVAL - stripe_size or stripe_offset or stripe_count or stripe_pattern is invalid. - - - EEXIST - Striping information has already been set and cannot be altered; name already exists. - - - EALREADY - Striping information has already been set and cannot be altered - - - ENOTTY - name may not point to a Lustre filesystem. - - - - -
-
- Example - #include <sys/types.h> +
+
+ Description + The llapi_file_create() call is equivalent to the llapi_file_open call with flags equal to O_CREAT|O_WRONLY and mode equal to 0644, followed by file close. + llapi_file_open() opens a file with a given name on a Lustre filesystem. + + + + + + + + Option + + + Description + + + + + + + flags + + + Can be a combination of O_RDONLY, O_WRONLY, O_RDWR, O_CREAT, O_EXCL, O_NOCTTY, O_TRUNC, O_APPEND, O_NONBLOCK, O_SYNC, FASYNC, O_DIRECT, O_LARGEFILE, O_DIRECTORY, O_NOFOLLOW, O_NOATIME. + + + + + mode + + + Specifies the permission bits to be used for a new file when O_CREAT is used. + + + + + stripe_size + + + Specifies stripe size (in bytes). Should be multiple of 64 KB, not exceeding 4 GB. + + + + + stripe_offset + + + Specifies an OST index from which the file should start. The default value is -1. + + + + + stripe_count + + + Specifies the number of OSTs to stripe the file across. The default value is -1. + + + + + stripe_pattern + + + Specifies the striping pattern. In this version of Lustre, only LOV_PATTERN_RAID0 is available. The default value is 0. + + + + + +
+
+ Return Values + llapi_file_open() and llapi_file_create() return: + >=0 On success, for llapi_file_open the return value is a file descriptor + <0 On failure, the absolute value is an error code +
+
+ Errors + + + + + + + + Errors + + + Description + + + + + + + EINVAL + + + stripe_size or stripe_offset or stripe_count or stripe_pattern is invalid. + + + + + EEXIST + + + triping information has already been set and cannot be altered; name already exists. + + + + + EALREADY + + + Striping information has already been set and cannot be altered + + + + + ENOTTY + + + name may not point to a Lustre filesystem. + + + + + +
+
+ Example + #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <errno.h> @@ -437,23 +583,21 @@ int main(int argc, char *argv[]) return -1; rc = llapi_file_create(argv[1], 1048576, 0, 2, LOV_PATTERN_RAID0); if (rc < 0) { - fprintf(stderr, "file creation has failed, %s\n", strerror\ -(-rc)); + fprintf(stderr, "file creation has failed, %s\n", strerror(-rc)); return -1; } printf("%s with stripe size 1048576, striped across 2 OSTs," " has been created!\n", argv[1]); return 0; -} - -
+}
-
- 34.4 llapi_quotactl - Use llapi_quotactl to manipulate disk quotas on a Lustre file system. -
- Synopsis - #include <liblustre.h> +
+
+ 34.4 <literal>llapi_quotactl</literal> + Use llapi_quotactl to manipulate disk quotas on a Lustre file system. +
+ Synopsis + #include <liblustre.h> #include <lustre/lustre_idl.h> #include <lustre/liblustreapi.h> #include <lustre/lustre_user.h> @@ -489,137 +633,195 @@ struct obd_dqinfo { }; struct obd_uuid { char uuid[40]; -}; - -
-
- Description - The llapi_quotactl() command manipulates disk quotas on a Lustre file system mount. qc_cmd indicates a command to be applied to UID qc_id or GID qc_id. - - - - - - - Option - Description - - - - - LUSTRE_Q_QUOTAON - Turns on quotas for a Lustre file system. qc_type is USRQUOTA, GRPQUOTA or UGQUOTA (both user and group quota). The quota files must exist. They are normally created with the llapi_quotacheck call. This call is restricted to the super user privilege. - - - LUSTRE_Q_QUOTAOFF - Turns off quotas for a Lustre file system. qc_type is USRQUOTA, GRPQUOTA or UGQUOTA (both user and group quota). This call is restricted to the super user privilege. - - - LUSTRE_Q_GETQUOTA - Gets disk quota limits and current usage for user or group qc_id. qc_type is USRQUOTA or GRPQUOTA. uuid may be filled with OBD UUID string to query quota information from a specific node. dqb_valid may be set nonzero to query information only from MDS. If uuid is an empty string and dqb_valid is zero then cluster-wide limits and usage are returned. On return, obd_dqblk contains the requested information (block limits unit is kilobyte). Quotas must be turned on before using this command. - - - LUSTRE_Q_SETQUOTA - Sets disk quota limits for user or group qc_id. qc_type is USRQUOTA or GRPQUOTA. dqb_valid must be set to QIF_ILIMITS, QIF_BLIMITS or QIF_LIMITS (both inode limits and block limits) dependent on updating limits. obd_dqblk must be filled with limits values (as set in dqb_valid, block limits unit is kilobyte). Quotas must be turned on before using this command. - - - LUSTRE_Q_GETINFO - Gets information about quotas. qc_type is either USRQUOTA or GRPQUOTA. On return, dqi_igrace is inode grace time (in seconds), dqi_bgrace is block grace time (in seconds), dqi_flags is not used by the current Lustre version. - - - LUSTRE_Q_SETINFO - Sets quota information (like grace times). qc_type is either USRQUOTA or GRPQUOTA. dqi_igrace is inode grace time (in seconds), dqi_bgrace is block grace time (in seconds), dqi_flags is not used by the current Lustre version and must be zeroed. - - - - -
-
- Return Values - llapi_quotactl() returns: - 0 On success - -1 On failure and sets error number (errno) to indicate the error -
-
- Errors - llapi_quotactl errors are described below. - - - - - - - Errors - Description - - - - - EFAULT - qctl is invalid. - - - ENOSYS - Kernel or Lustre modules have not been compiled with the QUOTA option. - - - ENOMEM - Insufficient memory to complete operation. - - - ENOTTY - qc_cmd is invalid. - - - EBUSY - Cannot process during quotacheck. - - - ENOENT - uuid does not correspond to OBD or mnt does not exist. - - - EPERM - The call is privileged and the caller is not the super user. - - - ESRCH - No disk quota is found for the indicated user. Quotas have not been turned on for this file system. - - - - -
+}; +
+
+ Description + The llapi_quotactl() command manipulates disk quotas on a Lustre file system mount. qc_cmd indicates a command to be applied to UID qc_id or GID qc_id. + + + + + + + + Option + + + Description + + + + + + + LUSTRE_Q_QUOTAON + + + Turns on quotas for a Lustre file system. qc_type is USRQUOTA, GRPQUOTA or UGQUOTA (both user and group quota). The quota files must exist. They are normally created with the llapi_quotacheck call. This call is restricted to the super user privilege. + + + + + LUSTRE_Q_QUOTAOFF + + + Turns off quotas for a Lustre file system. qc_type is USRQUOTA, GRPQUOTA or UGQUOTA (both user and group quota). This call is restricted to the super user privilege. + + + + + LUSTRE_Q_GETQUOTA + + + Gets disk quota limits and current usage for user or group qc_id. qc_type is USRQUOTA or GRPQUOTA. uuid may be filled with OBD UUID string to query quota information from a specific node. dqb_valid may be set nonzero to query information only from MDS. If uuid is an empty string and dqb_valid is zero then cluster-wide limits and usage are returned. On return, obd_dqblk contains the requested information (block limits unit is kilobyte). Quotas must be turned on before using this command. + + + + + LUSTRE_Q_SETQUOTA + + + Sets disk quota limits for user or group qc_id. qc_type is USRQUOTA or GRPQUOTA. dqb_valid must be set to QIF_ILIMITS, QIF_BLIMITS or QIF_LIMITS (both inode limits and block limits) dependent on updating limits. obd_dqblk must be filled with limits values (as set in dqb_valid, block limits unit is kilobyte). Quotas must be turned on before using this command. + + + + + LUSTRE_Q_GETINFO + + + Gets information about quotas. qc_type is either USRQUOTA or GRPQUOTA. On return, dqi_igrace is inode grace time (in seconds), dqi_bgrace is block grace time (in seconds), dqi_flags is not used by the current Lustre version. + + + + + LUSTRE_Q_SETINFO + + + Sets quota information (like grace times). qc_type is either USRQUOTA or GRPQUOTA. dqi_igrace is inode grace time (in seconds), dqi_bgrace is block grace time (in seconds), dqi_flags is not used by the current Lustre version and must be zeroed. + + + + + +
+
Return Valuesllapi_quotactl() returns:0 On success + -1 + On failure and sets error number (errno) to indicate the error
+
+ Errors + llapi_quotactl errors are described below. + + + + + + + + Errors + + + Description + + + + + + + EFAULT + + + qctl is invalid. + + + + + ENOSYS + + + Kernel or Lustre modules have not been compiled with the QUOTA option. + + + + + ENOMEM + + + Insufficient memory to complete operation. + + + + + ENOTTY + + + qc_cmd is invalid. + + + + + EBUSY + + + Cannot process during quotacheck. + + + + + ENOENT + + + uuid does not correspond to OBD or mnt does not exist. + + + + + EPERM + + + The call is privileged and the caller is not the super user. + + + + + ESRCH + + + No disk quota is found for the indicated user. Quotas have not been turned on for this file system. + + + + +
-
- 34.5 llapi_path2fid - Use llapi_path2fid to get the FID from the pathname. -
- Synopsis - #include <lustre/liblustreapi.h> +
+
+ 34.5 <literal>llapi_path2fid</literal> + Use llapi_path2fid to get the FID from the pathname. +
+ Synopsis + #include <lustre/liblustreapi.h> #include <lustre/lustre_user.h> -int llapi_path2fid(const char *path, unsigned long long *seq, unsigned long\ - *oid, unsigned long *ver) - -
-
- Description - The llapi_path2fid function returns the FID (sequence : object ID : version) for the pathname. -
-
- Return Values - llapi_path2fid returns: - 0 On success - non-zero value On failure -
+int llapi_path2fid(const char *path, unsigned long long *seq, unsigned long *oid, unsigned long *ver) +
+
+ Description + The llapi_path2fid function returns the FID (sequence : object ID : version) for the pathname. +
+
+ Return Values + llapi_path2fid returns: + 0 On success + non-zero value On failure
-
- 34.6 Example Using the llapi Library - Use llapi_file_create to set Lustre properties for a new file. For a synopsis and description of llapi_file_create and examples of how to use it, see . - You can set striping from inside programs like ioctl. To compile the sample program, you need to download libtest.c and liblustreapi.c files from the Lustre source tree. - A simple C program to demonstrate striping API - libtest.c - /* -*- mode: c; c-basic-offset: 8; indent-tabs-mode: nil; -*- +
+
+ 34.6 Example Using the <literal>llapi</literal> Library + Use llapi_file_create to set Lustre properties for a new file. For a synopsis and description of llapi_file_create and examples of how to use it, see . + You can set striping from inside programs like ioctl. To compile the sample program, you need to download libtest.c and liblustreapi.c files from the Lustre source tree. + A simple C program to demonstrate striping API - libtest.c + /* -*- mode: c; c-basic-offset: 8; indent-tabs-mode: nil; -*- * vim:expandtab:shiftwidth=8:tabstop=8: * * lustredemo - simple code examples of liblustreapi functions @@ -637,34 +839,34 @@ int llapi_path2fid(const char *path, unsigned long long *seq, unsigned long\ #include <lustre/liblustreapi.h> #include <lustre/lustre_user.h> #define MAX_OSTS 1024 -#define LOV_EA_SIZE(lum, num) (sizeof(*lum) + num * sizeof(*lum->lmm_objects\ -)) +#define LOV_EA_SIZE(lum, num) (sizeof(*lum) + num * sizeof(*lum->lmm_objects)) #define LOV_EA_MAX(lum) LOV_EA_SIZE(lum, MAX_OSTS) + + + + /* This program provides crude examples of using the liblustre API functions */ /* Change these definitions to suit */ - -#define TESTDIR "/tmp" /* R\ -esults directory */ -#define TESTFILE "lustre_dummy" \ - /* Name for the file we create/destroy */ -#define FILESIZE 262144 \ -/* Size of the file in words */ -#define DUMWORD "DEADBEEF" \ - /* Dummy word used to fill files */ -#define MY_STRIPE_WIDTH 2 \ -/* Set this to the number of OST required */ + + + + +#define TESTDIR "/tmp" /* Results directory */ +#define TESTFILE "lustre_dummy" /* Name for the file we create/destroy */ +#define FILESIZE 262144 /* Size of the file in words */ +#define DUMWORD "DEADBEEF" /* Dummy word used to fill files */ +#define MY_STRIPE_WIDTH 2 /* Set this to the number of OST required */ #define MY_LUSTRE_DIR "/mnt/lustre/ftest" int close_file(int fd) { if (close(fd) < 0) { - fprintf(stderr, "File close failed: %d (%s)\n", errno, strerror(er\ -rno)); + fprintf(stderr, "File close failed: %d (%s)\n", errno, strerror(errno)); return -1; } return 0; @@ -686,14 +888,10 @@ int write_file(int fd) int open_stripe_file() { char *tfile = TESTFILE; - int stripe_size = 65536; \ - /* System default is 4M */ - int stripe_offset = -1; \ - /* Start at default */ - int stripe_count = MY_STRIPE_WIDTH; \ - /*Single stripe for this demo*/ - int stripe_pattern = 0; \ - /* only RAID 0 at this time */ + int stripe_size = 65536; /* System default is 4M */ + int stripe_offset = -1; /* Start at default */ + int stripe_count = MY_STRIPE_WIDTH; /*Single stripe for this demo*/ + int stripe_pattern = 0; /* only RAID 0 at this time */ int rc, fd; /* */ @@ -703,15 +901,13 @@ stripe_size,stripe_offset,stripe_count,stripe_pattern); We borrow an error message from sanity.c */ if (rc) { - fprintf(stderr,"llapi_file_create failed: %d (%s) \n", rc, st\ -rerror(-rc)); + fprintf(stderr,"llapi_file_create failed: %d (%s) \n", rc, strerror(-rc)); return -1; } /* llapi_file_create closes the file descriptor, we must re-open */ fd = open(tfile, O_CREAT | O_RDWR | O_LOV_DELAY_CREATE, 0644); if (fd < 0) { - fprintf(stderr, "Can't open %s file: %d (%s)\n", tfile, errno\ -, strerror(errno)); + fprintf(stderr, "Can't open %s file: %d (%s)\n", tfile, errno, strerror(errno)); return -1; } return fd; @@ -720,15 +916,13 @@ rerror(-rc)); /* output a list of uuids for this file */ int get_my_uuids(int fd) { - struct obd_uuid uuids[1024], *uuidp; \ - /* Output var */ + struct obd_uuid uuids[1024], *uuidp; /* Output var */ int obdcount = 1024; int rc,i; rc = llapi_lov_get_uuids(fd, uuids, &obdcount); if (rc != 0) { - fprintf(stderr, "get uuids failed: %d (%s)\n",errno, strerror(errn\ -o)); + fprintf(stderr, "get uuids failed: %d (%s)\n",errno, strerror(errno)); } printf("This file system has %d obds\n", obdcount); for (i = 0, uuidp = uuids; i < obdcount; i++, uuidp++) { @@ -753,8 +947,7 @@ int get_file_info(char *path) rc = llapi_file_get_stripe(path, lump); if (rc != 0) { - fprintf(stderr, "get_stripe failed: %d (%s)\n",errno, strerror(err\ -no)); + fprintf(stderr, "get_stripe failed: %d (%s)\n",errno, strerror(errno)); return -1; } @@ -766,8 +959,7 @@ no)); printf("Lov stripe count %hu\n", lump->lmm_stripe_count); printf("Lov stripe offset %u\n", lump->lmm_stripe_offset); for (i = 0; i < lump->lmm_stripe_count; i++) { - printf("Object index %d Objid %llu\n", lump->lmm_objects[i].l_ost_i\ -dx, lump->lmm_objects[i].l_object_id); + printf("Object index %d Objid %llu\n", lump->lmm_objects[i].l_ost_idx, lump->lmm_objects[i].l_object_id); } @@ -845,8 +1037,8 @@ int main() exit(rc); } - Makefile for sample application: - + Makefile for sample application: + gcc -g -O2 -Wall -o lustredemo libtest.c -llustreapi clean: rm -f core lustredemo *.o @@ -856,13 +1048,30 @@ rm -f /mnt/lustre/ftest/lustredemo rm -f /mnt/lustre/ftest/lustre_dummy cp lustredemo /mnt/lustre/ftest/ -
- See Also - - llapi_file_create, - llapi_file_get_stripe, - llapi_file_open, - llapi_quotactl -
+
+ See Also + + + + + + + + + + + + + + + + + + + + + +
+
diff --git a/UnderstandingLustre.xml b/UnderstandingLustre.xml index 8ea7e01..764b25b 100644 --- a/UnderstandingLustre.xml +++ b/UnderstandingLustre.xml @@ -1,8 +1,7 @@ - - Understanding Lustre - + Understanding Lustre + This chapter describes the Lustre architecture and features of Lustre. It includes the following sections: @@ -22,14 +21,14 @@
- 1.1 <anchor xml:id="dbdoclet.50438250_92658" xreflabel=""/>What Lustre Is (and What It Isn't) + What Lustre Is (and What It Isn't) Lustre is a storage architecture for clusters. The central component of the Lustre architecture is the Lustre file system, which is supported on the Linux operating system and provides a POSIX-compliant UNIX file system interface. The Lustre storage architecture is used for many different kinds of clusters. It is best known for powering seven of the ten largest high-performance computing (HPC) clusters worldwide, with tens of thousands of client systems, petabytes (PB) of storage and hundreds of gigabytes per second (GB/sec) of I/O throughput. Many HPC sites use Lustre as a site-wide global file system, serving dozens of clusters. The ability of a Lustre file system to scale capacity and performance for any need reduces the need to deploy many separate file systems, such as one for each compute cluster. Storage management is simplified by avoiding the need to copy data between compute clusters. In addition to aggregating storage capacity of many servers, the I/O throughput is also aggregated and scales with additional servers. Moreover, throughput and/or capacity can be easily increased by adding servers dynamically. While Lustre can function in many work environments, it is not necessarily the best choice for all applications. It is best suited for uses that exceed the capacity that a single server can provide, though in some use cases Lustre can perform better with a single server than other filesystems due to its strong locking and data coherency. Lustre is currently not particularly well suited for "peer-to-peer" usage models where there are clients and servers running on the same node, each sharing a small amount of storage, due to the lack of Lustre-level data replication. In such uses, if one client/server fails, then the data stored on that node will not be accessible until the node is restarted.
- 1.1.1 Lustre <anchor xml:id="dbdoclet.50438250_marker-1293792" xreflabel=""/>Features + Lustre Features Lustre runs on a variety of vendor's kernels. For more details, see Lustre Release Information on the Whamcloud wiki. A Lustre installation can be scaled up or down with respect to the number of client nodes, disk storage and bandwidth. Scalability and performance are dependent on available disk and network bandwith and the processing power of the servers in the system. Lustre can be deployed in a wide variety of configurations that can be scaled well beyond the size and performance observed in production systems to date. shows the practical range of scalability and performance characteristics of the Lustre file system and some test results in production systems. @@ -43,9 +42,7 @@ - - - + Feature @@ -158,7 +155,9 @@ Other Lustre features are: - Performance-enhanced ext4 file system: Lustre uses a modified version of the ext4 journaling file system to store data and metadata. This version, called ldiskfs, has been enhanced to improve performance and provide additional functionality needed by Lustre. + Performance-enhanced ext4 file system: Lustre uses a modified version of the ext4 journaling file system to store data and metadata. This version, called + ldiskfs + , has been enhanced to improve performance and provide additional functionality needed by Lustre. POSIX compliance : The full POSIX test suite passes with limited exceptions on Lustre clients. In a cluster, most operations are atomic so that clients never see stale data or metadata. Lustre supports mmap() file I/O. @@ -215,7 +214,7 @@
- 1.2 <anchor xml:id="dbdoclet.50438250_17402" xreflabel=""/>Lustre Components + Lustre Components An installation of the Lustre software includes a management server (MGS) and one or more Lustre file systems interconnected with Lustre networking (LNET). A basic configuration of Lustre components is shown in .
@@ -230,12 +229,12 @@
- 1.2.1 Management Server (MGS) + Management Server (MGS) The MGS stores configuration information for all the Lustre file systems in a cluster and provides this information to other Lustre components. Each Lustre target contacts the MGS to provide information, and Lustre clients contact the MGS to retrieve information. It is preferable that the MGS have its own storage space so that it can be managed independently. However, the MGS can be co-located and share storage space with an MDS as shown in .
- 1.2.2 Lustre File System Components + Lustre File System Components Each Lustre file system consists of the following components: @@ -266,9 +265,7 @@ - - - + Required attached storage @@ -318,11 +315,11 @@ For additional hardware requirements and considerations, see .
- 1.2.3 Lustre Networking (LNET) + Lustre Networking (LNET) Lustre Networking (LNET) is a custom networking API that provides the communication infrastructure that handles metadata and file I/O data for the Lustre file system servers and clients. For more information about LNET, see something .
- 1.2.4 Lustre Cluster + Lustre Cluster At scale, the Lustre cluster can include hundreds of OSSs and thousands of clients (see ). More than one type of network can be used in a Lustre cluster. Shared storage between OSSs enables failover capability. For more details about OSS failover, see .
Lustre cluster at scale @@ -338,7 +335,7 @@
- 1.3 <anchor xml:id="dbdoclet.50438250_38077" xreflabel=""/>Lustre Storage and I/O + Lustre Storage and I/O In a Lustre file system, a file stored on the MDT points to one or more objects associated with a data file, as shown in . Each object contains data and is stored on an OST. If the MDT file points to one object, all the file data is stored in that object. If the file points to more than one object, the file data is 'striped' across the objects (using RAID 0) and each object is stored on a different OST. (For more information about how striping is implemented in Lustre, see ) In , each filename points to an inode. The inode contains all of the file attributes, such as owner, access permissions, Lustre striping layout, access time, and access control. Multiple filenames may point to the same inode.
@@ -368,16 +365,16 @@ The available bandwidth of a Lustre file system is determined as follows: - The network bandwidth equals the aggregated bandwidth of the OSSs to the targets. + The network bandwidth equals the aggregated bandwidth of the OSSs to the targets. - The disk bandwidth equals the sum of the disk bandwidths of the storage targets (OSTs) up to the limit of the network bandwidth. + The disk bandwidth equals the sum of the disk bandwidths of the storage targets (OSTs) up to the limit of the network bandwidth. - The aggregate bandwidth equals the minimium of the disk bandwidth and the network bandwidth. + The aggregate bandwidth equals the minimium of the disk bandwidth and the network bandwidth. - The available file system space equals the sum of the available space of all the OSTs. + The available file system space equals the sum of the available space of all the OSTs.
@@ -396,7 +393,7 @@ - File striping pattern across three OSTs for three different data files. The file is sparse and missing chunk 6. + File striping pattern across three OSTs for three different data files. The file is sparse and missing chunk 6.
diff --git a/UnderstandingLustreNetworking.xml b/UnderstandingLustreNetworking.xml index 7671561..b6341e8 100644 --- a/UnderstandingLustreNetworking.xml +++ b/UnderstandingLustreNetworking.xml @@ -1,9 +1,6 @@ - - - - Understanding Lustre Networking (LNET) - + + Understanding Lustre Networking (LNET) This chapter introduces Lustre Networking (LNET) and includes the following sections: @@ -23,7 +20,7 @@
- 2.1 Introducing LNET + Introducing LNET In a cluster with a Lustre file system, the system network connecting the servers and the clients is implemented using Lustre Networking (LNET), which provides the communication infrastructure required by the Lustre file system. LNET supports many commonly-used network types, such as InfiniBand and IP networks, and allows simultaneous availability across multiple network types with routing between them. Remote Direct Memory Access (RDMA) is permitted when supported by underlying networks using the appropriate Lustre network driver (LND). High availability and recovery features enable transparent recovery in conjunction with failover servers. An LND is a pluggable driver that provides support for a particular network type. LNDs are loaded into the driver stack, with one LND for each network type in use. @@ -31,11 +28,11 @@ For information about administering LNET, see .
- 2.2 Key Features of LNET + Key Features of LNET Key features of LNET include: - RDMA, when supported by underlying networks such as InfiniBand or Myrinet MX + RDMA, when supported by underlying networks such as InfiniBand or Myrinet MX Support for many commonly-used network types such as InfiniBand and TCP/IP @@ -51,7 +48,7 @@ Lustre can use bonded networks, such as bonded Ethernet networks, when the underlying network technology supports bonding. For more information, see .
- 2.3 Supported Network Types + Supported Network Types LNET includes LNDs to support many network types including: @@ -67,10 +64,10 @@ Myrinet: MX - RapidArray: ra + RapidArray: ra - Quadrics: Elan + Quadrics: Elan
-- 1.8.3.1