- <section remap="h3">
- <title><indexterm>
- <primary>configuring</primary>
- <secondary>portals</secondary>
- </indexterm>Portals LND Linux (ptllnd)</title>
- <para>The Portals LND Linux (<literal>ptllnd</literal>) can be used as a interface layer to communicate with Sandia Portals networking devices. This version is intended to work on Cray XT3 Linux nodes that use Cray Portals as a network transport.</para>
- <para><emphasis role="bold">Message Buffers</emphasis></para>
- <para>When <literal>ptllnd</literal> starts up, it allocates and posts sufficient message buffers to allow all expected peers (set by concurrent_peers) to send one unsolicited message. The first message that a peer actually sends is (so-called) "<literal>HELLO</literal>" message, used to negotiate how much additional buffering to setup (typically 8 messages). If 10000 peers actually exist, then enough buffers are posted for 80000 messages.</para>
- <para>The maximum message size is set by the <literal>max_msg_size</literal> module parameter (default value is 512). This parameter sets the bulk transfer breakpoint. Below this breakpoint, payload data is sent in the message itself. Above this breakpoint, a buffer descriptor is sent and the receiver gets the actual payload.</para>
- <para>The buffer size is set by the <literal>rxb_npages</literal> module parameter (default value is <literal>1</literal>). The default conservatively avoids allocation problems due to kernel memory fragmentation. However, increasing this value to 2 is probably not risky.</para>
- <para>The <literal>ptllnd</literal> also keeps an additional <literal>rxb_nspare</literal> buffers (default value is 8) posted to account for full buffers being handled.</para>
- <para>Assuming a 4K page size with 10000 peers, 1258 buffers can be expected to be posted at startup, increasing to a maximum of 10008 as peers that are actually connected. By doubling <literal>rxb_npages</literal> halving <literal>max_msg_size</literal>, this number can be reduced by a factor of 4.</para>
- <para><emphasis role="bold">ME/MD Queue Length</emphasis></para>
- <para>The <literal>ptllnd</literal> uses a single portal set by the portal module parameter (default value of 9) for both message and bulk buffers. Message buffers are always attached with <literal>PTL_INS_AFTER</literal> and match anything sent with "message" matchbits. Bulk buffers are always attached with <literal>PTL_INS_BEFORE</literal> and match only specific matchbits for that particular bulk transfer.</para>
- <para>This scheme assumes that the majority of ME/MDs posted are for "message" buffers, and that the overhead of searching through the preceding "bulk" buffers is acceptable. Since the number of "bulk" buffers posted at any time is also dependent on the bulk transfer breakpoint set by <literal>max_msg_size</literal>, this seems like an issue worth measuring at scale.</para>
- <para><emphasis role="bold">TX Descriptors</emphasis></para>
- <para>The <literal>ptllnd</literal> has a pool of so-called "tx descriptors", which it uses not only for outgoing messages, but also to hold state for bulk transfers requested by incoming messages. This pool should scale with the total number of peers.</para>
- <para>To enable the building of the Portals LND (<literal>ptllnd.ko</literal>) configure with this option:</para>
- <screen>./configure --with-portals=<replaceable>/path/to/portals/headers</replaceable></screen>
- <informaltable frame="all">
- <tgroup cols="2">
- <colspec colname="c1" colwidth="50*"/>
- <colspec colname="c2" colwidth="50*"/>
- <thead>
- <row>
- <entry>
- <para><emphasis role="bold">Variable</emphasis></para>
- </entry>
- <entry>
- <para><emphasis role="bold">Description</emphasis></para>
- </entry>
- </row>
- </thead>
- <tbody>
- <row>
- <entry>
- <para> <literal>ntx</literal></para>
- <para> <literal>(256)</literal></para>
- </entry>
- <entry>
- <para>Total number of messaging descriptors.</para>
- </entry>
- </row>
- <row>
- <entry>
- <para> <literal>concurrent_peers</literal></para>
- <para> <literal>(1152)</literal></para>
- </entry>
- <entry>
- <para>Maximum number of concurrent peers. Peers that attempt to connect beyond the maximum are not allowed.</para>
- </entry>
- </row>
- <row>
- <entry>
- <para> <literal>peer_hash_table_size</literal></para>
- <para> <literal>(101)</literal></para>
- </entry>
- <entry>
- <para>Number of hash table slots for the peers. This number should scale with <literal>concurrent_peers</literal>. The size of the peer hash table is set by the module parameter <literal>peer_hash_table_size</literal> which defaults to a value of 101. This number should be prime to ensure the peer hash table is populated evenly. It is advisable to increase this value to 1001 for ~10000 peers.</para>
- </entry>
- </row>
- <row>
- <entry>
- <para> <literal>cksum</literal></para>
- <para> <literal>(0)</literal></para>
- </entry>
- <entry>
- <para>Set to non-zero to enable message (not RDMA) checksums for outgoing packets. Incoming packets are always check-summed if necessary, independent of this value.</para>
- </entry>
- </row>
- <row>
- <entry>
- <para> <literal>timeout</literal></para>
- <para> <literal>(50)</literal></para>
- </entry>
- <entry>
- <para>Amount of time (in seconds) that a request can linger in a peers-active queue before the peer is considered dead.</para>
- </entry>
- </row>
- <row>
- <entry>
- <para> <literal>portal</literal></para>
- <para> <literal>(9)</literal></para>
- </entry>
- <entry>
- <para>Portal ID to use for the <literal>ptllnd</literal> traffic.</para>
- </entry>
- </row>
- <row>
- <entry>
- <para> <literal>rxb_npages</literal></para>
- <para> <literal>(64 * #cpus)</literal></para>
- </entry>
- <entry>
- <para>Number of pages in an RX buffer.</para>
- </entry>
- </row>
- <row>
- <entry>
- <para> <literal>credits</literal></para>
- <para> <literal>(128)</literal></para>
- </entry>
- <entry>
- <para>Maximum total number of concurrent sends that are outstanding to a single peer at a given time.</para>
- </entry>
- </row>
- <row>
- <entry>
- <para> <literal>peercredits</literal></para>
- <para> <literal>(8)</literal></para>
- </entry>
- <entry>
- <para>Maximum number of concurrent sends that are outstanding to a single peer at a given time.</para>
- </entry>
- </row>
- <row>
- <entry>
- <para> <literal>max_msg_size</literal></para>
- <para> <literal>(512)</literal></para>
- </entry>
- <entry>
- <para>Maximum immediate message size. This MUST be the same on all nodes in a cluster. A peer that connects with a different <literal>max_msg_size</literal> value will be rejected.</para>
- </entry>
- </row>
- </tbody>
- </tgroup>
- </informaltable>
- </section>
- <section remap="h3">
- <title><indexterm><primary>configuring</primary><secondary>MX LND</secondary></indexterm>MX LND</title>
- <para><literal>MXLND</literal> supports a number of load-time parameters using Linux's module parameter system. The following variables are available:</para>
- <informaltable frame="all">
- <tgroup cols="2">
- <colspec colname="c1" colwidth="50*"/>
- <colspec colname="c2" colwidth="50*"/>
- <thead>
- <row>
- <entry>
- <para><emphasis role="bold">Variable</emphasis></para>
- </entry>
- <entry>
- <para><emphasis role="bold">Description</emphasis></para>
- </entry>
- </row>
- </thead>
- <tbody>
- <row>
- <entry>
- <para> <literal>n_waitd</literal></para>
- </entry>
- <entry>
- <para>Number of completion daemons.</para>
- </entry>
- </row>
- <row>
- <entry>
- <para> <literal>max_peers</literal></para>
- </entry>
- <entry>
- <para>Maximum number of peers that may connect.</para>
- </entry>
- </row>
- <row>
- <entry>
- <para> <literal>cksum</literal></para>
- </entry>
- <entry>
- <para>Enables small message (below 4 KB) checksums if set to a non-zero value.</para>
- </entry>
- </row>
- <row>
- <entry>
- <para> <literal>ntx</literal></para>
- </entry>
- <entry>
- <para>Number of total tx message descriptors.</para>
- </entry>
- </row>
- <row>
- <entry>
- <para> <literal>credits</literal></para>
- </entry>
- <entry>
- <para>Number of concurrent sends to a single peer.</para>
- </entry>
- </row>
- <row>
- <entry>
- <para> <literal>board</literal></para>
- </entry>
- <entry>
- <para>Index value of the Myrinet board (NIC).</para>
- </entry>
- </row>
- <row>
- <entry>
- <para> <literal>ep_id</literal></para>
- </entry>
- <entry>
- <para>MX endpoint ID.</para>
- </entry>
- </row>
- <row>
- <entry>
- <para> <literal>polling</literal></para>
- </entry>
- <entry>
- <para>Use zero (0) to block (wait). A value greater than 0 will poll that many times before blocking.</para>
- </entry>
- </row>
- <row>
- <entry>
- <para> <literal>hosts</literal></para>
- </entry>
- <entry>
- <para>IP-to-hostname resolution file.</para>
- </entry>
- </row>
- </tbody>
- </tgroup>
- </informaltable>
- <para>Of the described variables, only hosts is required. It must be the absolute path to the MXLND hosts file.</para>
- <para>For example:</para>
- <screen>options kmxlnd hosts=/etc/hosts.mxlnd</screen>
- <para>The file format for the hosts file is:</para>
- <screen>IP HOST BOARD EP_ID</screen>
- <para>The values must be space and/or tab separated where:</para>
- <para><literal>IP</literal> is a valid IPv4 address</para>
- <para><literal>HOST</literal> is the name returned by <literal>`hostname`</literal> on that machine</para>
- <para><literal>BOARD</literal> is the index of the Myricom NIC (0 for the first card, etc.)</para>
- <para><literal>EP_ID</literal> is the MX endpoint ID</para>
- <para>To obtain the optimal performance for your platform, you may want to vary the remaining options.</para>
- <para><literal>n_waitd(1)</literal> sets the number of threads that process completed MX requests (sends and receives).</para>
- <para><literal>max_peers(1024)</literal> tells MXLND the upper limit of machines that it will need to communicate with. This affects how many receives it will pre-post and each receive will use one page of memory. Ideally, on clients, this value will be equal to the total number of Lustre servers (MDS and OSS). On servers, it needs to equal the total number of machines in the storage system. cksum (0) turns on small message checksums. It can be used to aid in troubleshooting. MX also provides an optional checksumming feature which can check all messages (large and small). For details, see the MX README.</para>
- <para><literal>ntx(256)</literal> is the number of total sends in flight from this machine. In actuality, MXLND reserves half of them for connect messages so make this value twice as large as you want for the total number of sends in flight.</para>
- <para><literal>credits(8)</literal> is the number of in-flight messages for a specific peer.
- This is part of the flow-control system in provided by the Lustre software. Increasing this
- value may improve performance but it requires more memory because each message requires at
- least one page.</para>
- <para><literal>board(0)</literal> is the index of the Myricom NIC. Hosts can have multiple Myricom NICs and this identifies which one MXLND should use. This value must match the board value in your MXLND hosts file for this host.</para>
- <para><literal>ep_id(3)</literal> is the MX endpoint ID. Each process that uses MX is required to have at least one MX endpoint to access the MX library and NIC. The ID is a simple index starting at zero (0). This value must match the endpoint ID value in your MXLND hosts file for this host.</para>
- <para><literal>polling(0)</literal> determines whether this host will poll or block for MX request completions. A value of 0 blocks and any positive value will poll that many times before blocking. Since polling increases CPU usage, we suggest that you set this to zero (0) on the client and experiment with different values for servers.</para>
- </section>