1 tbd Sun Microsystems, Inc.
3 * Support for networks:
4 socklnd - any kernel supported by Lustre,
5 qswlnd - Qsnet kernel modules 5.20 and later,
6 openiblnd - IbGold 1.8.2,
7 o2iblnd - OFED 1.1, 1.2.0, 1.2.5, and 1.3
8 viblnd - Voltaire ibhost 3.4.5 and later,
9 ciblnd - Topspin 3.2.0,
10 iiblnd - Infiniserv 3.3 + PathBits patch,
11 gmlnd - GM 2.1.22 and later,
12 mxlnd - MX 1.2.1 or later,
13 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
22 Description: fix credit flow deadlock in uptllnd
26 Description: finalize network operation in reasonable time
27 Details : conf-sanity test_32a couldn't stop ost and mds because it
28 tried to access non-existent peer and tcp connect took
29 quite long before timing out.
33 Description: Remove portals compatibility
34 Details : Remove portals compatibility, not interoperable with releases
39 Description: Continuous recovery on 33 of 413 nodes after lustre oss failure
40 Details : Lost reference on conn prevents peer from being destroyed, which
41 could prevent new peer creation if peer count has reached upper
46 Description: LNET Selftest results in Soft lockup on OSS CPU
47 Details : only hits when 8 or more o2ib clients involved and a session is
48 torn down with 'lst end_session' without preceeding 'lst stop'.
52 Description: concurrent_sends in IB LNDs should not be changeable at run time
53 Details : concurrent_sends in IB LNDs should not be changeable at run time
57 Description: ptl_send_rpc hits LASSERT when ptl_send_buf fails
58 Details : only hits under out-of-memory situations
61 -------------------------------------------------------------------------------
64 04-26-2008 Sun Microsystems, Inc.
66 * Support for networks:
67 socklnd - any kernel supported by Lustre,
68 qswlnd - Qsnet kernel modules 5.20 and later,
69 openiblnd - IbGold 1.8.2,
70 o2iblnd - OFED 1.1 and 1.2.0, 1.2.5
71 viblnd - Voltaire ibhost 3.4.5 and later,
72 ciblnd - Topspin 3.2.0,
73 iiblnd - Infiniserv 3.3 + PathBits patch,
74 gmlnd - GM 2.1.22 and later,
75 mxlnd - MX 1.2.1 or later,
76 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
80 Description: excessive debug information removed
81 Details : excessive debug information removed
85 Description: ksocknal_create_conn() hit ASSERTION during connection race
86 Details : ksocknal_create_conn() hit ASSERTION during connection race
90 Description: ksocknal_send_hello() hit ASSERTION while connecting race
91 Details : ksocknal_send_hello() hit ASSERTION while connecting race
95 Description: o2iblnd/ptllnd credit deadlock in a routed config.
96 Details : o2iblnd/ptllnd credit deadlock in a routed config.
100 Description: High load after starting lnet
101 Details : gmlnd should sleep in rx thread in interruptible way. Otherwise,
102 uptime utility reports high load that looks confusingly.
106 Description: ksocklnd fails to establish connection if accept_port is high
107 Details : PID remapping must not be done for active (outgoing) connections
109 --------------------------------------------------------------------------------
111 2008-01-11 Sun Microsystems, Inc.
113 * Support for networks:
114 socklnd - any kernel supported by Lustre,
115 qswlnd - Qsnet kernel modules 5.20 and later,
116 openiblnd - IbGold 1.8.2,
117 o2iblnd - OFED 1.1 and 1.2.0, 1.2.5
118 viblnd - Voltaire ibhost 3.4.5 and later,
119 ciblnd - Topspin 3.2.0,
120 iiblnd - Infiniserv 3.3 + PathBits patch,
121 gmlnd - GM 2.1.22 and later,
122 mxlnd - MX 1.2.1 or later,
123 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
126 Description: liblustre network error
127 Details : liblustre clients should understand LNET_ACCEPT_PORT environment
128 variable even if they don't start lnet acceptor.
132 Description: Strange message from lnet (Ignoring prediction from the future)
133 Details : Incorrect calculation of peer's last_alive value in ksocklnd
135 --------------------------------------------------------------------------------
137 2007-12-07 Cluster File Systems, Inc. <info@clusterfs.com>
139 * Support for networks:
140 socklnd - any kernel supported by Lustre,
141 qswlnd - Qsnet kernel modules 5.20 and later,
142 openiblnd - IbGold 1.8.2,
143 o2iblnd - OFED 1.1 and 1.2.0, 1.2.5.
144 viblnd - Voltaire ibhost 3.4.5 and later,
145 ciblnd - Topspin 3.2.0,
146 iiblnd - Infiniserv 3.3 + PathBits patch,
147 gmlnd - GM 2.1.22 and later,
148 mxlnd - MX 1.2.1 or later,
149 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
153 Description: ASSERTION(me == md->md_me) failed in lnet_match_md()
157 Description: increase send queue size for ciblnd/openiblnd
161 Description: new userspace socklnd
162 Details : Old userspace tcpnal that resided in lnet/ulnds/socklnd replaced
163 with new one - usocklnd.
165 Severity : enhancement
167 Description: Console message flood
168 Details : Make cdls ratelimiting more tunable by adding several tunable in
169 procfs /proc/sys/lnet/console_{min,max}_delay_centisecs and
170 /proc/sys/lnet/console_backoff.
172 --------------------------------------------------------------------------------
174 2007-09-27 Cluster File Systems, Inc. <info@clusterfs.com>
176 * Support for networks:
177 socklnd - any kernel supported by Lustre,
178 qswlnd - Qsnet kernel modules 5.20 and later,
179 openiblnd - IbGold 1.8.2,
180 o2iblnd - OFED 1.1 and 1.2,
181 viblnd - Voltaire ibhost 3.4.5 and later,
182 ciblnd - Topspin 3.2.0,
183 iiblnd - Infiniserv 3.3 + PathBits patch,
184 gmlnd - GM 2.1.22 and later,
185 mxlnd - MX 1.2.1 or later,
186 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
190 Description: /proc/sys/lnet has non-sysctl entries
191 Details : Updating dump_kernel/daemon_file/debug_mb to use sysctl variables
195 Description: TOE Kernel panic by ksocklnd
196 Details : offloaded sockets provide their own implementation of sendpage,
197 can't call tcp_sendpage() directly
201 Description: kibnal_shutdown() doesn't finish; lconf --cleanup hangs
202 Details : races between lnd_shutdown and peer creation prevent
203 lnd_shutdown from finishing.
207 Description: open files rlimit 1024 reached while liblustre testing
208 Details : ulnds/socklnd must close open socket after unsuccessful
213 Description: build error
214 Details : fix typos in gmlnd, ptllnd and viblnd
216 ------------------------------------------------------------------------------
218 2007-07-30 Cluster File Systems, Inc. <info@clusterfs.com>
220 * Support for networks:
221 socklnd - kernels up to 2.6.16,
222 qswlnd - Qsnet kernel modules 5.20 and later,
223 openiblnd - IbGold 1.8.2,
224 o2iblnd - OFED 1.1 and 1.2
225 viblnd - Voltaire ibhost 3.4.5 and later,
226 ciblnd - Topspin 3.2.0,
227 iiblnd - Infiniserv 3.3 + PathBits patch,
228 gmlnd - GM 2.1.22 and later,
229 mxlnd - MX 1.2.1 or later,
230 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
232 2007-06-21 Cluster File Systems, Inc. <info@clusterfs.com>
234 * Support for networks:
235 socklnd - kernels up to 2.6.16,
236 qswlnd - Qsnet kernel modules 5.20 and later,
237 openiblnd - IbGold 1.8.2,
239 viblnd - Voltaire ibhost 3.4.5 and later,
240 ciblnd - Topspin 3.2.0,
241 iiblnd - Infiniserv 3.3 + PathBits patch,
242 gmlnd - GM 2.1.22 and later,
243 mxlnd - MX 1.2.1 or later,
244 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
248 Description: Initialize cpumask before use
252 Description: ASSERTION failures when upgrading to the patchless zero-copy
254 Details : This bug affects "rolling upgrades", causing an inconsistent
255 protocol version negotiation and subsequent assertion failure
256 during rolling upgrades after the first wave of upgrades.
260 Details : Change "dropped message" CERRORs to D_NETERROR so they are
261 logged instead of creating "console chatter" when a lustre
262 timeout races with normal RPC completion.
265 Details : lnet_clear_peer_table can wait forever if user forgets to
269 Details : libcfs_id2str should check pid against LNET_PID_ANY.
273 Description: added LNET self test
274 Details : landing b_self_test
279 Description: cfs_duration_{u,n}sec() wrongly calculate nanosecond part of
281 Details : do_div() macro is used incorrectly.
283 2007-04-23 Cluster File Systems, Inc. <info@clusterfs.com>
287 Description: make panic on lbug configurable
291 Description: Add OFED1.2 support to o2iblnd
292 Details : o2iblnd depends on OFED's modules, if out-tree OFED's modules
293 are installed (other than kernel's in-tree infiniband), there
294 could be some problem while insmod o2iblnd (mismatch CRC of
296 If extra Module.symvers is supported in kernel (i.e, 2.6.17),
297 this link provides solution:
298 https://bugs.openfabrics.org/show_bug.cgi?id=355
299 if extra Module.symvers is not supported in kernel, we will
300 have to run the script in bug 12316 to update
301 $LINUX/module.symvers before building o2iblnd.
302 More details about this are in bug 12316.
304 ------------------------------------------------------------------------------
306 2007-04-01 Cluster File Systems, Inc. <info@clusterfs.com>
307 * version 1.4.10 / 1.6.0
308 * Support for networks:
309 socklnd - kernels up to 2.6.16,
310 qswlnd - Qsnet kernel modules 5.20 and later,
311 openiblnd - IbGold 1.8.2,
313 viblnd - Voltaire ibhost 3.4.5 and later,
314 ciblnd - Topspin 3.2.0,
315 iiblnd - Infiniserv 3.3 + PathBits patch,
316 gmlnd - GM 2.1.22 and later,
317 mxlnd - MX 1.2.1 or later,
318 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
322 Description: Ptllnd didn't init kptllnd_data.kptl_idle_txs before it could be
323 possibly accessed in kptllnd_shutdown. Ptllnd should init
324 kptllnd_data.kptl_ptlid2str_lock before calling kptllnd_ptlid2str.
328 Description: gmlnd ignored some transmit errors when finalizing lnet messages.
332 Description: ptllnd logs a piece of incorrect debug info in kptllnd_peer_handle_hello.
336 Description: the_lnet.ln_finalizing was not set when the current thread is
337 about to complete messages. It only affects multi-threaded
343 Description: Changed the default kqswlnd ntxmsg=512
348 Description: Assertion failure in kernel ptllnd caused by posting passive
349 bulk buffers before connection establishment complete.
354 Description: A race in kernel ptllnd between deleting a peer and posting
355 new communications for it could hang communications -
356 manifesting as "Unexpectedly long timeout" messages.
361 Description: Kernel ptllnd lock ordering issue could hang a node.
366 Description: node crash on socket teardown race
369 Frequency : 'lctl peer_list' issued on a mx net
371 Description: Enable lctl's peer_list for MXLND
374 Frequency : after Ptllnd timeouts and portals congestion
376 Description: Credit overflows
377 Details : This was a bug in ptllnd connection establishment. The fix
378 implements better peer stamps to disambiguate connection
379 establishment and ensure both peers enter the credit flow
380 state machine consistently.
385 Description: kptllnd didn't propagate some network errors up to LNET
386 Details : This bug was spotted while investigating 11394. The fix
387 ensures network errors on sends and bulk transfers are
388 propagated to LNET/lustre correctly.
390 Severity : enhancement
392 Description: Fixed console chatter in case of -ETIMEDOUT.
394 Severity : enhancement
396 Description: Added D_NETTRACE for recording network packet history
397 (initially only for ptllnd). Also a separate userspace
398 ptllnd facility to gather history which should really be
399 covered by D_NETTRACE too, if only CDEBUG recorded history in
405 Description: o2iblnd handle early RDMA_CM_EVENT_DISCONNECTED.
406 Details : If the fabric is lossy, an RDMA_CM_EVENT_DISCONNECTED
407 callback can occur before a connection has actually been
408 established. This caused an assertion failure previously.
410 Severity : enhancement
412 Description: Multiple instances for o2iblnd
413 Details : Allow multiple instances of o2iblnd to enable networking over
414 multiple HCAs and routing between them.
418 Description: lnet deadlock in router_checker
419 Details : turned ksnd_connd_lock, ksnd_reaper_lock, and ksock_net_t:ksnd_lock
420 into BH locks to eliminate potential deadlock caused by
421 ksocknal_data_ready() preempting code holding these locks.
425 Description: Millions of failed socklnd connection attempts cause a very slow FS
426 Details : added a new route flag ksnr_scheduled to distinguish from
427 ksnr_connecting, so that a peer connection request is only turned
428 down for race concerns when an active connection to the same peer
429 is under progress (instead of just being scheduled).
431 ------------------------------------------------------------------------------
433 2007-02-09 Cluster File Systems, Inc. <info@clusterfs.com>
435 * Support for networks:
436 socklnd - kernels up to 2.6.16
437 qswlnd - Qsnet kernel modules 5.20 and later
438 openiblnd - IbGold 1.8.2
440 viblnd - Voltaire ibhost 3.4.5 and later
441 ciblnd - Topspin 3.2.0
442 iiblnd - Infiniserv 3.3 + PathBits patch
443 gmlnd - GM 2.1.22 and later
444 mxlnd - MX 1.2.1 or later
445 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
448 Severity : major on XT3
450 Description: libcfs overwrites /proc/sys/portals
451 Details : libcfs created a symlink from /proc/sys/portals to
452 /proc/sys/lnet for backwards compatibility. This is no
453 longer required and makes the Cray portals /proc variables
458 Description: OFED FMR API change
459 Details : This changes parameter usage to reflect a change in
460 ib_fmr_pool_map_phys() between OFED 1.0 and OFED 1.1. Note
461 that FMR support is only used in experimental versions of the
462 o2iblnd - this change does not affect standard usage at all.
464 Severity : enhancement
466 Description: new ko2iblnd module parameter: ib_mtu
467 Details : the default IB MTU of 2048 performs badly on 23108 Tavor
468 HCAs. You can avoid this problem by setting the MTU to 1024
469 using this module parameter.
471 Severity : enhancement
472 Bugzilla : 11118/11620
473 Description: ptllnd small request message buffer alignment fix
474 Details : Set the PTL_MD_LOCAL_ALIGN8 option on small message receives.
475 Round up small message size on sends in case this option
476 is not supported. 11620 was a defect in the initial
477 implementation which effectively asserted all peers had to be
478 running the correct protocol version which was fixed by always
479 NAK-ing such requests and handling any misalignments they
484 Description: When kib(nal|lnd)_del_peer() is called upon a peer whose
485 ibp_tx_queue is not empty, kib(nal|lnd)_destroy_peer()'s
486 'LASSERT(list_empty(&peer->ibp_tx_queue))' will fail.
488 Severity : enhancement
490 Description: Patchless ZC(zero copy) socklnd
491 Details : New protocol for socklnd, socklnd can support zero copy without
492 kernel patch, it's compatible with old socklnd. Checksum is
493 moved from tunables to modparams.
497 Description: When ksocknal_del_peer() is called upon a peer whose
498 ksnp_tx_queue is not empty, ksocknal_destroy_peer()'s
499 'LASSERT(list_empty(&peer->ksnp_tx_queue))' will fail.
502 Frequency : when ptlrpc is under heavy use and runs out of request buffer
504 Description: In lnet_match_blocked_msg(), md can be used without holding a
508 Frequency : very rarely
510 Description: If ksocknal_lib_setup_sock() fails, a ref on peer is lost.
511 If connd connects a route which has been closed by
512 ksocknal_shutdown(), ksocknal_create_routes() may create new
513 routes which hold references on the peer, causing shutdown
514 process to wait for peer to disappear forever.
516 Severity : enhancement
518 Description: Dump XT3 portals traces on kptllnd timeout
519 Details : Set the kptllnd module parameter "ptltrace_on_timeout=1" to
520 dump Cray portals debug traces to a file. The kptllnd module
521 parameter "ptltrace_basename", default "/tmp/lnet-ptltrace",
522 is the basename of the dump file.
525 Frequency : infrequent
527 Description: kernel ptllnd fix bug in connection re-establishment
528 Details : Kernel ptllnd could produce protocol errors e.g. illegal
529 matchbits and/or violate the credit flow protocol when trying
530 to re-establish a connection with a peer after an error or
533 Severity : enhancement
535 Description: Allow /proc/sys/lnet/debug to be set symbolically
536 Details : Allow debug and subsystem debug values to be read/set by name
537 in addition to numerically, for ease of use.
540 Frequency : only in configurations with LNET routers
542 Description: routes automatically marked down and recovered
543 Details : In configurations with LNET routers if a router fails routers
544 now actively try to recover routes that are down, unless they
545 are marked down by an administrator.
547 ------------------------------------------------------------------------------
549 2006-12-09 Cluster File Systems, Inc. <info@clusterfs.com>
552 Frequency : very rarely, in configurations with LNET routers and TCP
554 Description: incorrect data written to files on OSTs
555 Details : In certain high-load conditions incorrect data may be written
556 to files on the OST when using TCP networks.
558 ------------------------------------------------------------------------------
560 2006-07-31 Cluster File Systems, Inc. <info@clusterfs.com>
562 - rework CDEBUG messages rate-limiting mechanism b=10375
563 - add per-socket tunables for socklnd if the kernel is patched b=10327
565 ------------------------------------------------------------------------------
567 2006-02-15 Cluster File Systems, Inc. <info@clusterfs.com>
569 - fix use of portals/lnet pid to avoid dropping RPCs b=10074
570 - iiblnd wasn't mapping all memory, resulting in comms errors b=9776
571 - quiet LNET startup LNI message for liblustre b=10128
572 - Better console error messages if 'ip2nets' can't match an IP address
573 - Fixed overflow/use-before-set bugs in linux-time.h
574 - Fixed ptllnd bug that wasn't initialising rx descriptors completely
575 - LNET teardown failed an assertion about the route table being empty
576 - Fixed a crash in LNetEQPoll(<invalid handle>)
577 - Future protocol compatibility work (b_rls146_lnetprotovrsn)
578 - improve debug message for liblustre/Catamount nodes (b=10116)
580 2005-10-10 Cluster File Systems, Inc. <info@clusterfs.com>
581 * Configuration change for the XT3
582 The PTLLND is now used to run Lustre over Portals on the XT3.
583 The configure option(s) --with-cray-portals are no longer
584 used. Rather --with-portals=<path-to-portals-includes> is
585 used to enable building on the XT3. In addition to enable
586 XT3 specific features the option --enable-cray-xt3 must be
589 2005-10-10 Cluster File Systems, Inc. <info@clusterfs.com>
590 * Portals has been removed, replaced by LNET.
591 LNET is new networking infrastructure for Lustre, it includes a
592 reorganized network configuration mode (see the user
593 documentation for full details) as well as support for routing
594 between different network fabrics. Lustre Networking Devices
595 (LNDS) for the supported network fabrics have also been created
596 for this new infrastructure.
598 2005-08-08 Cluster File Systems, Inc. <info@clusterfs.com>
603 Frequency : rare (large Voltaire clusters only)
605 Description: the default number of reserved transmit descriptors was too low
606 for some large clusters
607 Details : As a workaround, the number was increased. A proper fix includes
610 2005-06-02 Cluster File Systems, Inc. <info@clusterfs.com>
615 Frequency : occasional (large-scale events, cluster reboot, network failure)
617 Description: too many error messages on console obscure actual problem and
618 can slow down/panic server, or cause recovery to fail repeatedly
619 Details : enable rate-limiting of console error messages, and some messages
620 that were console errors now only go to the kernel log
622 Severity : enhancement
624 Description: add /proc/sys/portals/catastrophe entry which will report if
625 that node has previously LBUGged
627 2005-04-06 Cluster File Systems, Inc. <info@clusterfs.com>
629 - update gmnal to use PTL_MTU, fix module refcounting (b=5786)
631 2005-04-04 Cluster File Systems, Inc. <info@clusterfs.com>
633 - handle error return code in kranal_check_fma_rx() (5915,6054)
635 2005-02-04 Cluster File Systems, Inc. <info@clusterfs.com>
637 - update vibnal (Voltaire IB NAL)
638 - update gmnal (Myrinet NAL), gmnalid
640 2005-02-04 Eric Barton <eeb@bartonsoftware.com>
642 * Landed portals:b_port_step as follows...
644 - removed CFS_DECL_SPIN*
645 just use 'spinlock_t' and initialise with spin_lock_init()
647 - removed CFS_DECL_MUTEX*
648 just use 'struct semaphore' and initialise with init_mutex()
650 - removed CFS_DECL_RWSEM*
651 just use 'struct rw_semaphore' and initialise with init_rwsem()
653 - renamed cfs_sleep_chan -> cfs_waitq
654 cfs_sleep_link -> cfs_waitlink
656 - fixed race in linux version of arch-independent socknal
657 (the ENOMEM/EAGAIN decision).
659 - Didn't fix problems in Darwin version of arch-independent socknal
660 (resetting socket callbacks, eager ack hack, ENOMEM/EAGAIN decision)
662 - removed libcfs types from non-socknal header files (only some types
663 in the header files had been changed; the .c files hadn't been