1 tbd Sun Microsystems, Inc.
3 * Support for networks:
4 socklnd - any kernel supported by Lustre,
5 qswlnd - Qsnet kernel modules 5.20 and later,
6 openiblnd - IbGold 1.8.2,
7 o2iblnd - OFED 1.1, 1.2.0, 1.2.5, and 1.3
8 viblnd - Voltaire ibhost 3.4.5 and later,
9 ciblnd - Topspin 3.2.0,
10 iiblnd - Infiniserv 3.3 + PathBits patch,
11 gmlnd - GM 2.1.22 and later,
12 mxlnd - MX 1.2.1 or later,
13 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
22 Description: Support Zerocopy receive of Chelsio device
23 Details : Chelsio driver can support zerocopy for iov[1] if it's
24 contiguous and large enough.
28 Description: fix credit flow deadlock in uptllnd
32 Description: finalize network operation in reasonable time
33 Details : conf-sanity test_32a couldn't stop ost and mds because it
34 tried to access non-existent peer and tcp connect took
35 quite long before timing out.
39 Description: Remove portals compatibility
40 Details : Remove portals compatibility, not interoperable with releases
45 Description: Continuous recovery on 33 of 413 nodes after lustre oss failure
46 Details : Lost reference on conn prevents peer from being destroyed, which
47 could prevent new peer creation if peer count has reached upper
52 Description: LNET Selftest results in Soft lockup on OSS CPU
53 Details : only hits when 8 or more o2ib clients involved and a session is
54 torn down with 'lst end_session' without preceeding 'lst stop'.
58 Description: concurrent_sends in IB LNDs should not be changeable at run time
59 Details : concurrent_sends in IB LNDs should not be changeable at run time
63 Description: ptl_send_rpc hits LASSERT when ptl_send_buf fails
64 Details : only hits under out-of-memory situations
67 -------------------------------------------------------------------------------
70 04-26-2008 Sun Microsystems, Inc.
72 * Support for networks:
73 socklnd - any kernel supported by Lustre,
74 qswlnd - Qsnet kernel modules 5.20 and later,
75 openiblnd - IbGold 1.8.2,
76 o2iblnd - OFED 1.1 and 1.2.0, 1.2.5
77 viblnd - Voltaire ibhost 3.4.5 and later,
78 ciblnd - Topspin 3.2.0,
79 iiblnd - Infiniserv 3.3 + PathBits patch,
80 gmlnd - GM 2.1.22 and later,
81 mxlnd - MX 1.2.1 or later,
82 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
86 Description: excessive debug information removed
87 Details : excessive debug information removed
91 Description: ksocknal_create_conn() hit ASSERTION during connection race
92 Details : ksocknal_create_conn() hit ASSERTION during connection race
96 Description: ksocknal_send_hello() hit ASSERTION while connecting race
97 Details : ksocknal_send_hello() hit ASSERTION while connecting race
101 Description: o2iblnd/ptllnd credit deadlock in a routed config.
102 Details : o2iblnd/ptllnd credit deadlock in a routed config.
106 Description: High load after starting lnet
107 Details : gmlnd should sleep in rx thread in interruptible way. Otherwise,
108 uptime utility reports high load that looks confusingly.
112 Description: ksocklnd fails to establish connection if accept_port is high
113 Details : PID remapping must not be done for active (outgoing) connections
115 --------------------------------------------------------------------------------
117 2008-01-11 Sun Microsystems, Inc.
119 * Support for networks:
120 socklnd - any kernel supported by Lustre,
121 qswlnd - Qsnet kernel modules 5.20 and later,
122 openiblnd - IbGold 1.8.2,
123 o2iblnd - OFED 1.1 and 1.2.0, 1.2.5
124 viblnd - Voltaire ibhost 3.4.5 and later,
125 ciblnd - Topspin 3.2.0,
126 iiblnd - Infiniserv 3.3 + PathBits patch,
127 gmlnd - GM 2.1.22 and later,
128 mxlnd - MX 1.2.1 or later,
129 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
132 Description: liblustre network error
133 Details : liblustre clients should understand LNET_ACCEPT_PORT environment
134 variable even if they don't start lnet acceptor.
138 Description: Strange message from lnet (Ignoring prediction from the future)
139 Details : Incorrect calculation of peer's last_alive value in ksocklnd
141 --------------------------------------------------------------------------------
143 2007-12-07 Cluster File Systems, Inc. <info@clusterfs.com>
145 * Support for networks:
146 socklnd - any kernel supported by Lustre,
147 qswlnd - Qsnet kernel modules 5.20 and later,
148 openiblnd - IbGold 1.8.2,
149 o2iblnd - OFED 1.1 and 1.2.0, 1.2.5.
150 viblnd - Voltaire ibhost 3.4.5 and later,
151 ciblnd - Topspin 3.2.0,
152 iiblnd - Infiniserv 3.3 + PathBits patch,
153 gmlnd - GM 2.1.22 and later,
154 mxlnd - MX 1.2.1 or later,
155 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
159 Description: ASSERTION(me == md->md_me) failed in lnet_match_md()
163 Description: increase send queue size for ciblnd/openiblnd
167 Description: new userspace socklnd
168 Details : Old userspace tcpnal that resided in lnet/ulnds/socklnd replaced
169 with new one - usocklnd.
171 Severity : enhancement
173 Description: Console message flood
174 Details : Make cdls ratelimiting more tunable by adding several tunable in
175 procfs /proc/sys/lnet/console_{min,max}_delay_centisecs and
176 /proc/sys/lnet/console_backoff.
178 --------------------------------------------------------------------------------
180 2007-09-27 Cluster File Systems, Inc. <info@clusterfs.com>
182 * Support for networks:
183 socklnd - any kernel supported by Lustre,
184 qswlnd - Qsnet kernel modules 5.20 and later,
185 openiblnd - IbGold 1.8.2,
186 o2iblnd - OFED 1.1 and 1.2,
187 viblnd - Voltaire ibhost 3.4.5 and later,
188 ciblnd - Topspin 3.2.0,
189 iiblnd - Infiniserv 3.3 + PathBits patch,
190 gmlnd - GM 2.1.22 and later,
191 mxlnd - MX 1.2.1 or later,
192 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
196 Description: /proc/sys/lnet has non-sysctl entries
197 Details : Updating dump_kernel/daemon_file/debug_mb to use sysctl variables
201 Description: TOE Kernel panic by ksocklnd
202 Details : offloaded sockets provide their own implementation of sendpage,
203 can't call tcp_sendpage() directly
207 Description: kibnal_shutdown() doesn't finish; lconf --cleanup hangs
208 Details : races between lnd_shutdown and peer creation prevent
209 lnd_shutdown from finishing.
213 Description: open files rlimit 1024 reached while liblustre testing
214 Details : ulnds/socklnd must close open socket after unsuccessful
219 Description: build error
220 Details : fix typos in gmlnd, ptllnd and viblnd
222 ------------------------------------------------------------------------------
224 2007-07-30 Cluster File Systems, Inc. <info@clusterfs.com>
226 * Support for networks:
227 socklnd - kernels up to 2.6.16,
228 qswlnd - Qsnet kernel modules 5.20 and later,
229 openiblnd - IbGold 1.8.2,
230 o2iblnd - OFED 1.1 and 1.2
231 viblnd - Voltaire ibhost 3.4.5 and later,
232 ciblnd - Topspin 3.2.0,
233 iiblnd - Infiniserv 3.3 + PathBits patch,
234 gmlnd - GM 2.1.22 and later,
235 mxlnd - MX 1.2.1 or later,
236 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
238 2007-06-21 Cluster File Systems, Inc. <info@clusterfs.com>
240 * Support for networks:
241 socklnd - kernels up to 2.6.16,
242 qswlnd - Qsnet kernel modules 5.20 and later,
243 openiblnd - IbGold 1.8.2,
245 viblnd - Voltaire ibhost 3.4.5 and later,
246 ciblnd - Topspin 3.2.0,
247 iiblnd - Infiniserv 3.3 + PathBits patch,
248 gmlnd - GM 2.1.22 and later,
249 mxlnd - MX 1.2.1 or later,
250 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
254 Description: Initialize cpumask before use
258 Description: ASSERTION failures when upgrading to the patchless zero-copy
260 Details : This bug affects "rolling upgrades", causing an inconsistent
261 protocol version negotiation and subsequent assertion failure
262 during rolling upgrades after the first wave of upgrades.
266 Details : Change "dropped message" CERRORs to D_NETERROR so they are
267 logged instead of creating "console chatter" when a lustre
268 timeout races with normal RPC completion.
271 Details : lnet_clear_peer_table can wait forever if user forgets to
275 Details : libcfs_id2str should check pid against LNET_PID_ANY.
279 Description: added LNET self test
280 Details : landing b_self_test
285 Description: cfs_duration_{u,n}sec() wrongly calculate nanosecond part of
287 Details : do_div() macro is used incorrectly.
289 2007-04-23 Cluster File Systems, Inc. <info@clusterfs.com>
293 Description: make panic on lbug configurable
297 Description: Add OFED1.2 support to o2iblnd
298 Details : o2iblnd depends on OFED's modules, if out-tree OFED's modules
299 are installed (other than kernel's in-tree infiniband), there
300 could be some problem while insmod o2iblnd (mismatch CRC of
302 If extra Module.symvers is supported in kernel (i.e, 2.6.17),
303 this link provides solution:
304 https://bugs.openfabrics.org/show_bug.cgi?id=355
305 if extra Module.symvers is not supported in kernel, we will
306 have to run the script in bug 12316 to update
307 $LINUX/module.symvers before building o2iblnd.
308 More details about this are in bug 12316.
310 ------------------------------------------------------------------------------
312 2007-04-01 Cluster File Systems, Inc. <info@clusterfs.com>
313 * version 1.4.10 / 1.6.0
314 * Support for networks:
315 socklnd - kernels up to 2.6.16,
316 qswlnd - Qsnet kernel modules 5.20 and later,
317 openiblnd - IbGold 1.8.2,
319 viblnd - Voltaire ibhost 3.4.5 and later,
320 ciblnd - Topspin 3.2.0,
321 iiblnd - Infiniserv 3.3 + PathBits patch,
322 gmlnd - GM 2.1.22 and later,
323 mxlnd - MX 1.2.1 or later,
324 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
328 Description: Ptllnd didn't init kptllnd_data.kptl_idle_txs before it could be
329 possibly accessed in kptllnd_shutdown. Ptllnd should init
330 kptllnd_data.kptl_ptlid2str_lock before calling kptllnd_ptlid2str.
334 Description: gmlnd ignored some transmit errors when finalizing lnet messages.
338 Description: ptllnd logs a piece of incorrect debug info in kptllnd_peer_handle_hello.
342 Description: the_lnet.ln_finalizing was not set when the current thread is
343 about to complete messages. It only affects multi-threaded
349 Description: Changed the default kqswlnd ntxmsg=512
354 Description: Assertion failure in kernel ptllnd caused by posting passive
355 bulk buffers before connection establishment complete.
360 Description: A race in kernel ptllnd between deleting a peer and posting
361 new communications for it could hang communications -
362 manifesting as "Unexpectedly long timeout" messages.
367 Description: Kernel ptllnd lock ordering issue could hang a node.
372 Description: node crash on socket teardown race
375 Frequency : 'lctl peer_list' issued on a mx net
377 Description: Enable lctl's peer_list for MXLND
380 Frequency : after Ptllnd timeouts and portals congestion
382 Description: Credit overflows
383 Details : This was a bug in ptllnd connection establishment. The fix
384 implements better peer stamps to disambiguate connection
385 establishment and ensure both peers enter the credit flow
386 state machine consistently.
391 Description: kptllnd didn't propagate some network errors up to LNET
392 Details : This bug was spotted while investigating 11394. The fix
393 ensures network errors on sends and bulk transfers are
394 propagated to LNET/lustre correctly.
396 Severity : enhancement
398 Description: Fixed console chatter in case of -ETIMEDOUT.
400 Severity : enhancement
402 Description: Added D_NETTRACE for recording network packet history
403 (initially only for ptllnd). Also a separate userspace
404 ptllnd facility to gather history which should really be
405 covered by D_NETTRACE too, if only CDEBUG recorded history in
411 Description: o2iblnd handle early RDMA_CM_EVENT_DISCONNECTED.
412 Details : If the fabric is lossy, an RDMA_CM_EVENT_DISCONNECTED
413 callback can occur before a connection has actually been
414 established. This caused an assertion failure previously.
416 Severity : enhancement
418 Description: Multiple instances for o2iblnd
419 Details : Allow multiple instances of o2iblnd to enable networking over
420 multiple HCAs and routing between them.
424 Description: lnet deadlock in router_checker
425 Details : turned ksnd_connd_lock, ksnd_reaper_lock, and ksock_net_t:ksnd_lock
426 into BH locks to eliminate potential deadlock caused by
427 ksocknal_data_ready() preempting code holding these locks.
431 Description: Millions of failed socklnd connection attempts cause a very slow FS
432 Details : added a new route flag ksnr_scheduled to distinguish from
433 ksnr_connecting, so that a peer connection request is only turned
434 down for race concerns when an active connection to the same peer
435 is under progress (instead of just being scheduled).
437 ------------------------------------------------------------------------------
439 2007-02-09 Cluster File Systems, Inc. <info@clusterfs.com>
441 * Support for networks:
442 socklnd - kernels up to 2.6.16
443 qswlnd - Qsnet kernel modules 5.20 and later
444 openiblnd - IbGold 1.8.2
446 viblnd - Voltaire ibhost 3.4.5 and later
447 ciblnd - Topspin 3.2.0
448 iiblnd - Infiniserv 3.3 + PathBits patch
449 gmlnd - GM 2.1.22 and later
450 mxlnd - MX 1.2.1 or later
451 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
454 Severity : major on XT3
456 Description: libcfs overwrites /proc/sys/portals
457 Details : libcfs created a symlink from /proc/sys/portals to
458 /proc/sys/lnet for backwards compatibility. This is no
459 longer required and makes the Cray portals /proc variables
464 Description: OFED FMR API change
465 Details : This changes parameter usage to reflect a change in
466 ib_fmr_pool_map_phys() between OFED 1.0 and OFED 1.1. Note
467 that FMR support is only used in experimental versions of the
468 o2iblnd - this change does not affect standard usage at all.
470 Severity : enhancement
472 Description: new ko2iblnd module parameter: ib_mtu
473 Details : the default IB MTU of 2048 performs badly on 23108 Tavor
474 HCAs. You can avoid this problem by setting the MTU to 1024
475 using this module parameter.
477 Severity : enhancement
478 Bugzilla : 11118/11620
479 Description: ptllnd small request message buffer alignment fix
480 Details : Set the PTL_MD_LOCAL_ALIGN8 option on small message receives.
481 Round up small message size on sends in case this option
482 is not supported. 11620 was a defect in the initial
483 implementation which effectively asserted all peers had to be
484 running the correct protocol version which was fixed by always
485 NAK-ing such requests and handling any misalignments they
490 Description: When kib(nal|lnd)_del_peer() is called upon a peer whose
491 ibp_tx_queue is not empty, kib(nal|lnd)_destroy_peer()'s
492 'LASSERT(list_empty(&peer->ibp_tx_queue))' will fail.
494 Severity : enhancement
496 Description: Patchless ZC(zero copy) socklnd
497 Details : New protocol for socklnd, socklnd can support zero copy without
498 kernel patch, it's compatible with old socklnd. Checksum is
499 moved from tunables to modparams.
503 Description: When ksocknal_del_peer() is called upon a peer whose
504 ksnp_tx_queue is not empty, ksocknal_destroy_peer()'s
505 'LASSERT(list_empty(&peer->ksnp_tx_queue))' will fail.
508 Frequency : when ptlrpc is under heavy use and runs out of request buffer
510 Description: In lnet_match_blocked_msg(), md can be used without holding a
514 Frequency : very rarely
516 Description: If ksocknal_lib_setup_sock() fails, a ref on peer is lost.
517 If connd connects a route which has been closed by
518 ksocknal_shutdown(), ksocknal_create_routes() may create new
519 routes which hold references on the peer, causing shutdown
520 process to wait for peer to disappear forever.
522 Severity : enhancement
524 Description: Dump XT3 portals traces on kptllnd timeout
525 Details : Set the kptllnd module parameter "ptltrace_on_timeout=1" to
526 dump Cray portals debug traces to a file. The kptllnd module
527 parameter "ptltrace_basename", default "/tmp/lnet-ptltrace",
528 is the basename of the dump file.
531 Frequency : infrequent
533 Description: kernel ptllnd fix bug in connection re-establishment
534 Details : Kernel ptllnd could produce protocol errors e.g. illegal
535 matchbits and/or violate the credit flow protocol when trying
536 to re-establish a connection with a peer after an error or
539 Severity : enhancement
541 Description: Allow /proc/sys/lnet/debug to be set symbolically
542 Details : Allow debug and subsystem debug values to be read/set by name
543 in addition to numerically, for ease of use.
546 Frequency : only in configurations with LNET routers
548 Description: routes automatically marked down and recovered
549 Details : In configurations with LNET routers if a router fails routers
550 now actively try to recover routes that are down, unless they
551 are marked down by an administrator.
553 ------------------------------------------------------------------------------
555 2006-12-09 Cluster File Systems, Inc. <info@clusterfs.com>
558 Frequency : very rarely, in configurations with LNET routers and TCP
560 Description: incorrect data written to files on OSTs
561 Details : In certain high-load conditions incorrect data may be written
562 to files on the OST when using TCP networks.
564 ------------------------------------------------------------------------------
566 2006-07-31 Cluster File Systems, Inc. <info@clusterfs.com>
568 - rework CDEBUG messages rate-limiting mechanism b=10375
569 - add per-socket tunables for socklnd if the kernel is patched b=10327
571 ------------------------------------------------------------------------------
573 2006-02-15 Cluster File Systems, Inc. <info@clusterfs.com>
575 - fix use of portals/lnet pid to avoid dropping RPCs b=10074
576 - iiblnd wasn't mapping all memory, resulting in comms errors b=9776
577 - quiet LNET startup LNI message for liblustre b=10128
578 - Better console error messages if 'ip2nets' can't match an IP address
579 - Fixed overflow/use-before-set bugs in linux-time.h
580 - Fixed ptllnd bug that wasn't initialising rx descriptors completely
581 - LNET teardown failed an assertion about the route table being empty
582 - Fixed a crash in LNetEQPoll(<invalid handle>)
583 - Future protocol compatibility work (b_rls146_lnetprotovrsn)
584 - improve debug message for liblustre/Catamount nodes (b=10116)
586 2005-10-10 Cluster File Systems, Inc. <info@clusterfs.com>
587 * Configuration change for the XT3
588 The PTLLND is now used to run Lustre over Portals on the XT3.
589 The configure option(s) --with-cray-portals are no longer
590 used. Rather --with-portals=<path-to-portals-includes> is
591 used to enable building on the XT3. In addition to enable
592 XT3 specific features the option --enable-cray-xt3 must be
595 2005-10-10 Cluster File Systems, Inc. <info@clusterfs.com>
596 * Portals has been removed, replaced by LNET.
597 LNET is new networking infrastructure for Lustre, it includes a
598 reorganized network configuration mode (see the user
599 documentation for full details) as well as support for routing
600 between different network fabrics. Lustre Networking Devices
601 (LNDS) for the supported network fabrics have also been created
602 for this new infrastructure.
604 2005-08-08 Cluster File Systems, Inc. <info@clusterfs.com>
609 Frequency : rare (large Voltaire clusters only)
611 Description: the default number of reserved transmit descriptors was too low
612 for some large clusters
613 Details : As a workaround, the number was increased. A proper fix includes
616 2005-06-02 Cluster File Systems, Inc. <info@clusterfs.com>
621 Frequency : occasional (large-scale events, cluster reboot, network failure)
623 Description: too many error messages on console obscure actual problem and
624 can slow down/panic server, or cause recovery to fail repeatedly
625 Details : enable rate-limiting of console error messages, and some messages
626 that were console errors now only go to the kernel log
628 Severity : enhancement
630 Description: add /proc/sys/portals/catastrophe entry which will report if
631 that node has previously LBUGged
633 2005-04-06 Cluster File Systems, Inc. <info@clusterfs.com>
635 - update gmnal to use PTL_MTU, fix module refcounting (b=5786)
637 2005-04-04 Cluster File Systems, Inc. <info@clusterfs.com>
639 - handle error return code in kranal_check_fma_rx() (5915,6054)
641 2005-02-04 Cluster File Systems, Inc. <info@clusterfs.com>
643 - update vibnal (Voltaire IB NAL)
644 - update gmnal (Myrinet NAL), gmnalid
646 2005-02-04 Eric Barton <eeb@bartonsoftware.com>
648 * Landed portals:b_port_step as follows...
650 - removed CFS_DECL_SPIN*
651 just use 'spinlock_t' and initialise with spin_lock_init()
653 - removed CFS_DECL_MUTEX*
654 just use 'struct semaphore' and initialise with init_mutex()
656 - removed CFS_DECL_RWSEM*
657 just use 'struct rw_semaphore' and initialise with init_rwsem()
659 - renamed cfs_sleep_chan -> cfs_waitq
660 cfs_sleep_link -> cfs_waitlink
662 - fixed race in linux version of arch-independent socknal
663 (the ENOMEM/EAGAIN decision).
665 - Didn't fix problems in Darwin version of arch-independent socknal
666 (resetting socket callbacks, eager ack hack, ENOMEM/EAGAIN decision)
668 - removed libcfs types from non-socknal header files (only some types
669 in the header files had been changed; the .c files hadn't been