1 tbd Sun Microsystems, Inc.
3 * Support for networks:
4 socklnd - any kernel supported by Lustre,
5 qswlnd - Qsnet kernel modules 5.20 and later,
6 openiblnd - IbGold 1.8.2,
7 o2iblnd - OFED 1.1, 1.2.0, 1.2.5, and 1.3
8 viblnd - Voltaire ibhost 3.4.5 and later,
9 ciblnd - Topspin 3.2.0,
10 iiblnd - Infiniserv 3.3 + PathBits patch,
11 gmlnd - GM 2.1.22 and later,
12 mxlnd - MX 1.2.1 or later,
13 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
21 -------------------------------------------------------------------------------
23 12-31-2008 Sun Microsystems, Inc.
25 * Support for networks:
26 socklnd - any kernel supported by Lustre,
27 qswlnd - Qsnet kernel modules 5.20 and later,
28 openiblnd - IbGold 1.8.2,
29 o2iblnd - OFED 1.1, 1.2.0, 1.2.5, and 1.3
30 viblnd - Voltaire ibhost 3.4.5 and later,
31 ciblnd - Topspin 3.2.0,
32 iiblnd - Infiniserv 3.3 + PathBits patch,
33 gmlnd - GM 2.1.22 and later,
34 mxlnd - MX 1.2.1 or later,
35 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
44 Description: workaround for OOM from o2iblnd
45 Details : OFED needs allocate big chunk of memory for QP while creating
46 connection for o2iblnd, OOM can happen if no such a contiguous
48 QP size is decided by concurrent_sends and max_fragments of
49 o2iblnd, now we permit user to specify smaller value for
50 concurrent_sends of o2iblnd(i.e: concurrent_sends=7), which
51 will decrease memory block size required by creating QP.
55 Description: Support Zerocopy receive of Chelsio device
56 Details : Chelsio driver can support zerocopy for iov[1] if it's
57 contiguous and large enough.
61 Description: fix credit flow deadlock in uptllnd
65 Description: finalize network operation in reasonable time
66 Details : conf-sanity test_32a couldn't stop ost and mds because it
67 tried to access non-existent peer and tcp connect took
68 quite long before timing out.
72 Description: Continuous recovery on 33 of 413 nodes after lustre oss failure
73 Details : Lost reference on conn prevents peer from being destroyed, which
74 could prevent new peer creation if peer count has reached upper
79 Description: LNET Selftest results in Soft lockup on OSS CPU
80 Details : only hits when 8 or more o2ib clients involved and a session is
81 torn down with 'lst end_session' without preceeding 'lst stop'.
85 Description: concurrent_sends in IB LNDs should not be changeable at run time
86 Details : concurrent_sends in IB LNDs should not be changeable at run time
90 Description: ptl_send_rpc hits LASSERT when ptl_send_buf fails
91 Details : only hits under out-of-memory situations
94 -------------------------------------------------------------------------------
97 04-26-2008 Sun Microsystems, Inc.
99 * Support for networks:
100 socklnd - any kernel supported by Lustre,
101 qswlnd - Qsnet kernel modules 5.20 and later,
102 openiblnd - IbGold 1.8.2,
103 o2iblnd - OFED 1.1 and 1.2.0, 1.2.5
104 viblnd - Voltaire ibhost 3.4.5 and later,
105 ciblnd - Topspin 3.2.0,
106 iiblnd - Infiniserv 3.3 + PathBits patch,
107 gmlnd - GM 2.1.22 and later,
108 mxlnd - MX 1.2.1 or later,
109 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
113 Description: excessive debug information removed
114 Details : excessive debug information removed
118 Description: ksocknal_create_conn() hit ASSERTION during connection race
119 Details : ksocknal_create_conn() hit ASSERTION during connection race
123 Description: ksocknal_send_hello() hit ASSERTION while connecting race
124 Details : ksocknal_send_hello() hit ASSERTION while connecting race
128 Description: o2iblnd/ptllnd credit deadlock in a routed config.
129 Details : o2iblnd/ptllnd credit deadlock in a routed config.
133 Description: High load after starting lnet
134 Details : gmlnd should sleep in rx thread in interruptible way. Otherwise,
135 uptime utility reports high load that looks confusingly.
139 Description: ksocklnd fails to establish connection if accept_port is high
140 Details : PID remapping must not be done for active (outgoing) connections
142 --------------------------------------------------------------------------------
144 2008-01-11 Sun Microsystems, Inc.
146 * Support for networks:
147 socklnd - any kernel supported by Lustre,
148 qswlnd - Qsnet kernel modules 5.20 and later,
149 openiblnd - IbGold 1.8.2,
150 o2iblnd - OFED 1.1 and 1.2.0, 1.2.5
151 viblnd - Voltaire ibhost 3.4.5 and later,
152 ciblnd - Topspin 3.2.0,
153 iiblnd - Infiniserv 3.3 + PathBits patch,
154 gmlnd - GM 2.1.22 and later,
155 mxlnd - MX 1.2.1 or later,
156 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
159 Description: liblustre network error
160 Details : liblustre clients should understand LNET_ACCEPT_PORT environment
161 variable even if they don't start lnet acceptor.
165 Description: Strange message from lnet (Ignoring prediction from the future)
166 Details : Incorrect calculation of peer's last_alive value in ksocklnd
168 --------------------------------------------------------------------------------
170 2007-12-07 Cluster File Systems, Inc. <info@clusterfs.com>
172 * Support for networks:
173 socklnd - any kernel supported by Lustre,
174 qswlnd - Qsnet kernel modules 5.20 and later,
175 openiblnd - IbGold 1.8.2,
176 o2iblnd - OFED 1.1 and 1.2.0, 1.2.5.
177 viblnd - Voltaire ibhost 3.4.5 and later,
178 ciblnd - Topspin 3.2.0,
179 iiblnd - Infiniserv 3.3 + PathBits patch,
180 gmlnd - GM 2.1.22 and later,
181 mxlnd - MX 1.2.1 or later,
182 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
186 Description: ASSERTION(me == md->md_me) failed in lnet_match_md()
190 Description: increase send queue size for ciblnd/openiblnd
194 Description: new userspace socklnd
195 Details : Old userspace tcpnal that resided in lnet/ulnds/socklnd replaced
196 with new one - usocklnd.
198 Severity : enhancement
200 Description: Console message flood
201 Details : Make cdls ratelimiting more tunable by adding several tunable in
202 procfs /proc/sys/lnet/console_{min,max}_delay_centisecs and
203 /proc/sys/lnet/console_backoff.
205 --------------------------------------------------------------------------------
207 2007-09-27 Cluster File Systems, Inc. <info@clusterfs.com>
209 * Support for networks:
210 socklnd - any kernel supported by Lustre,
211 qswlnd - Qsnet kernel modules 5.20 and later,
212 openiblnd - IbGold 1.8.2,
213 o2iblnd - OFED 1.1 and 1.2,
214 viblnd - Voltaire ibhost 3.4.5 and later,
215 ciblnd - Topspin 3.2.0,
216 iiblnd - Infiniserv 3.3 + PathBits patch,
217 gmlnd - GM 2.1.22 and later,
218 mxlnd - MX 1.2.1 or later,
219 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
223 Description: /proc/sys/lnet has non-sysctl entries
224 Details : Updating dump_kernel/daemon_file/debug_mb to use sysctl variables
228 Description: TOE Kernel panic by ksocklnd
229 Details : offloaded sockets provide their own implementation of sendpage,
230 can't call tcp_sendpage() directly
234 Description: kibnal_shutdown() doesn't finish; lconf --cleanup hangs
235 Details : races between lnd_shutdown and peer creation prevent
236 lnd_shutdown from finishing.
240 Description: open files rlimit 1024 reached while liblustre testing
241 Details : ulnds/socklnd must close open socket after unsuccessful
246 Description: build error
247 Details : fix typos in gmlnd, ptllnd and viblnd
249 ------------------------------------------------------------------------------
251 2007-07-30 Cluster File Systems, Inc. <info@clusterfs.com>
253 * Support for networks:
254 socklnd - kernels up to 2.6.16,
255 qswlnd - Qsnet kernel modules 5.20 and later,
256 openiblnd - IbGold 1.8.2,
257 o2iblnd - OFED 1.1 and 1.2
258 viblnd - Voltaire ibhost 3.4.5 and later,
259 ciblnd - Topspin 3.2.0,
260 iiblnd - Infiniserv 3.3 + PathBits patch,
261 gmlnd - GM 2.1.22 and later,
262 mxlnd - MX 1.2.1 or later,
263 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
265 2007-06-21 Cluster File Systems, Inc. <info@clusterfs.com>
267 * Support for networks:
268 socklnd - kernels up to 2.6.16,
269 qswlnd - Qsnet kernel modules 5.20 and later,
270 openiblnd - IbGold 1.8.2,
272 viblnd - Voltaire ibhost 3.4.5 and later,
273 ciblnd - Topspin 3.2.0,
274 iiblnd - Infiniserv 3.3 + PathBits patch,
275 gmlnd - GM 2.1.22 and later,
276 mxlnd - MX 1.2.1 or later,
277 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
281 Description: Initialize cpumask before use
285 Description: ASSERTION failures when upgrading to the patchless zero-copy
287 Details : This bug affects "rolling upgrades", causing an inconsistent
288 protocol version negotiation and subsequent assertion failure
289 during rolling upgrades after the first wave of upgrades.
293 Details : Change "dropped message" CERRORs to D_NETERROR so they are
294 logged instead of creating "console chatter" when a lustre
295 timeout races with normal RPC completion.
298 Details : lnet_clear_peer_table can wait forever if user forgets to
302 Details : libcfs_id2str should check pid against LNET_PID_ANY.
306 Description: added LNET self test
307 Details : landing b_self_test
312 Description: cfs_duration_{u,n}sec() wrongly calculate nanosecond part of
314 Details : do_div() macro is used incorrectly.
316 2007-04-23 Cluster File Systems, Inc. <info@clusterfs.com>
320 Description: make panic on lbug configurable
324 Description: Add OFED1.2 support to o2iblnd
325 Details : o2iblnd depends on OFED's modules, if out-tree OFED's modules
326 are installed (other than kernel's in-tree infiniband), there
327 could be some problem while insmod o2iblnd (mismatch CRC of
329 If extra Module.symvers is supported in kernel (i.e, 2.6.17),
330 this link provides solution:
331 https://bugs.openfabrics.org/show_bug.cgi?id=355
332 if extra Module.symvers is not supported in kernel, we will
333 have to run the script in bug 12316 to update
334 $LINUX/module.symvers before building o2iblnd.
335 More details about this are in bug 12316.
337 ------------------------------------------------------------------------------
339 2007-04-01 Cluster File Systems, Inc. <info@clusterfs.com>
340 * version 1.4.10 / 1.6.0
341 * Support for networks:
342 socklnd - kernels up to 2.6.16,
343 qswlnd - Qsnet kernel modules 5.20 and later,
344 openiblnd - IbGold 1.8.2,
346 viblnd - Voltaire ibhost 3.4.5 and later,
347 ciblnd - Topspin 3.2.0,
348 iiblnd - Infiniserv 3.3 + PathBits patch,
349 gmlnd - GM 2.1.22 and later,
350 mxlnd - MX 1.2.1 or later,
351 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
355 Description: Ptllnd didn't init kptllnd_data.kptl_idle_txs before it could be
356 possibly accessed in kptllnd_shutdown. Ptllnd should init
357 kptllnd_data.kptl_ptlid2str_lock before calling kptllnd_ptlid2str.
361 Description: gmlnd ignored some transmit errors when finalizing lnet messages.
365 Description: ptllnd logs a piece of incorrect debug info in kptllnd_peer_handle_hello.
369 Description: the_lnet.ln_finalizing was not set when the current thread is
370 about to complete messages. It only affects multi-threaded
376 Description: Changed the default kqswlnd ntxmsg=512
381 Description: Assertion failure in kernel ptllnd caused by posting passive
382 bulk buffers before connection establishment complete.
387 Description: A race in kernel ptllnd between deleting a peer and posting
388 new communications for it could hang communications -
389 manifesting as "Unexpectedly long timeout" messages.
394 Description: Kernel ptllnd lock ordering issue could hang a node.
399 Description: node crash on socket teardown race
402 Frequency : 'lctl peer_list' issued on a mx net
404 Description: Enable lctl's peer_list for MXLND
407 Frequency : after Ptllnd timeouts and portals congestion
409 Description: Credit overflows
410 Details : This was a bug in ptllnd connection establishment. The fix
411 implements better peer stamps to disambiguate connection
412 establishment and ensure both peers enter the credit flow
413 state machine consistently.
418 Description: kptllnd didn't propagate some network errors up to LNET
419 Details : This bug was spotted while investigating 11394. The fix
420 ensures network errors on sends and bulk transfers are
421 propagated to LNET/lustre correctly.
423 Severity : enhancement
425 Description: Fixed console chatter in case of -ETIMEDOUT.
427 Severity : enhancement
429 Description: Added D_NETTRACE for recording network packet history
430 (initially only for ptllnd). Also a separate userspace
431 ptllnd facility to gather history which should really be
432 covered by D_NETTRACE too, if only CDEBUG recorded history in
438 Description: o2iblnd handle early RDMA_CM_EVENT_DISCONNECTED.
439 Details : If the fabric is lossy, an RDMA_CM_EVENT_DISCONNECTED
440 callback can occur before a connection has actually been
441 established. This caused an assertion failure previously.
443 Severity : enhancement
445 Description: Multiple instances for o2iblnd
446 Details : Allow multiple instances of o2iblnd to enable networking over
447 multiple HCAs and routing between them.
451 Description: lnet deadlock in router_checker
452 Details : turned ksnd_connd_lock, ksnd_reaper_lock, and ksock_net_t:ksnd_lock
453 into BH locks to eliminate potential deadlock caused by
454 ksocknal_data_ready() preempting code holding these locks.
458 Description: Millions of failed socklnd connection attempts cause a very slow FS
459 Details : added a new route flag ksnr_scheduled to distinguish from
460 ksnr_connecting, so that a peer connection request is only turned
461 down for race concerns when an active connection to the same peer
462 is under progress (instead of just being scheduled).
464 ------------------------------------------------------------------------------
466 2007-02-09 Cluster File Systems, Inc. <info@clusterfs.com>
468 * Support for networks:
469 socklnd - kernels up to 2.6.16
470 qswlnd - Qsnet kernel modules 5.20 and later
471 openiblnd - IbGold 1.8.2
473 viblnd - Voltaire ibhost 3.4.5 and later
474 ciblnd - Topspin 3.2.0
475 iiblnd - Infiniserv 3.3 + PathBits patch
476 gmlnd - GM 2.1.22 and later
477 mxlnd - MX 1.2.1 or later
478 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
481 Severity : major on XT3
483 Description: libcfs overwrites /proc/sys/portals
484 Details : libcfs created a symlink from /proc/sys/portals to
485 /proc/sys/lnet for backwards compatibility. This is no
486 longer required and makes the Cray portals /proc variables
491 Description: OFED FMR API change
492 Details : This changes parameter usage to reflect a change in
493 ib_fmr_pool_map_phys() between OFED 1.0 and OFED 1.1. Note
494 that FMR support is only used in experimental versions of the
495 o2iblnd - this change does not affect standard usage at all.
497 Severity : enhancement
499 Description: new ko2iblnd module parameter: ib_mtu
500 Details : the default IB MTU of 2048 performs badly on 23108 Tavor
501 HCAs. You can avoid this problem by setting the MTU to 1024
502 using this module parameter.
504 Severity : enhancement
505 Bugzilla : 11118/11620
506 Description: ptllnd small request message buffer alignment fix
507 Details : Set the PTL_MD_LOCAL_ALIGN8 option on small message receives.
508 Round up small message size on sends in case this option
509 is not supported. 11620 was a defect in the initial
510 implementation which effectively asserted all peers had to be
511 running the correct protocol version which was fixed by always
512 NAK-ing such requests and handling any misalignments they
517 Description: When kib(nal|lnd)_del_peer() is called upon a peer whose
518 ibp_tx_queue is not empty, kib(nal|lnd)_destroy_peer()'s
519 'LASSERT(list_empty(&peer->ibp_tx_queue))' will fail.
521 Severity : enhancement
523 Description: Patchless ZC(zero copy) socklnd
524 Details : New protocol for socklnd, socklnd can support zero copy without
525 kernel patch, it's compatible with old socklnd. Checksum is
526 moved from tunables to modparams.
530 Description: When ksocknal_del_peer() is called upon a peer whose
531 ksnp_tx_queue is not empty, ksocknal_destroy_peer()'s
532 'LASSERT(list_empty(&peer->ksnp_tx_queue))' will fail.
535 Frequency : when ptlrpc is under heavy use and runs out of request buffer
537 Description: In lnet_match_blocked_msg(), md can be used without holding a
541 Frequency : very rarely
543 Description: If ksocknal_lib_setup_sock() fails, a ref on peer is lost.
544 If connd connects a route which has been closed by
545 ksocknal_shutdown(), ksocknal_create_routes() may create new
546 routes which hold references on the peer, causing shutdown
547 process to wait for peer to disappear forever.
549 Severity : enhancement
551 Description: Dump XT3 portals traces on kptllnd timeout
552 Details : Set the kptllnd module parameter "ptltrace_on_timeout=1" to
553 dump Cray portals debug traces to a file. The kptllnd module
554 parameter "ptltrace_basename", default "/tmp/lnet-ptltrace",
555 is the basename of the dump file.
558 Frequency : infrequent
560 Description: kernel ptllnd fix bug in connection re-establishment
561 Details : Kernel ptllnd could produce protocol errors e.g. illegal
562 matchbits and/or violate the credit flow protocol when trying
563 to re-establish a connection with a peer after an error or
566 Severity : enhancement
568 Description: Allow /proc/sys/lnet/debug to be set symbolically
569 Details : Allow debug and subsystem debug values to be read/set by name
570 in addition to numerically, for ease of use.
573 Frequency : only in configurations with LNET routers
575 Description: routes automatically marked down and recovered
576 Details : In configurations with LNET routers if a router fails routers
577 now actively try to recover routes that are down, unless they
578 are marked down by an administrator.
580 ------------------------------------------------------------------------------
582 2006-12-09 Cluster File Systems, Inc. <info@clusterfs.com>
585 Frequency : very rarely, in configurations with LNET routers and TCP
587 Description: incorrect data written to files on OSTs
588 Details : In certain high-load conditions incorrect data may be written
589 to files on the OST when using TCP networks.
591 ------------------------------------------------------------------------------
593 2006-07-31 Cluster File Systems, Inc. <info@clusterfs.com>
595 - rework CDEBUG messages rate-limiting mechanism b=10375
596 - add per-socket tunables for socklnd if the kernel is patched b=10327
598 ------------------------------------------------------------------------------
600 2006-02-15 Cluster File Systems, Inc. <info@clusterfs.com>
602 - fix use of portals/lnet pid to avoid dropping RPCs b=10074
603 - iiblnd wasn't mapping all memory, resulting in comms errors b=9776
604 - quiet LNET startup LNI message for liblustre b=10128
605 - Better console error messages if 'ip2nets' can't match an IP address
606 - Fixed overflow/use-before-set bugs in linux-time.h
607 - Fixed ptllnd bug that wasn't initialising rx descriptors completely
608 - LNET teardown failed an assertion about the route table being empty
609 - Fixed a crash in LNetEQPoll(<invalid handle>)
610 - Future protocol compatibility work (b_rls146_lnetprotovrsn)
611 - improve debug message for liblustre/Catamount nodes (b=10116)
613 2005-10-10 Cluster File Systems, Inc. <info@clusterfs.com>
614 * Configuration change for the XT3
615 The PTLLND is now used to run Lustre over Portals on the XT3.
616 The configure option(s) --with-cray-portals are no longer
617 used. Rather --with-portals=<path-to-portals-includes> is
618 used to enable building on the XT3. In addition to enable
619 XT3 specific features the option --enable-cray-xt3 must be
622 2005-10-10 Cluster File Systems, Inc. <info@clusterfs.com>
623 * Portals has been removed, replaced by LNET.
624 LNET is new networking infrastructure for Lustre, it includes a
625 reorganized network configuration mode (see the user
626 documentation for full details) as well as support for routing
627 between different network fabrics. Lustre Networking Devices
628 (LNDS) for the supported network fabrics have also been created
629 for this new infrastructure.
631 2005-08-08 Cluster File Systems, Inc. <info@clusterfs.com>
636 Frequency : rare (large Voltaire clusters only)
638 Description: the default number of reserved transmit descriptors was too low
639 for some large clusters
640 Details : As a workaround, the number was increased. A proper fix includes
643 2005-06-02 Cluster File Systems, Inc. <info@clusterfs.com>
648 Frequency : occasional (large-scale events, cluster reboot, network failure)
650 Description: too many error messages on console obscure actual problem and
651 can slow down/panic server, or cause recovery to fail repeatedly
652 Details : enable rate-limiting of console error messages, and some messages
653 that were console errors now only go to the kernel log
655 Severity : enhancement
657 Description: add /proc/sys/portals/catastrophe entry which will report if
658 that node has previously LBUGged
660 2005-04-06 Cluster File Systems, Inc. <info@clusterfs.com>
662 - update gmnal to use PTL_MTU, fix module refcounting (b=5786)
664 2005-04-04 Cluster File Systems, Inc. <info@clusterfs.com>
666 - handle error return code in kranal_check_fma_rx() (5915,6054)
668 2005-02-04 Cluster File Systems, Inc. <info@clusterfs.com>
670 - update vibnal (Voltaire IB NAL)
671 - update gmnal (Myrinet NAL), gmnalid
673 2005-02-04 Eric Barton <eeb@bartonsoftware.com>
675 * Landed portals:b_port_step as follows...
677 - removed CFS_DECL_SPIN*
678 just use 'spinlock_t' and initialise with spin_lock_init()
680 - removed CFS_DECL_MUTEX*
681 just use 'struct semaphore' and initialise with init_mutex()
683 - removed CFS_DECL_RWSEM*
684 just use 'struct rw_semaphore' and initialise with init_rwsem()
686 - renamed cfs_sleep_chan -> cfs_waitq
687 cfs_sleep_link -> cfs_waitlink
689 - fixed race in linux version of arch-independent socknal
690 (the ENOMEM/EAGAIN decision).
692 - Didn't fix problems in Darwin version of arch-independent socknal
693 (resetting socket callbacks, eager ack hack, ENOMEM/EAGAIN decision)
695 - removed libcfs types from non-socknal header files (only some types
696 in the header files had been changed; the .c files hadn't been