1 tbd Cluster File Systems, Inc. <info@clusterfs.com>
3 * Support for networks:
4 socklnd - any kernel supported by Lustre,
5 qswlnd - Qsnet kernel modules 5.20 and later,
6 openiblnd - IbGold 1.8.2,
7 o2iblnd - OFED 1.1 and 1.2,
8 viblnd - Voltaire ibhost 3.4.5 and later,
9 ciblnd - Topspin 3.2.0,
10 iiblnd - Infiniserv 3.3 + PathBits patch,
11 gmlnd - GM 2.1.22 and later,
12 mxlnd - MX 1.2.1 or later,
13 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
15 --------------------------------------------------------------------------------
17 2007-09-27 Cluster File Systems, Inc. <info@clusterfs.com>
19 * Support for networks:
20 socklnd - any kernel supported by Lustre,
21 qswlnd - Qsnet kernel modules 5.20 and later,
22 openiblnd - IbGold 1.8.2,
23 o2iblnd - OFED 1.1 and 1.2,
24 viblnd - Voltaire ibhost 3.4.5 and later,
25 ciblnd - Topspin 3.2.0,
26 iiblnd - Infiniserv 3.3 + PathBits patch,
27 gmlnd - GM 2.1.22 and later,
28 mxlnd - MX 1.2.1 or later,
29 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
33 Description: /proc/sys/lnet has non-sysctl entries
34 Details : Updating dump_kernel/daemon_file/debug_mb to use sysctl variables
38 Description: TOE Kernel panic by ksocklnd
39 Details : offloaded sockets provide their own implementation of sendpage,
40 can't call tcp_sendpage() directly
44 Description: kibnal_shutdown() doesn't finish; lconf --cleanup hangs
45 Details : races between lnd_shutdown and peer creation prevent
46 lnd_shutdown from finishing.
50 Description: open files rlimit 1024 reached while liblustre testing
51 Details : ulnds/socklnd must close open socket after unsuccessful
56 Description: build error
57 Details : fix typos in gmlnd, ptllnd and viblnd
59 ------------------------------------------------------------------------------
61 2007-07-30 Cluster File Systems, Inc. <info@clusterfs.com>
63 * Support for networks:
64 socklnd - kernels up to 2.6.16,
65 qswlnd - Qsnet kernel modules 5.20 and later,
66 openiblnd - IbGold 1.8.2,
67 o2iblnd - OFED 1.1 and 1.2
68 viblnd - Voltaire ibhost 3.4.5 and later,
69 ciblnd - Topspin 3.2.0,
70 iiblnd - Infiniserv 3.3 + PathBits patch,
71 gmlnd - GM 2.1.22 and later,
72 mxlnd - MX 1.2.1 or later,
73 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
75 2007-06-21 Cluster File Systems, Inc. <info@clusterfs.com>
77 * Support for networks:
78 socklnd - kernels up to 2.6.16,
79 qswlnd - Qsnet kernel modules 5.20 and later,
80 openiblnd - IbGold 1.8.2,
82 viblnd - Voltaire ibhost 3.4.5 and later,
83 ciblnd - Topspin 3.2.0,
84 iiblnd - Infiniserv 3.3 + PathBits patch,
85 gmlnd - GM 2.1.22 and later,
86 mxlnd - MX 1.2.1 or later,
87 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
91 Description: Initialize cpumask before use
95 Description: ASSERTION failures when upgrading to the patchless zero-copy
97 Details : This bug affects "rolling upgrades", causing an inconsistent
98 protocol version negotiation and subsequent assertion failure
99 during rolling upgrades after the first wave of upgrades.
103 Details : Change "dropped message" CERRORs to D_NETERROR so they are
104 logged instead of creating "console chatter" when a lustre
105 timeout races with normal RPC completion.
108 Details : lnet_clear_peer_table can wait forever if user forgets to
112 Details : libcfs_id2str should check pid against LNET_PID_ANY.
116 Description: added LNET self test
117 Details : landing b_self_test
122 Description: cfs_duration_{u,n}sec() wrongly calculate nanosecond part of
124 Details : do_div() macro is used incorrectly.
126 2007-04-23 Cluster File Systems, Inc. <info@clusterfs.com>
130 Description: make panic on lbug configurable
134 Description: Add OFED1.2 support to o2iblnd
135 Details : o2iblnd depends on OFED's modules, if out-tree OFED's modules
136 are installed (other than kernel's in-tree infiniband), there
137 could be some problem while insmod o2iblnd (mismatch CRC of
139 If extra Module.symvers is supported in kernel (i.e, 2.6.17),
140 this link provides solution:
141 https://bugs.openfabrics.org/show_bug.cgi?id=355
142 if extra Module.symvers is not supported in kernel, we will
143 have to run the script in bug 12316 to update
144 $LINUX/module.symvers before building o2iblnd.
145 More details about this are in bug 12316.
147 ------------------------------------------------------------------------------
149 2007-04-01 Cluster File Systems, Inc. <info@clusterfs.com>
150 * version 1.4.10 / 1.6.0
151 * Support for networks:
152 socklnd - kernels up to 2.6.16,
153 qswlnd - Qsnet kernel modules 5.20 and later,
154 openiblnd - IbGold 1.8.2,
156 viblnd - Voltaire ibhost 3.4.5 and later,
157 ciblnd - Topspin 3.2.0,
158 iiblnd - Infiniserv 3.3 + PathBits patch,
159 gmlnd - GM 2.1.22 and later,
160 mxlnd - MX 1.2.1 or later,
161 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
165 Description: Ptllnd didn't init kptllnd_data.kptl_idle_txs before it could be
166 possibly accessed in kptllnd_shutdown. Ptllnd should init
167 kptllnd_data.kptl_ptlid2str_lock before calling kptllnd_ptlid2str.
171 Description: gmlnd ignored some transmit errors when finalizing lnet messages.
175 Description: ptllnd logs a piece of incorrect debug info in kptllnd_peer_handle_hello.
179 Description: the_lnet.ln_finalizing was not set when the current thread is
180 about to complete messages. It only affects multi-threaded
186 Description: Changed the default kqswlnd ntxmsg=512
191 Description: Assertion failure in kernel ptllnd caused by posting passive
192 bulk buffers before connection establishment complete.
197 Description: A race in kernel ptllnd between deleting a peer and posting
198 new communications for it could hang communications -
199 manifesting as "Unexpectedly long timeout" messages.
204 Description: Kernel ptllnd lock ordering issue could hang a node.
209 Description: node crash on socket teardown race
212 Frequency : 'lctl peer_list' issued on a mx net
214 Description: Enable lctl's peer_list for MXLND
217 Frequency : after Ptllnd timeouts and portals congestion
219 Description: Credit overflows
220 Details : This was a bug in ptllnd connection establishment. The fix
221 implements better peer stamps to disambiguate connection
222 establishment and ensure both peers enter the credit flow
223 state machine consistently.
228 Description: kptllnd didn't propagate some network errors up to LNET
229 Details : This bug was spotted while investigating 11394. The fix
230 ensures network errors on sends and bulk transfers are
231 propagated to LNET/lustre correctly.
233 Severity : enhancement
235 Description: Fixed console chatter in case of -ETIMEDOUT.
237 Severity : enhancement
239 Description: Added D_NETTRACE for recording network packet history
240 (initially only for ptllnd). Also a separate userspace
241 ptllnd facility to gather history which should really be
242 covered by D_NETTRACE too, if only CDEBUG recorded history in
248 Description: o2iblnd handle early RDMA_CM_EVENT_DISCONNECTED.
249 Details : If the fabric is lossy, an RDMA_CM_EVENT_DISCONNECTED
250 callback can occur before a connection has actually been
251 established. This caused an assertion failure previously.
253 Severity : enhancement
255 Description: Multiple instances for o2iblnd
256 Details : Allow multiple instances of o2iblnd to enable networking over
257 multiple HCAs and routing between them.
261 Description: lnet deadlock in router_checker
262 Details : turned ksnd_connd_lock, ksnd_reaper_lock, and ksock_net_t:ksnd_lock
263 into BH locks to eliminate potential deadlock caused by
264 ksocknal_data_ready() preempting code holding these locks.
268 Description: Millions of failed socklnd connection attempts cause a very slow FS
269 Details : added a new route flag ksnr_scheduled to distinguish from
270 ksnr_connecting, so that a peer connection request is only turned
271 down for race concerns when an active connection to the same peer
272 is under progress (instead of just being scheduled).
274 ------------------------------------------------------------------------------
276 2007-02-09 Cluster File Systems, Inc. <info@clusterfs.com>
278 * Support for networks:
279 socklnd - kernels up to 2.6.16
280 qswlnd - Qsnet kernel modules 5.20 and later
281 openiblnd - IbGold 1.8.2
283 viblnd - Voltaire ibhost 3.4.5 and later
284 ciblnd - Topspin 3.2.0
285 iiblnd - Infiniserv 3.3 + PathBits patch
286 gmlnd - GM 2.1.22 and later
287 mxlnd - MX 1.2.1 or later
288 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
291 Severity : major on XT3
293 Description: libcfs overwrites /proc/sys/portals
294 Details : libcfs created a symlink from /proc/sys/portals to
295 /proc/sys/lnet for backwards compatibility. This is no
296 longer required and makes the Cray portals /proc variables
301 Description: OFED FMR API change
302 Details : This changes parameter usage to reflect a change in
303 ib_fmr_pool_map_phys() between OFED 1.0 and OFED 1.1. Note
304 that FMR support is only used in experimental versions of the
305 o2iblnd - this change does not affect standard usage at all.
307 Severity : enhancement
309 Description: new ko2iblnd module parameter: ib_mtu
310 Details : the default IB MTU of 2048 performs badly on 23108 Tavor
311 HCAs. You can avoid this problem by setting the MTU to 1024
312 using this module parameter.
314 Severity : enhancement
315 Bugzilla : 11118/11620
316 Description: ptllnd small request message buffer alignment fix
317 Details : Set the PTL_MD_LOCAL_ALIGN8 option on small message receives.
318 Round up small message size on sends in case this option
319 is not supported. 11620 was a defect in the initial
320 implementation which effectively asserted all peers had to be
321 running the correct protocol version which was fixed by always
322 NAK-ing such requests and handling any misalignments they
327 Description: When kib(nal|lnd)_del_peer() is called upon a peer whose
328 ibp_tx_queue is not empty, kib(nal|lnd)_destroy_peer()'s
329 'LASSERT(list_empty(&peer->ibp_tx_queue))' will fail.
331 Severity : enhancement
333 Description: Patchless ZC(zero copy) socklnd
334 Details : New protocol for socklnd, socklnd can support zero copy without
335 kernel patch, it's compatible with old socklnd. Checksum is
336 moved from tunables to modparams.
340 Description: When ksocknal_del_peer() is called upon a peer whose
341 ksnp_tx_queue is not empty, ksocknal_destroy_peer()'s
342 'LASSERT(list_empty(&peer->ksnp_tx_queue))' will fail.
345 Frequency : when ptlrpc is under heavy use and runs out of request buffer
347 Description: In lnet_match_blocked_msg(), md can be used without holding a
351 Frequency : very rarely
353 Description: If ksocknal_lib_setup_sock() fails, a ref on peer is lost.
354 If connd connects a route which has been closed by
355 ksocknal_shutdown(), ksocknal_create_routes() may create new
356 routes which hold references on the peer, causing shutdown
357 process to wait for peer to disappear forever.
359 Severity : enhancement
361 Description: Dump XT3 portals traces on kptllnd timeout
362 Details : Set the kptllnd module parameter "ptltrace_on_timeout=1" to
363 dump Cray portals debug traces to a file. The kptllnd module
364 parameter "ptltrace_basename", default "/tmp/lnet-ptltrace",
365 is the basename of the dump file.
368 Frequency : infrequent
370 Description: kernel ptllnd fix bug in connection re-establishment
371 Details : Kernel ptllnd could produce protocol errors e.g. illegal
372 matchbits and/or violate the credit flow protocol when trying
373 to re-establish a connection with a peer after an error or
376 Severity : enhancement
378 Description: Allow /proc/sys/lnet/debug to be set symbolically
379 Details : Allow debug and subsystem debug values to be read/set by name
380 in addition to numerically, for ease of use.
383 Frequency : only in configurations with LNET routers
385 Description: routes automatically marked down and recovered
386 Details : In configurations with LNET routers if a router fails routers
387 now actively try to recover routes that are down, unless they
388 are marked down by an administrator.
390 ------------------------------------------------------------------------------
392 2006-12-09 Cluster File Systems, Inc. <info@clusterfs.com>
395 Frequency : very rarely, in configurations with LNET routers and TCP
397 Description: incorrect data written to files on OSTs
398 Details : In certain high-load conditions incorrect data may be written
399 to files on the OST when using TCP networks.
401 ------------------------------------------------------------------------------
403 2006-07-31 Cluster File Systems, Inc. <info@clusterfs.com>
405 - rework CDEBUG messages rate-limiting mechanism b=10375
406 - add per-socket tunables for socklnd if the kernel is patched b=10327
408 ------------------------------------------------------------------------------
410 2006-02-15 Cluster File Systems, Inc. <info@clusterfs.com>
412 - fix use of portals/lnet pid to avoid dropping RPCs b=10074
413 - iiblnd wasn't mapping all memory, resulting in comms errors b=9776
414 - quiet LNET startup LNI message for liblustre b=10128
415 - Better console error messages if 'ip2nets' can't match an IP address
416 - Fixed overflow/use-before-set bugs in linux-time.h
417 - Fixed ptllnd bug that wasn't initialising rx descriptors completely
418 - LNET teardown failed an assertion about the route table being empty
419 - Fixed a crash in LNetEQPoll(<invalid handle>)
420 - Future protocol compatibility work (b_rls146_lnetprotovrsn)
421 - improve debug message for liblustre/Catamount nodes (b=10116)
423 2005-10-10 Cluster File Systems, Inc. <info@clusterfs.com>
424 * Configuration change for the XT3
425 The PTLLND is now used to run Lustre over Portals on the XT3.
426 The configure option(s) --with-cray-portals are no longer
427 used. Rather --with-portals=<path-to-portals-includes> is
428 used to enable building on the XT3. In addition to enable
429 XT3 specific features the option --enable-cray-xt3 must be
432 2005-10-10 Cluster File Systems, Inc. <info@clusterfs.com>
433 * Portals has been removed, replaced by LNET.
434 LNET is new networking infrastructure for Lustre, it includes a
435 reorganized network configuration mode (see the user
436 documentation for full details) as well as support for routing
437 between different network fabrics. Lustre Networking Devices
438 (LNDS) for the supported network fabrics have also been created
439 for this new infrastructure.
441 2005-08-08 Cluster File Systems, Inc. <info@clusterfs.com>
446 Frequency : rare (large Voltaire clusters only)
448 Description: the default number of reserved transmit descriptors was too low
449 for some large clusters
450 Details : As a workaround, the number was increased. A proper fix includes
453 2005-06-02 Cluster File Systems, Inc. <info@clusterfs.com>
458 Frequency : occasional (large-scale events, cluster reboot, network failure)
460 Description: too many error messages on console obscure actual problem and
461 can slow down/panic server, or cause recovery to fail repeatedly
462 Details : enable rate-limiting of console error messages, and some messages
463 that were console errors now only go to the kernel log
465 Severity : enhancement
467 Description: add /proc/sys/portals/catastrophe entry which will report if
468 that node has previously LBUGged
470 2005-04-06 Cluster File Systems, Inc. <info@clusterfs.com>
472 - update gmnal to use PTL_MTU, fix module refcounting (b=5786)
474 2005-04-04 Cluster File Systems, Inc. <info@clusterfs.com>
476 - handle error return code in kranal_check_fma_rx() (5915,6054)
478 2005-02-04 Cluster File Systems, Inc. <info@clusterfs.com>
480 - update vibnal (Voltaire IB NAL)
481 - update gmnal (Myrinet NAL), gmnalid
483 2005-02-04 Eric Barton <eeb@bartonsoftware.com>
485 * Landed portals:b_port_step as follows...
487 - removed CFS_DECL_SPIN*
488 just use 'spinlock_t' and initialise with spin_lock_init()
490 - removed CFS_DECL_MUTEX*
491 just use 'struct semaphore' and initialise with init_mutex()
493 - removed CFS_DECL_RWSEM*
494 just use 'struct rw_semaphore' and initialise with init_rwsem()
496 - renamed cfs_sleep_chan -> cfs_waitq
497 cfs_sleep_link -> cfs_waitlink
499 - fixed race in linux version of arch-independent socknal
500 (the ENOMEM/EAGAIN decision).
502 - Didn't fix problems in Darwin version of arch-independent socknal
503 (resetting socket callbacks, eager ack hack, ENOMEM/EAGAIN decision)
505 - removed libcfs types from non-socknal header files (only some types
506 in the header files had been changed; the .c files hadn't been