1 2007-04-23 Cluster File Systems, Inc. <info@clusterfs.com>
2 * version 1.4.11 / 1.6.1
3 * Support for networks:
4 socklnd - kernels up to 2.6.16
5 qswlnd - Qsnet kernel modules 5.20 and later
6 openiblnd - IbGold 1.8.2
8 viblnd - Voltaire ibhost 3.4.5 and later
10 iiblnd - Infiniserv 3.3 + PathBits patch
11 gmlnd - GM 2.1.22 and later
12 mxlnd - MX 1.2.1 or later
13 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
16 ------------------------------------------------------------------------------
18 2007-04-01 Cluster File Systems, Inc. <info@clusterfs.com>
19 * version 1.4.10 / 1.6.0
20 * Support for networks:
21 socklnd - kernels up to 2.6.16
22 qswlnd - Qsnet kernel modules 5.20 and later
23 openiblnd - IbGold 1.8.2
25 viblnd - Voltaire ibhost 3.4.5 and later
26 ciblnd - Topspin 3.2.0
27 iiblnd - Infiniserv 3.3 + PathBits patch
28 gmlnd - GM 2.1.22 and later
29 mxlnd - MX 1.2.1 or later
30 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
36 Description: Added LNetSetAsync() to ensure single-threaded userspace
37 clients can be eager LNET receivers even when the application
38 is not executing in the filesystem.
43 Description: node crash on socket teardown race
46 Frequency : 'lctl peer_list' issued on a mx net
48 Description: Enable lctl's peer_list for MXLND
51 Frequency : after Ptllnd timeouts and portals congestion
53 Description: Credit overflows
54 Details : This was a bug in ptllnd connection establishment. The fix
55 implements better peer stamps to disambiguate connection
56 establishment and ensure both peers enter the credit flow
57 state machine consistently.
62 Description: kptllnd didn't propagate some network errors up to LNET
63 Details : This bug was spotted while investigating 11394. The fix
64 ensures network errors on sends and bulk transfers are
65 propagated to LNET/lustre correctly.
67 Severity : enhancement
69 Description: Fixed console chatter in case of -ETIMEDOUT.
71 Severity : enhancement
73 Description: Added D_NETTRACE for recording network packet history
74 (initially only for ptllnd). Also a separate userspace
75 ptllnd facility to gather history which should really be
76 covered by D_NETTRACE too, if only CDEBUG recorded history in
82 Description: o2iblnd handle early RDMA_CM_EVENT_DISCONNECTED.
83 Details : If the fabric is lossy, an RDMA_CM_EVENT_DISCONNECTED
84 callback can occur before a connection has actually been
85 established. This caused an assertion failure previously.
87 Severity : enhancement
89 Description: Multiple instances for o2iblnd
90 Details : Allow multiple instances of o2iblnd to enable networking over
91 multiple HCAs and routing between them.
95 Description: lnet deadlock in router_checker
96 Details : turned ksnd_connd_lock, ksnd_reaper_lock, and ksock_net_t:ksnd_lock
97 into BH locks to eliminate potential deadlock caused by
98 ksocknal_data_ready() preempting code holding these locks.
102 Description: Millions of failed socklnd connection attempts cause a very slow FS
103 Details : added a new route flag ksnr_scheduled to distinguish from
104 ksnr_connecting, so that a peer connection request is only turned
105 down for race concerns when an active connection to the same peer
106 is under progress (instead of just being scheduled).
108 ------------------------------------------------------------------------------
110 2007-02-09 Cluster File Systems, Inc. <info@clusterfs.com>
112 * Support for networks:
113 socklnd - kernels up to 2.6.16
114 qswlnd - Qsnet kernel modules 5.20 and later
115 openiblnd - IbGold 1.8.2
117 viblnd - Voltaire ibhost 3.4.5 and later
118 ciblnd - Topspin 3.2.0
119 iiblnd - Infiniserv 3.3 + PathBits patch
120 gmlnd - GM 2.1.22 and later
121 mxlnd - MX 1.2.1 or later
122 ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x
125 Severity : major on XT3
127 Description: libcfs overwrites /proc/sys/portals
128 Details : libcfs created a symlink from /proc/sys/portals to
129 /proc/sys/lnet for backwards compatibility. This is no
130 longer required and makes the Cray portals /proc variables
135 Description: OFED FMR API change
136 Details : This changes parameter usage to reflect a change in
137 ib_fmr_pool_map_phys() between OFED 1.0 and OFED 1.1. Note
138 that FMR support is only used in experimental versions of the
139 o2iblnd - this change does not affect standard usage at all.
141 Severity : enhancement
143 Description: new ko2iblnd module parameter: ib_mtu
144 Details : the default IB MTU of 2048 performs badly on 23108 Tavor
145 HCAs. You can avoid this problem by setting the MTU to 1024
146 using this module parameter.
148 Severity : enhancement
149 Bugzilla : 11118/11620
150 Description: ptllnd small request message buffer alignment fix
151 Details : Set the PTL_MD_LOCAL_ALIGN8 option on small message receives.
152 Round up small message size on sends in case this option
153 is not supported. 11620 was a defect in the initial
154 implementation which effectively asserted all peers had to be
155 running the correct protocol version which was fixed by always
156 NAK-ing such requests and handling any misalignments they
161 Description: When kib(nal|lnd)_del_peer() is called upon a peer whose
162 ibp_tx_queue is not empty, kib(nal|lnd)_destroy_peer()'s
163 'LASSERT(list_empty(&peer->ibp_tx_queue))' will fail.
165 Severity : enhancement
167 Description: Patchless ZC(zero copy) socklnd
168 Details : New protocol for socklnd, socklnd can support zero copy without
169 kernel patch, it's compatible with old socklnd. Checksum is
170 moved from tunables to modparams.
174 Description: When ksocknal_del_peer() is called upon a peer whose
175 ksnp_tx_queue is not empty, ksocknal_destroy_peer()'s
176 'LASSERT(list_empty(&peer->ksnp_tx_queue))' will fail.
179 Frequency : when ptlrpc is under heavy use and runs out of request buffer
181 Description: In lnet_match_blocked_msg(), md can be used without holding a
185 Frequency : very rarely
187 Description: If ksocknal_lib_setup_sock() fails, a ref on peer is lost.
188 If connd connects a route which has been closed by
189 ksocknal_shutdown(), ksocknal_create_routes() may create new
190 routes which hold references on the peer, causing shutdown
191 process to wait for peer to disappear forever.
193 Severity : enhancement
195 Description: Dump XT3 portals traces on kptllnd timeout
196 Details : Set the kptllnd module parameter "ptltrace_on_timeout=1" to
197 dump Cray portals debug traces to a file. The kptllnd module
198 parameter "ptltrace_basename", default "/tmp/lnet-ptltrace",
199 is the basename of the dump file.
202 Frequency : infrequent
204 Description: kernel ptllnd fix bug in connection re-establishment
205 Details : Kernel ptllnd could produce protocol errors e.g. illegal
206 matchbits and/or violate the credit flow protocol when trying
207 to re-establish a connection with a peer after an error or
210 Severity : enhancement
212 Description: Allow /proc/sys/lnet/debug to be set symbolically
213 Details : Allow debug and subsystem debug values to be read/set by name
214 in addition to numerically, for ease of use.
217 Frequency : only in configurations with LNET routers
219 Description: routes automatically marked down and recovered
220 Details : In configurations with LNET routers if a router fails routers
221 now actively try to recover routes that are down, unless they
222 are marked down by an administrator.
224 ------------------------------------------------------------------------------
226 2006-12-09 Cluster File Systems, Inc. <info@clusterfs.com>
229 Frequency : very rarely, in configurations with LNET routers and TCP
231 Description: incorrect data written to files on OSTs
232 Details : In certain high-load conditions incorrect data may be written
233 to files on the OST when using TCP networks.
235 ------------------------------------------------------------------------------
237 2006-07-31 Cluster File Systems, Inc. <info@clusterfs.com>
239 - rework CDEBUG messages rate-limiting mechanism b=10375
240 - add per-socket tunables for socklnd if the kernel is patched b=10327
242 ------------------------------------------------------------------------------
244 2006-02-15 Cluster File Systems, Inc. <info@clusterfs.com>
246 - fix use of portals/lnet pid to avoid dropping RPCs b=10074
247 - iiblnd wasn't mapping all memory, resulting in comms errors b=9776
248 - quiet LNET startup LNI message for liblustre b=10128
249 - Better console error messages if 'ip2nets' can't match an IP address
250 - Fixed overflow/use-before-set bugs in linux-time.h
251 - Fixed ptllnd bug that wasn't initialising rx descriptors completely
252 - LNET teardown failed an assertion about the route table being empty
253 - Fixed a crash in LNetEQPoll(<invalid handle>)
254 - Future protocol compatibility work (b_rls146_lnetprotovrsn)
255 - improve debug message for liblustre/Catamount nodes (b=10116)
257 2005-10-10 Cluster File Systems, Inc. <info@clusterfs.com>
258 * Configuration change for the XT3
259 The PTLLND is now used to run Lustre over Portals on the XT3.
260 The configure option(s) --with-cray-portals are no longer
261 used. Rather --with-portals=<path-to-portals-includes> is
262 used to enable building on the XT3. In addition to enable
263 XT3 specific features the option --enable-cray-xt3 must be
266 2005-10-10 Cluster File Systems, Inc. <info@clusterfs.com>
267 * Portals has been removed, replaced by LNET.
268 LNET is new networking infrastructure for Lustre, it includes a
269 reorganized network configuration mode (see the user
270 documentation for full details) as well as support for routing
271 between different network fabrics. Lustre Networking Devices
272 (LNDS) for the supported network fabrics have also been created
273 for this new infrastructure.
275 2005-08-08 Cluster File Systems, Inc. <info@clusterfs.com>
280 Frequency : rare (large Voltaire clusters only)
282 Description: the default number of reserved transmit descriptors was too low
283 for some large clusters
284 Details : As a workaround, the number was increased. A proper fix includes
287 2005-06-02 Cluster File Systems, Inc. <info@clusterfs.com>
292 Frequency : occasional (large-scale events, cluster reboot, network failure)
294 Description: too many error messages on console obscure actual problem and
295 can slow down/panic server, or cause recovery to fail repeatedly
296 Details : enable rate-limiting of console error messages, and some messages
297 that were console errors now only go to the kernel log
299 Severity : enhancement
301 Description: add /proc/sys/portals/catastrophe entry which will report if
302 that node has previously LBUGged
304 2005-04-06 Cluster File Systems, Inc. <info@clusterfs.com>
306 - update gmnal to use PTL_MTU, fix module refcounting (b=5786)
308 2005-04-04 Cluster File Systems, Inc. <info@clusterfs.com>
310 - handle error return code in kranal_check_fma_rx() (5915,6054)
312 2005-02-04 Cluster File Systems, Inc. <info@clusterfs.com>
314 - update vibnal (Voltaire IB NAL)
315 - update gmnal (Myrinet NAL), gmnalid
317 2005-02-04 Eric Barton <eeb@bartonsoftware.com>
319 * Landed portals:b_port_step as follows...
321 - removed CFS_DECL_SPIN*
322 just use 'spinlock_t' and initialise with spin_lock_init()
324 - removed CFS_DECL_MUTEX*
325 just use 'struct semaphore' and initialise with init_mutex()
327 - removed CFS_DECL_RWSEM*
328 just use 'struct rw_semaphore' and initialise with init_rwsem()
330 - renamed cfs_sleep_chan -> cfs_waitq
331 cfs_sleep_link -> cfs_waitlink
333 - fixed race in linux version of arch-independent socknal
334 (the ENOMEM/EAGAIN decision).
336 - Didn't fix problems in Darwin version of arch-independent socknal
337 (resetting socket callbacks, eager ack hack, ENOMEM/EAGAIN decision)
339 - removed libcfs types from non-socknal header files (only some types
340 in the header files had been changed; the .c files hadn't been