3 This module implements lustre-specific IO- and network-tests.
4 It is based on the 'obdfilter-survey'-script distributed with lustre-iokit.
6 To use it as a library, the caller should first create a set of
7 EchoClient-object. The EchoClient-class will automatically create the
8 echo_client-device, and set it up to communicate with the device
9 given as the target to the EchoClient-constructor. See main() for
10 an example of how to set up EchoClient-objects and the objects it
13 Next, run ParallelTestBRW to run benchmarks in parallel over all
14 the EchoClients with a specific number of threads++.
16 ParallelTestBRW returns a list of ParallelTestBRWResult-objects
17 (one for eacy type of test ('w' and 'r') performed).
18 See the documentation for ParallelTestBRWResult for how to extract
19 the data from this object.
21 Some notes about the implementation:
22 The core-functionality is implemented as python-classes wrapping lustre-devices
23 such as obdecho-, osc-, and echo_client-devices. The constructors for these
24 classes automatically create the lustre-device, and the destructor removes the
25 devices. High-level devices keep references to low-level devices, ensuring that
26 the low-level devices are not removed as long as they are in use. The
27 garbage-collector will clean everything up in the right order. However, there
28 are two corner-cases that users of the library must be awere of:
30 1. You can not create to lustre-devices of the same type with the same name on
31 the same node at the same time. Replacing one object with a conflicting object
34 foo = OBDEcho("nodename", "test_obdecho")
35 foo = OBDEcho("nodename", "test_obdecho")
37 will fail because the second obdecho-object's constructor will run before the old
38 object has been removed. To replace an object with a conflicting new object the
39 fist one has to explisitly be removed first:
41 foo = OBDEcho("nodename", "test_obdecho")
43 foo = OBDEcho("nodename", "test_obdecho")
46 2. When python exists it will remove all remaining objects without following
47 the dependency-rules between objects. This may cause lustre-devices to not
48 be removed properly. Make sure to delete all references to the lustre-device
49 objects _before_ exiting python to make sure this doesn't happen.
53 Copyright (c) 2005 Scali AS. All Rights Reserved.
66 # Classes that implement remote execution using different tools/protocols:
67 # These should subclass Popen3, and implement the same interface.
69 class scashPopen3(popen2.Popen3):
71 Implement the same functionality as popen2.Popen3, but for a
74 def __init__(self, node, cmd, *args, **kwargs):
76 As popen2.Popen3, except:
77 @node - hostname where to execute cmd
78 @cmd - the command to execute. (Needs to be a string!)
80 cmd = ["scash", "-n", node, cmd]
81 popen2.Popen3.__init__(self, cmd, *args, **kwargs)
83 class sshPopen3(popen2.Popen3):
85 Implement the same functionality as popen2.Popen3, but for a
88 def __init__(self, node, cmd, *args, **kwargs):
90 As popen2.Popen3, except:
91 @node - hostname where to execute cmd
92 @cmd - the command to execute. (Needs to be a string!)
94 cmd = ["ssh", node, cmd]
95 popen2.Popen3.__init__(self, cmd, *args, **kwargs)
98 # Select remote execution tool/protocol based on what is actually available:
99 if os.path.isfile("/opt/scali/bin/scash"):
100 remotePopen3 = scashPopen3
101 elif os.path.isfile("/usr/bin/ssh"):
102 remotePopen3 = sshPopen3
104 raise Exception("No remote-execution environment found!")
107 def remoteCommand(node, command):
109 Run an external command, and return the output as a list of strings
110 (one string per line). Raise an exception if the command fails
111 (returns non-zero exit-code).
112 @node - nodename where to run the command
113 @command - the command to run
115 remote = remotePopen3(node, command, True)
116 exit_code = remote.wait()
118 raise Exception("Remote command %s failed with exit-code: %d" %
119 (repr(command), exit_code))
120 return remote.fromchild.readlines()
124 Generate a random UUID
126 r = random.Random(time.time())
127 return "%04x%04x-%04x-%04x-%04x-%04x%04x%04x" % (r.randint(0,16**4), r.randint(0,16**4),
128 r.randint(0,16**4), r.randint(0,16**4), r.randint(0,16**4),
129 r.randint(0,16**4), r.randint(0,16**4), r.randint(0,16**4))
133 Object to keep track of the usage of a kernel-module, and unload it when
134 it's no longer needed. The constructor will check if the module is already
135 loaded. If it is, the use_count will be preset to 1 and the module will never
136 be automatically unloaded. (Assuming no object will cal decUse without first
137 having called incUse)
139 def __init__(self, node, name):
141 KernelModule constructor.
142 Does _not_ increase the usage-counter or load the module!
143 @name - the name of the kernel-module
147 self.use_count = self.__isLoaded()
148 def __isLoaded(self):
150 Check if the module is currently loaded
152 for line in remoteCommand(self.node, "/sbin/lsmod"):
153 if line.split()[0] == self.name:
159 Don't call this directly - call incUse.
161 remoteCommand(self.node, "modprobe %s" % self.name)
164 Unload the module now.
165 Don't call this directly - call decUse.
167 remoteCommand(self.node, "rmmod obdecho")
170 Call this method before using the module
173 if self.use_count == 1:
177 Call this method when you're done using the module
180 if self.use_count == 0:
186 Class to keep track of multiple KernelModule-objects
187 for multiple kernel-modules on multiple nodes.
190 # The KernelModule-objects are stored in self.data
191 # The key in self.data is the nodename. The value is a new
192 # new dictionary with module-names as keys and KernelModule
195 def getKernelModule(self, nodename, modulename):
197 Lookup (or create) a KernelModule object
198 @nodename - the node where the kernel-module should be loaded
199 @modulename - the name of the kernel-module
201 # Create the object if it's not already in self.data:
202 if not self.data.has_key(nodename):
203 self.data[nodename] = {}
204 if not self.data[nodename].has_key(modulename):
205 self.data[nodename][modulename] = KernelModule(nodename, modulename)
206 # And then return it:
207 return self.data[nodename][modulename]
209 # This global object is used to keep track of all the loaded kernel-modules:
210 modules = KernelModules()
212 def lctl(node, commands):
214 Run a set of lctl-commands
215 @node - node where to run the commands
216 @commands - list of commands
217 Returns the output from lctl as a list of strings (one string per line)
219 # Encapsulate in quotes:
220 commands = string.join(commands, '\n')
221 log = logging.getLogger("lctl")
222 log.debug("lctl: %s" % repr(commands))
223 return remoteCommand(node, 'echo -e "%s" | lctl' % commands)
225 def find_device(node, search_type, search_name):
227 Find the devicenumber for a device
228 @ node - the node where the device lives
229 @ search_type - the device-type to search for
230 @ search_name - the devine-name to search for
231 Returns the device-number (int)
234 for dev in lctl(node, ['device_list']):
235 device_id, device_state, device_type, device_name, uuid, refcnt = dev.split()
236 if device_type == search_type and device_name == search_name:
237 return int(device_id)
238 raise ValueError("device not found: %s:%s" % (search_type, search_name))
243 Create a obdecho-device (A device that can simulate a ost)
245 def __init__(self, node, name):
247 The constructor will create the device
248 @node - the node where to run the obdecho-device
249 @name - the name of the new device
253 self.uuid = genUUID()
254 self.module = modules.getKernelModule(self.node, "obdecho")
256 lctl(self.node, ['attach obdecho %s %s' % (self.name, self.uuid), 'setup n'])
259 The destructor will remove the device
261 lctl(self.node, ['cfg_device %s' % self.name, 'cleanup', 'detach'])
267 Class to represent an existing osc-device
268 The object is device is not manipulated in any way - this class
269 is just used to keep refer to the device
271 def __init__(self, node, name):
273 Create a reference to the device
274 @node - the node where the device lives
275 @name - the name of the device
282 Create a osc-device (A device that connects to a remote ost/obdecho-device
283 and looks like a local obdfilter.
285 def __init__(self, node, name, ost):
288 @node - the node where to run the OSC
289 @name - the name of the new device
290 @ost - the object that the osc should be connected to. This should
296 self.module = modules.getKernelModule(self.node, "obdecho")
298 self.uuid = genUUID()
299 # FIXME: "NID_%s_UUID" should probably not be hardcoded? Retrieve uuid from node-object?
300 lctl(self.node, ['attach osc %s %s' % (self.name, self.uuid), 'setup %s "NID_%s_UUID"' % (self.ost.uuid, self.ost.node)])
303 The destructor will remove the device
305 lctl(self.node, ['cfg_device %s' % self.name, 'cleanup', 'detach'])
308 class ExistingOBDFilter:
310 Class to represent an existing obdfilter-device
311 The object is device is not manipulated in any way - this class
312 is just used to keep refer to the device
314 def __init__(self, node, name):
316 Create a reference to the device
317 @node - the node where the device lives
318 @name - the name of the device
325 Class wrapping echo_client functionality
327 def __init__(self, node, name, target):
329 Create a new echo_client
330 @node - the node to run the echo_client on
331 @name - the name of the new echo_client
332 @target - The obdfilter / osc device to connect to. This should
333 be an OSC, ExistingOSC or ExistingOBDFilter-object on the same node.
338 self.objects = [] # List of objects that have been created and not yet destroyed.
339 self.log = logging.getLogger("EchoClient")
340 self.module = modules.getKernelModule(self.node, "obdecho")
342 self.uuid = genUUID()
343 lctl(self.node, ['attach echo_client %s %s' % (self.name, self.uuid), 'setup %s' % self.target.name])
344 self.devicenum = find_device(self.node, 'echo_client', self.name)
345 self.log.debug("EchoClient created: %s" % self.name)
349 Remove the echo_client, and unload the obdecho module if it is no longer in use
350 Destroy all objects that have been created.
352 self.log.debug("EchoClient destructor: destroying objects")
353 self.destroyObjects(self.objects[:])
354 self.log.debug("EchoClient destructor: detach echo_client:")
355 lctl(self.node, ['cfg_device %s' % self.name, 'cleanup', 'detach'])
356 self.log.debug("EchoClient destructor: Unload modules:")
358 self.log.debug("EchoClient destructor: Done")
360 def createObjects(self, num):
362 Create new objects on this device
363 @num - the number of devices to create
364 Returns a list of object-ids.
367 line = lctl(self.node, ['device %d' % self.devicenum, 'create %d' % num])
368 if line[0].strip() != 'create: %d objects' % num:
369 raise Exception("Invalid output from lctl(2): %s" % repr(line[1]))
370 pattern=re.compile('create: #(.*) is object id 0x(.*)')
371 for line in line[1:]:
372 i, oid = pattern.match(line).groups()
373 if int(i) != len(oids)+1:
374 raise Exception("Expected to find object nr %d - found object nr %d:" % ( len(oids)+1, int(i)))
375 oids.append(long(oid, 16))
379 def destroyObjects(self, objects):
381 Destroy a set of objects
382 @objects - list of object ids
385 lctl(self.node, ['device %d' % self.devicenum, 'destroy %d' % oid])
386 self.objects.remove(oid)
388 def startTestBRW(self, oid, threads=1, num=1, test='w', pages=1):
390 Start an test_brw, and return a remotePopen3-object to the test-process
391 Do <num> bulk read/writes on OST object <objid> (<npages> per I/O).
392 @oid - objectid for the first object to use.
393 (each thread will use one object)
394 @threads - number of threads to use
395 @num - number of io-operations to perform
396 @test - what test to perform ('w' or 'r', for write or read-tests)
397 @pages - number of pages to use in each io-request. (4KB on ia32)
399 cmd = 'lctl --threads %d q %d test_brw %d %s q %d %d' % \
400 (threads, self.devicenum, num, test, pages, oid)
402 self.log.debug("startTestBRW: %s:%s" % (self.node, cmd))
403 remote = remotePopen3(self.node, cmd, True)
406 def testBRW(self, oid, threads=1, num=1, test='w', pages=1):
408 Do <num> bulk read/writes on OST object <objid> (<npages> per I/O).
409 @oid - objectid for the first object to use.
410 (each thread will use one object)
411 @threads - number of threads to use
412 @num - number of io-operations to perform
413 @test - what test to perform ('w' or 'r', for write or read-tests)
414 @pages - number of pages to use in each io-request. (4KB on ia32)
416 test = self.startTestBRW(oid, threads, num, test, pages)
417 exit_code = test.wait()
419 raise Exception("test_brw failed with exitcode %d." % exit_code)
421 class ParallelTestBRWResult:
423 Class to hold result from ParallelTestBRW
425 def __init__(self, threads, num, testtype, pages, pagesize, numclients):
427 Prepare the result-object with the constants for the test
428 threads -- number of threads (per client)
429 num -- number of io-operations for each thread
430 testtype -- what kind of test ('w' for write-test or 'r' for read-test)
431 pages -- number of pages in each request
432 pagesize -- pagesize (Assumes same page-size accross all clients)
433 numclients -- number of clients used in the tests
435 self.threads = threads
437 self.testtype = testtype
439 self.pagesize = pagesize
440 self.numclients = numclients
441 self.starttimes = {} # clientid to starttime mapping
442 self.finishtimes = {} # clientid to finishtime mapping
443 self.exitcodes = {} # clientid to exit-code mapping
444 self.runtimes = {} # clientid to runtime mapping
445 self.stdout = {} # clientid to output mapping
446 self.stderr = {} # clientid to errors mapping
447 def registerStart(self, clientid):
449 Register that this client is about to start
450 clientid -- the id of the client
452 self.starttimes[clientid] = time.time()
453 def registerFinish(self, clientid, exitcode, stdout, stderr):
455 Register that this client just finished
456 clientid -- the id of the client
457 exitcode -- the exitcode of this test
458 stdout -- the output from the test
459 stderr -- the errors from the test
461 self.finishtimes[clientid] = time.time()
462 self.exitcodes[clientid] = exitcode
463 self.stdout[clientid] = stdout
464 self.stderr[clientid] = stderr
465 self.runtimes[clientid] = self.finishtimes[clientid] - self.starttimes[clientid]
466 def getTestType(self):
468 Return the name of the test-type ('w' for write-tests and 'r' for read-tests)
471 def verifyExitCodes(self):
473 Verify that all tests finished successfully. Raise exception if they didn't.
475 if self.exitcodes.values().count(0) != self.numclients:
476 raise Exception("test_brw failed!")
477 def getTotalTime(self):
479 Return the number of seconds used for the test
481 return max(self.finishtimes.values()) - min(self.starttimes.values())
482 def getTotalSize(self):
484 Return total amount of data transfered (in KB)
486 return self.numclients * self.num * self.pages * self.threads * self.pagesize
487 def getTotalBandwidth(self):
489 Return the total bandwidth for the test
491 return self.getTotalSize() / self.getTotalTime()
492 def getMaxBandwidth(self):
494 Return the bandwidth of the fastest OST
496 time = min(self.runtimes.values())
497 return self.num * self.pages * self.threads * self.pagesize / time
498 def getMinBandwidth(self):
500 Return the bandwidth of the fastest OST
502 time = max(self.runtimes.values())
503 return self.num * self.pages * self.threads * self.pagesize / time
507 def ParallelTestBRW(echo_clients, threads=1, size=100, tests=('w', 'r'), rsz=1024, pagesize=4):
509 Run a test_brw in parallel on a set of echo_clients
510 @echo_client -- list of EchoClient-objects to run tests on
511 @threads -- number of threads to use per client
512 @size -- amount of data to transfer for each thread (MB)
513 @test -- list of tests to perform ('w' or 'r', for write or read-tests)
514 @rsz -- Amount of data (in KB) for each request. Default, 1024.
515 @pagesize - Size of each page (KB)
517 pages = rsz / pagesize
518 num = size * 1024 / rsz / threads
521 for client in echo_clients:
522 objects[client] = client.createObjects(threads)
523 # Verify if the objectids are consequative:
524 for i in range(len(objects[client])-1):
525 if objects[client][i+1] != objects[client][i] + 1:
526 raise Exception("Non-consequative objectids on client %s: %s" % (client, objects[client]))
530 result = ParallelTestBRWResult(threads, num, test, pages, pagesize, len(echo_clients))
531 pids = {} # pid to clientid mapping
532 remotes = {} # clientid to RemotePopen3-objects
535 for client in echo_clients:
536 first_obj = objects[client][0]
537 result.registerStart(clientid)
538 remote = client.startTestBRW(first_obj, threads, num, test, pages)
539 remotes[clientid] = remote
540 pids[remote.pid] = clientid
542 # Wait for tests to finish:
544 pid, status = os.wait()
546 remote = remotes[clientid]
547 # Workaround for leak in popen2, see patch #816059 at python.sf.net:
548 popen2._active.remove(remote)
549 result.registerFinish(clientid, status, remote.fromchild.read(), remote.childerr.read())
551 results.append(result)
553 for client in echo_clients:
554 client.destroyObjects(objects[client])
559 def timeit(func, *args, **kwargs):
561 Helper-function to easily time the execution of a function.
562 @func - the function to run
563 @*args - possitional arguments
564 @**kwargs - keyword arguments
565 Returns the number of seconds used executing the function
568 timeit(max, 1, 2, 5, 2) - will time how long it takes to run max(1,2,5,2)
571 func(*args, **kwargs)