Whamcloud - gitweb
b=5779
There's one error path in the DLM's enqueue code where, after a timeout and
user abort, the client will keep its local copy of the lock.
This has a few follow-on effects, all of which should be fixed by this patch:
- the reference on the lock is never dropped, so we never try to cancel it, so
we never find out that our view of the lock state differs from the server's.
This could perhaps cause some corruption.
- we try to match this lock on future enqueues; although the lock is marked as
failed, search_queue is only checking for destroyed (bug). I don't know
precisely why we need two flags for this, but that's a more subtle change than
I'm willing to make right now.
- once we have a handle on that lock, the completion AST does check that flag,
so it returns an error right away -- but we don't check its return code in the
match path (bug) and plow on
- the lock enqueue was originally aborted before it got to the part that updates
the KMS and sets the LDLM_FL_CAN_MATCH flag. So each match attempt will wait
100 seconds for that flag to get set, which of course never happens. We should
print a pretty serious warning if that timeout happens, but fixes for the
previous two bugs should prevent us from getting here in the first place.
This has been running at NERSC for the last week, so I think it's ready for
more exposure.