Whamcloud - gitweb
LU-5395 lfsck: deadlock between LFSCK and destroy
There is potential deadlock race condition between object
destroy and layout LFSCK. Consider the following scenario:
1) The LFSCK thread obtained the parent object firstly, at
that time, the parent object has not been destroyed yet.
2) One RPC service thread destroyed the parent and all its
children objects. Because the LFSCK is referencing the
parent object, then the parent object will be marked as
dying in RAM. On the other hand, the parent object is
referencing all its children objects, then all children
objects will be marked as dying in RAM also.
3) The LFSCK thread tries to find some child object with
the parent object referenced. Then it will find that the
child object is dying. According to the object visibility
rules: the object with dying flag cannot be returned to
others. So the LFSCK thread has to wait until the dying
object has been purged from RAM, then it can allocate a
new object (with the same FID) in RAM. Unfortunately, the
LFSCK thread itself is referencing the parent object, and
cause the parent object cannot be purged, then cause the
child object cannot be purged also. So the LFSCK thread
will fall into deadlock.
We introduce non-blocked version lu_object_find() to allow
the LFSCK thread to return failure immediately (instead of
wait) when it finds dying (child) object, then the LFSCK
thread can check whether the parent object is dying or not.
So avoid above deadlock.
Signed-off-by: Fan Yong <fan.yong@intel.com>
Change-Id: I7f465259011ad5fb92ef1b4dba0ff9f46d134352
Reviewed-on: http://review.whamcloud.com/11373
Tested-by: Jenkins
Tested-by: Maloo <hpdd-maloo@intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Lai Siyao <lai.siyao@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>