[Ocfs2-users] Tracking down hangs

Fri Jun 4 10:11:21 PDT 2010

On 06/04/2010 07:17 AM, Andrew Robert Nicols wrote:
> If the hang is only short, could it be that we're just missing the relevant
> busy locks by running scanlocks too late?
>    

Then it is not a hang. It is just slow. A hang is more permanent
and is typically due to a bug in some component. A busy lock, in itself,
is not a bug. It just indicates that a node has requested an upconvert.
The dlm will see if another node has an incompatible lock level and
ask it to downconvert. The downconverting node will then wait for
the journal to commit+checkpoint (if needed) before requesting the
downconvert. All this involves writes. Any slowdown in the io layer
will affect the overall throughout.

> I've remounted with data=writeback on the nfs server and under normal load,
> we're still seeing hangs fairly frequently. I'm having real difficulty in
> tracking down the cause of the issues.
>
> I've moved away from catting the same file on each server to reading a
> different file on each server. This has reduced the frequency of the issue
> slightly, but not altogether.
>    

Performance issues are never easy. ;)

> We've checked out the drbd link and it appears untaxed when we see these
> glitches.
>    

My concern is latency. Even if the total amount of ios are small, if
the time taken is long, it will explain what you are observing.