[Ocfs2-users] Huge Problem ocfs2

Joel Becker jlbec at evilplan.org
Mon Nov 12 02:37:45 PST 2012


On Mon, Nov 12, 2012 at 01:24:30AM +0200, Laurentiu Gosu wrote:
> We managed to track down the problem: the inodes which hold the
> RootDirectory and System Directory(and probably others ..like hb)
> were overwritten somehow(!?).
> Using debugfs and a lot of detective work Marian found the inode
> number of one of the sub-folders and then we cd .. until the most
> top level reachable folder...and then used rdump to recover the
> data.

	Nicely done recovery!

> Now the question is why the critical blocks were overwritten. Maybe
> you can help to track this down and correct it(if that's the case).
> So some facts from 2 days ago:
> 1. ocfs2 cluster started becoming unresponsive(could not ls on some folders)
> 2. we unmounted the device from all nodes and run a fscheck -y on
> it(few months ago we did this succesfully)
> 3. after succesfully finished fscheck i remounted the device on all 5 nodes.
> 4. after 1 hour all nodes started reporting in syslog something like:
> *Nov  9 15:40:17 ro02xsrv003 kernel:
> (o2hb-B4CF8D4667,6098,9):o2hb_check_last_timestamp:576 ERROR:
> Another node is heartbeating on device (dm-5):
> expected(2:0xdfd1f518e3333501, 0x509d07bf),
> ondisk(1:0xd81cb80a00020069, 0xac1bf00000001db8)**
> **Nov  9 15:40:17 ro02xsrv003 kernel:
> (o2hb-B4CF8D4667,6098,9):o2hb_check_slot:802 ERROR: Node 0 has
> written a bad crc to dm-5**
> **Nov  9 15:40:17 ro02xsrv003 kernel:
> (o2hb-B4CF8D4667,6098,9):o2hb_dump_slot:526 ERROR: Dump slot
> information: seq = 0x2c2527fa66646f6d, node = 37, cksum = 0xda52,
> generation 0xf7a004a5a8c00000**
> **Nov  9 15:40:17 ro02xsrv003 kernel:
> (o2hb-B4CF8D4667,6098,9):o2hb_check_slot:802 ERROR: Node 3 has
> written a bad crc to dm-5*
> 
> So i believe the fscheck marked somehow the meta-data blocks as
> writable and when they were used....kaboom.
> Hope it helps somebody to find the root cause. If additional info
> are needed for debugging let me know.

	That's really weird.  The fsck code treats the system blocks
first and doesn't have an easy way to clear them.  I think you are very
correct that something overwrote the front of your disk.  I'm unsure
whether this evidence matches your supposition (metadata blocks are
re-allocated to regular files) or just a straight dd.  If you haven't
overwritten the whole disk yet, can you find a file with the heartbeat
blocks in the metadata chain?

Joel


-- 

"But then she looks me in the eye
 And says, 'We're going to last forever,'
 And man you know I can't begin to doubt it.
 Cause it just feels so good and so free and so right,
 I know we ain't never going to change our minds about it, Hey!
 Here comes my girl."

			http://www.jlbec.org/
			jlbec at evilplan.org



More information about the Ocfs2-users mailing list