[Ocfs2-users] Unreported temporary file corruption

Nuno Tavares nuno.tavares at dri.pt
Sat Feb 13 12:32:29 PST 2010


Greetings all,

I'm wondering if anyone can shed some light here.

Some days ago an user reported problems dealing with a specific
directory. After further investigation, I'm now suspecting that there
was data corruption between a specific time period.

Addressing the initial issue, I've checked /var/log/messages just in
case and it had lots of messages like this:

Jan 18 14:24:40 fsnode01 kernel: (4626,1):ocfs2_check_dir_entry:111 ERROR:
bad entry in directory #8075816: directory entry across blocks -
offset=0, inode=1164370764863544510, rec_len=57912, name_len=167
Jan 18 14:24:40 fsnode01 kernel: (4626,1):ocfs2_prepare_dir_for_insert:1734
ERROR: status = -2
Jan 18 14:24:40 fsnode01 kernel: (4626,1):ocfs2_mknod:240 ERROR: status = -2
Jan 18 14:24:59 fsnode01 kernel: (4264,0):ocfs2_check_dir_entry:111 ERROR:
bad entry in directory #8075816: directory entry across blocks -
offset=0, inode=1164370764863544510, rec_len=57912, name_len=167
Jan 18 14:24:59 fsnode01 kernel: (4264,0):ocfs2_prepare_dir_for_insert:1734
ERROR: status = -2
Jan 18 14:24:59 fsnode01 kernel: (4264,0):ocfs2_mknod:240 ERROR: status = -2
[...]

Indeed, that inode was bound to the problematic directory:

[root at fsnode01 ~]# debugfs.ocfs2 -R "findpath <8075816>" /dev/sdc1
	8075816	/storage/problematic/directory/

So I brought the cluster down and requested a filesystem check which
dumped a lot of messages like this:

Cluster 1135086 is claimed by the following inodes:
  /storage/unrelated/file1
  /storage/unrelated/file2
[DUP_CLUSTERS_CLONE] Inode "/storage/unrelated/file1" may be cloned or
deleted to break the claim it has on its clusters. Clone inode
"/storage/unrelated/file1" to break claims on clusters it shares with
other inodes? y
pass1d: Invalid argument passed to OCFS2 library while reading inode to
clone

Just check that pass1d (last) message. I've checked my tools, and
although they mismatch, they are the latest versions available:
[root at fsnode01 ~]# rpm -qa | grep ocfs
ocfs2-tools-1.4.3-1.el5
ocfs2-2.6.18-164.el5-1.4.4-1.el5
ocfs2console-1.4.3-1.el5

Notice kernel modules are 1.4.4 and tools are 1.4.3. Could this version
mismatch cause the pass1d error? Does it have any consequence? I've
checked again, they were the only ones available...

I must say /storage/unrelated/* are all PDF files. However, there are
some damaged ones, and I've tracked some down using 'file -bi' to an
interval of time between 'Jan 18 09:47' and 'Jan 18 12:24'. I could only
track these files because 'file' reported a damaged PDF header, but I
can't be sure the other ones are all OK, I can just say their header is OK.

Also worth mentioning is that there are other files between that time
interval that seem to be OK (again, I can't be sure). I can't be certain
when this mess was started and when did the cluster recovered from this
mess.

I'm almost sure the files were OK when they were "about to be" stored on
/storage. This investigation suggested they were damaged *during* their
existence on /storage. I've now taken appropriate measures to prove this
in the future.

What is puzzling me is:
* I now have knowledge of corrupted files, and I don't even know how
many there is. I just know they are as much or more than those 'file'
detected. Some of the files whose inodes fsck.ocfs2 tried to clone
belong to the supra time period, and this suggests there were some kind
of mess going on that the cluster wrote different files parts on the
same blocks.. what could have caused this, and how do I avoid happening
again?

* Show I turn on tracing for a particular bit? Which one?

* How can I monitor OCFS2 health on a running cluster?

Regards,
-- 
Nuno Tavares
DRI, Consultoria Informática
Telef: +351 936 184 086





-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 260 bytes
Desc: OpenPGP digital signature
Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20100213/2888604e/attachment.bin 


More information about the Ocfs2-users mailing list