[Ocfs2-users] Unreported temporary file corruption

Tue Feb 23 12:14:46 PST 2010

Hello again,

I'm sorry to insist on this matter, but I'm wondering if this message
went through unnoticed or if I said something absurd (?). I find strange
nobody had nothing to say.

> * I now have knowledge of corrupted files, and I don't even know how
> many there is. I just know they are as much or more than those 'file'
> detected. Some of the files whose inodes fsck.ocfs2 tried to clone
> belong to the supra time period, and this suggests there were some kind
> of mess going on that the cluster wrote different files parts on the
> same blocks.. what could have caused this, and how do I avoid happening
> again?
>
> * Show I turn on tracing for a particular bit? Which one?
>
> * How can I monitor OCFS2 health on a running cluster?

I also find strange what mounted.ocfs2 reports:
[root at server01 ~]# mounted.ocfs2 -f
Device                FS     Nodes
/dev/sdc1             ocfs2  Unknown, server01

The output is consistent to server02's.

I'd really like to hear your thoughts on this.

-- 
Nuno Tavares
DRI, Consultoria Informática
Telef: +351 936 184 086

Nuno Tavares escreveu:
> Greetings all,
> 
> I'm wondering if anyone can shed some light here.
> 
> Some days ago an user reported problems dealing with a specific
> directory. After further investigation, I'm now suspecting that there
> was data corruption between a specific time period.
> 
> Addressing the initial issue, I've checked /var/log/messages just in
> case and it had lots of messages like this:
> 
> Jan 18 14:24:40 fsnode01 kernel: (4626,1):ocfs2_check_dir_entry:111 ERROR:
> bad entry in directory #8075816: directory entry across blocks -
> offset=0, inode=1164370764863544510, rec_len=57912, name_len=167
> Jan 18 14:24:40 fsnode01 kernel: (4626,1):ocfs2_prepare_dir_for_insert:1734
> ERROR: status = -2
> Jan 18 14:24:40 fsnode01 kernel: (4626,1):ocfs2_mknod:240 ERROR: status = -2
> Jan 18 14:24:59 fsnode01 kernel: (4264,0):ocfs2_check_dir_entry:111 ERROR:
> bad entry in directory #8075816: directory entry across blocks -
> offset=0, inode=1164370764863544510, rec_len=57912, name_len=167
> Jan 18 14:24:59 fsnode01 kernel: (4264,0):ocfs2_prepare_dir_for_insert:1734
> ERROR: status = -2
> Jan 18 14:24:59 fsnode01 kernel: (4264,0):ocfs2_mknod:240 ERROR: status = -2
> [...]
> 
> Indeed, that inode was bound to the problematic directory:
> 
> [root at fsnode01 ~]# debugfs.ocfs2 -R "findpath <8075816>" /dev/sdc1
> 	8075816	/storage/problematic/directory/
> 
> So I brought the cluster down and requested a filesystem check which
> dumped a lot of messages like this:
> 
> Cluster 1135086 is claimed by the following inodes:
>   /storage/unrelated/file1
>   /storage/unrelated/file2
> [DUP_CLUSTERS_CLONE] Inode "/storage/unrelated/file1" may be cloned or
> deleted to break the claim it has on its clusters. Clone inode
> "/storage/unrelated/file1" to break claims on clusters it shares with
> other inodes? y
> pass1d: Invalid argument passed to OCFS2 library while reading inode to
> clone
> 
> Just check that pass1d (last) message. I've checked my tools, and
> although they mismatch, they are the latest versions available:
> [root at fsnode01 ~]# rpm -qa | grep ocfs
> ocfs2-tools-1.4.3-1.el5
> ocfs2-2.6.18-164.el5-1.4.4-1.el5
> ocfs2console-1.4.3-1.el5
> 
> Notice kernel modules are 1.4.4 and tools are 1.4.3. Could this version
> mismatch cause the pass1d error? Does it have any consequence? I've
> checked again, they were the only ones available...
> 
> I must say /storage/unrelated/* are all PDF files. However, there are
> some damaged ones, and I've tracked some down using 'file -bi' to an
> interval of time between 'Jan 18 09:47' and 'Jan 18 12:24'. I could only
> track these files because 'file' reported a damaged PDF header, but I
> can't be sure the other ones are all OK, I can just say their header is OK.
> 
> Also worth mentioning is that there are other files between that time
> interval that seem to be OK (again, I can't be sure). I can't be certain
> when this mess was started and when did the cluster recovered from this
> mess.
> 
> I'm almost sure the files were OK when they were "about to be" stored on
> /storage. This investigation suggested they were damaged *during* their
> existence on /storage. I've now taken appropriate measures to prove this
> in the future.
> 
> What is puzzling me is:
> * I now have knowledge of corrupted files, and I don't even know how
> many there is. I just know they are as much or more than those 'file'
> detected. Some of the files whose inodes fsck.ocfs2 tried to clone
> belong to the supra time period, and this suggests there were some kind
> of mess going on that the cluster wrote different files parts on the
> same blocks.. what could have caused this, and how do I avoid happening
> again?
> 
> * Show I turn on tracing for a particular bit? Which one?
> 
> * How can I monitor OCFS2 health on a running cluster?
> 
> Regards,
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 260 bytes
Desc: OpenPGP digital signature
Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20100223/d45a328b/attachment.bin