[Ocfs2-users] sanity check - Xen+iSCSI+LVM+OCFS2 at dom0/domU

Fri Feb 8 10:43:16 PST 2008

Appears the self-built ocfs2 modules are still in play.

Do:
$ find /lib/modules/`uname -r` -name \*ocfs\* -exec echo -n "{}    " \; 
-exec rpm -qf {} \;
This will show all the ocfs2 modules and whether the owning package.

Alok Dhir wrote:
> Great call on the netconsole -- I had no idea it existed until your 
> advice - you learn something new every day :)
>
> Here's my repeatable oops on Centos 5.1 Xen dom0, 2.6.18-53.1.6.el5xen 
> x86_64, using OSS packages 'ocfs2-2.6.18-53.1.6.el5xen-1.2.8-2.el5'.
>
> As soon as I kick off 'iozone -A':
>
> Kernel BUG at fs/inode.c:250
> invalid opcode: 0000 [1] SMP
> last sysfs file: 
> /devices/pci0000:00/0000:00:02.0/0000:04:00.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/irq 
>
> CPU 5
> Modules linked in: ocfs2(U) netconsole netloop netbk blktap blkbk 
> ipt_MASQUERADE iptable_nat ip_nat xt_state ip_conntrack nfnetlink 
> ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge ipv6 
> autofs4 hidp rfcomm l2cap bluetooth ocfs2_dlmfs(U) ocfs2_dlm(U) 
> ocfs2_nodemanager(U) configfs sunrpc ib_iser rdma_cm ib_cm iw_cm 
> ib_addr ib_local_sa ib_sa ib_mad ib_core iscsi_tcp libiscsi 
> scsi_transport_iscsi dm_multipath video sbs backlight i2c_ec i2c_core 
> button battery asus_acpi ac parport_pc lp parport joydev sr_mod ide_cd 
> serial_core serio_raw cdrom pcspkr shpchp bnx2 dm_snapshot dm_zero 
> dm_mirror dm_mod mppVhba(U) usb_storage ata_piix libata megaraid_sas 
> mppUpper(U) sg(U) sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
> Pid: 31841, comm: iozone Not tainted 2.6.18-53.1.6.el5xen #1
> RIP: e030:[<ffffffff80222b19>]  [<ffffffff80222b19>] 
> clear_inode+0x1b/0x123
> RSP: e02b:ffff8803c2803e28  EFLAGS: 00010202
> RAX: ffff8803c3f2ff20 RBX: ffff8803c3f2fd98 RCX: 0000000000000000
> RDX: ffffffffff578140 RSI: ffff8803c2803e48 RDI: ffff8803c3f2fd98
> RBP: 0000000000000000 R08: ffff8803dbe1b6c0 R09: 0000000000000002
> R10: 0000000000000001 R11: ffff880002cb8c00 R12: ffff8803c3f2fac0
> R13: ffff8803d6e46000 R14: 0000000000000002 R15: 0000000000000000
> FS:  00002aaaaaac5ee0(0000) GS:ffffffff80599280(0000) 
> knlGS:0000000000000000
> CS:  e033 DS: 0000 ES: 0000
> Process iozone (pid: 31841, threadinfo ffff8803c2802000, task 
> ffff8803e345a7a0)
> Stack:  ffff8803c3f2fd98  ffffffff8869ac19  ffff8803c3f2fd98  
> ffff8803c3857130
>  0000000000000000  0000000000000002  ffffffffffffffff  0000000000000200
>  ffff8803c3f2fd98  ffffffff8869a4f0
> Call Trace:
>  [<ffffffff8869ac19>] :ocfs2:ocfs2_delete_inode+0x729/0x79a
>  [<ffffffff8869a4f0>] :ocfs2:ocfs2_delete_inode+0x0/0x79a
>  [<ffffffff8022f811>] generic_delete_inode+0xc6/0x143
>  [<ffffffff88699df5>] :ocfs2:ocfs2_drop_inode+0x117/0x16e
>  [<ffffffff8023c6ea>] do_unlinkat+0xd5/0x141
>  [<ffffffff8025d291>] tracesys+0x47/0xb2
>  [<ffffffff8025d2f1>] tracesys+0xa7/0xb2
>
>
> Code: 0f 0b 68 e3 7f 47 80 c2 fa 00 48 8b 83 08 02 00 00 a8 10 75
> RIP  [<ffffffff80222b19>] clear_inode+0x1b/0x123
>  RSP <ffff8803c2803e28>
>  <0>Kernel panic - not syncing: Fatal exception
>
>
> On Feb 7, 2008, at 6:56 PM, Sunil Mushran wrote:
>
>> Setup netconsole on the cluster members (domU?) to get a stack trace.
>>
>> Alok Dhir wrote:
>>> Thanks again for your prompt assistance earlier today - we seem to 
>>> have gotten past the fs/inode.c bug at domU by using the OSS 
>>> packaged ocfs2 kernel modules.  The cluster comes up and mounts on 
>>> all boxes, and appears to work.
>>>
>>> However, we have now run into a more prevalent issue - at dom0, any 
>>> of the cluster member servers will spontaneously reboot when I start 
>>> an 'iozone -A' in an ocfs2 filesystem.  I am unable to check the 
>>> kernel panic message as the box reboots immediately, despite the 
>>> setting of 'kernel.panic=0' in sysctl (which is supposed to mean 'do 
>>> not reboot on panic').  There are also no entries in messages when 
>>> this happens.
>>>
>>> I realize there's not much debugging you can do without the panic 
>>> message, but I'm wondering if perhaps this new version has some bug 
>>> which was not in 1.2.7 (with our self-built 1.2.7 only domU servers 
>>> rebooted - dom0 were stable).
>>>
>>> Are others running this new version with success?  Under RHEL/Centos 
>>> 5.1 Xen dom0/domU?
>>>
>>> On Feb 7, 2008, at 1:40 PM, Sunil Mushran wrote:
>>>
>>>> Is the ip address correct? If not, correct.
>>>>
>>>> # netstat -tan
>>>> See if that port is already in use. If  so, use another.
>>>>
>>>> Alok Dhir wrote:
>>>>> Ah - thanks for the clarification.
>>>>>
>>>>> I'm left with one perplexing problem - on one of the hosts, 
>>>>> 'devxen0', o2cb refuses to start.  The box is identically 
>>>>> configured to at least 2 other cluster hosts and all were imaged 
>>>>> the exact same way, except that devxen0 has 32GB RAM where the 
>>>>> others have 16 or less.
>>>>>
>>>>> Any clues where to look?
>>>>>
>>>>> --[root at devxen0:~] service o2cb enable
>>>>> Writing O2CB configuration: OK
>>>>> Starting O2CB cluster ocfs2: Failed
>>>>> Cluster ocfs2 created
>>>>> Node beast added
>>>>> o2cb_ctl: Internal logic failure while adding node devxen0
>>>>>
>>>>> Stopping O2CB cluster ocfs2: OK
>>>>> --This is in syslog when this happens:
>>>>>
>>>>> Feb  7 13:26:50 devxen0 kernel: 
>>>>> (17194,6):o2net_open_listening_sock:1867 ERROR: unable to bind 
>>>>> socket at 196.168.1.72:7777, ret=-99
>>>>>
>>>>> --Box config:
>>>>>
>>>>> [root at devxen0:~] uname -a
>>>>> Linux devxen0.symplicity.com 2.6.18-53.1.6.el5xen #1 SMP Wed Jan 
>>>>> 23 11:59:21 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
>>>>>
>>>>> --Here is cluster.conf:
>>>>>
>>>>> ---
>>>>> node:
>>>>>   ip_port = 7777
>>>>>   ip_address = 192.168.1.62
>>>>>   number = 0
>>>>>   name = beast
>>>>>   cluster = ocfs2
>>>>>
>>>>> node:
>>>>>   ip_port = 7777
>>>>>   ip_address = 196.168.1.72
>>>>>   number = 1
>>>>>   name = devxen0
>>>>>   cluster = ocfs2
>>>>>
>>>>> node:
>>>>>   ip_port = 7777
>>>>>   ip_address = 192.168.1.73
>>>>>   number = 2
>>>>>   name = devxen1
>>>>>   cluster = ocfs2
>>>>>
>>>>> node:
>>>>>   ip_port = 7777
>>>>>   ip_address = 192.168.1.74
>>>>>   number = 3
>>>>>   name = devxen2
>>>>>   cluster = ocfs2
>>>>>
>>>>> node:
>>>>>   ip_port = 7777
>>>>>   ip_address = 192.168.1.70
>>>>>   number = 4
>>>>>   name = fs1
>>>>>   cluster = ocfs2
>>>>>
>>>>> node:
>>>>>   ip_port = 7777
>>>>>   ip_address = 192.168.1.71
>>>>>   number = 5
>>>>>   name = fs2
>>>>>   cluster = ocfs2
>>>>>
>>>>> node:
>>>>>   ip_port = 7777
>>>>>   ip_address = 192.168.1.80
>>>>>   number = 6
>>>>>   name = vdb1
>>>>>   cluster = ocfs2
>>>>>
>>>>> cluster:
>>>>>   node_count = 7
>>>>>   name = ocfs2
>>>>> ---
>>>>>
>>>>>
>>>>>
>>>>> On Feb 7, 2008, at 1:23 PM, Sunil Mushran wrote:
>>>>>
>>>>>> Yes, but backported and released as ocfs2 1.4 which is yet to be 
>>>>>> released.
>>>>>> You are on ocfs2 1.2.
>>>>>>
>>>>>> Alok Dhir wrote:
>>>>>>> I've seen that -- I was under the impression that some of those 
>>>>>>> were being backported into the release kernels.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Alok
>>>>>>>
>>>>>>> On Feb 7, 2008, at 1:15 PM, Sunil Mushran wrote:
>>>>>>>
>>>>>>>> http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2-new-features.html 
>>>>>>>>
>>>>>>>>
>>>>>>>> Alok Dhir wrote:
>>>>>>>>> We were indeed using a self-built module due to the lack of an 
>>>>>>>>> OSS one for the latest kernel.  Thanks for your response, I 
>>>>>>>>> will test with the new version.
>>>>>>>>>
>>>>>>>>> What are we leaving on the table by not using the latest 
>>>>>>>>> mainline kernel?
>>>>>>>>>
>>>>>>>>> On Feb 7, 2008, at 12:56 PM, Sunil Mushran wrote:
>>>>>>>>>
>>>>>>>>>> Are you building ocfs2 with this kernel or are using the ones we
>>>>>>>>>> provide for RHEL5?
>>>>>>>>>>
>>>>>>>>>> I am assuming you have built it yourself as we did not release
>>>>>>>>>> packages for the latest 2.6.18-53.1.6 kernel till last night.
>>>>>>>>>>
>>>>>>>>>> If you are using your own, then use the one from oss.
>>>>>>>>>>
>>>>>>>>>> If you are using the one from oss, then file a bugzilla with the
>>>>>>>>>> full oops trace.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Sunil
>>>>>>>>>>
>>>>>>>>>> Alok K. Dhir wrote:
>>>>>>>>>>> Hello all - we're evaluating OCFS2 in our development 
>>>>>>>>>>> environment to see if it meets our needs.
>>>>>>>>>>>
>>>>>>>>>>> We're testing it with an iSCSI storage array (Dell MD3000i) 
>>>>>>>>>>> and 5 servers running Centos 5.1 (2.6.18-53.1.6.el5xen).
>>>>>>>>>>>
>>>>>>>>>>> 1) Each of the 5 servers is running the Centos 5.1 
>>>>>>>>>>> open-iscsi initiator, and sees the volumes exposed by the 
>>>>>>>>>>> array just fine.  So far so good.
>>>>>>>>>>>
>>>>>>>>>>> 2) Created a volume group using the exposed iscsi volumes 
>>>>>>>>>>> and created a few LVM2 logical volumes.
>>>>>>>>>>>
>>>>>>>>>>> 3) vgscan; vgchange -a y; on all the cluster members.  all 
>>>>>>>>>>> see the "md3000vg" volume group.  looking good. (we have no 
>>>>>>>>>>> intention of changing the LVM2 configurations much if at 
>>>>>>>>>>> all, and can make sure all such changes are done when the 
>>>>>>>>>>> volumes are off-line on all cluster members, so 
>>>>>>>>>>> theoretically this should not be a problem).
>>>>>>>>>>>
>>>>>>>>>>> 4) mkfs.ocfs2 /dev/md3000vg/testvol0 -- works great
>>>>>>>>>>>
>>>>>>>>>>> 5) mount on all Xen dom0 boxes in the cluster, works great.
>>>>>>>>>>>
>>>>>>>>>>> 6) create a VM on one of the cluster members, set up iscsi, 
>>>>>>>>>>> vgscan, md3000vg shows up -- looking good.
>>>>>>>>>>>
>>>>>>>>>>> 7) install ocfs2, 'service o2cb enable', starts up fine.  
>>>>>>>>>>> mount /dev/md3000vg/testvol0, works fine.
>>>>>>>>>>>
>>>>>>>>>>> ** Thanks for making it this far -- this is where is gets 
>>>>>>>>>>> interesting
>>>>>>>>>>>
>>>>>>>>>>> 8) run 'iozone' in domU against ocfs2 share - BANG - 
>>>>>>>>>>> immediate kernel panic, repeatable all day long.
>>>>>>>>>>>
>>>>>>>>>>> "kernel BUG at fs/inode.c"
>>>>>>>>>>>
>>>>>>>>>>> So my questions:
>>>>>>>>>>>
>>>>>>>>>>> 1) should this work?
>>>>>>>>>>>
>>>>>>>>>>> 2) if not, what should we do differently?
>>>>>>>>>>>
>>>>>>>>>>> 3) currently we're tracking the latest RHEL/Centos 5.1 
>>>>>>>>>>> kernels -- would we have better luck using the latest 
>>>>>>>>>>> mainline kernel?
>>>>>>>>>>>
>>>>>>>>>>> Thanks for any assistance.
>>>>>>>>>>>
>>>>>>>>>>> Alok Dhir
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Ocfs2-users mailing list
>>>>>>>>>>> Ocfs2-users at oss.oracle.com
>>>>>>>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Ocfs2-users mailing list
>>>>>>>>> Ocfs2-users at oss.oracle.com
>>>>>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users