[Ocfs2-users] Auto reboot when running fio benchmarking

Eric Ren zren at suse.com
Tue Dec 15 19:26:19 PST 2015


Hi,

On Tue, Dec 15, 2015 at 04:20:46PM +0700, Nguyen Xuan. Hai wrote: 
> Hi Eric,
> 
> 1. I am using o2cb cluster stack.
I'm not familiar with o2cb, so I've CCed Joseph, maybe he could
give some help.
> 2. The scenarios led to  reboot: Randomly writing with Fixed file
> size. This is an example of these scenarios:
> 
> [global]
> directory=/mnt/fio4G
> filename=fio_data
> invalidate=1
> ioengine=libaio
> direct=1
> ;ramp_time=30
> iodepth=1
> 
> [RandWR-512-ASync-Depth1-Thread1-NoGrp-4G-Fix]
> new_group
> rw=randwrite
> bs=512
> size=4g
> numjobs=1
> group_reporting
> 
> [RandWR-4k-ASync-Depth1-Thread1-NoGrp-4G-Fix]
> new_group
> rw=randwrite
> bs=4k
> size=4g
> numjobs=1
> group_reporting
> 
> [RandWR-64k-ASync-Depth1-Thread1-NoGrp-4G-Fix]
> new_group
> rw=randwrite
> bs=64k
> size=4g
> numjobs=1
> group_reporting
> 
> [RandWR-1m-ASync-Depth1-Thread1-NoGrp-4G-Fix]
> new_group
> rw=randwrite
> bs=1m
> size=4g
> numjobs=1
> group_reporting
Well, we tested ocfs2 by iozone and fio with pcmk stack, and randomly
write is OK. Actually, I'm intersted at what you expect ocfs2 do for you;-)
> 
> 3. I've attached the log files (kernel log, system log, message
> log). Please take a look.

Thanks. I only scaned the kernel log, and pick up unusual messages here:
---
 8470 Dec  9 03:25:06 skerlet kernel: [    3.835795] md: md0 stopped.
 8471 Dec  9 03:25:06 skerlet kernel: [    3.836408] md: bind<sda7>
 8472 Dec  9 03:25:06 skerlet kernel: [    3.837104] md: bind<sdb7>
 8473 Dec  9 03:25:06 skerlet kernel: [    3.837742] md: raid0 personality registered for level 0
...
 8531 Dec  9 03:25:06 skerlet kernel: [   11.507721] OCFS2 Node Manager 1.5.0
 8532 Dec  9 03:25:06 skerlet kernel: [   11.550409] OCFS2 DLM 1.5.0
 8533 Dec  9 03:25:06 skerlet kernel: [   11.554791] ocfs2: Registered cluster interface o2cb
 8534 Dec  9 03:25:06 skerlet kernel: [   11.565655] OCFS2 DLMFS 1.5.0
 8535 Dec  9 03:25:06 skerlet kernel: [   11.565778] OCFS2 User DLM kernel interface loaded
 8536 Dec  9 03:25:06 skerlet kernel: [   12.205284] fuse init (API version 7.18)
 8537 Dec  9 03:25:06 skerlet kernel: [   13.198880] RPC: Registered named UNIX socket transport module.
 8538 Dec  9 03:25:06 skerlet kernel: [   13.198883] RPC: Registered udp transport module.
 8539 Dec  9 03:25:06 skerlet kernel: [   13.198884] RPC: Registered tcp transport module.
 8540 Dec  9 03:25:06 skerlet kernel: [   13.198885] RPC: Registered tcp NFSv4.1 backchannel transport module.
 8541 Dec  9 03:25:06 skerlet kernel: [   13.473735] Installing knfsd (copyright (C) 1996 okir at monad.swb.de).
 8542 Dec  9 03:25:07 skerlet kernel: [   13.626296] svc: failed to register lockdv1 RPC service (errno 97).
 8543 Dec  9 03:25:07 skerlet kernel: [   13.626382] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
 8544 Dec  9 03:25:07 skerlet kernel: [   13.643385] NFSD: starting 90-second grace period
 8545 Dec  9 03:25:08 skerlet kernel: [   14.760868] r8169 0000:02:00.0: eth0: link up
 8546 Dec  9 03:25:08 skerlet kernel: [   14.761229] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
 8547 Dec  9 03:25:10 skerlet kernel: [   17.022745] sshd (1835): /proc/1835/oom_adj is deprecated, please use /proc/1835/oom_score_adj instead.
 8548 Dec  9 03:25:18 skerlet kernel: [   25.293365] eth0: no IPv6 routers present
 8549 Dec  9 03:26:03 skerlet kernel: [   70.179116] OCFS2 1.5.0
 8550 Dec  9 03:26:03 skerlet kernel: [   70.259140] o2dlm: Joining domain 7CA3345642C24049B6CE5DE419528F3C ( 1 ) 1 nodes
 8551 Dec  9 03:26:03 skerlet kernel: [   70.259697] ocfs2: Slot 0 on device (253,0) was already allocated to this node!
 8552 Dec  9 03:26:03 skerlet kernel: [   70.262011] ocfs2: File system on device (253,0) was not unmounted cleanly, recovering it.
 8553 Dec  9 03:26:03 skerlet kernel: [   70.275936] ocfs2: Mounting device (253,0) on (node 1, slot 0) with ordered data mode.
 8554 Dec  9 03:37:27 skerlet kernel: [  752.501739] o2dlm: Leaving domain 7CA3345642C24049B6CE5DE419528F3C
 8555 Dec  9 03:37:28 skerlet kernel: [  753.803611] ocfs2: Unmounting device (253,0) on (node 1)
 8556 Dec  9 03:38:08 skerlet kernel: [  792.852741] (o2hb-1BF749BE01,2185,0):o2hb_check_own_slot:590 ERROR: Another node is heartbeating on device (md0): expected(1:0x0, 0x313045444f4e49), ondisk(59:0x0, 0x313045444f4e49)
 8557 Dec  9 03:38:16 skerlet kernel: [  800.860197] o2dlm: Joining domain 1BF749BE01CC449189BB493D549B8D31 ( 1 ) 1 nodes
 8558 Dec  9 03:38:16 skerlet kernel: [  800.862930] JBD2: no valid journal superblock found
 8559 Dec  9 03:38:16 skerlet kernel: [  800.862936] (mount.ocfs2,2184,0):ocfs2_journal_wipe:1045 ERROR: status = -22
 8560 Dec  9 03:38:16 skerlet kernel: [  800.862940] (mount.ocfs2,2184,0):ocfs2_check_volume:2465 ERROR: status = -22
 8561 Dec  9 03:38:16 skerlet kernel: [  800.862943] (mount.ocfs2,2184,0):ocfs2_check_volume:2527 ERROR: status = -22
 8562 Dec  9 03:38:16 skerlet kernel: [  800.862946] (mount.ocfs2,2184,0):ocfs2_mount_volume:1903 ERROR: status = -22
 8563 Dec  9 03:38:20 skerlet kernel: [  804.894651] o2dlm: Leaving domain 1BF749BE01CC449189BB493D549B8D31
 8564 Dec  9 03:38:20 skerlet kernel: [  804.894865] ocfs2: Unmounting device (9,0) on (node 1)
 8565 Dec  9 03:38:20 skerlet kernel: [  804.894871] (mount.ocfs2,2184,2):ocfs2_fill_super:1230 ERROR: status = -22
 8566 Dec  9 03:38:37 skerlet kernel: [  822.242693] o2dlm: Joining domain 1BF749BE01CC449189BB493D549B8D31 ( 1 ) 1 nodes
 8567 Dec  9 03:38:41 skerlet kernel: [  826.278128] o2dlm: Leaving domain 1BF749BE01CC449189BB493D549B8D31
 8568 Dec  9 03:45:55 skerlet kernel: [ 1259.696919] o2dlm: Joining domain BD60713F60434B938DD07321526907DE ( 1 ) 1 nodes
 8569 Dec  9 03:45:55 skerlet kernel: [ 1259.707796] JBD2: Ignoring recovery information on journal
 8570 Dec  9 03:45:55 skerlet kernel: [ 1259.715268] ocfs2: Mounting device (9,0) on (node 1, slot 0) with ordered data mode.
 8571 Dec  9 03:48:10 skerlet kernel: [ 1394.248484] o2dlm: Joining domain 7CA3345642C24049B6CE5DE419528F3C ( 1 ) 1 nodes
 8572 Dec  9 03:48:10 skerlet kernel: [ 1394.257546] ocfs2: Mounting device (253,0) on (node 1, slot 0) with ordered data mode.
 8573 Dec  9 05:47:53 skerlet kernel: [ 8560.370451] o2dlm: Leaving domain 7CA3345642C24049B6CE5DE419528F3C
 8574 Dec  9 05:47:54 skerlet kernel: [ 8561.615904] ocfs2: Unmounting device (253,0) on (node 1)
 8575 Dec  9 05:58:55 skerlet kernel: [ 9221.153369] INFO: task flush-9:0:2319 blocked for more than 120 seconds.
 8576 Dec  9 05:58:55 skerlet kernel: [ 9221.153373] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
...backtrace...
---
1. device(253,0) was md0 device? if so, "md: md0 stopped" may be the cause.
2. I don't think md raid0 can be used as shared disk for ocfs2. If mkfs.ocfs2 with local option, it should be fine.
   IOW, if you want ocfs2 show its cluster ability, you shouldn't use md raid0 as shared disk, if you do so, it's no
   surprise that writing case would fail because native md has no clustering ability. Have a try, and let us know.
3. AFAIK, a cluster md feature(belong to md) is coming soon to to so.

Thanks,
Eric

> 
> Thank you so much,
> Hai Nguyen
> 
> 
> On 12/15/2015 4:09 PM, Eric Ren wrote:
> >Hi,
> >
> >On Thu, Dec 03, 2015 at 03:19:52PM +0700, Nguyen Xuan. Hai wrote:
> >>Hi all,
> >>
> >>I'm performing benchmarking on OCFS2 file system on LVM using fio tool.
> >>There are some scenarios that when we run it, after few minutes,
> >>computer will reboot automatically. These scenarios are related to
> >>OCFS2 file system only (there is no problem with ext3).
> >>
> >>We tried to fix by adding option "--debug" in fio command
> >>(example: /fio RandWR-ASync-IOdepth1-FixFileSize
> >>--output=RandWR-ASync-IOdepth1-FixFileSize.out //*-*//*-debug=io*/).
> >>Some scenarios can run successfully without rebooting. But there are
> >>still some scenarios cannot run successfully.
> >Sorry for late reply. Could you provide more information? such as
> >1. which cluster stack were you using, o2cb or pcmk? if pcmk, ocfs2 RA monitor timeout
> >    will triger fencing - reboot. I have not experienced rebooting when using o2cb, and am
> >    wondering if o2cb has similiar fencing mechianism. Maybe, kernel panic also incurs
> >    rebooting sometimes.
> >2. What scenarios led to reboot?
> >3. all logs: kernel logs, pacemaker logs if pcmk.
> >
> >Thanks,
> >Eric
> >>We tried to upgrade Linux kernel from 3.4.34 to 3.10.65 (after
> >>referred to link: https://oss.oracle.com/pipermail/ocfs2-users/2014-February/006130.html).
> >>Some scenarios can run successfully without rebooting. But there are
> >>still some scenarios cannot run successfully.
> >>
> >>This is content of file "RandWR-ASync-IOdepth1-FixFileSize":
> >>[global]
> >>directory=/mnt/fio4G
> >>filename=fio_data
> >>invalidate=1
> >>ioengine=libaio
> >>direct=1
> >>;ramp_time=30
> >>iodepth=1
> >>
> >>[RandWR-512-ASync-Depth1-Thread1-NoGrp-4G-Fix]
> >>new_group
> >>rw=randwrite
> >>bs=512
> >>size=4g
> >>numjobs=1
> >>group_reporting
> >>
> >>[RandWR-4k-ASync-Depth1-Thread1-NoGrp-4G-Fix]
> >>new_group
> >>rw=randwrite
> >>bs=4k
> >>size=4g
> >>numjobs=1
> >>group_reporting
> >>
> >>[RandWR-64k-ASync-Depth1-Thread1-NoGrp-4G-Fix]
> >>new_group
> >>rw=randwrite
> >>bs=64k
> >>size=4g
> >>numjobs=1
> >>group_reporting
> >>
> >>[RandWR-1m-ASync-Depth1-Thread1-NoGrp-4G-Fix]
> >>new_group
> >>rw=randwrite
> >>bs=1m
> >>size=4g
> >>numjobs=1
> >>group_reporting
> >>
> >>[RandWR-512-ASync-Depth1-Thread4-Grp-1G-Fix]
> >>new_group
> >>rw=randwrite
> >>bs=512
> >>size=1g
> >>numjobs=4
> >>group_reporting
> >>
> >>[RandWR-4k-ASync-Depth1-Thread4-Grp-1G-Fix]
> >>new_group
> >>rw=randwrite
> >>bs=4k
> >>size=1g
> >>numjobs=4
> >>group_reporting
> >>
> >>[RandWR-64k-ASync-Depth1-Thread4-Grp-1G-Fix]
> >>new_group
> >>rw=randwrite
> >>bs=64k
> >>size=1g
> >>numjobs=4
> >>group_reporting
> >>
> >>[RandWR-1m-ASync-Depth1-Thread4-Grp-1G-Fix]
> >>new_group
> >>rw=randwrite
> >>bs=1m
> >>size=1g
> >>numjobs=4
> >>group_reporting
> >>
> >>Could you help me find out the reason?
> >>
> >>Thanks and Best regards,
> >>
> >>-- 
> >>=====================================================================
> >>Nguyen Xuan Hai (Mr)
> >>
> >>Toshiba Software Development (Vietnam) Co.,Ltd
> >>
> >>=====================================================================
> >>
> >>-- 
> >>This mail was scanned by BitDefender
> >>For more information please visit http://www.bitdefender.com
> >>_______________________________________________
> >>Ocfs2-users mailing list
> >>Ocfs2-users at oss.oracle.com
> >>https://oss.oracle.com/mailman/listinfo/ocfs2-users
> >
> 
> -- 
> =====================================================================
> Nguyen Xuan Hai (Mr)
> 
> Toshiba Software Development (Vietnam) Co.,Ltd
> 
> =====================================================================
> 


> -- 
> This mail was scanned by BitDefender
> For more information please visit http://www.bitdefender.com

> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users




More information about the Ocfs2-users mailing list