[Ocfs2-users] input / out error on some nodes

Mon Oct 26 19:03:50 PDT 2015

Hi,

Did you test on an pure ocfs2 volume, or with ceph rdb?

I tried your steps on my side just without ceph rdb, and didn't see your 
issue.
node1                                                          node2
n1:~ # mkdir /mnt/shared/test1
n1:~ # cd /mnt/shared/test1/
n2:~ # mv /mnt/shared/test1/ /mnt/shared/test2
n1:/mnt/shared/test1 # ll /mnt/shared/
drwxr-xr-x 2 root root 3896 Oct 26 21:18 lost+found
drwxr-xr-x 2 root root 3896 Oct 27 09:46 test2
  n2:~ # ll /mnt/shared/
drwxr-xr-x 2 root root 3896 Oct 26 21:18 lost+found
drwxr-xr-x 2 root root 3896 Oct 27 09:46 test2

Hope you can further isolate your problem. Again, firstly make sure you 
have an ocfs2 cluster in good condition!
BTW, please response what I've asked you in previous email if possible ;-)

Thanks,
Eric

On 10/26/15 16:28, gjprabu wrote:
>
> Hi Eric,
>
> We identified the issue. When we do simultaneous access on the same 
> directory its having i/o error. But normally cluster filesytem will 
> handle this, in our cause its not working. ocfs2 version 
> ocfs2-tools-1.8.0-16.
>
>
> Exmaple
>
> Node1 : cd /home/downloads/test
>
> Node2 : mv /home/downloads/test /home/downloads/test1
>
>
> Node1
>
> ls -al /home/downloads/
>
> d?????????   ? ?     ?         ?            ?   test1
>
>
> Node2
>
> ls -al /home/downloads/
>
> drwxr-xr-x    2 root  root  3.9K Oct 26 12:06 test1
>
>
>
> Regards
> Prabu
>
>
>
> ---- On Mon, 26 Oct 2015 08:10:06 +0530 *Eric Ren <zren at suse.com>* 
> wrote ----
>
>     Hi,
>
>     On 10/22/15 21:00, gjprabu wrote:
>
>         Hi Eric,
>
>         Thanks for your reply, Still we are facing same issue. we
>         found this dmesg logs and this is known logs because our self
>         made down node1 and made up this is showing in logs and other
>         then we didn't found error message. Even we do have problem
>         while unmounting. umount process goes to "D" stat and fsck
>         through fsck.ocfs2: I/O error. If required to run any other
>         command pls let me know.
>
>     1. system log over boots
>     #journalctl --list-boots
>     If there is just one boot record, please " man journald.conf" to
>     configure saving system logs over boots.
>     so, you can use "journalctl -b xxx" to see any specific boot
>     system log.
>
>     I can't see what steps exactly lead to that error message? Better
>     to tidy up your problems from clean state.
>
>     2. umount issue may be caused by the bad condition cluster.
>     Communication between nodes hung up.
>
>     3. please using device instead of mount point.
>
>     4. Did you build up CEPH  RBD based on a good conditional ocfs2
>     cluster? It's better test more if cluster is
>     good before working on it.
>
>
>     Thanks,
>     Eric
>
>         *ocfs2 version*
>         debugfs.ocfs2 1.8.0
>
>         *# cat /etc/sysconfig/o2cb*
>         #
>         # This is a configuration file for automatic startup of the O2CB
>         # driver.  It is generated by running /etc/init.d/o2cb configure.
>         # On Debian based systems the preferred method is running
>         # 'dpkg-reconfigure ocfs2-tools'.
>         #
>
>         # O2CB_STACK: The name of the cluster stack backing O2CB.
>         O2CB_STACK=o2cb
>
>         # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
>         O2CB_BOOTCLUSTER=ocfs2
>
>         # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is
>         considered dead.
>         O2CB_HEARTBEAT_THRESHOLD=31
>
>         # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection
>         is considered dead.
>         O2CB_IDLE_TIMEOUT_MS=30000
>
>         # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive
>         packet is sent
>         O2CB_KEEPALIVE_DELAY_MS=2000
>
>         # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection
>         attempts
>         O2CB_RECONNECT_DELAY_MS=2000
>
>         *# fsck.ocfs2 -fy /home/build/downloads/*
>         fsck.ocfs2 1.8.0
>         fsck.ocfs2: I/O error on channel while opening
>         "/zoho/build/downloads/"
>
>         _*dmesg logs*_
>
>         [ 4229.886284] o2dlm: Joining domain
>         A895BC216BE641A8A7E20AA89D57E051 ( 5 ) 1 nodes
>         [ 4251.437451] o2dlm: Node 3 joins domain
>         A895BC216BE641A8A7E20AA89D57E051 ( 3 5 ) 2 nodes
>         [ 4267.836392] o2dlm: Node 1 joins domain
>         A895BC216BE641A8A7E20AA89D57E051 ( 1 3 5 ) 3 nodes
>         [ 4292.755589] o2dlm: Node 2 joins domain
>         A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 5 ) 4 nodes
>         [ 4306.262165] o2dlm: Node 4 joins domain
>         A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes
>         [316476.505401]
>         (kworker/u192:0,95923,0):dlm_do_assert_master:1717 ERROR:
>         Error -112 when sending message 502 (key 0xc3460ae7) to node 1
>         [316476.505470] o2cb: o2dlm has evicted node 1 from domain
>         A895BC216BE641A8A7E20AA89D57E051
>         [316480.437231] o2dlm: Begin recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051 for node 1
>         [316480.442389] o2cb: o2dlm has evicted node 1 from domain
>         A895BC216BE641A8A7E20AA89D57E051
>         [316480.442412]
>         (kworker/u192:0,95923,20):dlm_begin_reco_handler:2765
>         A895BC216BE641A8A7E20AA89D57E051: dead_node previously set to
>         1, node 3 changing it to 1
>         [316480.541237] o2dlm: Node 3 (he) is the Recovery Master for
>         the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
>         [316480.541241] o2dlm: End recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051
>         [316485.542733] o2dlm: Begin recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051 for node 1
>         [316485.542740] o2dlm: Node 3 (he) is the Recovery Master for
>         the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
>         [316485.542742] o2dlm: End recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051
>         [316490.544535] o2dlm: Begin recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051 for node 1
>         [316490.544538] o2dlm: Node 3 (he) is the Recovery Master for
>         the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
>         [316490.544539] o2dlm: End recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051
>         [316495.546356] o2dlm: Begin recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051 for node 1
>         [316495.546362] o2dlm: Node 3 (he) is the Recovery Master for
>         the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
>         [316495.546364] o2dlm: End recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051
>         [316500.548135] o2dlm: Begin recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051 for node 1
>         [316500.548139] o2dlm: Node 3 (he) is the Recovery Master for
>         the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
>         [316500.548140] o2dlm: End recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051
>         [316505.549947] o2dlm: Begin recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051 for node 1
>         [316505.549951] o2dlm: Node 3 (he) is the Recovery Master for
>         the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
>         [316505.549952] o2dlm: End recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051
>         [316510.551734] o2dlm: Begin recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051 for node 1
>         [316510.551739] o2dlm: Node 3 (he) is the Recovery Master for
>         the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
>         [316510.551740] o2dlm: End recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051
>         [316515.553543] o2dlm: Begin recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051 for node 1
>         [316515.553547] o2dlm: Node 3 (he) is the Recovery Master for
>         the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
>         [316515.553548] o2dlm: End recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051
>         [316520.555337] o2dlm: Begin recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051 for node 1
>         [316520.555341] o2dlm: Node 3 (he) is the Recovery Master for
>         the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
>         [316520.555343] o2dlm: End recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051
>         [316525.557131] o2dlm: Begin recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051 for node 1
>         [316525.557136] o2dlm: Node 3 (he) is the Recovery Master for
>         the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
>         [316525.557153] o2dlm: End recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051
>         [316530.558952] o2dlm: Begin recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051 for node 1
>         [316530.558955] o2dlm: Node 3 (he) is the Recovery Master for
>         the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
>         [316530.558957] o2dlm: End recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051
>         [316535.560781] o2dlm: Begin recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051 for node 1
>         [316535.560789] o2dlm: Node 3 (he) is the Recovery Master for
>         the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
>         [316535.560792] o2dlm: End recovery on domain
>         A895BC216BE641A8A7E20AA89D57E051
>         [319419.525609] o2dlm: Node 1 joins domain
>         A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes
>
>
>
>         *ps -auxxxxx | grep umount*
>         root     32083 21.8  0.0 125620  2828 pts/14   D+   19:37  
>         0:18 umount /home/build/repository
>         root     32196  0.0  0.0 112652  2264 pts/8    S+   19:38  
>         0:00 grep --color=auto umount
>
>
>         *cat /proc/32083/stack*
>         [<ffffffff8132ad7d>] o2net_send_message_vec+0x71d/0xb00
>         [<ffffffff81352148>]
>         dlm_send_remote_unlock_request.isra.2+0x128/0x410
>         [<ffffffff813527db>] dlmunlock_common+0x3ab/0x9e0
>         [<ffffffff81353088>] dlmunlock+0x278/0x800
>         [<ffffffff8131f765>] o2cb_dlm_unlock+0x35/0x50
>         [<ffffffff8131ecfe>] ocfs2_dlm_unlock+0x1e/0x30
>         [<ffffffff812a8776>] ocfs2_drop_lock.isra.29.part.30+0x1f6/0x700
>         [<ffffffff812ae40d>] ocfs2_simple_drop_lockres+0x2d/0x40
>         [<ffffffff8129b43c>] ocfs2_dentry_lock_put+0x5c/0x80
>         [<ffffffff8129b4a2>] ocfs2_dentry_iput+0x42/0x1d0
>         [<ffffffff81204dc2>] __dentry_kill+0x102/0x1f0
>         [<ffffffff81205294>] shrink_dentry_list+0xe4/0x2a0
>         [<ffffffff81205aa8>] shrink_dcache_parent+0x38/0x90
>         [<ffffffff81205b16>] do_one_tree+0x16/0x50
>         [<ffffffff81206e9f>] shrink_dcache_for_umount+0x2f/0x90
>         [<ffffffff811efb15>] generic_shutdown_super+0x25/0x100
>         [<ffffffff811eff57>] kill_block_super+0x27/0x70
>         [<ffffffff811f02a9>] deactivate_locked_super+0x49/0x60
>         [<ffffffff811f089e>] deactivate_super+0x4e/0x70
>         [<ffffffff8120da83>] cleanup_mnt+0x43/0x90
>         [<ffffffff8120db22>] __cleanup_mnt+0x12/0x20
>         [<ffffffff81093ba4>] task_work_run+0xc4/0xe0
>         [<ffffffff81013c67>] do_notify_resume+0x97/0xb0
>         [<ffffffff817d2ee7>] int_signal+0x12/0x17
>         [<ffffffffffffffff>] 0xffffffffffffffff
>
>         Regards
>         Prabu
>
>
>
>
>         ---- On Wed, 21 Oct 2015 08:32:15 +0530 *Eric Ren
>         <zren at suse.com> <mailto:zren at suse.com>* wrote ----
>
>             Hi Prabu,
>
>             I guess others like me are not familiar with this case
>             that combine CEPH RBD and OCFS2.
>
>             We'd really like to help you. But I think ocfs2 developers
>             cannot get any info about what happened
>             to ocfs2 from your descriptions.
>
>             So, I'm wondering if you can reproduce and tell us the
>             steps. Once developers can reproduce it,
>             it's likely be resolved;-) BTW, any dmesg log about ocfs2
>             especially the initial error message and stack
>             back trace will be helpful!
>
>             Thanks,
>             Eric
>
>             On 10/20/15 17:29, gjprabu wrote:
>
>                 Hi
>
>                         We are looking forward to your input on this.
>
>                 Regads
>                 Prabu
>
>                 --- On Fri, 09 Oct 2015 12:08:19 +0530 *gjprabu
>                 <gjprabu at zohocorp.com> <mailto:gjprabu at zohocorp.com>*
>                 wrote ----
>
>
>
>
>
>                         Hi All,
>
>                                  Anybody pls help me on this issue.
>
>                         Regards
>                         Prabu
>
>
>
>
>                         ---- On Thu, 08 Oct 2015 12:33:57 +0530
>                         *gjprabu <gjprabu at zohocorp.com
>                         <mailto:gjprabu at zohocorp.com>>* wrote ----
>
>
>
>                             Hi All,
>
>                                    We have CEPH  RBD with OCFS2
>                             mounted servers. we are facing i/o errors
>                             simultaneously while move the data's in
>                             the same disk (Copying is not having any
>                             problem). Temporary we remount the
>                             partition and the issue get resolved but
>                             after sometime problem again reproduced.
>                             If anybody faced same issue. Please help us.
>
>                             Note : We have total 5 Nodes, here two
>                             nodes working fine other nodes are showing
>                             like below input/output error.
>
>                             ls -althr
>                             ls: cannot access LITE_3_0_M4_1_TEST:
>                             Input/output error
>                             ls: cannot access LITE_3_0_M4_1_OLD:
>                             Input/output error
>                             total 0
>                             d????????? ? ? ? ? ? LITE_3_0_M4_1_TEST
>                             d????????? ? ? ? ? ? LITE_3_0_M4_1_OLD
>
>                             cluster:
>                                    node_count=5
>                                    heartbeat_mode = local
>                                    name=ocfs2
>
>                             node:
>                                     ip_port = 7777
>                                     ip_address = 192.168.113.42
>                                     number = 1
>                                     name = integ-hm9
>                                     cluster = ocfs2
>
>                             node:
>                                     ip_port = 7777
>                                     ip_address = 192.168.112.115
>                                     number = 2
>                                     name = integ-hm2
>                                     cluster = ocfs2
>
>                             node:
>                                     ip_port = 7777
>                                     ip_address = 192.168.113.43
>                                     number = 3
>                                     name = integ-ci-1
>                                     cluster = ocfs2
>                             node:
>                                     ip_port = 7777
>                                     ip_address = 192.168.112.217
>                                     number = 4
>                                     name = integ-hm8
>                                     cluster = ocfs2
>                             node:
>                                     ip_port = 7777
>                                     ip_address = 192.168.112.192
>                                     number = 5
>                                     name = integ-hm5
>                                     cluster = ocfs2
>
>
>                             Regards
>                             Prabu
>
>
>
>                             _______________________________________________
>
>                             Ocfs2-users mailing list
>                             Ocfs2-users at oss.oracle.com
>                             <mailto:Ocfs2-users at oss.oracle.com>
>                             https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
>
>
>
>                 _______________________________________________
>                 Ocfs2-users mailing list
>                 Ocfs2-users at oss.oracle.com
>                 <mailto:Ocfs2-users at oss.oracle.com>  https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20151027/f4068f0f/attachment-0001.html