[rds-devel] rds-rdma call trace bug

Chen Nai chennai1982 at 126.com
Thu Oct 29 00:23:19 PDT 2015


Hi, all:
I am using rds 4.1 integratd by MLNX_OFED(MLNX_OFED_LINUX-2.3-2.0.0  on CentOS6.5 x64). 
If I increase the message buffer length on the client via sendmsg (maybe 10000 bytes), then the server will be blocking on revemsg(3, 
and then rds-ping will not output and rds-stress will output all zero traffics.
After restarting the following service:
# /etc/init.d/openibd restart
The kernel reports the following errors:
Oct 29 15:13:31 dbnode01 kernel: RDS/IB: connection <172.16.10.102,172.16.10.99,0> dropped
Oct 29 15:13:45 dbnode01 kernel: RDS/IB: connection <172.16.10.102,172.16.10.104,0> dropped
Oct 29 15:13:45 dbnode01 kernel: RDS/IB: connection <172.16.10.102,172.16.10.103,0> dropped
Oct 29 15:14:15 dbnode01 kernel: RDS/IB: device cleanup timed out after  30 secs (refcount=3)


Oct 29 15:16:51 dbnode01 kernel: INFO: task krdsd:2129 blocked for more than 120 seconds.
Oct 29 15:16:51 dbnode01 kernel:      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
Oct 29 15:16:51 dbnode01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 29 15:16:51 dbnode01 kernel: krdsd         D 0000000000000001     0  2129      2 0x00000000
Oct 29 15:16:51 dbnode01 kernel: ffff880438b8dd20 0000000000000046 0000000000000000 ffff88082579c400
Oct 29 15:16:51 dbnode01 kernel: 0000000000000000 ffff88082579c470 0000000000000207 0000000000000207
Oct 29 15:16:51 dbnode01 kernel: ffff8804384abaf8 ffff880438b8dfd8 000000000000fbc8 ffff8804384abaf8
Oct 29 15:16:51 dbnode01 kernel: Call Trace:
Oct 29 15:16:51 dbnode01 kernel: [<ffffffffa0c5058d>] rds_ib_conn_shutdown+0x9d/0x5d0 [rds_rdma]
Oct 29 15:16:51 dbnode01 kernel: [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
Oct 29 15:16:51 dbnode01 kernel: [<ffffffffa0c2db30>] ? rds_shutdown_worker+0x0/0x20 [rds]
Oct 29 15:16:51 dbnode01 kernel: [<ffffffffa0c28cd6>] rds_conn_shutdown+0x156/0x200 [rds]
Oct 29 15:16:51 dbnode01 kernel: [<ffffffffa0c2db30>] ? rds_shutdown_worker+0x0/0x20 [rds]
Oct 29 15:16:51 dbnode01 kernel: [<ffffffffa0c2db45>] rds_shutdown_worker+0x15/0x20 [rds]
Oct 29 15:16:51 dbnode01 kernel: [<ffffffff81094d20>] worker_thread+0x170/0x2a0
Oct 29 15:16:51 dbnode01 kernel: [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
Oct 29 15:16:51 dbnode01 kernel: [<ffffffff81094bb0>] ? worker_thread+0x0/0x2a0
Oct 29 15:16:51 dbnode01 kernel: [<ffffffff8109aef6>] kthread+0x96/0xa0
Oct 29 15:16:51 dbnode01 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Oct 29 15:16:51 dbnode01 kernel: [<ffffffff8109ae60>] ? kthread+0x0/0xa0
Oct 29 15:16:51 dbnode01 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Oct 29 15:16:51 dbnode01 kernel: INFO: task modprobe:7170 blocked for more than 120 seconds.
Oct 29 15:16:51 dbnode01 kernel:      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
Oct 29 15:16:51 dbnode01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 29 15:16:51 dbnode01 kernel: modprobe      D 0000000000000001     0  7170   7110 0x00000080
Oct 29 15:16:51 dbnode01 kernel: ffff880438f5bbf8 0000000000000082 0000000000000000 0000000000000082
Oct 29 15:16:51 dbnode01 kernel: ffff880438f5bbc8 ffffffff81065c5e ffff880438f5bb88 ffff880400000003
Oct 29 15:16:51 dbnode01 kernel: ffff880437e83af8 ffff880438f5bfd8 000000000000fbc8 ffff880437e83af8
Oct 29 15:16:51 dbnode01 kernel: Call Trace:
Oct 29 15:16:51 dbnode01 kernel: [<ffffffff81065c5e>] ? try_to_wake_up+0x24e/0x3e0
Oct 29 15:16:51 dbnode01 kernel: [<ffffffff815287b5>] schedule_timeout+0x215/0x2e0
Oct 29 15:16:51 dbnode01 kernel: [<ffffffff81058d53>] ? __wake_up+0x53/0x70
Oct 29 15:16:51 dbnode01 kernel: [<ffffffff81528433>] wait_for_common+0x123/0x180
Oct 29 15:16:51 dbnode01 kernel: [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
Oct 29 15:16:51 dbnode01 kernel: [<ffffffff810955b2>] ? queue_work_on+0x42/0x60
Oct 29 15:16:51 dbnode01 kernel: [<ffffffff8152854d>] wait_for_completion+0x1d/0x20
Oct 29 15:16:51 dbnode01 kernel: [<ffffffffa0a5b0be>] cma_remove_one+0x18e/0x210 [rdma_cm]
Oct 29 15:16:51 dbnode01 kernel: [<ffffffffa0b3d60f>] ib_unregister_device+0x4f/0x100 [ib_core]
Oct 29 15:16:51 dbnode01 kernel: [<ffffffffa0b75b06>] mlx4_ib_remove+0xc6/0x300 [mlx4_ib]
Oct 29 15:16:51 dbnode01 kernel: [<ffffffffa0a849e1>] mlx4_remove_device+0x71/0x90 [mlx4_core]
Oct 29 15:16:51 dbnode01 kernel: [<ffffffffa0a84b13>] mlx4_unregister_interface+0x43/0x80 [mlx4_core]
Oct 29 15:16:51 dbnode01 kernel: [<ffffffffa0b8dac1>] __exit_compat+0x15/0x69 [mlx4_ib]
Oct 29 15:16:51 dbnode01 kernel: [<ffffffff810b9454>] sys_delete_module+0x194/0x260
Oct 29 15:16:51 dbnode01 kernel: [<ffffffff810e2067>] ? audit_syscall_entry+0x1d7/0x200
Oct 29 15:16:51 dbnode01 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b






-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/rds-devel/attachments/20151029/65fcff43/attachment.html 


More information about the rds-devel mailing list