[rds-devel] FW: RDS -- hanging kernel
Richard Frank
richard.frank at oracle.com
Tue Apr 27 07:26:52 PDT 2010
Hi Andy, RDS should not tear down connections that still have sends
queued for delivery..
even if the connection has been unconnected for some period.. Today
Oracle will wait for
a very long time - which is customer tunable - before declaring that a
node is dead.. RDS in
V1 used to destroy connections .. and it was a pain to keep the "RDS
timeout period" correctly
setup so that RDS would not destroy a connection underneath Oracle..
RDS is must continue trying - forever - to delivered queued sends.. If
the client gets
tired of waiting.. then it can cancel any outstanding sends - which lets
RDS know that
it is safe to destroy a connection (no more sends to deliver) if it
chooses to.
There is another behavior which overlaps with connection destroying..
in that it is possible for an RDS client to make up any "IP" and attempt
to send to it..
The result is an RDS connection is formed that will never complete (no
real destination)..
A misbehaving client can create a DNS by spinning making up IPs..
flooding RDS
with bad connections.. that have sends queued !
I think we have a bug filed for this.. I'll go look.. and file one if
needed..
RDS should somehow validate the IP before allowing a new connection to
form.. and deny
the initial send if the IP is not valid.. ?
Andy Grover wrote:
> On 04/26/2010 02:58 PM, Tang, Changqing wrote:
>
>> OK, thanks, Andy. How could we get this latest development code to test ?
>>
>
> The current devel branch is here:
>
> http://www.openfabrics.org/git/?p=~agrover/ofed_1_5/linux-2.6.git;a=shortlog;h=devel
>
> This is what we're working on for OFED 1.6. If you're interested in the
> connection-killing stuff, you should be able to pull that patch out and
> apply it to 1.5.1 relatively easily. Look for "RDS: Destroy unconnected
> connections after 5 minutes".
>
> Regards -- Andy
>
>
>> --CQ
>>
>> -----Original Message-----
>> From: Andy Grover [mailto:andy.grover at oracle.com]
>> Sent: Monday, April 26, 2010 4:41 PM
>> To: Tang, Changqing
>> Cc: Tina Yang; RDS Devel
>> Subject: Re: [rds-devel] FW: RDS -- hanging kernel
>>
>> On 04/23/2010 03:14 PM, Tang, Changqing wrote:
>>
>>> Tina, If one of the node is down, and never recover, the RDS IB
>>> connection to this node Is down, and never reconnected, how about
>>> the messages in other nodes to this down node ? Do they sit in RDS
>>> forever or get dropped after sometime ?
>>>
>> Yes, the current behavior is to retry forever. See this bug:
>>
>> https://bugs.openfabrics.org/show_bug.cgi?id=1355
>>
>> so we would like RDS to get smarter. There's currently a patch in the
>> devel branch to destroy unconnected connections after 5 minutes.
>>
>> Regards -- Andy
>>
>>
>>> Thank you.
>>>
>>> --CQ
>>>
>>> -----Original Message----- From: Tina Yang
>>> [mailto:tina.yang at oracle.com] Sent: Saturday, April 10, 2010 12:03
>>> AM To: Tang, Changqing Cc: Andy Grover; RDS Devel Subject: Re:
>>> [rds-devel] FW: RDS -- hanging kernel
>>>
>>> Can you put the core dump on a ftp site where I can accesss ?
>>>
>>>
>>>
>>> Tang, Changqing wrote:
>>>
>>>> Tina, I apply the patch in bug 2002 to file rdma.c, and rebuild
>>>> RDS, it does NOT work For me, the Pallas MPI test still hangs the
>>>> first node almost at the same test location.
>>>>
>>>> Any idea?
>>>>
>>>> --CQ
>>>>
>>>> -----Original Message----- From: Tang, Changqing Sent: Friday,
>>>> April 09, 2010 1:14 PM To: 'Tina Yang' Cc: Andy Grover; RDS Devel
>>>> Subject: RE: [rds-devel] FW: RDS -- hanging kernel
>>>>
>>>> Ok, the one in bug 2002 are newer than OFED 1.5.1, is this the one
>>>> to fix my problem?
>>>>
>>>> --CQ
>>>>
>>>> -----Original Message----- From: Tina Yang
>>>> [mailto:tina.yang at oracle.com] Sent: Friday, April 09, 2010 12:49
>>>> PM To: Tang, Changqing Cc: Andy Grover; RDS Devel Subject: Re:
>>>> [rds-devel] FW: RDS -- hanging kernel
>>>>
>>>> Looks like the signature of a known bug. Go fetch the following
>>>> recently submitted patches from kernel netdev archive,
>>>> [rds-devel][PATCH 06/13] RDS: Fix send locking issue
>>>> [rds-devel][PATCH 09/13] RDS: Fix locking in rds_send_drop_to()
>>>> [rds-devel] [PATCH 12/13] RDS: Do not call set_page_dirty() with
>>>> irqs off and https://bugs.openfabrics.org/show_bug.cgi?id=2002
>>>>
>>>>
>>>>
>>>> Tang, Changqing wrote:
>>>>
>>>>
>>>>> Tina, Eventually I get the vmcore for the hanging kernel. I run
>>>>> MPI pallas test over RDS, two nodes, 5 rank on the first node, 6
>>>>> ranks on the second node.
>>>>>
>>>>> After running a while, during MPI_Allreduce() test, the first
>>>>> node hangs. We Managed to get the vmcore from this hanging node.
>>>>>
>>>>> It shows following:
>>>>>
>>>>> crash> bt PID: 9353 TASK: ffff810416f8c7b0 CPU: 1 COMMAND:
>>>>> "pmb.x.32" #0 [ffff81042ff33e68] die_nmi at ffffffff8006634d #1
>>>>> [ffff81042ff33e80] die_nmi at ffffffff80066393 #2
>>>>> [ffff81042ff33ea0] nmi_watchdog_tick at ffffffff80066af9 #3
>>>>> [ffff81042ff33ef0] default_do_nmi at ffffffff80066717 #4
>>>>> [ffff81042ff33f40] do_nmi at ffffffff80066984 #5
>>>>> [ffff81042ff33f50] nmi at ffffffff80065fd7 [exception RIP:
>>>>> .text.lock.spinlock+17] RIP: ffffffff80065cf3 RSP:
>>>>> ffff810407383af0 RFLAGS: 00000082 RAX: 0000000000000246 RBX:
>>>>> ffff810405c21b80 RCX: 0000000000002000 RDX: ffff810407383ed8
>>>>> RSI: ffff810407383ed8 RDI: ffff810405c21df8 RBP:
>>>>> 0000000000000000 R8: 0000000080000040 R9: 0000000000002000
>>>>> R10: 0000000000002000 R11: 00000000ffb2a618 R12:
>>>>> 0000000000000001 R13: ffff810407383ed8 R14: ffff810407383ed8
>>>>> R15: ffff810405c21df8 ORIG_RAX: ffffffffffffffff CS: 0010 SS:
>>>>> 0018 ---<exception stack> --- #6 [ffff810407383af0]
>>>>> .text.lock.spinlock at ffffffff80065cf3 (via _spin_lock_i
>>>>> rqsave) #7 [ffff810407383af0] rds_notify_queue_get at
>>>>> ffffffff8868c6a0 #8 [ffff810407383b60] rds_recvmsg at
>>>>> ffffffff8868cbb2 #9 [ffff810407383c10] sock_recvmsg at
>>>>> ffffffff80030ab8 #10 [ffff810407383db0] sys_recvmsg at
>>>>> ffffffff8003d77f #11 [ffff810407383f50] compat_sys_socketcall at
>>>>> ffffffff8022b44b #12 [ffff810407383f80] cstar_do_call at
>>>>> ffffffff80062618
>>>>>
>>>>> What else do I need to provide you?
>>>>>
>>>>> --CQ
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>
>
> _______________________________________________
> rds-devel mailing list
> rds-devel at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/rds-devel
>
More information about the rds-devel
mailing list