[rds-devel] FW: RDS -- how to detect peer is gone ?

Tang, Changqing changquing.tang at hp.com
Wed Mar 31 21:40:17 PDT 2010


Andy,
        Thank you, I will try to open a bug and provide a patch if I could.

        After reading the rds_recvmsg() function in recv.c (RDS source code), I find the msg.msg_controllen processing does not follow the Linux recvmsg() man page.

        The Linux recvmsg() man page says that, upon return from recvmsg, msg.msg_controllen should contain the length of control message sequence. So if there is no control message, msg_controllen should be set to zero.

        However, from the rds_recvmsg() code, if we receive rdma notification control message, put_cmsg() is used on 'msghdr', in turn, put_cmsg() just advance msg_control to next control message space, and msg_controllen is decreased to the size of available space.  Eventually msg_controllen will be zero (if input length is multiple of control message length). The same thing for receiving RDS_CMSG_RDMA_DEST control message.

        Also if there is no rdma notification control message, or other control message, msg_controllen is not touched by RDS code.

        In another words, upon return from recvmsg(), msg_controllen is not the buffer length RDS code filled in.

        Thanks for your comment.

--CQ



-----Original Message-----
From: Andy Grover [mailto:andy.grover at oracle.com]
Sent: Wednesday, March 31, 2010 6:43 PM
To: Tang, Changqing
Cc: RDS Devel
Subject: Re: [rds-devel] FW: RDS -- how to detect peer is gone ?

Tang, Changqing wrote:
> We strongly ask the ability to run both 32bit and 64bit RDS code on 64bit kernel.
>
> --CQ

Please open a bug at bugs.openfabrics.org.

This is more likely to get fixed faster if you also attach a patch.

Thanks -- Regards -- Andy

>
> -----Original Message-----
> From: Andy Grover [mailto:andy.grover at oracle.com]
> Sent: Wednesday, March 31, 2010 4:40 PM
> To: Tang, Changqing
> Cc: RDS Devel
> Subject: Re: [rds-devel] FW: RDS -- how to detect peer is gone ?
>
> Tang, Changqing wrote:
>> Why not ? even IB verbs support both 32bit and 64bit apps.
>
> We support 32bit apps on a 32bit kernel and 64bit apps on a 64bit
> kernel. You are talking about some kind of 32bit userspace on a 64bit
> kernel. Nobody does that.
>
> -- Andy
>
>> --CQ
>>
>> -----Original Message-----
>> From: Andy Grover [mailto:andy.grover at oracle.com]
>> Sent: Wednesday, March 31, 2010 1:33 PM
>> To: Tang, Changqing
>> Cc: RDS Devel
>> Subject: Re: [rds-devel] FW: RDS -- how to detect peer is gone ?
>>
>> Tang, Changqing wrote:
>>> Andy, Thank you for your confirmation, when do you have a fix for
>>> this 32bit RDS problem on x86_64 system ?
>>>
>>> --CQ
>> Running 32 bit apps on 64bit kernel is not supported.
>>
>> -- Andy
>>
>>> -----Original Message-----
>>> From: Andy Grover [mailto:andy.grover at oracle.com]
>>> Sent: Tuesday, March 30, 2010 8:00 PM
>>> To: Tang, Changqing
>>> Cc: RDS Devel
>>> Subject: Re: [rds-devel] FW: RDS -- how to detect peer is gone ?
>>>
>>> Tang, Changqing wrote:
>>>> Andy, I looked 'man cmsg', 'struct rds_get_mr_args' is always 32
>>>> bytes.  Here is my test code:
>>>>
>>>> #include <stdio.h> #include <stdlib.h> #include <sys/socket.h>
>>>>
>>>> int main ()
>>>>
>>>> { struct cmsghdr *cmsg; char    cmsgbuf[CMSG_SPACE(32)];  /* using
>>>> struct rds_get_mr_args size */
>>>>
>>>> cmsg = (struct cmsghdr *)cmsgbuf;
>>>>
>>>> cmsg->cmsg_len = CMSG_SPACE(32); cmsg->cmsg_type = 0;
>>>> cmsg->cmsg_level = 1;
>>>>
>>>> fprintf(stderr, "offset %d\n", (char*)CMSG_DATA(cmsg)-(char*)cmsg); }
>>>>
>>>>
>>>> The offset for 64bit is 16 and for 32bit is 12.
>>>>
>>>> So if my code is 32bit, I put 'struct rds_get_mr_args' on 12 bytes
>>>> offset, but RDS kernel code will get it from 16 bytes offset.
>>>>
>>>> Am I wrong ?  Thank you again.
>>> Hi CQ,
>>>
>>> First, please always CC rds-devel so this discussion may be archived,
>>> and maybe help someone else in the future.
>>>
>>> Regarding your question -- I think you're correct that 32bit userland
>>> will not work with 64bit kernel.
>>>
>>> Regards -- Andy
>>>
>>>> --CQ
>>>>
>>>>
>>>>
>>>> -----Original Message----- From: Andy Grover
>>>> [mailto:andy.grover at oracle.com] Sent: Tuesday, March 30, 2010 1:41 PM
>>>>  To: Tang, Changqing; RDS Devel Subject: Re: [rds-devel] FW: RDS --
>>>> how to detect peer is gone ?
>>>>
>>>> Tang, Changqing wrote:
>>>>> Andy, One simple question, does 32bit rds-rdma code work on x86_64
>>>>> machine ? I noticed that the size of 'struct cmsghdr' is different
>>>>> between 32bit and 64bit, If the kernel code is always 64bit, how
>>>>> does the RDS kernel code figure out The control message buffer is
>>>>> passed as 32bit format?
>>>>>
>>>>> Do I miss something here ?
>>>> See "man cmsg", it describes the various macros that resolve 32/64
>>>> differences.
>>>>
>>>> Regards -- Andy
>>>>
>>>>> Thank you. --CQ
>>>>>
>>>>> -----Original Message----- From: Andy Grover
>>>>> [mailto:andy.grover at oracle.com] Sent: Tuesday, March 16, 2010 5:44
>>>>> PM To: Tang, Changqing Cc: rds-devel at oss.oracle.com Subject: Re:
>>>>> [rds-devel] FW: RDS -- how to detect peer is gone ?
>>>>>
>>>>> Tang, Changqing wrote:
>>>>>>> [CQ] yes, the node is up and the process may corrupted. If you
>>>>>>> can extend the rds ping message a little bit to process as
>>>>>>> optional, that would be wonderful.
>>>>>> I don't see why rds's ping functionality as-is is insufficient
>>>>>> for what you want to do.
>>>>>>
>>>>>> [CQ] What do you mean ? how can I use rds ping function as-is to
>>>>>> identify process down ?
>>>>> Like I said, if the process doesn't respond but the rds ping does,
>>>>> then you know the machine is alive but the process is not.
>>>>>
>>>>> -- Andy
>




More information about the rds-devel mailing list