From metze at samba.org  Tue Apr  1 08:19:05 2025
From: metze at samba.org (Stefan Metzmacher)
Date: Tue, 1 Apr 2025 10:19:05 +0200
Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer
 via optlen_t to proto[_ops].getsockopt()
In-Reply-To: <Z-sDc-0qyfPZz9lv@mini-arch>
References: <cover.1743449872.git.metze@samba.org> <Z-sDc-0qyfPZz9lv@mini-arch>
Message-ID: <39515c76-310d-41af-a8b4-a814841449e3@samba.org>

Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:
> On 03/31, Stefan Metzmacher wrote:
>> The motivation for this is to remove the SOL_SOCKET limitation
>> from io_uring_cmd_getsockopt().
>>
>> The reason for this limitation is that io_uring_cmd_getsockopt()
>> passes a kernel pointer as optlen to do_sock_getsockopt()
>> and can't reach the ops->getsockopt() path.
>>
>> The first idea would be to change the optval and optlen arguments
>> to the protocol specific hooks also to sockptr_t, as that
>> is already used for setsockopt() and also by do_sock_getsockopt()
>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
>>
>> But as Linus don't like 'sockptr_t' I used a different approach.
>>
>> @Linus, would that optlen_t approach fit better for you?
> 
> [..]
> 
>> Instead of passing the optlen as user or kernel pointer,
>> we only ever pass a kernel pointer and do the
>> translation from/to userspace in do_sock_getsockopt().
> 
> At this point why not just fully embrace iov_iter? You have the size
> now + the user (or kernel) pointer. Might as well do
> s/sockptr_t/iov_iter/ conversion?

I think that would only be possible if we introduce
proto[_ops].getsockopt_iter() and then convert the implementations
step by step. Doing it all in one go has a lot of potential to break
the uapi. I could try to convert things like socket, ip and tcp myself, but
the rest needs to be converted by the maintainer of the specific protocol,
as it needs to be tested. As there are crazy things happening in the existing
implementations, e.g. some getsockopt() implementations use optval as in and out
buffer.

I first tried to convert both optval and optlen of getsockopt to sockptr_t,
and that showed that touching the optval part starts to get complex very soon,
see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
(note it didn't converted everything, I gave up after hitting
sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
more are the ones also doing both copy_from_user and copy_to_user on optval)

I come also across one implementation that returned -ERANGE because *optlen was
too short and put the required length into *optlen, which means the returned
*optlen is larger than the optval buffer given from userspace.

Because of all these strange things I tried to do a minimal change
in order to get rid of the io_uring limitation and only converted
optlen and leave optval as is.

In order to have a patchset that has a low risk to cause regressions.

But as alternative introducing a prototype like this:

         int (*getsockopt_iter)(struct socket *sock, int level, int optname,
                                struct iov_iter *optval_iter);

That returns a non-negative value which can be placed into *optlen
or negative value as error and *optlen will not be changed on error.
optval_iter will get direction ITER_DEST, so it can only be written to.

Implementations could then opt in for the new interface and
allow do_sock_getsockopt() work also for the io_uring case,
while all others would still get -EOPNOTSUPP.

So what should be the way to go?

metze


From metze at samba.org  Tue Apr  1 08:24:32 2025
From: metze at samba.org (Stefan Metzmacher)
Date: Tue, 1 Apr 2025 10:24:32 +0200
Subject: [rds-devel] [RFC PATCH 3/4] net: pass a kernel pointer via
 'optlen_t' to proto[ops].getsockopt() hooks
In-Reply-To: <20250331224946.13899fcf@pumpkin>
References: <cover.1743449872.git.metze@samba.org>
 <d482e207223f434f0d306d3158b2142dceac4631.1743449872.git.metze@samba.org>
 <20250331224946.13899fcf@pumpkin>
Message-ID: <51bb66d4-eaf3-4247-ba11-d793b6f0d56c@samba.org>

Am 31.03.25 um 23:49 schrieb David Laight:
> On Mon, 31 Mar 2025 22:10:55 +0200
> Stefan Metzmacher <metze at samba.org> wrote:
> 
>> The motivation for this is to remove the SOL_SOCKET limitation
>> from io_uring_cmd_getsockopt().
>>
>> The reason for this limitation is that io_uring_cmd_getsockopt()
>> passes a kernel pointer.
>>
>> The first idea would be to change the optval and optlen arguments
>> to the protocol specific hooks also to sockptr_t, as that
>> is already used for setsockopt() and also by do_sock_getsockopt()
>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
>>
>> But as Linus don't like 'sockptr_t' I used a different approach.
>>
>> Instead of passing the optlen as user or kernel pointer,
>> we only ever pass a kernel pointer and do the
>> translation from/to userspace in do_sock_getsockopt().
>>
>> The simple solution would be to just remove the
>> '__user' from the int *optlen argument, but it
>> seems the compiler doesn't complain about
>> '__user' vs. without it, so instead I used
>> a helper struct in order to make sure everything
>> compiles with a typesafe change.
>>
>> That together with get_optlen() and put_optlen() helper
>> macros make it relatively easy to review and check the
>> behaviour is most likely unchanged.
> 
> I've looked into this before (and fallen down the patch rabbit hole).

Yes, if you want to change the logic at the same time as
changing the kind of argument variable, then it get messy
quite fast.

> I think the best (final) solution is to pass a validated non-negative
> 'optlen' into all getsockopt() functions and to have them usually return
> either -errno or the modified length.
> This simplifies 99% of the functions.

Yes, maybe not 99%, but a lot.

> The problem case is functions that want to update the length and return
> an error.
> By best solution is to support return values of -errno << 20 | length
> (as well as -errno and length).
> 
> There end up being some slight behaviour changes.
> - Some code tries to 'undo' actions if the length can't be updated.
>    I'm sure this is unnecessary and the recovery path is untested and
>    could be buggy. Provided the kernel data is consistent there is
>    no point trying to get code to recover from EFAULT.
>    The 'length' has been read - so would also need to be readonly
>    or unmapped by a second thread!
> - A lot of getsockopt functions actually treat a negative length as 4.
>    I think this 'bug' needs to preserved to avoid breaking applications.
> 
> The changes are mechanical but very widespread.
> 
> They also give the option of not writing back the length if unchanged.

See my other mail regarding proto[_ops].getsockopt_iter(),
where implementation could be converted step by step.

But we may still need to keep the current  proto[ops].getsockopt()
as proto[ops].getsockopt_legacy() in order to keep the
insane uapi semantics alive.

metze


From leitao at debian.org  Tue Apr  1 12:17:09 2025
From: leitao at debian.org (Breno Leitao)
Date: Tue, 1 Apr 2025 05:17:09 -0700
Subject: [rds-devel] [RFC PATCH 1/4] net: introduce get_optlen() and
 put_optlen() helpers
In-Reply-To: <156e83128747b2cf7c755bffa68f2519bd255f78.1743449872.git.metze@samba.org>
References: <cover.1743449872.git.metze@samba.org>
 <156e83128747b2cf7c755bffa68f2519bd255f78.1743449872.git.metze@samba.org>
Message-ID: <Z+vZRcbvh6r1fnZL@gmail.com>

Hello Stefan,

On Mon, Mar 31, 2025 at 10:10:53PM +0200, Stefan Metzmacher wrote:
> --- a/include/linux/sockptr.h
> +++ b/include/linux/sockptr.h
> @@ -169,4 +169,26 @@ static inline int check_zeroed_sockptr(sockptr_t src, size_t offset,
>  	return memchr_inv(src.kernel + offset, 0, size) == NULL;
>  }
>  
> +#define __check_optlen_t(__optlen)				\
> +({								\
> +	int __user *__ptr __maybe_unused = __optlen; 		\
> +	BUILD_BUG_ON(sizeof(*(__ptr)) != sizeof(int));		\
> +})

I am a bit confused about this macro. I understand that this macro's
goal is to check that __optlen is a pointer to an integer, otherwise
failed to build.

It is unclear to me if that is what it does. Let's suppose that __optlen
is not an integer pointer. Then:

> int __user *__ptr __maybe_unused = __optlen;

This will generate a compile failure/warning due invalid casting,
depending on -Wincompatible-pointer-types.

> BUILD_BUG_ON(sizeof(*(__ptr)) != sizeof(int));

Then this comparison will always false, since __ptr is a pointer to int,
and you are comparing the size of its content with the sizeof(int).


From metze at samba.org  Tue Apr  1 12:22:50 2025
From: metze at samba.org (Stefan Metzmacher)
Date: Tue, 1 Apr 2025 14:22:50 +0200
Subject: [rds-devel] [RFC PATCH 1/4] net: introduce get_optlen() and
 put_optlen() helpers
In-Reply-To: <Z+vZRcbvh6r1fnZL@gmail.com>
References: <cover.1743449872.git.metze@samba.org>
 <156e83128747b2cf7c755bffa68f2519bd255f78.1743449872.git.metze@samba.org>
 <Z+vZRcbvh6r1fnZL@gmail.com>
Message-ID: <90334e83-618b-41e0-a35c-9ce8b0d1d990@samba.org>

Hello Breno,

> On Mon, Mar 31, 2025 at 10:10:53PM +0200, Stefan Metzmacher wrote:
>> --- a/include/linux/sockptr.h
>> +++ b/include/linux/sockptr.h
>> @@ -169,4 +169,26 @@ static inline int check_zeroed_sockptr(sockptr_t src, size_t offset,
>>   	return memchr_inv(src.kernel + offset, 0, size) == NULL;
>>   }
>>   
>> +#define __check_optlen_t(__optlen)				\
>> +({								\
>> +	int __user *__ptr __maybe_unused = __optlen; 		\
>> +	BUILD_BUG_ON(sizeof(*(__ptr)) != sizeof(int));		\
>> +})
> 
> I am a bit confused about this macro. I understand that this macro's
> goal is to check that __optlen is a pointer to an integer, otherwise
> failed to build.
> 
> It is unclear to me if that is what it does. Let's suppose that __optlen
> is not an integer pointer. Then:
> 
>> int __user *__ptr __maybe_unused = __optlen;
> 
> This will generate a compile failure/warning due invalid casting,
> depending on -Wincompatible-pointer-types.
> 
>> BUILD_BUG_ON(sizeof(*(__ptr)) != sizeof(int));
> 
> Then this comparison will always false, since __ptr is a pointer to int,
> and you are comparing the size of its content with the sizeof(int).

Yes, it redundant in the first patch, it gets little more useful in
the 2nd and 3rd patch.

metze


From metze at samba.org  Tue Apr  1 13:37:28 2025
From: metze at samba.org (Stefan Metzmacher)
Date: Tue, 1 Apr 2025 15:37:28 +0200
Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer
 via optlen_t to proto[_ops].getsockopt()
In-Reply-To: <39515c76-310d-41af-a8b4-a814841449e3@samba.org>
References: <cover.1743449872.git.metze@samba.org>
 <Z-sDc-0qyfPZz9lv@mini-arch> <39515c76-310d-41af-a8b4-a814841449e3@samba.org>
Message-ID: <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org>

Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:
> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:
>> On 03/31, Stefan Metzmacher wrote:
>>> The motivation for this is to remove the SOL_SOCKET limitation
>>> from io_uring_cmd_getsockopt().
>>>
>>> The reason for this limitation is that io_uring_cmd_getsockopt()
>>> passes a kernel pointer as optlen to do_sock_getsockopt()
>>> and can't reach the ops->getsockopt() path.
>>>
>>> The first idea would be to change the optval and optlen arguments
>>> to the protocol specific hooks also to sockptr_t, as that
>>> is already used for setsockopt() and also by do_sock_getsockopt()
>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
>>>
>>> But as Linus don't like 'sockptr_t' I used a different approach.
>>>
>>> @Linus, would that optlen_t approach fit better for you?
>>
>> [..]
>>
>>> Instead of passing the optlen as user or kernel pointer,
>>> we only ever pass a kernel pointer and do the
>>> translation from/to userspace in do_sock_getsockopt().
>>
>> At this point why not just fully embrace iov_iter? You have the size
>> now + the user (or kernel) pointer. Might as well do
>> s/sockptr_t/iov_iter/ conversion?
> 
> I think that would only be possible if we introduce
> proto[_ops].getsockopt_iter() and then convert the implementations
> step by step. Doing it all in one go has a lot of potential to break
> the uapi. I could try to convert things like socket, ip and tcp myself, but
> the rest needs to be converted by the maintainer of the specific protocol,
> as it needs to be tested. As there are crazy things happening in the existing
> implementations, e.g. some getsockopt() implementations use optval as in and out
> buffer.
> 
> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> and that showed that touching the optval part starts to get complex very soon,
> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> (note it didn't converted everything, I gave up after hitting
> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> more are the ones also doing both copy_from_user and copy_to_user on optval)
> 
> I come also across one implementation that returned -ERANGE because *optlen was
> too short and put the required length into *optlen, which means the returned
> *optlen is larger than the optval buffer given from userspace.
> 
> Because of all these strange things I tried to do a minimal change
> in order to get rid of the io_uring limitation and only converted
> optlen and leave optval as is.
> 
> In order to have a patchset that has a low risk to cause regressions.
> 
> But as alternative introducing a prototype like this:
> 
>  ??????? int (*getsockopt_iter)(struct socket *sock, int level, int optname,
>  ?????????????????????????????? struct iov_iter *optval_iter);
> 
> That returns a non-negative value which can be placed into *optlen
> or negative value as error and *optlen will not be changed on error.
> optval_iter will get direction ITER_DEST, so it can only be written to.
> 
> Implementations could then opt in for the new interface and
> allow do_sock_getsockopt() work also for the io_uring case,
> while all others would still get -EOPNOTSUPP.
> 
> So what should be the way to go?

Ok, I've added the infrastructure for getsockopt_iter, see below,
but the first part I wanted to convert was
tcp_ao_copy_mkts_to_user() and that also reads from userspace before
writing.

So we could go with the optlen_t approach, or we need
logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
with ITER_DEST...

So who wants to decide?

Thanks!
metze
---
  include/linux/net.h |  4 +++
  include/net/sock.h  | 64 +++++++++++++++++++++++++++++++++++++++++++++
  net/core/sock.c     | 12 +++++++--
  net/socket.c        | 12 +++++++--
  4 files changed, 88 insertions(+), 4 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 0ff950eecc6b..ceb9f9ed84b9 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -194,6 +194,10 @@ struct proto_ops {
  				      unsigned int optlen);
  	int		(*getsockopt)(struct socket *sock, int level,
  				      int optname, char __user *optval, int __user *optlen);
+	int		(*getsockopt_iter)(struct socket *sock,
+					   int level,
+					   int optname,
+					   struct iov_iter *optval_iter);
  	void		(*show_fdinfo)(struct seq_file *m, struct socket *sock);
  	int		(*sendmsg)   (struct socket *sock, struct msghdr *m,
  				      size_t total_len);
diff --git a/include/net/sock.h b/include/net/sock.h
index 8daf1b3b12c6..e741b219056e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1249,6 +1249,11 @@ struct proto {
  	int			(*getsockopt)(struct sock *sk, int level,
  					int optname, char __user *optval,
  					int __user *option);
+	int			(*getsockopt_iter)(struct sock *sk,
+						   int level,
+						   int optname,
+						   struct iov_iter *optval_iter);
+
  	void			(*keepalive)(struct sock *sk, int valbool);
  #ifdef CONFIG_COMPAT
  	int			(*compat_ioctl)(struct sock *sk,
@@ -1781,6 +1786,65 @@ int do_sock_setsockopt(struct socket *sock, bool compat, int level,
  int do_sock_getsockopt(struct socket *sock, bool compat, int level,
  		       int optname, sockptr_t optval, sockptr_t optlen);

+#define __generic_wrap_getsockopt_iter(__s, __level,				\
+				       __optname, __optval, __optlen, 		\
+				       __getsockopt_iter) 			\
+do {										\
+	struct iov_iter optval_iter;						\
+	struct kvec optval_kvec;						\
+	int len;								\
+	int err;								\
+										\
+	if (unlikely(__getsockopt_iter == NULL))				\
+		return -EOPNOTSUPP;						\
+										\
+	if (copy_from_sockptr(&len, __optlen, sizeof(len)))			\
+		return -EFAULT;							\
+										\
+	if (len < 0)								\
+		return -EINVAL;							\
+										\
+	if (__optval.is_kernel) {						\
+		if (__optval.kernel == NULL && len != 0)			\
+			return -EFAULT;						\
+										\
+		optval_kvec = (struct kvec) {					\
+			.iov_base = __optval.kernel,				\
+			.iov_len = len,						\
+		};								\
+										\
+		iov_iter_kvec(&optval_iter, ITER_DEST,				\
+			      &optval_kvec, 1, optval_kvec.iov_len);		\
+	} else {								\
+		if (import_ubuf(ITER_DEST, __optval.user, len, &optval_iter))	\
+			return -EFAULT;						\
+	}									\
+										\
+	err = getsockopt_iter(__s, __level, __optname, &optval_iter);		\
+	if (unlikely(err < 0))							\
+		return err;							\
+										\
+	len = err;								\
+	if (copy_to_sockptr(__optlen, &len, sizeof(len)))			\
+		return -EFAULT;							\
+										\
+	return 0;								\
+} while (0)
+
+static __always_inline
+int sk_wrap_getsockopt_iter(struct sock *sk, int level, int optname, sockptr_t optval, sockptr_t optlen,
+	    int (*getsockopt_iter)(struct sock *sk, int level, int optname, struct iov_iter *optval_iter))
+{
+	__generic_wrap_getsockopt_iter(sk, level, optname, optval, optlen, getsockopt_iter);
+}
+
+static __always_inline
+int sock_wrap_getsockopt_iter(struct socket *sock, int level, int optname, sockptr_t optval, sockptr_t optlen,
+	    int (*getsockopt_iter)(struct socket *sock, int level, int optname, struct iov_iter *optval_iter))
+{
+	__generic_wrap_getsockopt_iter(sock, level, optname, optval, optlen, getsockopt_iter);
+}
+
  int sk_getsockopt(struct sock *sk, int level, int optname,
  		  sockptr_t optval, sockptr_t optlen);
  int sock_gettstamp(struct socket *sock, void __user *userstamp,
diff --git a/net/core/sock.c b/net/core/sock.c
index 323892066def..61625060e724 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3857,9 +3857,17 @@ int sock_common_getsockopt(struct socket *sock, int level, int optname,
  			   char __user *optval, int __user *optlen)
  {
  	struct sock *sk = sock->sk;
-
  	/* IPV6_ADDRFORM can change sk->sk_prot under us. */
-	return READ_ONCE(sk->sk_prot)->getsockopt(sk, level, optname, optval, optlen);
+	struct proto *prot = READ_ONCE(sk->sk_prot);
+
+	if (prot->getsockopt_iter) {
+		return sk_wrap_getsockopt_iter(sk, level, optname,
+					       USER_SOCKPTR(optval),
+					       USER_SOCKPTR(optlen),
+					       prot->getsockopt_iter);
+	}
+
+	return prot->getsockopt(sk, level, optname, optval, optlen);
  }
  EXPORT_SYMBOL(sock_common_getsockopt);

diff --git a/net/socket.c b/net/socket.c
index 9a0e720f0859..792cfd272611 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2335,6 +2335,7 @@ int do_sock_getsockopt(struct socket *sock, bool compat, int level,
  {
  	int max_optlen __maybe_unused = 0;
  	const struct proto_ops *ops;
+	const struct proto *prot;
  	int err;

  	err = security_socket_getsockopt(sock, level, optname);
@@ -2345,12 +2346,19 @@ int do_sock_getsockopt(struct socket *sock, bool compat, int level,
  		copy_from_sockptr(&max_optlen, optlen, sizeof(int));

  	ops = READ_ONCE(sock->ops);
+	prot = READ_ONCE(sock->sk->sk_prot);
  	if (level == SOL_SOCKET) {
  		err = sk_getsockopt(sock->sk, level, optname, optval, optlen);
-	} else if (unlikely(!ops->getsockopt)) {
+	} else if (ops->getsockopt_iter) {
+		err = sock_wrap_getsockopt_iter(sock, level, optname, optval, optlen,
+					        ops->getsockopt_iter);
+	} else if (ops->getsockopt == sock_common_getsockopt && prot->getsockopt_iter) {
+		err = sk_wrap_getsockopt_iter(sock->sk, level, optname, optval, optlen,
+					      prot->getsockopt_iter);
+	} else if (unlikely(!ops->getsockopt || optlen.is_kernel)) {
  		err = -EOPNOTSUPP;
  	} else {
-		if (WARN_ONCE(optval.is_kernel || optlen.is_kernel,
+		if (WARN_ONCE(optval.is_kernel,
  			      "Invalid argument type"))
  			return -EOPNOTSUPP;

-- 
2.34.1


From metze at samba.org  Tue Apr  1 13:48:58 2025
From: metze at samba.org (Stefan Metzmacher)
Date: Tue, 1 Apr 2025 15:48:58 +0200
Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer
 via optlen_t to proto[_ops].getsockopt()
In-Reply-To: <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org>
References: <cover.1743449872.git.metze@samba.org>
 <Z-sDc-0qyfPZz9lv@mini-arch> <39515c76-310d-41af-a8b4-a814841449e3@samba.org>
 <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org>
Message-ID: <ed2038b1-0331-43d6-ac15-fd7e004ab27e@samba.org>

Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:
> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:
>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:
>>> On 03/31, Stefan Metzmacher wrote:
>>>> The motivation for this is to remove the SOL_SOCKET limitation
>>>> from io_uring_cmd_getsockopt().
>>>>
>>>> The reason for this limitation is that io_uring_cmd_getsockopt()
>>>> passes a kernel pointer as optlen to do_sock_getsockopt()
>>>> and can't reach the ops->getsockopt() path.
>>>>
>>>> The first idea would be to change the optval and optlen arguments
>>>> to the protocol specific hooks also to sockptr_t, as that
>>>> is already used for setsockopt() and also by do_sock_getsockopt()
>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
>>>>
>>>> But as Linus don't like 'sockptr_t' I used a different approach.
>>>>
>>>> @Linus, would that optlen_t approach fit better for you?
>>>
>>> [..]
>>>
>>>> Instead of passing the optlen as user or kernel pointer,
>>>> we only ever pass a kernel pointer and do the
>>>> translation from/to userspace in do_sock_getsockopt().
>>>
>>> At this point why not just fully embrace iov_iter? You have the size
>>> now + the user (or kernel) pointer. Might as well do
>>> s/sockptr_t/iov_iter/ conversion?
>>
>> I think that would only be possible if we introduce
>> proto[_ops].getsockopt_iter() and then convert the implementations
>> step by step. Doing it all in one go has a lot of potential to break
>> the uapi. I could try to convert things like socket, ip and tcp myself, but
>> the rest needs to be converted by the maintainer of the specific protocol,
>> as it needs to be tested. As there are crazy things happening in the existing
>> implementations, e.g. some getsockopt() implementations use optval as in and out
>> buffer.
>>
>> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
>> and that showed that touching the optval part starts to get complex very soon,
>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
>> (note it didn't converted everything, I gave up after hitting
>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
>> more are the ones also doing both copy_from_user and copy_to_user on optval)
>>
>> I come also across one implementation that returned -ERANGE because *optlen was
>> too short and put the required length into *optlen, which means the returned
>> *optlen is larger than the optval buffer given from userspace.
>>
>> Because of all these strange things I tried to do a minimal change
>> in order to get rid of the io_uring limitation and only converted
>> optlen and leave optval as is.
>>
>> In order to have a patchset that has a low risk to cause regressions.
>>
>> But as alternative introducing a prototype like this:
>>
>> ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname,
>> ??????????????????????????????? struct iov_iter *optval_iter);
>>
>> That returns a non-negative value which can be placed into *optlen
>> or negative value as error and *optlen will not be changed on error.
>> optval_iter will get direction ITER_DEST, so it can only be written to.
>>
>> Implementations could then opt in for the new interface and
>> allow do_sock_getsockopt() work also for the io_uring case,
>> while all others would still get -EOPNOTSUPP.
>>
>> So what should be the way to go?
> 
> Ok, I've added the infrastructure for getsockopt_iter, see below,
> but the first part I wanted to convert was
> tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> writing.
> 
> So we could go with the optlen_t approach, or we need
> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> with ITER_DEST...
> 
> So who wants to decide?

I just noticed that it's even possible in same cases
to pass in a short buffer to optval, but have a longer value in optlen,
hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.

This makes it really hard to believe that trying to use iov_iter for this
is a good idea :-(

Any ideas beside just going with optlen_t?

metze


From leitao at debian.org  Tue Apr  1 15:35:50 2025
From: leitao at debian.org (Breno Leitao)
Date: Tue, 1 Apr 2025 08:35:50 -0700
Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer
 via optlen_t to proto[_ops].getsockopt()
In-Reply-To: <ed2038b1-0331-43d6-ac15-fd7e004ab27e@samba.org>
References: <cover.1743449872.git.metze@samba.org> <Z-sDc-0qyfPZz9lv@mini-arch>
 <39515c76-310d-41af-a8b4-a814841449e3@samba.org>
 <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org>
 <ed2038b1-0331-43d6-ac15-fd7e004ab27e@samba.org>
Message-ID: <Z+wH1oYOr1dlKeyN@gmail.com>

On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:
> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:
> > Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:
> > > Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:
> > > > On 03/31, Stefan Metzmacher wrote:
> > > > > The motivation for this is to remove the SOL_SOCKET limitation
> > > > > from io_uring_cmd_getsockopt().
> > > > > 
> > > > > The reason for this limitation is that io_uring_cmd_getsockopt()
> > > > > passes a kernel pointer as optlen to do_sock_getsockopt()
> > > > > and can't reach the ops->getsockopt() path.
> > > > > 
> > > > > The first idea would be to change the optval and optlen arguments
> > > > > to the protocol specific hooks also to sockptr_t, as that
> > > > > is already used for setsockopt() and also by do_sock_getsockopt()
> > > > > sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> > > > > 
> > > > > But as Linus don't like 'sockptr_t' I used a different approach.
> > > > > 
> > > > > @Linus, would that optlen_t approach fit better for you?
> > > > 
> > > > [..]
> > > > 
> > > > > Instead of passing the optlen as user or kernel pointer,
> > > > > we only ever pass a kernel pointer and do the
> > > > > translation from/to userspace in do_sock_getsockopt().
> > > > 
> > > > At this point why not just fully embrace iov_iter? You have the size
> > > > now + the user (or kernel) pointer. Might as well do
> > > > s/sockptr_t/iov_iter/ conversion?
> > > 
> > > I think that would only be possible if we introduce
> > > proto[_ops].getsockopt_iter() and then convert the implementations
> > > step by step. Doing it all in one go has a lot of potential to break
> > > the uapi. I could try to convert things like socket, ip and tcp myself, but
> > > the rest needs to be converted by the maintainer of the specific protocol,
> > > as it needs to be tested. As there are crazy things happening in the existing
> > > implementations, e.g. some getsockopt() implementations use optval as in and out
> > > buffer.
> > > 
> > > I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> > > and that showed that touching the optval part starts to get complex very soon,
> > > see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> > > (note it didn't converted everything, I gave up after hitting
> > > sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> > > sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> > > more are the ones also doing both copy_from_user and copy_to_user on optval)
> > > 
> > > I come also across one implementation that returned -ERANGE because *optlen was
> > > too short and put the required length into *optlen, which means the returned
> > > *optlen is larger than the optval buffer given from userspace.
> > > 
> > > Because of all these strange things I tried to do a minimal change
> > > in order to get rid of the io_uring limitation and only converted
> > > optlen and leave optval as is.
> > > 
> > > In order to have a patchset that has a low risk to cause regressions.
> > > 
> > > But as alternative introducing a prototype like this:
> > > 
> > > ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname,
> > > ??????????????????????????????? struct iov_iter *optval_iter);
> > > 
> > > That returns a non-negative value which can be placed into *optlen
> > > or negative value as error and *optlen will not be changed on error.
> > > optval_iter will get direction ITER_DEST, so it can only be written to.
> > > 
> > > Implementations could then opt in for the new interface and
> > > allow do_sock_getsockopt() work also for the io_uring case,
> > > while all others would still get -EOPNOTSUPP.
> > > 
> > > So what should be the way to go?
> > 
> > Ok, I've added the infrastructure for getsockopt_iter, see below,
> > but the first part I wanted to convert was
> > tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> > writing.
> > 
> > So we could go with the optlen_t approach, or we need
> > logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> > with ITER_DEST...
> > 
> > So who wants to decide?
> 
> I just noticed that it's even possible in same cases
> to pass in a short buffer to optval, but have a longer value in optlen,
> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
> 
> This makes it really hard to believe that trying to use iov_iter for this
> is a good idea :-(

That was my finding as well a while ago, when I was planning to get the
__user pointers converted to iov_iter. There are some weird ways of
using optlen and optval, which makes them non-trivial to covert to
iov_iter.


From stfomichev at gmail.com  Tue Apr  1 15:45:39 2025
From: stfomichev at gmail.com (Stanislav Fomichev)
Date: Tue, 1 Apr 2025 08:45:39 -0700
Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer
 via optlen_t to proto[_ops].getsockopt()
In-Reply-To: <Z+wH1oYOr1dlKeyN@gmail.com>
References: <cover.1743449872.git.metze@samba.org> <Z-sDc-0qyfPZz9lv@mini-arch>
 <39515c76-310d-41af-a8b4-a814841449e3@samba.org>
 <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org>
 <ed2038b1-0331-43d6-ac15-fd7e004ab27e@samba.org>
 <Z+wH1oYOr1dlKeyN@gmail.com>
Message-ID: <Z-wKI1rQGSgrsjbl@mini-arch>

On 04/01, Breno Leitao wrote:
> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:
> > Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:
> > > Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:
> > > > Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:
> > > > > On 03/31, Stefan Metzmacher wrote:
> > > > > > The motivation for this is to remove the SOL_SOCKET limitation
> > > > > > from io_uring_cmd_getsockopt().
> > > > > > 
> > > > > > The reason for this limitation is that io_uring_cmd_getsockopt()
> > > > > > passes a kernel pointer as optlen to do_sock_getsockopt()
> > > > > > and can't reach the ops->getsockopt() path.
> > > > > > 
> > > > > > The first idea would be to change the optval and optlen arguments
> > > > > > to the protocol specific hooks also to sockptr_t, as that
> > > > > > is already used for setsockopt() and also by do_sock_getsockopt()
> > > > > > sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> > > > > > 
> > > > > > But as Linus don't like 'sockptr_t' I used a different approach.
> > > > > > 
> > > > > > @Linus, would that optlen_t approach fit better for you?
> > > > > 
> > > > > [..]
> > > > > 
> > > > > > Instead of passing the optlen as user or kernel pointer,
> > > > > > we only ever pass a kernel pointer and do the
> > > > > > translation from/to userspace in do_sock_getsockopt().
> > > > > 
> > > > > At this point why not just fully embrace iov_iter? You have the size
> > > > > now + the user (or kernel) pointer. Might as well do
> > > > > s/sockptr_t/iov_iter/ conversion?
> > > > 
> > > > I think that would only be possible if we introduce
> > > > proto[_ops].getsockopt_iter() and then convert the implementations
> > > > step by step. Doing it all in one go has a lot of potential to break
> > > > the uapi. I could try to convert things like socket, ip and tcp myself, but
> > > > the rest needs to be converted by the maintainer of the specific protocol,
> > > > as it needs to be tested. As there are crazy things happening in the existing
> > > > implementations, e.g. some getsockopt() implementations use optval as in and out
> > > > buffer.
> > > > 
> > > > I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> > > > and that showed that touching the optval part starts to get complex very soon,
> > > > see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> > > > (note it didn't converted everything, I gave up after hitting
> > > > sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> > > > sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> > > > more are the ones also doing both copy_from_user and copy_to_user on optval)
> > > > 
> > > > I come also across one implementation that returned -ERANGE because *optlen was
> > > > too short and put the required length into *optlen, which means the returned
> > > > *optlen is larger than the optval buffer given from userspace.
> > > > 
> > > > Because of all these strange things I tried to do a minimal change
> > > > in order to get rid of the io_uring limitation and only converted
> > > > optlen and leave optval as is.
> > > > 
> > > > In order to have a patchset that has a low risk to cause regressions.
> > > > 
> > > > But as alternative introducing a prototype like this:
> > > > 
> > > > ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname,
> > > > ??????????????????????????????? struct iov_iter *optval_iter);
> > > > 
> > > > That returns a non-negative value which can be placed into *optlen
> > > > or negative value as error and *optlen will not be changed on error.
> > > > optval_iter will get direction ITER_DEST, so it can only be written to.
> > > > 
> > > > Implementations could then opt in for the new interface and
> > > > allow do_sock_getsockopt() work also for the io_uring case,
> > > > while all others would still get -EOPNOTSUPP.
> > > > 
> > > > So what should be the way to go?
> > > 
> > > Ok, I've added the infrastructure for getsockopt_iter, see below,
> > > but the first part I wanted to convert was
> > > tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> > > writing.
> > > 
> > > So we could go with the optlen_t approach, or we need
> > > logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> > > with ITER_DEST...
> > > 
> > > So who wants to decide?
> > 
> > I just noticed that it's even possible in same cases
> > to pass in a short buffer to optval, but have a longer value in optlen,
> > hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
> > 
> > This makes it really hard to believe that trying to use iov_iter for this
> > is a good idea :-(
> 
> That was my finding as well a while ago, when I was planning to get the
> __user pointers converted to iov_iter. There are some weird ways of
> using optlen and optval, which makes them non-trivial to covert to
> iov_iter.

Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
of useful socket opts. See if there are any obvious problems with them
and if not, try converting. The rest we can cover separately when/if
needed.


From metze at samba.org  Tue Apr  1 21:20:45 2025
From: metze at samba.org (Stefan Metzmacher)
Date: Tue, 1 Apr 2025 23:20:45 +0200
Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer
 via optlen_t to proto[_ops].getsockopt()
In-Reply-To: <Z-wKI1rQGSgrsjbl@mini-arch>
References: <cover.1743449872.git.metze@samba.org>
 <Z-sDc-0qyfPZz9lv@mini-arch> <39515c76-310d-41af-a8b4-a814841449e3@samba.org>
 <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org>
 <ed2038b1-0331-43d6-ac15-fd7e004ab27e@samba.org> <Z+wH1oYOr1dlKeyN@gmail.com>
 <Z-wKI1rQGSgrsjbl@mini-arch>
Message-ID: <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org>

Am 01.04.25 um 17:45 schrieb Stanislav Fomichev:
> On 04/01, Breno Leitao wrote:
>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:
>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:
>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:
>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:
>>>>>> On 03/31, Stefan Metzmacher wrote:
>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation
>>>>>>> from io_uring_cmd_getsockopt().
>>>>>>>
>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt()
>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt()
>>>>>>> and can't reach the ops->getsockopt() path.
>>>>>>>
>>>>>>> The first idea would be to change the optval and optlen arguments
>>>>>>> to the protocol specific hooks also to sockptr_t, as that
>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt()
>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
>>>>>>>
>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach.
>>>>>>>
>>>>>>> @Linus, would that optlen_t approach fit better for you?
>>>>>>
>>>>>> [..]
>>>>>>
>>>>>>> Instead of passing the optlen as user or kernel pointer,
>>>>>>> we only ever pass a kernel pointer and do the
>>>>>>> translation from/to userspace in do_sock_getsockopt().
>>>>>>
>>>>>> At this point why not just fully embrace iov_iter? You have the size
>>>>>> now + the user (or kernel) pointer. Might as well do
>>>>>> s/sockptr_t/iov_iter/ conversion?
>>>>>
>>>>> I think that would only be possible if we introduce
>>>>> proto[_ops].getsockopt_iter() and then convert the implementations
>>>>> step by step. Doing it all in one go has a lot of potential to break
>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but
>>>>> the rest needs to be converted by the maintainer of the specific protocol,
>>>>> as it needs to be tested. As there are crazy things happening in the existing
>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out
>>>>> buffer.
>>>>>
>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
>>>>> and that showed that touching the optval part starts to get complex very soon,
>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
>>>>> (note it didn't converted everything, I gave up after hitting
>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval)
>>>>>
>>>>> I come also across one implementation that returned -ERANGE because *optlen was
>>>>> too short and put the required length into *optlen, which means the returned
>>>>> *optlen is larger than the optval buffer given from userspace.
>>>>>
>>>>> Because of all these strange things I tried to do a minimal change
>>>>> in order to get rid of the io_uring limitation and only converted
>>>>> optlen and leave optval as is.
>>>>>
>>>>> In order to have a patchset that has a low risk to cause regressions.
>>>>>
>>>>> But as alternative introducing a prototype like this:
>>>>>
>>>>>  ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname,
>>>>>  ??????????????????????????????? struct iov_iter *optval_iter);
>>>>>
>>>>> That returns a non-negative value which can be placed into *optlen
>>>>> or negative value as error and *optlen will not be changed on error.
>>>>> optval_iter will get direction ITER_DEST, so it can only be written to.
>>>>>
>>>>> Implementations could then opt in for the new interface and
>>>>> allow do_sock_getsockopt() work also for the io_uring case,
>>>>> while all others would still get -EOPNOTSUPP.
>>>>>
>>>>> So what should be the way to go?
>>>>
>>>> Ok, I've added the infrastructure for getsockopt_iter, see below,
>>>> but the first part I wanted to convert was
>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before
>>>> writing.
>>>>
>>>> So we could go with the optlen_t approach, or we need
>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
>>>> with ITER_DEST...
>>>>
>>>> So who wants to decide?
>>>
>>> I just noticed that it's even possible in same cases
>>> to pass in a short buffer to optval, but have a longer value in optlen,
>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
>>>
>>> This makes it really hard to believe that trying to use iov_iter for this
>>> is a good idea :-(
>>
>> That was my finding as well a while ago, when I was planning to get the
>> __user pointers converted to iov_iter. There are some weird ways of
>> using optlen and optval, which makes them non-trivial to covert to
>> iov_iter.
> 
> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
> of useful socket opts. See if there are any obvious problems with them
> and if not, try converting. The rest we can cover separately when/if
> needed.

That's what I tried, but it fails with
tcp_getsockopt ->
    do_tcp_getsockopt ->
      tcp_ao_get_mkts ->
         tcp_ao_copy_mkts_to_user ->
            copy_struct_from_sockptr
      tcp_ao_get_sock_info ->
         copy_struct_from_sockptr

That's not possible with a ITER_DEST iov_iter.

metze


From stfomichev at gmail.com  Tue Apr  1 22:04:29 2025
From: stfomichev at gmail.com (Stanislav Fomichev)
Date: Tue, 1 Apr 2025 15:04:29 -0700
Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer
 via optlen_t to proto[_ops].getsockopt()
In-Reply-To: <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org>
References: <cover.1743449872.git.metze@samba.org> <Z-sDc-0qyfPZz9lv@mini-arch>
 <39515c76-310d-41af-a8b4-a814841449e3@samba.org>
 <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org>
 <ed2038b1-0331-43d6-ac15-fd7e004ab27e@samba.org>
 <Z+wH1oYOr1dlKeyN@gmail.com> <Z-wKI1rQGSgrsjbl@mini-arch>
 <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org>
Message-ID: <Z-xi7TH83upf-E3q@mini-arch>

On 04/01, Stefan Metzmacher wrote:
> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev:
> > On 04/01, Breno Leitao wrote:
> > > On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:
> > > > Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:
> > > > > Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:
> > > > > > Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:
> > > > > > > On 03/31, Stefan Metzmacher wrote:
> > > > > > > > The motivation for this is to remove the SOL_SOCKET limitation
> > > > > > > > from io_uring_cmd_getsockopt().
> > > > > > > > 
> > > > > > > > The reason for this limitation is that io_uring_cmd_getsockopt()
> > > > > > > > passes a kernel pointer as optlen to do_sock_getsockopt()
> > > > > > > > and can't reach the ops->getsockopt() path.
> > > > > > > > 
> > > > > > > > The first idea would be to change the optval and optlen arguments
> > > > > > > > to the protocol specific hooks also to sockptr_t, as that
> > > > > > > > is already used for setsockopt() and also by do_sock_getsockopt()
> > > > > > > > sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> > > > > > > > 
> > > > > > > > But as Linus don't like 'sockptr_t' I used a different approach.
> > > > > > > > 
> > > > > > > > @Linus, would that optlen_t approach fit better for you?
> > > > > > > 
> > > > > > > [..]
> > > > > > > 
> > > > > > > > Instead of passing the optlen as user or kernel pointer,
> > > > > > > > we only ever pass a kernel pointer and do the
> > > > > > > > translation from/to userspace in do_sock_getsockopt().
> > > > > > > 
> > > > > > > At this point why not just fully embrace iov_iter? You have the size
> > > > > > > now + the user (or kernel) pointer. Might as well do
> > > > > > > s/sockptr_t/iov_iter/ conversion?
> > > > > > 
> > > > > > I think that would only be possible if we introduce
> > > > > > proto[_ops].getsockopt_iter() and then convert the implementations
> > > > > > step by step. Doing it all in one go has a lot of potential to break
> > > > > > the uapi. I could try to convert things like socket, ip and tcp myself, but
> > > > > > the rest needs to be converted by the maintainer of the specific protocol,
> > > > > > as it needs to be tested. As there are crazy things happening in the existing
> > > > > > implementations, e.g. some getsockopt() implementations use optval as in and out
> > > > > > buffer.
> > > > > > 
> > > > > > I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> > > > > > and that showed that touching the optval part starts to get complex very soon,
> > > > > > see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> > > > > > (note it didn't converted everything, I gave up after hitting
> > > > > > sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> > > > > > sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> > > > > > more are the ones also doing both copy_from_user and copy_to_user on optval)
> > > > > > 
> > > > > > I come also across one implementation that returned -ERANGE because *optlen was
> > > > > > too short and put the required length into *optlen, which means the returned
> > > > > > *optlen is larger than the optval buffer given from userspace.
> > > > > > 
> > > > > > Because of all these strange things I tried to do a minimal change
> > > > > > in order to get rid of the io_uring limitation and only converted
> > > > > > optlen and leave optval as is.
> > > > > > 
> > > > > > In order to have a patchset that has a low risk to cause regressions.
> > > > > > 
> > > > > > But as alternative introducing a prototype like this:
> > > > > > 
> > > > > >  ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname,
> > > > > >  ??????????????????????????????? struct iov_iter *optval_iter);
> > > > > > 
> > > > > > That returns a non-negative value which can be placed into *optlen
> > > > > > or negative value as error and *optlen will not be changed on error.
> > > > > > optval_iter will get direction ITER_DEST, so it can only be written to.
> > > > > > 
> > > > > > Implementations could then opt in for the new interface and
> > > > > > allow do_sock_getsockopt() work also for the io_uring case,
> > > > > > while all others would still get -EOPNOTSUPP.
> > > > > > 
> > > > > > So what should be the way to go?
> > > > > 
> > > > > Ok, I've added the infrastructure for getsockopt_iter, see below,
> > > > > but the first part I wanted to convert was
> > > > > tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> > > > > writing.
> > > > > 
> > > > > So we could go with the optlen_t approach, or we need
> > > > > logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> > > > > with ITER_DEST...
> > > > > 
> > > > > So who wants to decide?
> > > > 
> > > > I just noticed that it's even possible in same cases
> > > > to pass in a short buffer to optval, but have a longer value in optlen,
> > > > hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
> > > > 
> > > > This makes it really hard to believe that trying to use iov_iter for this
> > > > is a good idea :-(
> > > 
> > > That was my finding as well a while ago, when I was planning to get the
> > > __user pointers converted to iov_iter. There are some weird ways of
> > > using optlen and optval, which makes them non-trivial to covert to
> > > iov_iter.
> > 
> > Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
> > of useful socket opts. See if there are any obvious problems with them
> > and if not, try converting. The rest we can cover separately when/if
> > needed.
> 
> That's what I tried, but it fails with
> tcp_getsockopt ->
>    do_tcp_getsockopt ->
>      tcp_ao_get_mkts ->
>         tcp_ao_copy_mkts_to_user ->
>            copy_struct_from_sockptr
>      tcp_ao_get_sock_info ->
>         copy_struct_from_sockptr
> 
> That's not possible with a ITER_DEST iov_iter.
> 
> metze

Can we create two iterators over the same memory? One for ITER_SOURCE and
another for ITER_DEST. And then make getsockopt_iter accept optval_in and
optval_out. We can also use optval_out position (iov_offset) as optlen output
value. Don't see why it won't work, but I agree that's gonna be a messy
conversion so let's see if someone else has better suggestions.


From metze at samba.org  Tue Apr  1 22:53:58 2025
From: metze at samba.org (Stefan Metzmacher)
Date: Wed, 2 Apr 2025 00:53:58 +0200
Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer
 via optlen_t to proto[_ops].getsockopt()
In-Reply-To: <Z-xi7TH83upf-E3q@mini-arch>
References: <cover.1743449872.git.metze@samba.org>
 <Z-sDc-0qyfPZz9lv@mini-arch> <39515c76-310d-41af-a8b4-a814841449e3@samba.org>
 <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org>
 <ed2038b1-0331-43d6-ac15-fd7e004ab27e@samba.org> <Z+wH1oYOr1dlKeyN@gmail.com>
 <Z-wKI1rQGSgrsjbl@mini-arch> <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org>
 <Z-xi7TH83upf-E3q@mini-arch>
Message-ID: <4b7ac4e9-6856-4e54-a2ba-15465e9622ac@samba.org>

Am 02.04.25 um 00:04 schrieb Stanislav Fomichev:
> On 04/01, Stefan Metzmacher wrote:
>> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev:
>>> On 04/01, Breno Leitao wrote:
>>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:
>>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:
>>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:
>>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:
>>>>>>>> On 03/31, Stefan Metzmacher wrote:
>>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation
>>>>>>>>> from io_uring_cmd_getsockopt().
>>>>>>>>>
>>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt()
>>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt()
>>>>>>>>> and can't reach the ops->getsockopt() path.
>>>>>>>>>
>>>>>>>>> The first idea would be to change the optval and optlen arguments
>>>>>>>>> to the protocol specific hooks also to sockptr_t, as that
>>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt()
>>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
>>>>>>>>>
>>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach.
>>>>>>>>>
>>>>>>>>> @Linus, would that optlen_t approach fit better for you?
>>>>>>>>
>>>>>>>> [..]
>>>>>>>>
>>>>>>>>> Instead of passing the optlen as user or kernel pointer,
>>>>>>>>> we only ever pass a kernel pointer and do the
>>>>>>>>> translation from/to userspace in do_sock_getsockopt().
>>>>>>>>
>>>>>>>> At this point why not just fully embrace iov_iter? You have the size
>>>>>>>> now + the user (or kernel) pointer. Might as well do
>>>>>>>> s/sockptr_t/iov_iter/ conversion?
>>>>>>>
>>>>>>> I think that would only be possible if we introduce
>>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations
>>>>>>> step by step. Doing it all in one go has a lot of potential to break
>>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but
>>>>>>> the rest needs to be converted by the maintainer of the specific protocol,
>>>>>>> as it needs to be tested. As there are crazy things happening in the existing
>>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out
>>>>>>> buffer.
>>>>>>>
>>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
>>>>>>> and that showed that touching the optval part starts to get complex very soon,
>>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
>>>>>>> (note it didn't converted everything, I gave up after hitting
>>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
>>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
>>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval)
>>>>>>>
>>>>>>> I come also across one implementation that returned -ERANGE because *optlen was
>>>>>>> too short and put the required length into *optlen, which means the returned
>>>>>>> *optlen is larger than the optval buffer given from userspace.
>>>>>>>
>>>>>>> Because of all these strange things I tried to do a minimal change
>>>>>>> in order to get rid of the io_uring limitation and only converted
>>>>>>> optlen and leave optval as is.
>>>>>>>
>>>>>>> In order to have a patchset that has a low risk to cause regressions.
>>>>>>>
>>>>>>> But as alternative introducing a prototype like this:
>>>>>>>
>>>>>>>   ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname,
>>>>>>>   ??????????????????????????????? struct iov_iter *optval_iter);
>>>>>>>
>>>>>>> That returns a non-negative value which can be placed into *optlen
>>>>>>> or negative value as error and *optlen will not be changed on error.
>>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to.
>>>>>>>
>>>>>>> Implementations could then opt in for the new interface and
>>>>>>> allow do_sock_getsockopt() work also for the io_uring case,
>>>>>>> while all others would still get -EOPNOTSUPP.
>>>>>>>
>>>>>>> So what should be the way to go?
>>>>>>
>>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below,
>>>>>> but the first part I wanted to convert was
>>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before
>>>>>> writing.
>>>>>>
>>>>>> So we could go with the optlen_t approach, or we need
>>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
>>>>>> with ITER_DEST...
>>>>>>
>>>>>> So who wants to decide?
>>>>>
>>>>> I just noticed that it's even possible in same cases
>>>>> to pass in a short buffer to optval, but have a longer value in optlen,
>>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
>>>>>
>>>>> This makes it really hard to believe that trying to use iov_iter for this
>>>>> is a good idea :-(
>>>>
>>>> That was my finding as well a while ago, when I was planning to get the
>>>> __user pointers converted to iov_iter. There are some weird ways of
>>>> using optlen and optval, which makes them non-trivial to covert to
>>>> iov_iter.
>>>
>>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
>>> of useful socket opts. See if there are any obvious problems with them
>>> and if not, try converting. The rest we can cover separately when/if
>>> needed.
>>
>> That's what I tried, but it fails with
>> tcp_getsockopt ->
>>     do_tcp_getsockopt ->
>>       tcp_ao_get_mkts ->
>>          tcp_ao_copy_mkts_to_user ->
>>             copy_struct_from_sockptr
>>       tcp_ao_get_sock_info ->
>>          copy_struct_from_sockptr
>>
>> That's not possible with a ITER_DEST iov_iter.
>>
>> metze
> 
> Can we create two iterators over the same memory? One for ITER_SOURCE and
> another for ITER_DEST. And then make getsockopt_iter accept optval_in and
> optval_out. We can also use optval_out position (iov_offset) as optlen output
> value. Don't see why it won't work, but I agree that's gonna be a messy
> conversion so let's see if someone else has better suggestions.

Yes, that might work, but it would be good to get some feedback
if this would be the way to go:

           int (*getsockopt_iter)(struct socket *sock,
				 int level, int optname,
				 struct iov_iter *optval_in,
				 struct iov_iter *optval_out);

And *optlen = optval_out->iov_offset;

Any objection or better ideas? Linus would that be what you had in mind?

Thanks!
metze


From stfomichev at gmail.com  Wed Apr  2 14:19:46 2025
From: stfomichev at gmail.com (Stanislav Fomichev)
Date: Wed, 2 Apr 2025 07:19:46 -0700
Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer
 via optlen_t to proto[_ops].getsockopt()
In-Reply-To: <20250402132906.0ceb8985@pumpkin>
References: <Z-sDc-0qyfPZz9lv@mini-arch>
 <39515c76-310d-41af-a8b4-a814841449e3@samba.org>
 <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org>
 <ed2038b1-0331-43d6-ac15-fd7e004ab27e@samba.org>
 <Z+wH1oYOr1dlKeyN@gmail.com> <Z-wKI1rQGSgrsjbl@mini-arch>
 <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org>
 <Z-xi7TH83upf-E3q@mini-arch>
 <4b7ac4e9-6856-4e54-a2ba-15465e9622ac@samba.org>
 <20250402132906.0ceb8985@pumpkin>
Message-ID: <Z-1Hgv4ImjWOW8X2@mini-arch>

On 04/02, David Laight wrote:
> On Wed, 2 Apr 2025 00:53:58 +0200
> Stefan Metzmacher <metze at samba.org> wrote:
> 
> > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev:
> > > On 04/01, Stefan Metzmacher wrote:  
> > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev:  
> > >>> On 04/01, Breno Leitao wrote:  
> > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:  
> > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:  
> > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:  
> > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:  
> > >>>>>>>> On 03/31, Stefan Metzmacher wrote:  
> > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation
> > >>>>>>>>> from io_uring_cmd_getsockopt().
> > >>>>>>>>>
> > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt()
> > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt()
> > >>>>>>>>> and can't reach the ops->getsockopt() path.
> > >>>>>>>>>
> > >>>>>>>>> The first idea would be to change the optval and optlen arguments
> > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that
> > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt()
> > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> > >>>>>>>>>
> > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach.
> > >>>>>>>>>
> > >>>>>>>>> @Linus, would that optlen_t approach fit better for you?  
> > >>>>>>>>
> > >>>>>>>> [..]
> > >>>>>>>>  
> > >>>>>>>>> Instead of passing the optlen as user or kernel pointer,
> > >>>>>>>>> we only ever pass a kernel pointer and do the
> > >>>>>>>>> translation from/to userspace in do_sock_getsockopt().  
> > >>>>>>>>
> > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size
> > >>>>>>>> now + the user (or kernel) pointer. Might as well do
> > >>>>>>>> s/sockptr_t/iov_iter/ conversion?  
> > >>>>>>>
> > >>>>>>> I think that would only be possible if we introduce
> > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations
> > >>>>>>> step by step. Doing it all in one go has a lot of potential to break
> > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but
> > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol,
> > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing
> > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out
> > >>>>>>> buffer.
> > >>>>>>>
> > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> > >>>>>>> and that showed that touching the optval part starts to get complex very soon,
> > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> > >>>>>>> (note it didn't converted everything, I gave up after hitting
> > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval)
> > >>>>>>>
> > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was
> > >>>>>>> too short and put the required length into *optlen, which means the returned
> > >>>>>>> *optlen is larger than the optval buffer given from userspace.
> > >>>>>>>
> > >>>>>>> Because of all these strange things I tried to do a minimal change
> > >>>>>>> in order to get rid of the io_uring limitation and only converted
> > >>>>>>> optlen and leave optval as is.
> > >>>>>>>
> > >>>>>>> In order to have a patchset that has a low risk to cause regressions.
> > >>>>>>>
> > >>>>>>> But as alternative introducing a prototype like this:
> > >>>>>>>
> > >>>>>>>   ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname,
> > >>>>>>>   ??????????????????????????????? struct iov_iter *optval_iter);
> > >>>>>>>
> > >>>>>>> That returns a non-negative value which can be placed into *optlen
> > >>>>>>> or negative value as error and *optlen will not be changed on error.
> > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to.
> > >>>>>>>
> > >>>>>>> Implementations could then opt in for the new interface and
> > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case,
> > >>>>>>> while all others would still get -EOPNOTSUPP.
> > >>>>>>>
> > >>>>>>> So what should be the way to go?  
> > >>>>>>
> > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below,
> > >>>>>> but the first part I wanted to convert was
> > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> > >>>>>> writing.
> > >>>>>>
> > >>>>>> So we could go with the optlen_t approach, or we need
> > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> > >>>>>> with ITER_DEST...
> > >>>>>>
> > >>>>>> So who wants to decide?  
> > >>>>>
> > >>>>> I just noticed that it's even possible in same cases
> > >>>>> to pass in a short buffer to optval, but have a longer value in optlen,
> > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
> > >>>>>
> > >>>>> This makes it really hard to believe that trying to use iov_iter for this
> > >>>>> is a good idea :-(  
> > >>>>
> > >>>> That was my finding as well a while ago, when I was planning to get the
> > >>>> __user pointers converted to iov_iter. There are some weird ways of
> > >>>> using optlen and optval, which makes them non-trivial to covert to
> > >>>> iov_iter.  
> > >>>
> > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
> > >>> of useful socket opts. See if there are any obvious problems with them
> > >>> and if not, try converting. The rest we can cover separately when/if
> > >>> needed.  
> > >>
> > >> That's what I tried, but it fails with
> > >> tcp_getsockopt ->
> > >>     do_tcp_getsockopt ->
> > >>       tcp_ao_get_mkts ->
> > >>          tcp_ao_copy_mkts_to_user ->
> > >>             copy_struct_from_sockptr
> > >>       tcp_ao_get_sock_info ->
> > >>          copy_struct_from_sockptr
> > >>
> > >> That's not possible with a ITER_DEST iov_iter.
> > >>
> > >> metze  
> > > 
> > > Can we create two iterators over the same memory? One for ITER_SOURCE and
> > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and
> > > optval_out. We can also use optval_out position (iov_offset) as optlen output
> > > value. Don't see why it won't work, but I agree that's gonna be a messy
> > > conversion so let's see if someone else has better suggestions.  
> > 
> > Yes, that might work, but it would be good to get some feedback
> > if this would be the way to go:
> > 
> >            int (*getsockopt_iter)(struct socket *sock,
> > 				 int level, int optname,
> > 				 struct iov_iter *optval_in,
> > 				 struct iov_iter *optval_out);
> > 
> > And *optlen = optval_out->iov_offset;
> > 
> > Any objection or better ideas? Linus would that be what you had in mind?
> 
> I'd worry about performance - yes I know 'iter' are used elsewhere but...
> Also look at the SCTP code.

Performance usually does not matter for set/getsockopts, there
are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE) and maybe recent
devmem sockopts; we can special-case these if needed, or keep sockptr_t,
idk. I'm skeptical we can convert everything though, that's why the
suggestion to start with sk/ip/tcp/udp.

> How do you handle code that wants to return an updated length (often longer
> than the one provided) and an error code (eg ERRSIZE or similar).
>
> There is also a very strange use (I think it is a sockopt rather than an ioctl)
> where the buffer length the application provides is only that of the header.
> The actual buffer length is contained in the header.
> The return length is the amount written into the full buffer.

Let's discuss these special cases as they come up? Worst case these
places can always re-init iov_iter with a comment on why it is ok.
But I do agree in general that there are a few places that do wild
stuff.


From torvalds at linux-foundation.org  Wed Apr  2 00:40:19 2025
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Tue, 1 Apr 2025 17:40:19 -0700
Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer
 via optlen_t to proto[_ops].getsockopt()
In-Reply-To: <cover.1743449872.git.metze@samba.org>
References: <cover.1743449872.git.metze@samba.org>
Message-ID: <CAHk-=whmzrO-BMU=uSVXbuoLi-3tJsO=0kHj1BCPBE3F2kVhTA@mail.gmail.com>

"

On Mon, 31 Mar 2025 at 13:11, Stefan Metzmacher <metze at samba.org> wrote:
>
> But as Linus don't like 'sockptr_t' I used a different approach.

So the sockptr_t thing has already happened. I hate it, and I think
it's ugly as hell, but it is what it is.

I think it's a complete hack and having that "kernel or user" pointer
flag is disgusting.

Making things worse, the naming is disgusting too, talking about some
random "socket pointer", when it has absolutely nothing to do with
socket, and isn't even a pointer. It's something else.

It's literally called "socket" not because it has anything to do with
sockets, but because it's a socket-specific hack that isn't acceptable
anywhere else in the kernel.

So that "socket" part of the name is literally shorthand for "only
sockets are disgusting enough to use this, and nobody else should ever
touch this crap".

At least so far that part has mostly worked, even if there's some
"sockptr_t" use in the crypto code. I didn't look closer, because I
didn't want to lose my lunch.

I don't understand why the networking code uses that thing.

If you have a "fat pointer", you should damn well make it have the
size of the area too, and do things *right*.

Instead of doing what sockptr_t does, which is a complete hack to just
pass a kernel/user flag, and then passes the length *separately*
because the socket code couldn't be arsed to do the right thing.

So I do still think "sockptr_t" should die.

As Stanislav says, if you actually want that "user or kernel" thing,
just use an "iov_iter".

No, an "iov_iter" isn't exactly a pretty thing either, but at least
it's the standard way to say "this pointer can have multiple different
kinds of sources".

And it keeps the size of the thing it points to around, so it's at
least a fat pointer with proper ranges, even if it isn't exactly "type
safe" (yes, it's type safe in the sense that it stays as a "iov_iter",
but it's still basically a "random pointer").

> @Linus, would that optlen_t approach fit better for you?

The optlen_t thing is slightly better mainly because it's more
type-safe. At least it's not a "random misnamed
user-or-kernel-pointer" thing where the name is about how nothing else
is so broken as to use it.

So it's better because it's more limited, and it's better in that at
least it has a type-safe pointer rather than a "void *" with no size
or type associated with it.

That said, I don't think it's exactly great.

It's just another case of "networking can't just do it right, and uses
a random hack with special flag values".

So I do think that it would be better to actually get rid of
"sockptr_t optval, unsigned int optlen" ENTIRELY, and replace that
with iov_iter and just make networking bite the bullet and do the
RightThing(tm).

In fact, to make it *really* typesafe, it might be a good idea to wrap
the iov_iter in another struct, something like

   typedef struct sockopt {
        struct iov_iter iter;
   } sockopt_t;

and make the networking functions make the typing very clear, and end
up with an interface something like

   int do_tcp_setsockopt(struct sock *sk,
                     int level, int optname,
                     sockopt_t *val);

where that "sockopt_t *val" replaces not just the "sockptr_t optval",
but also the "unsigned int optlen" thing.

And no, I didn't look at how much churn that would be. Probably a lot.
Maybe more than people are willing to do - even if I think some of it
could be automated with coccinelle or whatever.

                Linus


From david.laight.linux at gmail.com  Wed Apr  2 12:29:06 2025
From: david.laight.linux at gmail.com (David Laight)
Date: Wed, 2 Apr 2025 13:29:06 +0100
Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer
 via optlen_t to proto[_ops].getsockopt()
In-Reply-To: <4b7ac4e9-6856-4e54-a2ba-15465e9622ac@samba.org>
References: <cover.1743449872.git.metze@samba.org> <Z-sDc-0qyfPZz9lv@mini-arch>
 <39515c76-310d-41af-a8b4-a814841449e3@samba.org>
 <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org>
 <ed2038b1-0331-43d6-ac15-fd7e004ab27e@samba.org>
 <Z+wH1oYOr1dlKeyN@gmail.com> <Z-wKI1rQGSgrsjbl@mini-arch>
 <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org>
 <Z-xi7TH83upf-E3q@mini-arch>
 <4b7ac4e9-6856-4e54-a2ba-15465e9622ac@samba.org>
Message-ID: <20250402132906.0ceb8985@pumpkin>

On Wed, 2 Apr 2025 00:53:58 +0200
Stefan Metzmacher <metze at samba.org> wrote:

> Am 02.04.25 um 00:04 schrieb Stanislav Fomichev:
> > On 04/01, Stefan Metzmacher wrote:  
> >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev:  
> >>> On 04/01, Breno Leitao wrote:  
> >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:  
> >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:  
> >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:  
> >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:  
> >>>>>>>> On 03/31, Stefan Metzmacher wrote:  
> >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation
> >>>>>>>>> from io_uring_cmd_getsockopt().
> >>>>>>>>>
> >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt()
> >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt()
> >>>>>>>>> and can't reach the ops->getsockopt() path.
> >>>>>>>>>
> >>>>>>>>> The first idea would be to change the optval and optlen arguments
> >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that
> >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt()
> >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> >>>>>>>>>
> >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach.
> >>>>>>>>>
> >>>>>>>>> @Linus, would that optlen_t approach fit better for you?  
> >>>>>>>>
> >>>>>>>> [..]
> >>>>>>>>  
> >>>>>>>>> Instead of passing the optlen as user or kernel pointer,
> >>>>>>>>> we only ever pass a kernel pointer and do the
> >>>>>>>>> translation from/to userspace in do_sock_getsockopt().  
> >>>>>>>>
> >>>>>>>> At this point why not just fully embrace iov_iter? You have the size
> >>>>>>>> now + the user (or kernel) pointer. Might as well do
> >>>>>>>> s/sockptr_t/iov_iter/ conversion?  
> >>>>>>>
> >>>>>>> I think that would only be possible if we introduce
> >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations
> >>>>>>> step by step. Doing it all in one go has a lot of potential to break
> >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but
> >>>>>>> the rest needs to be converted by the maintainer of the specific protocol,
> >>>>>>> as it needs to be tested. As there are crazy things happening in the existing
> >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out
> >>>>>>> buffer.
> >>>>>>>
> >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> >>>>>>> and that showed that touching the optval part starts to get complex very soon,
> >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> >>>>>>> (note it didn't converted everything, I gave up after hitting
> >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval)
> >>>>>>>
> >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was
> >>>>>>> too short and put the required length into *optlen, which means the returned
> >>>>>>> *optlen is larger than the optval buffer given from userspace.
> >>>>>>>
> >>>>>>> Because of all these strange things I tried to do a minimal change
> >>>>>>> in order to get rid of the io_uring limitation and only converted
> >>>>>>> optlen and leave optval as is.
> >>>>>>>
> >>>>>>> In order to have a patchset that has a low risk to cause regressions.
> >>>>>>>
> >>>>>>> But as alternative introducing a prototype like this:
> >>>>>>>
> >>>>>>>   ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname,
> >>>>>>>   ??????????????????????????????? struct iov_iter *optval_iter);
> >>>>>>>
> >>>>>>> That returns a non-negative value which can be placed into *optlen
> >>>>>>> or negative value as error and *optlen will not be changed on error.
> >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to.
> >>>>>>>
> >>>>>>> Implementations could then opt in for the new interface and
> >>>>>>> allow do_sock_getsockopt() work also for the io_uring case,
> >>>>>>> while all others would still get -EOPNOTSUPP.
> >>>>>>>
> >>>>>>> So what should be the way to go?  
> >>>>>>
> >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below,
> >>>>>> but the first part I wanted to convert was
> >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> >>>>>> writing.
> >>>>>>
> >>>>>> So we could go with the optlen_t approach, or we need
> >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> >>>>>> with ITER_DEST...
> >>>>>>
> >>>>>> So who wants to decide?  
> >>>>>
> >>>>> I just noticed that it's even possible in same cases
> >>>>> to pass in a short buffer to optval, but have a longer value in optlen,
> >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
> >>>>>
> >>>>> This makes it really hard to believe that trying to use iov_iter for this
> >>>>> is a good idea :-(  
> >>>>
> >>>> That was my finding as well a while ago, when I was planning to get the
> >>>> __user pointers converted to iov_iter. There are some weird ways of
> >>>> using optlen and optval, which makes them non-trivial to covert to
> >>>> iov_iter.  
> >>>
> >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
> >>> of useful socket opts. See if there are any obvious problems with them
> >>> and if not, try converting. The rest we can cover separately when/if
> >>> needed.  
> >>
> >> That's what I tried, but it fails with
> >> tcp_getsockopt ->
> >>     do_tcp_getsockopt ->
> >>       tcp_ao_get_mkts ->
> >>          tcp_ao_copy_mkts_to_user ->
> >>             copy_struct_from_sockptr
> >>       tcp_ao_get_sock_info ->
> >>          copy_struct_from_sockptr
> >>
> >> That's not possible with a ITER_DEST iov_iter.
> >>
> >> metze  
> > 
> > Can we create two iterators over the same memory? One for ITER_SOURCE and
> > another for ITER_DEST. And then make getsockopt_iter accept optval_in and
> > optval_out. We can also use optval_out position (iov_offset) as optlen output
> > value. Don't see why it won't work, but I agree that's gonna be a messy
> > conversion so let's see if someone else has better suggestions.  
> 
> Yes, that might work, but it would be good to get some feedback
> if this would be the way to go:
> 
>            int (*getsockopt_iter)(struct socket *sock,
> 				 int level, int optname,
> 				 struct iov_iter *optval_in,
> 				 struct iov_iter *optval_out);
> 
> And *optlen = optval_out->iov_offset;
> 
> Any objection or better ideas? Linus would that be what you had in mind?

I'd worry about performance - yes I know 'iter' are used elsewhere but...
Also look at the SCTP code.

How do you handle code that wants to return an updated length (often longer
than the one provided) and an error code (eg ERRSIZE or similar).

There is also a very strange use (I think it is a sockopt rather than an ioctl)
where the buffer length the application provides is only that of the header.
The actual buffer length is contained in the header.
The return length is the amount written into the full buffer.

	David


From david.laight.linux at gmail.com  Wed Apr  2 12:35:20 2025
From: david.laight.linux at gmail.com (David Laight)
Date: Wed, 2 Apr 2025 13:35:20 +0100
Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer
 via optlen_t to proto[_ops].getsockopt()
In-Reply-To: <CAHk-=whmzrO-BMU=uSVXbuoLi-3tJsO=0kHj1BCPBE3F2kVhTA@mail.gmail.com>
References: <cover.1743449872.git.metze@samba.org>
 <CAHk-=whmzrO-BMU=uSVXbuoLi-3tJsO=0kHj1BCPBE3F2kVhTA@mail.gmail.com>
Message-ID: <20250402133520.40451468@pumpkin>

On Tue, 1 Apr 2025 17:40:19 -0700
Linus Torvalds <torvalds at linux-foundation.org> wrote:

> "
> 
> On Mon, 31 Mar 2025 at 13:11, Stefan Metzmacher <metze at samba.org> wrote:
> >
> > But as Linus don't like 'sockptr_t' I used a different approach.  
> 
> So the sockptr_t thing has already happened. I hate it, and I think
> it's ugly as hell, but it is what it is.
> 
> I think it's a complete hack and having that "kernel or user" pointer
> flag is disgusting.

I have proposed a patch which replaced it with a structure.
That showed up some really hacky code in IIRC io_uring.

Using sockptr_t for the buffer was one thing, the generic code
can't copy the buffer to/from user because code lies about the length.

But using for the length is just brain-dead.
That is fixed size and can be copied from/to user by the wrapper.
The code bloat reduction will be significant.

	David


From david.laight.linux at gmail.com  Wed Apr  2 20:46:38 2025
From: david.laight.linux at gmail.com (David Laight)
Date: Wed, 2 Apr 2025 21:46:38 +0100
Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer
 via optlen_t to proto[_ops].getsockopt()
In-Reply-To: <Z-1Hgv4ImjWOW8X2@mini-arch>
References: <Z-sDc-0qyfPZz9lv@mini-arch>
 <39515c76-310d-41af-a8b4-a814841449e3@samba.org>
 <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org>
 <ed2038b1-0331-43d6-ac15-fd7e004ab27e@samba.org>
 <Z+wH1oYOr1dlKeyN@gmail.com> <Z-wKI1rQGSgrsjbl@mini-arch>
 <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org>
 <Z-xi7TH83upf-E3q@mini-arch>
 <4b7ac4e9-6856-4e54-a2ba-15465e9622ac@samba.org>
 <20250402132906.0ceb8985@pumpkin> <Z-1Hgv4ImjWOW8X2@mini-arch>
Message-ID: <20250402214638.0b5eed55@pumpkin>

On Wed, 2 Apr 2025 07:19:46 -0700
Stanislav Fomichev <stfomichev at gmail.com> wrote:

> On 04/02, David Laight wrote:
> > On Wed, 2 Apr 2025 00:53:58 +0200
> > Stefan Metzmacher <metze at samba.org> wrote:
> >   
> > > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev:  
> > > > On 04/01, Stefan Metzmacher wrote:    
> > > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev:    
> > > >>> On 04/01, Breno Leitao wrote:    
> > > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:    
> > > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:    
> > > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:    
> > > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:    
> > > >>>>>>>> On 03/31, Stefan Metzmacher wrote:    
> > > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation
> > > >>>>>>>>> from io_uring_cmd_getsockopt().
> > > >>>>>>>>>
> > > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt()
> > > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt()
> > > >>>>>>>>> and can't reach the ops->getsockopt() path.
> > > >>>>>>>>>
> > > >>>>>>>>> The first idea would be to change the optval and optlen arguments
> > > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that
> > > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt()
> > > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> > > >>>>>>>>>
> > > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach.
> > > >>>>>>>>>
> > > >>>>>>>>> @Linus, would that optlen_t approach fit better for you?    
> > > >>>>>>>>
> > > >>>>>>>> [..]
> > > >>>>>>>>    
> > > >>>>>>>>> Instead of passing the optlen as user or kernel pointer,
> > > >>>>>>>>> we only ever pass a kernel pointer and do the
> > > >>>>>>>>> translation from/to userspace in do_sock_getsockopt().    
> > > >>>>>>>>
> > > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size
> > > >>>>>>>> now + the user (or kernel) pointer. Might as well do
> > > >>>>>>>> s/sockptr_t/iov_iter/ conversion?    
> > > >>>>>>>
> > > >>>>>>> I think that would only be possible if we introduce
> > > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations
> > > >>>>>>> step by step. Doing it all in one go has a lot of potential to break
> > > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but
> > > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol,
> > > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing
> > > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out
> > > >>>>>>> buffer.
> > > >>>>>>>
> > > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> > > >>>>>>> and that showed that touching the optval part starts to get complex very soon,
> > > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> > > >>>>>>> (note it didn't converted everything, I gave up after hitting
> > > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> > > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> > > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval)
> > > >>>>>>>
> > > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was
> > > >>>>>>> too short and put the required length into *optlen, which means the returned
> > > >>>>>>> *optlen is larger than the optval buffer given from userspace.
> > > >>>>>>>
> > > >>>>>>> Because of all these strange things I tried to do a minimal change
> > > >>>>>>> in order to get rid of the io_uring limitation and only converted
> > > >>>>>>> optlen and leave optval as is.
> > > >>>>>>>
> > > >>>>>>> In order to have a patchset that has a low risk to cause regressions.
> > > >>>>>>>
> > > >>>>>>> But as alternative introducing a prototype like this:
> > > >>>>>>>
> > > >>>>>>>   ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname,
> > > >>>>>>>   ??????????????????????????????? struct iov_iter *optval_iter);
> > > >>>>>>>
> > > >>>>>>> That returns a non-negative value which can be placed into *optlen
> > > >>>>>>> or negative value as error and *optlen will not be changed on error.
> > > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to.
> > > >>>>>>>
> > > >>>>>>> Implementations could then opt in for the new interface and
> > > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case,
> > > >>>>>>> while all others would still get -EOPNOTSUPP.
> > > >>>>>>>
> > > >>>>>>> So what should be the way to go?    
> > > >>>>>>
> > > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below,
> > > >>>>>> but the first part I wanted to convert was
> > > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> > > >>>>>> writing.
> > > >>>>>>
> > > >>>>>> So we could go with the optlen_t approach, or we need
> > > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> > > >>>>>> with ITER_DEST...
> > > >>>>>>
> > > >>>>>> So who wants to decide?    
> > > >>>>>
> > > >>>>> I just noticed that it's even possible in same cases
> > > >>>>> to pass in a short buffer to optval, but have a longer value in optlen,
> > > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
> > > >>>>>
> > > >>>>> This makes it really hard to believe that trying to use iov_iter for this
> > > >>>>> is a good idea :-(    
> > > >>>>
> > > >>>> That was my finding as well a while ago, when I was planning to get the
> > > >>>> __user pointers converted to iov_iter. There are some weird ways of
> > > >>>> using optlen and optval, which makes them non-trivial to covert to
> > > >>>> iov_iter.    
> > > >>>
> > > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
> > > >>> of useful socket opts. See if there are any obvious problems with them
> > > >>> and if not, try converting. The rest we can cover separately when/if
> > > >>> needed.    
> > > >>
> > > >> That's what I tried, but it fails with
> > > >> tcp_getsockopt ->
> > > >>     do_tcp_getsockopt ->
> > > >>       tcp_ao_get_mkts ->
> > > >>          tcp_ao_copy_mkts_to_user ->
> > > >>             copy_struct_from_sockptr
> > > >>       tcp_ao_get_sock_info ->
> > > >>          copy_struct_from_sockptr
> > > >>
> > > >> That's not possible with a ITER_DEST iov_iter.
> > > >>
> > > >> metze    
> > > > 
> > > > Can we create two iterators over the same memory? One for ITER_SOURCE and
> > > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and
> > > > optval_out. We can also use optval_out position (iov_offset) as optlen output
> > > > value. Don't see why it won't work, but I agree that's gonna be a messy
> > > > conversion so let's see if someone else has better suggestions.    
> > > 
> > > Yes, that might work, but it would be good to get some feedback
> > > if this would be the way to go:
> > > 
> > >            int (*getsockopt_iter)(struct socket *sock,
> > > 				 int level, int optname,
> > > 				 struct iov_iter *optval_in,
> > > 				 struct iov_iter *optval_out);
> > > 
> > > And *optlen = optval_out->iov_offset;
> > > 
> > > Any objection or better ideas? Linus would that be what you had in mind?  
> > 
> > I'd worry about performance - yes I know 'iter' are used elsewhere but...
> > Also look at the SCTP code.  
> 
> Performance usually does not matter for set/getsockopts, there
> are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE)

That might be the one that is really horrid and completely abuses
the 'length' parameter.

> and maybe recent
> devmem sockopts; we can special-case these if needed, or keep sockptr_t,
> idk. I'm skeptical we can convert everything though, that's why the
> suggestion to start with sk/ip/tcp/udp.
> 
> > How do you handle code that wants to return an updated length (often longer
> > than the one provided) and an error code (eg ERRSIZE or similar).
> >
> > There is also a very strange use (I think it is a sockopt rather than an ioctl)
> > where the buffer length the application provides is only that of the header.
> > The actual buffer length is contained in the header.
> > The return length is the amount written into the full buffer.  
> 
> Let's discuss these special cases as they come up? Worst case these
> places can always re-init iov_iter with a comment on why it is ok.
> But I do agree in general that there are a few places that do wild
> stuff.

The problem is that the generic code has to deal with all the 'wild stuff'.
It is also common to do non-sequential accesses - so iov_iter doesn't match
at all.
There also isn't a requirement for scatter-gather.

For 'normal' getsockopt (and setsockopt) with short lengths it actually makes
sense for the syscall wrapper to do the user copies.
But it would need to pass the user ptr+len as well as the kernel ptr+len
to give the required flexibilty.
Then you have to work out whether the final copy to user is needed or not.
(not that hard, but it all adds complication).

	David


From torvalds at linux-foundation.org  Wed Apr  2 21:07:54 2025
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 2 Apr 2025 14:07:54 -0700
Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer
 via optlen_t to proto[_ops].getsockopt()
In-Reply-To: <20250402214638.0b5eed55@pumpkin>
References: <Z-sDc-0qyfPZz9lv@mini-arch>
 <39515c76-310d-41af-a8b4-a814841449e3@samba.org>
 <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org>
 <ed2038b1-0331-43d6-ac15-fd7e004ab27e@samba.org>
 <Z+wH1oYOr1dlKeyN@gmail.com> <Z-wKI1rQGSgrsjbl@mini-arch>
 <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org>
 <Z-xi7TH83upf-E3q@mini-arch> <4b7ac4e9-6856-4e54-a2ba-15465e9622ac@samba.org>
 <20250402132906.0ceb8985@pumpkin> <Z-1Hgv4ImjWOW8X2@mini-arch>
 <20250402214638.0b5eed55@pumpkin>
Message-ID: <CAHk-=wi7p9bKgZt1E1BWE-NjwSRDBQs=Coviiz0ZTQy9OhHvPg@mail.gmail.com>

On Wed, 2 Apr 2025 at 13:46, David Laight <david.laight.linux at gmail.com> wrote:
>
> The problem is that the generic code has to deal with all the 'wild stuff'.
> It is also common to do non-sequential accesses - so iov_iter doesn't match
> at all.
> There also isn't a requirement for scatter-gather.

Note that the generic code has special cases for the simple stuff,
which is all that the sockopt code would need.

Now, that's _particularly_ true for the "single user address range"
thing, where there's a special ITER_UBUF thing.

We don't actually have a "single kernel range" version of that, but
ITER_KVEC is simple to use, and the sockopt code could say "I only
ever look at the first buffer".

It's ok to just not handle all the cases, and you don't *have* to use
the generic "copy_from_iter()" routines if you don't want to.

In fact, I would expect that something like sockopt generally wouldn't
want to use the normal iter copying routines, since those are
basically all geared towards "copy and update the iter".

           Linus


From stfomichev at gmail.com  Wed Apr  2 21:21:35 2025
From: stfomichev at gmail.com (Stanislav Fomichev)
Date: Wed, 2 Apr 2025 14:21:35 -0700
Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer
 via optlen_t to proto[_ops].getsockopt()
In-Reply-To: <20250402214638.0b5eed55@pumpkin>
References: <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org>
 <ed2038b1-0331-43d6-ac15-fd7e004ab27e@samba.org>
 <Z+wH1oYOr1dlKeyN@gmail.com> <Z-wKI1rQGSgrsjbl@mini-arch>
 <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org>
 <Z-xi7TH83upf-E3q@mini-arch>
 <4b7ac4e9-6856-4e54-a2ba-15465e9622ac@samba.org>
 <20250402132906.0ceb8985@pumpkin> <Z-1Hgv4ImjWOW8X2@mini-arch>
 <20250402214638.0b5eed55@pumpkin>
Message-ID: <Z-2qX_N2-jpMYSIy@mini-arch>

On 04/02, David Laight wrote:
> On Wed, 2 Apr 2025 07:19:46 -0700
> Stanislav Fomichev <stfomichev at gmail.com> wrote:
> 
> > On 04/02, David Laight wrote:
> > > On Wed, 2 Apr 2025 00:53:58 +0200
> > > Stefan Metzmacher <metze at samba.org> wrote:
> > >   
> > > > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev:  
> > > > > On 04/01, Stefan Metzmacher wrote:    
> > > > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev:    
> > > > >>> On 04/01, Breno Leitao wrote:    
> > > > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:    
> > > > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:    
> > > > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:    
> > > > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:    
> > > > >>>>>>>> On 03/31, Stefan Metzmacher wrote:    
> > > > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation
> > > > >>>>>>>>> from io_uring_cmd_getsockopt().
> > > > >>>>>>>>>
> > > > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt()
> > > > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt()
> > > > >>>>>>>>> and can't reach the ops->getsockopt() path.
> > > > >>>>>>>>>
> > > > >>>>>>>>> The first idea would be to change the optval and optlen arguments
> > > > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that
> > > > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt()
> > > > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> > > > >>>>>>>>>
> > > > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach.
> > > > >>>>>>>>>
> > > > >>>>>>>>> @Linus, would that optlen_t approach fit better for you?    
> > > > >>>>>>>>
> > > > >>>>>>>> [..]
> > > > >>>>>>>>    
> > > > >>>>>>>>> Instead of passing the optlen as user or kernel pointer,
> > > > >>>>>>>>> we only ever pass a kernel pointer and do the
> > > > >>>>>>>>> translation from/to userspace in do_sock_getsockopt().    
> > > > >>>>>>>>
> > > > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size
> > > > >>>>>>>> now + the user (or kernel) pointer. Might as well do
> > > > >>>>>>>> s/sockptr_t/iov_iter/ conversion?    
> > > > >>>>>>>
> > > > >>>>>>> I think that would only be possible if we introduce
> > > > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations
> > > > >>>>>>> step by step. Doing it all in one go has a lot of potential to break
> > > > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but
> > > > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol,
> > > > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing
> > > > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out
> > > > >>>>>>> buffer.
> > > > >>>>>>>
> > > > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> > > > >>>>>>> and that showed that touching the optval part starts to get complex very soon,
> > > > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> > > > >>>>>>> (note it didn't converted everything, I gave up after hitting
> > > > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> > > > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> > > > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval)
> > > > >>>>>>>
> > > > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was
> > > > >>>>>>> too short and put the required length into *optlen, which means the returned
> > > > >>>>>>> *optlen is larger than the optval buffer given from userspace.
> > > > >>>>>>>
> > > > >>>>>>> Because of all these strange things I tried to do a minimal change
> > > > >>>>>>> in order to get rid of the io_uring limitation and only converted
> > > > >>>>>>> optlen and leave optval as is.
> > > > >>>>>>>
> > > > >>>>>>> In order to have a patchset that has a low risk to cause regressions.
> > > > >>>>>>>
> > > > >>>>>>> But as alternative introducing a prototype like this:
> > > > >>>>>>>
> > > > >>>>>>>   ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname,
> > > > >>>>>>>   ??????????????????????????????? struct iov_iter *optval_iter);
> > > > >>>>>>>
> > > > >>>>>>> That returns a non-negative value which can be placed into *optlen
> > > > >>>>>>> or negative value as error and *optlen will not be changed on error.
> > > > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to.
> > > > >>>>>>>
> > > > >>>>>>> Implementations could then opt in for the new interface and
> > > > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case,
> > > > >>>>>>> while all others would still get -EOPNOTSUPP.
> > > > >>>>>>>
> > > > >>>>>>> So what should be the way to go?    
> > > > >>>>>>
> > > > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below,
> > > > >>>>>> but the first part I wanted to convert was
> > > > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> > > > >>>>>> writing.
> > > > >>>>>>
> > > > >>>>>> So we could go with the optlen_t approach, or we need
> > > > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> > > > >>>>>> with ITER_DEST...
> > > > >>>>>>
> > > > >>>>>> So who wants to decide?    
> > > > >>>>>
> > > > >>>>> I just noticed that it's even possible in same cases
> > > > >>>>> to pass in a short buffer to optval, but have a longer value in optlen,
> > > > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
> > > > >>>>>
> > > > >>>>> This makes it really hard to believe that trying to use iov_iter for this
> > > > >>>>> is a good idea :-(    
> > > > >>>>
> > > > >>>> That was my finding as well a while ago, when I was planning to get the
> > > > >>>> __user pointers converted to iov_iter. There are some weird ways of
> > > > >>>> using optlen and optval, which makes them non-trivial to covert to
> > > > >>>> iov_iter.    
> > > > >>>
> > > > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
> > > > >>> of useful socket opts. See if there are any obvious problems with them
> > > > >>> and if not, try converting. The rest we can cover separately when/if
> > > > >>> needed.    
> > > > >>
> > > > >> That's what I tried, but it fails with
> > > > >> tcp_getsockopt ->
> > > > >>     do_tcp_getsockopt ->
> > > > >>       tcp_ao_get_mkts ->
> > > > >>          tcp_ao_copy_mkts_to_user ->
> > > > >>             copy_struct_from_sockptr
> > > > >>       tcp_ao_get_sock_info ->
> > > > >>          copy_struct_from_sockptr
> > > > >>
> > > > >> That's not possible with a ITER_DEST iov_iter.
> > > > >>
> > > > >> metze    
> > > > > 
> > > > > Can we create two iterators over the same memory? One for ITER_SOURCE and
> > > > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and
> > > > > optval_out. We can also use optval_out position (iov_offset) as optlen output
> > > > > value. Don't see why it won't work, but I agree that's gonna be a messy
> > > > > conversion so let's see if someone else has better suggestions.    
> > > > 
> > > > Yes, that might work, but it would be good to get some feedback
> > > > if this would be the way to go:
> > > > 
> > > >            int (*getsockopt_iter)(struct socket *sock,
> > > > 				 int level, int optname,
> > > > 				 struct iov_iter *optval_in,
> > > > 				 struct iov_iter *optval_out);
> > > > 
> > > > And *optlen = optval_out->iov_offset;
> > > > 
> > > > Any objection or better ideas? Linus would that be what you had in mind?  
> > > 
> > > I'd worry about performance - yes I know 'iter' are used elsewhere but...
> > > Also look at the SCTP code.  
> > 
> > Performance usually does not matter for set/getsockopts, there
> > are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE)
> 
> That might be the one that is really horrid and completely abuses
> the 'length' parameter.

It is reading and writing, yes, but it's not a huge problem. And it
does enforce the optlen (to copy back the same amount of bytes). It's
not that bad, it's just an example of where we need to be extra
careful.

> > and maybe recent
> > devmem sockopts; we can special-case these if needed, or keep sockptr_t,
> > idk. I'm skeptical we can convert everything though, that's why the
> > suggestion to start with sk/ip/tcp/udp.
> > 
> > > How do you handle code that wants to return an updated length (often longer
> > > than the one provided) and an error code (eg ERRSIZE or similar).
> > >
> > > There is also a very strange use (I think it is a sockopt rather than an ioctl)
> > > where the buffer length the application provides is only that of the header.
> > > The actual buffer length is contained in the header.
> > > The return length is the amount written into the full buffer.  
> > 
> > Let's discuss these special cases as they come up? Worst case these
> > places can always re-init iov_iter with a comment on why it is ok.
> > But I do agree in general that there are a few places that do wild
> > stuff.
> 
> The problem is that the generic code has to deal with all the 'wild stuff'.

getsockopt_iter will have optval_in for the minority of socket options
(like TCP_ZEROCOPY_RECEIVE) that want to read user's value as well
as optval_out. The latter is what the majority of socket options
will use to write their value. That doesn't seem too complicated to
handle?

> It is also common to do non-sequential accesses - so iov_iter doesn't match
> at all.

I disagree that it's 'common'. Searching for copy_from_sockptr_offset
returns a few cases and they are mostly using read-with-offset because
there is no sequential read (iterator) semantics with sockptr_t.

> There also isn't a requirement for scatter-gather.
> 
> For 'normal' getsockopt (and setsockopt) with short lengths it actually makes
> sense for the syscall wrapper to do the user copies.
> But it would need to pass the user ptr+len as well as the kernel ptr+len
> to give the required flexibilty.
> Then you have to work out whether the final copy to user is needed or not.
> (not that hard, but it all adds complication).

Not sure I understand what's the problem. The user vs kernel part will
be abstracted by iov_iter. The callers will have to write the optlen
back. And there are two call sites we care about: io_uring and regular
system call. What's your suggestion? Maybe I'm missing something. Do you
prefer get_optlen/put_optlen?


From david.laight.linux at gmail.com  Wed Apr  2 22:38:05 2025
From: david.laight.linux at gmail.com (David Laight)
Date: Wed, 2 Apr 2025 23:38:05 +0100
Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer
 via optlen_t to proto[_ops].getsockopt()
In-Reply-To: <Z-2qX_N2-jpMYSIy@mini-arch>
References: <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org>
 <ed2038b1-0331-43d6-ac15-fd7e004ab27e@samba.org>
 <Z+wH1oYOr1dlKeyN@gmail.com> <Z-wKI1rQGSgrsjbl@mini-arch>
 <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org>
 <Z-xi7TH83upf-E3q@mini-arch>
 <4b7ac4e9-6856-4e54-a2ba-15465e9622ac@samba.org>
 <20250402132906.0ceb8985@pumpkin> <Z-1Hgv4ImjWOW8X2@mini-arch>
 <20250402214638.0b5eed55@pumpkin> <Z-2qX_N2-jpMYSIy@mini-arch>
Message-ID: <20250402233805.464ed70e@pumpkin>

On Wed, 2 Apr 2025 14:21:35 -0700
Stanislav Fomichev <stfomichev at gmail.com> wrote:

> On 04/02, David Laight wrote:
> > On Wed, 2 Apr 2025 07:19:46 -0700
> > Stanislav Fomichev <stfomichev at gmail.com> wrote:
> >   
> > > On 04/02, David Laight wrote:  
> > > > On Wed, 2 Apr 2025 00:53:58 +0200
> > > > Stefan Metzmacher <metze at samba.org> wrote:
> > > >     
> > > > > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev:    
> > > > > > On 04/01, Stefan Metzmacher wrote:      
> > > > > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev:      
> > > > > >>> On 04/01, Breno Leitao wrote:      
> > > > > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:      
> > > > > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:      
> > > > > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:      
> > > > > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:      
> > > > > >>>>>>>> On 03/31, Stefan Metzmacher wrote:      
> > > > > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation
> > > > > >>>>>>>>> from io_uring_cmd_getsockopt().
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt()
> > > > > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt()
> > > > > >>>>>>>>> and can't reach the ops->getsockopt() path.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> The first idea would be to change the optval and optlen arguments
> > > > > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that
> > > > > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt()
> > > > > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> @Linus, would that optlen_t approach fit better for you?      
> > > > > >>>>>>>>
> > > > > >>>>>>>> [..]
> > > > > >>>>>>>>      
> > > > > >>>>>>>>> Instead of passing the optlen as user or kernel pointer,
> > > > > >>>>>>>>> we only ever pass a kernel pointer and do the
> > > > > >>>>>>>>> translation from/to userspace in do_sock_getsockopt().      
> > > > > >>>>>>>>
> > > > > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size
> > > > > >>>>>>>> now + the user (or kernel) pointer. Might as well do
> > > > > >>>>>>>> s/sockptr_t/iov_iter/ conversion?      
> > > > > >>>>>>>
> > > > > >>>>>>> I think that would only be possible if we introduce
> > > > > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations
> > > > > >>>>>>> step by step. Doing it all in one go has a lot of potential to break
> > > > > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but
> > > > > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol,
> > > > > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing
> > > > > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out
> > > > > >>>>>>> buffer.
> > > > > >>>>>>>
> > > > > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> > > > > >>>>>>> and that showed that touching the optval part starts to get complex very soon,
> > > > > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> > > > > >>>>>>> (note it didn't converted everything, I gave up after hitting
> > > > > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> > > > > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> > > > > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval)
> > > > > >>>>>>>
> > > > > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was
> > > > > >>>>>>> too short and put the required length into *optlen, which means the returned
> > > > > >>>>>>> *optlen is larger than the optval buffer given from userspace.
> > > > > >>>>>>>
> > > > > >>>>>>> Because of all these strange things I tried to do a minimal change
> > > > > >>>>>>> in order to get rid of the io_uring limitation and only converted
> > > > > >>>>>>> optlen and leave optval as is.
> > > > > >>>>>>>
> > > > > >>>>>>> In order to have a patchset that has a low risk to cause regressions.
> > > > > >>>>>>>
> > > > > >>>>>>> But as alternative introducing a prototype like this:
> > > > > >>>>>>>
> > > > > >>>>>>>   ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname,
> > > > > >>>>>>>   ??????????????????????????????? struct iov_iter *optval_iter);
> > > > > >>>>>>>
> > > > > >>>>>>> That returns a non-negative value which can be placed into *optlen
> > > > > >>>>>>> or negative value as error and *optlen will not be changed on error.
> > > > > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to.
> > > > > >>>>>>>
> > > > > >>>>>>> Implementations could then opt in for the new interface and
> > > > > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case,
> > > > > >>>>>>> while all others would still get -EOPNOTSUPP.
> > > > > >>>>>>>
> > > > > >>>>>>> So what should be the way to go?      
> > > > > >>>>>>
> > > > > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below,
> > > > > >>>>>> but the first part I wanted to convert was
> > > > > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> > > > > >>>>>> writing.
> > > > > >>>>>>
> > > > > >>>>>> So we could go with the optlen_t approach, or we need
> > > > > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> > > > > >>>>>> with ITER_DEST...
> > > > > >>>>>>
> > > > > >>>>>> So who wants to decide?      
> > > > > >>>>>
> > > > > >>>>> I just noticed that it's even possible in same cases
> > > > > >>>>> to pass in a short buffer to optval, but have a longer value in optlen,
> > > > > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
> > > > > >>>>>
> > > > > >>>>> This makes it really hard to believe that trying to use iov_iter for this
> > > > > >>>>> is a good idea :-(      
> > > > > >>>>
> > > > > >>>> That was my finding as well a while ago, when I was planning to get the
> > > > > >>>> __user pointers converted to iov_iter. There are some weird ways of
> > > > > >>>> using optlen and optval, which makes them non-trivial to covert to
> > > > > >>>> iov_iter.      
> > > > > >>>
> > > > > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
> > > > > >>> of useful socket opts. See if there are any obvious problems with them
> > > > > >>> and if not, try converting. The rest we can cover separately when/if
> > > > > >>> needed.      
> > > > > >>
> > > > > >> That's what I tried, but it fails with
> > > > > >> tcp_getsockopt ->
> > > > > >>     do_tcp_getsockopt ->
> > > > > >>       tcp_ao_get_mkts ->
> > > > > >>          tcp_ao_copy_mkts_to_user ->
> > > > > >>             copy_struct_from_sockptr
> > > > > >>       tcp_ao_get_sock_info ->
> > > > > >>          copy_struct_from_sockptr
> > > > > >>
> > > > > >> That's not possible with a ITER_DEST iov_iter.
> > > > > >>
> > > > > >> metze      
> > > > > > 
> > > > > > Can we create two iterators over the same memory? One for ITER_SOURCE and
> > > > > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and
> > > > > > optval_out. We can also use optval_out position (iov_offset) as optlen output
> > > > > > value. Don't see why it won't work, but I agree that's gonna be a messy
> > > > > > conversion so let's see if someone else has better suggestions.      
> > > > > 
> > > > > Yes, that might work, but it would be good to get some feedback
> > > > > if this would be the way to go:
> > > > > 
> > > > >            int (*getsockopt_iter)(struct socket *sock,
> > > > > 				 int level, int optname,
> > > > > 				 struct iov_iter *optval_in,
> > > > > 				 struct iov_iter *optval_out);
> > > > > 
> > > > > And *optlen = optval_out->iov_offset;
> > > > > 
> > > > > Any objection or better ideas? Linus would that be what you had in mind?    
> > > > 
> > > > I'd worry about performance - yes I know 'iter' are used elsewhere but...
> > > > Also look at the SCTP code.    
> > > 
> > > Performance usually does not matter for set/getsockopts, there
> > > are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE)  
> > 
> > That might be the one that is really horrid and completely abuses
> > the 'length' parameter.  
> 
> It is reading and writing, yes, but it's not a huge problem. And it
> does enforce the optlen (to copy back the same amount of bytes). It's
> not that bad, it's just an example of where we need to be extra
> careful.
> 
> > > and maybe recent
> > > devmem sockopts; we can special-case these if needed, or keep sockptr_t,
> > > idk. I'm skeptical we can convert everything though, that's why the
> > > suggestion to start with sk/ip/tcp/udp.
> > >   
> > > > How do you handle code that wants to return an updated length (often longer
> > > > than the one provided) and an error code (eg ERRSIZE or similar).
> > > >
> > > > There is also a very strange use (I think it is a sockopt rather than an ioctl)
> > > > where the buffer length the application provides is only that of the header.
> > > > The actual buffer length is contained in the header.
> > > > The return length is the amount written into the full buffer.    
> > > 
> > > Let's discuss these special cases as they come up? Worst case these
> > > places can always re-init iov_iter with a comment on why it is ok.
> > > But I do agree in general that there are a few places that do wild
> > > stuff.  
> > 
> > The problem is that the generic code has to deal with all the 'wild stuff'.  
> 
> getsockopt_iter will have optval_in for the minority of socket options
> (like TCP_ZEROCOPY_RECEIVE) that want to read user's value as well
> as optval_out. The latter is what the majority of socket options
> will use to write their value. That doesn't seem too complicated to
> handle?
> 
> > It is also common to do non-sequential accesses - so iov_iter doesn't match
> > at all.  
> 
> I disagree that it's 'common'. Searching for copy_from_sockptr_offset
> returns a few cases and they are mostly using read-with-offset because
> there is no sequential read (iterator) semantics with sockptr_t.
> 
> > There also isn't a requirement for scatter-gather.
> > 
> > For 'normal' getsockopt (and setsockopt) with short lengths it actually makes
> > sense for the syscall wrapper to do the user copies.
> > But it would need to pass the user ptr+len as well as the kernel ptr+len
> > to give the required flexibilty.
> > Then you have to work out whether the final copy to user is needed or not.
> > (not that hard, but it all adds complication).  
> 
> Not sure I understand what's the problem. The user vs kernel part will
> be abstracted by iov_iter. The callers will have to write the optlen
> back. And there are two call sites we care about: io_uring and regular
> system call. What's your suggestion? Maybe I'm missing something. Do you
> prefer get_optlen/put_optlen?

I think the final aim should be to pass the user supplied length to the
per-protocol code and have it return the length/error to be passed back to the
user.

But in a lot of cases the syscall wrapper can do the buffer copies (as well
as the length copies).
That would be restricted to short length (on stack).
So code that needed a long buffer (like some of the sctp options)
would need to directly access the user buffer (or a long buffer provided
by an in-kernel user).

But you'll find code that reads/writes well beyond the apparent size of
the user buffer.
(And not just code that accesses 4 bytes without checking the length).

	David


From stfomichev at gmail.com  Wed Apr  2 23:39:17 2025
From: stfomichev at gmail.com (Stanislav Fomichev)
Date: Wed, 2 Apr 2025 16:39:17 -0700
Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer
 via optlen_t to proto[_ops].getsockopt()
In-Reply-To: <20250402233805.464ed70e@pumpkin>
References: <Z+wH1oYOr1dlKeyN@gmail.com> <Z-wKI1rQGSgrsjbl@mini-arch>
 <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org>
 <Z-xi7TH83upf-E3q@mini-arch>
 <4b7ac4e9-6856-4e54-a2ba-15465e9622ac@samba.org>
 <20250402132906.0ceb8985@pumpkin> <Z-1Hgv4ImjWOW8X2@mini-arch>
 <20250402214638.0b5eed55@pumpkin> <Z-2qX_N2-jpMYSIy@mini-arch>
 <20250402233805.464ed70e@pumpkin>
Message-ID: <Z-3KpXR_nJQ4X76F@mini-arch>

On 04/02, David Laight wrote:
> On Wed, 2 Apr 2025 14:21:35 -0700
> Stanislav Fomichev <stfomichev at gmail.com> wrote:
> 
> > On 04/02, David Laight wrote:
> > > On Wed, 2 Apr 2025 07:19:46 -0700
> > > Stanislav Fomichev <stfomichev at gmail.com> wrote:
> > >   
> > > > On 04/02, David Laight wrote:  
> > > > > On Wed, 2 Apr 2025 00:53:58 +0200
> > > > > Stefan Metzmacher <metze at samba.org> wrote:
> > > > >     
> > > > > > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev:    
> > > > > > > On 04/01, Stefan Metzmacher wrote:      
> > > > > > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev:      
> > > > > > >>> On 04/01, Breno Leitao wrote:      
> > > > > > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:      
> > > > > > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:      
> > > > > > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:      
> > > > > > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:      
> > > > > > >>>>>>>> On 03/31, Stefan Metzmacher wrote:      
> > > > > > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation
> > > > > > >>>>>>>>> from io_uring_cmd_getsockopt().
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt()
> > > > > > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt()
> > > > > > >>>>>>>>> and can't reach the ops->getsockopt() path.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> The first idea would be to change the optval and optlen arguments
> > > > > > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that
> > > > > > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt()
> > > > > > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> @Linus, would that optlen_t approach fit better for you?      
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> [..]
> > > > > > >>>>>>>>      
> > > > > > >>>>>>>>> Instead of passing the optlen as user or kernel pointer,
> > > > > > >>>>>>>>> we only ever pass a kernel pointer and do the
> > > > > > >>>>>>>>> translation from/to userspace in do_sock_getsockopt().      
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size
> > > > > > >>>>>>>> now + the user (or kernel) pointer. Might as well do
> > > > > > >>>>>>>> s/sockptr_t/iov_iter/ conversion?      
> > > > > > >>>>>>>
> > > > > > >>>>>>> I think that would only be possible if we introduce
> > > > > > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations
> > > > > > >>>>>>> step by step. Doing it all in one go has a lot of potential to break
> > > > > > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but
> > > > > > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol,
> > > > > > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing
> > > > > > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out
> > > > > > >>>>>>> buffer.
> > > > > > >>>>>>>
> > > > > > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> > > > > > >>>>>>> and that showed that touching the optval part starts to get complex very soon,
> > > > > > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> > > > > > >>>>>>> (note it didn't converted everything, I gave up after hitting
> > > > > > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> > > > > > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> > > > > > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval)
> > > > > > >>>>>>>
> > > > > > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was
> > > > > > >>>>>>> too short and put the required length into *optlen, which means the returned
> > > > > > >>>>>>> *optlen is larger than the optval buffer given from userspace.
> > > > > > >>>>>>>
> > > > > > >>>>>>> Because of all these strange things I tried to do a minimal change
> > > > > > >>>>>>> in order to get rid of the io_uring limitation and only converted
> > > > > > >>>>>>> optlen and leave optval as is.
> > > > > > >>>>>>>
> > > > > > >>>>>>> In order to have a patchset that has a low risk to cause regressions.
> > > > > > >>>>>>>
> > > > > > >>>>>>> But as alternative introducing a prototype like this:
> > > > > > >>>>>>>
> > > > > > >>>>>>>   ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname,
> > > > > > >>>>>>>   ??????????????????????????????? struct iov_iter *optval_iter);
> > > > > > >>>>>>>
> > > > > > >>>>>>> That returns a non-negative value which can be placed into *optlen
> > > > > > >>>>>>> or negative value as error and *optlen will not be changed on error.
> > > > > > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to.
> > > > > > >>>>>>>
> > > > > > >>>>>>> Implementations could then opt in for the new interface and
> > > > > > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case,
> > > > > > >>>>>>> while all others would still get -EOPNOTSUPP.
> > > > > > >>>>>>>
> > > > > > >>>>>>> So what should be the way to go?      
> > > > > > >>>>>>
> > > > > > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below,
> > > > > > >>>>>> but the first part I wanted to convert was
> > > > > > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> > > > > > >>>>>> writing.
> > > > > > >>>>>>
> > > > > > >>>>>> So we could go with the optlen_t approach, or we need
> > > > > > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> > > > > > >>>>>> with ITER_DEST...
> > > > > > >>>>>>
> > > > > > >>>>>> So who wants to decide?      
> > > > > > >>>>>
> > > > > > >>>>> I just noticed that it's even possible in same cases
> > > > > > >>>>> to pass in a short buffer to optval, but have a longer value in optlen,
> > > > > > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
> > > > > > >>>>>
> > > > > > >>>>> This makes it really hard to believe that trying to use iov_iter for this
> > > > > > >>>>> is a good idea :-(      
> > > > > > >>>>
> > > > > > >>>> That was my finding as well a while ago, when I was planning to get the
> > > > > > >>>> __user pointers converted to iov_iter. There are some weird ways of
> > > > > > >>>> using optlen and optval, which makes them non-trivial to covert to
> > > > > > >>>> iov_iter.      
> > > > > > >>>
> > > > > > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
> > > > > > >>> of useful socket opts. See if there are any obvious problems with them
> > > > > > >>> and if not, try converting. The rest we can cover separately when/if
> > > > > > >>> needed.      
> > > > > > >>
> > > > > > >> That's what I tried, but it fails with
> > > > > > >> tcp_getsockopt ->
> > > > > > >>     do_tcp_getsockopt ->
> > > > > > >>       tcp_ao_get_mkts ->
> > > > > > >>          tcp_ao_copy_mkts_to_user ->
> > > > > > >>             copy_struct_from_sockptr
> > > > > > >>       tcp_ao_get_sock_info ->
> > > > > > >>          copy_struct_from_sockptr
> > > > > > >>
> > > > > > >> That's not possible with a ITER_DEST iov_iter.
> > > > > > >>
> > > > > > >> metze      
> > > > > > > 
> > > > > > > Can we create two iterators over the same memory? One for ITER_SOURCE and
> > > > > > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and
> > > > > > > optval_out. We can also use optval_out position (iov_offset) as optlen output
> > > > > > > value. Don't see why it won't work, but I agree that's gonna be a messy
> > > > > > > conversion so let's see if someone else has better suggestions.      
> > > > > > 
> > > > > > Yes, that might work, but it would be good to get some feedback
> > > > > > if this would be the way to go:
> > > > > > 
> > > > > >            int (*getsockopt_iter)(struct socket *sock,
> > > > > > 				 int level, int optname,
> > > > > > 				 struct iov_iter *optval_in,
> > > > > > 				 struct iov_iter *optval_out);
> > > > > > 
> > > > > > And *optlen = optval_out->iov_offset;
> > > > > > 
> > > > > > Any objection or better ideas? Linus would that be what you had in mind?    
> > > > > 
> > > > > I'd worry about performance - yes I know 'iter' are used elsewhere but...
> > > > > Also look at the SCTP code.    
> > > > 
> > > > Performance usually does not matter for set/getsockopts, there
> > > > are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE)  
> > > 
> > > That might be the one that is really horrid and completely abuses
> > > the 'length' parameter.  
> > 
> > It is reading and writing, yes, but it's not a huge problem. And it
> > does enforce the optlen (to copy back the same amount of bytes). It's
> > not that bad, it's just an example of where we need to be extra
> > careful.
> > 
> > > > and maybe recent
> > > > devmem sockopts; we can special-case these if needed, or keep sockptr_t,
> > > > idk. I'm skeptical we can convert everything though, that's why the
> > > > suggestion to start with sk/ip/tcp/udp.
> > > >   
> > > > > How do you handle code that wants to return an updated length (often longer
> > > > > than the one provided) and an error code (eg ERRSIZE or similar).
> > > > >
> > > > > There is also a very strange use (I think it is a sockopt rather than an ioctl)
> > > > > where the buffer length the application provides is only that of the header.
> > > > > The actual buffer length is contained in the header.
> > > > > The return length is the amount written into the full buffer.    
> > > > 
> > > > Let's discuss these special cases as they come up? Worst case these
> > > > places can always re-init iov_iter with a comment on why it is ok.
> > > > But I do agree in general that there are a few places that do wild
> > > > stuff.  
> > > 
> > > The problem is that the generic code has to deal with all the 'wild stuff'.  
> > 
> > getsockopt_iter will have optval_in for the minority of socket options
> > (like TCP_ZEROCOPY_RECEIVE) that want to read user's value as well
> > as optval_out. The latter is what the majority of socket options
> > will use to write their value. That doesn't seem too complicated to
> > handle?
> > 
> > > It is also common to do non-sequential accesses - so iov_iter doesn't match
> > > at all.  
> > 
> > I disagree that it's 'common'. Searching for copy_from_sockptr_offset
> > returns a few cases and they are mostly using read-with-offset because
> > there is no sequential read (iterator) semantics with sockptr_t.
> > 
> > > There also isn't a requirement for scatter-gather.
> > > 
> > > For 'normal' getsockopt (and setsockopt) with short lengths it actually makes
> > > sense for the syscall wrapper to do the user copies.
> > > But it would need to pass the user ptr+len as well as the kernel ptr+len
> > > to give the required flexibilty.
> > > Then you have to work out whether the final copy to user is needed or not.
> > > (not that hard, but it all adds complication).  
> > 
> > Not sure I understand what's the problem. The user vs kernel part will
> > be abstracted by iov_iter. The callers will have to write the optlen
> > back. And there are two call sites we care about: io_uring and regular
> > system call. What's your suggestion? Maybe I'm missing something. Do you
> > prefer get_optlen/put_optlen?
> 
> I think the final aim should be to pass the user supplied length to the
> per-protocol code and have it return the length/error to be passed back to the
> user.

Like what Stefan's patch 3 is doing? Or you're suggesting to change
getsockopt handlers to handle length more explicitly? If we were
to proceed with sockptr to iov_iter conversion we'll have to do it anyway
(or pass the length as the size of iov_iter).

> But in a lot of cases the syscall wrapper can do the buffer copies (as well
> as the length copies).
> That would be restricted to short length (on stack).
> So code that needed a long buffer (like some of the sctp options)
> would need to directly access the user buffer (or a long buffer provided
> by an in-kernel user).

This sounds similar to what we did with bpf hooks - copy (head of) the
buffer and run bpf program on top of it. I remember iptables setsockopt
begin problematic because of its huge size.. It is an option, yes (to
convert protocol handler to kernel memory mostly).

> But you'll find code that reads/writes well beyond the apparent size of
> the user buffer.
> (And not just code that accesses 4 bytes without checking the length).

With can start with getsockopt_iter + sk_getsockopt to see if there are any
issues with that approach. If not, adding ip/tcp/udp to the mix should be doable.
We can explain and comment on special cases if needed. When other protocols
are needed from io_uring, we can convert more. But at least the new code
will use the correct abstractions.


From kangyan91 at outlook.com  Wed Apr  2 16:15:56 2025
From: kangyan91 at outlook.com (YAN KANG)
Date: Wed, 2 Apr 2025 16:15:56 +0000
Subject: [rds-devel] BUG: KASAN: slab-use-after-free in rds_inc_put
Message-ID: <SY8P300MB0421C6B1FD42BA488F04B797A1AF2@SY8P300MB0421.AUSP300.PROD.OUTLOOK.COM>

Dear maintainers,

My fuzzing tool found a new kernel bug titiled "BUG: KASAN: slab-use-after-free in rds_inc_put ".  I tested it on the Linux upstream version (6.14.0-rc6) . 
Because the target object is freed  in kernel workqueue kthread , I have no repro for this bug. But the crash log is sufficient to describe the cause of the bug.

RootCause Analysis:
in /net/rds/recv.c 
void rds_inc_put(struct rds_incoming *inc)
{
	rdsdebug("put inc %p ref %d\n", inc, refcount_read(&inc->i_refcount));
	if (refcount_dec_and_test(&inc->i_refcount)) {
		BUG_ON(!list_empty(&inc->i_item));

		inc->i_conn->c_trans->inc_free(inc);  // crash, because inc->i_conn is dangling pointer.
	}
}
struct rds_connection object is alloced in rds_sendmsg function and added to loop_conns  list.

Then there are two structures hold the reference of struct rds_connection object .
1. struct rds_sock has field ( struct rds_connection	*rs_conn) , rs->rs_conn is initalized in rds_sendmsg function.
2. global list : loop_conns's item (struct rds_loop_connection * type) has a field (struct rds_connection *conn) . In function __rds_conn_create, conn is alloced and add to global list .

In workqueue : cleanup_net calls rds_loop_kill_conns and free all connections. But in another thread, rds_sock still hold the dangling pointer.

Fix suggestion:
I think there needs to be some synchronization mechanism for rds_connection's lifecycle.

If you fix this issue, please add the following tag to the commit:
Reported-by: yan kang <kangyan91 at outlook.com>
Reported-by: yue sun <samsun1006219 at gmail.com>


I hope it helps.
Best regards
yan kang


Kernel crash log is below.
==================================================================
==================================================================
BUG: KASAN: slab-use-after-free in rds_inc_put+0x210/0x220 net/rds/recv.c:83
Read of size 8 at addr ffff88803d111048 by task syz.0.615/15412

CPU: 0 UID: 0 PID: 15412 Comm: syz.0.615 Not tainted 6.14.0-rc6-00006-g7122647c49bb-dirty #112
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x116/0x1b0 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:378 [inline]
 print_report+0xc0/0x5e0 mm/kasan/report.c:489
 kasan_report+0xbd/0xf0 mm/kasan/report.c:602
 rds_inc_put+0x210/0x220 net/rds/recv.c:83
 rds_clear_recv_queue+0x3e6/0x610 net/rds/recv.c:778
 rds_release+0xdb/0x460 net/rds/af_rds.c:73
 __sock_release+0xb0/0x270 net/socket.c:640
 sock_close+0x1c/0x30 net/socket.c:1408
 __fput+0x3f8/0xb40 fs/file_table.c:450
 task_work_run+0x169/0x260 kernel/task_work.c:239
 exit_task_work include/linux/task_work.h:43 [inline]
 do_exit+0xacc/0x2ce0 kernel/exit.c:938
 do_group_exit+0xd3/0x2a0 kernel/exit.c:1087
 get_signal+0x222c/0x2500 kernel/signal.c:3017
 arch_do_signal_or_restart+0x81/0x7d0 arch/x86/kernel/signal.c:337
 exit_to_user_mode_loop kernel/entry/common.c:111 [inline]
 exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline]
 __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
 syscall_exit_to_user_mode+0x150/0x2a0 kernel/entry/common.c:218
 do_syscall_64+0xd8/0x250 arch/x86/entry/common.c:89
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff6239e6d48
Code: Unable to access opcode bytes at 0x7ff6239e6d1e.
RSP: 002b:00007ff6214f5e90 EFLAGS: 00000293 ORIG_RAX: 00000000000000e6
RAX: fffffffffffffdfc RBX: 00007ff623bb5f01 RCX: 00007ff6239e6d48
RDX: 00007ff6214f5f20 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 00007ff623a39f8e R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 00007ff6214f5f20
R13: 0000000000000000 R14: 0000000000000000 R15: 00007ff6214d6000
 </TASK>

Allocated by task 16518:
 kasan_save_stack+0x24/0x50 mm/kasan/common.c:47
 kasan_save_track+0x14/0x30 mm/kasan/common.c:68
 unpoison_slab_object mm/kasan/common.c:319 [inline]
 __kasan_slab_alloc+0x87/0x90 mm/kasan/common.c:345
 kasan_slab_alloc include/linux/kasan.h:250 [inline]
 slab_post_alloc_hook mm/slub.c:4119 [inline]
 slab_alloc_node mm/slub.c:4168 [inline]
 kmem_cache_alloc_noprof+0x167/0x3e0 mm/slub.c:4175
 __rds_conn_create+0x83c/0x2330 net/rds/connection.c:193
 rds_conn_create_outgoing+0x44/0x60 net/rds/connection.c:363
 rds_sendmsg+0x11b2/0x3160 net/rds/send.c:1294
 sock_sendmsg_nosec net/socket.c:711 [inline]
 __sock_sendmsg net/socket.c:726 [inline]
 __sys_sendto+0x4fc/0x570 net/socket.c:2197
 __do_sys_sendto net/socket.c:2204 [inline]
 __se_sys_sendto net/socket.c:2200 [inline]
 __x64_sys_sendto+0xe0/0x1c0 net/socket.c:2200
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xcb/0x250 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 9656:
 kasan_save_stack+0x24/0x50 mm/kasan/common.c:47
 kasan_save_track+0x14/0x30 mm/kasan/common.c:68
 kasan_save_free_info+0x3b/0x60 mm/kasan/generic.c:582
 poison_slab_object mm/kasan/common.c:247 [inline]
 __kasan_slab_free+0x54/0x70 mm/kasan/common.c:264
 kasan_slab_free include/linux/kasan.h:233 [inline]
 slab_free_hook mm/slub.c:2353 [inline]
 slab_free mm/slub.c:4613 [inline]
 kmem_cache_free+0x145/0x4b0 mm/slub.c:4715
 rds_conn_destroy+0x61f/0x850 net/rds/connection.c:513
 rds_loop_kill_conns net/rds/loop.c:213 [inline]
 rds_loop_exit_net+0x2cd/0x410 net/rds/loop.c:219
 ops_exit_list+0xb0/0x180 net/core/net_namespace.c:172
 cleanup_net+0x5b3/0xd90 net/core/net_namespace.c:648
 process_one_work+0x966/0x1b90 kernel/workqueue.c:3236
 process_scheduled_works kernel/workqueue.c:3317 [inline]
 worker_thread+0x66e/0xe80 kernel/workqueue.c:3398
 kthread+0x2c7/0x3b0 kernel/kthread.c:389
 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

The buggy address belongs to the object at ffff88803d111000
 which belongs to the cache rds_connection of size 240
The buggy address is located 72 bytes inside of
 freed 240-byte region [ffff88803d111000, ffff88803d1110f0)

The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88803d111000 pfn:0x3d111
flags: 0xfff00000000000(node=0|zone=1|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 00fff00000000000 ffff88802aefb500 dead000000000122 0000000000000000
raw: ffff88803d111000 00000000800d000c 00000001f5000000 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x52cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP), pid 16518, tgid 16516 (syz.2.792), ts 132666258205, free_ts 129603681735
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0x2e7/0x350 mm/page_alloc.c:1558
 prep_new_page mm/page_alloc.c:1566 [inline]
 get_page_from_freelist+0xe4e/0x2b20 mm/page_alloc.c:3476
 __alloc_pages_noprof+0x219/0x2190 mm/page_alloc.c:4753
 alloc_pages_mpol_noprof+0x2b6/0x600 mm/mempolicy.c:2269
 alloc_slab_page mm/slub.c:2423 [inline]
 allocate_slab mm/slub.c:2589 [inline]
 new_slab+0x2d5/0x420 mm/slub.c:2642
 ___slab_alloc+0xbb7/0x1850 mm/slub.c:3830
 __slab_alloc.constprop.0+0x56/0xb0 mm/slub.c:3920
 __slab_alloc_node mm/slub.c:3995 [inline]
 slab_alloc_node mm/slub.c:4156 [inline]
 kmem_cache_alloc_noprof+0x264/0x3e0 mm/slub.c:4175
 __rds_conn_create+0x83c/0x2330 net/rds/connection.c:193
 rds_conn_create_outgoing+0x44/0x60 net/rds/connection.c:363
 rds_sendmsg+0x11b2/0x3160 net/rds/send.c:1294
 sock_sendmsg_nosec net/socket.c:711 [inline]
 __sock_sendmsg net/socket.c:726 [inline]
 __sys_sendto+0x4fc/0x570 net/socket.c:2197
 __do_sys_sendto net/socket.c:2204 [inline]
 __se_sys_sendto net/socket.c:2200 [inline]
 __x64_sys_sendto+0xe0/0x1c0 net/socket.c:2200
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xcb/0x250 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
page last free pid 49 tgid 49 stack trace:
 reset_page_owner include/linux/page_owner.h:25 [inline]
 free_pages_prepare mm/page_alloc.c:1127 [inline]
 free_unref_page+0x700/0x10a0 mm/page_alloc.c:2659
 vfree+0x172/0x940 mm/vmalloc.c:3383
 delayed_vfree_work+0x57/0x70 mm/vmalloc.c:3303
 process_one_work+0x966/0x1b90 kernel/workqueue.c:3236
 process_scheduled_works kernel/workqueue.c:3317 [inline]
 worker_thread+0x66e/0xe80 kernel/workqueue.c:3398
 kthread+0x2c7/0x3b0 kernel/kthread.c:389
 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

Memory state around the buggy address:
 ffff88803d110f00: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
 ffff88803d110f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88803d111000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                              ^
 ffff88803d111080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc
 ffff88803d111100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================

From leon at kernel.org  Tue Apr  8 11:04:55 2025
From: leon at kernel.org (Leon Romanovsky)
Date: Tue,  8 Apr 2025 14:04:55 +0300
Subject: [rds-devel] [PATCH net-next] rds: rely on IB/core to determine if
 device is ODP capable
Message-ID: <bfc8ffb7ea207ed90c777a4f61a8afe1badef212.1744109826.git.leonro@nvidia.com>

From: Leon Romanovsky <leonro at nvidia.com>

There is no need to perform checks if IB device ODP capable as
ib_reg_user_mr() will check all access flags anyway.

RDS is the only one in-kernel ODP user, so change return value for ODP
not supported case, to the value used by RDS.

Signed-off-by: Leon Romanovsky <leonro at nvidia.com>
---
 drivers/infiniband/core/verbs.c | 2 +-
 net/rds/ib.c                    | 8 --------
 net/rds/ib.h                    | 1 -
 net/rds/ib_rdma.c               | 5 -----
 4 files changed, 1 insertion(+), 15 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index c5e78bbefbd0..61620787ee48 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -2218,7 +2218,7 @@ struct ib_mr *ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 		if (!(pd->device->attrs.kernel_cap_flags &
 		      IBK_ON_DEMAND_PAGING)) {
 			pr_debug("ODP support not available\n");
-			return ERR_PTR(-EINVAL);
+			return ERR_PTR(-EOPNOTSUPP);
 		}
 	}
 
diff --git a/net/rds/ib.c b/net/rds/ib.c
index 9826fe7f9d00..c62aa2ff4963 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -153,14 +153,6 @@ static int rds_ib_add_one(struct ib_device *device)
 	rds_ibdev->max_wrs = device->attrs.max_qp_wr;
 	rds_ibdev->max_sge = min(device->attrs.max_send_sge, RDS_IB_MAX_SGE);
 
-	rds_ibdev->odp_capable =
-		!!(device->attrs.kernel_cap_flags &
-		   IBK_ON_DEMAND_PAGING) &&
-		!!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps &
-		   IB_ODP_SUPPORT_WRITE) &&
-		!!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps &
-		   IB_ODP_SUPPORT_READ);
-
 	rds_ibdev->max_1m_mrs = device->attrs.max_mr ?
 		min_t(unsigned int, (device->attrs.max_mr / 2),
 		      rds_ib_mr_1m_pool_size) : rds_ib_mr_1m_pool_size;
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 8ef3178ed4d6..f3ec4ff5951f 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -246,7 +246,6 @@ struct rds_ib_device {
 	struct list_head	conn_list;
 	struct ib_device	*dev;
 	struct ib_pd		*pd;
-	u8			odp_capable:1;
 
 	unsigned int		max_mrs;
 	struct rds_ib_mr_pool	*mr_1m_pool;
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index d1cfceeff133..75ab7b8db864 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -568,11 +568,6 @@ void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents,
 		struct ib_sge sge = {};
 		struct ib_mr *ib_mr;
 
-		if (!rds_ibdev->odp_capable) {
-			ret = -EOPNOTSUPP;
-			goto out;
-		}
-
 		ib_mr = ib_reg_user_mr(rds_ibdev->pd, start, length, virt_addr,
 				       access_flags);
 
-- 
2.49.0


From jgg at nvidia.com  Tue Apr  8 12:23:38 2025
From: jgg at nvidia.com (Jason Gunthorpe)
Date: Tue, 8 Apr 2025 09:23:38 -0300
Subject: [rds-devel] [PATCH net-next] rds: rely on IB/core to determine
 if device is ODP capable
In-Reply-To: <bfc8ffb7ea207ed90c777a4f61a8afe1badef212.1744109826.git.leonro@nvidia.com>
References: <bfc8ffb7ea207ed90c777a4f61a8afe1badef212.1744109826.git.leonro@nvidia.com>
Message-ID: <20250408122338.GA1778492@nvidia.com>

On Tue, Apr 08, 2025 at 02:04:55PM +0300, Leon Romanovsky wrote:
> diff --git a/net/rds/ib.c b/net/rds/ib.c
> index 9826fe7f9d00..c62aa2ff4963 100644
> --- a/net/rds/ib.c
> +++ b/net/rds/ib.c
> @@ -153,14 +153,6 @@ static int rds_ib_add_one(struct ib_device *device)
>  	rds_ibdev->max_wrs = device->attrs.max_qp_wr;
>  	rds_ibdev->max_sge = min(device->attrs.max_send_sge, RDS_IB_MAX_SGE);
>  
> -	rds_ibdev->odp_capable =
> -		!!(device->attrs.kernel_cap_flags &
> -		   IBK_ON_DEMAND_PAGING) &&
> -		!!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps &
> -		   IB_ODP_SUPPORT_WRITE) &&
> -		!!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps &
> -		   IB_ODP_SUPPORT_READ);

This patch seems to drop the check for WRITE and READ support on the
ODP.

Jason


From leon at kernel.org  Tue Apr  8 12:34:13 2025
From: leon at kernel.org (Leon Romanovsky)
Date: Tue, 8 Apr 2025 15:34:13 +0300
Subject: [rds-devel] [PATCH net-next] rds: rely on IB/core to determine
 if device is ODP capable
In-Reply-To: <20250408122338.GA1778492@nvidia.com>
References: <bfc8ffb7ea207ed90c777a4f61a8afe1badef212.1744109826.git.leonro@nvidia.com>
 <20250408122338.GA1778492@nvidia.com>
Message-ID: <20250408123413.GA199604@unreal>

On Tue, Apr 08, 2025 at 09:23:38AM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 08, 2025 at 02:04:55PM +0300, Leon Romanovsky wrote:
> > diff --git a/net/rds/ib.c b/net/rds/ib.c
> > index 9826fe7f9d00..c62aa2ff4963 100644
> > --- a/net/rds/ib.c
> > +++ b/net/rds/ib.c
> > @@ -153,14 +153,6 @@ static int rds_ib_add_one(struct ib_device *device)
> >  	rds_ibdev->max_wrs = device->attrs.max_qp_wr;
> >  	rds_ibdev->max_sge = min(device->attrs.max_send_sge, RDS_IB_MAX_SGE);
> >  
> > -	rds_ibdev->odp_capable =
> > -		!!(device->attrs.kernel_cap_flags &
> > -		   IBK_ON_DEMAND_PAGING) &&
> > -		!!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps &
> > -		   IB_ODP_SUPPORT_WRITE) &&
> > -		!!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps &
> > -		   IB_ODP_SUPPORT_READ);
> 
> This patch seems to drop the check for WRITE and READ support on the
> ODP.

Right, and they are part of IBK_ON_DEMAND_PAGING support. All ODP
providers support both IB_ODP_SUPPORT_WRITE and IB_ODP_SUPPORT_READ.

RDS doesn't need to check more than general ODP support and can safely
rely on internal driver logic to create right MR.

Thanks

> 
> Jason


From jgg at nvidia.com  Tue Apr  8 12:38:14 2025
From: jgg at nvidia.com (Jason Gunthorpe)
Date: Tue, 8 Apr 2025 09:38:14 -0300
Subject: [rds-devel] [PATCH net-next] rds: rely on IB/core to determine
 if device is ODP capable
In-Reply-To: <20250408123413.GA199604@unreal>
References: <bfc8ffb7ea207ed90c777a4f61a8afe1badef212.1744109826.git.leonro@nvidia.com>
 <20250408122338.GA1778492@nvidia.com>
 <20250408123413.GA199604@unreal>
Message-ID: <20250408123814.GC1778492@nvidia.com>

On Tue, Apr 08, 2025 at 03:34:13PM +0300, Leon Romanovsky wrote:
> On Tue, Apr 08, 2025 at 09:23:38AM -0300, Jason Gunthorpe wrote:
> > On Tue, Apr 08, 2025 at 02:04:55PM +0300, Leon Romanovsky wrote:
> > > diff --git a/net/rds/ib.c b/net/rds/ib.c
> > > index 9826fe7f9d00..c62aa2ff4963 100644
> > > --- a/net/rds/ib.c
> > > +++ b/net/rds/ib.c
> > > @@ -153,14 +153,6 @@ static int rds_ib_add_one(struct ib_device *device)
> > >  	rds_ibdev->max_wrs = device->attrs.max_qp_wr;
> > >  	rds_ibdev->max_sge = min(device->attrs.max_send_sge, RDS_IB_MAX_SGE);
> > >  
> > > -	rds_ibdev->odp_capable =
> > > -		!!(device->attrs.kernel_cap_flags &
> > > -		   IBK_ON_DEMAND_PAGING) &&
> > > -		!!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps &
> > > -		   IB_ODP_SUPPORT_WRITE) &&
> > > -		!!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps &
> > > -		   IB_ODP_SUPPORT_READ);
> > 
> > This patch seems to drop the check for WRITE and READ support on the
> > ODP.
> 
> Right, and they are part of IBK_ON_DEMAND_PAGING support. All ODP
> providers support both IB_ODP_SUPPORT_WRITE and IB_ODP_SUPPORT_READ.

Where? mlx5 reads this from FW and I don't see anything blocking
IBK_ON_DEMAND_PAGING if the FW is weird.

Jason


From leon at kernel.org  Tue Apr  8 19:11:38 2025
From: leon at kernel.org (Leon Romanovsky)
Date: Tue, 8 Apr 2025 22:11:38 +0300
Subject: [rds-devel] [PATCH net-next] rds: rely on IB/core to determine
 if device is ODP capable
In-Reply-To: <20250408123814.GC1778492@nvidia.com>
References: <bfc8ffb7ea207ed90c777a4f61a8afe1badef212.1744109826.git.leonro@nvidia.com>
 <20250408122338.GA1778492@nvidia.com>
 <20250408123413.GA199604@unreal>
 <20250408123814.GC1778492@nvidia.com>
Message-ID: <20250408191138.GF199604@unreal>

On Tue, Apr 08, 2025 at 09:38:14AM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 08, 2025 at 03:34:13PM +0300, Leon Romanovsky wrote:
> > On Tue, Apr 08, 2025 at 09:23:38AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Apr 08, 2025 at 02:04:55PM +0300, Leon Romanovsky wrote:
> > > > diff --git a/net/rds/ib.c b/net/rds/ib.c
> > > > index 9826fe7f9d00..c62aa2ff4963 100644
> > > > --- a/net/rds/ib.c
> > > > +++ b/net/rds/ib.c
> > > > @@ -153,14 +153,6 @@ static int rds_ib_add_one(struct ib_device *device)
> > > >  	rds_ibdev->max_wrs = device->attrs.max_qp_wr;
> > > >  	rds_ibdev->max_sge = min(device->attrs.max_send_sge, RDS_IB_MAX_SGE);
> > > >  
> > > > -	rds_ibdev->odp_capable =
> > > > -		!!(device->attrs.kernel_cap_flags &
> > > > -		   IBK_ON_DEMAND_PAGING) &&
> > > > -		!!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps &
> > > > -		   IB_ODP_SUPPORT_WRITE) &&
> > > > -		!!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps &
> > > > -		   IB_ODP_SUPPORT_READ);
> > > 
> > > This patch seems to drop the check for WRITE and READ support on the
> > > ODP.
> > 
> > Right, and they are part of IBK_ON_DEMAND_PAGING support. All ODP
> > providers support both IB_ODP_SUPPORT_WRITE and IB_ODP_SUPPORT_READ.
> 
> Where? mlx5 reads this from FW and I don't see anything blocking
> IBK_ON_DEMAND_PAGING if the FW is weird.

As the one who added it, I can assure you that we added these checks not
because of weird FW, but because these caps existed.

RDS calls to ib_reg_user_mr() with the following access_flags.

  564                 int access_flags =
  565                         (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ |
  566                          IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_ATOMIC |
  567                          IB_ACCESS_ON_DEMAND);
  <...>
  575
  576                 ib_mr = ib_reg_user_mr(rds_ibdev->pd, start, length, virt_addr,
  577                                        access_flags);

If for some reason ODP doesn't support WRITE and/or READ, ib_reg_user_mr() will return an error from FW,

Thanks


> 
> Jason


From pranav.tyagi03 at gmail.com  Tue Apr  8 19:41:53 2025
From: pranav.tyagi03 at gmail.com (Pranav Tyagi)
Date: Wed,  9 Apr 2025 01:11:53 +0530
Subject: [rds-devel] [PATCH] net: rds: replace strncpy with memcpy
Message-ID: <20250408194153.6570-1-pranav.tyagi03@gmail.com>

Replace deprecated strncpy() function with memcpy()
as the destination buffer is length bounded 
and not required to be NUL-terminated 

Signed-off-by: Pranav Tyagi <pranav.tyagi03 at gmail.com>
---
 net/rds/connection.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/rds/connection.c b/net/rds/connection.c
index c749c5525b40..3718c3edb32e 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -749,8 +749,7 @@ static int rds_conn_info_visitor(struct rds_conn_path *cp, void *buffer)
 	cinfo->laddr = conn->c_laddr.s6_addr32[3];
 	cinfo->faddr = conn->c_faddr.s6_addr32[3];
 	cinfo->tos = conn->c_tos;
-	strncpy(cinfo->transport, conn->c_trans->t_name,
-		sizeof(cinfo->transport));
+	memcpy(cinfo->transport, conn->c_trans->t_name, min(sizeof(cinfo->transport), strnlen(conn->c_trans->t_name, sizeof(cinfo->transport))));
 	cinfo->flags = 0;
 
 	rds_conn_info_set(cinfo->flags, test_bit(RDS_IN_XMIT, &cp->cp_flags),
@@ -775,8 +774,7 @@ static int rds6_conn_info_visitor(struct rds_conn_path *cp, void *buffer)
 	cinfo6->next_rx_seq = cp->cp_next_rx_seq;
 	cinfo6->laddr = conn->c_laddr;
 	cinfo6->faddr = conn->c_faddr;
-	strncpy(cinfo6->transport, conn->c_trans->t_name,
-		sizeof(cinfo6->transport));
+	memcpy(cinfo6->transport, conn->c_trans->t_name, min(sizeof(cinfo6->transport), strnlen(conn->c_trans->t_name, sizeof(cinfo6->transport))));
 	cinfo6->flags = 0;
 
 	rds_conn_info_set(cinfo6->flags, test_bit(RDS_IN_XMIT, &cp->cp_flags),
-- 
2.49.0


From shannon.nelson at amd.com  Tue Apr  8 21:18:12 2025
From: shannon.nelson at amd.com (Nelson, Shannon)
Date: Tue, 8 Apr 2025 14:18:12 -0700
Subject: [rds-devel] [PATCH] net: rds: replace strncpy with memcpy
In-Reply-To: <20250408194153.6570-1-pranav.tyagi03@gmail.com>
References: <20250408194153.6570-1-pranav.tyagi03@gmail.com>
Message-ID: <c7dbeaf7-ab93-4b4f-904c-99d42a83a83d@amd.com>

On 4/8/2025 12:41 PM, Pranav Tyagi wrote:
> 
> Replace deprecated strncpy() function with memcpy()

I suspect that strtomem() is a better answer here than a raw memcpy() - 
it already has all the strnlen() and min() stuff baked into it, along 
with some other compile-time checking.

> as the destination buffer is length bounded
> and not required to be NUL-terminated

Are you sure that null-termination is not required?  I'm not familiar 
with this bit of code, but the definitions of both of the .transport[] 
fields do say /* null term ascii */

sln

> 
> Signed-off-by: Pranav Tyagi <pranav.tyagi03 at gmail.com>
> ---
>   net/rds/connection.c | 6 ++----
>   1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/net/rds/connection.c b/net/rds/connection.c
> index c749c5525b40..3718c3edb32e 100644
> --- a/net/rds/connection.c
> +++ b/net/rds/connection.c
> @@ -749,8 +749,7 @@ static int rds_conn_info_visitor(struct rds_conn_path *cp, void *buffer)
>          cinfo->laddr = conn->c_laddr.s6_addr32[3];
>          cinfo->faddr = conn->c_faddr.s6_addr32[3];
>          cinfo->tos = conn->c_tos;
> -       strncpy(cinfo->transport, conn->c_trans->t_name,
> -               sizeof(cinfo->transport));
> +       memcpy(cinfo->transport, conn->c_trans->t_name, min(sizeof(cinfo->transport), strnlen(conn->c_trans->t_name, sizeof(cinfo->transport))));
>          cinfo->flags = 0;
> 
>          rds_conn_info_set(cinfo->flags, test_bit(RDS_IN_XMIT, &cp->cp_flags),
> @@ -775,8 +774,7 @@ static int rds6_conn_info_visitor(struct rds_conn_path *cp, void *buffer)
>          cinfo6->next_rx_seq = cp->cp_next_rx_seq;
>          cinfo6->laddr = conn->c_laddr;
>          cinfo6->faddr = conn->c_faddr;
> -       strncpy(cinfo6->transport, conn->c_trans->t_name,
> -               sizeof(cinfo6->transport));
> +       memcpy(cinfo6->transport, conn->c_trans->t_name, min(sizeof(cinfo6->transport), strnlen(conn->c_trans->t_name, sizeof(cinfo6->transport))));
>          cinfo6->flags = 0;
> 
>          rds_conn_info_set(cinfo6->flags, test_bit(RDS_IN_XMIT, &cp->cp_flags),
> --
> 2.49.0
> 
> 


From allison.henderson at oracle.com  Tue Apr  8 22:45:24 2025
From: allison.henderson at oracle.com (Allison Henderson)
Date: Tue, 8 Apr 2025 22:45:24 +0000
Subject: [rds-devel] [PATCH] net: rds: replace strncpy with memcpy
In-Reply-To: <c7dbeaf7-ab93-4b4f-904c-99d42a83a83d@amd.com>
References: <20250408194153.6570-1-pranav.tyagi03@gmail.com>
 <c7dbeaf7-ab93-4b4f-904c-99d42a83a83d@amd.com>
Message-ID: <32b8d4635b8f15cce3ae898cc480616428bc93ba.camel@oracle.com>

On Tue, 2025-04-08 at 14:18 -0700, Nelson, Shannon wrote:
> On 4/8/2025 12:41 PM, Pranav Tyagi wrote:
> > 
> > Replace deprecated strncpy() function with memcpy()
> 
> I suspect that strtomem() is a better answer here than a raw memcpy() - 
> it already has all the strnlen() and min() stuff baked into it, along 
> with some other compile-time checking.
> 
> > as the destination buffer is length bounded
> > and not required to be NUL-terminated
> 
> Are you sure that null-termination is not required?  I'm not familiar 
> with this bit of code, but the definitions of both of the .transport[] 
> fields do say /* null term ascii */
> 
> sln
> 

Hi all,

It appears that the transport names are null-terminated. Looking at rds_ib_transport, rds_tcp_transport, and
rds_loop_transport, the t_name member is initialized to "infiniband", "tcp", or "loop", respectively? which include the
null terminator. Given that, I think strscpy seems to be the appropriate function to use here.

However, it looks like Baris has already submitted a similar patch yesterday, and unfortunately, we can't accept both.
That said, thank you very much for your contribution?we really appreciate it! ?

Allison

> > 
> > Signed-off-by: Pranav Tyagi <pranav.tyagi03 at gmail.com>
> > ---
> >   net/rds/connection.c | 6 ++----
> >   1 file changed, 2 insertions(+), 4 deletions(-)
> > 
> > diff --git a/net/rds/connection.c b/net/rds/connection.c
> > index c749c5525b40..3718c3edb32e 100644
> > --- a/net/rds/connection.c
> > +++ b/net/rds/connection.c
> > @@ -749,8 +749,7 @@ static int rds_conn_info_visitor(struct rds_conn_path *cp, void *buffer)
> >          cinfo->laddr = conn->c_laddr.s6_addr32[3];
> >          cinfo->faddr = conn->c_faddr.s6_addr32[3];
> >          cinfo->tos = conn->c_tos;
> > -       strncpy(cinfo->transport, conn->c_trans->t_name,
> > -               sizeof(cinfo->transport));
> > +       memcpy(cinfo->transport, conn->c_trans->t_name, min(sizeof(cinfo->transport), strnlen(conn->c_trans->t_name, sizeof(cinfo->transport))));
> >          cinfo->flags = 0;
> > 
> >          rds_conn_info_set(cinfo->flags, test_bit(RDS_IN_XMIT, &cp->cp_flags),
> > @@ -775,8 +774,7 @@ static int rds6_conn_info_visitor(struct rds_conn_path *cp, void *buffer)
> >          cinfo6->next_rx_seq = cp->cp_next_rx_seq;
> >          cinfo6->laddr = conn->c_laddr;
> >          cinfo6->faddr = conn->c_faddr;
> > -       strncpy(cinfo6->transport, conn->c_trans->t_name,
> > -               sizeof(cinfo6->transport));
> > +       memcpy(cinfo6->transport, conn->c_trans->t_name, min(sizeof(cinfo6->transport), strnlen(conn->c_trans->t_name, sizeof(cinfo6->transport))));
> >          cinfo6->flags = 0;
> > 
> >          rds_conn_info_set(cinfo6->flags, test_bit(RDS_IN_XMIT, &cp->cp_flags),
> > --
> > 2.49.0
> > 
> > 
> 
> 


From allison.henderson at oracle.com  Wed Apr  9 00:54:39 2025
From: allison.henderson at oracle.com (Allison Henderson)
Date: Wed, 9 Apr 2025 00:54:39 +0000
Subject: [rds-devel] [PATCH net-next] rds: rely on IB/core to determine
 if device is ODP capable
In-Reply-To: <20250408191138.GF199604@unreal>
References: <bfc8ffb7ea207ed90c777a4f61a8afe1badef212.1744109826.git.leonro@nvidia.com>
 <20250408122338.GA1778492@nvidia.com> <20250408123413.GA199604@unreal>
 <20250408123814.GC1778492@nvidia.com> <20250408191138.GF199604@unreal>
Message-ID: <94c8e113c11ec18c5e9330d7f2175a4469518e44.camel@oracle.com>

On Tue, 2025-04-08 at 22:11 +0300, Leon Romanovsky wrote:
> On Tue, Apr 08, 2025 at 09:38:14AM -0300, Jason Gunthorpe wrote:
> > On Tue, Apr 08, 2025 at 03:34:13PM +0300, Leon Romanovsky wrote:
> > > On Tue, Apr 08, 2025 at 09:23:38AM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Apr 08, 2025 at 02:04:55PM +0300, Leon Romanovsky wrote:
> > > > > diff --git a/net/rds/ib.c b/net/rds/ib.c
> > > > > index 9826fe7f9d00..c62aa2ff4963 100644
> > > > > --- a/net/rds/ib.c
> > > > > +++ b/net/rds/ib.c
> > > > > @@ -153,14 +153,6 @@ static int rds_ib_add_one(struct ib_device *device)
> > > > >  	rds_ibdev->max_wrs = device->attrs.max_qp_wr;
> > > > >  	rds_ibdev->max_sge = min(device->attrs.max_send_sge, RDS_IB_MAX_SGE);
> > > > >  
> > > > > -	rds_ibdev->odp_capable =
> > > > > -		!!(device->attrs.kernel_cap_flags &
> > > > > -		   IBK_ON_DEMAND_PAGING) &&
> > > > > -		!!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps &
> > > > > -		   IB_ODP_SUPPORT_WRITE) &&
> > > > > -		!!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps &
> > > > > -		   IB_ODP_SUPPORT_READ);
> > > > 
> > > > This patch seems to drop the check for WRITE and READ support on the
> > > > ODP.
> > > 
> > > Right, and they are part of IBK_ON_DEMAND_PAGING support. All ODP
> > > providers support both IB_ODP_SUPPORT_WRITE and IB_ODP_SUPPORT_READ.
> > 
> > Where? mlx5 reads this from FW and I don't see anything blocking
> > IBK_ON_DEMAND_PAGING if the FW is weird.
> 
> As the one who added it, I can assure you that we added these checks not
> because of weird FW, but because these caps existed.
Hi Leon,

Thanks for the patch.  Is there a commit id for the FW checks we can see?  Maybe we can just add a little more detail to
the commit description to make clear where they are and what they're checking for.  Thank you!

Allison

> 
> RDS calls to ib_reg_user_mr() with the following access_flags.
> 
>   564                 int access_flags =
>   565                         (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ |
>   566                          IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_ATOMIC |
>   567                          IB_ACCESS_ON_DEMAND);
>   <...>
>   575
>   576                 ib_mr = ib_reg_user_mr(rds_ibdev->pd, start, length, virt_addr,
>   577                                        access_flags);
> 
> If for some reason ODP doesn't support WRITE and/or READ, ib_reg_user_mr() will return an error from FW,
> 
> Thanks
> 
> 
> > 
> > Jason


From leon at kernel.org  Thu Apr 10 11:35:05 2025
From: leon at kernel.org (Leon Romanovsky)
Date: Thu, 10 Apr 2025 14:35:05 +0300
Subject: [rds-devel] [PATCH net-next] rds: rely on IB/core to determine
 if device is ODP capable
In-Reply-To: <94c8e113c11ec18c5e9330d7f2175a4469518e44.camel@oracle.com>
References: <bfc8ffb7ea207ed90c777a4f61a8afe1badef212.1744109826.git.leonro@nvidia.com>
 <20250408122338.GA1778492@nvidia.com>
 <20250408123413.GA199604@unreal>
 <20250408123814.GC1778492@nvidia.com>
 <20250408191138.GF199604@unreal>
 <94c8e113c11ec18c5e9330d7f2175a4469518e44.camel@oracle.com>
Message-ID: <20250410113505.GQ199604@unreal>

On Wed, Apr 09, 2025 at 12:54:39AM +0000, Allison Henderson wrote:
> On Tue, 2025-04-08 at 22:11 +0300, Leon Romanovsky wrote:
> > On Tue, Apr 08, 2025 at 09:38:14AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Apr 08, 2025 at 03:34:13PM +0300, Leon Romanovsky wrote:
> > > > On Tue, Apr 08, 2025 at 09:23:38AM -0300, Jason Gunthorpe wrote:
> > > > > On Tue, Apr 08, 2025 at 02:04:55PM +0300, Leon Romanovsky wrote:
> > > > > > diff --git a/net/rds/ib.c b/net/rds/ib.c
> > > > > > index 9826fe7f9d00..c62aa2ff4963 100644
> > > > > > --- a/net/rds/ib.c
> > > > > > +++ b/net/rds/ib.c
> > > > > > @@ -153,14 +153,6 @@ static int rds_ib_add_one(struct ib_device *device)
> > > > > >  	rds_ibdev->max_wrs = device->attrs.max_qp_wr;
> > > > > >  	rds_ibdev->max_sge = min(device->attrs.max_send_sge, RDS_IB_MAX_SGE);
> > > > > >  
> > > > > > -	rds_ibdev->odp_capable =
> > > > > > -		!!(device->attrs.kernel_cap_flags &
> > > > > > -		   IBK_ON_DEMAND_PAGING) &&
> > > > > > -		!!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps &
> > > > > > -		   IB_ODP_SUPPORT_WRITE) &&
> > > > > > -		!!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps &
> > > > > > -		   IB_ODP_SUPPORT_READ);
> > > > > 
> > > > > This patch seems to drop the check for WRITE and READ support on the
> > > > > ODP.
> > > > 
> > > > Right, and they are part of IBK_ON_DEMAND_PAGING support. All ODP
> > > > providers support both IB_ODP_SUPPORT_WRITE and IB_ODP_SUPPORT_READ.
> > > 
> > > Where? mlx5 reads this from FW and I don't see anything blocking
> > > IBK_ON_DEMAND_PAGING if the FW is weird.
> > 
> > As the one who added it, I can assure you that we added these checks not
> > because of weird FW, but because these caps existed.
> Hi Leon,
> 
> Thanks for the patch.  Is there a commit id for the FW checks we can see?

It is part of FW checks to provided access_flags. In this case, you are
asking for IB_ACCESS_REMOTE_READ and IB_ACCESS_ON_DEMAND.

The check of IB_ODP_SUPPORT_READ is used when you need to dig which
transport actually supports it.

The thing is that ODP was always supported for RC QPs, from day one.

> Maybe we can just add a little more detail to
> the commit description to make clear where they are and what they're checking for.  Thank you!

Sure, will update it.

Thanks

> 
> Allison
> 
> > 
> > RDS calls to ib_reg_user_mr() with the following access_flags.
> > 
> >   564                 int access_flags =
> >   565                         (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ |
> >   566                          IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_ATOMIC |
> >   567                          IB_ACCESS_ON_DEMAND);
> >   <...>
> >   575
> >   576                 ib_mr = ib_reg_user_mr(rds_ibdev->pd, start, length, virt_addr,
> >   577                                        access_flags);
> > 
> > If for some reason ODP doesn't support WRITE and/or READ, ib_reg_user_mr() will return an error from FW,
> > 
> > Thanks
> > 
> > 
> > > 
> > > Jason
>