libbpf-tools: add CO-RE tcpretrans #3539

michaelgugino · 2021-07-15T18:07:13Z

This commit adds tcpretrans port to CO-RE. This port aims to replicate
the output of existing BCC/tools tcpretrans.

Signed-off-by: Michael Gugino [email protected]

libbpf-tools/tcpretrans.h

chenhengqi · 2021-07-16T01:00:03Z

libbpf-tools/tcpretrans.h

+	union {
+		__u32 saddr_v4;
+		__u8 saddr_v6[16];
+	};
+	union {
+		__u32 daddr_v4;
+		__u8 daddr_v6[16];
+	};


Since we use same struct for v4/v6 event, this can be a single field like __u8 saddr[16].
In user space, inet_ntop can handle this correctly.

I went ahead and attempted this, seems to work locally, let me know what you think.

chenhengqi · 2021-07-16T01:00:43Z

libbpf-tools/tcpretrans.h

+	__u32 af; // AF_INET or AF_INET6
+	__u32 pid;
+	__u16 dport;
+	__u16 sport;
+	__u64 type;
+	int state;


Consider reorder these field to avoid padding.

I'm not quite sure what you mean, other event structs like tcpconnect do not pad elements.

This should make the struct more compact.

I don't understand, checkout bindsnoop.h, no padding for structs there either.

He means that compiler generates extra 4 bytes of padding between sport and type fields, because type has to be 8-byte aligned. Move type to the top of the struct to avoid it.

libbpf-tools/Makefile

libbpf-tools/tcpretrans.bpf.c

libbpf-tools/tcpretrans.c

chenhengqi · 2021-07-16T01:13:31Z

libbpf-tools/tcpretrans.c

+	"NEW_SYN_RECV"};
+
+
+static volatile sig_atomic_t hang_on = 1;


Existing tools use exiting for this purpose. I think it's good to follow that.

syscount.c and tcpconnect. also use the 'hang_on' pattern.

right, let's convert all of them to consistent "exiting" then?

libbpf-tools/tcpretrans.c

chenhengqi · 2021-07-22T05:35:07Z

libbpf-tools/tcpretrans.h

+struct ipv4_flow_key {
+	__u32 saddr;
+	__u32 daddr;
+	__u16 dport;
+	__u16 sport;
+};
+
+struct ipv6_flow_key {
+	__u8 saddr[16];
+	__u8 daddr[16];
+	__u16 dport;
+	__u16 sport;
+};


Consider merge this two struct and adding a type field to distinguish v4/v6.
This way we can save a map in BPF side.

I considered this but I would prefer not to. The existing tcpretrans from BCC prints all ipv4 then ipv6, having two maps makes that a little easier.

I would still suggest that using one map and sort the result in userspace.
This can save a map in BPF side for efficiency and merge two print functions into one in userspace.
WDYT ?

Are maps really expensive? We would need to double the ipv6 map's entries, and that would reserve more memory (is memory reserved for map entries?), my math says it will increase total reservation by 458752 bytes.

In any case, I think it's helpful to have the code structure similar to tcpconnect.c and tcpretrans.py. Unless there is a compelling reason to change the structs, I propose we keep them the same.

I agree, those are different (structurally) keys, so it makes sense to keep two maps. Given they are not pre-allocated, it shouldn't waste much.

chenhengqi · 2021-07-22T05:35:42Z

libbpf-tools/tcpretrans.h

+#ifndef __TCPRETRANS_H
+#define __TCPRETRANS_H
+
+#define MAX_ENTRIES 8192


This could be defined in .bpf.c.

MAX_ENTRIES is also used in tcpretrans.c

libbpf-tools/tcpretrans.bpf.c

chenhengqi · 2021-07-22T05:37:27Z

libbpf-tools/tcpretrans.bpf.c

+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, MAX_ENTRIES);
+	__type(key, struct ipv4_flow_key);
+	__type(value, u64);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+} ipv4_count SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, MAX_ENTRIES);
+	__type(key, struct ipv6_flow_key);
+	__type(value, u64);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+} ipv6_count SEC(".maps");


We can save a map here.

libbpf-tools/tcpretrans.bpf.c

chenhengqi · 2021-07-22T05:46:54Z

libbpf-tools/tcpretrans.bpf.c

+	pid_tgid = bpf_get_current_pid_tgid();
+	pid = pid_tgid >> 32;
+	e.pid = pid;
+
+	BPF_CORE_READ_INTO(&dport, skp, __sk_common.skc_dport);
+	e.dport = dport;
+	BPF_CORE_READ_INTO(&sport, skp, __sk_common.skc_num);
+	e.sport = sport;
+	state = BPF_CORE_READ(skp, __sk_common.skc_state);
+	e.state = state;


Though no big deal, I think you can save many of these local variables.

e.state complains about type casting, I'll remove the port variables, they are unneeded.

libbpf-tools/tcpretrans.c

anakryiko · 2021-08-13T23:47:05Z

libbpf-tools/tcpretrans.bpf.c

+#define AF_INET		2
+#define AF_INET6	10
+
+const volatile bool do_count = 0;


if this is bool, use false

anakryiko · 2021-08-13T23:49:18Z

libbpf-tools/tcpretrans.h

+	__u32 af; // AF_INET or AF_INET6
+	__u32 pid;
+	__u16 dport;
+	__u16 sport;
+	__u64 type;
+	int state;


He means that compiler generates extra 4 bytes of padding between sport and type fields, because type has to be 8-byte aligned. Move type to the top of the struct to avoid it.

anakryiko · 2021-08-13T23:50:18Z

libbpf-tools/tcpretrans.h

+struct ipv4_flow_key {
+	__u32 saddr;
+	__u32 daddr;
+	__u16 dport;
+	__u16 sport;
+};
+
+struct ipv6_flow_key {
+	__u8 saddr[16];
+	__u8 daddr[16];
+	__u16 dport;
+	__u16 sport;
+};


I agree, those are different (structurally) keys, so it makes sense to keep two maps. Given they are not pre-allocated, it shouldn't waste much.

anakryiko · 2021-08-13T23:51:07Z

libbpf-tools/tcpretrans.bpf.c

+
+static void count_v4(const struct sock *skp)
+{
+	struct ipv4_flow_key key = {};


you are initializing this struct completely below, so there is no need to additionally zero-initialize it here, you can drop = {} part

anakryiko · 2021-08-13T23:51:55Z

libbpf-tools/tcpretrans.bpf.c

+	static __u64 zero;
+	__u64 *val;
+
+	BPF_CORE_READ_INTO(&key.saddr, skp, __sk_common.skc_rcv_saddr);


do you have a particular reason to prefer _INTO variants of BPF_CORE_READ?

I think that key.saddr = BPF_CORE_READ(skp, __sk_common.skc_rcv_saddr); has much better readability

READ_INTO works like memcpy, can't assign with READ: array type '__u8 [16]' is not assignable

ah, makes sense

anakryiko · 2021-08-14T00:11:59Z

libbpf-tools/tcpretrans.c

+		   local,
+		   e->type == RETRANSMIT ? "R" : "L",
+		   remote,
+		   TCPSTATE[e->state - 1]);


if e->state somehow becomes 0 (e.g., failed read or something), you'll be reading way outside of array. Maybe let's add entry 0 in TCPSTATE like ""? And then also check that e->state < ARRAY_SIZE(TCPSTATE)?

I can see 0 becoming a possibility (though, it seems very unlikely as these are retransmits, thus should have a TCP state from the kernel), but I can't imagine the kernel gives us a garbage value that is larger than the array.

tcpretrans.py also seems to make this assumption (regarding larger number).

anakryiko · 2021-08-14T00:12:40Z

libbpf-tools/tcpretrans.c

+	// bpf will load non-existant trace points but fail at the attach stage, so
+	// check to ensure our tp exists before we load it.


please don't use C++-style comments

anakryiko · 2021-08-14T00:13:49Z

libbpf-tools/tcpretrans.c

+			warn("tcp_retransmit_skb tracepoint not found, falling back to kprobe");
+		prog = bpf_object__find_program_by_name(obj->obj, "tracepoint__tcp__tcp_retransmit_skb");
+		err = bpf_program__set_autoload(prog, false);
+		if (err) {
+			warn("Unable to set autoload for tcp_retransmit_skb\n");
+			return err;
+		}
+	} else {
+		prog = bpf_object__find_program_by_name(obj->obj, "tcp_retransmit_skb");
+		err = bpf_program__set_autoload(prog, false);
+		if (err) {
+			warn("Unable to set autoload for tcp_send_loss_probe\n");
+			return err;
+		}
+	}
+
+	if (!env.lossprobe) {
+		prog = bpf_object__find_program_by_name(obj->obj, "tcp_send_loss_probe");
+		err = bpf_program__set_autoload(prog, false);
+		if (err) {
+			warn("Unable to set autoload for tcp_send_loss_probe\n");
+			return err;
+		}


you have BPF skeleton, why are you using generic APIs to look up by string names?...

I tried a variety of things and ended up with this. I now see other examples using bpf_program__set_autoload but I hadn't come across them previously.

libbpf-tools/tcpretrans.c

anakryiko · 2021-08-14T00:16:51Z

libbpf-tools/tcpretrans.bpf.c

+}
+
+SEC("tp/tcp/tcp_retransmit_skb")
+int tracepoint__tcp__tcp_retransmit_skb(struct trace_event_raw_tcp_event_sk_skb* ctx)


this is super long and inconvenient name, it will also get truncated to generic "tracepoint_" or something like that, if you dump maps with bpftool. We still have few original tools using such BCC-style long names (where it's actually part of defining which tracepoint to connect too), but generally we've moved to shorter names, like just "tcp_retransmit_skb" would be totally appropriate in this case.

Sounds good. I didn't have much to go off here, there are no other tp examples in this directory.

tcp_retransmit_skb is used for the kprobe, how about just tp_tcp_retransmit_skb?

anakryiko · 2021-08-14T00:18:01Z

btw, seems like your PR description is out of date, you do have kprobe fallback, no?

anakryiko

Github doesn't allow me to reply to your comment...

but I can't imagine the kernel gives us a garbage value that is larger than the array.

It's highly unlikely, but if kernel ever adds another enum and someone runs old version of this tool, they'll have value bigger than they expected. It's probably unlikely to happen, so I'm ok if we don't touch it right now, but it's something to always keep in mind with tracing: you can never be 100% sure that value you are getting is in the expected range.

The rest looks good to me.

anakryiko · 2021-08-17T00:14:04Z

libbpf-tools/tcpretrans.bpf.c

+	static __u64 zero;
+	__u64 *val;
+
+	BPF_CORE_READ_INTO(&key.saddr, skp, __sk_common.skc_rcv_saddr);


ah, makes sense

This commit adds tcpretrans port to CO-RE. This port aims to replicate the output of existing BCC/tools tcpretrans. Signed-off-by: Michael Gugino <[email protected]>

michaelgugino · 2021-09-13T14:03:54Z

@yonghong-song @brendangregg PTAL.

libbpf-tools/tcpretrans.c

Co-authored-by: Mauricio Vásquez <[email protected]>

seizethedave · 2025-07-25T01:17:04Z

Something we can do to get this tool merged? It seems to be old enough now that it no longer builds.

chenhengqi · 2025-07-28T09:39:54Z

ping @michaelgugino

michaelgugino · 2025-07-28T13:10:11Z

@chenhengqi I will try to make some updates to the patch later this week.

michaelgugino requested review from 4ast, brendangregg, drzaeus77, goldshtn and yonghong-song as code owners July 15, 2021 18:07

michaelgugino force-pushed the libbpf-tcpretrans branch from df363f8 to 351223b Compare July 15, 2021 18:13

chenhengqi reviewed Jul 16, 2021

View reviewed changes

libbpf-tools/tcpretrans.c Outdated Show resolved Hide resolved

michaelgugino force-pushed the libbpf-tcpretrans branch from b709dc4 to 175156d Compare July 21, 2021 13:34