-
Notifications
You must be signed in to change notification settings - Fork 4k
libbpf-tools: add CO-RE tcpretrans #3539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
libbpf-tools: add CO-RE tcpretrans #3539
Conversation
df363f8 to
351223b
Compare
libbpf-tools/tcpretrans.h
Outdated
| union { | ||
| __u32 saddr_v4; | ||
| __u8 saddr_v6[16]; | ||
| }; | ||
| union { | ||
| __u32 daddr_v4; | ||
| __u8 daddr_v6[16]; | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we use same struct for v4/v6 event, this can be a single field like __u8 saddr[16].
In user space, inet_ntop can handle this correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and attempted this, seems to work locally, let me know what you think.
libbpf-tools/tcpretrans.h
Outdated
| __u32 af; // AF_INET or AF_INET6 | ||
| __u32 pid; | ||
| __u16 dport; | ||
| __u16 sport; | ||
| __u64 type; | ||
| int state; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider reorder these field to avoid padding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure what you mean, other event structs like tcpconnect do not pad elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should make the struct more compact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand, checkout bindsnoop.h, no padding for structs there either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
He means that compiler generates extra 4 bytes of padding between sport and type fields, because type has to be 8-byte aligned. Move type to the top of the struct to avoid it.
libbpf-tools/tcpretrans.c
Outdated
| "NEW_SYN_RECV"}; | ||
|
|
||
|
|
||
| static volatile sig_atomic_t hang_on = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing tools use exiting for this purpose. I think it's good to follow that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syscount.c and tcpconnect. also use the 'hang_on' pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, let's convert all of them to consistent "exiting" then?
b709dc4 to
175156d
Compare
| struct ipv4_flow_key { | ||
| __u32 saddr; | ||
| __u32 daddr; | ||
| __u16 dport; | ||
| __u16 sport; | ||
| }; | ||
|
|
||
| struct ipv6_flow_key { | ||
| __u8 saddr[16]; | ||
| __u8 daddr[16]; | ||
| __u16 dport; | ||
| __u16 sport; | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider merge this two struct and adding a type field to distinguish v4/v6.
This way we can save a map in BPF side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this but I would prefer not to. The existing tcpretrans from BCC prints all ipv4 then ipv6, having two maps makes that a little easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still suggest that using one map and sort the result in userspace.
This can save a map in BPF side for efficiency and merge two print functions into one in userspace.
WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are maps really expensive? We would need to double the ipv6 map's entries, and that would reserve more memory (is memory reserved for map entries?), my math says it will increase total reservation by 458752 bytes.
In any case, I think it's helpful to have the code structure similar to tcpconnect.c and tcpretrans.py. Unless there is a compelling reason to change the structs, I propose we keep them the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, those are different (structurally) keys, so it makes sense to keep two maps. Given they are not pre-allocated, it shouldn't waste much.
| #ifndef __TCPRETRANS_H | ||
| #define __TCPRETRANS_H | ||
|
|
||
| #define MAX_ENTRIES 8192 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be defined in .bpf.c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAX_ENTRIES is also used in tcpretrans.c
| struct { | ||
| __uint(type, BPF_MAP_TYPE_HASH); | ||
| __uint(max_entries, MAX_ENTRIES); | ||
| __type(key, struct ipv4_flow_key); | ||
| __type(value, u64); | ||
| __uint(map_flags, BPF_F_NO_PREALLOC); | ||
| } ipv4_count SEC(".maps"); | ||
|
|
||
| struct { | ||
| __uint(type, BPF_MAP_TYPE_HASH); | ||
| __uint(max_entries, MAX_ENTRIES); | ||
| __type(key, struct ipv6_flow_key); | ||
| __type(value, u64); | ||
| __uint(map_flags, BPF_F_NO_PREALLOC); | ||
| } ipv6_count SEC(".maps"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can save a map here.
libbpf-tools/tcpretrans.bpf.c
Outdated
| pid_tgid = bpf_get_current_pid_tgid(); | ||
| pid = pid_tgid >> 32; | ||
| e.pid = pid; | ||
|
|
||
| BPF_CORE_READ_INTO(&dport, skp, __sk_common.skc_dport); | ||
| e.dport = dport; | ||
| BPF_CORE_READ_INTO(&sport, skp, __sk_common.skc_num); | ||
| e.sport = sport; | ||
| state = BPF_CORE_READ(skp, __sk_common.skc_state); | ||
| e.state = state; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though no big deal, I think you can save many of these local variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.state complains about type casting, I'll remove the port variables, they are unneeded.
9dd85d2 to
a18a3bc
Compare
a18a3bc to
9f399be
Compare
9f399be to
63dfe4e
Compare
libbpf-tools/tcpretrans.bpf.c
Outdated
| #define AF_INET 2 | ||
| #define AF_INET6 10 | ||
|
|
||
| const volatile bool do_count = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is bool, use false
libbpf-tools/tcpretrans.h
Outdated
| __u32 af; // AF_INET or AF_INET6 | ||
| __u32 pid; | ||
| __u16 dport; | ||
| __u16 sport; | ||
| __u64 type; | ||
| int state; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
He means that compiler generates extra 4 bytes of padding between sport and type fields, because type has to be 8-byte aligned. Move type to the top of the struct to avoid it.
| struct ipv4_flow_key { | ||
| __u32 saddr; | ||
| __u32 daddr; | ||
| __u16 dport; | ||
| __u16 sport; | ||
| }; | ||
|
|
||
| struct ipv6_flow_key { | ||
| __u8 saddr[16]; | ||
| __u8 daddr[16]; | ||
| __u16 dport; | ||
| __u16 sport; | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, those are different (structurally) keys, so it makes sense to keep two maps. Given they are not pre-allocated, it shouldn't waste much.
libbpf-tools/tcpretrans.bpf.c
Outdated
|
|
||
| static void count_v4(const struct sock *skp) | ||
| { | ||
| struct ipv4_flow_key key = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are initializing this struct completely below, so there is no need to additionally zero-initialize it here, you can drop = {} part
| static __u64 zero; | ||
| __u64 *val; | ||
|
|
||
| BPF_CORE_READ_INTO(&key.saddr, skp, __sk_common.skc_rcv_saddr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have a particular reason to prefer _INTO variants of BPF_CORE_READ?
I think that key.saddr = BPF_CORE_READ(skp, __sk_common.skc_rcv_saddr); has much better readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
READ_INTO works like memcpy, can't assign with READ: array type '__u8 [16]' is not assignable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, makes sense
libbpf-tools/tcpretrans.c
Outdated
| local, | ||
| e->type == RETRANSMIT ? "R" : "L", | ||
| remote, | ||
| TCPSTATE[e->state - 1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if e->state somehow becomes 0 (e.g., failed read or something), you'll be reading way outside of array. Maybe let's add entry 0 in TCPSTATE like ""? And then also check that e->state < ARRAY_SIZE(TCPSTATE)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see 0 becoming a possibility (though, it seems very unlikely as these are retransmits, thus should have a TCP state from the kernel), but I can't imagine the kernel gives us a garbage value that is larger than the array.
tcpretrans.py also seems to make this assumption (regarding larger number).
libbpf-tools/tcpretrans.c
Outdated
| // bpf will load non-existant trace points but fail at the attach stage, so | ||
| // check to ensure our tp exists before we load it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don't use C++-style comments
| warn("tcp_retransmit_skb tracepoint not found, falling back to kprobe"); | ||
| prog = bpf_object__find_program_by_name(obj->obj, "tracepoint__tcp__tcp_retransmit_skb"); | ||
| err = bpf_program__set_autoload(prog, false); | ||
| if (err) { | ||
| warn("Unable to set autoload for tcp_retransmit_skb\n"); | ||
| return err; | ||
| } | ||
| } else { | ||
| prog = bpf_object__find_program_by_name(obj->obj, "tcp_retransmit_skb"); | ||
| err = bpf_program__set_autoload(prog, false); | ||
| if (err) { | ||
| warn("Unable to set autoload for tcp_send_loss_probe\n"); | ||
| return err; | ||
| } | ||
| } | ||
|
|
||
| if (!env.lossprobe) { | ||
| prog = bpf_object__find_program_by_name(obj->obj, "tcp_send_loss_probe"); | ||
| err = bpf_program__set_autoload(prog, false); | ||
| if (err) { | ||
| warn("Unable to set autoload for tcp_send_loss_probe\n"); | ||
| return err; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have BPF skeleton, why are you using generic APIs to look up by string names?...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried a variety of things and ended up with this. I now see other examples using bpf_program__set_autoload but I hadn't come across them previously.
libbpf-tools/tcpretrans.bpf.c
Outdated
| } | ||
|
|
||
| SEC("tp/tcp/tcp_retransmit_skb") | ||
| int tracepoint__tcp__tcp_retransmit_skb(struct trace_event_raw_tcp_event_sk_skb* ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is super long and inconvenient name, it will also get truncated to generic "tracepoint_" or something like that, if you dump maps with bpftool. We still have few original tools using such BCC-style long names (where it's actually part of defining which tracepoint to connect too), but generally we've moved to shorter names, like just "tcp_retransmit_skb" would be totally appropriate in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I didn't have much to go off here, there are no other tp examples in this directory.
tcp_retransmit_skb is used for the kprobe, how about just tp_tcp_retransmit_skb?
|
btw, seems like your PR description is out of date, you do have kprobe fallback, no? |
84fde47 to
8d82ba6
Compare
anakryiko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github doesn't allow me to reply to your comment...
but I can't imagine the kernel gives us a garbage value that is larger than the array.
It's highly unlikely, but if kernel ever adds another enum and someone runs old version of this tool, they'll have value bigger than they expected. It's probably unlikely to happen, so I'm ok if we don't touch it right now, but it's something to always keep in mind with tracing: you can never be 100% sure that value you are getting is in the expected range.
The rest looks good to me.
| static __u64 zero; | ||
| __u64 *val; | ||
|
|
||
| BPF_CORE_READ_INTO(&key.saddr, skp, __sk_common.skc_rcv_saddr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, makes sense
8d82ba6 to
87417b3
Compare
This commit adds tcpretrans port to CO-RE. This port aims to replicate the output of existing BCC/tools tcpretrans. Signed-off-by: Michael Gugino <[email protected]>
87417b3 to
bd43a2a
Compare
|
@yonghong-song @brendangregg PTAL. |
Co-authored-by: Mauricio Vásquez <[email protected]>
|
Something we can do to get this tool merged? It seems to be old enough now that it no longer builds. |
|
ping @michaelgugino |
|
@chenhengqi I will try to make some updates to the patch later this week. |
This commit adds tcpretrans port to CO-RE. This port aims to replicate
the output of existing BCC/tools tcpretrans.
Signed-off-by: Michael Gugino [email protected]