Skip to content

Conversation

@michaelgugino
Copy link
Contributor

@michaelgugino michaelgugino commented Jul 15, 2021

This commit adds tcpretrans port to CO-RE. This port aims to replicate
the output of existing BCC/tools tcpretrans.

Signed-off-by: Michael Gugino [email protected]

Comment on lines 11 to 18
union {
__u32 saddr_v4;
__u8 saddr_v6[16];
};
union {
__u32 daddr_v4;
__u8 daddr_v6[16];
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we use same struct for v4/v6 event, this can be a single field like __u8 saddr[16].
In user space, inet_ntop can handle this correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and attempted this, seems to work locally, let me know what you think.

Comment on lines 19 to 22
__u32 af; // AF_INET or AF_INET6
__u32 pid;
__u16 dport;
__u16 sport;
__u64 type;
int state;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider reorder these field to avoid padding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure what you mean, other event structs like tcpconnect do not pad elements.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should make the struct more compact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand, checkout bindsnoop.h, no padding for structs there either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He means that compiler generates extra 4 bytes of padding between sport and type fields, because type has to be 8-byte aligned. Move type to the top of the struct to avoid it.

"NEW_SYN_RECV"};


static volatile sig_atomic_t hang_on = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing tools use exiting for this purpose. I think it's good to follow that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syscount.c and tcpconnect. also use the 'hang_on' pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, let's convert all of them to consistent "exiting" then?

Comment on lines +23 to +38
struct ipv4_flow_key {
__u32 saddr;
__u32 daddr;
__u16 dport;
__u16 sport;
};

struct ipv6_flow_key {
__u8 saddr[16];
__u8 daddr[16];
__u16 dport;
__u16 sport;
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider merge this two struct and adding a type field to distinguish v4/v6.
This way we can save a map in BPF side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this but I would prefer not to. The existing tcpretrans from BCC prints all ipv4 then ipv6, having two maps makes that a little easier.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still suggest that using one map and sort the result in userspace.
This can save a map in BPF side for efficiency and merge two print functions into one in userspace.
WDYT ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are maps really expensive? We would need to double the ipv6 map's entries, and that would reserve more memory (is memory reserved for map entries?), my math says it will increase total reservation by 458752 bytes.

In any case, I think it's helpful to have the code structure similar to tcpconnect.c and tcpretrans.py. Unless there is a compelling reason to change the structs, I propose we keep them the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, those are different (structurally) keys, so it makes sense to keep two maps. Given they are not pre-allocated, it shouldn't waste much.

#ifndef __TCPRETRANS_H
#define __TCPRETRANS_H

#define MAX_ENTRIES 8192
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be defined in .bpf.c.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX_ENTRIES is also used in tcpretrans.c

Comment on lines +28 to +48
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_ENTRIES);
__type(key, struct ipv4_flow_key);
__type(value, u64);
__uint(map_flags, BPF_F_NO_PREALLOC);
} ipv4_count SEC(".maps");

struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_ENTRIES);
__type(key, struct ipv6_flow_key);
__type(value, u64);
__uint(map_flags, BPF_F_NO_PREALLOC);
} ipv6_count SEC(".maps");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can save a map here.

Comment on lines 102 to 113
pid_tgid = bpf_get_current_pid_tgid();
pid = pid_tgid >> 32;
e.pid = pid;

BPF_CORE_READ_INTO(&dport, skp, __sk_common.skc_dport);
e.dport = dport;
BPF_CORE_READ_INTO(&sport, skp, __sk_common.skc_num);
e.sport = sport;
state = BPF_CORE_READ(skp, __sk_common.skc_state);
e.state = state;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though no big deal, I think you can save many of these local variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.state complains about type casting, I'll remove the port variables, they are unneeded.

@michaelgugino michaelgugino force-pushed the libbpf-tcpretrans branch 3 times, most recently from 9dd85d2 to a18a3bc Compare July 22, 2021 12:56
#define AF_INET 2
#define AF_INET6 10

const volatile bool do_count = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is bool, use false

Comment on lines 19 to 22
__u32 af; // AF_INET or AF_INET6
__u32 pid;
__u16 dport;
__u16 sport;
__u64 type;
int state;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He means that compiler generates extra 4 bytes of padding between sport and type fields, because type has to be 8-byte aligned. Move type to the top of the struct to avoid it.

Comment on lines +23 to +38
struct ipv4_flow_key {
__u32 saddr;
__u32 daddr;
__u16 dport;
__u16 sport;
};

struct ipv6_flow_key {
__u8 saddr[16];
__u8 daddr[16];
__u16 dport;
__u16 sport;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, those are different (structurally) keys, so it makes sense to keep two maps. Given they are not pre-allocated, it shouldn't waste much.


static void count_v4(const struct sock *skp)
{
struct ipv4_flow_key key = {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are initializing this struct completely below, so there is no need to additionally zero-initialize it here, you can drop = {} part

static __u64 zero;
__u64 *val;

BPF_CORE_READ_INTO(&key.saddr, skp, __sk_common.skc_rcv_saddr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have a particular reason to prefer _INTO variants of BPF_CORE_READ?

I think that key.saddr = BPF_CORE_READ(skp, __sk_common.skc_rcv_saddr); has much better readability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

READ_INTO works like memcpy, can't assign with READ: array type '__u8 [16]' is not assignable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, makes sense

local,
e->type == RETRANSMIT ? "R" : "L",
remote,
TCPSTATE[e->state - 1]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if e->state somehow becomes 0 (e.g., failed read or something), you'll be reading way outside of array. Maybe let's add entry 0 in TCPSTATE like ""? And then also check that e->state < ARRAY_SIZE(TCPSTATE)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see 0 becoming a possibility (though, it seems very unlikely as these are retransmits, thus should have a TCP state from the kernel), but I can't imagine the kernel gives us a garbage value that is larger than the array.

tcpretrans.py also seems to make this assumption (regarding larger number).

Comment on lines 297 to 298
// bpf will load non-existant trace points but fail at the attach stage, so
// check to ensure our tp exists before we load it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't use C++-style comments

Comment on lines 303 to 329
warn("tcp_retransmit_skb tracepoint not found, falling back to kprobe");
prog = bpf_object__find_program_by_name(obj->obj, "tracepoint__tcp__tcp_retransmit_skb");
err = bpf_program__set_autoload(prog, false);
if (err) {
warn("Unable to set autoload for tcp_retransmit_skb\n");
return err;
}
} else {
prog = bpf_object__find_program_by_name(obj->obj, "tcp_retransmit_skb");
err = bpf_program__set_autoload(prog, false);
if (err) {
warn("Unable to set autoload for tcp_send_loss_probe\n");
return err;
}
}

if (!env.lossprobe) {
prog = bpf_object__find_program_by_name(obj->obj, "tcp_send_loss_probe");
err = bpf_program__set_autoload(prog, false);
if (err) {
warn("Unable to set autoload for tcp_send_loss_probe\n");
return err;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have BPF skeleton, why are you using generic APIs to look up by string names?...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a variety of things and ended up with this. I now see other examples using bpf_program__set_autoload but I hadn't come across them previously.

}

SEC("tp/tcp/tcp_retransmit_skb")
int tracepoint__tcp__tcp_retransmit_skb(struct trace_event_raw_tcp_event_sk_skb* ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is super long and inconvenient name, it will also get truncated to generic "tracepoint_" or something like that, if you dump maps with bpftool. We still have few original tools using such BCC-style long names (where it's actually part of defining which tracepoint to connect too), but generally we've moved to shorter names, like just "tcp_retransmit_skb" would be totally appropriate in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I didn't have much to go off here, there are no other tp examples in this directory.

tcp_retransmit_skb is used for the kprobe, how about just tp_tcp_retransmit_skb?

@anakryiko
Copy link
Contributor

btw, seems like your PR description is out of date, you do have kprobe fallback, no?

Copy link
Contributor

@anakryiko anakryiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github doesn't allow me to reply to your comment...

but I can't imagine the kernel gives us a garbage value that is larger than the array.

It's highly unlikely, but if kernel ever adds another enum and someone runs old version of this tool, they'll have value bigger than they expected. It's probably unlikely to happen, so I'm ok if we don't touch it right now, but it's something to always keep in mind with tracing: you can never be 100% sure that value you are getting is in the expected range.

The rest looks good to me.

static __u64 zero;
__u64 *val;

BPF_CORE_READ_INTO(&key.saddr, skp, __sk_common.skc_rcv_saddr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, makes sense

This commit adds tcpretrans port to CO-RE. This port aims to replicate
the output of existing BCC/tools tcpretrans.

Signed-off-by: Michael Gugino <[email protected]>
@michaelgugino
Copy link
Contributor Author

@yonghong-song @brendangregg PTAL.

Co-authored-by: Mauricio Vásquez <[email protected]>
@seizethedave
Copy link
Contributor

Something we can do to get this tool merged? It seems to be old enough now that it no longer builds.

@chenhengqi
Copy link
Collaborator

ping @michaelgugino

@michaelgugino
Copy link
Contributor Author

@chenhengqi I will try to make some updates to the patch later this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants