In Part 1 of this series, we walked through how a call to socket.Write() eventually results in a packet being transmitted β the write (TX) path.
In this part, we'll explore the other side: what happens when the system receives a packet β the read (RX) path. We'll trace a packet's journey from the moment it hits the NIC all the way up to the application layer.
Here's the high-level flow we'll cover:
NIC β DMA β Driver (NAPI) β sk_buff β L2 β ip_rcv β tcp_v4_rcv β Socket β Application
When a packet arrives at the NIC, the hardware performs some quick sanity checks before bothering the CPU:
Ethernet FCS (Frame Check Sequence)
Every Ethernet frame ends with a 32-bit CRC. The NIC recomputes this CRC over the received frame and compares it with the FCS field. If they don't match, the packet is silently dropped β no point wasting CPU cycles on corrupted data.
Frame Length Validation
The NIC verifies the frame isn't too short (minimum 64 bytes) or too long (typically 1518 bytes, or more for jumbo frames).
MAC Address Filtering
The NIC checks if the destination MAC address matches its own (or is a broadcast/multicast address it's listening to). Packets destined for other hosts are dropped.
Checksum Offloading (optional)
Modern NICs can verify IP/TCP/UDP checksums in hardware. When enabled, instead of making the kernel recalculate these checksums, the NIC either:
CHECKSUM_UNNECESSARY (telling the kernel "I've already verified this"), orOnce these checks pass, the NIC uses DMA to copy the packet data directly into system memory β specifically, into buffers that the driver has pre-allocated and registered with the NIC.
The classic approach to handling incoming packets would be: NIC receives packet β triggers interrupt β kernel handles packet. Simple, but terrible for performance. At 10 Gbps, you could be processing millions of packets per second. If each packet triggers an interrupt and context switch, the CPU spends more time handling interrupts than actually processing packets.
This is where NAPI (New API) comes in. It's a hybrid interrupt/polling mechanism that gives you the best of both worlds:
napi_poll() function runs in softirq context, pulling packets from the NIC's buffer in batches (typically up to 64 packets at a time)The key insight is that under high load, we stay in polling mode β processing packets as fast as they arrive without interrupt overhead. Under low load, we fall back to interrupt-driven mode to avoid wasting CPU cycles polling an empty queue.
Before we look at the driver code, we need to understand where these packets actually end up.
The driver maintains a ring buffer (also called a descriptor ring) shared with the NIC. Think of it as a circular array where:
Here are the key data structures:
struct rx_desc {
dma_addr_t dma_addr; // physical address where NIC writes
u16 length; // bytes written by NIC
u16 status; // completion flags: DD (done), EOP (end of packet), errors
};
struct rx_buffer {
struct page *page; // backing memory page
u16 offset; // offset within the page
};
The driver pre-allocates memory pages and maps them for DMA, then fills the descriptor ring with pointers to these buffers. The NIC knows exactly where to write incoming packet data without any CPU involvement.
Now let's look at what happens when napi_poll() runs. This is where the driver reads completed descriptors and builds sk_buff structures for the kernel.
First, here's how the driver keeps the ring buffer topped up with fresh buffers:
#define RX_BUF_SIZE 2048
#define PAGE_SIZE 4096
void rx_refill(struct rx_ring *ring)
{
static struct page *cur_page = NULL;
static u16 cur_offset = 0;
while (ring->free_descs) {
u16 i = ring->next_to_use;
struct rx_buffer *buf = &ring->buf[i];
struct rx_desc *desc = &ring->desc[i];
/* Allocate a new page only when needed */
if (!cur_page || cur_offset + RX_BUF_SIZE > PAGE_SIZE) {
cur_page = alloc_page(GFP_ATOMIC);
cur_offset = 0;
}
/* Associate this descriptor with a slice of the page */
buf->page = cur_page;
buf->offset = cur_offset;
/* Map that slice for DMA */
desc->dma_addr = dma_map_page(dev, cur_page, cur_offset,
RX_BUF_SIZE, DMA_FROM_DEVICE);
/* Hand descriptor ownership to NIC */
desc->status = 0;
/* Advance to next slice in the page */
cur_offset += RX_BUF_SIZE;
ring->next_to_use = (i + 1) % ring->size;
ring->free_descs--;
}
}
Notice how we pack two 2KB buffers into a single 4KB page.
Here's the heart of packet reception:
int rx_napi_poll(struct napi_struct *napi, int budget)
{
struct rx_ring *ring = container_of(napi, struct rx_ring, napi);
int work_done = 0;
while (work_done < budget) {
struct sk_buff *skb = NULL;
int frag_idx = 0;
/* Inner loop: assemble one complete packet (may span multiple descriptors) */
while (1) {
u16 i = ring->next_to_clean;
struct rx_desc *desc = &ring->desc[i];
struct rx_buffer *buf = &ring->buf[i];
/* Check if NIC has finished writing to this descriptor */
if (!(desc->status & RX_DESC_DONE))
goto out;
/* Unmap the buffer - NIC is done with it */
dma_unmap_page(dev, desc->dma_addr, RX_BUF_SIZE, DMA_FROM_DEVICE);
if (!skb) {
/* First fragment: create the sk_buff */
skb = build_skb(page_address(buf->page), PAGE_SIZE);
skb_reserve(skb, buf->offset);
skb_put(skb, desc->length);
} else {
/* Additional fragment: add to existing sk_buff */
skb_add_rx_frag(skb, frag_idx++, buf->page,
buf->offset, desc->length, PAGE_SIZE);
}
ring->next_to_clean = (i + 1) % ring->size;
ring->free_descs++;
/* Is this the last fragment of the packet? */
if (desc->status & RX_DESC_EOP)
break;
}
/* L2 processing: extract Ethernet header info */
skb->protocol = eth_type_trans(skb, netdev);
/* Pass to GRO for potential coalescing, then up the stack */
napi_gro_receive(napi, skb);
work_done++;
}
out:
if (work_done)
rx_refill(ring);
return work_done;
}
A few things worth noting:
sk_buff points directly to the DMA buffer. We're just moving pointers around.sk_buff.RX_DESC_DONE flag tells us the NIC has finished writing. Until that flag is set, we can't touch the buffer(prevents concurrent read/write).Once the driver has an sk_buff with raw packet data, it calls eth_type_trans() to extract Ethernet header information:
__be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
{
struct ethhdr *eth = (struct ethhdr *)skb->data;
/* Record where the MAC header starts */
skb->mac_header = skb->data;
/* Network header starts right after Ethernet header */
skb->network_header = skb->data + sizeof(struct ethhdr);
/* Advance data pointer past the Ethernet header */
skb->data += sizeof(struct ethhdr);
/* Extract the protocol (IPv4, IPv6, ARP, etc.) */
skb->protocol = eth->h_proto;
/* Remember which device received this */
skb->dev = dev;
/* Classify packet type based on destination MAC */
if (is_multicast_ether_addr(eth->h_dest))
skb->pkt_type = PACKET_MULTICAST;
else if (is_broadcast_ether_addr(eth->h_dest))
skb->pkt_type = PACKET_BROADCAST;
else
skb->pkt_type = PACKET_HOST;
return skb->protocol;
}
After this function returns, skb->data points to the IP header (or whatever protocol is encapsulated), and skb->protocol tells us which protocol handler should process it.
Before passing packets up the stack, NAPI gives GRO (Generic Receive Offload) a chance to merge them:
void napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
{
struct list_head *gro_list = &napi->gro_list;
/* Try to merge with an existing packet in the GRO list */
for_each_entry(prev, gro_list) {
if (can_gro_merge(prev, skb)) {
gro_merge(prev, skb);
return;
}
}
/* Can't merge - add to GRO list for potential future merging */
list_add(&skb->list, gro_list);
/* Flush the list if it's full or we've waited long enough */
if (gro_should_flush(gro_list)) {
for_each_entry_safe(pkt, gro_list) {
netif_receive_skb(pkt);
}
list_init(gro_list);
}
}
Generic Receiver Offload (GRO) is the receive-side counterpart to TSO (TCP Segmentation Offload). The idea is simple: if we're receiving a burst of TCP packets from the same flow, merge them into one large packet before handing it to the TCP stack. Processing one 64KB packet is much cheaper than processing forty-four 1.5KB packets.
GRO is very strict about what it will combine. Two packets can only merge if they're consecutive segments of the same logical stream:
bool can_gro_merge(struct sk_buff *prev, struct sk_buff *skb)
{
struct iphdr *iph1 = ip_hdr(prev);
struct iphdr *iph2 = ip_hdr(skb);
struct tcphdr *th1 = tcp_hdr(prev);
struct tcphdr *th2 = tcp_hdr(skb);
// Must be same protocol
if (prev->protocol != skb->protocol)
return false;
// IP headers must match (same flow)
if (iph1->saddr != iph2->saddr ||
iph1->daddr != iph2->daddr ||
iph1->protocol != iph2->protocol)
return false;
// For TCP: ports must match
if (th1->source != th2->source ||
th1->dest != th2->dest)
return false;
// Sequence numbers must be consecutive
u32 prev_end = ntohl(th1->seq) + prev->len;
if (ntohl(th2->seq) != prev_end)
return false;
// No special TCP flags (SYN, FIN, RST, URG)
if (th2->syn || th2->fin || th2->rst || th2->urg)
return false;
// ACK numbers should match (same direction of flow)
if (th1->ack_seq != th2->ack_seq)
return false;
// Window size shouldn't change mid-stream
if (th1->window != th2->window)
return false;
// Combined size can't exceed 64KB (or configured limit)
if (prev->len + skb->len > GRO_MAX_SIZE)
return false;
return true;
}
If any of these checks fail, the packets stay separate.
When two packets are combined, GRO doesn't actually merge the headers β it keeps only the first packet's header and appends the second packet's payload:
void gro_merge(struct sk_buff *prev, struct sk_buff *skb)
{
struct iphdr *iph = ip_hdr(prev);
struct tcphdr *th = tcp_hdr(prev);
// Append the new packet's data as a fragment
// (the header from skb is discarded)
skb_pull(skb, skb_transport_offset(skb) + tcp_hdrlen(skb));
skb_add_frag(prev, skb->data, skb->len);
// Update the first packet's length fields
prev->len += skb->len;
prev->data_len += skb->len;
// Update IP total length
iph->tot_len = htons(ntohs(iph->tot_len) + skb->len);
// TCP header stays the same (same seq, same flags)
// but the payload is now larger
// Recalculate checksums (or mark for later)
prev->ip_summed = CHECKSUM_PARTIAL;
// The merged packet now carries combined data
// Free the now-empty second sk_buff (header was stripped)
kfree_skb(skb);
}
So if you had two packets:
Packet 1: [Eth][IP len=1500][TCP seq=1000][1460 bytes data]
Packet 2: [Eth][IP len=1500][TCP seq=2460][1460 bytes data]
After GRO merge, you get:
Merged: [Eth][IP len=2960][TCP seq=1000][2920 bytes data]
β β
length updated payloads concatenated
The second packet's headers are thrown away β they were redundant since seq/ack/ports were identical (or predictably sequential).
If packets can't be merged (different flows, different protocols, out-of-order, has special flags), they stay in the GRO list as separate entries:
// GRO list might look like this:
gro_list:
[0] TCP flow A (192.168.1.1:443 β 10.0.0.1:52000) - 4380 bytes (3 merged)
[1] TCP flow B (192.168.1.1:80 β 10.0.0.1:52001) - 1460 bytes (1 packet)
[2] UDP packet (192.168.1.1:53 β 10.0.0.1:41234) - 512 bytes
[3] TCP flow A (192.168.1.1:443 β 10.0.0.1:52000) - 1460 bytes (out of order, can't merge with [0])
When the GRO list is flushed (budget exhausted, list full, or timeout), each entry is passed separately to netif_receive_skb():
bool gro_should_flush(struct list_head *gro_list)
{
// Too many distinct flows in the list
if (gro_list->count >= MAX_GRO_SKBS) // typically 8
return true;
// Individual packet has been held too long
if (time_after(jiffies, oldest_entry->gro_time + GRO_FLUSH_TIMEOUT))
return true;
// NAPI poll is ending
if (napi_complete_called)
return true;
return false;
}
if (gro_should_flush(gro_list)) {
for_each_entry_safe(pkt, gro_list) {
// Each packet goes up the stack individually
netif_receive_skb(pkt);
}
list_init(gro_list);
}
So GRO never forces incompatible packets together β it just batches them and sends them up individually when it can't merge. The goal is to hold packets just long enough to catch their siblings, but not so long that we add noticeable latency. In practice, GRO adds microseconds of delay in exchange for dramatically reduced per-packet overhead.
netif_receive_skb() is the gateway into the kernel's protocol processing. It runs any configured ingress hooks and then hands the packet to the right protocol handler:
int netif_receive_skb(struct sk_buff *skb)
{
/* Run ingress traffic control if configured (tc filters, eBPF, etc.) */
if (skb->dev->ingress_qdisc) {
int result = tc_ingress_classify(skb);
if (result == TC_ACT_SHOT) {
kfree_skb(skb);
return NET_RX_DROP;
}
}
/* Validate checksum if the NIC didn't do it for us */
if (skb->ip_summed == CHECKSUM_NONE) {
if (!validate_checksum(skb)) {
kfree_skb(skb);
return NET_RX_DROP;
}
}
/* Dispatch to the appropriate protocol handler */
switch (skb->protocol) {
case ETH_P_IP:
return ip_rcv(skb);
case ETH_P_IPV6:
return ipv6_rcv(skb);
case ETH_P_ARP:
return arp_rcv(skb);
default:
kfree_skb(skb);
return NET_RX_DROP;
}
}
For our TCP/IP packet, this means calling ip_rcv().
Now we're in Layer 3 territory. ip_rcv() validates the IP header and figures out what to do with the packet:
int ip_rcv(struct sk_buff *skb)
{
struct iphdr *iph = ip_hdr(skb);
/* Basic sanity checks */
if (iph->version != 4)
goto drop;
if (iph->ihl < 5) /* Header length must be at least 20 bytes */
goto drop;
if (skb->len < ntohs(iph->tot_len))
goto drop;
/* Verify IP header checksum */
if (ip_fast_csum((u8 *)iph, iph->ihl) != 0)
goto drop;
/* Run through netfilter PREROUTING hooks (iptables, nftables, etc.) */
// For instance NLBs use NETFILTER HOOKS at pre-routing stage to
// perform DNAT.
// However, some high-performance load balancers (like those using
// DPDK, XDP, or eBPF) may bypass the standard netfilter path
// entirely and perform packet modifications at lower layers for
// better performance.
return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, ip_rcv_finish);
drop:
kfree_skb(skb);
return NET_RX_DROP;
}
After netfilter processing, ip_rcv_finish() makes the routing decision:
int ip_rcv_finish(struct sk_buff *skb)
{
struct iphdr *iph = ip_hdr(skb);
/* Look up the route for this destination */
if (!skb_dst(skb)) {
if (ip_route_input(skb, iph->daddr, iph->saddr, iph->tos, skb->dev) < 0)
goto drop;
}
/* Follow the routing decision */
return skb_dst(skb)->input(skb); // calls ip_local_deliver or ip_forward
drop:
kfree_skb(skb);
return NET_RX_DROP;
}
ip_route_input() actually do?This function is the kernel's routing lookup for incoming packets. It queries the FIB (Forwarding Information Base) and attaches a routing decision to the sk_buff:
int ip_route_input(skb, daddr, saddr, tos, dev)
{
// Is destination one of our local addresses?
if (inet_addr_is_local(daddr)) {
skb->dst->input = ip_local_deliver;
return 0;
}
// Is it broadcast or multicast?
if (inet_addr_is_broadcast(daddr, dev))
return setup_broadcast_route(skb);
if (ipv4_is_multicast(daddr))
return ip_route_input_mc(skb, ...);
// Look up in routing table (Forward Information Base)
fib_result = fib_lookup(daddr);
if (fib_result.type == RTN_UNICAST) {
skb->dst->input = ip_forward;
skb->dst->next_hop = fib_result.gateway;
return 0;
}
return -ENETUNREACH; // No route to host
}
After this lookup, the sk_buff carries a dst_entry that tells the stack:
ip_local_deliver (for us) or ip_forward (send it elsewhere)This is also where policy routing kicks in β the lookup can consider source address, TOS bits, incoming interface, and firewall marks, not just the destination.
If the packet is a fragment of a larger datagram, ip_local_deliver() hands it to the IP fragment reassembly queue:
int ip_local_deliver(struct sk_buff *skb)
{
struct iphdr *iph = ip_hdr(skb);
/* Check if this is a fragment */
if (ip_is_fragment(iph)) {
skb = ip_defrag(skb);
if (!skb)
return 0; /* Fragment queued, waiting for more pieces */
}
/* Strip the IP header, advance to L4 payload */
skb_pull(skb, iph->ihl * 4);
skb->transport_header = skb->data;
/* Run through netfilter LOCAL_IN hooks */
return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, ip_local_deliver_finish);
}
The IP header has two fields that identify fragments:
struct iphdr {
// ... other fields ...
__be16 frag_off; // fragment offset + flags
__be16 id; // identification (same for all fragments of one datagram)
};
// The frag_off field packs both flags and offset:
// Bits 0-12: Fragment offset (in 8-byte units)
// Bit 13: MF (More Fragments) flag
// Bit 14: DF (Don't Fragment) flag
// Bit 15: Reserved
bool ip_is_fragment(struct iphdr *iph)
{
// It's a fragment if MF flag is set OR offset is non-zero
return (iph->frag_off & htons(IP_MF | IP_OFFSET)) != 0;
}
A packet is a fragment if:
The first fragment has offset=0 but MF=1. The last fragment has MF=0 but offset>0. Middle fragments have both MF=1 and offset>0.
The kernel maintains a hash table of incomplete datagrams, keyed by (src_ip, dst_ip, protocol, identification):
struct ipq {
struct iphdr *iph; // copy of IP header from first fragment
struct sk_buff *fragments; // linked list of received fragments
int len; // total length so far
int meat; // bytes of actual data received
__u8 last_in; // have we seen the last fragment?
struct timer_list timer; // reassembly timeout (default: 30 seconds)
};
struct sk_buff *ip_defrag(struct sk_buff *skb)
{
struct iphdr *iph = ip_hdr(skb);
// Find or create reassembly queue for this datagram
struct ipq *qp = ip_find(iph->id, iph->saddr, iph->daddr, iph->protocol);
if (!qp) {
qp = ip_create_queue(iph);
start_timer(&qp->timer, IP_FRAG_TIMEOUT); // 30 seconds
}
// Insert this fragment in offset order
ip_frag_queue(qp, skb);
// Check if we have all fragments
if (qp->last_in) {
// All pieces received - reassemble into one sk_buff
struct sk_buff *complete = ip_frag_reasm(qp);
ip_destroy_queue(qp);
return complete;
}
// Still waiting for more fragments
return NULL;
}
Key points:
sk_buffNetfilter provides several hook points where packets can be inspected, modified, or dropped. The NF_INET_LOCAL_IN hook runs on packets destined for this machine, after routing but before transport layer processing:
ββββββββββββββββ
β FORWARD ββββ (to another interface)
ββββββββββββββββ
β
βββββββββββ ββββββββββββββ βββββββββββ
βββββPREROUTINGβββββ Routing βββββ LOCAL_IN ββββ (to TCP/UDP/ICMP)
βββββββββββ β Decision β βββββββββββ
ββββββββββββββ
β
(packet is for us)
If any hook returns NF_DROP, the packet is discarded and never reaches TCP/UDP. If all hooks return NF_ACCEPT, processing continues to ip_local_deliver_finish().
Finally, ip_local_deliver_finish() dispatches to the transport layer based on the protocol field:
int ip_local_deliver_finish(struct sk_buff *skb)
{
struct iphdr *iph = ip_hdr(skb);
switch (iph->protocol) {
case IPPROTO_TCP:
return tcp_v4_rcv(skb);
case IPPROTO_UDP:
return udp_rcv(skb);
case IPPROTO_ICMP:
return icmp_rcv(skb);
default:
kfree_skb(skb);
return 0;
}
}
For TCP packets, tcp_v4_rcv() is where the real complexity begins. TCP is a stateful protocol, so the first job is finding which connection this packet belongs to:
int tcp_v4_rcv(struct sk_buff *skb)
{
struct tcphdr *th = tcp_hdr(skb);
struct sock *sk;
/* Validate TCP header */
if (th->doff < 5) /* Header too short */
goto drop;
if (!tcp_checksum_valid(skb))
goto drop;
/* Look up the socket for this 4-tuple (src_ip, src_port, dst_ip, dst_port) */
sk = inet_lookup(skb, th->source, th->dest);
if (!sk)
goto no_socket;
/* Hand off to the appropriate handler based on socket state */
if (sk->sk_state == TCP_LISTEN)
return tcp_v4_do_rcv_listen(sk, skb); /* Incoming connection */
else
return tcp_v4_do_rcv(sk, skb); /* Established connection */
no_socket:
/* No matching socket - send RST if it's not already a RST */
if (!th->rst)
tcp_v4_send_reset(skb);
drop:
kfree_skb(skb);
return 0;
}
The socket lookup uses a hash table keyed by the 4-tuple (source IP, source port, destination IP, destination port). This is O(1) on average, which matters a lot when you're handling millions of packets per second.
For an established connection, tcp_v4_do_rcv() feeds the packet into TCP's state machine:
int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
struct tcphdr *th = tcp_hdr(skb);
u32 seq = ntohl(th->seq);
/* Fast path: is this the next expected segment? */
if (seq == tp->rcv_nxt && tcp_header_ok(th) && !th->syn && !th->fin) {
/* Process inline without queuing */
return tcp_rcv_established_fastpath(sk, skb);
}
/* Slow path: out-of-order, has special flags, or needs more validation */
return tcp_rcv_established(sk, skb);
}
The fast path handles the common case: an in-order data segment on an established connection. It bypasses a lot of checks and gets data to the socket receive queue as quickly as possible.
tcp_rcv_established_fastpath()int tcp_rcv_established_fastpath(struct sock *sk, struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
struct tcphdr *th = tcp_hdr(skb);
/* We already know from the caller that:
* - seq == tp->rcv_nxt (this is the exact segment we're waiting for)
* - Header is valid (length check passed)
* - No SYN or FIN flags (normal data segment)
*/
/* Calculate the payload length (data only, excluding TCP header) */
u32 len = skb->len - th->doff * 4;
/* Understanding th->doff:
- doff = "Data Offset" (where the actual data begins)
- It's a 4-bit field in the TCP header
- Measured in 32-bit words (4 bytes)
- 1 Word = 4 Bytes
- So doff tells the number of words the header is composed of.
- Range: 5-15 (minimum 20 bytes, maximum 60 bytes)
- Multiply by 4 to convert words to bytes
Example:
- Minimal TCP header (no options): doff = 5 β 5 * 4 = 20 bytes
- With options (e.g., timestamps): doff = 8 β 8 * 4 = 32 bytes
So: skb->len = 1500 bytes (total)
th->doff = 5 (header is 20 bytes)
len = 1500 - 20 = 1480 bytes of actual data
*/
/* Update receive sequence number by the amount of data received */
tp->rcv_nxt += len;
/* ACK processing: update what the sender has acknowledged */
u32 ack = ntohl(th->ack_seq);
if (after(ack, tp->snd_una)) {
tcp_ack_update(tp, ack); // advance send window
}
/* Window update: sender is telling us how much buffer space it has */
tp->snd_wnd = ntohs(th->window) << tp->snd_wscale;
/* Add the segment directly to receive queue - no buffering */
skb_queue_tail(&sk->sk_receive_queue, skb);
/* IMPORTANT: Check if this segment fills a gap in the OOO queue */
if (!skb_queue_empty(&tp->out_of_order_queue)) {
tcp_ofo_queue(sk);
}
/* Wake up any process waiting in recv() */
sk->sk_data_ready(sk);
/* Send ACK if needed (delayed ACK logic) */
if (tcp_should_ack(tp))
tcp_send_ack(sk);
return 0;
}
What triggers the slow path?
seq != tp->rcv_nxt means we got packet N+2 before N+1Here's the flow in the slow path:
int tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
struct tcphdr *th = tcp_hdr(skb);
/* Process ACK first (may free up send buffer space) */
if (th->ack) {
tcp_ack(sk, skb); // handles retransmission, congestion control, etc.
}
/* Now handle the data portion */
if (skb->len > 0) {
tcp_data_queue(sk, skb);
}
return 0;
}
tcp_data_queue() FunctionThe tcp_data_queue() function is called from the slow path only:
tcp_v4_rcv()
βββ tcp_v4_do_rcv()
βββ tcp_rcv_established_fastpath() β Fast path: does NOT call tcp_data_queue()
β (queues directly, handles ACK inline)
β
βββ tcp_rcv_established() β Slow path: calls tcp_data_queue()
βββ tcp_data_queue()
TCP maintains an ordered receive queue for each connection. Here's a more detailed view:
int tcp_data_queue(struct sock *sk, struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
u32 seq = TCP_SKB_CB(skb)->seq;
u32 end_seq = TCP_SKB_CB(skb)->end_seq;
/* Is this the segment we're waiting for? */
if (seq == tp->rcv_nxt) {
/* In-order: add directly to receive queue */
tp->rcv_nxt = end_seq;
skb_queue_tail(&sk->sk_receive_queue, skb);
/* Check if any out-of-order segments can now be processed */
tcp_ofo_queue(sk);
/* Wake up any process blocked on read() */
sk->sk_data_ready(sk);
}
else if (after(seq, tp->rcv_nxt)) {
/* Out-of-order: stash in the out-of-order queue */
tcp_ofo_insert(sk, skb);
}
else {
/* Duplicate or old segment - drop it */
kfree_skb(skb);
}
return 0;
}
tcp_ofo_queue()?The out-of-order (OOO) queue is a holding area for segments that arrive before we're ready to process them. When a segment fills a gap, tcp_ofo_queue() moves any newly-contiguous segments from the OOO queue to the receive queue.
Example scenario:
Let's say we're expecting sequence numbers 1000, 2000, 3000, 4000...
1. Receive seq 1000 (expected)
β Goes directly to sk_receive_queue
β tp->rcv_nxt = 2000
2. Receive seq 4000 (out of order!)
β Goes to OOO queue
β Still waiting for 2000
3. Receive seq 3000 (still out of order)
β Goes to OOO queue
β Still waiting for 2000
OOO queue now: [3000-4000], [4000-5000]
4. Receive seq 2000 (the missing piece!)
β Goes to sk_receive_queue
β tp->rcv_nxt = 3000
β Call tcp_ofo_queue() β This is where the magic happens
What tcp_ofo_queue() does:
void tcp_ofo_queue(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb, *tmp;
/* Walk through the out-of-order queue looking for segments
* that are now contiguous with rcv_nxt */
skb_queue_walk_safe(&tp->out_of_order_queue, skb, tmp) {
u32 seq = TCP_SKB_CB(skb)->seq;
u32 end_seq = TCP_SKB_CB(skb)->end_seq;
/* Is this segment now the next expected one? */
if (seq == tp->rcv_nxt) {
/* Yes! Move it to the receive queue */
__skb_unlink(skb, &tp->out_of_order_queue);
skb_queue_tail(&sk->sk_receive_queue, skb);
/* Advance our expected sequence number */
tp->rcv_nxt = end_seq;
/* Continue checking - maybe more segments are now ready */
continue;
}
/* Is this segment still in the future? */
if (after(seq, tp->rcv_nxt)) {
/* Yes, there's still a gap. Stop here - we maintain order */
break;
}
/* This segment is old/duplicate - shouldn't happen, but be safe */
__skb_unlink(skb, &tp->out_of_order_queue);
kfree_skb(skb);
}
/* If we moved segments, wake up the reader */
if (!skb_queue_empty(&sk->sk_receive_queue))
sk->sk_data_ready(sk);
}
Continuing our example:
After receiving seq 2000:
Before tcp_ofo_queue():
sk_receive_queue: [1000-2000], [2000-3000]
out_of_order_queue: [3000-4000], [4000-5000]
tp->rcv_nxt = 3000
After tcp_ofo_queue():
sk_receive_queue: [1000-2000], [2000-3000], [3000-4000], [4000-5000]
out_of_order_queue: [empty]
tp->rcv_nxt = 5000
Both OOO segments became contiguous once the gap at 2000 was filled, so they moved to the receive queue in one sweep.
For UDP, things are much simpler. There's no connection state, no ordering, no reassembly at the transport layer:
int udp_rcv(struct sk_buff *skb)
{
struct udphdr *uh = udp_hdr(skb);
struct sock *sk;
/* Validate UDP header and checksum */
if (skb->len < sizeof(struct udphdr))
goto drop;
if (udp_checksum_invalid(skb))
goto drop;
/* Look up socket by destination port */
sk = udp_lookup(skb, uh->dest);
if (!sk)
goto no_socket;
/* Queue directly to socket */
return udp_queue_rcv_skb(sk, skb);
no_socket:
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0);
drop:
kfree_skb(skb);
return 0;
}
int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
{
/* Check if socket receive buffer is full */
if (sk_rmem_alloc_get(sk) + skb->truesize > sk->sk_rcvbuf) {
kfree_skb(skb);
return -ENOMEM; /* Drop if buffer is full */
}
/* Add to socket's receive queue */
skb_queue_tail(&sk->sk_receive_queue, skb);
/* Wake up any process blocked on read() */
sk->sk_data_ready(sk);
return 0;
}
No fancy state machine, no reordering β just validate, find the socket, and queue it. If the socket's receive buffer is full, the packet is dropped. UDP makes no delivery guarantees, and this is where that shows up.
At this point, the packet data is sitting in a socket's receive queue. But how does the application actually get it?
When your application calls recv() or read() on a socket, here's what happens:
int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags)
{
struct tcp_sock *tp = tcp_sk(sk);
int copied = 0;
lock_sock(sk);
while (copied < len) {
struct sk_buff *skb;
/* Get the next segment from the receive queue */
skb = skb_peek(&sk->sk_receive_queue);
if (!skb) {
/* Queue is empty */
if (copied > 0)
break; /* Return what we have */
if (flags & MSG_DONTWAIT) {
copied = -EAGAIN; /* Non-blocking, nothing available */
break;
}
/* Block until data arrives */
sk_wait_data(sk);
continue;
}
/* Copy data from skb to user buffer */
int chunk = min(skb->len, len - copied);
if (copy_to_user(msg->msg_iov, skb->data, chunk))
return -EFAULT;
copied += chunk;
/* Consume the data from the skb */
skb_pull(skb, chunk);
if (skb->len == 0) {
skb_unlink(skb, &sk->sk_receive_queue);
kfree_skb(skb);
}
/* Update TCP receive window */
tcp_rcv_space_adjust(sk);
}
release_sock(sk);
return copied;
}
This is where the kernel-to-userspace boundary crossing happens. The copy_to_user() call copies data from the kernel's sk_buff into the application's buffer. This is one of the few actual copies in the entire receive path.
Remember that sk->sk_data_ready(sk) call in tcp_data_queue()? That's what wakes up a process blocked in sk_wait_data(). The kernel uses wait queues to efficiently sleep processes until data is available, avoiding busy-waiting.
Applications using select(), poll(), or epoll() don't block inside recv(). Instead, they register interest in multiple file descriptors and block waiting for any of them to become readable. When data arrives, the socket's sk_data_ready callback notifies the epoll wait queue, which wakes up the application.
Let's trace a complete packet journey from wire to application:
[NIC Hardware] [Hardware]
β (packet arrives)
βββ CRC validation
βββ MAC filter check
βββ DMA write to ring buffer
β
βΌ
IRQ raised
β
βΌ
IRQ handler [Hard IRQ Context]
βββ disable NIC IRQ
βββ schedule NAPI
βββ return
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β SOFTIRQ CONTEXT (ksoftirqd / NET_RX) β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
napi_poll() [Driver - NAPI]
βββ read RX descriptors
βββ allocate sk_buff
βββ DMA sync (zero-copy)
βββ for each packet:
β
βΌ
eth_type_trans() [L2 - Ethernet]
βββ parse Ethernet header
βββ set skb->protocol
βββ skb_pull() to strip Ethernet header
β
βΌ
GRO (Generic Receive Offload)
βββ coalesce related packets
βββ batch processing
β
βΌ
netif_receive_skb()
β
βββββββββββββββββββββββββββββββ
β β
βΌ βΌ
ip_rcv() AF_PACKET socket (tcpdump/Wireshark)
β gets copy here
β
[IP Layer - L3]
β
βββ ip_rcv_core()
β βββ validate header length
β βββ validate checksum
β βββ validate TTL
β
βββ netfilter PREROUTING (iptables)
β
βββ ip_rcv_finish()
β βββ routing decision
β βββ forward? β ip_forward()
β βββ local? β continue
β
βββ ip_local_deliver()
βββ reassemble fragments
βββ netfilter LOCAL_IN
βββ dispatch by protocol:
β
βΌ
tcp_v4_rcv() [TCP Layer - L4]
β
βββ checksum verification
βββ socket lookup (4-tuple hash)
β βββ (src_ip, src_port, dst_ip, dst_port)
β
βββ tcp_v4_do_rcv()
β β
β βββ TCP state machine
β β βββ process ACKs
β β βββ update congestion window
β β βββ handle retransmits
β β βββ process flags (SYN/FIN/RST)
β β
β βββ tcp_rcv_established()
β β
β βββ validate sequence numbers
β βββ update receive window
β βββ tcp_data_queue()
β βββ add to sk_receive_queue
β βββ trim TCP header (skb_pull)
β
βββ sk_data_ready()
βββ wake_up_interruptible(sk->sk_wq)
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β END OF SOFTIRQ CONTEXT β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
[Process wakes up] [Process Context]
β
βΌ
recv(fd, buf, len) [User Space]
β (or read/recvmsg)
βΌ syscall
sys_recvmsg() [Kernel]
β
βΌ
sock_recvmsg() [Socket Layer]
β
βΌ
tcp_recvmsg() [TCP]
βββ lock_sock()
βββ check sk_receive_queue
βββ copy_to_user() from sk_buff
βββ update rcv_nxt
βββ send ACK (if needed)
βββ release_sock()
β
βΌ
return to userspace
β
βΌ
[Application processes data]
Throughout this journey, we've seen a consistent pattern: the kernel avoids copying data whenever possible. The same memory buffer that the NIC DMA'd into is referenced all the way up to the socket layer. The only copy happens at the very end, when we have to move data from kernel space to user space.
This is why modern network stacks can achieve such high throughput. At 100 Gbps, you simply can't afford to copy every byte multiple times. The entire architecture is designed around passing pointers and metadata, not shuffling bytes around.
If you're debugging network performance, here's where to look:
NAPI budget exhaustion: If your system consistently hits the NAPI budget, you're processing packets faster than softirq time allows. Consider RPS/RFS or multiple queues.
Socket buffer pressure: When sk_rcvbuf fills up, packets get dropped at the socket layer. Your application isn't reading fast enough.
Netfilter overhead: Complex iptables rules run on every packet in the hot path. This adds up.
Lock contention: The socket lock (lock_sock) is held during receive processing. Multiple threads reading from the same socket will serialize.
Copy to userspace: This is often the bottleneck. Technologies like io_uring and AF_XDP try to eliminate or reduce this copy.
We've traced a packet's complete journey from the network wire to your application's buffer:
sk_buff structures with zero copyingUnderstanding this path helps you reason about where bottlenecks might occur, why certain syscalls behave the way they do, and how technologies like DPDK, XDP, and io_uring manage to go even faster by bypassing parts of this stack.