Navigating the Linux Kernel Network Stack:The Packet Read Path

In Part 1 of this series, we walked through how a call to socket.Write() eventually results in a packet being transmitted β€” the write (TX) path.

In this part, we'll explore the other side: what happens when the system receives a packet β€” the read (RX) path. We'll trace a packet's journey from the moment it hits the NIC all the way up to the application layer.

Here's the high-level flow we'll cover:

NIC β†’ DMA β†’ Driver (NAPI) β†’ sk_buff β†’ L2 β†’ ip_rcv β†’ tcp_v4_rcv β†’ Socket β†’ Application

Table of Contents

  1. The NIC Receives the Packet
  2. Interrupts and NAPI: How the Kernel Knows There's Work to Do
  3. The Ring Buffer: Where Packets Land
  4. The NAPI Poll Loop: Processing Packets
  5. Layer 2 Processing: Parsing the Ethernet Header
  6. GRO: Coalescing Packets for Efficiency
  7. Protocol Dispatch: Entering the Network Stack
  8. IP Layer: Validation and Routing
  9. TCP Layer: Connection Lookup and State Machine
  10. UDP: The Simple Case
  11. The Socket Layer: Where Kernel Meets Userspace
  12. Putting It All Together

1. The NIC Receives the Packet

When a packet arrives at the NIC, the hardware performs some quick sanity checks before bothering the CPU:

Ethernet FCS (Frame Check Sequence)

Every Ethernet frame ends with a 32-bit CRC. The NIC recomputes this CRC over the received frame and compares it with the FCS field. If they don't match, the packet is silently dropped β€” no point wasting CPU cycles on corrupted data.

Frame Length Validation

The NIC verifies the frame isn't too short (minimum 64 bytes) or too long (typically 1518 bytes, or more for jumbo frames).

MAC Address Filtering

The NIC checks if the destination MAC address matches its own (or is a broadcast/multicast address it's listening to). Packets destined for other hosts are dropped.

Checksum Offloading (optional)

Modern NICs can verify IP/TCP/UDP checksums in hardware. When enabled, instead of making the kernel recalculate these checksums, the NIC either:

  • Marks the packet as CHECKSUM_UNNECESSARY (telling the kernel "I've already verified this"), or
  • Zeroes out the checksum field and sets metadata indicating verification passed

Once these checks pass, the NIC uses DMA to copy the packet data directly into system memory β€” specifically, into buffers that the driver has pre-allocated and registered with the NIC.


2. Interrupts and NAPI: How the Kernel Knows There's Work to Do

The classic approach to handling incoming packets would be: NIC receives packet β†’ triggers interrupt β†’ kernel handles packet. Simple, but terrible for performance. At 10 Gbps, you could be processing millions of packets per second. If each packet triggers an interrupt and context switch, the CPU spends more time handling interrupts than actually processing packets.

This is where NAPI (New API) comes in. It's a hybrid interrupt/polling mechanism that gives you the best of both worlds:

  1. Initial interrupt: When the first packet arrives, the NIC triggers a hardware interrupt (IRQ)
  2. Disable further interrupts: The driver's IRQ handler immediately disables interrupts from this NIC and schedules a NAPI poll
  3. Polling mode: The napi_poll() function runs in softirq context, pulling packets from the NIC's buffer in batches (typically up to 64 packets at a time)
  4. Budget-based processing: The poll function processes packets until either the budget is exhausted or no more packets are available. If the budget is exhausted, the kernel reschedules the poll in a ksoftirqd (a per-CPU kernel thread)
  5. Re-enable interrupts: Only after all the packets in the buffer are processed, NAPI re-enables interrupts and waits for the next batch

The key insight is that under high load, we stay in polling mode β€” processing packets as fast as they arrive without interrupt overhead. Under low load, we fall back to interrupt-driven mode to avoid wasting CPU cycles polling an empty queue.


3. The Ring Buffer: Where Packets Land

Before we look at the driver code, we need to understand where these packets actually end up.

The driver maintains a ring buffer (also called a descriptor ring) shared with the NIC. Think of it as a circular array where:

  • Each entry is a descriptor that points to a memory buffer
  • The NIC writes incoming packet data into these buffers via DMA
  • The driver reads from the ring and hands packets up the stack

Here are the key data structures:

struct rx_desc {
    dma_addr_t dma_addr;  // physical address where NIC writes
    u16        length;    // bytes written by NIC 
    u16        status;    // completion flags: DD (done), EOP (end of packet), errors
};

struct rx_buffer {
    struct page *page;     // backing memory page
    u16          offset;   // offset within the page
};

The driver pre-allocates memory pages and maps them for DMA, then fills the descriptor ring with pointers to these buffers. The NIC knows exactly where to write incoming packet data without any CPU involvement.


4. The NAPI Poll Loop: Processing Packets

Now let's look at what happens when napi_poll() runs. This is where the driver reads completed descriptors and builds sk_buff structures for the kernel.

Refilling the ring

First, here's how the driver keeps the ring buffer topped up with fresh buffers:

#define RX_BUF_SIZE 2048
#define PAGE_SIZE   4096

void rx_refill(struct rx_ring *ring)
{
    static struct page *cur_page = NULL;
    static u16 cur_offset = 0;

    while (ring->free_descs) {
        u16 i = ring->next_to_use;
        struct rx_buffer *buf = &ring->buf[i];
        struct rx_desc   *desc = &ring->desc[i];

        /* Allocate a new page only when needed */
        if (!cur_page || cur_offset + RX_BUF_SIZE > PAGE_SIZE) {
            cur_page = alloc_page(GFP_ATOMIC);
            cur_offset = 0;
        }

        /* Associate this descriptor with a slice of the page */
        buf->page   = cur_page;
        buf->offset = cur_offset;

        /* Map that slice for DMA */
        desc->dma_addr = dma_map_page(dev, cur_page, cur_offset,
                                       RX_BUF_SIZE, DMA_FROM_DEVICE);

        /* Hand descriptor ownership to NIC */
        desc->status = 0;

        /* Advance to next slice in the page */
        cur_offset += RX_BUF_SIZE;

        ring->next_to_use = (i + 1) % ring->size;
        ring->free_descs--;
    }
}

Notice how we pack two 2KB buffers into a single 4KB page.

The main poll function

Here's the heart of packet reception:

int rx_napi_poll(struct napi_struct *napi, int budget)
{
    struct rx_ring *ring = container_of(napi, struct rx_ring, napi);
    int work_done = 0;

    while (work_done < budget) {
        struct sk_buff *skb = NULL;
        int frag_idx = 0;

        /* Inner loop: assemble one complete packet (may span multiple descriptors) */
        while (1) {
            u16 i = ring->next_to_clean;
            struct rx_desc   *desc = &ring->desc[i];
            struct rx_buffer *buf  = &ring->buf[i];

            /* Check if NIC has finished writing to this descriptor */
            if (!(desc->status & RX_DESC_DONE))
                goto out;

            /* Unmap the buffer - NIC is done with it */
            dma_unmap_page(dev, desc->dma_addr, RX_BUF_SIZE, DMA_FROM_DEVICE);

            if (!skb) {
                /* First fragment: create the sk_buff */
                skb = build_skb(page_address(buf->page), PAGE_SIZE);
                skb_reserve(skb, buf->offset);
                skb_put(skb, desc->length);
            } else {
                /* Additional fragment: add to existing sk_buff */
                skb_add_rx_frag(skb, frag_idx++, buf->page,
                                buf->offset, desc->length, PAGE_SIZE);
            }

            ring->next_to_clean = (i + 1) % ring->size;
            ring->free_descs++;

            /* Is this the last fragment of the packet? */
            if (desc->status & RX_DESC_EOP)
                break;
        }

        /* L2 processing: extract Ethernet header info */
        skb->protocol = eth_type_trans(skb, netdev);

        /* Pass to GRO for potential coalescing, then up the stack */
        napi_gro_receive(napi, skb);

        work_done++;
    }

out:
    if (work_done)
        rx_refill(ring);

    return work_done;
}

A few things worth noting:

  • No data copying: The sk_buff points directly to the DMA buffer. We're just moving pointers around.
  • Jumbo frames: Large packets may span multiple descriptors. The inner loop assembles all fragments into one sk_buff.
  • Descriptor ownership: The RX_DESC_DONE flag tells us the NIC has finished writing. Until that flag is set, we can't touch the buffer(prevents concurrent read/write).

5. Layer 2 Processing: Parsing the Ethernet Header

Once the driver has an sk_buff with raw packet data, it calls eth_type_trans() to extract Ethernet header information:

__be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
{
    struct ethhdr *eth = (struct ethhdr *)skb->data;

    /* Record where the MAC header starts */
    skb->mac_header = skb->data;

    /* Network header starts right after Ethernet header */
    skb->network_header = skb->data + sizeof(struct ethhdr);

    /* Advance data pointer past the Ethernet header */
    skb->data += sizeof(struct ethhdr);

    /* Extract the protocol (IPv4, IPv6, ARP, etc.) */
    skb->protocol = eth->h_proto;

    /* Remember which device received this */
    skb->dev = dev;

    /* Classify packet type based on destination MAC */
    if (is_multicast_ether_addr(eth->h_dest))
        skb->pkt_type = PACKET_MULTICAST;
    else if (is_broadcast_ether_addr(eth->h_dest))
        skb->pkt_type = PACKET_BROADCAST;
    else
        skb->pkt_type = PACKET_HOST;

    return skb->protocol;
}

After this function returns, skb->data points to the IP header (or whatever protocol is encapsulated), and skb->protocol tells us which protocol handler should process it.


6. GRO: Coalescing Packets for Efficiency

Before passing packets up the stack, NAPI gives GRO (Generic Receive Offload) a chance to merge them:

void napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
{
    struct list_head *gro_list = &napi->gro_list;

    /* Try to merge with an existing packet in the GRO list */
    for_each_entry(prev, gro_list) {
        if (can_gro_merge(prev, skb)) {
            gro_merge(prev, skb);
            return;
        }
    }

    /* Can't merge - add to GRO list for potential future merging */
    list_add(&skb->list, gro_list);

    /* Flush the list if it's full or we've waited long enough */
    if (gro_should_flush(gro_list)) {
        for_each_entry_safe(pkt, gro_list) {
            netif_receive_skb(pkt);
        }
        list_init(gro_list);
    }
}

Generic Receiver Offload (GRO) is the receive-side counterpart to TSO (TCP Segmentation Offload). The idea is simple: if we're receiving a burst of TCP packets from the same flow, merge them into one large packet before handing it to the TCP stack. Processing one 64KB packet is much cheaper than processing forty-four 1.5KB packets.

When can packets be merged?

GRO is very strict about what it will combine. Two packets can only merge if they're consecutive segments of the same logical stream:

bool can_gro_merge(struct sk_buff *prev, struct sk_buff *skb)
{
    struct iphdr *iph1 = ip_hdr(prev);
    struct iphdr *iph2 = ip_hdr(skb);
    struct tcphdr *th1 = tcp_hdr(prev);
    struct tcphdr *th2 = tcp_hdr(skb);

    // Must be same protocol
    if (prev->protocol != skb->protocol)
        return false;

    // IP headers must match (same flow)
    if (iph1->saddr != iph2->saddr ||
        iph1->daddr != iph2->daddr ||
        iph1->protocol != iph2->protocol)
        return false;

    // For TCP: ports must match
    if (th1->source != th2->source ||
        th1->dest != th2->dest)
        return false;

    // Sequence numbers must be consecutive
    u32 prev_end = ntohl(th1->seq) + prev->len;
    if (ntohl(th2->seq) != prev_end)
        return false;

    // No special TCP flags (SYN, FIN, RST, URG)
    if (th2->syn || th2->fin || th2->rst || th2->urg)
        return false;

    // ACK numbers should match (same direction of flow)
    if (th1->ack_seq != th2->ack_seq)
        return false;

    // Window size shouldn't change mid-stream
    if (th1->window != th2->window)
        return false;

    // Combined size can't exceed 64KB (or configured limit)
    if (prev->len + skb->len > GRO_MAX_SIZE)
        return false;

    return true;
}

If any of these checks fail, the packets stay separate.

How do headers merge?

When two packets are combined, GRO doesn't actually merge the headers β€” it keeps only the first packet's header and appends the second packet's payload:

void gro_merge(struct sk_buff *prev, struct sk_buff *skb)
{
    struct iphdr *iph = ip_hdr(prev);
    struct tcphdr *th = tcp_hdr(prev);

    // Append the new packet's data as a fragment
    // (the header from skb is discarded)
    skb_pull(skb, skb_transport_offset(skb) + tcp_hdrlen(skb));
    skb_add_frag(prev, skb->data, skb->len);

    // Update the first packet's length fields
    prev->len += skb->len;
    prev->data_len += skb->len;

    // Update IP total length
    iph->tot_len = htons(ntohs(iph->tot_len) + skb->len);

    // TCP header stays the same (same seq, same flags)
    // but the payload is now larger

    // Recalculate checksums (or mark for later)
    prev->ip_summed = CHECKSUM_PARTIAL;

    // The merged packet now carries combined data
    // Free the now-empty second sk_buff (header was stripped)
    kfree_skb(skb);
}

So if you had two packets:

Packet 1: [Eth][IP len=1500][TCP seq=1000][1460 bytes data]
Packet 2: [Eth][IP len=1500][TCP seq=2460][1460 bytes data]

After GRO merge, you get:

Merged:   [Eth][IP len=2960][TCP seq=1000][2920 bytes data]
                    ↑                           ↑
              length updated              payloads concatenated

The second packet's headers are thrown away β€” they were redundant since seq/ack/ports were identical (or predictably sequential).

What happens to completely different packets?

If packets can't be merged (different flows, different protocols, out-of-order, has special flags), they stay in the GRO list as separate entries:

// GRO list might look like this:
gro_list:
  [0] TCP flow A (192.168.1.1:443 β†’ 10.0.0.1:52000) - 4380 bytes (3 merged)
  [1] TCP flow B (192.168.1.1:80  β†’ 10.0.0.1:52001) - 1460 bytes (1 packet)
  [2] UDP packet (192.168.1.1:53 β†’ 10.0.0.1:41234) - 512 bytes
  [3] TCP flow A (192.168.1.1:443 β†’ 10.0.0.1:52000) - 1460 bytes (out of order, can't merge with [0])

When the GRO list is flushed (budget exhausted, list full, or timeout), each entry is passed separately to netif_receive_skb():

bool gro_should_flush(struct list_head *gro_list)
{
    // Too many distinct flows in the list
    if (gro_list->count >= MAX_GRO_SKBS)  // typically 8
        return true;

    // Individual packet has been held too long
    if (time_after(jiffies, oldest_entry->gro_time + GRO_FLUSH_TIMEOUT))
        return true;

    // NAPI poll is ending
    if (napi_complete_called)
        return true;

    return false;
}

if (gro_should_flush(gro_list)) {
    for_each_entry_safe(pkt, gro_list) {
        // Each packet goes up the stack individually
        netif_receive_skb(pkt);
    }
    list_init(gro_list);
}

So GRO never forces incompatible packets together β€” it just batches them and sends them up individually when it can't merge. The goal is to hold packets just long enough to catch their siblings, but not so long that we add noticeable latency. In practice, GRO adds microseconds of delay in exchange for dramatically reduced per-packet overhead.


7. Protocol Dispatch: Entering the Network Stack

netif_receive_skb() is the gateway into the kernel's protocol processing. It runs any configured ingress hooks and then hands the packet to the right protocol handler:

int netif_receive_skb(struct sk_buff *skb)
{
    /* Run ingress traffic control if configured (tc filters, eBPF, etc.) */
    if (skb->dev->ingress_qdisc) {
        int result = tc_ingress_classify(skb);
        if (result == TC_ACT_SHOT) {
            kfree_skb(skb);
            return NET_RX_DROP;
        }
    }

    /* Validate checksum if the NIC didn't do it for us */
    if (skb->ip_summed == CHECKSUM_NONE) {
        if (!validate_checksum(skb)) {
            kfree_skb(skb);
            return NET_RX_DROP;
        }
    }

    /* Dispatch to the appropriate protocol handler */
    switch (skb->protocol) {
        case ETH_P_IP:
            return ip_rcv(skb);
        case ETH_P_IPV6:
            return ipv6_rcv(skb);
        case ETH_P_ARP:
            return arp_rcv(skb);
        default:
            kfree_skb(skb);
            return NET_RX_DROP;
    }
}

For our TCP/IP packet, this means calling ip_rcv().


8. IP Layer: Validation and Routing

Now we're in Layer 3 territory. ip_rcv() validates the IP header and figures out what to do with the packet:

int ip_rcv(struct sk_buff *skb)
{
    struct iphdr *iph = ip_hdr(skb);

    /* Basic sanity checks */
    if (iph->version != 4)
        goto drop;

    if (iph->ihl < 5)  /* Header length must be at least 20 bytes */
        goto drop;

    if (skb->len < ntohs(iph->tot_len))
        goto drop;

    /* Verify IP header checksum */
    if (ip_fast_csum((u8 *)iph, iph->ihl) != 0)
        goto drop;

    /* Run through netfilter PREROUTING hooks (iptables, nftables, etc.) */
    // For instance NLBs use NETFILTER HOOKS at pre-routing stage to 
    // perform DNAT.
    
    // However, some high-performance load balancers (like those using
    // DPDK, XDP, or eBPF) may bypass the standard netfilter path
    // entirely and perform packet modifications at lower layers for
    // better performance.
    return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, ip_rcv_finish);

drop:
    kfree_skb(skb);
    return NET_RX_DROP;
}

After netfilter processing, ip_rcv_finish() makes the routing decision:

int ip_rcv_finish(struct sk_buff *skb)
{
    struct iphdr *iph = ip_hdr(skb);

    /* Look up the route for this destination */
    if (!skb_dst(skb)) {
        if (ip_route_input(skb, iph->daddr, iph->saddr, iph->tos, skb->dev) < 0)
            goto drop;
    }

    /* Follow the routing decision */
    return skb_dst(skb)->input(skb);  // calls ip_local_deliver or ip_forward

drop:
    kfree_skb(skb);
    return NET_RX_DROP;
}

What does ip_route_input() actually do?

This function is the kernel's routing lookup for incoming packets. It queries the FIB (Forwarding Information Base) and attaches a routing decision to the sk_buff:

int ip_route_input(skb, daddr, saddr, tos, dev)
{
    // Is destination one of our local addresses?
    if (inet_addr_is_local(daddr)) {
        skb->dst->input = ip_local_deliver;
        return 0;
    }

    // Is it broadcast or multicast?
    if (inet_addr_is_broadcast(daddr, dev))
        return setup_broadcast_route(skb);

    if (ipv4_is_multicast(daddr))
        return ip_route_input_mc(skb, ...);

    // Look up in routing table (Forward Information Base)
    fib_result = fib_lookup(daddr);
    
    if (fib_result.type == RTN_UNICAST) {
        skb->dst->input = ip_forward;
        skb->dst->next_hop = fib_result.gateway;
        return 0;
    }

    return -ENETUNREACH;  // No route to host
}

After this lookup, the sk_buff carries a dst_entry that tells the stack:

  • Where to go next: ip_local_deliver (for us) or ip_forward (send it elsewhere)
  • Outgoing interface and next hop: If forwarding, which interface and gateway to use
  • Path MTU: For fragmentation decisions later

This is also where policy routing kicks in β€” the lookup can consider source address, TOS bits, incoming interface, and firewall marks, not just the destination.

Handling IP Fragmentation

If the packet is a fragment of a larger datagram, ip_local_deliver() hands it to the IP fragment reassembly queue:

int ip_local_deliver(struct sk_buff *skb)
{
    struct iphdr *iph = ip_hdr(skb);

    /* Check if this is a fragment */
    if (ip_is_fragment(iph)) {
        skb = ip_defrag(skb);
        if (!skb)
            return 0;  /* Fragment queued, waiting for more pieces */
    }

    /* Strip the IP header, advance to L4 payload */
    skb_pull(skb, iph->ihl * 4);
    skb->transport_header = skb->data;

    /* Run through netfilter LOCAL_IN hooks */
    return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, ip_local_deliver_finish);
}

How are fragments detected?

The IP header has two fields that identify fragments:

struct iphdr {
    // ... other fields ...
    __be16 frag_off;   // fragment offset + flags
    __be16 id;         // identification (same for all fragments of one datagram)
};

// The frag_off field packs both flags and offset:
//   Bits 0-12:  Fragment offset (in 8-byte units)
//   Bit  13:    MF (More Fragments) flag
//   Bit  14:    DF (Don't Fragment) flag
//   Bit  15:    Reserved

bool ip_is_fragment(struct iphdr *iph)
{
    // It's a fragment if MF flag is set OR offset is non-zero
    return (iph->frag_off & htons(IP_MF | IP_OFFSET)) != 0;
}

A packet is a fragment if:

  • MF (More Fragments) bit is set: More pieces are coming
  • Fragment offset is non-zero: This isn't the first piece

The first fragment has offset=0 but MF=1. The last fragment has MF=0 but offset>0. Middle fragments have both MF=1 and offset>0.

How are fragments reassembled?

The kernel maintains a hash table of incomplete datagrams, keyed by (src_ip, dst_ip, protocol, identification):

struct ipq {
    struct iphdr    *iph;           // copy of IP header from first fragment
    struct sk_buff  *fragments;     // linked list of received fragments
    int              len;           // total length so far
    int              meat;          // bytes of actual data received
    __u8             last_in;       // have we seen the last fragment?
    struct timer_list timer;        // reassembly timeout (default: 30 seconds)
};

struct sk_buff *ip_defrag(struct sk_buff *skb)
{
    struct iphdr *iph = ip_hdr(skb);
    
    // Find or create reassembly queue for this datagram
    struct ipq *qp = ip_find(iph->id, iph->saddr, iph->daddr, iph->protocol);
    
    if (!qp) {
        qp = ip_create_queue(iph);
        start_timer(&qp->timer, IP_FRAG_TIMEOUT);  // 30 seconds
    }
    
    // Insert this fragment in offset order
    ip_frag_queue(qp, skb);
    
    // Check if we have all fragments
    if (qp->last_in) {
        // All pieces received - reassemble into one sk_buff
        struct sk_buff *complete = ip_frag_reasm(qp);
        ip_destroy_queue(qp);
        return complete;
    }
    
    // Still waiting for more fragments
    return NULL;
}

Key points:

  • Fragments can arrive out of order β€” the kernel sorts them by offset
  • A timer prevents memory exhaustion from incomplete datagrams
  • Once all fragments arrive, they're merged into a single sk_buff
  • If the timer expires before all fragments arrive, the whole thing is dropped

What are LOCAL_IN hooks?

Netfilter provides several hook points where packets can be inspected, modified, or dropped. The NF_INET_LOCAL_IN hook runs on packets destined for this machine, after routing but before transport layer processing:

                                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                    β”‚   FORWARD    │──→ (to another interface)
                                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                          ↑
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
───→│PREROUTING│──→│  Routing   │──→│ LOCAL_IN │──→ (to TCP/UDP/ICMP)
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚  Decision  β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         ↓
               (packet is for us)

If any hook returns NF_DROP, the packet is discarded and never reaches TCP/UDP. If all hooks return NF_ACCEPT, processing continues to ip_local_deliver_finish().

Finally, ip_local_deliver_finish() dispatches to the transport layer based on the protocol field:

int ip_local_deliver_finish(struct sk_buff *skb)
{
    struct iphdr *iph = ip_hdr(skb);

    switch (iph->protocol) {
        case IPPROTO_TCP:
            return tcp_v4_rcv(skb);
        case IPPROTO_UDP:
            return udp_rcv(skb);
        case IPPROTO_ICMP:
            return icmp_rcv(skb);
        default:
            kfree_skb(skb);
            return 0;
    }
}

9. TCP Layer: Connection Lookup and State Machine

For TCP packets, tcp_v4_rcv() is where the real complexity begins. TCP is a stateful protocol, so the first job is finding which connection this packet belongs to:

int tcp_v4_rcv(struct sk_buff *skb)
{
    struct tcphdr *th = tcp_hdr(skb);
    struct sock *sk;

    /* Validate TCP header */
    if (th->doff < 5)  /* Header too short */
        goto drop;

    if (!tcp_checksum_valid(skb))
        goto drop;

    /* Look up the socket for this 4-tuple (src_ip, src_port, dst_ip, dst_port) */
    sk = inet_lookup(skb, th->source, th->dest);

    if (!sk)
        goto no_socket;

    /* Hand off to the appropriate handler based on socket state */
    if (sk->sk_state == TCP_LISTEN)
        return tcp_v4_do_rcv_listen(sk, skb);  /* Incoming connection */
    else
        return tcp_v4_do_rcv(sk, skb);         /* Established connection */

no_socket:
    /* No matching socket - send RST if it's not already a RST */
    if (!th->rst)
        tcp_v4_send_reset(skb);
    
drop:
    kfree_skb(skb);
    return 0;
}

The socket lookup uses a hash table keyed by the 4-tuple (source IP, source port, destination IP, destination port). This is O(1) on average, which matters a lot when you're handling millions of packets per second.

Processing Data on an Established Connection

For an established connection, tcp_v4_do_rcv() feeds the packet into TCP's state machine:

int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct tcphdr *th = tcp_hdr(skb);
    u32 seq = ntohl(th->seq);

    /* Fast path: is this the next expected segment? */
    if (seq == tp->rcv_nxt && tcp_header_ok(th) && !th->syn && !th->fin) {
        /* Process inline without queuing */
        return tcp_rcv_established_fastpath(sk, skb);
    }

    /* Slow path: out-of-order, has special flags, or needs more validation */
    return tcp_rcv_established(sk, skb);
}

The fast path handles the common case: an in-order data segment on an established connection. It bypasses a lot of checks and gets data to the socket receive queue as quickly as possible.

The TCP Fast Path: tcp_rcv_established_fastpath()

int tcp_rcv_established_fastpath(struct sock *sk, struct sk_buff *skb)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct tcphdr *th = tcp_hdr(skb);

    /* We already know from the caller that:
     * - seq == tp->rcv_nxt (this is the exact segment we're waiting for)
     * - Header is valid (length check passed)
     * - No SYN or FIN flags (normal data segment)
     */

    /* Calculate the payload length (data only, excluding TCP header) */
    u32 len = skb->len - th->doff * 4;
    
    /* Understanding th->doff:
         - doff = "Data Offset" (where the actual data begins)
         - It's a 4-bit field in the TCP header
         - Measured in 32-bit words (4 bytes)
         - 1 Word = 4 Bytes
         - So doff tells the number of words the header is composed of.
        - Range: 5-15 (minimum 20 bytes, maximum 60 bytes)
        - Multiply by 4 to convert words to bytes
     
     Example:
        - Minimal TCP header (no options): doff = 5 β†’ 5 * 4 = 20 bytes
        - With options (e.g., timestamps): doff = 8 β†’ 8 * 4 = 32 bytes
     
     So: skb->len = 1500 bytes (total)
         th->doff = 5 (header is 20 bytes)
         len = 1500 - 20 = 1480 bytes of actual data
     */
    
    /* Update receive sequence number by the amount of data received */
    tp->rcv_nxt += len;

    /* ACK processing: update what the sender has acknowledged */
    u32 ack = ntohl(th->ack_seq);
    if (after(ack, tp->snd_una)) {
        tcp_ack_update(tp, ack);  // advance send window
    }

    /* Window update: sender is telling us how much buffer space it has */
    tp->snd_wnd = ntohs(th->window) << tp->snd_wscale;

    /* Add the segment directly to receive queue - no buffering */
    skb_queue_tail(&sk->sk_receive_queue, skb);

    /* IMPORTANT: Check if this segment fills a gap in the OOO queue */
    if (!skb_queue_empty(&tp->out_of_order_queue)) {
        tcp_ofo_queue(sk);
    }

    /* Wake up any process waiting in recv() */
    sk->sk_data_ready(sk);

    /* Send ACK if needed (delayed ACK logic) */
    if (tcp_should_ack(tp))
        tcp_send_ack(sk);

    return 0;
}

What triggers the slow path?

  • Out-of-order segment: seq != tp->rcv_nxt means we got packet N+2 before N+1
  • Special flags: SYN, FIN, RST, URG require state machine handling
  • Zero window probe: Sender testing if our receive window has opened
  • Pure ACK: No data, just acknowledging our sent data
  • Invalid header: Length checks failed, options malformed, etc.

Here's the flow in the slow path:

int tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct tcphdr *th = tcp_hdr(skb);

    /* Process ACK first (may free up send buffer space) */
    if (th->ack) {
        tcp_ack(sk, skb);  // handles retransmission, congestion control, etc.
    }

    /* Now handle the data portion */
    if (skb->len > 0) {
        tcp_data_queue(sk, skb);
    }

    return 0;
}

The Receive Queue and Data Queueing - tcp_data_queue() Function

The tcp_data_queue() function is called from the slow path only:

tcp_v4_rcv()
  └─→ tcp_v4_do_rcv()
       β”œβ”€β†’ tcp_rcv_established_fastpath()  ← Fast path: does NOT call tcp_data_queue()
       β”‚                                      (queues directly, handles ACK inline)
       β”‚
       └─→ tcp_rcv_established()            ← Slow path: calls tcp_data_queue()
            └─→ tcp_data_queue()

TCP maintains an ordered receive queue for each connection. Here's a more detailed view:

int tcp_data_queue(struct sock *sk, struct sk_buff *skb)
{
    struct tcp_sock *tp = tcp_sk(sk);
    u32 seq = TCP_SKB_CB(skb)->seq;
    u32 end_seq = TCP_SKB_CB(skb)->end_seq;

    /* Is this the segment we're waiting for? */
    if (seq == tp->rcv_nxt) {
        /* In-order: add directly to receive queue */
        tp->rcv_nxt = end_seq;
        skb_queue_tail(&sk->sk_receive_queue, skb);

        /* Check if any out-of-order segments can now be processed */
        tcp_ofo_queue(sk);

        /* Wake up any process blocked on read() */
        sk->sk_data_ready(sk);
    }
    else if (after(seq, tp->rcv_nxt)) {
        /* Out-of-order: stash in the out-of-order queue */
        tcp_ofo_insert(sk, skb);
    }
    else {
        /* Duplicate or old segment - drop it */
        kfree_skb(skb);
    }

    return 0;
}

What is tcp_ofo_queue()?

The out-of-order (OOO) queue is a holding area for segments that arrive before we're ready to process them. When a segment fills a gap, tcp_ofo_queue() moves any newly-contiguous segments from the OOO queue to the receive queue.

Example scenario:

Let's say we're expecting sequence numbers 1000, 2000, 3000, 4000...

1. Receive seq 1000 (expected)
   β†’ Goes directly to sk_receive_queue
   β†’ tp->rcv_nxt = 2000

2. Receive seq 4000 (out of order!)
   β†’ Goes to OOO queue
   β†’ Still waiting for 2000

3. Receive seq 3000 (still out of order)
   β†’ Goes to OOO queue
   β†’ Still waiting for 2000
   
   OOO queue now: [3000-4000], [4000-5000]

4. Receive seq 2000 (the missing piece!)
   β†’ Goes to sk_receive_queue
   β†’ tp->rcv_nxt = 3000
   β†’ Call tcp_ofo_queue() ← This is where the magic happens

What tcp_ofo_queue() does:

void tcp_ofo_queue(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *skb, *tmp;

    /* Walk through the out-of-order queue looking for segments
     * that are now contiguous with rcv_nxt */
    skb_queue_walk_safe(&tp->out_of_order_queue, skb, tmp) {
        u32 seq = TCP_SKB_CB(skb)->seq;
        u32 end_seq = TCP_SKB_CB(skb)->end_seq;

        /* Is this segment now the next expected one? */
        if (seq == tp->rcv_nxt) {
            /* Yes! Move it to the receive queue */
            __skb_unlink(skb, &tp->out_of_order_queue);
            skb_queue_tail(&sk->sk_receive_queue, skb);

            /* Advance our expected sequence number */
            tp->rcv_nxt = end_seq;

            /* Continue checking - maybe more segments are now ready */
            continue;
        }

        /* Is this segment still in the future? */
        if (after(seq, tp->rcv_nxt)) {
            /* Yes, there's still a gap. Stop here - we maintain order */
            break;
        }

        /* This segment is old/duplicate - shouldn't happen, but be safe */
        __skb_unlink(skb, &tp->out_of_order_queue);
        kfree_skb(skb);
    }

    /* If we moved segments, wake up the reader */
    if (!skb_queue_empty(&sk->sk_receive_queue))
        sk->sk_data_ready(sk);
}

Continuing our example:

After receiving seq 2000:

Before tcp_ofo_queue():
  sk_receive_queue: [1000-2000], [2000-3000]
  out_of_order_queue: [3000-4000], [4000-5000]
  tp->rcv_nxt = 3000

After tcp_ofo_queue():
  sk_receive_queue: [1000-2000], [2000-3000], [3000-4000], [4000-5000]
  out_of_order_queue: [empty]
  tp->rcv_nxt = 5000

Both OOO segments became contiguous once the gap at 2000 was filled, so they moved to the receive queue in one sweep.


10. UDP: The Simple Case

For UDP, things are much simpler. There's no connection state, no ordering, no reassembly at the transport layer:

int udp_rcv(struct sk_buff *skb)
{
    struct udphdr *uh = udp_hdr(skb);
    struct sock *sk;

    /* Validate UDP header and checksum */
    if (skb->len < sizeof(struct udphdr))
        goto drop;

    if (udp_checksum_invalid(skb))
        goto drop;

    /* Look up socket by destination port */
    sk = udp_lookup(skb, uh->dest);

    if (!sk)
        goto no_socket;

    /* Queue directly to socket */
    return udp_queue_rcv_skb(sk, skb);

no_socket:
    icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0);
    
drop:
    kfree_skb(skb);
    return 0;
}

int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
{
    /* Check if socket receive buffer is full */
    if (sk_rmem_alloc_get(sk) + skb->truesize > sk->sk_rcvbuf) {
        kfree_skb(skb);
        return -ENOMEM;  /* Drop if buffer is full */
    }

    /* Add to socket's receive queue */
    skb_queue_tail(&sk->sk_receive_queue, skb);

    /* Wake up any process blocked on read() */
    sk->sk_data_ready(sk);

    return 0;
}

No fancy state machine, no reordering β€” just validate, find the socket, and queue it. If the socket's receive buffer is full, the packet is dropped. UDP makes no delivery guarantees, and this is where that shows up.


11. The Socket Layer: Where Kernel Meets Userspace

At this point, the packet data is sitting in a socket's receive queue. But how does the application actually get it?

When your application calls recv() or read() on a socket, here's what happens:

int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags)
{
    struct tcp_sock *tp = tcp_sk(sk);
    int copied = 0;
    
    lock_sock(sk);

    while (copied < len) {
        struct sk_buff *skb;

        /* Get the next segment from the receive queue */
        skb = skb_peek(&sk->sk_receive_queue);

        if (!skb) {
            /* Queue is empty */
            if (copied > 0)
                break;  /* Return what we have */

            if (flags & MSG_DONTWAIT) {
                copied = -EAGAIN;  /* Non-blocking, nothing available */
                break;
            }

            /* Block until data arrives */
            sk_wait_data(sk);
            continue;
        }

        /* Copy data from skb to user buffer */
        int chunk = min(skb->len, len - copied);
        if (copy_to_user(msg->msg_iov, skb->data, chunk))
            return -EFAULT;

        copied += chunk;

        /* Consume the data from the skb */
        skb_pull(skb, chunk);
        if (skb->len == 0) {
            skb_unlink(skb, &sk->sk_receive_queue);
            kfree_skb(skb);
        }

        /* Update TCP receive window */
        tcp_rcv_space_adjust(sk);
    }

    release_sock(sk);
    return copied;
}

This is where the kernel-to-userspace boundary crossing happens. The copy_to_user() call copies data from the kernel's sk_buff into the application's buffer. This is one of the few actual copies in the entire receive path.

Waking Up the Reader

Remember that sk->sk_data_ready(sk) call in tcp_data_queue()? That's what wakes up a process blocked in sk_wait_data(). The kernel uses wait queues to efficiently sleep processes until data is available, avoiding busy-waiting.

Non-blocking and Multiplexed I/O

Applications using select(), poll(), or epoll() don't block inside recv(). Instead, they register interest in multiple file descriptors and block waiting for any of them to become readable. When data arrives, the socket's sk_data_ready callback notifies the epoll wait queue, which wakes up the application.


12. Putting It All Together

Let's trace a complete packet journey from wire to application:

[NIC Hardware]                          [Hardware]
    β”‚ (packet arrives)
    β”œβ”€β”€ CRC validation
    β”œβ”€β”€ MAC filter check
    └── DMA write to ring buffer
            β”‚
            β–Ό
        IRQ raised                      
            β”‚
            β–Ό
    IRQ handler                         [Hard IRQ Context]
        β”œβ”€β”€ disable NIC IRQ
        β”œβ”€β”€ schedule NAPI
        └── return
                β”‚
                β–Ό
            ═══════════════════════════════════════════════════
            β•‘     SOFTIRQ CONTEXT (ksoftirqd / NET_RX)       β•‘
            ═══════════════════════════════════════════════════
                β”‚
                β–Ό
        napi_poll()                     [Driver - NAPI]
            β”œβ”€β”€ read RX descriptors
            β”œβ”€β”€ allocate sk_buff
            β”œβ”€β”€ DMA sync (zero-copy)
            └── for each packet:
                    β”‚
                    β–Ό
                eth_type_trans()        [L2 - Ethernet]
                    β”œβ”€β”€ parse Ethernet header
                    β”œβ”€β”€ set skb->protocol
                    └── skb_pull() to strip Ethernet header
                            β”‚
                            β–Ό
                        GRO (Generic Receive Offload)
                            β”œβ”€β”€ coalesce related packets
                            └── batch processing
                                    β”‚
                                    β–Ό
                                netif_receive_skb()
                                    β”‚
                                    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                    β”‚                             β”‚
                                    β–Ό                             β–Ό
                                ip_rcv()            AF_PACKET socket (tcpdump/Wireshark)
                                    β”‚                   gets copy here
                                    β”‚
                        [IP Layer - L3]
                                    β”‚
                    β”œβ”€β”€ ip_rcv_core()
                    β”‚   β”œβ”€β”€ validate header length
                    β”‚   β”œβ”€β”€ validate checksum
                    β”‚   └── validate TTL
                    β”‚
                    β”œβ”€β”€ netfilter PREROUTING (iptables)
                    β”‚
                    β”œβ”€β”€ ip_rcv_finish()
                    β”‚   └── routing decision
                    β”‚       β”œβ”€β”€ forward? β†’ ip_forward()
                    β”‚       └── local? β†’ continue
                    β”‚
                    └── ip_local_deliver()
                            β”œβ”€β”€ reassemble fragments
                            β”œβ”€β”€ netfilter LOCAL_IN
                            └── dispatch by protocol:
                                    β”‚
                                    β–Ό
                                tcp_v4_rcv()        [TCP Layer - L4]
                                    β”‚
                                    β”œβ”€β”€ checksum verification
                                    β”œβ”€β”€ socket lookup (4-tuple hash)
                                    β”‚   └── (src_ip, src_port, dst_ip, dst_port)
                                    β”‚
                                    β”œβ”€β”€ tcp_v4_do_rcv()
                                    β”‚   β”‚
                                    β”‚   β”œβ”€β”€ TCP state machine
                                    β”‚   β”‚   β”œβ”€β”€ process ACKs
                                    β”‚   β”‚   β”œβ”€β”€ update congestion window
                                    β”‚   β”‚   β”œβ”€β”€ handle retransmits
                                    β”‚   β”‚   └── process flags (SYN/FIN/RST)
                                    β”‚   β”‚
                                    β”‚   └── tcp_rcv_established()
                                    β”‚           β”‚
                                    β”‚           β”œβ”€β”€ validate sequence numbers
                                    β”‚           β”œβ”€β”€ update receive window
                                    β”‚           └── tcp_data_queue()
                                    β”‚                   β”œβ”€β”€ add to sk_receive_queue
                                    β”‚                   └── trim TCP header (skb_pull)
                                    β”‚
                                    └── sk_data_ready()
                                            └── wake_up_interruptible(sk->sk_wq)
            ═══════════════════════════════════════════════════
            β•‘          END OF SOFTIRQ CONTEXT                β•‘
            ═══════════════════════════════════════════════════
                                β”‚
                                β–Ό
                        [Process wakes up]          [Process Context]
                                β”‚
                                β–Ό
recv(fd, buf, len)              [User Space]
    β”‚ (or read/recvmsg)
    β–Ό syscall
sys_recvmsg()                   [Kernel]
    β”‚
    β–Ό
sock_recvmsg()                  [Socket Layer]
    β”‚
    β–Ό
tcp_recvmsg()                   [TCP]
    β”œβ”€β”€ lock_sock()
    β”œβ”€β”€ check sk_receive_queue
    β”œβ”€β”€ copy_to_user() from sk_buff
    β”œβ”€β”€ update rcv_nxt
    β”œβ”€β”€ send ACK (if needed)
    └── release_sock()
            β”‚
            β–Ό
        return to userspace
            β”‚
            β–Ό
    [Application processes data]

The Zero-Copy Theme

Throughout this journey, we've seen a consistent pattern: the kernel avoids copying data whenever possible. The same memory buffer that the NIC DMA'd into is referenced all the way up to the socket layer. The only copy happens at the very end, when we have to move data from kernel space to user space.

This is why modern network stacks can achieve such high throughput. At 100 Gbps, you simply can't afford to copy every byte multiple times. The entire architecture is designed around passing pointers and metadata, not shuffling bytes around.

Where Time Goes

If you're debugging network performance, here's where to look:

  1. NAPI budget exhaustion: If your system consistently hits the NAPI budget, you're processing packets faster than softirq time allows. Consider RPS/RFS or multiple queues.

  2. Socket buffer pressure: When sk_rcvbuf fills up, packets get dropped at the socket layer. Your application isn't reading fast enough.

  3. Netfilter overhead: Complex iptables rules run on every packet in the hot path. This adds up.

  4. Lock contention: The socket lock (lock_sock) is held during receive processing. Multiple threads reading from the same socket will serialize.

  5. Copy to userspace: This is often the bottleneck. Technologies like io_uring and AF_XDP try to eliminate or reduce this copy.


Summary

We've traced a packet's complete journey from the network wire to your application's buffer:

  1. NIC validates the frame and DMAs it into pre-allocated buffers
  2. NAPI efficiently polls the driver in batches
  3. Driver builds sk_buff structures with zero copying
  4. L2 processing parses the Ethernet header
  5. IP layer validates, routes, and reassembles fragments
  6. TCP/UDP finds the socket and queues the data
  7. Socket layer wakes up the application and copies data to userspace

Understanding this path helps you reason about where bottlenecks might occur, why certain syscalls behave the way they do, and how technologies like DPDK, XDP, and io_uring manage to go even faster by bypassing parts of this stack.