Dissecting a 1-Day Vulnerability in Linux’s XFRM Subsystem

An exploration of the Linux XFRM subsystem, including patch analysis and vulnerability insights for CVE-2025-39965 (recently submitted as a kernelCTF entry).

Posted Oct 16, 2025 Updated Oct 31, 2025

By StreyPaws

29 min read

In this blog, I’ll be presenting my research and analysis on CVE-2025-39965 (a Use-After-Free in the Linux XFRM subsystem) covering the patch-fix analysis, vulnerability analysis, and technical insights into my process of triggering the bug along with some PoC code. It was reported by syzbot and was seen as a 1-Day Kernel-CTF entry.

DISCLAIMER: All content provided is for educational and research purposes only. All testing was conducted exclusively on an Linux Kernel Emulator, in a safe, isolated environment. No production systems or devices owned by others were involved or affected during this research. The author assumes no responsibility for any misuse of the information presented or for any damages resulting from its application.

Overview

CVE-2025-39965 is a Use-After-Free vulnerability in the Linux XFRM Subsystem. Specifically the issue arises in the xfrm_alloc_spi which shouldn’t use 0 as a SPI value (which it does in the vulnerable version), which results in a UaF since state deletion doesn’t remove that entry from the byspi list leaving a dangling pointer. This flaw could lead to kernel instability, crashes, or unpredictable behavior, and in certain scenarios, may even be escalated into a privilege escalation on the target system (as seen in the Kernel-CTF submission).

We’ll begin with an exploration of the XFRM Subsystem and some internals around the vulnerable code, followed by a detailed patch and vulnerability analysis. Finally, we’ll walk through how this bug could be safely and reproducibly triggered in a Linux Kernel Emulator for demonstration purposes.

XFRM Internals

The XFRM subsystem is large enough (and complicated) to deserve its own deep-dive. It’s not possible to cover every detail here, but I’ll highlight the key mechanisms needed to understand the vulnerability. Let’s dive in.

Introduction

The XFRM subsystem is the Linux kernel’s implementation of IPsec (IP Security) and related security transformation protocols. XFRM provides a framework for applying cryptographic transformations to network packets, enforcing security policies, and managing security associations. The subsystem sits between the network layer (IPv4/IPv6) and the device layer, transparently encrypting outbound traffic and decrypting inbound traffic according to configured policies.

There are two critical components: xfrm_state.c which manages the Security Association (SA) database, while xfrm_user.c which provides the Netlink interface for userspace communication. We’ll be focusing on these 2 areas in depth since this is where the bug manifests.

SA Database Management

The xfrm_state.c file is the core database layer for managing Security Associations (SAs) in the Linux kernel’s IPsec implementation. Think of SAs as “recipes” that tell the kernel exactly how to encrypt, decrypt, or authenticate network packets. This file handles storing these recipes, finding them quickly when needed, and creating new ones on demand.

  
static struct xfrm_state *__xfrm_state_lookup(const struct xfrm_hash_state_ptrs *state_ptrs,
					      u32 mark,
					      const xfrm_address_t *daddr,
					      __be32 spi, u8 proto,
					      unsigned short family)
{
	...

	hlist_for_each_entry_rcu(x, state_ptrs->byspi + h, byspi) {
		if (x->props.family != family ||
		    x->id.spi       != spi ||
		    x->id.proto     != proto ||
		    !xfrm_addr_equal(&x->id.daddr, daddr, family))
			continue;

	...
}
}

The code organizes SAs using hash tables, which work like an index in a book - they let you find things quickly without searching through everything. When a packet arrives, the kernel needs to find the right SA to process it. The __xfrm_state_lookup function does this by searching using four key pieces of information: the SPI (a unique identifier), destination address, protocol type (like ESP or AH), and address family (IPv4 or IPv6). The hash table converts these values into an index that points directly to the right location, making lookups very fast even with thousands of SAs.

To make things even faster for incoming packets, xfrm_input_state_lookup adds a per-CPU cache layer. This means each processor core keeps its own small list of recently-used SAs, so it can find them instantly without even touching the main hash table. This optimization is crucial because packet processing happens at very high speeds and every microsecond counts.

Sometimes the kernel needs an SA that doesn’t exist yet - for example, when you try to send traffic to a new destination that requires IPsec protection. The __find_acq_core function handles this by creating “acquisition state” entries. These are placeholder SAs that signal to userspace key management daemons (like strongSwan or racoon) that they need to negotiate the actual security parameters with the remote peer. The function first searches to see if an acquisition is already in progress, and if not, it creates a new one with the necessary addressing and protocol information.

  
void xfrm_register_km(struct xfrm_mgr *km)
{
	spin_lock_bh(&xfrm_km_lock);
	list_add_tail_rcu(&km->list, &xfrm_km_list);
	spin_unlock_bh(&xfrm_km_lock);
}
EXPORT_SYMBOL(xfrm_register_km);

void xfrm_unregister_km(struct xfrm_mgr *km)
{
	spin_lock_bh(&xfrm_km_lock);
	list_del_rcu(&km->list);
	spin_unlock_bh(&xfrm_km_lock);
	synchronize_rcu();
}
EXPORT_SYMBOL(xfrm_unregister_km);

Finally, the file provides the glue between the kernel and userspace key management software. The xfrm_register_km and xfrm_unregister_km functions let key management daemons register themselves with the kernel so they can receive notifications about SA events and respond to acquisition requests. This registration mechanism is what allows userspace programs to control IPsec policy and negotiate security parameters while the kernel handles the actual packet processing.

Netlink Interface Layer

The code present in xfrm_user.c is essentially a translator and gatekeeper between userspace programs (like ip xfrm commands or VPN daemons) and the kernel’s IPsec engine. It uses Netlink, which is Linux’s way of letting userspace programs talk to the kernel. When the module loads, xfrm_user_init sets up the Netlink socket that listens for messages from userspace. It then registers itself with the XFRM core via xfrm_register_km as we saw earlier, telling the kernel “I’m ready to handle IPsec configuration requests.”

  
static const struct xfrm_link {
	int (*doit)(struct sk_buff *, struct nlmsghdr *, struct nlattr **,
		    struct netlink_ext_ack *);
	int (*start)(struct netlink_callback *);
	int (*dump)(struct sk_buff *, struct netlink_callback *);
	int (*done)(struct netlink_callback *);
	const struct nla_policy *nla_pol;
	int nla_max;
} xfrm_dispatch[XFRM_NR_MSGTYPES] = { // [1]
	[XFRM_MSG_NEWSA       - XFRM_MSG_BASE] = { .doit = xfrm_add_sa        },
	[XFRM_MSG_DELSA       - XFRM_MSG_BASE] = { .doit = xfrm_del_sa        },
	[XFRM_MSG_GETSA       - XFRM_MSG_BASE] = { .doit = xfrm_get_sa,
						   .dump = xfrm_dump_sa,
						   .done = xfrm_dump_sa_done  },
	[XFRM_MSG_NEWPOLICY   - XFRM_MSG_BASE] = { .doit = xfrm_add_policy    },
	[XFRM_MSG_DELPOLICY   - XFRM_MSG_BASE] = { .doit = xfrm_get_policy    },
	[XFRM_MSG_GETPOLICY   - XFRM_MSG_BASE] = { .doit = xfrm_get_policy,
						   .start = xfrm_dump_policy_start,
						   .dump = xfrm_dump_policy,
						   .done = xfrm_dump_policy_done },
                           ...
                           ...
}

The subsystem uses a dispatch table to route Netlink messages to their handlers. The core of this file is the xfrm_dispatch table [1], which acts like a switchboard operator which maps netlink message types to handler functions. When a netlink message arrives, xfrm_user_rcv_msg processes it by looking up the appropriate handler in this dispatch table.

When a userspace tool sends a message like “create a new Security Association (SA)”, the dispatch table routes it to the appropriate handler function. For example:

XFRM_MSG_NEWSA → xfrm_add_sa (creates a new SA)
XFRM_MSG_DELSA → xfrm_del_sa (deletes an SA)
XFRM_MSG_GETPOLICY → xfrm_get_policy (retrieves policy info)

and others.

Before processing any request, the code validates that the configuration makes sense. For instance, certain flags only make sense for outbound traffic, not inbound. If you try to set XFRM_STATE_NOPMTUDISC (which controls packet fragmentation) on an inbound SA, the code rejects it with a clear error message. This prevents misconfigurations that could break IPsec tunnels.

The code converts between two different data formats. Userspace sends data in xfrm_usersa_info structures, but the kernel needs xfrm_state objects. The copy_from_user_state function copies fields like encryption keys, addresses, and lifetimes from the userspace format into the kernel format. Lets briefly touch upon the struct xfrm_state and the fields it contains.

  
struct xfrm_state {
	possible_net_t		xs_net;
	union {
		struct hlist_node	gclist;
		struct hlist_node	bydst;
	};
	struct hlist_node	bysrc;
	struct hlist_node	byspi;
	struct hlist_node	byseq;

	refcount_t		refcnt;
	spinlock_t		lock;

	struct xfrm_id		id;
	struct xfrm_selector	sel;
	struct xfrm_mark	mark;
	u32			if_id;
	u32			tfcpad;

	u32			genid;
    ...
    ...
}

The struct xfrm_state is basically the central data structure that represents a single IPsec Security Association (SA) in the kernel. It serves as the bridge between the code in xfrm_user.c (which receives SA configuration from userspace) and xfrm_state.c (which stores and manages these SAs).

The structure begins with fields that enable multi-index lookups in the hash tables. The bydst, bysrc, byspi, and byseq fields are hlist_node structures that link this SA into four different hash tables simultaneously. When __xfrm_state_insert adds an SA, it inserts it into all these tables at once for fast lookups by different criteria. The gclist field shares a union with bydst because an SA is either active (in the bydst table) or being garbage collected (in the gclist).

The id field (containing SPI, destination address, and protocol) uniquely identifies this SA for inbound packet processing. The sel (selector) field defines which traffic this SA applies to, matching source/destination addresses and ports. There are a lot of other fields present (the structure is quite big), but as you’ll see later, we’ll be focusing a lot on this struct and the fields that we discussed.

Let’s look deeper into 3 of the handler functions which are relevant to the bug and what they do.

Handler Functions

  
static int xfrm_user_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
			     struct netlink_ext_ack *extack)
{
	struct net *net = sock_net(skb->sk);
	struct nlattr *attrs[XFRMA_MAX+1];
	const struct xfrm_link *link;
	struct nlmsghdr *nlh64 = NULL;
	int type, err;

	type = nlh->nlmsg_type;
	if (type > XFRM_MSG_MAX)
		return -EINVAL;

	type -= XFRM_MSG_BASE;
	link = &xfrm_dispatch[type]; // [2]

	/* All operations require privileges, even GET */
	if (!netlink_net_capable(skb, CAP_NET_ADMIN)) // [3]
		return -EPERM;
    ...
    ...
}

The entry point of the handler functions we discussed is xfrm_user_rcv_msg, which receives the incoming Netlink message and immediately validates that the sender has CAP_NET_ADMIN privileges [3]. If yes, then extracts the message type from the Netlink header, subtracts XFRM_MSG_BASE to convert it into an array index, and uses that index to look up the appropriate handler in the xfrm_dispatch table [2].

The XFRM_MSG_ALLOCSPI Handler

The first handler we’ll discuss is the XFRM_MSG_ALLOCSPI message type which allows userspace to request that the kernel automatically assign an SPI value to an existing SA that was created without one, typically an “ACQUIRE” state SA. For XFRM_MSG_ALLOCSPI messages, the dispatch table routes to xfrm_alloc_userspi. This function is the top-level handler for SPI allocation requests.

The xfrm_alloc_userspi function extracts the xfrm_userspi_info structure from the Netlink message payload using nlmsg_data(nlh). This structure contains the SPI range (min and max values) and SA identification information (info field). The function first validates the SPI range by calling verify_spi_info, which checks that the protocol is valid (AH, ESP, or IPCOMP), that min <= max, and that IPCOMP SPIs don’t exceed 65535.

The function then needs to find the existing SA to which it will assign an SPI. It extracts the family and destination address (daddr) from the message, along with optional attributes like mark, if_id, and pcpu_num. If the message includes a sequence number (p->info.seq), it calls xfrm_find_acq_byseq to find the SA by sequence number. Otherwise, it calls xfrm_find_acq to search for an ACQUIRE state SA matching the protocol, mode, addresses, and other parameters. Once the SA is found, xfrm_alloc_userspi calls xfrm_alloc_spi to actually allocate and assign the SPI. This is where the core allocation logic happens.

  
int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high,
		   struct netlink_ext_ack *extack)
{
	struct net *net = xs_net(x);
	unsigned int h;
	struct xfrm_state *x0;
	int err = -ENOENT;
	u32 range = high - low + 1;
	__be32 newspi = 0;

	spin_lock_bh(&x->lock); // [4]
	if (x->km.state == XFRM_STATE_DEAD) { 
		NL_SET_ERR_MSG(extack, "Target ACQUIRE is in DEAD state"); // [5]
		goto unlock;
	}

	err = 0;
	if (x->id.spi) // [6]
		goto unlock;

	err = -ENOENT;

	for (h = 0; h < range; h++) { 
		u32 spi = (low == high) ? low : get_random_u32_inclusive(low, high); // [7]
		newspi = htonl(spi);
		spin_lock_bh(&net->xfrm.xfrm_state_lock);
		x0 = xfrm_state_lookup_spi_proto(net, newspi, x->id.proto); 
		if (!x0) {
			x->id.spi = newspi;
			h = xfrm_spi_hash(net, &x->id.daddr, newspi, x->id.proto, x->props.family);
			XFRM_STATE_INSERT(byspi, &x->byspi, net->xfrm.state_byspi + h, x->xso.type); // [8]
			spin_unlock_bh(&net->xfrm.xfrm_state_lock);
			err = 0;
			goto unlock;
		}
		xfrm_state_put(x0);
		spin_unlock_bh(&net->xfrm.xfrm_state_lock);

		if (signal_pending(current)) {
			err = -ERESTARTSYS;
			goto unlock;
		}
		if (low == high)
			break;
	}
	if (err)
		NL_SET_ERR_MSG(extack, "No SPI available in the requested range");
unlock:
	spin_unlock_bh(&x->lock);
	return err;
}
EXPORT_SYMBOL(xfrm_alloc_spi);

The function starts by acquiring the SA’s lock (spin_lock_bh(&x->lock)) [4] to prevent concurrent modifications. It first checks if the SA is in XFRM_STATE_DEAD state [5], which would indicate it’s being deleted - if so, it returns an error. Then it checks if the SA already has an SPI assigned (x->id.spi) [6] - if it does, the function returns success immediately without allocating a new one.

The function then calculates the range size (range = high - low + 1) and enters a loop to try allocating an SPI. For each iteration, it generates a candidate SPI value: if low == high, it uses that exact value; otherwise, it calls get_random_u32_inclusive(low, high) to generate a random SPI within the range [7]. We’ll see later how this logic has an edge case which leads to the bug. The SPI is then converted to network byte order with htonl(spi). The function assigns the SPI to the SA (x->id.spi = newspi). It then computes the hash bucket index by calling xfrm_spi_hash, which hashes the destination address, SPI, protocol, and address family. Critically, it then directly inserts the SA into the state_byspi hash table using the XFRM_STATE_INSERT [8].

After successful SPI allocation, xfrm_alloc_userspi optionally sets the SA direction from the XFRMA_SA_DIR attribute. It then creates a response message by calling xfrm_state_netlink() to serialize the SA state into a Netlink message which is subsequently returned back to userspace with nlmsg_unicast.

The XFRM_MSG_GETSA Handler

The XFRM_MSG_GETSA message type allows userspace tools to retrieve detailed information about a specific SA that’s already stored in the kernel’s database. This is a read-only operation used for monitoring and debugging IPsec tunnels. For XFRM_MSG_GETSA, the dispatch table routes to xfrm_get_sa. The xfrm_get_sa function extracts the xfrm_usersa_id structure from the Netlink message payload using nlmsg_data(nlh). This structure contains the SA’s identifying information: SPI, destination address, protocol, and address family. The function then calls xfrm_user_state_lookup to find the SA in the kernel’s database.

The xfrm_user_state_lookup helper function handles two lookup scenarios. If the protocol matches any IPsec protocol (IPSEC_PROTO_ANY), it calls xfrm_state_lookup to search by SPI, destination address, protocol, and family. This function internally calls __xfrm_state_lookup which traverses the state_byspi hash table. The __xfrm_state_lookup function performs the actual hash table search to find an SA matching a specific SPI, destination address, protocol, and address family.

  
static struct xfrm_state *__xfrm_state_lookup(const struct xfrm_hash_state_ptrs *state_ptrs,
					      u32 mark,
					      const xfrm_address_t *daddr,
					      __be32 spi, u8 proto,
					      unsigned short family)
{
	unsigned int h = __xfrm_spi_hash(daddr, spi, proto, family, state_ptrs->hmask); // [9]
	struct xfrm_state *x;

	hlist_for_each_entry_rcu(x, state_ptrs->byspi + h, byspi) {
		if (x->props.family != family ||
		    x->id.spi       != spi ||
		    x->id.proto     != proto ||
		    !xfrm_addr_equal(&x->id.daddr, daddr, family)) // [10]
			continue;

		if ((mark & x->mark.m) != x->mark.v)
			continue;
		if (!xfrm_state_hold_rcu(x))  // [11]
			continue;
		return x;
	}

	return NULL;
}

The function starts by computing the hash bucket index using __xfrm_spi_hash, which hashes the destination address, SPI, protocol, and address family together with the hash mask to determine which bucket in the state_byspi table to search [9]. It then iterates through all SAs in that bucket using hlist_for_each_entry_rcu, which provides RCU-safe traversal of the hash list. For each SA in the bucket, it checks four matching criteria: the address family must match (x->props.family != family), the SPI must match (x->id.spi != spi), the protocol must match (x->id.proto != proto), and the destination address must match using xfrm_addr_equal [10]. If any of these don’t match, it continues to the next SA in the bucket.

Once a matching SA is found, the function attempts to increment its reference count using xfrm_state_hold_rcu to prevent the SA from being freed while in use [11]. If the reference count increment succeeds (meaning the SA isn’t being deleted), the function returns the SA pointer. If no matching SA is found after checking all entries in the bucket, it returns NULL.

Finally, xfrm_get_sa calls xfrm_state_netlink to serialize the SA’s state into a Netlink message and sends the response back to the requesting process using nlmsg_unicast. This particular code path is quite important for the vulnerability as we’ll soon see.

The XFRM_MSG_DELSA Handler

The XFRM_MSG_DELSA message type allows userspace tools to remove an existing SA from the kernel’s database. For XFRM_MSG_DELSA, the dispatch table routes to xfrm_del_sa. The xfrm_del_sa function extracts the xfrm_usersa_id structure from the Netlink message payload using nlmsg_data(nlh).This structure contains the SA’s identifying information: SPI, destination address, protocol, and address family. The function then calls xfrm_user_state_lookup similar to the XFRM_MSG_GETSA handler to find the SA in the kernel’s database.

Before deletion, xfrm_del_sa performs two critical security checks. First, it calls security_xfrm_state_delete to verify that the current security context (SELinux/LSM) permits deletion of this SA. Second, it calls xfrm_state_kern to check if the SA is kernel-created and used by tunnels - such SAs cannot be deleted by userspace.

Once validation passes, xfrm_del_sa calls xfrm_state_delete to actually remove the SA. The xfrm_state_delete function acquires the SA’s lock and calls __xfrm_state_delete to perform the actual deletion.

  
int __xfrm_state_delete(struct xfrm_state *x)
{
	struct net *net = xs_net(x);
	int err = -ESRCH;

	if (x->km.state != XFRM_STATE_DEAD) {
		x->km.state = XFRM_STATE_DEAD;  // [12]

		spin_lock(&net->xfrm.xfrm_state_lock);   
		list_del(&x->km.all);      // [13]
		hlist_del_rcu(&x->bydst);
		hlist_del_rcu(&x->bysrc);
		if (x->km.seq)
			hlist_del_rcu(&x->byseq);
		if (!hlist_unhashed(&x->state_cache))
			hlist_del_rcu(&x->state_cache);
		if (!hlist_unhashed(&x->state_cache_input))
			hlist_del_rcu(&x->state_cache_input);

		if (x->id.spi)
			hlist_del_rcu(&x->byspi);
		net->xfrm.state_num--;
		xfrm_nat_keepalive_state_updated(x);
		spin_unlock(&net->xfrm.xfrm_state_lock);

		xfrm_dev_state_delete(x);

		xfrm_state_delete_tunnel(x);

		/* All xfrm_state objects are created by xfrm_state_alloc.
		 * The xfrm_state_alloc call gives a reference, and that
		 * is what we are dropping here.
		 */
		xfrm_state_put(x); // [14]
		err = 0;
	}

	return err;
}
EXPORT_SYMBOL(__xfrm_state_delete);

The __xfrm_state_delete function is where the critical deletion logic happens. It first checks if the SA is already dead - if so, it returns immediately. Otherwise, it marks the SA as XFRM_STATE_DEAD to prevent further use [12]. Then it acquires the global state lock and removes the SA from multiple hash tables and lists. Finally, it releases the reference count with xfrm_state_put, which may trigger garbage collection.

After successful deletion, xfrm_del_sa notifies key management daemons via km_state_notify(XFRM_MSG_DELSA) allowing IKE daemons to track SA lifecycle changes. It then logs the event using xfrm_audit_state_delete and releases the acquired SA reference. When the SA’s reference count drops to zero, __xfrm_state_destroy adds it to the GC list and schedules xfrm_state_gc_work. The worker xfrm_state_gc_task later invokes xfrm_state_gc_destroy to free the SA memory.

Now that we have some context on how a part of this subsystem works, let’s look at the patch now.

Patch Analysis

We can access the patch via this link. It’s quite informative in nature, hence it’s easier for us to understand.

This patch fixes a Use-After-Free (UAF) vulnerability in the XFRM (IPsec transformation) subsystem caused by allowing SPI (Security Parameter Index) value 0 to be allocated to XFRM states. In XFRM, the value x->id.spi == 0 has a special semantic meaning: “no SPI assigned”. This is a sentinel value used throughout the codebase to indicate that a state is not yet fully initialized with an SPI. The function in focus is xfrm_alloc_spi which we saw earlier is responsible for allocating a unique SPI value for a new struct xfrm_state. It picks a random SPI between a given range (low to high), and ensures it’s not already used by another SA.

The core issue highlighted in the patch is that xfrm_alloc_spi allowed spi to be equal to 0. Upon bisect, The bug seems to be originating from this commit where the function started creating xfrm_state objects and adding them to the byspi list even if spi == 0. This is wrong, because those “no SPI assigned yet” states don’t belong in byspi (that list is indexed by valid SPIs).

They are effectively inserted into a list that will never properly clean them up later. Later, when __xfrm_state_delete tries to remove a state, it doesn’t remove entries with spi == 0 from the byspi list (because it assumes such states were never added there). So, these invalid entries remain dangling in byspi. When the kernel next iterates over byspi, it can access freed memory, triggering a Use-After-Free. The patch fixes this issue by simply skipping any randomly generated spi value of 0:

  
for (h = 0; h < range; h++) {
    u32 spi = get_random_u32_inclusive(low, high);
+   if (spi == 0)
+       goto next;
    newspi = htonl(spi);
    ...
+next:
    ...
}

This ensures no xfrm_state with spi == 0 is ever created which avoids invalid insertions into any lists. This bug is a logic-level memory corruption which arises from a violation of subsystem invariants: a “special value” (0) meant to represent “uninitialized” got treated as “valid.”

Vulnerable Code Analysis

Let’s run through the entire call chain. The vulnerability begins when a userspace application requests SPI allocation via a netlink message via the XFRM_MSG_ALLOCSPI handler. xfrm_alloc_userspi receives a netlink message from userspace requesting SPI allocation within a specific range (min/max values). xfrm_alloc_spi generates random SPI values and checks for collisions. In the vulnerable code, it does NOT check if the generated SPI is 0. The code unconditionally inserts the state into the byspi hash table using XFRM_STATE_INSERT [15], even when spi == 0. This violates the subsystem invariant that states with SPI=0 should never be in the byspi list.

  
int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high,
		   struct netlink_ext_ack *extack)
{
    ...
	for (h = 0; h < range; h++) {
		u32 spi = (low == high) ? low : get_random_u32_inclusive(low, high);
		// No check whether spi is 0
		newspi = htonl(spi);

		spin_lock_bh(&net->xfrm.xfrm_state_lock);
		x0 = xfrm_state_lookup_spi_proto(net, newspi, x->id.proto);
		if (!x0) {
			x->id.spi = newspi;
			h = xfrm_spi_hash(net, &x->id.daddr, newspi, x->id.proto, x->props.family);
			XFRM_STATE_INSERT(byspi, &x->byspi, net->xfrm.state_byspi + h, x->xso.type); // [15]
			spin_unlock_bh(&net->xfrm.xfrm_state_lock);
			err = 0;
			goto unlock;
		}
        ...

When a state is deleted, the cleanup code assumes the invariant holds - that states with SPI=0 are never in the byspi list. The __xfrm_state_delete removes states from various hash tables, but has a conditional check for byspi removal [16] -

  
int __xfrm_state_delete(struct xfrm_state *x)
{
	struct net *net = xs_net(x);
	int err = -ESRCH;

	if (x->km.state != XFRM_STATE_DEAD) {
		x->km.state = XFRM_STATE_DEAD;

		spin_lock(&net->xfrm.xfrm_state_lock);
		list_del(&x->km.all);
		hlist_del_rcu(&x->bydst);
		hlist_del_rcu(&x->bysrc);
		if (x->km.seq)
			hlist_del_rcu(&x->byseq);
		if (!hlist_unhashed(&x->state_cache))
			hlist_del_rcu(&x->state_cache);
		if (!hlist_unhashed(&x->state_cache_input))
			hlist_del_rcu(&x->state_cache_input);

		if (x->id.spi)
			hlist_del_rcu(&x->byspi); // [16]
        ...

This code only removes from the byspi list if x->id.spi is non-zero. If a state has SPI=0 (which shouldn’t happen but does due to the bug), it stays in the byspi list even after being freed. This creates a dangling pointer.

The UAF manifests when any code iterates through the byspi hash table. For example, xfrm_state_lookup_spi_proto or *__xfrm_state_lookup searches all byspi buckets -

  
static struct xfrm_state *__xfrm_state_lookup(const struct xfrm_hash_state_ptrs *state_ptrs,
					      u32 mark,
					      const xfrm_address_t *daddr,
					      __be32 spi, u8 proto,
					      unsigned short family)
{
	...
	hlist_for_each_entry_rcu(x, state_ptrs->byspi + h, byspi) { // [17]
		if (x->props.family != family ||
		    x->id.spi       != spi ||
		    x->id.proto     != proto ||
		    !xfrm_addr_equal(&x->id.daddr, daddr, family))
			continue;

		if ((mark & x->mark.m) != x->mark.v)
			continue;
		if (!xfrm_state_hold_rcu(x))
			continue;
        ...
}
...
static struct xfrm_state *xfrm_state_lookup_spi_proto(struct net *net, __be32 spi, u8 proto)
{
	...
	for (i = 0; i <= net->xfrm.state_hmask; i++) {
		hlist_for_each_entry_rcu(x, &net->xfrm.state_byspi[i], byspi) { // [17]
			if (x->id.spi == spi && x->id.proto == proto) {
				if (!xfrm_state_hold_rcu(x))
					continue;
				rcu_read_unlock();
				return x;
			}
		}
	}
	...
}

The hlist_for_each_entry_rcu macro traverses the linked list. When it encounters the dangling entry (freed state with SPI=0 still in the list), it dereferences freed memory while checking fields like x->id.spi and x->id.proto, triggering the UAF [17].

Here’s a neat diagram with all the steps involved -

Creating the Test Environment

To develop a testing environment for the vulnerability, I used an Linux Kernel Emulation Setup, similar to the Android Kernel Emulation Setup I created earlier with Debugging Support. We’ll be pulling the tag cd8ae32e4e4652db55bce6b9c79267d8946765a9 from the Linux Kernel Repository. The patch-diff or the fix needs to be reverted (or commented) in net/xfrm/xfrm_state.c to reintroduce the vulnerability in the environment for testing.

  
 	for (h = 0; h < range; h++) {
 		u32 spi = (low == high) ? low : get_random_u32_inclusive(low, high);

                // comment out the fix
+		if (spi == 0)
+			goto next;
 		newspi = htonl(spi);
 
 		spin_lock_bh(&net->xfrm.xfrm_state_lock);
@@ -2598,6 +2600,7 @@ int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high,
 		xfrm_state_put(x0);
 		spin_unlock_bh(&net->xfrm.xfrm_state_lock);
 
+next:
 		if (signal_pending(current)) {
 			err = -ERESTARTSYS;
 			goto unlock;

We’ll now run the following commands to setup the Linux Kernel Emulation Environment.

  
mkdir linux-kernel
cd linux-kernel
wget https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/snapshot/linux-cd8ae32e4e4652db55bce6b9c79267d8946765a9.tar.gz
tar xf linux-cd8ae32e4e4652db55bce6b9c79267d8946765a9.tar.gz

The next steps are similar to the compilation steps from my earlier blog, but with 2 important settings. First, is the inclusion of the XFRM related config flags to enable that subsystem. So, in addition to the flags mentioned in the compilation steps, add these too (you can add KASAN as well to detect the UaF) -

  
// XFRM Flags
CONFIG_XFRM=y
CONFIG_XFRM_ALGO=y
CONFIG_XFRM_USER=y
CONFIG_XFRM_INTERFACE=y
CONFIG_XFRM_AH=y
CONFIG_XFRM_ESP=y

// KASAN Flags
CONFIG_HAVE_ARCH_KASAN=y
CONFIG_HAVE_ARCH_KASAN_VMALLOC=y
CONFIG_CC_HAS_KASAN_GENERIC=y
CONFIG_CC_HAS_WORKING_NOSANITIZE_ADDRESS=y
CONFIG_KASAN=y
CONFIG_CC_HAS_KASAN_MEMINTRINSIC_PREFIX=y
CONFIG_KASAN_GENERIC=y
CONFIG_KASAN_INLINE=y
CONFIG_KASAN_STACK=y
CONFIG_KASAN_EXTRA_INFO=y
CONFIG_HAVE_ARCH_KFENCE=y
CONFIG_HAVE_ARCH_KMSAN=y

The second setting is related to the user namespace that we’ll be using the PoC. The xfrm interface can be accessed by an unprivileged user only if the process has the CAP_NET_ADMIN network namespace capability. This is also clearly seen in the KernelCTF entry which shows that userns was used (meaning the unprivileged userns effectively lets an unprivileged user gain namespaced capabilities like CAP_NET_ADMIN). For our local testing we’ll use a compiled, statically linked setcap binary to grant the CAP_NET_ADMIN capability to our PoC binary. You can follow the steps given below -

  
# Required for PoC
sudo apt install libnl-nf-3-dev 

# Instructions taken from https://github.com/sjinks/setcap-static/blob/master/README.md
sudo apt install cmake make libcap-dev
git clone https://github.com/sjinks/setcap-static
cd setcap-static
cmake -S . -B build -DCMAKE_BUILD_TYPE=MinSizeRel
cmake --build build --config MinSizeRel

# push the build/setcap-static to Emulator
# Run below command in Emulator

# This will grant the CAP_NET_ADMIN capability to /home/user/poc
# Note that this command needs to be rerun if the poc binary is deleted or replaced
./setcap-static 'cap_net_admin+ep' /home/user/poc 

Triggering the Bug

Now that we’ve understood the vulnerability, the next step is to try crafting a trigger for it. There’s a C Repro available on the syzbot report for this bug, but since it looked all greek to me, I decided to write one on my own. I referred to the PoC for CVE-2025-38500 to get an idea on how to interact with the XFRM subsystem from userspace. Although the region of code where CVE-2025-38500 was present is different from this bug, but still I could gather the basic idea required to trigger the issue at hand.

Based on our discussion let’s follow this call path to trigger the bug:

Allocation with SPI=0: XFRM_MSG_ALLOCSPI → xfrm_alloc_userspi() → xfrm_alloc_spi() generates SPI=0 and unconditionally inserts into byspi list
Deletion: XFRM_MSG_DELSA → xfrm_del_sa() → __xfrm_state_delete() skips removing from byspi because SPI=0
Traversal: XFRM_MSG_GETSA → xfrm_get_sa() → __xfrm_state_lookup() → UAF when dereferencing freed memory in byspi list

So, We’ll send XFRM_MSG_ALLOCSPI with a range that is 0 (e.g., min=0, max=0), so that the SPI allocation always results in 0. Then we’ll delete the state via XFRM_MSG_DELSA. Once that happens, we’ll trigger lookups via XFRM_MSG_GETSA or other operations that traverse byspi list.

As usual, you can find the trigger PoC for this vulnerability on my Github.

On triggering this PoC on a KASAN enabled Kernel gives us a slab-use-after-free crash trace instantly indicating that the dangling pointer is accessed in the __xfrm_state_lookup function as expected. Since the vulnerability is non-racy, it triggers the bug every single time. And the fact that the Use-After-Free can be triggered precisely with a userspace calls using XFRM_MSG_DELSA and XFRM_MSG_GETSA makes this a very powerful primitive which could be used for privilege escalation. No wonder it was selected as a good candidate for a KernelCTF entry.

On a non-KASAN enabled kernel, the Kernel hits a hard crash after executing the PoC for a few times, and we get a null pointer dereference bug, indicating the function tries to dereference the dangling pointer which pointed to NULL.

What’s Next?

Triggering the bug and getting a crash is cute, but I wanted to learn more on how this primitive could be used to achieve root.

  
int __net_init xfrm_state_init(struct net *net)
{
	unsigned int sz;
	if (net_eq(net, &init_net))
		xfrm_state_cache = KMEM_CACHE(xfrm_state,
					      SLAB_HWCACHE_ALIGN | SLAB_PANIC); // [18]
    ...

When the Use-After-Free triggers, the freed object is a struct xfrm_state, which represents an IPsec Security Association (SA) in the kernel as we saw earlier. This structure is not allocated using regular kmalloc, but instead comes from a dedicated slab cache called xfrm_state_cache. The slab cache is created during XFRM subsystem initialization in xfrm_state_init [18].

The xfrm_state structure is large and contains many pointers to other kernel objects (algorithms, sockets, security contexts). With the bug we can potentially control what data occupies the freed memory by triggering other allocations. So, since a dedicated slab cache is in the picture, cross-cache attack scenario could be used with a suitable object.

The xfrm_state structure is validated in several places, so forging its fields is likely required to reach interesting code paths. I’m thinking maybe the msg_msg or setxattr objects (as seen in this PoC) as candidate primitives but there could be others as well. Preliminary review of the XFRM_MSG_GETSA and XFRM_MSG_UPDSA paths suggests they could be useful for leaking kernel memory or achieving controlled partial writes at chosen offsets; I’ll pursue deeper testing when I get some free time.

A test PoC demonstrating these two calls is available on my GitHub.

UPDATE(31/10/2025): @Nevsor (the KCTF entry author) has released a PoC that shows exactly how the bug was used to get root on the KCTF targets — check it out!

Conclusion

The analysis of CVE-2025-39965 demonstrates how subtle flaws in kernel subsystems such as XFRM can produce both stability and security issues. By dissecting the patch, reproducing the vulnerable behavior, and demonstrating its effects in a controlled crash, I sought to understand the full lifecycle of identifying and analyzing such bugs — including how they might be escalated to root in a responsible research context. The patch not only closes a potential exploitation path but also improves the robustness of the Linux networking stack. It’s encouraging to see bugs like this included in KernelCTF, and the exercise provided valuable lessons for my KCTF research.

Credits

Hey There! If you’ve come across any bugs or have ideas for improvements, feel free to reach out to me on X! If your suggestion proves helpful and gets implemented, I’ll gladly credit you in this dedicated Credits section. Thanks for reading!

Linux, CVEs

This post is licensed under CC BY 4.0 by the author.