Performance modes — which API and Python build should I use?¶

PSDK exposes several execution modes. The right one depends on (1) whether you can run a free-threaded CPython build (e.g., 3.14t) with the GIL disabled, and (2) what your workload looks like — predominantly single-key reads/writes, or complex queries with builders, batches, and error handlers.

This guide is the short, user-facing decision tree. The full numbers and methodology behind every recommendation are in benchmarking.md.

TL;DR decision tree¶

Single-key reads/writes, want max throughput? Use session.get() / session.put() — the fast-path API.
Complex queries (secondary index, AEL filters, batch ops, error handlers)? Use chained builders — session.query(...).where(...).execute() and friends.
Sync or async? If you have an existing sync codebase, use SyncClient. For new code or web servers, async is the standard.
Free-threaded Python (e.g. 3.14t)? Yes if you need high throughput across many threads. No if you depend on C extensions that aren’t FT-safe.
AsyncPool? Only on free-threaded Python. Slower than single-client on non-FT.

Free-threaded vs regular Python¶

PSDK works on both standard CPython and a free-threaded build (e.g., 3.14t). The choice matters a lot for high-throughput workloads.

	Regular CPython	Free-threaded CPython (e.g. 3.14t)
GIL	Always on. Threads serialize through one interpreter.	Off when invoked with `PYTHON_GIL=0`. Multiple threads run Python in true parallel.
Single-thread perf	Same	Same (slightly slower for some workloads due to atomic refcounts)
Multi-thread perf	Capped by GIL — usually 1.5-2× single-thread no matter how many threads	Scales near-linearly with cores for I/O-bound work
C extension support	Universal	Limited — extensions must declare `Py_mod_gil = Py_MOD_GIL_NOT_USED`
Recommended for PSDK?	If GIL-on simplicity is fine for your workload	If you want PSDK’s high-TPS modes

Setup for free-threaded mode¶

# Install the free-threaded build (uv or pyenv)
uv python install 3.14.5+freethreaded

# Always launch with PYTHON_GIL=0
PYTHON_GIL=0 python my_app.py

Critical gotcha: if any imported C extension hasn’t opted into free-threading, the interpreter silently re-enables the GIL. Verify with sys._is_gil_enabled() returning False after all imports. PSDK’s dependency PAC (aerospike-async) is FT-safe; many other libraries aren’t yet.

If PYTHON_GIL=0 is unset on the free-threaded build, the GIL stays on by default — which negates the entire point of using it.

Fast-path: `session.get` / `session.put`¶

For single-key operations where you don’t need filters, error handlers, projections, batch semantics, secondary indexes, etc., the fast-path methods bypass the builder + stream wrapping and call PAC’s native blocking/async APIs directly with the session-cached policy.

Sync example¶

from aerospike_sdk import Behavior, SyncClient
from aerospike_async import Key

with SyncClient("localhost:3000") as client:
    session = client.create_session(Behavior.DEFAULT)
    k = Key("test", "users", "alice")
    session.put(k, {"name": "Alice", "age": 28})
    record = session.get(k)
    print(record.bins)

Async example¶

import asyncio
from aerospike_sdk import Behavior, Client
from aerospike_async import Key

async def main():
    async with Client("localhost:3000") as client:
        session = client.create_session(Behavior.DEFAULT)
        k = Key("test", "users", "alice")
        await session.put(k, {"name": "Alice", "age": 28})
        record = await session.get(k)
        print(record.bins)

asyncio.run(main())

The fast-path APIs accept an optional bins= projection for reads and an arbitrary bins dict for writes. Errors raise directly (no RecordResult wrapping).

When NOT to use fast-path:

Anything that needs where(...) filters, expire_record_after_seconds, with_durable_delete, generation checks, or record_exists_action overrides — use the builder.
Reads from a DataSet with a secondary-index query — use the builder.
Batch reads/writes across multiple keys — use the builder or the session.batch() chain.
RecordResult.is_ok / error introspection per record — use the builder, which yields wrapped RecordResult instances.

Chained builder API¶

The full-featured chainable API that mirrors the Aerospike SDK shape across languages.

from aerospike_sdk import Behavior, Client, DataSet, ErrorStrategy

async with Client("localhost:3000") as client:
    session = client.create_session(Behavior.DEFAULT)
    users = DataSet.of("test", "users")

    # Filtered query — AEL filter expression
    results = await (
        session.query(users)
        .where("$.age > %s and $.country == '%s'", 25, "US")
        .execute()
    )
    async for r in results:
        if r.is_ok:
            print(r.record.bins)

    # Write with TTL + error handler
    stream = await (
        session.upsert(users.id(1))
        .put({"name": "Alice"})
        .expire_record_after_seconds(3600)
        .execute(on_error=ErrorStrategy.IN_STREAM)
    )
    await stream.collect()

Use the builder when you need filter expressions, batch operations, secondary-index queries, error handlers, TTL overrides, or generation checks. For plain single-key reads and writes, prefer the fast-path.

AsyncPool — multi-loop async on free-threaded Python only¶

AsyncPool runs N event loops on N OS threads with one PAC client each, so async work can use multiple CPU cores in parallel. It only helps under free-threaded Python.

from aerospike_sdk import AsyncPool, Behavior
from aerospike_sdk.aio.client import Client

def factory():
    return Client("localhost:3000")

async def per_loop(client, loop_idx):
    session = client.create_session(Behavior.DEFAULT)
    # ... do work, e.g. asyncio.gather of session.get/put calls ...

async with AsyncPool(factory, loop_count=4) as pool:
    await pool.map(per_loop, range(4))

Scaling: at loop_count >= 4, AsyncPool automatically gives each Client its own PAC Tokio runtime (per-Client runtime isolation). This eliminates the cross-loop scheduler contention that previously capped throughput at 4 loops, so TPS scales monotonically. Measured on 8-core hardware, FT Python (with uvloop enabled by default and PAC’s drainer thread serializing call_soon_threadsafe wakeups across all pooled Clients):

Pool size	TPS	p99 latency
4 × 64 tasks	~260K	2.5 ms
8 × 64 tasks	~292K	4.1 ms

The 290K ceiling is now above the PSDK sync ct_runtime ceiling (~266K) and well past the production sync fast-path (~214K) — async is the highest- throughput single-key Python mode on free-threaded hardware. Past 8 loops, additional loops trade p99 latency for marginal TPS; pick loop_count based on the tail-latency budget your workload tolerates.

You can override the auto-enable threshold via AsyncPool(..., per_client_runtime=True|False). Forcing it on at low loop counts may be useful on smaller hardware; forcing it off reverts to the shared global Tokio runtime path. Worker count is auto-derived as max(2, os.cpu_count() // loop_count).

AsyncPool on regular (GIL-on) Python is now roughly on par with single-client async after the uvloop-in-pool change — measured ~108K (pool 4×64) vs ~106K (single-loop) on FT-Python forced to GIL-on. The GIL still serializes all Python execution across pool threads, so the multi-loop architecture can’t deliver the full FT scaling, but uvloop’s per-op savings inside the pool now roughly cancel the orchestration overhead.

On regular Python it’s a wash — pick AsyncPool if it fits your code shape (you already write fan-out patterns) or a single Client + asyncio.gather if simpler. The real AsyncPool win remains free-threaded Python.

Sync vs async — when to pick which¶

Sync (SyncClient) is best when:
- You’re integrating into an existing sync codebase (Django views, scripts, etc.)
- Per-op latency matters more than concurrency depth
- You want the absolute lowest per-op overhead — PSDK sync fast-path is roughly at parity with PAC’s direct blocking API
Async (Client) is best when:
- You already have an asyncio event loop (FastAPI, aiohttp, etc.)
- You need to overlap I/O across many concurrent operations
- You’re willing to use uvloop for higher throughput (default in modern asyncio + free-threaded Python setups)

Both modes share the same Session API surface (chained builders + fast-path shortcuts), the same Behavior policy model, and the same error semantics.

Note

When you construct a SyncClient without supplying your own ClientPolicy, PSDK sets conn_pools_per_node = 8 (PAC’s default is 4). The async-tuned PAC default works well for single-loop or per-Client-runtime workloads where the event loop serializes pool access naturally, but sync wrappers drive PAC from many caller threads and see real connection-pool mutex contention at 4 — the p99 tail roughly doubles. Pass your own ClientPolicy if you need a different value (e.g. lower for memory-constrained deployments).

Performance summary table¶

Numbers from the Benchmarking Guide — 8-vCPU isolated client VM → 3× 8-vCPU isolated server VMs over a low-latency private network, 100K keys, 50/50 RW, 50-byte payload.

Single-key dispatch (batch size 1)¶

Mode	Threads / Tasks	Free-threaded TPS	Non-FT TPS
Sync fast-path (`session.get`/`put`)	32	~214K	~51K
Sync builder (`session.query(k).execute()`)	32	~149K	~32K
Async fast-path, AsyncPool 8×64	512 tasks	~292K	(FT only)
Async fast-path, AsyncPool 4×64	256 tasks	~260K	~108K
Async fast-path, single client	32 tasks	~118K	~106K
Async builder, AsyncPool 4×64	256 tasks	~182K	~61K
Async builder, single client	32 tasks	~68K	~64K

Experimental: current_thread_runtime (ct_runtime)

SyncClient accepts a current_thread_runtime=True flag that gives each Python thread its own PAC _LocalClient (per-thread Tokio current-thread runtime). It boosts measured TPS to ~265K (sync fp) / ~187K (sync builder) on free-threaded Python — but it comes with non-trivial operational baggage:

N× cluster-tend threads. Each per-thread runtime owns its own Cluster and runs its own cluster-tend loop. At 32 worker threads that’s 32 tend loops polling the cluster every second.
N× connection pools. Each thread’s runtime maintains its own pool. PSDK auto-defaults conn_pools_per_node=1 when you opt in (so total per-node connections stay around N threads × 1 pool ≈ the non-ct_runtime default of 8), but if you pass your own ClientPolicy you take responsibility for the connection count.
Incomplete _with_overrides surface. Not every PAC method routes through the ct_runtime path; some operations still hit the shared multi-thread runtime even when ct_runtime is on.

Usage (opt-in):

from aerospike_sdk import Behavior, SyncClient

# Auto-default `conn_pools_per_node = 1` applies because we didn't pass a policy.
# Cluster-tend multiplication is NOT mitigated by the default — each
# worker thread that calls session.get/put will lazily create its own
# _LocalClient, each with its own tend loop.
with SyncClient("localhost:3000", current_thread_runtime=True) as client:
    session = client.create_session(Behavior.DEFAULT)
    # ... worker threads each call session.get / session.put as normal ...

Treat ct_runtime as an experimental performance lever for benchmarking and tightly-controlled deployments. The default sync path (one shared Tokio multi-thread runtime + one shared connection pool) is the recommended production setup.

With batching (`--batch-size > 1`, free-threaded)¶

When the workload can group keys per call, the chained-builder API amortizes its per-op overhead and surpasses every single-key number above.

Mode	Batch size	Peak TPS
Sync builder	128	~485K
AsyncPool builder, 4×64	64	~336K
Async single-loop builder, 32 tasks	128	~205K

Practical reading:

If your workload can batch keys, the sync builder with session.batch() or multi-key session.query([keys]) is the highest-throughput mode — scales to ~485K TPS at batch=128. Doubling the batch size keeps amortizing the per-call cost.
For single-key workloads on free-threaded Python, AsyncPool fast-path at 4-8 loops delivers ~260-292K TPS — the highest non-experimental single-key mode, above sync fast-path (~214K). If you prefer sync, the fast-path is still the best non-experimental sync mode.
On regular Python (GIL on), AsyncPool 4×64 (~108K) edges out single-client async (~106K) and is roughly 2× sync fast-path (~51K). Under non-FT, AsyncPool is now a slight win rather than a loss thanks to uvloop inside the pool.

Why sync and async perform similarly now¶

The cost stacks for sync and async used to diverge sharply — async historically lost ~50% to the asyncio ↔ Tokio bridge per op. With PAC’s drainer thread (a single persistent waker thread handling all Tokio→asyncio wakeups) plus uvloop installed by default under FT, the async ceiling has closed substantially:

Sync clients pay only the PyO3 boundary cost plus a per-op thread-handoff between caller and Tokio (~71 µs per op). PSDK fast-path adds ~3-5% on top of PAC direct — the SDK layer is essentially free.
Async clients pay PyO3 + asyncio event-loop scheduling. The drainer thread eliminates per-batch Python::attach churn on Tokio workers; uvloop reduces per-op loop-thread cost. With both, single-loop async tops out around 130K TPS (the asyncio loop thread is now the single-threaded bottleneck, doing per-op set_result and task wakeup).
AsyncPool with N loops breaks past the single-loop ceiling by parallelizing the loop work across N Python threads. 4-8 loops scale to 260-292K TPS — above the production sync ceiling on the same hardware.
The chained-builder API pays an additional Python-interpreter cost on single-key calls — per-op object allocation, validation, and stream-wrap cost. On batch calls, that cost amortizes across keys; at batch=128 the sync builder reaches ~484K TPS — much higher than any single-key cell. Use the fast-path (session.get/session.put) for single-key dispatch without filters; use the builder with batching for high-throughput bulk workloads.