Performance modes — which API and Python build should I use?

PSDK exposes several execution modes. The right one depends on (1) whether you can run a free-threaded CPython build (e.g., 3.14t) with the GIL disabled, and (2) what your workload looks like — predominantly single-key reads/writes, or complex queries with builders, batches, and error handlers.

This guide is the short, user-facing decision tree. The full numbers and methodology behind every recommendation are in benchmarking.md.

TL;DR decision tree

  1. Single-key reads/writes, want max throughput? Use session.get() / session.put() — the fast-path API.

  2. Complex queries (secondary index, AEL filters, batch ops, error handlers)? Use chained builderssession.query(...).where(...).execute() and friends.

  3. Sync or async? If you have an existing sync codebase, use SyncClient. For new code or web servers, async is the standard.

  4. Free-threaded Python (e.g. 3.14t)? Yes if you need high throughput across many threads. No if you depend on C extensions that aren’t FT-safe.

  5. AsyncPool? Only on free-threaded Python. Slower than single-client on non-FT.

Free-threaded vs regular Python

PSDK works on both standard CPython and a free-threaded build (e.g., 3.14t). The choice matters a lot for high-throughput workloads.

Regular CPython

Free-threaded CPython (e.g. 3.14t)

GIL

Always on. Threads serialize through one interpreter.

Off when invoked with PYTHON_GIL=0. Multiple threads run Python in true parallel.

Single-thread perf

Same

Same (slightly slower for some workloads due to atomic refcounts)

Multi-thread perf

Capped by GIL — usually 1.5-2× single-thread no matter how many threads

Scales near-linearly with cores for I/O-bound work

C extension support

Universal

Limited — extensions must declare Py_mod_gil = Py_MOD_GIL_NOT_USED

Recommended for PSDK?

If GIL-on simplicity is fine for your workload

If you want PSDK’s high-TPS modes

Setup for free-threaded mode

# Install the free-threaded build (uv or pyenv)
uv python install 3.14.5+freethreaded

# Always launch with PYTHON_GIL=0
PYTHON_GIL=0 python my_app.py

Critical gotcha: if any imported C extension hasn’t opted into free-threading, the interpreter silently re-enables the GIL. Verify with sys._is_gil_enabled() returning False after all imports. PSDK’s dependency PAC (aerospike-async) is FT-safe; many other libraries aren’t yet.

If PYTHON_GIL=0 is unset on the free-threaded build, the GIL stays on by default — which negates the entire point of using it.

Fast-path: session.get / session.put

For single-key operations where you don’t need filters, error handlers, projections, batch semantics, secondary indexes, etc., the fast-path methods bypass the builder + stream wrapping and call PAC’s native blocking/async APIs directly with the session-cached policy.

Sync example

from aerospike_sdk import Behavior, SyncClient
from aerospike_async import Key

with SyncClient("localhost:3000") as client:
    session = client.create_session(Behavior.DEFAULT)
    k = Key("test", "users", "alice")
    session.put(k, {"name": "Alice", "age": 28})
    record = session.get(k)
    print(record.bins)

Async example

import asyncio
from aerospike_sdk import Behavior, Client
from aerospike_async import Key

async def main():
    async with Client("localhost:3000") as client:
        session = client.create_session(Behavior.DEFAULT)
        k = Key("test", "users", "alice")
        await session.put(k, {"name": "Alice", "age": 28})
        record = await session.get(k)
        print(record.bins)

asyncio.run(main())

The fast-path APIs accept an optional bins= projection for reads and an arbitrary bins dict for writes. Errors raise directly (no RecordResult wrapping).

When NOT to use fast-path:

  • Anything that needs where(...) filters, expire_record_after_seconds, with_durable_delete, generation checks, or record_exists_action overrides — use the builder.

  • Reads from a DataSet with a secondary-index query — use the builder.

  • Batch reads/writes across multiple keys — use the builder or the session.batch() chain.

  • RecordResult.is_ok / error introspection per record — use the builder, which yields wrapped RecordResult instances.

Chained builder API

The full-featured chainable API that mirrors the Aerospike SDK shape across languages.

from aerospike_sdk import Behavior, Client, DataSet, ErrorStrategy

async with Client("localhost:3000") as client:
    session = client.create_session(Behavior.DEFAULT)
    users = DataSet.of("test", "users")

    # Filtered query — AEL filter expression
    results = await (
        session.query(users)
        .where("$.age > %s and $.country == '%s'", 25, "US")
        .execute()
    )
    async for r in results:
        if r.is_ok:
            print(r.record.bins)

    # Write with TTL + error handler
    stream = await (
        session.upsert(users.id(1))
        .put({"name": "Alice"})
        .expire_record_after_seconds(3600)
        .execute(on_error=ErrorStrategy.IN_STREAM)
    )
    await stream.collect()

Use the builder when you need filter expressions, batch operations, secondary-index queries, error handlers, TTL overrides, or generation checks. For plain single-key reads and writes, prefer the fast-path.

AsyncPool — multi-loop async on free-threaded Python only

AsyncPool runs N event loops on N OS threads with one PAC client each, so async work can use multiple CPU cores in parallel. It only helps under free-threaded Python.

from aerospike_sdk import AsyncPool, Behavior
from aerospike_sdk.aio.client import Client

def factory():
    return Client("localhost:3000")

async def per_loop(client, loop_idx):
    session = client.create_session(Behavior.DEFAULT)
    # ... do work, e.g. asyncio.gather of session.get/put calls ...

async with AsyncPool(factory, loop_count=4) as pool:
    await pool.map(per_loop, range(4))

Scaling: at loop_count >= 4, AsyncPool automatically gives each Client its own PAC Tokio runtime (per-Client runtime isolation). This eliminates the cross-loop scheduler contention that previously capped throughput at 4 loops, so TPS scales monotonically. Measured on 8-core hardware, FT Python:

Pool size

TPS

p99 latency

2 × 64 tasks

~139K

1.7 ms

4 × 64 tasks

~170K

4.1 ms

6 × 64 tasks

~177K

6.3 ms

8 × 64 tasks

~178K

9.3 ms

12 × 64 tasks

~180K

15.5 ms

The ceiling at ~180K is Python interpreter self-time across the loops; adding more loops past 8–12 trades p99 latency for marginal TPS. Pick loop_count based on the tail-latency budget your workload tolerates.

You can override the auto-enable threshold via AsyncPool(..., per_client_runtime=True|False). Forcing it on at low loop counts may be useful on smaller hardware; forcing it off reverts to the shared global Tokio runtime path. Worker count is auto-derived as max(2, os.cpu_count() // loop_count).

Do not use AsyncPool on regular (GIL-on) Python. Empirically it’s 17-26% slower than a single-client async setup because:

  • The GIL still serializes all Python code across the 4 OS threads

  • The pool’s task orchestration adds Python work that has nowhere to escape to under the GIL

On regular Python, use a single Client + asyncio.gather instead.

Sync vs async — when to pick which

  • Sync (SyncClient) is best when:

    • You’re integrating into an existing sync codebase (Django views, scripts, etc.)

    • Per-op latency matters more than concurrency depth

    • You want the absolute lowest per-op overhead — PSDK sync fast-path is roughly at parity with PAC’s direct blocking API

  • Async (Client) is best when:

    • You already have an asyncio event loop (FastAPI, aiohttp, etc.)

    • You need to overlap I/O across many concurrent operations

    • You’re willing to use uvloop for higher throughput (default in modern asyncio + free-threaded Python setups)

Both modes share the same Session API surface (chained builders + fast-path shortcuts), the same Behavior policy model, and the same error semantics.

Note

When you construct a SyncClient without supplying your own ClientPolicy, PSDK sets conn_pools_per_node = 8 (PAC’s default is 4). The async-tuned PAC default works well for single-loop or per-Client-runtime workloads where the event loop serializes pool access naturally, but sync wrappers drive PAC from many caller threads and see real connection-pool mutex contention at 4 — the p99 tail roughly doubles. Pass your own ClientPolicy if you need a different value (e.g. lower for memory-constrained deployments).

Performance summary table

Numbers from the Benchmarking Guide — 8-vCPU isolated client VM → 8-vCPU isolated server VM over a low-latency private network, 100K keys, 50/50 RW, 50-byte payload.

Single-key dispatch (batch size 1)

Mode

Threads / Tasks

Free-threaded TPS

Non-FT TPS

Sync fast-path (session.get/put)

32

~214K

~53K

Sync builder (session.query(k).execute())

32

~153K

~32K

Async fast-path, single client

32 tasks

~113K

~76K

Async fast-path, AsyncPool 4×64

256 tasks

~173K

~56K (slower than single-loop)

Async fast-path, AsyncPool 8×64

512 tasks

~182K

(FT only)

Async fast-path, AsyncPool 12×64

768 tasks

~180K

(FT only)

Async builder, single client

32 tasks

~63K

~50K

Async builder, AsyncPool 4×64

256 tasks

~147K

~38K (slower than single-loop)

With batching (--batch-size > 1, free-threaded)

When the workload can group keys per call, the chained-builder API amortizes its per-op overhead and surpasses every single-key number above.

Mode

Batch size

Peak TPS

Sync builder

128

~563K

AsyncPool builder, 4×64

64

~335K

Async single-loop builder, 32 tasks

128

~230K

Practical reading:

  • If your workload can batch keys, the sync builder with session.batch() or multi-key session.query([keys]) is the highest-throughput mode — scales monotonically to ~563K TPS at batch=128, 94% above Rust-core async direct (~290K). Doubling the batch size keeps amortizing the per-call cost.

  • For single-key workloads, the sync fast-path (~214K) is the highest mode. If you need async, AsyncPool fast-path scales monotonically through 4–12 loops to ~180K (closing most of the gap to sync on the same hardware).

  • On regular Python (GIL on), async single-client fast-path (~76K) is the simplest high-throughput mode; sync fast-path (~53K) is slightly lower because of GIL contention across the 32 worker threads.

Why sync and async perform so differently

The cost stacks for sync and async are not the same. From the benchmarking guide’s stack analysis:

  • Sync clients pay only the PyO3 boundary cost (~11%). The SDK layer on top of PAC adds ~3%. PSDK sync builder routes through PAC’s _blocking entries directly — no asyncio loop in the path.

  • Async clients pay PyO3 + asyncio event-loop scheduling + a Tokio worker bounce on each op — roughly a 59% drop vs Rust async direct. Every async op crosses Tokio ↔ asyncio twice (submit, then complete), which is the fundamental cost of bridging two async runtimes. AsyncPool recovers some of that by running multiple event loops on multiple OS threads in parallel, but only on free-threaded Python.

  • The chained-builder API pays an additional Python-interpreter cost on single-key calls — per-op object allocation, validation, and stream-wrap cost. On batch calls, that cost amortizes across keys; at batch=128 the sync builder reaches ~563K TPS — 94% above Rust-core async direct (~290K) and the highest single-loop number in the matrix. Use the fast-path (session.get/session.put) for single-key dispatch without filters; use the builder with batching for high-throughput bulk workloads.