Performance modes — which API and Python build should I use?¶
PSDK exposes several execution modes. The right one depends on (1) whether you can run a free-threaded CPython build (e.g., 3.14t) with the GIL disabled, and (2) what your workload looks like — predominantly single-key reads/writes, or complex queries with builders, batches, and error handlers.
This guide is the short, user-facing decision tree. The full numbers and methodology behind every recommendation are in benchmarking.md.
TL;DR decision tree¶
Single-key reads/writes, want max throughput? Use
session.get()/session.put()— the fast-path API.Complex queries (secondary index, AEL filters, batch ops, error handlers)? Use chained builders —
session.query(...).where(...).execute()and friends.Sync or async? If you have an existing sync codebase, use
SyncClient. For new code or web servers, async is the standard.Free-threaded Python (e.g. 3.14t)? Yes if you need high throughput across many threads. No if you depend on C extensions that aren’t FT-safe.
AsyncPool? Only on free-threaded Python. Slower than single-client on non-FT.
Free-threaded vs regular Python¶
PSDK works on both standard CPython and a free-threaded build (e.g., 3.14t). The choice matters a lot for high-throughput workloads.
Regular CPython |
Free-threaded CPython (e.g. 3.14t) |
|
|---|---|---|
GIL |
Always on. Threads serialize through one interpreter. |
Off when invoked with |
Single-thread perf |
Same |
Same (slightly slower for some workloads due to atomic refcounts) |
Multi-thread perf |
Capped by GIL — usually 1.5-2× single-thread no matter how many threads |
Scales near-linearly with cores for I/O-bound work |
C extension support |
Universal |
Limited — extensions must declare |
Recommended for PSDK? |
If GIL-on simplicity is fine for your workload |
If you want PSDK’s high-TPS modes |
Setup for free-threaded mode¶
# Install the free-threaded build (uv or pyenv)
uv python install 3.14.5+freethreaded
# Always launch with PYTHON_GIL=0
PYTHON_GIL=0 python my_app.py
Critical gotcha: if any imported C extension hasn’t opted into free-threading, the interpreter silently re-enables the GIL. Verify with sys._is_gil_enabled() returning False after all imports. PSDK’s dependency PAC (aerospike-async) is FT-safe; many other libraries aren’t yet.
If PYTHON_GIL=0 is unset on the free-threaded build, the GIL stays on by default — which negates the entire point of using it.
Fast-path: session.get / session.put¶
For single-key operations where you don’t need filters, error handlers, projections, batch semantics, secondary indexes, etc., the fast-path methods bypass the builder + stream wrapping and call PAC’s native blocking/async APIs directly with the session-cached policy.
Sync example¶
from aerospike_sdk import Behavior, SyncClient
from aerospike_async import Key
with SyncClient("localhost:3000") as client:
session = client.create_session(Behavior.DEFAULT)
k = Key("test", "users", "alice")
session.put(k, {"name": "Alice", "age": 28})
record = session.get(k)
print(record.bins)
Async example¶
import asyncio
from aerospike_sdk import Behavior, Client
from aerospike_async import Key
async def main():
async with Client("localhost:3000") as client:
session = client.create_session(Behavior.DEFAULT)
k = Key("test", "users", "alice")
await session.put(k, {"name": "Alice", "age": 28})
record = await session.get(k)
print(record.bins)
asyncio.run(main())
The fast-path APIs accept an optional bins= projection for reads and an arbitrary bins dict for writes. Errors raise directly (no RecordResult wrapping).
When NOT to use fast-path:
Anything that needs
where(...)filters,expire_record_after_seconds,with_durable_delete, generation checks, orrecord_exists_actionoverrides — use the builder.Reads from a
DataSetwith a secondary-index query — use the builder.Batch reads/writes across multiple keys — use the builder or the
session.batch()chain.RecordResult.is_ok/errorintrospection per record — use the builder, which yields wrappedRecordResultinstances.
Chained builder API¶
The full-featured chainable API that mirrors the Aerospike SDK shape across languages.
from aerospike_sdk import Behavior, Client, DataSet, ErrorStrategy
async with Client("localhost:3000") as client:
session = client.create_session(Behavior.DEFAULT)
users = DataSet.of("test", "users")
# Filtered query — AEL filter expression
results = await (
session.query(users)
.where("$.age > %s and $.country == '%s'", 25, "US")
.execute()
)
async for r in results:
if r.is_ok:
print(r.record.bins)
# Write with TTL + error handler
stream = await (
session.upsert(users.id(1))
.put({"name": "Alice"})
.expire_record_after_seconds(3600)
.execute(on_error=ErrorStrategy.IN_STREAM)
)
await stream.collect()
Use the builder when you need filter expressions, batch operations, secondary-index queries, error handlers, TTL overrides, or generation checks. For plain single-key reads and writes, prefer the fast-path.
AsyncPool — multi-loop async on free-threaded Python only¶
AsyncPool runs N event loops on N OS threads with one PAC client each, so async work can use multiple CPU cores in parallel. It only helps under free-threaded Python.
from aerospike_sdk import AsyncPool, Behavior
from aerospike_sdk.aio.client import Client
def factory():
return Client("localhost:3000")
async def per_loop(client, loop_idx):
session = client.create_session(Behavior.DEFAULT)
# ... do work, e.g. asyncio.gather of session.get/put calls ...
async with AsyncPool(factory, loop_count=4) as pool:
await pool.map(per_loop, range(4))
Scaling: at loop_count >= 4, AsyncPool automatically gives each Client
its own PAC Tokio runtime (per-Client runtime isolation). This eliminates the
cross-loop scheduler contention that previously capped throughput at 4 loops,
so TPS scales monotonically. Measured on 8-core hardware, FT Python:
Pool size |
TPS |
p99 latency |
|---|---|---|
2 × 64 tasks |
~139K |
1.7 ms |
4 × 64 tasks |
~170K |
4.1 ms |
6 × 64 tasks |
~177K |
6.3 ms |
8 × 64 tasks |
~178K |
9.3 ms |
12 × 64 tasks |
~180K |
15.5 ms |
The ceiling at ~180K is Python interpreter self-time across the loops; adding
more loops past 8–12 trades p99 latency for marginal TPS. Pick loop_count
based on the tail-latency budget your workload tolerates.
You can override the auto-enable threshold via AsyncPool(..., per_client_runtime=True|False).
Forcing it on at low loop counts may be useful on smaller hardware; forcing
it off reverts to the shared global Tokio runtime path. Worker count is
auto-derived as max(2, os.cpu_count() // loop_count).
Do not use AsyncPool on regular (GIL-on) Python. Empirically it’s 17-26% slower than a single-client async setup because:
The GIL still serializes all Python code across the 4 OS threads
The pool’s task orchestration adds Python work that has nowhere to escape to under the GIL
On regular Python, use a single Client + asyncio.gather instead.
Sync vs async — when to pick which¶
Sync (
SyncClient) is best when:You’re integrating into an existing sync codebase (Django views, scripts, etc.)
Per-op latency matters more than concurrency depth
You want the absolute lowest per-op overhead — PSDK sync fast-path is roughly at parity with PAC’s direct blocking API
Async (
Client) is best when:You already have an asyncio event loop (FastAPI, aiohttp, etc.)
You need to overlap I/O across many concurrent operations
You’re willing to use uvloop for higher throughput (default in modern asyncio + free-threaded Python setups)
Both modes share the same Session API surface (chained builders + fast-path shortcuts), the same Behavior policy model, and the same error semantics.
Note
When you construct a SyncClient without supplying your own ClientPolicy,
PSDK sets conn_pools_per_node = 8 (PAC’s default is 4). The async-tuned PAC
default works well for single-loop or per-Client-runtime workloads where the
event loop serializes pool access naturally, but sync wrappers drive PAC from
many caller threads and see real connection-pool mutex contention at 4 — the
p99 tail roughly doubles. Pass your own ClientPolicy if you need a different
value (e.g. lower for memory-constrained deployments).
Performance summary table¶
Numbers from the Benchmarking Guide — 8-vCPU isolated client VM → 8-vCPU isolated server VM over a low-latency private network, 100K keys, 50/50 RW, 50-byte payload.
Single-key dispatch (batch size 1)¶
Mode |
Threads / Tasks |
Free-threaded TPS |
Non-FT TPS |
|---|---|---|---|
Sync fast-path ( |
32 |
~214K |
~53K |
Sync builder ( |
32 |
~153K |
~32K |
Async fast-path, single client |
32 tasks |
~113K |
~76K |
Async fast-path, AsyncPool 4×64 |
256 tasks |
~173K |
~56K (slower than single-loop) |
Async fast-path, AsyncPool 8×64 |
512 tasks |
~182K |
(FT only) |
Async fast-path, AsyncPool 12×64 |
768 tasks |
~180K |
(FT only) |
Async builder, single client |
32 tasks |
~63K |
~50K |
Async builder, AsyncPool 4×64 |
256 tasks |
~147K |
~38K (slower than single-loop) |
With batching (--batch-size > 1, free-threaded)¶
When the workload can group keys per call, the chained-builder API amortizes its per-op overhead and surpasses every single-key number above.
Mode |
Batch size |
Peak TPS |
|---|---|---|
Sync builder |
128 |
~563K |
AsyncPool builder, 4×64 |
64 |
~335K |
Async single-loop builder, 32 tasks |
128 |
~230K |
Practical reading:
If your workload can batch keys, the sync builder with
session.batch()or multi-keysession.query([keys])is the highest-throughput mode — scales monotonically to ~563K TPS at batch=128, 94% above Rust-core async direct (~290K). Doubling the batch size keeps amortizing the per-call cost.For single-key workloads, the sync fast-path (~214K) is the highest mode. If you need async, AsyncPool fast-path scales monotonically through 4–12 loops to ~180K (closing most of the gap to sync on the same hardware).
On regular Python (GIL on), async single-client fast-path (~76K) is the simplest high-throughput mode; sync fast-path (~53K) is slightly lower because of GIL contention across the 32 worker threads.
Why sync and async perform so differently¶
The cost stacks for sync and async are not the same. From the benchmarking guide’s stack analysis:
Sync clients pay only the PyO3 boundary cost (~11%). The SDK layer on top of PAC adds ~3%. PSDK sync builder routes through PAC’s
_blockingentries directly — no asyncio loop in the path.Async clients pay PyO3 + asyncio event-loop scheduling + a Tokio worker bounce on each op — roughly a 59% drop vs Rust async direct. Every async op crosses Tokio ↔ asyncio twice (submit, then complete), which is the fundamental cost of bridging two async runtimes.
AsyncPoolrecovers some of that by running multiple event loops on multiple OS threads in parallel, but only on free-threaded Python.The chained-builder API pays an additional Python-interpreter cost on single-key calls — per-op object allocation, validation, and stream-wrap cost. On batch calls, that cost amortizes across keys; at batch=128 the sync builder reaches ~563K TPS — 94% above Rust-core async direct (~290K) and the highest single-loop number in the matrix. Use the fast-path (
session.get/session.put) for single-key dispatch without filters; use the builder with batching for high-throughput bulk workloads.