Benchmarking Guide¶
This guide documents the architecture, setup, and measured TPS / latency for the Aerospike Python SDK (PSDK) and the Aerospike Python Async Client (PAC). The reference setup uses two isolated VMs on Google Cloud Platform; the same methodology works on any other cloud provider (AWS EC2, Azure VMs, etc.) or on dedicated on-prem hardware — only the VM provisioning steps would change.
Architecture¶
┌─────────────────────────┐ ┌─────────────────────────┐
│ bench-client │◄──────────── TCP :3100 ────────────►│ bench-asd × 3 nodes │
│ c3-standard-8 (8 vCPU)│ │ c3-standard-8 (8 vCPU)│
│ 32 GB RAM, 30 GB disk │ │ 32 GB RAM, 30 GB disk │
│ Ubuntu 24.04 LTS │ │ Ubuntu 24.04 LTS │
│ │ │ │
│ Python 3.14t (free- │ │ Aerospike Enterprise │
│ threaded, no GIL) │ │ 8.1.1.1 │
│ Rust 1.95.0 │ │ in-memory storage │
│ PAC, PSDK from source │ │ (4 GB, namespace test)│
└─────────────────────────┘ └─────────────────────────┘
bench-asd is a 3-node Aerospike cluster (each node a separate c3-standard-8 VM). All four VMs (1 client + 3 server nodes) run within the same VPC/subnet, giving sub-millisecond network RTT. They use c3-standard-8 machine types (Intel Sapphire Rapids, 8 vCPUs, 32 GB RAM each) to provide dedicated, non-shared compute.
Why dedicated, isolated VMs?¶
Local benchmarking on macOS via Podman / Docker Desktop hits several bottlenecks that distort results:
Userspace TCP proxy (Docker Desktop’s
gvproxy) — adds 2-5 ms per hop, capping TPS at ~15K regardless of client capability.CPU contention — co-locating
asdand the Python client on a shared VM creates resource competition that masks true scaling behavior.uvloop + free-threading — multiple uvloop instances on separate OS threads under free-threaded Python can cause silent freezes. PSDK’s
AsyncPoolexplicitly usesasyncio.SelectorEventLoopfor worker threads to avoid this.
Dedicated VMs on isolated CPU cores with direct, low-latency networking between client and server eliminate all of these issues. GCP c3-standard-8 (8 dedicated vCPUs each) on the same VPC is the reference setup. Equivalent isolation on AWS (c7i.2xlarge / dedicated tenancy / placement groups), Azure (Fsv2-series), or on-prem (two adjacent physical hosts on a quiet switch) reproduces the numbers within run-to-run noise.
Environment¶
Component |
Version |
|---|---|
GCP machine type |
|
OS |
Ubuntu 24.04 LTS, kernel 6.17.0-gcp |
Python |
3.14.5 free-threaded build (e.g. 3.14t) |
Rust |
1.95.0 |
PAC |
|
PSDK |
|
Legacy Python client |
|
Aerospike server |
Enterprise 8.1.1.1, 3-node cluster, in-memory, 4 GB per node, RF=1 |
Workload¶
All measurements use the same workload across every client:
100,000 keys seeded into
test.testset with single-bin records50/50 read/write mix (
RU,50)Single-bin payload:
{"b0": <int>}— the int is the key id (no per-op rng for bin values)Shared client across all worker threads / tasks
15 seconds measured + 3 seconds warmup (no separate cooldown)
Sampled latency: 1-in-100 ops timed → p50 / p99 / p99.9 reported
Bench RNG / key construction: as of 2026-05-25, the harness uses PAC’s
FastRng (xoshiro256++) per worker instead of CPython’s random.Random
(Mersenne Twister) — matches the JSDK RandomShift / Rust core SmallRng
methodology and removes a ~5 µs/op Python-stdlib RNG handicap that
otherwise inflated the bench-harness overhead. Keys are constructed per op
via PAC’s Key.from_int_user_key(ns, set, kid) fast-path, which skips
Python str() conversion + PythonValue enum dispatch (~2 µs/op).
Net: the bench’s per-op overhead matches JSDK/Rust core methodology
within a few hundred nanoseconds, so reported TPS reflects client
capability rather than Python stdlib cost.
Free-threaded (FT) runs use PYTHON_GIL=0. Non-FT runs use PYTHON_GIL=1 ALLOW_GIL_ON=1 on the same free-threaded binary — same wheel, same imports, GIL state flipped.
Running the benchmarks¶
The framework bench (python -m benchmarks.benchmark) carries all the modes for the cells in this document. Each invocation prints per-second TPS / error / timeout lines plus a final summary block.
# PSDK sync — fast-path (session.get / session.put) by default
PYTHON_GIL=0 python -m benchmarks.benchmark \
-H <bench-asd>:3100 --services-alternate \
-n test -s test -k 100000 -o I8 -w RU,50 \
-d 15 --warmup 3 --cooldown 0 \
--mode sync --threads 32 --fast-path
# Same harness, builder API (session.query / upsert chained)
PYTHON_GIL=0 python -m benchmarks.benchmark \
-H <bench-asd>:3100 --services-alternate \
-n test -s test -k 100000 -o I8 -w RU,50 \
-d 15 --warmup 3 --cooldown 0 \
--mode sync --threads 32 --no-fast-path
# PSDK async — single client, N concurrent tasks
PYTHON_GIL=0 python -m benchmarks.benchmark \
-H <bench-asd>:3100 --services-alternate \
-n test -s test -k 100000 -o I8 -w RU,50 \
-d 15 --warmup 3 --cooldown 0 \
--mode async -z 32 --fast-path
# PSDK async — AsyncPool (N loops × M tasks per loop), free-threaded only
PYTHON_GIL=0 python -m benchmarks.benchmark \
-H <bench-asd>:3100 --services-alternate \
-n test -s test -k 100000 -o I8 -w RU,50 \
-d 15 --warmup 3 --cooldown 0 \
--mode async --pool-loops 4 -z 64 --fast-path
# PAC sync direct — bypasses PSDK, calls PAC `_blocking` entries
PYTHON_GIL=0 python -m benchmarks.benchmark \
-H <bench-asd>:3100 --services-alternate \
-n test -s test -k 100000 -o I8 -w RU,50 \
-d 15 --warmup 3 --cooldown 0 \
--mode pac-blocking --threads 32
# PAC async direct — bypasses PSDK, calls PAC async entries
PYTHON_GIL=0 python -m benchmarks.benchmark \
-H <bench-asd>:3100 --services-alternate \
-n test -s test -k 100000 -o I8 -w RU,50 \
-d 15 --warmup 3 --cooldown 0 \
--mode pac-async -z 32
# Legacy `aerospike` C client (single-threaded — that client doesn't support
# multi-threaded fan-out; importing it on a free-threaded build also auto-
# re-enables the GIL because the C extension hasn't declared FT-safety).
python -m benchmarks.benchmark \
-H <bench-asd>:3100 --services-alternate \
-n test -s test -k 100000 -o I8 -w RU,50 \
-d 15 --warmup 3 --cooldown 0 \
--mode legacy-sync --threads 1
# Non-FT comparison: same binary, GIL forced on
PYTHON_GIL=1 ALLOW_GIL_ON=1 python -m benchmarks.benchmark ... (same args)
The Rust core (no Python) is benched via a standalone Rust binary that talks to aerospike-core directly — no PyO3, no Python interpreter at all. This gives the language-floor TPS for the same workload:
cargo build --release --manifest-path benchmarks/rust-core/Cargo.toml
MODE=async TASKS=32 DURATION=15 WARMUP=3 \
AEROSPIKE_HOST=<bench-asd>:3100 \
benchmarks/rust-core/target/release/rust-core
Every cell in the matrix below was produced by python -m benchmarks.benchmark --mode ... against bench-asd (<bench-asd>:3100), except the Rust-core rows, which use the dedicated Rust binary at benchmarks/rust-core/.
Cross-client TPS — single-key (batch size 1)¶
50/50 RW, 100K keys, 32 threads / tasks (or 4×64 for AsyncPool), 15 s measured. Free-threaded runs use PYTHON_GIL=0; non-FT runs use PYTHON_GIL=1 ALLOW_GIL_ON=1. The Rust core has no GIL — one number applies, shown in the FT column.
Client / Mode |
Threads / Tasks |
FT TPS |
non-FT TPS |
|---|---|---|---|
PSDK sync, fast-path ( |
32 |
214,093 |
53,203 |
PSDK sync, builder (chained API) |
32 |
153,461 |
31,960 |
PSDK async AsyncPool, fast-path |
4×64 |
172,885 |
56,000 |
PSDK async AsyncPool, fast-path |
6×64 |
177,685 |
(FT only) |
PSDK async AsyncPool, fast-path |
8×64 |
181,851 |
(FT only) |
PSDK async AsyncPool, fast-path |
12×64 |
180,325 |
(FT only) |
PSDK async AsyncPool, builder |
4×64 |
146,638 |
38,208 |
PSDK async single-loop, fast-path |
32 tasks |
112,596 |
75,830 |
PSDK async single-loop, builder |
32 tasks |
63,082 |
50,048 |
PSDK sync, fast-path |
1 |
12,954 |
14,088 |
PSDK sync, builder |
1 |
12,452 |
12,536 |
PAC sync direct ( |
32 |
220,217 |
48,899 |
PAC async direct ( |
32 tasks |
118,665 |
68,284 |
PAC sync |
1 |
12,848 |
13,466 |
PAC async |
1 task |
9,394 |
4,036 |
Rust core, async (Tokio tasks, no Python) |
32 tasks |
289,885 |
n/a (no GIL) |
Rust core, sync (OS threads + |
32 |
246,038 |
n/a (no GIL) |
Rust core, async |
1 task |
16,765 |
n/a (no GIL) |
Rust core, sync |
1 |
14,167 |
n/a (no GIL) |
Python legacy (sync, C client) |
1 |
14,724 |
15,759 |
Cross-client latency¶
p50 / p99 / p99.9 in microseconds, sampled 1-in-100 ops during measurement. Framework rows are rounded to 100 µs precision (the per-second histogram bucket size); Rust-core rows are exact.
Client / Mode |
Threads / Tasks |
FT (µs) |
non-FT (µs) |
|---|---|---|---|
PSDK sync, fast-path |
32 |
100 / 300 / 400 |
700 / 2,600 / 3,900 |
PSDK sync, builder |
32 |
200 / 400 / 700 |
1,500 / 5,200 / 5,800 |
PSDK sync, fast-path |
1 |
100 / 100 / 100 |
100 / 100 / 400 |
PSDK sync, builder |
1 |
100 / 100 / 100 |
100 / 100 / 200 |
PSDK async single-loop, fast-path |
32 tasks |
300 / 400 / 1,000 |
500 / 600 / 900 |
PSDK async single-loop, builder |
32 tasks |
500 / 600 / 900 |
1,000 / 1,200 / 1,300 |
PSDK async AsyncPool, fast-path |
4×64 |
1,400 / 4,700 / 7,400 |
5,100 / 14,100 / 14,200 |
PSDK async AsyncPool, fast-path |
6×64 |
2,000 / 6,500 / 9,900 |
(FT only) |
PSDK async AsyncPool, fast-path |
8×64 |
2,500 / 8,800 / 13,300 |
(FT only) |
PSDK async AsyncPool, fast-path |
12×64 |
3,600 / 15,800 / 27,300 |
(FT only) |
PSDK async AsyncPool, builder |
4×64 |
1,600 / 4,000 / 5,000 |
8,700 / 26,900 / 27,000 |
PAC sync |
32 |
100 / 300 / 400 |
600 / 2,800 / 3,800 |
PAC sync |
1 |
100 / 100 / 100 |
100 / 100 / 300 |
PAC async |
32 tasks |
300 / 400 / 800 |
500 / 600 / 600 |
PAC async |
1 task |
100 / 100 / 200 |
200 / 300 / 300 |
Rust core, async |
32 tasks |
106 / 184 / 223 |
n/a (no GIL) |
Rust core, sync |
32 |
127 / 184 / 273 |
n/a (no GIL) |
Rust core, async |
1 task |
59 / 75 / 119 |
n/a (no GIL) |
Rust core, sync |
1 |
69 / 86 / 116 |
n/a (no GIL) |
Python legacy (sync) |
1 |
100 / 100 / 100 |
100 / 100 / 100 |
Batch sweeps¶
The single-key cells above measure one record per execute(). Real applications often batch multiple keys per call to amortize network and per-op overhead. The sweeps below hold concurrency constant (32 threads / tasks) and vary --batch-size. Free-threaded only.
PSDK sync builder¶
session.query([keys]).execute() and session.batch().upsert(k).put(b).execute(). Routes through PAC’s batch_read_blocking / batch_operate_blocking directly — no asyncio loop in the path.
Batch size |
Total TPS |
× b=1 |
|---|---|---|
1 |
155,758 |
1.00× |
4 |
226,188 |
1.45× |
16 |
425,984 |
2.73× |
32 |
488,432 |
3.14× |
64 |
536,416 |
3.44× |
128 |
562,752 |
3.61× |
PSDK async single-loop builder¶
await session.query([keys]).execute() and friends — one event loop, 32 concurrent tasks.
Batch size |
Total TPS |
× b=1 |
|---|---|---|
1 |
62,842 |
1.00× |
4 |
70,368 |
1.12× |
16 |
156,000 |
2.48× |
32 |
166,080 |
2.64× |
64 |
203,552 |
3.24× |
128 |
230,144 |
3.66× |
PSDK async AsyncPool builder¶
Four event loops × 64 tasks per loop. Free-threaded only.
Batch size |
Total TPS |
× b=1 (pool) |
|---|---|---|
1 |
143,700 |
1.00× |
4 |
156,138 |
1.09× |
16 |
267,808 |
1.86× |
32 |
309,744 |
2.16× |
64 |
335,328 |
2.33× |
Headline: the PSDK sync builder scales monotonically through batch=128 to 563K TPS — the highest number in the entire matrix and 94% above Rust-core async direct (290K). Sync batch routes via PAC’s batch_*_blocking entries with one PyO3 boundary per batch, so doubling the batch size keeps amortizing the per-call Python cost without ceiling out. The b=128 peak is 3.6× the single-key sync builder (which itself moved from 114K → 156K with the bench-RNG / key-construction cleanups landed in 2026-05-25).
The async single-loop sweep tops out around 230K (batch=128) — the asyncio ↔ Tokio bridge cost per execute() doesn’t go away just because each call moves more data. AsyncPool recovers most of that by running 4 loops in parallel, hitting 335K at batch=64.
Stack cost analysis¶
Layering the headline single-key TPS numbers across clients shows where every transition costs:
Layer |
TPS |
Note |
|---|---|---|
Rust core async direct |
289,885 |
|
Rust core sync ( |
246,038 |
|
PAC sync direct |
220,217 |
PyO3 wrapper over |
PSDK sync, fast-path |
214,093 |
SDK |
PSDK async AsyncPool, fast-path (8×64) |
181,851 |
8 event loops × 64 tasks (FT only, with per-Client runtime) |
PSDK async AsyncPool, fast-path (4×64) |
172,885 |
4 event loops × 64 tasks (FT only) |
PSDK sync, builder |
153,461 |
SDK chained builder → execute → stream |
PSDK async AsyncPool, builder (4×64) |
146,638 |
4 loops, full builder path |
PAC async direct, 32 tasks |
118,665 |
PyO3 wrapper, asyncio ↔ Tokio bridge |
PSDK async single-loop, fast-path |
112,596 |
One event loop, |
PSDK async single-loop, builder |
63,082 |
One event loop, full builder path |
Python legacy (sync) |
14,724 |
Single-thread C client baseline |
Sync stack — boundary cost is small¶
Transition |
TPS |
Δ |
|---|---|---|
Rust core async (reference) |
289,885 |
— |
→ Rust core sync ( |
246,038 |
−15% (Rust |
→ PAC sync direct (PyO3 wrap) |
220,217 |
−11% (PyO3 + Python boundary cost) |
→ PSDK sync, fast-path |
214,093 |
−3% (PSDK SDK layer dispatch) |
→ PSDK sync, builder |
153,461 |
−28% (chained builder + stream wrap in Python) |
Sync key insight: PSDK sync fast-path is within 3% of PAC sync direct — the SDK layer is essentially free. The 28% builder tax on single-key calls is the cost of Python interpreter time on a chained-allocation pattern; the fast-path avoids it. With batching (see Batch sweeps), the same builder hits 488K TPS at batch=32 and 563K at batch=128 — higher than the Rust async single-record ceiling.
Async stack — boundary cost is much higher¶
Transition |
TPS |
Δ |
|---|---|---|
Rust core async (reference) |
289,885 |
— |
→ PAC async direct, 32 tasks |
118,665 |
−59% (asyncio ↔ Tokio bridge: every op crosses twice) |
→ PSDK async single-loop, fast-path |
112,596 |
−5% (PSDK SDK layer) |
→ PSDK async AsyncPool, fast-path (4×64) |
172,885 |
+53% vs single-loop (multi-loop + per-Client runtime, FT only) |
→ PSDK async AsyncPool, fast-path (8×64) |
181,851 |
+62% vs single-loop |
→ PSDK async AsyncPool, fast-path (12×64) |
180,325 |
+60% vs single-loop (TPS ceiling on 8-core hw) |
Async key insight: the per-loop ceiling around ~113K is the fundamental cost of the async bridge pattern — every op crosses Tokio ↔ asyncio twice (submit, then complete). AsyncPool recovers most of that by running N loops on N OS threads in parallel, each with its own dedicated PAC Tokio runtime (per-Client runtime isolation, auto-enabled at loop_count >= 4). TPS scales monotonically through 4–12 loops to ~180K — a 1.5× lift over single-loop, closing most of the gap to the sync path. Only useful under free-threaded Python; under regular CPython the GIL serializes the loops and the pool is slower than a single client (see AsyncPool note).
Practical takeaway¶
Sync clients pay only the PyO3 boundary cost (~11%). The SDK layer adds ~3%.
Async clients pay PyO3 + asyncio event-loop scheduling + Tokio worker bounce — much more expensive per op (~59% drop vs Rust async).
AsyncPoolis the way to scale async across cores, but only on free-threaded Python.The chained-builder API pays an additional Python-interpreter cost on single-key calls (~28% on sync, more on async). On batch calls, that cost amortizes across keys; at batch=128 the sync builder exceeds the single-record Rust async ceiling by 94%.
For maximum throughput: use the sync API on free-threaded Python with batches when the workload tolerates batching. Use the fast-path (
session.get/session.put) for single-key reads/writes when you don’t need filters / error handlers / TTL hooks. Reserve the async API for genuinely async workloads (web servers, etc.).
Fast-path vs builder¶
PSDK exposes two API shapes for single-key reads and writes:
Builder (chained):
session.query(key).execute()andsession.upsert(key).put(bins).execute(). Returns aRecordStreamof wrappedRecordResults. Supports filter expressions, error handlers, TTL overrides, generation checks, batch operations, and secondary-index queries.Fast-path (direct):
session.get(key)andsession.put(key, bins). Bypasses the builder + stream wrap and calls PAC’s native_blocking/ async entry points directly with the session-cached policy. Single-key only; no filter / error-handler / TTL hooks. Errors raise directly (cache misses raiseRecordNotFound).
Speedup of fast-path over builder on single-key dispatch at 32 threads / 4×64 tasks, FT:
Config |
Builder TPS |
Fast-path TPS |
Speedup |
|---|---|---|---|
PSDK async, single client |
63,082 |
112,596 |
1.78× |
PSDK async, AsyncPool 4×64 |
146,638 |
172,885 |
1.18× |
PSDK sync |
153,461 |
214,093 |
1.40× |
These speedups are for single-key dispatch. With batching, the builder amortizes its per-op overhead across many keys per call — at batch=32 the sync builder reaches 488K TPS (vs 214K for sync fast-path). The fast-path stays single-key only; for any workload that can batch, the builder eventually wins.
The builder has irreducible Python overhead per op (builder object allocation, _OperationSpec finalization, RecordResult wrapping, generator-based stream iteration). The fast-path skips all of it.
See performance.md for the user-facing decision guide.
AsyncPool is a free-threading feature¶
AsyncPool runs N event loops on N OS threads with one PAC client each. Its value is multi-thread parallelism across CPU cores — which only materializes under free-threaded Python (PYTHON_GIL=0).
Under non-FT Python the GIL still serializes all Python execution. AsyncPool ends up with 256 outstanding tasks across 4 threads competing for one interpreter, plus the per-loop orchestration overhead — net slower than a single-client async setup:
Config |
non-FT TPS |
vs single-loop non-FT |
|---|---|---|
async single-loop, fast-path, 32 tasks |
75,830 |
baseline |
async AsyncPool 4×64, fast-path |
56,000 |
−26% |
async single-loop, builder, 32 tasks |
50,048 |
baseline |
async AsyncPool 4×64, builder |
38,208 |
−24% |
On regular Python or with PYTHON_GIL=1, use a single Client + asyncio.gather. Reserve AsyncPool for free-threaded runs only.
Error classification¶
The framework treats RecordNotFound (cache miss on a point read) as a successful read with no record — not an error. This matches the semantics used by other Aerospike SDKs. Real errors (timeouts, connection failures, server-side errors, etc.) are counted separately as either Errors: or Timeouts: in the per-second ticker and the summary block.
To verify error accounting on a fresh dataset, pass --truncate to the bench command; with the fix in place all modes report Errors: 0 even when half the early reads cache-miss.