Benchmarking Guide¶

This guide documents the architecture, setup, and measured TPS / latency for the Aerospike Python SDK (PSDK) and the Aerospike Python Async Client (PAC). The reference setup uses two isolated VMs on Google Cloud Platform; the same methodology works on any other cloud provider (AWS EC2, Azure VMs, etc.) or on dedicated on-prem hardware — only the VM provisioning steps would change.

Architecture¶

┌─────────────────────────┐                                     ┌─────────────────────────┐
│      bench-client       │◄──────────── TCP :3100 ────────────►│   bench-asd × 3 nodes   │
│   c3-standard-8 (8 vCPU)│                                     │   c3-standard-8 (8 vCPU)│
│   32 GB RAM, 30 GB disk │                                     │   32 GB RAM, 30 GB disk │
│   Ubuntu 24.04 LTS      │                                     │   Ubuntu 24.04 LTS      │
│                         │                                     │                         │
│   Python 3.14t (free-   │                                     │   Aerospike Enterprise  │
│     threaded, no GIL)   │                                     │   8.1.1.1               │
│   Rust 1.95.0           │                                     │   in-memory storage     │
│   PAC, PSDK from source │                                     │   (4 GB, namespace test)│
└─────────────────────────┘                                     └─────────────────────────┘

bench-asd is a 3-node Aerospike cluster (each node a separate c3-standard-8 VM). All four VMs (1 client + 3 server nodes) run within the same VPC/subnet, giving sub-millisecond network RTT. They use c3-standard-8 machine types (Intel Sapphire Rapids, 8 vCPUs, 32 GB RAM each) to provide dedicated, non-shared compute.

Why dedicated, isolated VMs?¶

Local benchmarking on macOS via Podman / Docker Desktop hits several bottlenecks that distort results:

Userspace TCP proxy (Docker Desktop’s gvproxy) — adds 2-5 ms per hop, capping TPS at ~15K regardless of client capability.
CPU contention — co-locating asd and the Python client on a shared VM creates resource competition that masks true scaling behavior.
uvloop + free-threading — multiple uvloop instances on separate OS threads under free-threaded Python can cause silent freezes. PSDK’s AsyncPool explicitly uses asyncio.SelectorEventLoop for worker threads to avoid this.

Dedicated VMs on isolated CPU cores with direct, low-latency networking between client and server eliminate all of these issues. GCP c3-standard-8 (8 dedicated vCPUs each) on the same VPC is the reference setup. Equivalent isolation on AWS (c7i.2xlarge / dedicated tenancy / placement groups), Azure (Fsv2-series), or on-prem (two adjacent physical hosts on a quiet switch) reproduces the numbers within run-to-run noise.

Environment¶

Component	Version
GCP machine type	`c3-standard-8` (8 vCPU, 32 GB)
OS	Ubuntu 24.04 LTS, kernel 6.17.0-gcp
Python	3.14.5 free-threaded build (e.g. 3.14t)
Rust	1.95.0
PAC	`aerospike-async` 0.5.0a1 (built from source with `mimalloc` global allocator)
PSDK	`aerospike-sdk` 0.9.0a2 (built from source)
Legacy Python client	`aerospike` 19.2.1 (single-threaded, sync, C client; published PyPI wheel)
Aerospike server	Enterprise 8.1.1.1, 3-node cluster, in-memory, 4 GB per node, RF=1

Workload¶

All measurements use the same workload across every client:

100,000 keys seeded into test.test set with single-bin records
50/50 read/write mix (RU,50)
Single-bin payload: {"b0": <int>} — the int is the key id (no per-op rng for bin values)
Shared client across all worker threads / tasks
15 seconds measured + 3 seconds warmup (no separate cooldown)
Sampled latency: 1-in-100 ops timed → p50 / p99 / p99.9 reported

Bench RNG / key construction: as of 2026-05-25, the harness uses PAC’s FastRng (xoshiro256++) per worker instead of CPython’s random.Random (Mersenne Twister) — matches the JSDK RandomShift / Rust core SmallRng methodology and removes a ~5 µs/op Python-stdlib RNG handicap that otherwise inflated the bench-harness overhead. Keys are constructed per op via PAC’s Key.from_int_user_key(ns, set, kid) fast-path, which skips Python str() conversion + PythonValue enum dispatch (~2 µs/op). Net: the bench’s per-op overhead matches JSDK/Rust core methodology within a few hundred nanoseconds, so reported TPS reflects client capability rather than Python stdlib cost.

Free-threaded (FT) runs use PYTHON_GIL=0. Non-FT runs use PYTHON_GIL=1 ALLOW_GIL_ON=1 on the same free-threaded binary — same wheel, same imports, GIL state flipped.

Running the benchmarks¶

The framework bench (python -m benchmarks.benchmark) carries all the modes for the cells in this document. Each invocation prints per-second TPS / error / timeout lines plus a final summary block.

# PSDK sync — fast-path (session.get / session.put) by default
PYTHON_GIL=0 python -m benchmarks.benchmark \
  -H <bench-asd>:3100 --services-alternate \
  -n test -s test -k 100000 -o I8 -w RU,50 \
  -d 15 --warmup 3 --cooldown 0 \
  --mode sync --threads 32 --fast-path

# Same harness, builder API (session.query / upsert chained)
PYTHON_GIL=0 python -m benchmarks.benchmark \
  -H <bench-asd>:3100 --services-alternate \
  -n test -s test -k 100000 -o I8 -w RU,50 \
  -d 15 --warmup 3 --cooldown 0 \
  --mode sync --threads 32 --no-fast-path

# PSDK async — single client, N concurrent tasks
PYTHON_GIL=0 python -m benchmarks.benchmark \
  -H <bench-asd>:3100 --services-alternate \
  -n test -s test -k 100000 -o I8 -w RU,50 \
  -d 15 --warmup 3 --cooldown 0 \
  --mode async -z 32 --fast-path

# PSDK async — AsyncPool (N loops × M tasks per loop), free-threaded only
PYTHON_GIL=0 python -m benchmarks.benchmark \
  -H <bench-asd>:3100 --services-alternate \
  -n test -s test -k 100000 -o I8 -w RU,50 \
  -d 15 --warmup 3 --cooldown 0 \
  --mode async --pool-loops 4 -z 64 --fast-path

# PAC sync direct — bypasses PSDK, calls PAC `_blocking` entries
PYTHON_GIL=0 python -m benchmarks.benchmark \
  -H <bench-asd>:3100 --services-alternate \
  -n test -s test -k 100000 -o I8 -w RU,50 \
  -d 15 --warmup 3 --cooldown 0 \
  --mode pac-blocking --threads 32

# PAC async direct — bypasses PSDK, calls PAC async entries
PYTHON_GIL=0 python -m benchmarks.benchmark \
  -H <bench-asd>:3100 --services-alternate \
  -n test -s test -k 100000 -o I8 -w RU,50 \
  -d 15 --warmup 3 --cooldown 0 \
  --mode pac-async -z 32

# Legacy `aerospike` C client (single-threaded — that client doesn't support
# multi-threaded fan-out; importing it on a free-threaded build also auto-
# re-enables the GIL because the C extension hasn't declared FT-safety).
python -m benchmarks.benchmark \
  -H <bench-asd>:3100 --services-alternate \
  -n test -s test -k 100000 -o I8 -w RU,50 \
  -d 15 --warmup 3 --cooldown 0 \
  --mode legacy-sync --threads 1

# Non-FT comparison: same binary, GIL forced on
PYTHON_GIL=1 ALLOW_GIL_ON=1 python -m benchmarks.benchmark ... (same args)

The Rust core (no Python) is benched via a standalone Rust binary that talks to aerospike-core directly — no PyO3, no Python interpreter at all. This gives the language-floor TPS for the same workload:

cargo build --release --manifest-path benchmarks/rust-core/Cargo.toml
MODE=async TASKS=32 DURATION=15 WARMUP=3 \
  AEROSPIKE_HOST=<bench-asd>:3100 \
  benchmarks/rust-core/target/release/rust-core

Every cell in the matrix below was produced by python -m benchmarks.benchmark --mode ... against bench-asd (<bench-asd>:3100), except the Rust-core rows, which use the dedicated Rust binary at benchmarks/rust-core/.

Cross-client TPS — single-key (batch size 1)¶

50/50 RW, 100K keys, 32 threads / tasks (or 4×64 for AsyncPool), 15 s measured. Free-threaded runs use PYTHON_GIL=0; non-FT runs use PYTHON_GIL=1 ALLOW_GIL_ON=1. The Rust core has no GIL — one number applies, shown in the FT column.

Client / Mode	Threads / Tasks	FT TPS	non-FT TPS
PSDK sync, fast-path (`session.get` / `session.put`)	32	214,093	53,203
PSDK sync, builder (chained API)	32	153,461	31,960
PSDK async AsyncPool, fast-path	4×64	172,885	56,000
PSDK async AsyncPool, fast-path	6×64	177,685	(FT only)
PSDK async AsyncPool, fast-path	8×64	181,851	(FT only)
PSDK async AsyncPool, fast-path	12×64	180,325	(FT only)
PSDK async AsyncPool, builder	4×64	146,638	38,208
PSDK async single-loop, fast-path	32 tasks	112,596	75,830
PSDK async single-loop, builder	32 tasks	63,082	50,048
PSDK sync, fast-path	1	12,954	14,088
PSDK sync, builder	1	12,452	12,536
PAC sync direct (`pac-blocking`)	32	220,217	48,899
PAC async direct (`pac-async`)	32 tasks	118,665	68,284
PAC sync	1	12,848	13,466
PAC async	1 task	9,394	4,036
Rust core, async (Tokio tasks, no Python)	32 tasks	289,885	n/a (no GIL)
Rust core, sync (OS threads + `Handle::block_on`)	32	246,038	n/a (no GIL)
Rust core, async	1 task	16,765	n/a (no GIL)
Rust core, sync	1	14,167	n/a (no GIL)
Python legacy (sync, C client)	1	14,724	15,759

Cross-client latency¶

p50 / p99 / p99.9 in microseconds, sampled 1-in-100 ops during measurement. Framework rows are rounded to 100 µs precision (the per-second histogram bucket size); Rust-core rows are exact.

Client / Mode	Threads / Tasks	FT (µs)	non-FT (µs)
PSDK sync, fast-path	32	100 / 300 / 400	700 / 2,600 / 3,900
PSDK sync, builder	32	200 / 400 / 700	1,500 / 5,200 / 5,800
PSDK sync, fast-path	1	100 / 100 / 100	100 / 100 / 400
PSDK sync, builder	1	100 / 100 / 100	100 / 100 / 200
PSDK async single-loop, fast-path	32 tasks	300 / 400 / 1,000	500 / 600 / 900
PSDK async single-loop, builder	32 tasks	500 / 600 / 900	1,000 / 1,200 / 1,300
PSDK async AsyncPool, fast-path	4×64	1,400 / 4,700 / 7,400	5,100 / 14,100 / 14,200
PSDK async AsyncPool, fast-path	6×64	2,000 / 6,500 / 9,900	(FT only)
PSDK async AsyncPool, fast-path	8×64	2,500 / 8,800 / 13,300	(FT only)
PSDK async AsyncPool, fast-path	12×64	3,600 / 15,800 / 27,300	(FT only)
PSDK async AsyncPool, builder	4×64	1,600 / 4,000 / 5,000	8,700 / 26,900 / 27,000
PAC sync	32	100 / 300 / 400	600 / 2,800 / 3,800
PAC sync	1	100 / 100 / 100	100 / 100 / 300
PAC async	32 tasks	300 / 400 / 800	500 / 600 / 600
PAC async	1 task	100 / 100 / 200	200 / 300 / 300
Rust core, async	32 tasks	106 / 184 / 223	n/a (no GIL)
Rust core, sync	32	127 / 184 / 273	n/a (no GIL)
Rust core, async	1 task	59 / 75 / 119	n/a (no GIL)
Rust core, sync	1	69 / 86 / 116	n/a (no GIL)
Python legacy (sync)	1	100 / 100 / 100	100 / 100 / 100

Batch sweeps¶

The single-key cells above measure one record per execute(). Real applications often batch multiple keys per call to amortize network and per-op overhead. The sweeps below hold concurrency constant (32 threads / tasks) and vary --batch-size. Free-threaded only.

PSDK sync builder¶

session.query([keys]).execute() and session.batch().upsert(k).put(b).execute(). Routes through PAC’s batch_read_blocking / batch_operate_blocking directly — no asyncio loop in the path.

Batch size	Total TPS	× b=1
1	155,758	1.00×
4	226,188	1.45×
16	425,984	2.73×
32	488,432	3.14×
64	536,416	3.44×
128	562,752	3.61×

PSDK async single-loop builder¶

await session.query([keys]).execute() and friends — one event loop, 32 concurrent tasks.

Batch size	Total TPS	× b=1
1	62,842	1.00×
4	70,368	1.12×
16	156,000	2.48×
32	166,080	2.64×
64	203,552	3.24×
128	230,144	3.66×

PSDK async AsyncPool builder¶

Four event loops × 64 tasks per loop. Free-threaded only.

Batch size	Total TPS	× b=1 (pool)
1	143,700	1.00×
4	156,138	1.09×
16	267,808	1.86×
32	309,744	2.16×
64	335,328	2.33×

Headline: the PSDK sync builder scales monotonically through batch=128 to 563K TPS — the highest number in the entire matrix and 94% above Rust-core async direct (290K). Sync batch routes via PAC’s batch_*_blocking entries with one PyO3 boundary per batch, so doubling the batch size keeps amortizing the per-call Python cost without ceiling out. The b=128 peak is 3.6× the single-key sync builder (which itself moved from 114K → 156K with the bench-RNG / key-construction cleanups landed in 2026-05-25).

The async single-loop sweep tops out around 230K (batch=128) — the asyncio ↔ Tokio bridge cost per execute() doesn’t go away just because each call moves more data. AsyncPool recovers most of that by running 4 loops in parallel, hitting 335K at batch=64.

Stack cost analysis¶

Layering the headline single-key TPS numbers across clients shows where every transition costs:

Layer	TPS	Note
Rust core async direct	289,885	`aerospike-core` via Tokio tasks — single-key language floor, no Python
Rust core sync (`block_on`)	246,038	`aerospike-core` via OS threads + `block_on`
PAC sync direct	220,217	PyO3 wrapper over `aerospike-core` blocking, no SDK
PSDK sync, fast-path	214,093	SDK `session.get` / `session.put` → PAC blocking
PSDK async AsyncPool, fast-path (8×64)	181,851	8 event loops × 64 tasks (FT only, with per-Client runtime)
PSDK async AsyncPool, fast-path (4×64)	172,885	4 event loops × 64 tasks (FT only)
PSDK sync, builder	153,461	SDK chained builder → execute → stream
PSDK async AsyncPool, builder (4×64)	146,638	4 loops, full builder path
PAC async direct, 32 tasks	118,665	PyO3 wrapper, asyncio ↔ Tokio bridge
PSDK async single-loop, fast-path	112,596	One event loop, `session.get` / `session.put`
PSDK async single-loop, builder	63,082	One event loop, full builder path
Python legacy (sync)	14,724	Single-thread C client baseline

Sync stack — boundary cost is small¶

Transition	TPS	Δ
Rust core async (reference)	289,885	—
→ Rust core sync (`block_on`)	246,038	−15% (Rust `block_on` overhead)
→ PAC sync direct (PyO3 wrap)	220,217	−11% (PyO3 + Python boundary cost)
→ PSDK sync, fast-path	214,093	−3% (PSDK SDK layer dispatch)
→ PSDK sync, builder	153,461	−28% (chained builder + stream wrap in Python)

Sync key insight: PSDK sync fast-path is within 3% of PAC sync direct — the SDK layer is essentially free. The 28% builder tax on single-key calls is the cost of Python interpreter time on a chained-allocation pattern; the fast-path avoids it. With batching (see Batch sweeps), the same builder hits 488K TPS at batch=32 and 563K at batch=128 — higher than the Rust async single-record ceiling.

Async stack — boundary cost is much higher¶

Transition	TPS	Δ
Rust core async (reference)	289,885	—
→ PAC async direct, 32 tasks	118,665	−59% (asyncio ↔ Tokio bridge: every op crosses twice)
→ PSDK async single-loop, fast-path	112,596	−5% (PSDK SDK layer)
→ PSDK async AsyncPool, fast-path (4×64)	172,885	+53% vs single-loop (multi-loop + per-Client runtime, FT only)
→ PSDK async AsyncPool, fast-path (8×64)	181,851	+62% vs single-loop
→ PSDK async AsyncPool, fast-path (12×64)	180,325	+60% vs single-loop (TPS ceiling on 8-core hw)

Async key insight: the per-loop ceiling around ~113K is the fundamental cost of the async bridge pattern — every op crosses Tokio ↔ asyncio twice (submit, then complete). AsyncPool recovers most of that by running N loops on N OS threads in parallel, each with its own dedicated PAC Tokio runtime (per-Client runtime isolation, auto-enabled at loop_count >= 4). TPS scales monotonically through 4–12 loops to ~180K — a 1.5× lift over single-loop, closing most of the gap to the sync path. Only useful under free-threaded Python; under regular CPython the GIL serializes the loops and the pool is slower than a single client (see AsyncPool note).

Practical takeaway¶

Sync clients pay only the PyO3 boundary cost (~11%). The SDK layer adds ~3%.
Async clients pay PyO3 + asyncio event-loop scheduling + Tokio worker bounce — much more expensive per op (~59% drop vs Rust async). AsyncPool is the way to scale async across cores, but only on free-threaded Python.
The chained-builder API pays an additional Python-interpreter cost on single-key calls (~28% on sync, more on async). On batch calls, that cost amortizes across keys; at batch=128 the sync builder exceeds the single-record Rust async ceiling by 94%.
For maximum throughput: use the sync API on free-threaded Python with batches when the workload tolerates batching. Use the fast-path (session.get / session.put) for single-key reads/writes when you don’t need filters / error handlers / TTL hooks. Reserve the async API for genuinely async workloads (web servers, etc.).

Fast-path vs builder¶

PSDK exposes two API shapes for single-key reads and writes:

Builder (chained): session.query(key).execute() and session.upsert(key).put(bins).execute(). Returns a RecordStream of wrapped RecordResults. Supports filter expressions, error handlers, TTL overrides, generation checks, batch operations, and secondary-index queries.
Fast-path (direct): session.get(key) and session.put(key, bins). Bypasses the builder + stream wrap and calls PAC’s native _blocking / async entry points directly with the session-cached policy. Single-key only; no filter / error-handler / TTL hooks. Errors raise directly (cache misses raise RecordNotFound).

Speedup of fast-path over builder on single-key dispatch at 32 threads / 4×64 tasks, FT:

Config	Builder TPS	Fast-path TPS	Speedup
PSDK async, single client	63,082	112,596	1.78×
PSDK async, AsyncPool 4×64	146,638	172,885	1.18×
PSDK sync	153,461	214,093	1.40×

These speedups are for single-key dispatch. With batching, the builder amortizes its per-op overhead across many keys per call — at batch=32 the sync builder reaches 488K TPS (vs 214K for sync fast-path). The fast-path stays single-key only; for any workload that can batch, the builder eventually wins.

The builder has irreducible Python overhead per op (builder object allocation, _OperationSpec finalization, RecordResult wrapping, generator-based stream iteration). The fast-path skips all of it.

See performance.md for the user-facing decision guide.

AsyncPool is a free-threading feature¶

AsyncPool runs N event loops on N OS threads with one PAC client each. Its value is multi-thread parallelism across CPU cores — which only materializes under free-threaded Python (PYTHON_GIL=0).

Under non-FT Python the GIL still serializes all Python execution. AsyncPool ends up with 256 outstanding tasks across 4 threads competing for one interpreter, plus the per-loop orchestration overhead — net slower than a single-client async setup:

Config	non-FT TPS	vs single-loop non-FT
async single-loop, fast-path, 32 tasks	75,830	baseline
async AsyncPool 4×64, fast-path	56,000	−26%
async single-loop, builder, 32 tasks	50,048	baseline
async AsyncPool 4×64, builder	38,208	−24%

On regular Python or with PYTHON_GIL=1, use a single Client + asyncio.gather. Reserve AsyncPool for free-threaded runs only.

Error classification¶

The framework treats RecordNotFound (cache miss on a point read) as a successful read with no record — not an error. This matches the semantics used by other Aerospike SDKs. Real errors (timeouts, connection failures, server-side errors, etc.) are counted separately as either Errors: or Timeouts: in the per-second ticker and the summary block.

To verify error accounting on a fresh dataset, pass --truncate to the bench command; with the fix in place all modes report Errors: 0 even when half the early reads cache-miss.