Benchmarking Guide

This guide documents the architecture, setup, and measured TPS / latency for the Aerospike Python SDK (PSDK) and the Aerospike Python Async Client (PAC). The reference setup uses two isolated VMs on Google Cloud Platform; the same methodology works on any other cloud provider (AWS EC2, Azure VMs, etc.) or on dedicated on-prem hardware — only the VM provisioning steps would change.

Architecture

┌─────────────────────────┐                                     ┌─────────────────────────┐
│      bench-client       │◄──────────── TCP :3000 ────────────►│   bench-asd × 3 nodes   │
│   n4-standard-8 (8 vCPU)│                                     │   n4-standard-8 (8 vCPU)│
│   32 GB RAM, 30 GB disk │                                     │   32 GB RAM, 30 GB disk │
│   Ubuntu 24.04 LTS      │                                     │   Ubuntu 24.04 LTS      │
│                         │                                     │                         │
│   Python 3.14t (free-   │                                     │   Aerospike Enterprise  │
│     threaded, no GIL)   │                                     │   8.x.x                 │
│   Rust 1.96+            │                                     │   in-memory storage     │
│   PAC, PSDK from source │                                     │   (4 GB, namespace test)│
└─────────────────────────┘                                     └─────────────────────────┘

bench-asd is a 3-node Aerospike cluster (each node a separate n4-standard-8 VM, dedicated 8 vCPU per ASD process — critical for measured server-side ceilings). All four VMs (1 client + 3 server nodes) run within the same VPC/subnet, giving sub-millisecond network RTT.

Why dedicated, isolated VMs?

Local benchmarking on macOS via Podman / Docker Desktop hits several bottlenecks that distort results:

  • Userspace TCP proxy (Docker Desktop’s gvproxy) — adds 2-5 ms per hop, capping TPS at ~15K regardless of client capability.

  • CPU contention — co-locating asd and the Python client on a shared VM creates resource competition that masks true scaling behavior. Server-side: running 3 ASDs as containers on a single 8-vCPU host (vs each on its own 8-vCPU VM) caps aerospike-core direct at ~280K TPS because the 3 server processes share 8 vCPUs (~2.7 vCPU each). On dedicated 8-vCPU-per-ASD VMs, the cluster sustains ≥580K TPS — well above where any default-config Python client lands. (Earlier writeups quoted the 3-VM ceiling as 810K and then 405K, then ~290-300K rust-core direct; all three were client-side artifacts — services-alternate routing errors, then the Tokio timer wheel + the default 256-conn pool — masquerading as the cluster.)

  • uvloop + free-threading — uvloop 0.22.x has a libuv FT race on loop._ready_len (MagicStack/uvloop issues #720, #721) that triggers when many threads concurrently call loop.call_soon_threadsafe(). PSDK / PAC fully mitigates this via a single persistent waker thread inside PAC: all Tokio-side call_soon_threadsafe invocations funnel through one dedicated thread, eliminating the multi-threaded access pattern the race needs. The fix is empirically stable across 20+ minutes of stress (z=128 single-loop + AsyncPool 8×64, 241M ops, zero stalls). uvloop is installed by default on FT and non-FT Linux/macOS builds. (uvloop has no Windows wheel; PAC falls back to the asyncio default selector loop there.) An AEROSPIKE_NO_UVLOOP=1 env-var safety valve is available to opt out without uninstalling the dependency.

Dedicated VMs on isolated CPU cores with direct, low-latency networking between client and server eliminate all of these issues. GCP n4-standard-8 (8 dedicated vCPUs each) on the same VPC is the reference setup. Equivalent isolation on AWS (c7i.2xlarge / dedicated tenancy / placement groups), Azure (Fsv2-series), or on-prem (two adjacent physical hosts on a quiet switch) reproduces the numbers within run-to-run noise.

Environment

Component

Version

GCP machine type

n4-standard-8 (8 vCPU, 32 GB)

OS

Ubuntu 24.04 LTS, kernel 6.17.0-gcp

Python

3.14.6 free-threaded build (e.g. 3.14t)

Rust

1.96.0

PyO3

0.29.0

PAC

aerospike-async 0.6.0-alpha (built from source with mimalloc global allocator; uvloop installed by default)

PSDK

aerospike-sdk 0.9.0-alpha (built from source)

Legacy Python client

aerospike 19.2.1 (single-threaded, sync, C client; published PyPI wheel)

Aerospike server

Enterprise 8.x, 3-node cluster, in-memory, 4 GB per node, RF=1

Workload

All measurements use the same workload across every client:

  • 100,000 keys seeded into test.test set with single-bin records

  • 50/50 read/write mix (RU,50)

  • Single-bin payload: {"b0": <int>} — the int is the key id (no per-op rng for bin values)

  • Shared client across all worker threads / tasks

  • 15 seconds measured + 3 seconds warmup (no separate cooldown)

  • Sampled latency: 1-in-100 ops timed → p50 / p99 / p99.9 reported

Bench RNG / key construction: as of 2026-05-25, the harness uses PAC’s FastRng (xoshiro256++) per worker instead of CPython’s random.Random (Mersenne Twister) — matches the JSDK RandomShift / Rust core SmallRng methodology and removes a ~5 µs/op Python-stdlib RNG handicap that otherwise inflated the bench-harness overhead. Keys are constructed per op via PAC’s Key.from_int_user_key(ns, set, kid) fast-path, which skips Python str() conversion + PythonValue enum dispatch (~2 µs/op). Net: the bench’s per-op overhead matches JSDK/Rust core methodology within a few hundred nanoseconds, so reported TPS reflects client capability rather than Python stdlib cost.

Free-threaded (FT) runs use PYTHON_GIL=0. Non-FT runs use PYTHON_GIL=1 ALLOW_GIL_ON=1 on the same free-threaded binary — same wheel, same imports, GIL state flipped.

Running the benchmarks

The framework bench (python -m benchmarks.benchmark) carries all the modes for the cells in this document. Each invocation prints per-second TPS / error / timeout lines plus a final summary block.

# PSDK sync — fast-path (session.get / session.put) by default
PYTHON_GIL=0 python -m benchmarks.benchmark \
  -H <bench-asd>:3100 --services-alternate \
  -n test -s test -k 100000 -o I8 -w RU,50 \
  -d 15 --warmup 3 --cooldown 0 \
  --mode sync --threads 32 --fast-path

# Same harness, builder API (session.query / upsert chained)
PYTHON_GIL=0 python -m benchmarks.benchmark \
  -H <bench-asd>:3100 --services-alternate \
  -n test -s test -k 100000 -o I8 -w RU,50 \
  -d 15 --warmup 3 --cooldown 0 \
  --mode sync --threads 32 --no-fast-path

# PSDK async — single client, N concurrent tasks
PYTHON_GIL=0 python -m benchmarks.benchmark \
  -H <bench-asd>:3100 --services-alternate \
  -n test -s test -k 100000 -o I8 -w RU,50 \
  -d 15 --warmup 3 --cooldown 0 \
  --mode async -z 32 --fast-path

# PSDK async — AsyncPool (N loops × M tasks per loop), free-threaded only
PYTHON_GIL=0 python -m benchmarks.benchmark \
  -H <bench-asd>:3100 --services-alternate \
  -n test -s test -k 100000 -o I8 -w RU,50 \
  -d 15 --warmup 3 --cooldown 0 \
  --mode async --pool-loops 4 -z 64 --fast-path

# PAC sync direct — bypasses PSDK, calls PAC `_blocking` entries
PYTHON_GIL=0 python -m benchmarks.benchmark \
  -H <bench-asd>:3100 --services-alternate \
  -n test -s test -k 100000 -o I8 -w RU,50 \
  -d 15 --warmup 3 --cooldown 0 \
  --mode pac-blocking --threads 32

# PAC async direct — bypasses PSDK, calls PAC async entries
PYTHON_GIL=0 python -m benchmarks.benchmark \
  -H <bench-asd>:3100 --services-alternate \
  -n test -s test -k 100000 -o I8 -w RU,50 \
  -d 15 --warmup 3 --cooldown 0 \
  --mode pac-async -z 32

# Legacy `aerospike` C client (single-threaded — that client doesn't support
# multi-threaded fan-out; importing it on a free-threaded build also auto-
# re-enables the GIL because the C extension hasn't declared FT-safety).
python -m benchmarks.benchmark \
  -H <bench-asd>:3100 --services-alternate \
  -n test -s test -k 100000 -o I8 -w RU,50 \
  -d 15 --warmup 3 --cooldown 0 \
  --mode legacy-sync --threads 1

# Non-FT comparison: same binary, GIL forced on
PYTHON_GIL=1 ALLOW_GIL_ON=1 python -m benchmarks.benchmark ... (same args)

The Rust core (no Python) is benched via a standalone Rust binary that talks to aerospike-core directly — no PyO3, no Python interpreter at all. This gives the language-floor TPS for the same workload:

cargo build --release --manifest-path benchmarks/rust-core/Cargo.toml
MODE=async TASKS=32 DURATION=15 WARMUP=3 \
  AEROSPIKE_HOST=<bench-asd>:3100 \
  benchmarks/rust-core/target/release/rust-core

Every cell in the matrix below was produced by python -m benchmarks.benchmark --mode ... against bench-asd (<bench-asd>:3100), except the Rust-core rows, which use the dedicated Rust binary at benchmarks/rust-core/.

Cross-client TPS — single-key (batch size 1)

50/50 RW, 100K keys, 32 threads / tasks (or 4×64 for AsyncPool), 15 s measured. Free-threaded runs use PYTHON_GIL=0; non-FT runs use PYTHON_GIL=1 ALLOW_GIL_ON=1. The Rust core has no GIL — one number applies, shown in the FT column.

Client / Mode

Threads / Tasks

FT TPS

non-FT TPS

PSDK sync, fast-path (session.get / session.put)

32

214,489

50,857

PSDK sync, fast-path, ct_runtime

32

265,971

57,200

PSDK sync, builder (chained API)

32

149,428

31,564

PSDK sync, builder, ct_runtime

32

187,209

33,444

PSDK async AsyncPool, fast-path

4×64

260,325

108,327

PSDK async AsyncPool, fast-path

8×64

~292,000

(FT only)

PSDK async AsyncPool, builder

4×64

181,838

61,281

PSDK async single-loop, fast-path

32 tasks

118,220

105,959

PSDK async single-loop, builder

32 tasks

68,402

64,496

PSDK sync, fast-path

1

10,946

9,730

PSDK sync, builder

1

10,053

8,826

PAC sync direct (pac-blocking)

32

209,426

50,194

PAC sync direct, ct_runtime

32

271,066

60,730

PAC async direct (pac-async)

32 tasks

124,001

114,020

PAC sync

1

12,221

11,812

PAC async

1 task

7,773

8,304

Rust core, async (default settings)

32 tasks

~290,000

n/a (no GIL)

Rust core, sync (default settings)

32

~246,000

n/a (no GIL)

Rust core, async, with timer fix + pool sized

512 tasks

~580,000

n/a (no GIL)

Rust core, async

1 task

12,627

n/a (no GIL)

Rust core, sync

1

11,960

n/a (no GIL)

Python legacy (sync, C client)

1

(FT N/A, no wheel)

~15,000

The Rust-core rows here are on the 3-VM ASD topology. At default settings, rust-core async hits ~290K at t=32 and scales with concurrency — but the apparent plateau between t=32 and t=512 is client-side, not the cluster. Two aerospike-core defaults stack to cap throughput:

  • Per-op Tokio timer-wheel registration. Every aerospike_rt::timeout(...) insert/remove goes through a shared mutex in Tokio’s global time driver; under contention this serializes per-op work. Bypassing it (A2 — measurement hack) lifts rust-core async at t=256 from ~381K to ~551K.

  • max_conns_per_node = 256 default, fail-fast on exhaustion. With the timer also out of the way, t=512 collapses with ~92% errors as the pool refuses past 256 concurrent ops per node. Sizing the pool to match concurrency (MAX_CONNS_PER_NODE = 512) takes t=512 to 580K @ 0 errors — the real ceiling.

Python clients (PAC, PSDK) hit their own client-side ceilings (PyO3 boundary, asyncio/Tokio bridge, builder allocations) well below 580K, so they don’t see either of these two artifacts. Earlier versions of this doc quoted 810K and 405K as “the cluster ceiling”; both were artifacts of the two issues above plus an older services-alternate routing bug. There is no real cluster constraint visible from any default-config Python client.

ct_runtime is experimental — measurement-only on this table

The ct_runtime rows above use PAC’s --current-thread-runtime mode (sync only): each Python thread gets its own Tokio current-thread runtime via PAC’s _LocalClient proxy. This sidesteps the multi-thread Tokio worker-pool hop and raises the sync ceiling (PAC sync 207K → 277K; PSDK sync fp 210K → 265K).

But ct_runtime is not production-ready. Each per-thread runtime owns its own Cluster, which means:

  • N× cluster-tend threads (32 Python threads = 32 tend loops polling the cluster every second)

  • N× connection pools (~384 connections per process at default settings)

  • Incomplete _with_overrides surface — some PAC methods still hit the shared runtime even when ct_runtime is on

These numbers are included for measurement transparency; treat them as an experimental performance lever, not a recommended deployment.

Cross-client latency

p50 / p99 / p99.9 in microseconds, sampled 1-in-100 ops during measurement. Framework rows are rounded to 100 µs precision (the per-second histogram bucket size); Rust-core rows are exact.

Client / Mode

Threads / Tasks

FT (µs)

non-FT (µs)

PSDK sync, fast-path

32

100 / 300 / 500

500 / 2,500 / 3,200

PSDK sync, fast-path, ct_runtime

32

100 / 200 / 400

500 / 2,400 / 3,400

PSDK sync, builder

32

100 / 900 / 3,700

1,400 / 5,000 / 5,800

PSDK sync, fast-path

1

100 / 100 / 200

100 / 100 / 100

PSDK async single-loop, fast-path

32 tasks

200 / 300 / 600

500 / 600 / 700

PSDK async single-loop, builder

32 tasks

500 / 600 / 800

1,000 / 1,200 / 1,300

PSDK async AsyncPool, fast-path

4×64

900 / 2,500 / 3,500

4,800 / 13,300 / 16,400

PSDK async AsyncPool, fast-path

8×64

1,700 / 4,100 / 5,800

(FT only)

PSDK async AsyncPool, builder

4×64

1,400 / 2,600 / 3,800

7,800 / 25,300 / 26,000

PAC sync

32

100 / 300 / 500

600 / 2,800 / 3,800

PAC sync, ct_runtime

32

100 / 200 / 300

500 / 2,300 / 3,300

PAC sync

1

100 / 100 / 100

100 / 100 / 100

PAC async

32 tasks

200 / 300 / 500

400 / 400 / 500

PAC async

1 task

100 / 200 / 200

100 / 200 / 200

Rust core, async (default)

32 tasks

(sampled) p99 ~190

n/a (no GIL)

Rust core, sync (default)

32

(sampled) p99 ~200

n/a (no GIL)

Rust core, async

1 task

p99 ~140

n/a (no GIL)

Rust core, sync

1

p99 ~115

n/a (no GIL)

Framework latency is histogram-bucketed at 100 µs granularity (--with-telemetry’s sampling resolution); Rust-core latency is sampled exactly. Framework cells with reported p50 under 100 µs round up to the 100 µs bucket boundary.

Batch sweeps

The single-key cells above measure one record per execute(). Real applications often batch multiple keys per call to amortize network and per-op overhead. The sweeps below hold concurrency constant (32 threads / tasks) and vary --batch-size. Free-threaded only.

PSDK sync builder

session.query([keys]).execute() and session.batch().upsert(k).put(b).execute(). Routes through PAC’s batch_read_blocking / batch_operate_blocking directly — no asyncio loop in the path.

Batch size

Total TPS

× b=1

1

145,898

1.00×

4

142,895

0.98×

16

328,253

2.25×

32

401,467

2.75×

64

470,720

3.23×

128

485,056

3.32×

PSDK async single-loop builder

await session.query([keys]).execute() and friends — one event loop, 32 concurrent tasks.

Batch size

Total TPS

× b=1

1

64,569

1.00×

4

59,514

0.92×

16

121,155

1.88×

32

144,885

2.24×

64

174,080

2.70×

128

204,736

3.17×

PSDK async AsyncPool builder

Four event loops × 64 tasks per loop. Free-threaded only.

Batch size

Total TPS

× b=1 (pool)

1

190,278

1.00×

4

156,954

0.83×

16

265,443

1.40×

32

310,901

1.63×

64

336,469

1.77×

Headline: the PSDK sync builder scales through batch=128 to ~485K TPS — the highest framework number in the matrix. Sync batch routes via PAC’s batch_*_blocking entries with one PyO3 boundary per batch, so doubling the batch size keeps amortizing the per-call Python cost. The b=128 peak is 3.3× the single-key sync builder.

The async single-loop sweep tops out around 205K (batch=128) — the asyncio ↔ Tokio bridge cost per execute() doesn’t go away just because each call moves more data. AsyncPool recovers most of that by running 4 loops in parallel, hitting 336K at batch=64.

Stack cost analysis

Layering the headline single-key TPS numbers across clients shows where every transition costs. The Rust-core figures below are at the same default settings as the Python clients; Rust-core’s real cluster-side ceiling is ≥580K (with the per-op Tokio timer wheel bypassed AND max_conns_per_node sized to match concurrency — see “Per-language baselines” above). Python clients hit their own client-side ceilings well below 580K, so they aren’t sensitive to the Rust-core defaults that gate the higher number.

Layer

TPS

Note

Rust core async, default settings

~290,000

aerospike-core via Tokio tasks; at default settings (timer wheel + 256-conn pool both active)

Rust core async, timer + pool sized

~580,000

Real cluster-side ceiling; current aerospike-core defaults stack to cap below this

Rust core sync, default settings

~246,000

aerospike-core via OS threads + block_on

PSDK async AsyncPool, fast-path (8×64)

~292,000

8 event loops × 64 tasks (FT only, uvloop)

PAC sync direct, ct_runtime

271,066

PyO3 wrapper, per-thread Tokio current-thread runtime

PSDK sync, fast-path, ct_runtime

265,971

SDK fast-path + ct_runtime

PSDK async AsyncPool, fast-path (4×64)

260,325

4 event loops × 64 tasks (FT only, uvloop)

PSDK sync, fast-path

214,489

SDK session.get / session.put → PAC blocking

PAC sync direct (multi-thread Tokio)

209,426

PyO3 wrapper, shared Tokio multi-thread runtime

PSDK async AsyncPool, builder (4×64)

181,838

4 loops, full builder path

PSDK sync, builder

149,428

SDK chained builder → execute → stream

PAC async direct, 32 tasks

124,001

PyO3 wrapper, asyncio ↔ Tokio bridge (with drainer + uvloop)

PSDK async single-loop, fast-path

118,220

One event loop, session.get / session.put

PSDK async single-loop, builder

68,402

One event loop, full builder path

Python legacy (sync, non-FT)

~15,000

Single-thread C client baseline

Sync stack — boundary cost is small

Transition

TPS

Δ

Rust core sync (default settings)

~246,000

reference (default aerospike-core)

→ PAC sync direct (multi-thread Tokio)

209,426

−15% (PyO3 + Python boundary; Tokio thread-handoff in the per-op path)

→ PSDK sync, fast-path

214,489

flat — SDK layer is essentially free

→ PSDK sync, builder

149,428

−30% vs fp (chained builder + stream wrap in Python)

The PyO3 + per-op Python ↔ Tokio thread handoff costs ~15% over the equivalent direct rust-core sync number. The PSDK SDK layer is essentially free over PAC direct. (The cluster sustains higher absolute throughput than rust-core sync default — see “Per-language baselines” — but with the default aerospike-core settings active, both Python and Rust-direct paths land in the same band.)

Async stack — closer to sync than it used to be

Transition

TPS

Δ

PSDK sync, fast-path (sync reference)

214,489

→ PAC async direct (single loop, drainer + uvloop)

124,001

−42% (asyncio loop thread is the gating step)

→ PSDK async single-loop, fast-path

118,220

−5% vs PAC async (PSDK SDK layer)

→ PSDK async AsyncPool, fast-path (4×64)

260,325

+120% vs single-loop (parallelism across loops + uvloop inside pool, FT only) — +21% above sync

→ PSDK async AsyncPool, fast-path (8×64)

~292,000

+147% vs single-loop, +36% over sync

Async key insight: post-drainer-thread + uvloop, the single-loop async ceiling sits around 120-130K. The bottleneck is now the asyncio loop thread doing per-op set_result and task wakeup work, single-threaded. AsyncPool (multi-loop) breaks past that ceiling by running 4-8 loops in parallel — at 8×64 it actually exceeds the sync fast-path ceiling. Only useful under free-threaded Python; under regular CPython the GIL serializes the loops and the pool is slower than a single client (see AsyncPool note).

Practical takeaway

  • PSDK SDK layer is essentially free on both sync and async paths — ~3-8% over PAC direct on either side. Most cost is below PSDK in PAC + PyO3.

  • PAC’s drainer thread moves all asyncio-loop wake-ups onto a single persistent waker thread, eliminating per-batch Python::attach churn on Tokio workers. This is what lifted async TPS substantially over earlier reference numbers (e.g., AsyncPool 4×64 went from 173K → 246K).

  • uvloop is installed by default under FT and non-FT Linux/macOS. It lifts single-loop async ~15% on top of the drainer; multi-loop (AsyncPool) sees ~0-3% extra because the per-loop work is already parallelized.

  • The chained-builder API pays a per-op Python tax on single-key calls (~30% vs fast-path on sync). On batch calls, that cost amortizes across keys: at batch=128 the sync builder reaches ~484K TPS — far above any single-key cell.

  • For maximum throughput: use the sync builder with batches (session.batch() or multi-key session.query([keys])) on free-threaded Python when the workload tolerates batching — ~484K TPS at batch=128. For single-key sync workloads, the fast-path (session.get / session.put) gives ~210K TPS. For async workloads, AsyncPool 4-8 loops delivers 246-273K TPS — above the sync fast-path ceiling. Reserve --current-thread-runtime (experimental — see the warning above) for tightly-controlled benchmarking, not production.

Fast-path vs builder

PSDK exposes two API shapes for single-key reads and writes:

  • Builder (chained): session.query(key).execute() and session.upsert(key).put(bins).execute(). Returns a RecordStream of wrapped RecordResults. Supports filter expressions, error handlers, TTL overrides, generation checks, batch operations, and secondary-index queries.

  • Fast-path (direct): session.get(key) and session.put(key, bins). Bypasses the builder + stream wrap and calls PAC’s native _blocking / async entry points directly with the session-cached policy. Single-key only; no filter / error-handler / TTL hooks. Errors raise directly (cache misses raise RecordNotFound).

Speedup of fast-path over builder on single-key dispatch at 32 threads / 4×64 tasks, FT:

Config

Builder TPS

Fast-path TPS

Speedup

PSDK async, single client

68,402

118,220

1.73×

PSDK async, AsyncPool 4×64

181,838

260,325

1.43×

PSDK sync

149,428

214,489

1.44×

These speedups are for single-key dispatch. With batching, the builder amortizes its per-op overhead across many keys per call — at batch=128 the sync builder reaches 484K TPS (vs 210K for sync fast-path). The fast-path stays single-key only; for any workload that can batch, the builder eventually wins.

The builder has irreducible Python overhead per op (builder object allocation, _OperationSpec finalization, RecordResult wrapping, generator-based stream iteration). The fast-path skips all of it.

See performance.md for the user-facing decision guide.

AsyncPool is a free-threading feature

AsyncPool runs N event loops on N OS threads with one PAC client each. Its value is multi-thread parallelism across CPU cores — which only materializes under free-threaded Python (PYTHON_GIL=0).

Under non-FT Python the GIL still serializes all Python execution. AsyncPool ends up with 256 outstanding tasks across 4 threads competing for one interpreter, plus the per-loop orchestration overhead — typically net flat or slightly slower than a single-client async setup on the same Python binary:

Config

non-FT TPS

vs single-loop non-FT

async single-loop, fast-path, 32 tasks

105,959

baseline

async AsyncPool 4×64, fast-path

108,327

+2% (uvloop in pool roughly recovers the overhead)

async single-loop, builder, 32 tasks

64,496

baseline

async AsyncPool 4×64, builder

61,281

−5%

AsyncPool is roughly on par with single-client async under GIL-on Python now that pool loops use uvloop too. Pick the one that fits your code shape; the real AsyncPool win is reserved for free-threaded runs.

Error classification

The framework treats RecordNotFound (cache miss on a point read) as a successful read with no record — not an error. This matches the semantics used by other Aerospike SDKs. Real errors (timeouts, connection failures, server-side errors, etc.) are counted separately as either Errors: or Timeouts: in the per-second ticker and the summary block.

To verify error accounting on a fresh dataset, pass --truncate to the bench command; with the fix in place all modes report Errors: 0 even when half the early reads cache-miss.