Benchmarking Guide¶
This guide documents the architecture, setup, and measured TPS / latency for the Aerospike Python SDK (PSDK) and the Aerospike Python Async Client (PAC). The reference setup uses two isolated VMs on Google Cloud Platform; the same methodology works on any other cloud provider (AWS EC2, Azure VMs, etc.) or on dedicated on-prem hardware — only the VM provisioning steps would change.
Architecture¶
┌─────────────────────────┐ ┌─────────────────────────┐
│ bench-client │◄──────────── TCP :3000 ────────────►│ bench-asd × 3 nodes │
│ n4-standard-8 (8 vCPU)│ │ n4-standard-8 (8 vCPU)│
│ 32 GB RAM, 30 GB disk │ │ 32 GB RAM, 30 GB disk │
│ Ubuntu 24.04 LTS │ │ Ubuntu 24.04 LTS │
│ │ │ │
│ Python 3.14t (free- │ │ Aerospike Enterprise │
│ threaded, no GIL) │ │ 8.x.x │
│ Rust 1.96+ │ │ in-memory storage │
│ PAC, PSDK from source │ │ (4 GB, namespace test)│
└─────────────────────────┘ └─────────────────────────┘
bench-asd is a 3-node Aerospike cluster (each node a separate n4-standard-8 VM, dedicated 8 vCPU per ASD process — critical for measured server-side ceilings). All four VMs (1 client + 3 server nodes) run within the same VPC/subnet, giving sub-millisecond network RTT.
Why dedicated, isolated VMs?¶
Local benchmarking on macOS via Podman / Docker Desktop hits several bottlenecks that distort results:
Userspace TCP proxy (Docker Desktop’s
gvproxy) — adds 2-5 ms per hop, capping TPS at ~15K regardless of client capability.CPU contention — co-locating
asdand the Python client on a shared VM creates resource competition that masks true scaling behavior. Server-side: running 3 ASDs as containers on a single 8-vCPU host (vs each on its own 8-vCPU VM) capsaerospike-coredirect at ~280K TPS because the 3 server processes share 8 vCPUs (~2.7 vCPU each). On dedicated 8-vCPU-per-ASD VMs, the cluster sustains ≥580K TPS — well above where any default-config Python client lands. (Earlier writeups quoted the 3-VM ceiling as 810K and then 405K, then ~290-300K rust-core direct; all three were client-side artifacts — services-alternate routing errors, then the Tokio timer wheel + the default 256-conn pool — masquerading as the cluster.)uvloop + free-threading — uvloop 0.22.x has a libuv FT race on
loop._ready_len(MagicStack/uvloop issues #720, #721) that triggers when many threads concurrently callloop.call_soon_threadsafe(). PSDK / PAC fully mitigates this via a single persistent waker thread inside PAC: all Tokio-sidecall_soon_threadsafeinvocations funnel through one dedicated thread, eliminating the multi-threaded access pattern the race needs. The fix is empirically stable across 20+ minutes of stress (z=128 single-loop + AsyncPool 8×64, 241M ops, zero stalls). uvloop is installed by default on FT and non-FT Linux/macOS builds. (uvloop has no Windows wheel; PAC falls back to the asyncio default selector loop there.) AnAEROSPIKE_NO_UVLOOP=1env-var safety valve is available to opt out without uninstalling the dependency.
Dedicated VMs on isolated CPU cores with direct, low-latency networking between client and server eliminate all of these issues. GCP n4-standard-8 (8 dedicated vCPUs each) on the same VPC is the reference setup. Equivalent isolation on AWS (c7i.2xlarge / dedicated tenancy / placement groups), Azure (Fsv2-series), or on-prem (two adjacent physical hosts on a quiet switch) reproduces the numbers within run-to-run noise.
Environment¶
Component |
Version |
|---|---|
GCP machine type |
|
OS |
Ubuntu 24.04 LTS, kernel 6.17.0-gcp |
Python |
3.14.6 free-threaded build (e.g. 3.14t) |
Rust |
1.96.0 |
PyO3 |
0.29.0 |
PAC |
|
PSDK |
|
Legacy Python client |
|
Aerospike server |
Enterprise 8.x, 3-node cluster, in-memory, 4 GB per node, RF=1 |
Workload¶
All measurements use the same workload across every client:
100,000 keys seeded into
test.testset with single-bin records50/50 read/write mix (
RU,50)Single-bin payload:
{"b0": <int>}— the int is the key id (no per-op rng for bin values)Shared client across all worker threads / tasks
15 seconds measured + 3 seconds warmup (no separate cooldown)
Sampled latency: 1-in-100 ops timed → p50 / p99 / p99.9 reported
Bench RNG / key construction: as of 2026-05-25, the harness uses PAC’s
FastRng (xoshiro256++) per worker instead of CPython’s random.Random
(Mersenne Twister) — matches the JSDK RandomShift / Rust core SmallRng
methodology and removes a ~5 µs/op Python-stdlib RNG handicap that
otherwise inflated the bench-harness overhead. Keys are constructed per op
via PAC’s Key.from_int_user_key(ns, set, kid) fast-path, which skips
Python str() conversion + PythonValue enum dispatch (~2 µs/op).
Net: the bench’s per-op overhead matches JSDK/Rust core methodology
within a few hundred nanoseconds, so reported TPS reflects client
capability rather than Python stdlib cost.
Free-threaded (FT) runs use PYTHON_GIL=0. Non-FT runs use PYTHON_GIL=1 ALLOW_GIL_ON=1 on the same free-threaded binary — same wheel, same imports, GIL state flipped.
Running the benchmarks¶
The framework bench (python -m benchmarks.benchmark) carries all the modes for the cells in this document. Each invocation prints per-second TPS / error / timeout lines plus a final summary block.
# PSDK sync — fast-path (session.get / session.put) by default
PYTHON_GIL=0 python -m benchmarks.benchmark \
-H <bench-asd>:3100 --services-alternate \
-n test -s test -k 100000 -o I8 -w RU,50 \
-d 15 --warmup 3 --cooldown 0 \
--mode sync --threads 32 --fast-path
# Same harness, builder API (session.query / upsert chained)
PYTHON_GIL=0 python -m benchmarks.benchmark \
-H <bench-asd>:3100 --services-alternate \
-n test -s test -k 100000 -o I8 -w RU,50 \
-d 15 --warmup 3 --cooldown 0 \
--mode sync --threads 32 --no-fast-path
# PSDK async — single client, N concurrent tasks
PYTHON_GIL=0 python -m benchmarks.benchmark \
-H <bench-asd>:3100 --services-alternate \
-n test -s test -k 100000 -o I8 -w RU,50 \
-d 15 --warmup 3 --cooldown 0 \
--mode async -z 32 --fast-path
# PSDK async — AsyncPool (N loops × M tasks per loop), free-threaded only
PYTHON_GIL=0 python -m benchmarks.benchmark \
-H <bench-asd>:3100 --services-alternate \
-n test -s test -k 100000 -o I8 -w RU,50 \
-d 15 --warmup 3 --cooldown 0 \
--mode async --pool-loops 4 -z 64 --fast-path
# PAC sync direct — bypasses PSDK, calls PAC `_blocking` entries
PYTHON_GIL=0 python -m benchmarks.benchmark \
-H <bench-asd>:3100 --services-alternate \
-n test -s test -k 100000 -o I8 -w RU,50 \
-d 15 --warmup 3 --cooldown 0 \
--mode pac-blocking --threads 32
# PAC async direct — bypasses PSDK, calls PAC async entries
PYTHON_GIL=0 python -m benchmarks.benchmark \
-H <bench-asd>:3100 --services-alternate \
-n test -s test -k 100000 -o I8 -w RU,50 \
-d 15 --warmup 3 --cooldown 0 \
--mode pac-async -z 32
# Legacy `aerospike` C client (single-threaded — that client doesn't support
# multi-threaded fan-out; importing it on a free-threaded build also auto-
# re-enables the GIL because the C extension hasn't declared FT-safety).
python -m benchmarks.benchmark \
-H <bench-asd>:3100 --services-alternate \
-n test -s test -k 100000 -o I8 -w RU,50 \
-d 15 --warmup 3 --cooldown 0 \
--mode legacy-sync --threads 1
# Non-FT comparison: same binary, GIL forced on
PYTHON_GIL=1 ALLOW_GIL_ON=1 python -m benchmarks.benchmark ... (same args)
The Rust core (no Python) is benched via a standalone Rust binary that talks to aerospike-core directly — no PyO3, no Python interpreter at all. This gives the language-floor TPS for the same workload:
cargo build --release --manifest-path benchmarks/rust-core/Cargo.toml
MODE=async TASKS=32 DURATION=15 WARMUP=3 \
AEROSPIKE_HOST=<bench-asd>:3100 \
benchmarks/rust-core/target/release/rust-core
Every cell in the matrix below was produced by python -m benchmarks.benchmark --mode ... against bench-asd (<bench-asd>:3100), except the Rust-core rows, which use the dedicated Rust binary at benchmarks/rust-core/.
Cross-client TPS — single-key (batch size 1)¶
50/50 RW, 100K keys, 32 threads / tasks (or 4×64 for AsyncPool), 15 s measured. Free-threaded runs use PYTHON_GIL=0; non-FT runs use PYTHON_GIL=1 ALLOW_GIL_ON=1. The Rust core has no GIL — one number applies, shown in the FT column.
Client / Mode |
Threads / Tasks |
FT TPS |
non-FT TPS |
|---|---|---|---|
PSDK sync, fast-path ( |
32 |
214,489 |
50,857 |
PSDK sync, fast-path, ct_runtime |
32 |
265,971 |
57,200 |
PSDK sync, builder (chained API) |
32 |
149,428 |
31,564 |
PSDK sync, builder, ct_runtime |
32 |
187,209 |
33,444 |
PSDK async AsyncPool, fast-path |
4×64 |
260,325 |
108,327 |
PSDK async AsyncPool, fast-path |
8×64 |
~292,000 |
(FT only) |
PSDK async AsyncPool, builder |
4×64 |
181,838 |
61,281 |
PSDK async single-loop, fast-path |
32 tasks |
118,220 |
105,959 |
PSDK async single-loop, builder |
32 tasks |
68,402 |
64,496 |
PSDK sync, fast-path |
1 |
10,946 |
9,730 |
PSDK sync, builder |
1 |
10,053 |
8,826 |
PAC sync direct ( |
32 |
209,426 |
50,194 |
PAC sync direct, ct_runtime |
32 |
271,066 |
60,730 |
PAC async direct ( |
32 tasks |
124,001 |
114,020 |
PAC sync |
1 |
12,221 |
11,812 |
PAC async |
1 task |
7,773 |
8,304 |
Rust core, async (default settings) |
32 tasks |
~290,000 |
n/a (no GIL) |
Rust core, sync (default settings) |
32 |
~246,000 |
n/a (no GIL) |
Rust core, async, with timer fix + pool sized |
512 tasks |
~580,000 |
n/a (no GIL) |
Rust core, async |
1 task |
12,627 |
n/a (no GIL) |
Rust core, sync |
1 |
11,960 |
n/a (no GIL) |
Python legacy (sync, C client) |
1 |
(FT N/A, no wheel) |
~15,000 |
The Rust-core rows here are on the 3-VM ASD topology. At default settings, rust-core async hits ~290K at t=32 and scales with concurrency — but the apparent plateau between t=32 and t=512 is client-side, not the cluster. Two aerospike-core defaults stack to cap throughput:
Per-op Tokio timer-wheel registration. Every
aerospike_rt::timeout(...)insert/remove goes through a shared mutex in Tokio’s global time driver; under contention this serializes per-op work. Bypassing it (A2 — measurement hack) lifts rust-core async at t=256 from ~381K to ~551K.max_conns_per_node = 256default, fail-fast on exhaustion. With the timer also out of the way, t=512 collapses with ~92% errors as the pool refuses past 256 concurrent ops per node. Sizing the pool to match concurrency (MAX_CONNS_PER_NODE = 512) takes t=512 to 580K @ 0 errors — the real ceiling.
Python clients (PAC, PSDK) hit their own client-side ceilings (PyO3 boundary, asyncio/Tokio bridge, builder allocations) well below 580K, so they don’t see either of these two artifacts. Earlier versions of this doc quoted 810K and 405K as “the cluster ceiling”; both were artifacts of the two issues above plus an older services-alternate routing bug. There is no real cluster constraint visible from any default-config Python client.
ct_runtime is experimental — measurement-only on this table
The ct_runtime rows above use PAC’s --current-thread-runtime mode (sync only): each Python thread gets its own Tokio current-thread runtime via PAC’s _LocalClient proxy. This sidesteps the multi-thread Tokio worker-pool hop and raises the sync ceiling (PAC sync 207K → 277K; PSDK sync fp 210K → 265K).
But ct_runtime is not production-ready. Each per-thread runtime owns its own Cluster, which means:
N× cluster-tend threads (32 Python threads = 32 tend loops polling the cluster every second)
N× connection pools (~384 connections per process at default settings)
Incomplete
_with_overridessurface — some PAC methods still hit the shared runtime even when ct_runtime is on
These numbers are included for measurement transparency; treat them as an experimental performance lever, not a recommended deployment.
Cross-client latency¶
p50 / p99 / p99.9 in microseconds, sampled 1-in-100 ops during measurement. Framework rows are rounded to 100 µs precision (the per-second histogram bucket size); Rust-core rows are exact.
Client / Mode |
Threads / Tasks |
FT (µs) |
non-FT (µs) |
|---|---|---|---|
PSDK sync, fast-path |
32 |
100 / 300 / 500 |
500 / 2,500 / 3,200 |
PSDK sync, fast-path, ct_runtime |
32 |
100 / 200 / 400 |
500 / 2,400 / 3,400 |
PSDK sync, builder |
32 |
100 / 900 / 3,700 |
1,400 / 5,000 / 5,800 |
PSDK sync, fast-path |
1 |
100 / 100 / 200 |
100 / 100 / 100 |
PSDK async single-loop, fast-path |
32 tasks |
200 / 300 / 600 |
500 / 600 / 700 |
PSDK async single-loop, builder |
32 tasks |
500 / 600 / 800 |
1,000 / 1,200 / 1,300 |
PSDK async AsyncPool, fast-path |
4×64 |
900 / 2,500 / 3,500 |
4,800 / 13,300 / 16,400 |
PSDK async AsyncPool, fast-path |
8×64 |
1,700 / 4,100 / 5,800 |
(FT only) |
PSDK async AsyncPool, builder |
4×64 |
1,400 / 2,600 / 3,800 |
7,800 / 25,300 / 26,000 |
PAC sync |
32 |
100 / 300 / 500 |
600 / 2,800 / 3,800 |
PAC sync, ct_runtime |
32 |
100 / 200 / 300 |
500 / 2,300 / 3,300 |
PAC sync |
1 |
100 / 100 / 100 |
100 / 100 / 100 |
PAC async |
32 tasks |
200 / 300 / 500 |
400 / 400 / 500 |
PAC async |
1 task |
100 / 200 / 200 |
100 / 200 / 200 |
Rust core, async (default) |
32 tasks |
(sampled) p99 ~190 |
n/a (no GIL) |
Rust core, sync (default) |
32 |
(sampled) p99 ~200 |
n/a (no GIL) |
Rust core, async |
1 task |
p99 ~140 |
n/a (no GIL) |
Rust core, sync |
1 |
p99 ~115 |
n/a (no GIL) |
Framework latency is histogram-bucketed at 100 µs granularity (--with-telemetry’s sampling resolution); Rust-core latency is sampled exactly. Framework cells with reported p50 under 100 µs round up to the 100 µs bucket boundary.
Batch sweeps¶
The single-key cells above measure one record per execute(). Real applications often batch multiple keys per call to amortize network and per-op overhead. The sweeps below hold concurrency constant (32 threads / tasks) and vary --batch-size. Free-threaded only.
PSDK sync builder¶
session.query([keys]).execute() and session.batch().upsert(k).put(b).execute(). Routes through PAC’s batch_read_blocking / batch_operate_blocking directly — no asyncio loop in the path.
Batch size |
Total TPS |
× b=1 |
|---|---|---|
1 |
145,898 |
1.00× |
4 |
142,895 |
0.98× |
16 |
328,253 |
2.25× |
32 |
401,467 |
2.75× |
64 |
470,720 |
3.23× |
128 |
485,056 |
3.32× |
PSDK async single-loop builder¶
await session.query([keys]).execute() and friends — one event loop, 32 concurrent tasks.
Batch size |
Total TPS |
× b=1 |
|---|---|---|
1 |
64,569 |
1.00× |
4 |
59,514 |
0.92× |
16 |
121,155 |
1.88× |
32 |
144,885 |
2.24× |
64 |
174,080 |
2.70× |
128 |
204,736 |
3.17× |
PSDK async AsyncPool builder¶
Four event loops × 64 tasks per loop. Free-threaded only.
Batch size |
Total TPS |
× b=1 (pool) |
|---|---|---|
1 |
190,278 |
1.00× |
4 |
156,954 |
0.83× |
16 |
265,443 |
1.40× |
32 |
310,901 |
1.63× |
64 |
336,469 |
1.77× |
Headline: the PSDK sync builder scales through batch=128 to ~485K TPS — the highest framework number in the matrix. Sync batch routes via PAC’s batch_*_blocking entries with one PyO3 boundary per batch, so doubling the batch size keeps amortizing the per-call Python cost. The b=128 peak is 3.3× the single-key sync builder.
The async single-loop sweep tops out around 205K (batch=128) — the asyncio ↔ Tokio bridge cost per execute() doesn’t go away just because each call moves more data. AsyncPool recovers most of that by running 4 loops in parallel, hitting 336K at batch=64.
Stack cost analysis¶
Layering the headline single-key TPS numbers across clients shows where every transition costs. The Rust-core figures below are at the same default settings as the Python clients; Rust-core’s real cluster-side ceiling is ≥580K (with the per-op Tokio timer wheel bypassed AND max_conns_per_node sized to match concurrency — see “Per-language baselines” above). Python clients hit their own client-side ceilings well below 580K, so they aren’t sensitive to the Rust-core defaults that gate the higher number.
Layer |
TPS |
Note |
|---|---|---|
Rust core async, default settings |
~290,000 |
|
Rust core async, timer + pool sized |
~580,000 |
Real cluster-side ceiling; current |
Rust core sync, default settings |
~246,000 |
|
PSDK async AsyncPool, fast-path (8×64) |
~292,000 |
8 event loops × 64 tasks (FT only, uvloop) |
PAC sync direct, ct_runtime |
271,066 |
PyO3 wrapper, per-thread Tokio current-thread runtime |
PSDK sync, fast-path, ct_runtime |
265,971 |
SDK fast-path + ct_runtime |
PSDK async AsyncPool, fast-path (4×64) |
260,325 |
4 event loops × 64 tasks (FT only, uvloop) |
PSDK sync, fast-path |
214,489 |
SDK |
PAC sync direct (multi-thread Tokio) |
209,426 |
PyO3 wrapper, shared Tokio multi-thread runtime |
PSDK async AsyncPool, builder (4×64) |
181,838 |
4 loops, full builder path |
PSDK sync, builder |
149,428 |
SDK chained builder → execute → stream |
PAC async direct, 32 tasks |
124,001 |
PyO3 wrapper, asyncio ↔ Tokio bridge (with drainer + uvloop) |
PSDK async single-loop, fast-path |
118,220 |
One event loop, |
PSDK async single-loop, builder |
68,402 |
One event loop, full builder path |
Python legacy (sync, non-FT) |
~15,000 |
Single-thread C client baseline |
Sync stack — boundary cost is small¶
Transition |
TPS |
Δ |
|---|---|---|
Rust core sync (default settings) |
~246,000 |
reference (default |
→ PAC sync direct (multi-thread Tokio) |
209,426 |
−15% (PyO3 + Python boundary; Tokio thread-handoff in the per-op path) |
→ PSDK sync, fast-path |
214,489 |
flat — SDK layer is essentially free |
→ PSDK sync, builder |
149,428 |
−30% vs fp (chained builder + stream wrap in Python) |
The PyO3 + per-op Python ↔ Tokio thread handoff costs ~15% over the equivalent direct rust-core sync number. The PSDK SDK layer is essentially free over PAC direct. (The cluster sustains higher absolute throughput than rust-core sync default — see “Per-language baselines” — but with the default aerospike-core settings active, both Python and Rust-direct paths land in the same band.)
Async stack — closer to sync than it used to be¶
Transition |
TPS |
Δ |
|---|---|---|
PSDK sync, fast-path (sync reference) |
214,489 |
— |
→ PAC async direct (single loop, drainer + uvloop) |
124,001 |
−42% (asyncio loop thread is the gating step) |
→ PSDK async single-loop, fast-path |
118,220 |
−5% vs PAC async (PSDK SDK layer) |
→ PSDK async AsyncPool, fast-path (4×64) |
260,325 |
+120% vs single-loop (parallelism across loops + uvloop inside pool, FT only) — +21% above sync |
→ PSDK async AsyncPool, fast-path (8×64) |
~292,000 |
+147% vs single-loop, +36% over sync |
Async key insight: post-drainer-thread + uvloop, the single-loop async ceiling sits around 120-130K. The bottleneck is now the asyncio loop thread doing per-op set_result and task wakeup work, single-threaded. AsyncPool (multi-loop) breaks past that ceiling by running 4-8 loops in parallel — at 8×64 it actually exceeds the sync fast-path ceiling. Only useful under free-threaded Python; under regular CPython the GIL serializes the loops and the pool is slower than a single client (see AsyncPool note).
Practical takeaway¶
PSDK SDK layer is essentially free on both sync and async paths — ~3-8% over PAC direct on either side. Most cost is below PSDK in PAC + PyO3.
PAC’s drainer thread moves all asyncio-loop wake-ups onto a single persistent waker thread, eliminating per-batch
Python::attachchurn on Tokio workers. This is what lifted async TPS substantially over earlier reference numbers (e.g., AsyncPool 4×64 went from 173K → 246K).uvloop is installed by default under FT and non-FT Linux/macOS. It lifts single-loop async ~15% on top of the drainer; multi-loop (AsyncPool) sees ~0-3% extra because the per-loop work is already parallelized.
The chained-builder API pays a per-op Python tax on single-key calls (~30% vs fast-path on sync). On batch calls, that cost amortizes across keys: at batch=128 the sync builder reaches ~484K TPS — far above any single-key cell.
For maximum throughput: use the sync builder with batches (
session.batch()or multi-keysession.query([keys])) on free-threaded Python when the workload tolerates batching — ~484K TPS at batch=128. For single-key sync workloads, the fast-path (session.get/session.put) gives ~210K TPS. For async workloads, AsyncPool 4-8 loops delivers 246-273K TPS — above the sync fast-path ceiling. Reserve--current-thread-runtime(experimental — see the warning above) for tightly-controlled benchmarking, not production.
Fast-path vs builder¶
PSDK exposes two API shapes for single-key reads and writes:
Builder (chained):
session.query(key).execute()andsession.upsert(key).put(bins).execute(). Returns aRecordStreamof wrappedRecordResults. Supports filter expressions, error handlers, TTL overrides, generation checks, batch operations, and secondary-index queries.Fast-path (direct):
session.get(key)andsession.put(key, bins). Bypasses the builder + stream wrap and calls PAC’s native_blocking/ async entry points directly with the session-cached policy. Single-key only; no filter / error-handler / TTL hooks. Errors raise directly (cache misses raiseRecordNotFound).
Speedup of fast-path over builder on single-key dispatch at 32 threads / 4×64 tasks, FT:
Config |
Builder TPS |
Fast-path TPS |
Speedup |
|---|---|---|---|
PSDK async, single client |
68,402 |
118,220 |
1.73× |
PSDK async, AsyncPool 4×64 |
181,838 |
260,325 |
1.43× |
PSDK sync |
149,428 |
214,489 |
1.44× |
These speedups are for single-key dispatch. With batching, the builder amortizes its per-op overhead across many keys per call — at batch=128 the sync builder reaches 484K TPS (vs 210K for sync fast-path). The fast-path stays single-key only; for any workload that can batch, the builder eventually wins.
The builder has irreducible Python overhead per op (builder object allocation, _OperationSpec finalization, RecordResult wrapping, generator-based stream iteration). The fast-path skips all of it.
See performance.md for the user-facing decision guide.
AsyncPool is a free-threading feature¶
AsyncPool runs N event loops on N OS threads with one PAC client each. Its value is multi-thread parallelism across CPU cores — which only materializes under free-threaded Python (PYTHON_GIL=0).
Under non-FT Python the GIL still serializes all Python execution. AsyncPool ends up with 256 outstanding tasks across 4 threads competing for one interpreter, plus the per-loop orchestration overhead — typically net flat or slightly slower than a single-client async setup on the same Python binary:
Config |
non-FT TPS |
vs single-loop non-FT |
|---|---|---|
async single-loop, fast-path, 32 tasks |
105,959 |
baseline |
async AsyncPool 4×64, fast-path |
108,327 |
+2% (uvloop in pool roughly recovers the overhead) |
async single-loop, builder, 32 tasks |
64,496 |
baseline |
async AsyncPool 4×64, builder |
61,281 |
−5% |
AsyncPool is roughly on par with single-client async under GIL-on Python now that pool loops use uvloop too. Pick the one that fits your code shape; the real AsyncPool win is reserved for free-threaded runs.
Error classification¶
The framework treats RecordNotFound (cache miss on a point read) as a successful read with no record — not an error. This matches the semantics used by other Aerospike SDKs. Real errors (timeouts, connection failures, server-side errors, etc.) are counted separately as either Errors: or Timeouts: in the per-second ticker and the summary block.
To verify error accounting on a fresh dataset, pass --truncate to the bench command; with the fix in place all modes report Errors: 0 even when half the early reads cache-miss.