For Coding Agents¶
Prescriptive reference for AI coding agents. Tables and code over prose. Terminology matches the codebase exactly.
Architecture Map¶
File Layout¶
| Path | Contains |
|---|---|
iscc_search/__init__.py |
Exports SearchOptions, search_opts, PlatformDirs |
iscc_search/options.py |
SearchOptions (Pydantic settings, env vars), get_index() factory |
iscc_search/config.py |
AppConfig, ConfigManager (CLI multi-index JSON config) |
iscc_search/models.py |
IsccBase, IsccID, IsccUnit, IsccCode, IsccItem |
iscc_search/schema.py |
Auto-generated Pydantic models from openapi/openapi.yaml |
iscc_search/processing.py |
Text tokenization utilities |
iscc_search/protocols/index.py |
IsccIndexProtocol (runtime-checkable Protocol) |
iscc_search/indexes/common.py |
Shared: serialization, ID encoding, validation, query normalization |
iscc_search/indexes/memory/index.py |
MemoryIndex (dict-based, no persistence) |
iscc_search/indexes/lmdb/index.py |
LmdbIndex (LMDB storage + inverted prefix search) |
iscc_search/indexes/lmdb/manager.py |
LmdbIndexManager (multi-index over .lmdb files) |
iscc_search/indexes/usearch/index.py |
UsearchIndex (HNSW + LMDB hybrid) |
iscc_search/indexes/usearch/manager.py |
UsearchIndexManager (multi-index over directories) |
iscc_search/indexes/simprint/lmdb_ops.py |
LMDB operations for simprints (pack/unpack, IDF, exact search) |
iscc_search/indexes/simprint/usearch_core.py |
UsearchSimprintIndex (ShardedIndex128 approximate search) |
iscc_search/indexes/simprint/models.py |
msgspec structs: MatchedChunkRaw, SimprintMatchRaw |
iscc_search/cli/ |
Typer CLI: add, datasets, get, hub, search, serve, index subcommands |
iscc_search/server/ |
FastAPI app: endpoints for indexes, assets, search, auth, health |
iscc_search/remote/client.py |
IsccSearchClient (HTTP client for remote servers) |
iscc_search/openapi/ |
Modular OpenAPI 3.0 spec (YAML fragments + bundled JSON) |
Class Hierarchy¶
IsccIndexProtocol (typing.Protocol, runtime_checkable)
├── MemoryIndex (indexes/memory/)
├── LmdbIndexManager (indexes/lmdb/)
│ └── LmdbIndex (per-index LMDB file)
└── UsearchIndexManager (indexes/usearch/)
└── UsearchIndex (per-index directory)
├── ShardedNphdIndex (from iscc-usearch, per unit-type)
├── UsearchSimprintIndex (per simprint-type)
└── LMDB env (assets, metadata, instance, simprints)
Schema Types (auto-generated, do not edit)¶
| Class | Purpose |
|---|---|
IsccIndex |
Index metadata (name, assets count, size) |
IsccEntry |
Asset record (iscc_id, iscc_code, units, simprints, metadata) |
IsccQuery |
Search query (iscc_code or units or simprints) |
IsccSearchResult |
Search response (query + global_matches + chunk_matches) |
IsccGlobalMatch |
Per-asset match (iscc_id, score, unit_scores) |
IsccAddResult |
Add response (iscc_id, status: created/updated) |
IsccSimprint |
Simprint with offset and size |
IsccChunk |
Chunk-level match detail |
TextQuery |
Text-based search query |
Decision Dispatch¶
Which backend?¶
| Scenario | Backend | URI | Why |
|---|---|---|---|
| Unit tests | Memory | memory:// |
No persistence, no deps |
| Exact prefix search | LMDB | lmdb:///path |
Bidirectional prefix matching |
| Production similarity | USearch | usearch:///path |
HNSW + LMDB hybrid |
Which config system?¶
| Context | System | Source |
|---|---|---|
iscc-search serve |
options.py (SearchOptions) |
Env vars ISCC_SEARCH_* |
iscc-search add/get/hub/search |
config.py (ConfigManager) |
JSON ~/.iscc-search/config.json |
iscc-search datasets |
none (HF Hub listing only) | huggingface_hub.list_datasets |
Which method for a task?¶
| Task | Protocol method | HTTP endpoint |
|---|---|---|
| Create index | create_index(IsccIndex) |
POST /indexes |
| List indexes | list_indexes() |
GET /indexes |
| Get index info | get_index(name) |
GET /indexes/{name} |
| Delete index | delete_index(name) |
DELETE /indexes/{name} |
| Add assets | add_assets(name, [IsccEntry]) |
POST /indexes/{name}/assets |
| Get asset | get_asset(name, iscc_id) |
GET /indexes/{name}/assets/{iscc_id} |
| Search | search_assets(name, IsccQuery, limit) |
POST /indexes/{name}/search |
| Health check | N/A | GET /healthz (liveness), GET /readyz (readiness) |
Constraints and Invariants¶
Index Names¶
Pattern: ^[a-z][a-z0-9]*$. No underscores, hyphens, capitals, or special characters. Max
32 characters.
ISCC-ID Format¶
- Total: 10 bytes (2-byte header + 8-byte body)
- Header encodes MainType=ID and realm_id (0 or 1)
- Body: 52-bit timestamp + 12-bit server-id
- Body stored as uint64 big-endian key in LMDB
Realm Consistency¶
All assets in a single index must share the same realm_id (0 or 1). The first asset added
sets the realm. Subsequent assets with a different realm raise ValueError.
LMDB Limits¶
| Backend | max_dbs |
Reason |
|---|---|---|
| LmdbIndex | 16 | Assets + metadata + up to 14 unit-type indexes |
| UsearchIndex | 32 | Assets + metadata + instance + ~14 simprint types (2 DBs each) |
Concurrency¶
| Backend | Readers | Writers | Multi-process |
|---|---|---|---|
| Memory | Unlimited | Unlimited | No |
| LMDB | Multiple | Single (LMDB lock) | Reads only |
| USearch | Multiple | Single (threading.RLock) |
No - corrupts .usearch files |
USearch production: single worker process, use FastAPI async for concurrency.
MapFullError Auto-Retry¶
| Backend | Strategy | Limit |
|---|---|---|
| LmdbIndex | Double map_size |
Unbounded |
| UsearchIndex | Increment by min(old_size, 1 GB) |
10 retries, 1 TB max |
Side Effects Catalog¶
UsearchIndex.add_assets()¶
| Step | Mutation | Atomic with LMDB? |
|---|---|---|
Store asset JSON in LMDB __assets__ |
Disk write | Yes (inside txn) |
| Add INSTANCE units to LMDB dupsort | Disk write | Yes (inside txn) |
| Store simprint data in LMDB | Disk write | Yes (inside txn) |
| LMDB transaction commits | Disk flush | - |
| Update ShardedNphdIndex (remove + add) | Memory + dirty flag | No |
| Update UsearchSimprintIndex | Memory + dirty flag | No |
| Auto-flush if dirty >= flush_interval | Disk write | No |
LMDB is the source of truth. Derived HNSW indexes can be rebuilt from LMDB if corrupted.
UsearchIndex.flush()¶
Saves all dirty ShardedNphdIndex and UsearchSimprintIndex files to disk. Exception-safe: each sub-index saved independently.
UsearchIndex.close()¶
Calls flush(), then closes LMDB environment. Idempotent via _closed flag.
Scoring Pipeline (UsearchIndex.search_assets)¶
Per-unit scores:
INSTANCE → binary 1.0 (any prefix match)
Similarity → 1.0 - NPHD distance
Aggregation:
1. Filter: score >= match_threshold_units (default 0.75)
2. Weight: score^confidence_exponent (default exponent 4)
3. Average: sum(score^exponent) / sum(score)
4. Self-exclude query asset by iscc_id
5. Sort by score descending, take top limit
Simprint Scoring¶
Exact (LMDB): score = coverage × quality
coverage = unique_matched / total_query_simprintsquality = min-max normalized inverse frequency
Approximate (UsearchSimprintIndex): IDF-weighted with oversampling
- Search with
limit × oversampling_factor(default 20x) candidates - Group by asset, IDF-weight per-query matches
- Unmatched query simprints penalize the score
Task Recipes¶
Add a new index backend¶
- Create
iscc_search/indexes/yourbackend/package - Implement all methods of
IsccIndexProtocol - Add URI scheme handling in
options.get_index() - Add tests matching pattern
test_indexes_yourbackend_*.py - Update
docs/howto/index-backends.mdcomparison table
Add a new API endpoint¶
- Add route handler in
iscc_search/server/(new file or existing) - Add schema models in
iscc_search/openapi/YAML fragments - Run
uv run poe buildto regenerateschema.pyandopenapi.json - Add tests in
tests/test_server*.py
Modify OpenAPI schema¶
- Edit YAML fragments in
iscc_search/openapi/ - Run
uv run poe build-schemato regenerateiscc_search/schema.py - Run
uv run poe build-openapito bundleopenapi.json - Run
uv run poe build-validateto check validity - Never hand-edit
schema.py
Add a new CLI command¶
- Create handler in
iscc_search/cli/yourcommand.py - Register with Typer app in
iscc_search/cli/__init__.py - Use
get_active_index()fromcli/common.pyfor index access - Add tests in
tests/test_cli_*.py
Change Playbook¶
If modifying IsccIndexProtocol¶
- Update all three backends:
MemoryIndex,LmdbIndexManager,UsearchIndexManager - Update
IsccSearchClientinremote/client.py - Update server route handlers in
server/ - Update CLI commands if they use the changed method
- Update
docs/reference/api.md(auto-generated, just rebuild)
If modifying SearchOptions¶
- Check
server/__init__.pyfor options usage - Check
UsearchIndex.__init__()andsearch_assets()for threshold/HNSW params - Update
docs/reference/configuration.mdenv var table - Update
docs/howto/deployment.mdif production-relevant
If modifying OpenAPI spec¶
- Edit YAML fragments in
iscc_search/openapi/, not the bundled JSON - Run
uv run poe build(regenerates schema.py + bundles JSON + validates) - Update any code using changed schema classes
- Never edit
schema.pydirectly
If modifying indexes/common.py¶
- Changes to
serialize_asset()/deserialize_asset()affect ALL backends - Changes to
validate_iscc_id()affect all ID validation paths - Changes to
normalize_query()affect all search paths - Run full test suite - these functions are called everywhere
If modifying LMDB database schemas¶
- Changes to key/value format require data migration or re-index
- LMDB database names are string constants - grep for them
__assets__,__metadata__,__instance__are reserved names- Unit-type databases are created dynamically (e.g.,
CONTENT_TEXT_V0)
Common Mistakes¶
NEVER use multiple workers with usearch backend¶
Corrupts .usearch files silently. No file locking, no multi-process coordination.
# WRONG - data corruption
uvicorn iscc_search.server:app --workers 4
# CORRECT
uvicorn iscc_search.server:app
NEVER hand-edit schema.py¶
It is auto-generated from iscc_search/openapi/openapi.yaml. Your changes will be
overwritten by uv run poe build-schema.
NEVER mix realm_ids in one index¶
All assets must have the same realm_id (0 or 1). The first asset sets it. Adding an asset
with a different realm raises ValueError.
ALWAYS call close() on index instances¶
UsearchIndex.close() flushes dirty HNSW indexes to disk. Skipping it loses unsaved data.
LmdbIndex.close() closes the LMDB environment. Leaked handles block other processes.
ALWAYS use absolute imports¶
The codebase enforces absolute imports only. No relative imports (from . import).
NEVER use type annotations in function signatures¶
Use PEP 484 type comments instead. Exception: FastAPI route handlers and Typer command handlers require annotations.
# WRONG
def search(query: IsccQuery, limit: int = 100) -> IsccSearchResult:
# CORRECT
def search(query, limit=100):
# type: (IsccQuery, int) -> IsccSearchResult
ALWAYS preserve Windows URI path handling in get_index()¶
options.get_index() has Windows-specific path normalization (strips leading / from
/C:/path). Do not simplify or remove this logic.
NEVER assume derived indexes are in sync with LMDB¶
After add_assets(), LMDB commits first. HNSW updates happen outside the transaction. If the
process crashes between these steps, HNSW is stale. LMDB is always the source of truth.
Recovery: rebuild HNSW from LMDB data.
ALWAYS run full test suite after changes to common.py¶
Functions in indexes/common.py (serialization, validation, normalization) are used by all
three backends. A change there can break any backend.