Getting started¶
By the end of this tutorial you will have a working iscc-search installation, an index with ISCC codes, and search results showing similar content.
Prerequisites¶
- Python 3.10 or later
uvorpipfor package installation
Install iscc-search¶
Verify the installation:
What is ISCC?¶
ISCC (International Standard Content Code, ISO 24138) is a content fingerprinting system for digital media. It generates short codes from content - text, images, audio, video - that preserve similarity. Two documents with overlapping content produce ISCC codes that are close in Hamming distance. iscc-search exploits this property to find similar content across large collections.
For a deeper explanation, see the ISCC primer.
Create an index¶
An index stores ISCC codes and enables similarity search. Start with the memory:// backend - it keeps
everything in RAM and requires no setup.
Add ISCC codes¶
Each asset you add contains an ISCC-CODE - a composite fingerprint that encodes multiple similarity dimensions (content, data, instance). The index decomposes the code into individual units and indexes each one for search.
Create a JSON file asset.json:
Add it to the active index:
Each asset gets an auto-generated ISCC-ID (a unique identifier) if you do not provide one. The status
field in the result tells you whether the asset was created or updated.
Index a ready-made dataset
To experiment with real data without preparing JSON files, index one of the published ISCC
datasets from the HuggingFace Hub. List them with iscc-search datasets and pull one in:
hub auto-registers a local index named after the dataset.
Search for similar content¶
Pass an ISCC-CODE as a query. The engine compares it against all indexed codes and returns ranked matches.
The score field ranges from 0.0 to 1.0. A score of 1.0 means the codes are identical. Scores above 0.75
(the default threshold) indicate strong similarity.
Tip
You can also search using a GET request with a query parameter:
GET /indexes/myindex/search?iscc_code=ISCC:KACYPXW445FTYNJ3CYSXHAFJMA2HUWULUNRFE3BLHRSCXYH2M5AEGQY
Try a persistent backend¶
The memory:// backend loses data when the process exits. For persistent storage, use lmdb:// which
stores indexes on disk using LMDB (Lightning Memory-Mapped Database).
import os
os.environ["ISCC_SEARCH_INDEX_URI"] = "lmdb:///tmp/iscc-data"
from iscc_search.options import get_index
from iscc_search.schema import IsccIndex
index = get_index()
index.create_index(IsccIndex(name="persistent"))
# Add assets and search as before...
# Data survives restarts.
index.close()
Note
For production workloads with large collections, use the usearch:// backend. It adds HNSW
(Hierarchical Navigable Small World) graph indexing for fast approximate nearest neighbor search.
See the index backends guide.
Next steps¶
- Index backends - configure memory, LMDB, and usearch backends
- REST API - run the API server and use all endpoints
- CLI reference - full command-line usage
- ISCC primer - how ISCC content codes work