Refresh and embed
gitcrawl refresh is the one command most users want. It runs sync → embed → cluster in order, with the same flags you would use individually.
#refresh
gitcrawl refresh owner/repo
By default this performs:
- Sync — open + recently closed issues and PRs (see Sync)
- Embed — fill
thread_vectorsfor any thread whose document changed - Cluster — rebuild durable clusters with the standard thresholds
Disable any stage with --no-sync, --no-embed, --no-cluster. The remaining stages still run; failures are reported per stage in the JSON output.
#Stage-specific flags
refresh forwards flags through to the underlying stages:
| Forwarded to | Flag |
|---|---|
| sync | --since, --state, --limit, --include-comments |
| embed | --limit |
| cluster | --threshold (0.80), --min-size (1), --max-cluster-size (40), --k (16), --cross-kind-threshold (0.93) |
--include-code is accepted but currently a no-op.
#JSON output
gitcrawl refresh owner/repo --json
{
"repository": "owner/repo",
"sync": { "selected": 124, "inserted": 12, "updated": 9, "run_id": 42 },
"embed": { "selected": 21, "embedded": 21, "skipped": 0, "failed": 0, "model": "text-embedding-3-small", "run_id": 43 },
"cluster": {
"threshold": 0.8, "cross_kind": 0.93, "min_size": 1, "max_size": 40, "k": 16,
"vector_count": 312, "edge_count": 1042, "cluster_count": 87, "member_count": 312, "run_id": 44
}
}
Each stage object mirrors the JSON shape of the standalone command. You can read the per-stage run_id later via gitcrawl runs --kind sync|embedding|cluster.
#embed
gitcrawl embed owner/repo
Generates OpenAI embeddings for any thread whose document hash has changed since its last embedding. Works through the database in batches (default size 64) with bounded concurrency (default 2).
#Flags
| Flag | Default | Description |
|---|---|---|
--number <n> | (any) | Embed a single issue/PR by number |
--limit <n> | (no limit) | Maximum rows to embed in this run |
--force | (off) | Re-embed every selected row, ignoring content hash |
--include-closed | (off) | Include closed threads |
#When to --force
You should rarely need it. The pipeline auto-forces a rebuild when:
- The configured embedding model changes (
GITCRAWL_EMBED_MODELorembed_modelin config) - The embedding input rune cap changes (so older, larger-cap vectors are not silently mixed in)
Use --force manually if you have manually edited vectors, or want to confirm an output is reproducible from scratch.
#Failure handling
OpenAI errors are retried with backoff unless GITCRAWL_OPENAI_RETRY_DISABLED=1. The JSON output includes a failures array with batch-level diagnostics (batch_start, batch_end, attempts, status, code, message) so partial failures do not silently drop rows.
Oversized inputs are capped before being sent upstream so a single huge body cannot exceed the model's input limit.
#JSON output
{
"repository": "owner/repo",
"model": "text-embedding-3-small",
"basis": "title_original",
"selected": 21,
"embedded": 20,
"skipped": 0,
"failed": 1,
"retries": 3,
"status": "ok",
"failures": [
{ "batch_start": 16, "batch_end": 17, "attempts": 3, "status": 429, "type": "rate_limit", "code": "rate_limit_exceeded", "message": "..." }
],
"run_id": 43
}
#runs
Inspect what refresh, sync, embed, or cluster actually did:
gitcrawl runs owner/repo --kind sync # default kind
gitcrawl runs owner/repo --kind embedding
gitcrawl runs owner/repo --kind cluster
Each row carries started_at, finished_at, status, and stats_json — useful when an agent needs to know whether a sync is fresh enough or whether the last cluster pass converged.
#Cost notes
- GitHub. Sync uses standard REST endpoints; the API quota is the dominant cost on busy repos. Use
--include-commentsand--with pr-detailsselectively. - OpenAI.
text-embedding-3-smallis inexpensive but not free.embedis bounded by--limitif you want to stay under a budget on initial backfills. - Disk. Vectors, generated documents, and raw API payloads grow with the repo. The portable-store flow includes
gitcrawl portable pruneto keep published payloads small while retaining compact comments and PR details — see Portable stores.