Configure

Refresh and embed

Refresh and embed

gitcrawl refresh is the one command most users want. It runs sync → embed → cluster in order, with the same flags you would use individually.

#refresh

gitcrawl refresh owner/repo

By default this performs:

  1. Sync — open + recently closed issues and PRs (see Sync)
  2. Embed — fill thread_vectors for any thread whose document changed
  3. Cluster — rebuild durable clusters with the standard thresholds

Disable any stage with --no-sync, --no-embed, --no-cluster. The remaining stages still run; failures are reported per stage in the JSON output.

#Stage-specific flags

refresh forwards flags through to the underlying stages:

Forwarded toFlag
sync--since, --state, --limit, --include-comments
embed--limit
cluster--threshold (0.80), --min-size (1), --max-cluster-size (40), --k (16), --cross-kind-threshold (0.93)

--include-code is accepted but currently a no-op.

#JSON output

gitcrawl refresh owner/repo --json
{
  "repository": "owner/repo",
  "sync": { "selected": 124, "inserted": 12, "updated": 9, "run_id": 42 },
  "embed": { "selected": 21, "embedded": 21, "skipped": 0, "failed": 0, "model": "text-embedding-3-small", "run_id": 43 },
  "cluster": {
    "threshold": 0.8, "cross_kind": 0.93, "min_size": 1, "max_size": 40, "k": 16,
    "vector_count": 312, "edge_count": 1042, "cluster_count": 87, "member_count": 312, "run_id": 44
  }
}

Each stage object mirrors the JSON shape of the standalone command. You can read the per-stage run_id later via gitcrawl runs --kind sync|embedding|cluster.

#embed

gitcrawl embed owner/repo

Generates OpenAI embeddings for any thread whose document hash has changed since its last embedding. Works through the database in batches (default size 64) with bounded concurrency (default 2).

#Flags

FlagDefaultDescription
--number <n>(any)Embed a single issue/PR by number
--limit <n>(no limit)Maximum rows to embed in this run
--force(off)Re-embed every selected row, ignoring content hash
--include-closed(off)Include closed threads

#When to --force

You should rarely need it. The pipeline auto-forces a rebuild when:

  • The configured embedding model changes (GITCRAWL_EMBED_MODEL or embed_model in config)
  • The embedding input rune cap changes (so older, larger-cap vectors are not silently mixed in)

Use --force manually if you have manually edited vectors, or want to confirm an output is reproducible from scratch.

#Failure handling

OpenAI errors are retried with backoff unless GITCRAWL_OPENAI_RETRY_DISABLED=1. The JSON output includes a failures array with batch-level diagnostics (batch_start, batch_end, attempts, status, code, message) so partial failures do not silently drop rows.

Oversized inputs are capped before being sent upstream so a single huge body cannot exceed the model's input limit.

#JSON output

{
  "repository": "owner/repo",
  "model": "text-embedding-3-small",
  "basis": "title_original",
  "selected": 21,
  "embedded": 20,
  "skipped": 0,
  "failed": 1,
  "retries": 3,
  "status": "ok",
  "failures": [
    { "batch_start": 16, "batch_end": 17, "attempts": 3, "status": 429, "type": "rate_limit", "code": "rate_limit_exceeded", "message": "..." }
  ],
  "run_id": 43
}

#runs

Inspect what refresh, sync, embed, or cluster actually did:

gitcrawl runs owner/repo --kind sync       # default kind
gitcrawl runs owner/repo --kind embedding
gitcrawl runs owner/repo --kind cluster

Each row carries started_at, finished_at, status, and stats_json — useful when an agent needs to know whether a sync is fresh enough or whether the last cluster pass converged.

#Cost notes

  • GitHub. Sync uses standard REST endpoints; the API quota is the dominant cost on busy repos. Use --include-comments and --with pr-details selectively.
  • OpenAI. text-embedding-3-small is inexpensive but not free. embed is bounded by --limit if you want to stay under a budget on initial backfills.
  • Disk. Vectors, generated documents, and raw API payloads grow with the repo. The portable-store flow includes gitcrawl portable prune to keep published payloads small while retaining compact comments and PR details — see Portable stores.