Concepts
The handful of nouns gitcrawl uses, and how they connect.
#Repository mirror
A repository is the owner/repo you sync. Every gitcrawl command takes one, and most state in SQLite is keyed by it. You can mirror as many repos as you like into a single gitcrawl.db; commands always scope to the one you name.
The mirror is metadata-first: titles, bodies, authors, labels, state, timestamps, and IDs land in SQLite immediately. Comments, reviews, review comments, and full PR detail (files, commits, checks, workflow runs) are opt-in on a per-sync basis (see Sync).
#Thread
A thread is a single GitHub issue or pull request, with its body and metadata. The CLI exposes threads via gitcrawl threads and via the gh shim's gh issue/pr view and gh issue/pr list paths.
Threads have two state dimensions:
- GitHub state —
openorclosedupstream. - Local close — a maintainer-only override stored locally.
gitcrawl close-threadandreopen-threadflip this without touching GitHub. Local closes drive the--hide-closedand--include-closedfilters acrossclusters,cluster-detail, the TUI, and search.
Local close is for triage workflow: "I have handled this duplicate locally, I do not need it shown next time." It does not write back to GitHub.
#Document
A document is the canonical text gitcrawl indexes for a thread — title plus body, with comments folded in when present. Documents back the FTS index used by gitcrawl search and feed the embedding pipeline.
Most users never interact with documents directly; they show up in JSON output as a document field on neighbors and search hits.
#Embedding
An embedding is a vector representation of a thread's document, produced by an OpenAI model (default text-embedding-3-small, 1024 dimensions). Vectors live in ~/.config/gitcrawl/vectors and are referenced from the thread_vectors table.
The embedding basis controls what text gets embedded. The default title_original uses title plus an excerpt of the original body. This is configurable via gitcrawl configure --embedding-basis ... but only title_original is currently implemented.
gitcrawl embed is the explicit command that fills the vector table. gitcrawl refresh runs it automatically as part of its sync → embed → cluster pipeline.
When the embedding input rune cap or model changes, vectors are rebuilt to avoid stale comparisons.
#Cluster
A cluster is a group of related threads inferred from vector similarity, with deterministic GitHub reference evidence (#123, pull/123, issues/123) folded in to harden weak edges.
Clustering is run by gitcrawl cluster (or as part of gitcrawl refresh). Defaults are tuned to ghcrawl's profile: --threshold 0.80, --min-size 1, --max-cluster-size 40, --k 16 nearest-neighbor fanout, --cross-kind-threshold 0.93 for issue↔PR edges.
Two safeguards keep mega-clusters from forming:
- Title-token overlap. A weak embedding edge needs concrete shared title tokens unless its similarity is already high or there is direct GitHub reference evidence.
- Cross-kind pruning. Issue↔PR edges need a higher similarity floor (
--cross-kind-threshold) than issue↔issue or PR↔PR.
#Cluster kinds
Every cluster ships with a kind that explains its shape:
singleton_orphan— one member, no neighbors above threshold. Useful for surfacing isolated reports.duplicate_candidate— multiple members above the merge threshold. The default duplicate triage row.
#Durable clusters
A durable cluster is a stable, long-lived row in durable_clusters with a stable ID derived from its representative thread. Durable cluster IDs survive re-runs of cluster and refresh, so the local close, exclusion, and canonical-member overrides you apply persist across re-clustering.
gitcrawl clusters and gitcrawl tui show the latest raw run's clusters first, with closed durable rows merged in as historical context. Use gitcrawl durable-clusters for an audit view that stays on the durable rows.
#Cluster overrides (governance)
Per-cluster maintainer overrides let you correct what the algorithm produced without re-tuning thresholds:
- Local close (
close-cluster/reopen-cluster) — hides a duplicate-candidate from active triage. - Member exclusion (
exclude-cluster-member/include-cluster-member) — pulls a specific thread out of a cluster and remembers why. - Canonical member (
set-cluster-canonical) — pins which thread represents the cluster.
See Governance for the full workflow.
#Run
Every sync, embed, and cluster operation records a run in run_records with start/finish timestamps, status, and stage-specific stats. gitcrawl runs --kind sync|embedding|cluster lists them, useful for debugging or auditing.
#Portable store
A portable store is a Git-backed publish target for a gitcrawl.db plus its derived bodies, designed for sharing a local cache across agents or machines without a hosted service.
gitcrawl init --portable-store https://github.com/org/repo clones a portable store into ~/.config/gitcrawl/portable/, points the runtime at it, and gitcrawl portable prune --body-chars 256 keeps the published payload small while retaining comments, PR details, checks, and workflow runs. Read-only commands run against portable stores refresh the checkout before reading. See Portable stores.
#Cache
The cache/ directory under ~/.config/gitcrawl/ holds:
cache/gh-shim/— the short-lived fallthrough cache for theghshim, keyed by config path, CWD,GH_HOST,GH_REPO, and command args. Inspect or clean it withgitcrawl gh xcache stats|keys|gc|flush.cache/pr/— hydrated PR detail blobs used to answergh pr view,gh pr checks, andgh runreads from local SQLite.
See gh shim for the cache key composition and TTL behavior.