Start

Concepts

Concepts

The handful of nouns gitcrawl uses, and how they connect.

#Repository mirror

A repository is the owner/repo you sync. Every gitcrawl command takes one, and most state in SQLite is keyed by it. You can mirror as many repos as you like into a single gitcrawl.db; commands always scope to the one you name.

The mirror is metadata-first: titles, bodies, authors, labels, state, timestamps, and IDs land in SQLite immediately. Comments, reviews, review comments, and full PR detail (files, commits, checks, workflow runs) are opt-in on a per-sync basis (see Sync).

#Thread

A thread is a single GitHub issue or pull request, with its body and metadata. The CLI exposes threads via gitcrawl threads and via the gh shim's gh issue/pr view and gh issue/pr list paths.

Threads have two state dimensions:

  • GitHub stateopen or closed upstream.
  • Local close — a maintainer-only override stored locally. gitcrawl close-thread and reopen-thread flip this without touching GitHub. Local closes drive the --hide-closed and --include-closed filters across clusters, cluster-detail, the TUI, and search.

Local close is for triage workflow: "I have handled this duplicate locally, I do not need it shown next time." It does not write back to GitHub.

#Document

A document is the canonical text gitcrawl indexes for a thread — title plus body, with comments folded in when present. Documents back the FTS index used by gitcrawl search and feed the embedding pipeline.

Most users never interact with documents directly; they show up in JSON output as a document field on neighbors and search hits.

#Embedding

An embedding is a vector representation of a thread's document, produced by an OpenAI model (default text-embedding-3-small, 1024 dimensions). Vectors live in ~/.config/gitcrawl/vectors and are referenced from the thread_vectors table.

The embedding basis controls what text gets embedded. The default title_original uses title plus an excerpt of the original body. This is configurable via gitcrawl configure --embedding-basis ... but only title_original is currently implemented.

gitcrawl embed is the explicit command that fills the vector table. gitcrawl refresh runs it automatically as part of its sync → embed → cluster pipeline.

When the embedding input rune cap or model changes, vectors are rebuilt to avoid stale comparisons.

#Cluster

A cluster is a group of related threads inferred from vector similarity, with deterministic GitHub reference evidence (#123, pull/123, issues/123) folded in to harden weak edges.

Clustering is run by gitcrawl cluster (or as part of gitcrawl refresh). Defaults are tuned to ghcrawl's profile: --threshold 0.80, --min-size 1, --max-cluster-size 40, --k 16 nearest-neighbor fanout, --cross-kind-threshold 0.93 for issue↔PR edges.

Two safeguards keep mega-clusters from forming:

  • Title-token overlap. A weak embedding edge needs concrete shared title tokens unless its similarity is already high or there is direct GitHub reference evidence.
  • Cross-kind pruning. Issue↔PR edges need a higher similarity floor (--cross-kind-threshold) than issue↔issue or PR↔PR.

#Cluster kinds

Every cluster ships with a kind that explains its shape:

  • singleton_orphan — one member, no neighbors above threshold. Useful for surfacing isolated reports.
  • duplicate_candidate — multiple members above the merge threshold. The default duplicate triage row.

#Durable clusters

A durable cluster is a stable, long-lived row in durable_clusters with a stable ID derived from its representative thread. Durable cluster IDs survive re-runs of cluster and refresh, so the local close, exclusion, and canonical-member overrides you apply persist across re-clustering.

gitcrawl clusters and gitcrawl tui show the latest raw run's clusters first, with closed durable rows merged in as historical context. Use gitcrawl durable-clusters for an audit view that stays on the durable rows.

#Cluster overrides (governance)

Per-cluster maintainer overrides let you correct what the algorithm produced without re-tuning thresholds:

  • Local close (close-cluster/reopen-cluster) — hides a duplicate-candidate from active triage.
  • Member exclusion (exclude-cluster-member/include-cluster-member) — pulls a specific thread out of a cluster and remembers why.
  • Canonical member (set-cluster-canonical) — pins which thread represents the cluster.

See Governance for the full workflow.

#Run

Every sync, embed, and cluster operation records a run in run_records with start/finish timestamps, status, and stage-specific stats. gitcrawl runs --kind sync|embedding|cluster lists them, useful for debugging or auditing.

#Portable store

A portable store is a Git-backed publish target for a gitcrawl.db plus its derived bodies, designed for sharing a local cache across agents or machines without a hosted service.

gitcrawl init --portable-store https://github.com/org/repo clones a portable store into ~/.config/gitcrawl/portable/, points the runtime at it, and gitcrawl portable prune --body-chars 256 keeps the published payload small while retaining comments, PR details, checks, and workflow runs. Read-only commands run against portable stores refresh the checkout before reading. See Portable stores.

#Cache

The cache/ directory under ~/.config/gitcrawl/ holds:

  • cache/gh-shim/ — the short-lived fallthrough cache for the gh shim, keyed by config path, CWD, GH_HOST, GH_REPO, and command args. Inspect or clean it with gitcrawl gh xcache stats|keys|gc|flush.
  • cache/pr/ — hydrated PR detail blobs used to answer gh pr view, gh pr checks, and gh run reads from local SQLite.

See gh shim for the cache key composition and TTL behavior.