Clustering
Group related issues and pull requests using vector similarity, hardened with deterministic GitHub reference evidence and cross-kind safeguards.
#How it works
Clustering builds a sparse nearest-neighbor graph over the local vector store. For each thread, gitcrawl picks the top k most similar threads (default 16). Edges below the cosine threshold (default 0.80) are dropped. The remaining graph is split into connected components capped at --max-cluster-size members.
Two safeguards keep mega-clusters from forming:
- Title-token overlap. A weak embedding edge needs concrete shared title tokens (4+ char alphanumeric tokens) unless its similarity is already high (≥ 0.90) or there is direct GitHub reference evidence (
#123,pull/123,issues/123). - Cross-kind pruning. Edges connecting issues to pull requests need a higher floor (
--cross-kind-threshold, default 0.93) than issue↔issue or PR↔PR edges.
GitHub references found in titles or in the first ~240 characters of bodies generate deterministic reference edges with score 0.94. Body-only references later in the document are treated as weak evidence (need title-token overlap or other support). Single-digit numbers in prose are ignored as ambiguous; references must be at least two digits or use a fully qualified form.
The result is written to two tables that survive across runs:
durable_clusters— stable cluster rows with stable IDs derived from the representative threaddurable_cluster_members— thread-to-cluster mappings with override metadata
#Generate clusters
gitcrawl cluster owner/repo
The defaults match ghcrawl's tuning so the output is comparable across tools:
| Flag | Default | Description |
|---|---|---|
--threshold <float> | 0.80 | Minimum cosine score for an edge |
--cross-kind-threshold <float> | 0.93 | Minimum cosine score for issue↔PR edges |
--min-size <n> | 1 | Minimum members per emitted cluster |
--max-cluster-size <n> | 40 | Hard cap on cluster size |
--k <n> | 16 | Nearest-neighbor fanout per thread |
--limit <n> | (no limit) | Maximum vector rows to consider |
--model <name> | (config) | Embedding model override |
--basis <name> | (config) | Embedding basis override |
--include-closed | (off) | Include closed threads |
Every active vector-backed thread is represented in the result: singleton clusters use kind = singleton_orphan, multi-member clusters use kind = duplicate_candidate.
#List clusters
gitcrawl clusters owner/repo
gitcrawl clusters owner/repo --sort size --min-size 5
gitcrawl clusters owner/repo --sort recent
gitcrawl clusters owner/repo --hide-closed
| Flag | Default | Description |
|---|---|---|
--sort recent|oldest|size | size | Ordering |
--min-size <n> | (none) | Minimum active member count |
--limit <n> | (no limit) | Maximum cluster rows |
--hide-closed | (off) | Hide locally closed clusters |
--include-closed | (deprecated) | Closed clusters are included by default |
gitcrawl clusters shows the latest raw run's clusters first and merges closed durable rows in as historical context. For a strict durable-only audit view (no merging with the latest run), use:
gitcrawl durable-clusters owner/repo --include-closed
GitHub-closed members are hidden from latest-run cluster summaries by default; pass --include-closed to see the full historical view.
#Inspect a cluster
gitcrawl cluster-detail owner/repo --id 123
gitcrawl cluster-explain owner/repo --id 123 # alias
| Flag | Default | Description |
|---|---|---|
--id <n> | (required) | Cluster ID |
--member-limit <n> | (no limit) | Maximum members to return |
--body-chars <n> | 280 | Body snippet length per member |
--include-closed | (off) | Include closed members |
cluster-explain is the same command — it exists so the verb reads naturally in agent prompts ("explain why these things ended up together").
#Find similar threads (neighbors)
gitcrawl neighbors owner/repo --number 123 --limit 10
| Flag | Default | Description |
|---|---|---|
--number <n> | (required) | Source issue/PR |
--limit <n> | 10 | Maximum neighbors |
--threshold <float> | 0.2 | Minimum cosine score |
Useful for "what else looks like this?" without committing to a cluster. The TUI's n shortcut and "Enter on a member" both call this path.
#Tuning recipes
#My clusters are too greedy
Symptom: unrelated bug reports merged together.
gitcrawl cluster owner/repo --threshold 0.85 --cross-kind-threshold 0.95
Tighter thresholds drop more weak edges. The --cross-kind-threshold raise specifically helps when an issue and a PR keep getting glued together because of shared boilerplate.
#My clusters are too sparse
Symptom: clear duplicates landing in separate clusters.
gitcrawl cluster owner/repo --threshold 0.75 --k 24
Lower threshold + higher fanout. Watch for false merges via cluster-detail.
#Make a single big cluster smaller
Symptom: one cluster has 40 members and is incoherent.
gitcrawl cluster owner/repo --max-cluster-size 20
Or slice it manually:
gitcrawl exclude-cluster-member owner/repo --id 12 --number 456 --reason "different repro"
See Governance for the full override workflow.
#Re-clustering and stable IDs
Durable cluster IDs are derived from the representative thread, so they survive re-runs of cluster and refresh. This means:
- Local closes (
close-cluster), exclusions, and canonical member overrides persist across re-clustering - You can safely re-cluster after every refresh without losing maintainer state
Cluster runs are recorded in run_records and visible via gitcrawl runs --kind cluster.
#See also
- Governance — close clusters, exclude members, set canonical
- TUI — the interactive cluster browser
- Concepts — durable clusters and cluster kinds