Operate

Automation

Automation

Stable JSON contracts, agent recipes, and patterns for keeping the local mirror warm without manual ceremony.

#JSON output is first class

Every command supports --json (or the global --format json). The resulting payload is pretty-printed with stable field names so you can pipe it directly into jq or feed it to an agent as structured context.

gitcrawl sync owner/repo --json | jq '{run_id, inserted, updated}'
gitcrawl clusters owner/repo --json --sort size --min-size 5 \
  | jq '.clusters[] | {id, members: .member_count, latest: .latest_thread_number}'

For the full per-command JSON shapes, see the individual feature pages and the Commands reference.

#Exit codes

  • 0 — success
  • non-zero — usage error, command not implemented, runtime error

Stderr always carries a human-readable error message; stdout is reserved for the requested output (text or JSON) so you can pipe stdout to jq without losing diagnostics.

#Keeping the mirror fresh

Three patterns, in increasing order of automation:

#On-demand staleness check

Use --sync-if-stale on gitcrawl search (or the gh-shim's search):

gitcrawl search issues "manifest cache" \
  -R owner/repo \
  --sync-if-stale 5m \
  --json number,title,url

Best for ad-hoc agent tools that should bound staleness but minimize sync calls.

#Auto-hydration via the gh shim

Symlink the gitcrawl binary as gh (or gitcrawl-gh) and let the shim pull a single PR's detail when an agent calls gh pr view or gh pr checks against an unhydrated PR. See gh shim → auto-hydration.

This is the lowest-overhead pattern for fleets of agents — no scheduling required.

#Periodic background refresh

Run gitcrawl refresh owner/repo on a cron, systemd timer, or launchd agent every few minutes per repo. Combine with the gh shim and your agents almost never have to wait on GitHub.

# Every 5 minutes, refresh the active repos.
*/5 * * * * $HOME/bin/gitcrawl refresh openclaw/gitcrawl --json > /tmp/gitcrawl.openclaw.json 2>&1

For multiple repos, loop in a small shell script — gitcrawl is happy to run sequentially against a shared SQLite file.

#Agent recipes

#"Look up an issue without burning quota"

gh issue view 123 -R owner/repo --json number,title,state,body,labels,author

With the shim symlinked as gh, this answers from local SQLite if the issue is cached. Auto-hydration covers PR-detail fields. The agent prompt does not change.

#"Find candidates, hydrate them, summarize"

NUMS=$(gh search issues "checksum mismatch" -R owner/repo \
        --json number --limit 30 \
        | jq -r '[.[].number] | join(",")')

gitcrawl sync owner/repo --numbers "$NUMS" --include-comments --with pr-details

gitcrawl cluster-detail owner/repo --id "$(gitcrawl clusters owner/repo --json \
        | jq '.clusters[0].id')"

Search is local; the targeted sync brings exactly the rows you need; cluster-detail returns the structured triage view.

#"Find duplicates around a new bug report"

NUM=789
gitcrawl sync owner/repo --numbers "$NUM" --include-comments
gitcrawl embed owner/repo --number "$NUM"
gitcrawl neighbors owner/repo --number "$NUM" --limit 10 --json

#"Triage a cluster end to end"

ID=42

# Read.
gitcrawl cluster-detail owner/repo --id "$ID" --body-chars 600 --json

# Decide canonical, then close locally.
gitcrawl set-cluster-canonical owner/repo --id "$ID" --number 123
gitcrawl close-cluster owner/repo --id "$ID" --reason "consolidated under #123"

# Comment upstream via real gh.
gh issue comment 456 -R owner/repo --body "Duplicate of #123"

#"Prove the shim is paying off"

# Periodically log cache stats — watch local_hits climb relative to backend_misses.
gitcrawl gh xcache stats --json \
  | jq '{local: .counters.local_hits, fallback: .counters.fallback_hits, github: .counters.backend_misses}'

#Multi-repo automation

A single gitcrawl.db can hold many repositories. Loop in shell:

for repo in openclaw/gitcrawl steipete/repo-a octocat/repo-b; do
  gitcrawl refresh "$repo" --json | jq '{repo: "'"$repo"'", sync: .sync, embed: .embed}'
done

Or maintain a small script that reads a list of repos from a file and runs them on a schedule.

#Output formats

FormatWhen to use
text (default)Humans at a terminal
json (or --json)Pipelines, scripts, agents
logInternal logging output; structured key/value pairs

Force a format globally with --format json or per-command with --json. The log format is mostly used internally and is subject to change.

#CI integration

Run gitcrawl in CI to validate a portable store's freshness, sanity-check cluster shapes, or produce a triage report:

- name: Refresh and snapshot clusters
  run: |
    gitcrawl init --portable-store $PORTABLE_STORE_URL
    gitcrawl refresh openclaw/gitcrawl --json > sync.json
    gitcrawl clusters openclaw/gitcrawl --json --sort size --min-size 5 > clusters.json
  env:
    GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

- uses: actions/upload-artifact@v4
  with: { name: triage, path: "*.json" }

The artifact gives reviewers a structured view of what changed and how the cluster graph looks today.

#Best practices

  • Set both tokens in a single place. Either env or [env] in config.toml. Mixing sources tends to confuse doctor reports.
  • Bound the staleness window. --sync-if-stale on every agent-driven search is cheaper than a hot cron loop.
  • Monitor xcache stats. If backend_misses dwarfs local_hits, you are not yet getting the shim's benefit — usually means agents are calling gh directly without going through the symlink.
  • Re-cluster after a backfill. A large --state all sync should be followed by gitcrawl refresh --no-sync (or just gitcrawl embed && gitcrawl cluster) so the durable graph reflects the new content.
  • Pin the gh binary. Set GITCRAWL_GH_PATH explicitly so the shim does not accidentally invoke itself.