Claude Code Plugins

Community-maintained marketplace

Feedback

github-archive

@gadievron/raptor
865
0

Investigate GitHub security incidents using tamper-proof GitHub Archive data via BigQuery. Use when verifying repository activity claims, recovering deleted PRs/branches/tags/repos, attributing actions to actors, or reconstructing attack timelines. Provides immutable forensic evidence of all public GitHub events since 2011.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name github-archive
description Investigate GitHub security incidents using tamper-proof GitHub Archive data via BigQuery. Use when verifying repository activity claims, recovering deleted PRs/branches/tags/repos, attributing actions to actors, or reconstructing attack timelines. Provides immutable forensic evidence of all public GitHub events since 2011.
version 1
author mbrg
tags github, gharchive, security, osint, forensics, git

GitHub Archive

Purpose: Query immutable GitHub event history via BigQuery to obtain tamper-proof forensic evidence for security investigations.

When to Use This Skill

  • Investigating security incidents involving GitHub repositories
  • Building threat actor attribution profiles
  • Verifying claims about repository activity (media reports, incident reports)
  • Reconstructing attack timelines with definitive timestamps
  • Analyzing automation system compromises
  • Detecting supply chain reconnaissance
  • Cross-repository behavioral analysis
  • Workflow execution verification (legitimate vs API abuse)
  • Pattern-based anomaly detection
  • Recovering deleted content: PRs, issues, branches, tags, entire repositories

GitHub Archive analysis should be your FIRST step in any GitHub-related security investigation. Start with the immutable record, then enrich with additional sources.

Core Principles

ALWAYS PREFER GitHub Archive as forensic evidence over:

  • Local git command outputs (git log, git show) - commits can be backdated/forged
  • Unverified claims from articles or reports - require independent confirmation
  • GitHub web interface screenshots - can be manipulated
  • Single-source evidence - always cross-verify

GitHub Archive IS your ground truth for:

  • Actor attribution (who performed actions)
  • Timeline reconstruction (when events occurred)
  • Event verification (what actually happened)
  • Pattern analysis (behavioral fingerprinting)
  • Cross-repository activity tracking
  • Deleted content recovery (issues, PRs, tags, commit references remain in archive)
  • Repository deletion forensics (commit SHAs persist even after repo deletion and history rewrites)

What Persists After Deletion

Deleted Issues & PRs:

  • Issue creation events (IssuesEvent) remain in archive
  • Issue comments (IssueCommentEvent) remain accessible
  • PR open/close/merge events (PullRequestEvent) persist
  • Forensic Value: Recover deleted evidence of social engineering, reconnaissance, or coordination

Deleted Tags & Branches:

  • CreateEvent records for tag/branch creation persist
  • DeleteEvent records document when deletion occurred
  • Forensic Value: Reconstruct attack staging infrastructure (e.g., malicious payload delivery tags)

Deleted Repositories:

  • All PushEvent records to the repository remain queryable
  • Commit SHAs are permanently recorded in archive
  • Fork relationships (ForkEvent) survive deletion
  • Forensic Value: Access commit metadata even after threat actor deletes evidence

Deleted User Accounts:

  • All activity events remain attributed to deleted username
  • Timeline reconstruction remains possible
  • Limitation: Direct code access lost, but commit SHAs can be searched elsewhere

Quick Start

Investigate if user opened PRs in June 2025:

from google.cloud import bigquery
from google.oauth2 import service_account

# Initialize client (see Setup section for credentials)
credentials = service_account.Credentials.from_service_account_file(
    'path/to/credentials.json',
    scopes=['https://www.googleapis.com/auth/bigquery']
)
client = bigquery.Client(credentials=credentials, project=credentials.project_id)

# Query for PR events
query = """
SELECT
    created_at,
    repo.name,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.number') as pr_number,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.title') as pr_title,
    JSON_EXTRACT_SCALAR(payload, '$.action') as action
FROM `githubarchive.day.202506*`
WHERE
    actor.login = 'suspected-actor'
    AND repo.name = 'target/repository'
    AND type = 'PullRequestEvent'
ORDER BY created_at
"""

results = client.query(query)
for row in results:
    print(f"{row.created_at}: PR #{row.pr_number} - {row.action}")
    print(f"  Title: {row.pr_title}")

Expected Output (if PR exists):

2025-06-15 14:23:11 UTC: PR #123 - opened
  Title: Add new feature
2025-06-20 09:45:22 UTC: PR #123 - closed
  Title: Add new feature

Interpretation:

  • No results → Claim disproven (no PR activity found)
  • Results found → Claim verified, proceed with detailed analysis

Setup

Prerequisites

  1. Google Cloud Project:

    • Login to Google Developer Console
    • Create a project and activate BigQuery API
    • Create a service account with BigQuery User role
    • Download JSON credentials file
  2. Install BigQuery Client:

pip install google-cloud-bigquery google-auth

Initialize Client

from google.cloud import bigquery
from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file(
    'path/to/credentials.json',
    scopes=['https://www.googleapis.com/auth/bigquery']
)

client = bigquery.Client(
    credentials=credentials,
    project=credentials.project_id
)

Free Tier: Google provides 1 TB of data processed per month free.

Cost Management & Query Optimization

Understanding GitHub Archive Costs

BigQuery charges $6.25 per TiB of data scanned (after the 1 TiB free tier). GitHub Archive tables are large - a single month table can be 50-100 GB, and yearly wildcards can scan multiple TiBs. Unoptimized queries can cost $10-100+, while optimized versions of the same query cost $0.10-1.00.

Key Cost Principle: BigQuery uses columnar storage - you pay for ALL data in the columns you SELECT, not just matching rows. A query with SELECT * on one day of data scans ~3 GB even with LIMIT 10.

ALWAYS Estimate Costs Before Querying

CRITICAL RULE: Run a dry run to estimate costs before executing any query against GitHub Archive production tables.

from google.cloud import bigquery

def estimate_gharchive_cost(query: str) -> dict:
    """Estimate cost before running GitHub Archive query."""
    client = bigquery.Client()

    # Dry run - validates query and returns bytes to scan
    dry_run_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    job = client.query(query, job_config=dry_run_config)

    bytes_processed = job.total_bytes_processed
    gb_processed = bytes_processed / (1024**3)
    tib_processed = bytes_processed / (1024**4)
    estimated_cost = tib_processed * 6.25

    return {
        'bytes': bytes_processed,
        'gigabytes': round(gb_processed, 2),
        'tib': round(tib_processed, 4),
        'estimated_cost_usd': round(estimated_cost, 4)
    }

# Example: Always check cost before running
estimate = estimate_gharchive_cost(your_query)
print(f"Cost estimate: {estimate['gigabytes']} GB → ${estimate['estimated_cost_usd']}")

if estimate['estimated_cost_usd'] > 1.0:
    print("⚠️ HIGH COST QUERY - Review optimization before proceeding")

Command-line dry run:

bq query --dry_run --use_legacy_sql=false 'YOUR_QUERY_HERE' 2>&1 | grep "bytes"

When to Ask the User About Costs

ASK USER BEFORE RUNNING if any of these conditions apply:

  1. Estimated cost > $1.00 - Always confirm with user for queries over $1
  2. Wildcard spans > 3 months - Queries like githubarchive.day.2025* scan entire year (~400 GB)
  3. No partition filter - Queries without date/time filters scan entire table range
  4. SELECT * used - Selecting all columns dramatically increases cost
  5. Cross-repository searches - Queries without repo.name filter scan all GitHub activity

Example user confirmation:

Query estimate: 120 GB ($0.75)
Scanning: githubarchive.day.202506* (June 2025, 30 days)
Reason: Cross-repository search for actor 'suspected-user'

This exceeds typical query cost ($0.10-0.30). Proceed? [y/n]

DON'T ASK if:

  • Estimated cost < $0.50 AND query is well-scoped (specific repo + date range)
  • User explicitly requested broad analysis (e.g., "scan all of 2025")

Cost Optimization Techniques for GitHub Archive

1. Select Only Required Columns (50-90% cost reduction)

-- ❌ EXPENSIVE: Scans ALL columns (~3 GB per day)
SELECT * FROM `githubarchive.day.20250615`
WHERE actor.login = 'target-user'

-- ✅ OPTIMIZED: Scans only needed columns (~0.3 GB per day)
SELECT
    type,
    created_at,
    repo.name,
    actor.login,
    JSON_EXTRACT_SCALAR(payload, '$.action') as action
FROM `githubarchive.day.20250615`
WHERE actor.login = 'target-user'

Never use SELECT * in production queries. Always specify exact columns needed.

2. Use Specific Date Ranges (10-100x cost reduction)

-- ❌ EXPENSIVE: Scans entire year (~400 GB)
SELECT ... FROM `githubarchive.day.2025*`
WHERE actor.login = 'target-user'

-- ✅ OPTIMIZED: Scans specific month (~40 GB)
SELECT ... FROM `githubarchive.day.202506*`
WHERE actor.login = 'target-user'

-- ✅ BEST: Scans single day (~3 GB)
SELECT ... FROM `githubarchive.day.20250615`
WHERE actor.login = 'target-user'

Strategy: Start with narrow date ranges (1-7 days), then expand if needed. Use monthly tables (githubarchive.month.202506) for multi-month queries instead of daily wildcards.

3. Filter by Repository Name (5-50x cost reduction)

-- ❌ EXPENSIVE: Scans all GitHub activity
SELECT ... FROM `githubarchive.day.202506*`
WHERE actor.login = 'target-user'

-- ✅ OPTIMIZED: Filter by repo (BigQuery can prune data blocks)
SELECT ... FROM `githubarchive.day.202506*`
WHERE
    repo.name = 'target-org/target-repo'
    AND actor.login = 'target-user'

Rule: Always include repo.name filter when investigating a specific repository.

4. Avoid SELECT * with Wildcards (Critical)

-- ❌ CATASTROPHIC: Can scan 1+ TiB ($6.25+)
SELECT * FROM `githubarchive.day.2025*`
WHERE type = 'PushEvent'

-- ✅ OPTIMIZED: Scans ~50 GB ($0.31)
SELECT
    created_at,
    actor.login,
    repo.name,
    JSON_EXTRACT_SCALAR(payload, '$.ref') as branch
FROM `githubarchive.day.2025*`
WHERE type = 'PushEvent'

5. Use LIMIT Correctly (Does NOT reduce cost on GHArchive)

IMPORTANT: LIMIT does not reduce BigQuery costs on non-clustered tables like GitHub Archive. BigQuery must scan all matching data before applying LIMIT.

-- ❌ MISCONCEPTION: Still scans full dataset
SELECT * FROM `githubarchive.day.20250615`
LIMIT 100  -- Cost: ~3 GB scanned

-- ✅ CORRECT: Use WHERE filters and column selection
SELECT type, created_at, actor.login
FROM `githubarchive.day.20250615`
WHERE repo.name = 'target/repo'  -- Cost: ~0.2 GB scanned
LIMIT 100

Safe Query Execution Template

Use this template for all GitHub Archive queries in production:

def safe_gharchive_query(query: str, max_cost_usd: float = 1.0):
    """Execute GitHub Archive query with cost controls."""
    client = bigquery.Client()

    # Step 1: Dry run estimate
    dry_run_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    dry_job = client.query(query, job_config=dry_run_config)

    bytes_processed = dry_job.total_bytes_processed
    gb = bytes_processed / (1024**3)
    estimated_cost = (bytes_processed / (1024**4)) * 6.25

    print(f"📊 Estimate: {gb:.2f} GB → ${estimated_cost:.4f}")

    # Step 2: Check budget
    if estimated_cost > max_cost_usd:
        raise ValueError(
            f"Query exceeds ${max_cost_usd} budget (estimated ${estimated_cost:.2f}). "
            f"Optimize query or increase max_cost_usd parameter."
        )

    # Step 3: Execute with safety limit
    job_config = bigquery.QueryJobConfig(
        maximum_bytes_billed=int(bytes_processed * 1.2)  # 20% buffer
    )

    print(f"✅ Executing query (max ${estimated_cost:.2f})...")
    return client.query(query, job_config=job_config).result()

# Usage
results = safe_gharchive_query("""
    SELECT created_at, repo.name, actor.login
    FROM `githubarchive.day.20250615`
    WHERE repo.name = 'aws/aws-toolkit-vscode'
        AND type = 'PushEvent'
""", max_cost_usd=0.50)

Common Investigation Patterns: Cost Comparison

Investigation Type Expensive Approach Cost Optimized Approach Cost
Verify user opened PR in June SELECT * FROM githubarchive.day.202506* ~$5.00 SELECT created_at, repo.name, payload FROM githubarchive.day.202506* WHERE actor.login='user' AND type='PullRequestEvent' ~$0.30
Find all actor activity in 2025 SELECT * FROM githubarchive.day.2025* ~$60.00 SELECT type, created_at, repo.name FROM githubarchive.month.2025* ~$5.00
Recover deleted PR content SELECT * FROM githubarchive.day.20250615 ~$0.20 SELECT created_at, payload FROM githubarchive.day.20250615 WHERE repo.name='target/repo' AND type='PullRequestEvent' ~$0.02
Cross-repo behavioral analysis SELECT * FROM githubarchive.day.202506* ~$5.00 Start with githubarchive.month.202506, identify specific repos, then query daily tables ~$0.50

Development vs Production Queries

During investigation/development:

  1. Start with single-day queries to test pattern: githubarchive.day.20250615
  2. Verify query returns expected results
  3. Expand to date range only after validation: githubarchive.day.202506*

Production checklist:

  • Used specific column names (no SELECT *)
  • Included narrowest possible date range
  • Added repo.name filter if investigating specific repository
  • Ran dry run and verified cost < $1.00 (or got user approval)
  • Set maximum_bytes_billed in query config

Cost Monitoring

Track your BigQuery spending with this query:

-- View GitHub Archive query costs (last 7 days)
SELECT
    DATE(creation_time) as query_date,
    COUNT(*) as queries,
    ROUND(SUM(total_bytes_billed) / (1024*1024*1024), 2) as total_gb,
    ROUND(SUM(total_bytes_billed) / (1024*1024*1024*1024) * 6.25, 2) as cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
    creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
    AND job_type = 'QUERY'
    AND REGEXP_CONTAINS(query, r'githubarchive\.')
GROUP BY query_date
ORDER BY query_date DESC

Schema Reference

Table Organization

Dataset: githubarchive

Table Patterns:

  • Daily tables: githubarchive.day.YYYYMMDD (e.g., githubarchive.day.20250713)
  • Monthly tables: githubarchive.month.YYYYMM (e.g., githubarchive.month.202507)
  • Yearly tables: githubarchive.year.YYYY (e.g., githubarchive.year.2025)

Wildcard Patterns:

  • All days in June 2025: githubarchive.day.202506*
  • All months in 2025: githubarchive.month.2025*
  • All data in 2025: githubarchive.year.2025*

Data Availability: February 12, 2011 to present (updated hourly)

Schema Structure

Top-Level Fields:

type              -- Event type (PushEvent, IssuesEvent, etc.)
created_at        -- Timestamp when event occurred (UTC)
actor.login       -- GitHub username who performed the action
actor.id          -- GitHub user ID
repo.name         -- Repository name (org/repo format)
repo.id           -- Repository ID
org.login         -- Organization login (if applicable)
org.id            -- Organization ID
payload           -- JSON string with event-specific data

Payload Field: JSON-encoded string containing event-specific details. Must be parsed with JSON_EXTRACT_SCALAR() in SQL or json.loads() in Python.

Event Types Reference

Repository Events

PushEvent - Commits pushed to a repository

-- Payload fields:
JSON_EXTRACT_SCALAR(payload, '$.ref')        -- Branch (refs/heads/master)
JSON_EXTRACT_SCALAR(payload, '$.before')     -- SHA before push
JSON_EXTRACT_SCALAR(payload, '$.after')      -- SHA after push
JSON_EXTRACT_SCALAR(payload, '$.size')       -- Number of commits
-- payload.commits[] contains array of commit objects with sha, message, author

PullRequestEvent - Pull request opened, closed, merged

-- Payload fields:
JSON_EXTRACT_SCALAR(payload, '$.action')              -- opened, closed, merged
JSON_EXTRACT_SCALAR(payload, '$.pull_request.number')
JSON_EXTRACT_SCALAR(payload, '$.pull_request.title')
JSON_EXTRACT_SCALAR(payload, '$.pull_request.merged') -- true/false

CreateEvent - Branch or tag created

-- Payload fields:
JSON_EXTRACT_SCALAR(payload, '$.ref_type')   -- branch, tag, repository
JSON_EXTRACT_SCALAR(payload, '$.ref')        -- Name of branch/tag

DeleteEvent - Branch or tag deleted

-- Payload fields:
JSON_EXTRACT_SCALAR(payload, '$.ref_type')   -- branch or tag
JSON_EXTRACT_SCALAR(payload, '$.ref')        -- Name of deleted ref

ForkEvent - Repository forked

-- Payload fields:
JSON_EXTRACT_SCALAR(payload, '$.forkee.full_name')  -- New fork name

Automation & CI/CD Events

WorkflowRunEvent - GitHub Actions workflow run status changes

-- Payload fields:
JSON_EXTRACT_SCALAR(payload, '$.action')               -- requested, completed
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.name')
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.path')    -- .github/workflows/file.yml
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.status')  -- queued, in_progress, completed
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.conclusion') -- success, failure, cancelled
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.head_sha')
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.head_branch')

WorkflowJobEvent - Individual job within workflow CheckRunEvent - Check run status (CI systems) CheckSuiteEvent - Check suite for commits

Issue & Discussion Events

IssuesEvent - Issue opened, closed, edited

-- Payload fields:
JSON_EXTRACT_SCALAR(payload, '$.action')        -- opened, closed, reopened
JSON_EXTRACT_SCALAR(payload, '$.issue.number')
JSON_EXTRACT_SCALAR(payload, '$.issue.title')
JSON_EXTRACT_SCALAR(payload, '$.issue.body')

IssueCommentEvent - Comment on issue or pull request PullRequestReviewEvent - PR review submitted PullRequestReviewCommentEvent - Comment on PR diff

Other Events

WatchEvent - Repository starred ReleaseEvent - Release published MemberEvent - Collaborator added/removed PublicEvent - Repository made public

Investigation Patterns

Deleted Issue & PR Text Recovery

Scenario: Issue or PR was deleted from GitHub (by author, maintainer, or moderation) but you need to recover the original title and body text for investigation, compliance, or historical reference.

Step 1: Recover Deleted Issue Content

SELECT
    created_at,
    actor.login,
    JSON_EXTRACT_SCALAR(payload, '$.action') as action,
    JSON_EXTRACT_SCALAR(payload, '$.issue.number') as issue_number,
    JSON_EXTRACT_SCALAR(payload, '$.issue.title') as title,
    JSON_EXTRACT_SCALAR(payload, '$.issue.body') as body
FROM `githubarchive.day.20250713`
WHERE
    repo.name = 'aws/aws-toolkit-vscode'
    AND actor.login = 'lkmanka58'
    AND type = 'IssuesEvent'
ORDER BY created_at

Step 2: Recover Deleted PR Description

SELECT
    created_at,
    actor.login,
    JSON_EXTRACT_SCALAR(payload, '$.action') as action,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.number') as pr_number,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.title') as title,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.body') as body,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.merged') as merged
FROM `githubarchive.day.202506*`
WHERE
    repo.name = 'target/repository'
    AND actor.login = 'target-user'
    AND type = 'PullRequestEvent'
ORDER BY created_at

Evidence Recovery:

  • Issue/PR Title: Full title text preserved in $.issue.title or $.pull_request.title
  • Issue/PR Body: Complete body text preserved in $.issue.body or $.pull_request.body
  • Comments: IssueCommentEvent preserves comment text in $.comment.body
  • Actor Attribution: actor.login identifies who created the content
  • Timestamps: Exact creation time in created_at

Real Example: Amazon Q investigation recovered deleted issue content from lkmanka58. The issue titled "aws amazon donkey aaaaaaiii aaaaaaaiii" contained a rant calling Amazon Q "deceptive" and "scripted fakery". The full issue body was preserved in GitHub Archive despite deletion from github.com, providing context for the timeline reconstruction.

Deleted PRs

Scenario: Media claims attacker submitted a PR in "late June" containing malicious code, but PR is now deleted and cannot be found on github.com.

Step 1: Query Archive

query = """
SELECT
    type,
    created_at,
    repo.name,
    JSON_EXTRACT_SCALAR(payload, '$.action') as action,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.number') as pr_number,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.title') as pr_title
FROM `githubarchive.day.202506*`
WHERE
    actor.login = 'suspected-actor'
    AND repo.name = 'target/repository'
    AND type = 'PullRequestEvent'
ORDER BY created_at
"""

results = client.query(query)
pr_events = list(results)

Step 2: Analyze Results

if not pr_events:
    print("❌ CLAIM DISPROVEN: No PR activity found in June 2025")
else:
    for event in pr_events:
        print(f"✓ VERIFIED: PR #{event.pr_number} {event.action} on {event.created_at}")
        print(f"  Title: {event.pr_title}")
        print(f"  Repo: {event.repo_name}")

Evidence Validation:

  • Claim TRUE: Archive shows PullRequestEvent with action='opened'
  • Claim FALSE: No events found → claim disproven
  • Investigation Outcome: Definitively verify or refute timeline claims

Real Example: Amazon Q investigation verified no PR from attacker's account in late June 2025, disproving media's claim of malicious code committed via deleted PR.

Deleted Repository Forensics

Scenario: Threat actor creates staging repository, pushes malicious code, then deletes repo to cover tracks.

Step 1: Find Repository Activity

query = """
SELECT
    type,
    created_at,
    JSON_EXTRACT_SCALAR(payload, '$.ref') as ref,
    JSON_EXTRACT_SCALAR(payload, '$.repository.name') as repo_name,
    payload
FROM `githubarchive.day.2025*`
WHERE
    actor.login = 'threat-actor'
    AND type IN ('CreateEvent', 'PushEvent')
    AND (
        JSON_EXTRACT_SCALAR(payload, '$.repository.name') = 'staging-repo'
        OR repo.name LIKE 'threat-actor/staging-repo'
    )
ORDER BY created_at
"""

results = client.query(query)

Step 2: Extract Commit SHAs

import json

commits = []
for row in results:
    if row.type == 'PushEvent':
        payload_data = json.loads(row.payload)
        for commit in payload_data.get('commits', []):
            commits.append({
                'sha': commit['sha'],
                'message': commit['message'],
                'timestamp': row.created_at
            })

for c in commits:
    print(f"{c['timestamp']}: {c['sha'][:8]} - {c['message']}")

Evidence Recovery:

  • CreateEvent reveals repository creation timestamp
  • PushEvent records contain commit SHAs and metadata
  • Commit SHAs can be used to recover code content via other archives or forks
  • Investigation Outcome: Complete reconstruction of attacker's staging infrastructure

Real Example: lkmanka58/code_whisperer repository deleted after attack, but GitHub Archive revealed June 13 creation with 3 commits containing AWS IAM role assumption attempts.

Deleted Tag Analysis

Scenario: Malicious tag used for payload delivery, then deleted to hide evidence.

Step 1: Search for Tag Events

SELECT
    type,
    created_at,
    actor.login,
    JSON_EXTRACT_SCALAR(payload, '$.ref') as tag_name,
    JSON_EXTRACT_SCALAR(payload, '$.ref_type') as ref_type
FROM `githubarchive.day.20250713`
WHERE
    repo.name = 'target/repository'
    AND type IN ('CreateEvent', 'DeleteEvent')
    AND JSON_EXTRACT_SCALAR(payload, '$.ref_type') = 'tag'
ORDER BY created_at

Timeline Reconstruction:

2025-07-13 19:41:44 UTC | CreateEvent | aws-toolkit-automation | tag 'stability'
2025-07-13 20:30:24 UTC | PushEvent   | aws-toolkit-automation | commit references tag
2025-07-14 08:15:33 UTC | DeleteEvent | aws-toolkit-automation | tag 'stability' deleted

Analysis: 48-hour window between tag creation and deletion reveals staging period for attack infrastructure.

Real Example: Amazon Q attack used 'stability' tag for malicious payload delivery. Tag was deleted, but CreateEvent in GitHub Archive preserved creation timestamp and actor, proving 48-hour staging window.

Deleted Branch Reconstruction

Scenario: Attacker creates development branch with malicious code, pushes commits, then deletes branch after merging or to cover tracks.

Step 1: Find Branch Lifecycle

SELECT
    type,
    created_at,
    actor.login,
    JSON_EXTRACT_SCALAR(payload, '$.ref') as branch_name,
    JSON_EXTRACT_SCALAR(payload, '$.ref_type') as ref_type
FROM `githubarchive.day.2025*`
WHERE
    repo.name = 'target/repository'
    AND type IN ('CreateEvent', 'DeleteEvent')
    AND JSON_EXTRACT_SCALAR(payload, '$.ref_type') = 'branch'
ORDER BY created_at

Step 2: Extract All Commit SHAs from Deleted Branch

SELECT
    created_at,
    actor.login as pusher,
    JSON_EXTRACT_SCALAR(payload, '$.ref') as branch_ref,
    JSON_EXTRACT_SCALAR(commit, '$.sha') as commit_sha,
    JSON_EXTRACT_SCALAR(commit, '$.message') as commit_message,
    JSON_EXTRACT_SCALAR(commit, '$.author.name') as author_name,
    JSON_EXTRACT_SCALAR(commit, '$.author.email') as author_email
FROM `githubarchive.day.2025*`,
UNNEST(JSON_EXTRACT_ARRAY(payload, '$.commits')) as commit
WHERE
    repo.name = 'target/repository'
    AND type = 'PushEvent'
    AND JSON_EXTRACT_SCALAR(payload, '$.ref') = 'refs/heads/deleted-branch-name'
ORDER BY created_at

Evidence Recovery:

  • Commit SHAs: All commit identifiers permanently recorded in PushEvent payload
  • Commit Messages: Full commit messages preserved in commits array
  • Author Metadata: Name and email from commit author field
  • Pusher Identity: Actor who executed the push operation
  • Temporal Sequence: Exact timestamps for each push operation
  • Branch Lifecycle: Complete creation-to-deletion timeline

Forensic Value: Even after branch deletion, commit SHAs can be used to:

  • Search for commits in forked repositories
  • Check if commits were merged into other branches
  • Search external code archives (Software Heritage, etc.)
  • Reconstruct complete attack development timeline

Automation vs Direct API Attribution

Scenario: Suspicious commits appear under automation account name. Determine if they came from legitimate GitHub Actions workflow execution or direct API abuse with compromised token.

Step 1: Search for Workflow Events During Suspicious Window

query = """
SELECT
    type,
    created_at,
    actor.login,
    JSON_EXTRACT_SCALAR(payload, '$.workflow_run.name') as workflow_name,
    JSON_EXTRACT_SCALAR(payload, '$.workflow_run.head_sha') as commit_sha,
    JSON_EXTRACT_SCALAR(payload, '$.workflow_run.conclusion') as conclusion
FROM `githubarchive.day.20250713`
WHERE
    repo.name = 'org/repository'
    AND type IN ('WorkflowRunEvent', 'WorkflowJobEvent')
    AND created_at >= '2025-07-13T20:25:00Z'
    AND created_at <= '2025-07-13T20:35:00Z'
ORDER BY created_at
"""

workflow_events = list(client.query(query))

Step 2: Establish Baseline Pattern

baseline_query = """
SELECT
    type,
    created_at,
    actor.login,
    JSON_EXTRACT_SCALAR(payload, '$.workflow_run.name') as workflow_name
FROM `githubarchive.day.20250713`
WHERE
    repo.name = 'org/repository'
    AND actor.login = 'automation-account'
    AND type = 'WorkflowRunEvent'
ORDER BY created_at
"""

baseline = list(client.query(baseline_query))
print(f"Total workflows for day: {len(baseline)}")

Step 3: Analyze Results

if not workflow_events:
    print("🚨 DIRECT API ATTACK DETECTED")
    print("No WorkflowRunEvent during suspicious commit window")
    print("Commit was NOT from legitimate workflow execution")
else:
    print("✓ Legitimate workflow execution detected")
    for event in workflow_events:
        print(f"{event.created_at}: {event.workflow_name} - {event.conclusion}")

Expected Results if Legitimate Workflow:

2025-07-13 20:30:15 UTC | WorkflowRunEvent | deploy-automation | requested
2025-07-13 20:30:24 UTC | PushEvent        | aws-toolkit-automation | refs/heads/main
2025-07-13 20:31:08 UTC | WorkflowRunEvent | deploy-automation | completed

Expected Results if Direct API Abuse:

2025-07-13 20:30:24 UTC | PushEvent | aws-toolkit-automation | refs/heads/main
[NO WORKFLOW EVENTS IN ±10 MINUTE WINDOW]

Investigation Outcome: Absence of WorkflowRunEvent = Direct API attack with stolen token

Real Example: Amazon Q investigation needed to determine if malicious commit 678851bbe9776228f55e0460e66a6167ac2a1685 (pushed July 13, 2025 20:30:24 UTC by aws-toolkit-automation) came from compromised workflow or direct API abuse. GitHub Archive query showed ZERO WorkflowRunEvent or WorkflowJobEvent records during the 20:25-20:35 UTC window. Baseline analysis revealed the same automation account had 18 workflows that day, all clustered in 20:48-21:02 UTC. The temporal gap and complete workflow absence during the malicious commit proved direct API attack, not workflow compromise.

Troubleshooting

Permission denied errors:

  • Verify service account has BigQuery User role
  • Check credentials file path is correct
  • Ensure BigQuery API is enabled in Google Cloud project

Query exceeds free tier (>1TB):

  • Use daily tables instead of wildcard: githubarchive.day.20250615
  • Add date filters: WHERE created_at >= '2025-06-01' AND created_at < '2025-07-01'
  • Limit columns: Select only needed fields, not SELECT *
  • Use monthly tables for broader searches: githubarchive.month.202506

No results for known event:

  • Verify date range (archive starts Feb 12, 2011)
  • Check timezone (GitHub Archive uses UTC)
  • Confirm actor.login spelling (case-sensitive)
  • Some events may take up to 1 hour to appear (hourly updates)

Payload extraction returns NULL:

  • Verify JSON path exists with JSON_EXTRACT() before using JSON_EXTRACT_SCALAR()
  • Check event type has that payload field (not all events have all fields)
  • Inspect raw payload: SELECT payload FROM ... LIMIT 1

Query timeout or slow performance:

  • Add repo.name filter when possible (significantly reduces data scanned)
  • Use specific date ranges instead of wildcards
  • Consider using monthly aggregated tables for long-term analysis
  • Partition queries by date and run in parallel

Force Push Recovery (Zero-Commit PushEvents)

Scenario: Developer accidentally commits secrets, then force pushes to "delete" the commit. The commit remains accessible on GitHub, but finding it requires knowing the SHA.

Background: When a developer runs git reset --hard HEAD~1 && git push --force, Git removes the reference to that commit from the branch. However:

  • GitHub stores these "dangling" commits indefinitely
  • GitHub Archive records the before SHA in PushEvent payloads
  • Force pushes appear as PushEvents with zero commits (empty commits array)

Step 1: Find All Zero-Commit PushEvents (Organization-Wide)

SELECT
    created_at,
    actor.login,
    repo.name,
    JSON_EXTRACT_SCALAR(payload, '$.before') as deleted_commit_sha,
    JSON_EXTRACT_SCALAR(payload, '$.head') as current_head,
    JSON_EXTRACT_SCALAR(payload, '$.ref') as branch
FROM `githubarchive.day.2025*`
WHERE
    repo.name LIKE 'target-org/%'
    AND type = 'PushEvent'
    AND JSON_EXTRACT_SCALAR(payload, '$.size') = '0'
ORDER BY created_at DESC

Step 2: Search for Specific Repository

SELECT
    created_at,
    actor.login,
    JSON_EXTRACT_SCALAR(payload, '$.before') as deleted_commit_sha,
    JSON_EXTRACT_SCALAR(payload, '$.head') as after_sha,
    JSON_EXTRACT_SCALAR(payload, '$.ref') as branch
FROM `githubarchive.day.202506*`
WHERE
    repo.name = 'org/repository'
    AND type = 'PushEvent'
    AND JSON_EXTRACT_SCALAR(payload, '$.size') = '0'
ORDER BY created_at

Step 3: Bulk Recovery Query

query = """
SELECT
    created_at,
    actor.login,
    repo.name,
    JSON_EXTRACT_SCALAR(payload, '$.before') as deleted_sha,
    JSON_EXTRACT_SCALAR(payload, '$.ref') as branch
FROM `githubarchive.year.2024`
WHERE
    type = 'PushEvent'
    AND JSON_EXTRACT_SCALAR(payload, '$.size') = '0'
    AND repo.name LIKE 'target-org/%'
"""

results = client.query(query)
deleted_commits = []
for row in results:
    deleted_commits.append({
        'timestamp': row.created_at,
        'actor': row.actor_login,
        'repo': row.repo_name,
        'deleted_sha': row.deleted_sha,
        'branch': row.branch
    })

print(f"Found {len(deleted_commits)} force-pushed commits to investigate")

Evidence Recovery:

  • before SHA: The commit that was "deleted" by the force push
  • head SHA: The commit the branch was reset to
  • ref: Which branch was force pushed
  • actor.login: Who performed the force push
  • Commit Access: Use recovered SHA to access commit via GitHub API or web UI

Forensic Applications:

  • Secret Scanning: Scan recovered commits for leaked credentials, API keys, tokens
  • Incident Timeline: Identify when secrets were committed and when they were "hidden"
  • Attribution: Determine who committed secrets and who attempted to cover them up
  • Compliance: Prove data exposure window for breach notifications

Real Example: Security researcher Sharon Brizinov scanned all zero-commit PushEvents since 2020 across GitHub, recovering "deleted" commits and scanning them for secrets. This technique uncovered credentials worth $25k in bug bounties, including an admin-level GitHub PAT with access to all Istio repositories (36k stars, used by Google, IBM, Red Hat). The token could have enabled a massive supply-chain attack.

Important Notes:

  • Force pushing does NOT delete commits from GitHub - they remain accessible via SHA
  • GitHub Archive preserves the before SHA indefinitely
  • Zero-commit PushEvents are the forensic fingerprint of history rewrites
  • This technique provides 100% coverage of "deleted" commits (vs brute-forcing 4-char SHA prefixes)

Learn More