name	github-archive
description	Investigate GitHub security incidents using tamper-proof GitHub Archive data via BigQuery. Use when verifying repository activity claims, recovering deleted PRs/branches/tags/repos, attributing actions to actors, or reconstructing attack timelines. Provides immutable forensic evidence of all public GitHub events since 2011.
version	1
author	mbrg
tags	github, gharchive, security, osint, forensics, git

GitHub Archive

Purpose: Query immutable GitHub event history via BigQuery to obtain tamper-proof forensic evidence for security investigations.

When to Use This Skill

Investigating security incidents involving GitHub repositories
Building threat actor attribution profiles
Verifying claims about repository activity (media reports, incident reports)
Reconstructing attack timelines with definitive timestamps
Analyzing automation system compromises
Detecting supply chain reconnaissance
Cross-repository behavioral analysis
Workflow execution verification (legitimate vs API abuse)
Pattern-based anomaly detection
Recovering deleted content: PRs, issues, branches, tags, entire repositories

GitHub Archive analysis should be your FIRST step in any GitHub-related security investigation. Start with the immutable record, then enrich with additional sources.

Core Principles

ALWAYS PREFER GitHub Archive as forensic evidence over:

Local git command outputs (git log, git show) - commits can be backdated/forged
Unverified claims from articles or reports - require independent confirmation
GitHub web interface screenshots - can be manipulated
Single-source evidence - always cross-verify

GitHub Archive IS your ground truth for:

Actor attribution (who performed actions)
Timeline reconstruction (when events occurred)
Event verification (what actually happened)
Pattern analysis (behavioral fingerprinting)
Cross-repository activity tracking
Deleted content recovery (issues, PRs, tags, commit references remain in archive)
Repository deletion forensics (commit SHAs persist even after repo deletion and history rewrites)

What Persists After Deletion

Deleted Issues & PRs:

Issue creation events (IssuesEvent) remain in archive
Issue comments (IssueCommentEvent) remain accessible
PR open/close/merge events (PullRequestEvent) persist
Forensic Value: Recover deleted evidence of social engineering, reconnaissance, or coordination

Deleted Tags & Branches:

CreateEvent records for tag/branch creation persist
DeleteEvent records document when deletion occurred
Forensic Value: Reconstruct attack staging infrastructure (e.g., malicious payload delivery tags)

Deleted Repositories:

All PushEvent records to the repository remain queryable
Commit SHAs are permanently recorded in archive
Fork relationships (ForkEvent) survive deletion
Forensic Value: Access commit metadata even after threat actor deletes evidence

Deleted User Accounts:

All activity events remain attributed to deleted username
Timeline reconstruction remains possible
Limitation: Direct code access lost, but commit SHAs can be searched elsewhere

Quick Start

Investigate if user opened PRs in June 2025:

from google.cloud import bigquery
from google.oauth2 import service_account

# Initialize client (see Setup section for credentials)
credentials = service_account.Credentials.from_service_account_file(
    'path/to/credentials.json',
    scopes=['https://www.googleapis.com/auth/bigquery']
)
client = bigquery.Client(credentials=credentials, project=credentials.project_id)

# Query for PR events
query = """
SELECT
    created_at,
    repo.name,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.number') as pr_number,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.title') as pr_title,
    JSON_EXTRACT_SCALAR(payload, '$.action') as action
FROM `githubarchive.day.202506*`
WHERE
    actor.login = 'suspected-actor'
    AND repo.name = 'target/repository'
    AND type = 'PullRequestEvent'
ORDER BY created_at
"""

results = client.query(query)
for row in results:
    print(f"{row.created_at}: PR #{row.pr_number} - {row.action}")
    print(f"  Title: {row.pr_title}")

Expected Output (if PR exists):

2025-06-15 14:23:11 UTC: PR #123 - opened
  Title: Add new feature
2025-06-20 09:45:22 UTC: PR #123 - closed
  Title: Add new feature

Interpretation:

No results → Claim disproven (no PR activity found)
Results found → Claim verified, proceed with detailed analysis

Setup

Prerequisites

Google Cloud Project:
- Login to Google Developer Console
- Create a project and activate BigQuery API
- Create a service account with BigQuery User role
- Download JSON credentials file
Install BigQuery Client:

pip install google-cloud-bigquery google-auth

Initialize Client

from google.cloud import bigquery
from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file(
    'path/to/credentials.json',
    scopes=['https://www.googleapis.com/auth/bigquery']
)

client = bigquery.Client(
    credentials=credentials,
    project=credentials.project_id
)

Free Tier: Google provides 1 TB of data processed per month free.

Cost Management & Query Optimization

Understanding GitHub Archive Costs

BigQuery charges $6.25 per TiB of data scanned (after the 1 TiB free tier). GitHub Archive tables are large - a single month table can be 50-100 GB, and yearly wildcards can scan multiple TiBs. Unoptimized queries can cost $10-100+, while optimized versions of the same query cost $0.10-1.00.

Key Cost Principle: BigQuery uses columnar storage - you pay for ALL data in the columns you SELECT, not just matching rows. A query with SELECT * on one day of data scans ~3 GB even with LIMIT 10.

ALWAYS Estimate Costs Before Querying

CRITICAL RULE: Run a dry run to estimate costs before executing any query against GitHub Archive production tables.

from google.cloud import bigquery

def estimate_gharchive_cost(query: str) -> dict:
    """Estimate cost before running GitHub Archive query."""
    client = bigquery.Client()

    # Dry run - validates query and returns bytes to scan
    dry_run_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    job = client.query(query, job_config=dry_run_config)

    bytes_processed = job.total_bytes_processed
    gb_processed = bytes_processed / (1024**3)
    tib_processed = bytes_processed / (1024**4)
    estimated_cost = tib_processed * 6.25

    return {
        'bytes': bytes_processed,
        'gigabytes': round(gb_processed, 2),
        'tib': round(tib_processed, 4),
        'estimated_cost_usd': round(estimated_cost, 4)
    }

# Example: Always check cost before running
estimate = estimate_gharchive_cost(your_query)
print(f"Cost estimate: {estimate['gigabytes']} GB → ${estimate['estimated_cost_usd']}")

if estimate['estimated_cost_usd'] > 1.0:
    print("⚠️ HIGH COST QUERY - Review optimization before proceeding")

Command-line dry run:

bq query --dry_run --use_legacy_sql=false 'YOUR_QUERY_HERE' 2>&1 | grep "bytes"

When to Ask the User About Costs

ASK USER BEFORE RUNNING if any of these conditions apply:

Estimated cost > $1.00 - Always confirm with user for queries over $1
Wildcard spans > 3 months - Queries like githubarchive.day.2025* scan entire year (~400 GB)
No partition filter - Queries without date/time filters scan entire table range
SELECT * used - Selecting all columns dramatically increases cost
Cross-repository searches - Queries without repo.name filter scan all GitHub activity

Example user confirmation:

Query estimate: 120 GB ($0.75)
Scanning: githubarchive.day.202506* (June 2025, 30 days)
Reason: Cross-repository search for actor 'suspected-user'

This exceeds typical query cost ($0.10-0.30). Proceed? [y/n]

DON'T ASK if:

Estimated cost < $0.50 AND query is well-scoped (specific repo + date range)
User explicitly requested broad analysis (e.g., "scan all of 2025")

Cost Optimization Techniques for GitHub Archive

1. Select Only Required Columns (50-90% cost reduction)

-- ❌ EXPENSIVE: Scans ALL columns (~3 GB per day)
SELECT * FROM `githubarchive.day.20250615`
WHERE actor.login = 'target-user'

-- ✅ OPTIMIZED: Scans only needed columns (~0.3 GB per day)
SELECT
    type,
    created_at,
    repo.name,
    actor.login,
    JSON_EXTRACT_SCALAR(payload, '$.action') as action
FROM `githubarchive.day.20250615`
WHERE actor.login = 'target-user'

Never use SELECT * in production queries. Always specify exact columns needed.

2. Use Specific Date Ranges (10-100x cost reduction)

-- ❌ EXPENSIVE: Scans entire year (~400 GB)
SELECT ... FROM `githubarchive.day.2025*`
WHERE actor.login = 'target-user'

-- ✅ OPTIMIZED: Scans specific month (~40 GB)
SELECT ... FROM `githubarchive.day.202506*`
WHERE actor.login = 'target-user'

-- ✅ BEST: Scans single day (~3 GB)
SELECT ... FROM `githubarchive.day.20250615`
WHERE actor.login = 'target-user'

Strategy: Start with narrow date ranges (1-7 days), then expand if needed. Use monthly tables (githubarchive.month.202506) for multi-month queries instead of daily wildcards.

3. Filter by Repository Name (5-50x cost reduction)

-- ❌ EXPENSIVE: Scans all GitHub activity
SELECT ... FROM `githubarchive.day.202506*`
WHERE actor.login = 'target-user'

-- ✅ OPTIMIZED: Filter by repo (BigQuery can prune data blocks)
SELECT ... FROM `githubarchive.day.202506*`
WHERE
    repo.name = 'target-org/target-repo'
    AND actor.login = 'target-user'

Rule: Always include repo.name filter when investigating a specific repository.

4. Avoid SELECT * with Wildcards (Critical)

-- ❌ CATASTROPHIC: Can scan 1+ TiB ($6.25+)
SELECT * FROM `githubarchive.day.2025*`
WHERE type = 'PushEvent'

-- ✅ OPTIMIZED: Scans ~50 GB ($0.31)
SELECT
    created_at,
    actor.login,
    repo.name,
    JSON_EXTRACT_SCALAR(payload, '$.ref') as branch
FROM `githubarchive.day.2025*`
WHERE type = 'PushEvent'

5. Use LIMIT Correctly (Does NOT reduce cost on GHArchive)

IMPORTANT: LIMIT does not reduce BigQuery costs on non-clustered tables like GitHub Archive. BigQuery must scan all matching data before applying LIMIT.

-- ❌ MISCONCEPTION: Still scans full dataset
SELECT * FROM `githubarchive.day.20250615`
LIMIT 100  -- Cost: ~3 GB scanned

-- ✅ CORRECT: Use WHERE filters and column selection
SELECT type, created_at, actor.login
FROM `githubarchive.day.20250615`
WHERE repo.name = 'target/repo'  -- Cost: ~0.2 GB scanned
LIMIT 100

Safe Query Execution Template

Use this template for all GitHub Archive queries in production:

def safe_gharchive_query(query: str, max_cost_usd: float = 1.0):
    """Execute GitHub Archive query with cost controls."""
    client = bigquery.Client()

    # Step 1: Dry run estimate
    dry_run_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    dry_job = client.query(query, job_config=dry_run_config)

    bytes_processed = dry_job.total_bytes_processed
    gb = bytes_processed / (1024**3)
    estimated_cost = (bytes_processed / (1024**4)) * 6.25

    print(f"📊 Estimate: {gb:.2f} GB → ${estimated_cost:.4f}")

    # Step 2: Check budget
    if estimated_cost > max_cost_usd:
        raise ValueError(
            f"Query exceeds ${max_cost_usd} budget (estimated ${estimated_cost:.2f}). "
            f"Optimize query or increase max_cost_usd parameter."
        )

    # Step 3: Execute with safety limit
    job_config = bigquery.QueryJobConfig(
        maximum_bytes_billed=int(bytes_processed * 1.2)  # 20% buffer
    )

    print(f"✅ Executing query (max ${estimated_cost:.2f})...")
    return client.query(query, job_config=job_config).result()

# Usage
results = safe_gharchive_query("""
    SELECT created_at, repo.name, actor.login
    FROM `githubarchive.day.20250615`
    WHERE repo.name = 'aws/aws-toolkit-vscode'
        AND type = 'PushEvent'
""", max_cost_usd=0.50)

Common Investigation Patterns: Cost Comparison

Investigation Type	Expensive Approach	Cost	Optimized Approach	Cost
Verify user opened PR in June	`SELECT * FROM githubarchive.day.202506*`	~$5.00	`SELECT created_at, repo.name, payload FROM githubarchive.day.202506* WHERE actor.login='user' AND type='PullRequestEvent'`	~$0.30
Find all actor activity in 2025	`SELECT * FROM githubarchive.day.2025*`	~$60.00	`SELECT type, created_at, repo.name FROM githubarchive.month.2025*`	~$5.00
Recover deleted PR content	`SELECT * FROM githubarchive.day.20250615`	~$0.20	`SELECT created_at, payload FROM githubarchive.day.20250615 WHERE repo.name='target/repo' AND type='PullRequestEvent'`	~$0.02
Cross-repo behavioral analysis	`SELECT * FROM githubarchive.day.202506*`	~$5.00	Start with `githubarchive.month.202506`, identify specific repos, then query daily tables	~$0.50

Development vs Production Queries

During investigation/development:

Start with single-day queries to test pattern: githubarchive.day.20250615
Verify query returns expected results
Expand to date range only after validation: githubarchive.day.202506*

Production checklist:

Used specific column names (no SELECT *)
Included narrowest possible date range
Added repo.name filter if investigating specific repository
Ran dry run and verified cost < $1.00 (or got user approval)
Set maximum_bytes_billed in query config

Cost Monitoring

Track your BigQuery spending with this query:

-- View GitHub Archive query costs (last 7 days)
SELECT
    DATE(creation_time) as query_date,
    COUNT(*) as queries,
    ROUND(SUM(total_bytes_billed) / (1024*1024*1024), 2) as total_gb,
    ROUND(SUM(total_bytes_billed) / (1024*1024*1024*1024) * 6.25, 2) as cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
    creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
    AND job_type = 'QUERY'
    AND REGEXP_CONTAINS(query, r'githubarchive\.')
GROUP BY query_date
ORDER BY query_date DESC

Schema Reference

Table Organization

Dataset: githubarchive

Table Patterns:

Daily tables: githubarchive.day.YYYYMMDD (e.g., githubarchive.day.20250713)
Monthly tables: githubarchive.month.YYYYMM (e.g., githubarchive.month.202507)
Yearly tables: githubarchive.year.YYYY (e.g., githubarchive.year.2025)

Wildcard Patterns:

All days in June 2025: githubarchive.day.202506*
All months in 2025: githubarchive.month.2025*
All data in 2025: githubarchive.year.2025*

Data Availability: February 12, 2011 to present (updated hourly)

Schema Structure

Top-Level Fields:

type              -- Event type (PushEvent, IssuesEvent, etc.)
created_at        -- Timestamp when event occurred (UTC)
actor.login       -- GitHub username who performed the action
actor.id          -- GitHub user ID
repo.name         -- Repository name (org/repo format)
repo.id           -- Repository ID
org.login         -- Organization login (if applicable)
org.id            -- Organization ID
payload           -- JSON string with event-specific data

Payload Field: JSON-encoded string containing event-specific details. Must be parsed with JSON_EXTRACT_SCALAR() in SQL or json.loads() in Python.

Event Types Reference

Repository Events

PushEvent - Commits pushed to a repository

-- Payload fields:
JSON_EXTRACT_SCALAR(payload, '$.ref')        -- Branch (refs/heads/master)
JSON_EXTRACT_SCALAR(payload, '$.before')     -- SHA before push
JSON_EXTRACT_SCALAR(payload, '$.after')      -- SHA after push
JSON_EXTRACT_SCALAR(payload, '$.size')       -- Number of commits
-- payload.commits[] contains array of commit objects with sha, message, author

PullRequestEvent - Pull request opened, closed, merged

-- Payload fields:
JSON_EXTRACT_SCALAR(payload, '$.action')              -- opened, closed, merged
JSON_EXTRACT_SCALAR(payload, '$.pull_request.number')
JSON_EXTRACT_SCALAR(payload, '$.pull_request.title')
JSON_EXTRACT_SCALAR(payload, '$.pull_request.merged') -- true/false

CreateEvent - Branch or tag created

-- Payload fields:
JSON_EXTRACT_SCALAR(payload, '$.ref_type')   -- branch, tag, repository
JSON_EXTRACT_SCALAR(payload, '$.ref')        -- Name of branch/tag

DeleteEvent - Branch or tag deleted

-- Payload fields:
JSON_EXTRACT_SCALAR(payload, '$.ref_type')   -- branch or tag
JSON_EXTRACT_SCALAR(payload, '$.ref')        -- Name of deleted ref

ForkEvent - Repository forked

-- Payload fields:
JSON_EXTRACT_SCALAR(payload, '$.forkee.full_name')  -- New fork name

Automation & CI/CD Events

WorkflowRunEvent - GitHub Actions workflow run status changes

-- Payload fields:
JSON_EXTRACT_SCALAR(payload, '$.action')               -- requested, completed
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.name')
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.path')    -- .github/workflows/file.yml
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.status')  -- queued, in_progress, completed
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.conclusion') -- success, failure, cancelled
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.head_sha')
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.head_branch')

WorkflowJobEvent - Individual job within workflow CheckRunEvent - Check run status (CI systems) CheckSuiteEvent - Check suite for commits

Issue & Discussion Events

IssuesEvent - Issue opened, closed, edited

-- Payload fields:
JSON_EXTRACT_SCALAR(payload, '$.action')        -- opened, closed, reopened
JSON_EXTRACT_SCALAR(payload, '$.issue.number')
JSON_EXTRACT_SCALAR(payload, '$.issue.title')
JSON_EXTRACT_SCALAR(payload, '$.issue.body')

IssueCommentEvent - Comment on issue or pull request PullRequestReviewEvent - PR review submitted PullRequestReviewCommentEvent - Comment on PR diff

Other Events

WatchEvent - Repository starred ReleaseEvent - Release published MemberEvent - Collaborator added/removed PublicEvent - Repository made public

Investigation Patterns

Deleted Issue & PR Text Recovery

Scenario: Issue or PR was deleted from GitHub (by author, maintainer, or moderation) but you need to recover the original title and body text for investigation, compliance, or historical reference.

Step 1: Recover Deleted Issue Content

SELECT
    created_at,
    actor.login,
    JSON_EXTRACT_SCALAR(payload, '$.action') as action,
    JSON_EXTRACT_SCALAR(payload, '$.issue.number') as issue_number,
    JSON_EXTRACT_SCALAR(payload, '$.issue.title') as title,
    JSON_EXTRACT_SCALAR(payload, '$.issue.body') as body
FROM `githubarchive.day.20250713`
WHERE
    repo.name = 'aws/aws-toolkit-vscode'
    AND actor.login = 'lkmanka58'
    AND type = 'IssuesEvent'
ORDER BY created_at

Step 2: Recover Deleted PR Description

SELECT
    created_at,
    actor.login,
    JSON_EXTRACT_SCALAR(payload, '$.action') as action,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.number') as pr_number,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.title') as title,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.body') as body,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.merged') as merged
FROM `githubarchive.day.202506*`
WHERE
    repo.name = 'target/repository'
    AND actor.login = 'target-user'
    AND type = 'PullRequestEvent'
ORDER BY created_at

Evidence Recovery:

Issue/PR Title: Full title text preserved in $.issue.title or $.pull_request.title
Issue/PR Body: Complete body text preserved in $.issue.body or $.pull_request.body
Comments: IssueCommentEvent preserves comment text in $.comment.body
Actor Attribution: actor.login identifies who created the content
Timestamps: Exact creation time in created_at

Real Example: Amazon Q investigation recovered deleted issue content from lkmanka58. The issue titled "aws amazon donkey aaaaaaiii aaaaaaaiii" contained a rant calling Amazon Q "deceptive" and "scripted fakery". The full issue body was preserved in GitHub Archive despite deletion from github.com, providing context for the timeline reconstruction.

Deleted PRs

Scenario: Media claims attacker submitted a PR in "late June" containing malicious code, but PR is now deleted and cannot be found on github.com.

Step 1: Query Archive

query = """
SELECT
    type,
    created_at,
    repo.name,
    JSON_EXTRACT_SCALAR(payload, '$.action') as action,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.number') as pr_number,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.title') as pr_title
FROM `githubarchive.day.202506*`
WHERE
    actor.login = 'suspected-actor'
    AND repo.name = 'target/repository'
    AND type = 'PullRequestEvent'
ORDER BY created_at
"""

results = client.query(query)
pr_events = list(results)

Step 2: Analyze Results

if not pr_events:
    print("❌ CLAIM DISPROVEN: No PR activity found in June 2025")
else:
    for event in pr_events:
        print(f"✓ VERIFIED: PR #{event.pr_number} {event.action} on {event.created_at}")
        print(f"  Title: {event.pr_title}")
        print(f"  Repo: {event.repo_name}")

Evidence Validation:

Claim TRUE: Archive shows PullRequestEvent with action='opened'
Claim FALSE: No events found → claim disproven
Investigation Outcome: Definitively verify or refute timeline claims

Real Example: Amazon Q investigation verified no PR from attacker's account in late June 2025, disproving media's claim of malicious code committed via deleted PR.

Deleted Repository Forensics

Scenario: Threat actor creates staging repository, pushes malicious code, then deletes repo to cover tracks.

Step 1: Find Repository Activity

query = """
SELECT
    type,
    created_at,
    JSON_EXTRACT_SCALAR(payload, '$.ref') as ref,
    JSON_EXTRACT_SCALAR(payload, '$.repository.name') as repo_name,
    payload
FROM `githubarchive.day.2025*`
WHERE
    actor.login = 'threat-actor'
    AND type IN ('CreateEvent', 'PushEvent')
    AND (
        JSON_EXTRACT_SCALAR(payload, '$.repository.name') = 'staging-repo'
        OR repo.name LIKE 'threat-actor/staging-repo'
    )
ORDER BY created_at
"""

results = client.query(query)

Step 2: Extract Commit SHAs

import json

commits = []
for row in results:
    if row.type == 'PushEvent':
        payload_data = json.loads(row.payload)
        for commit in payload_data.get('commits', []):
            commits.append({
                'sha': commit['sha'],
                'message': commit['message'],
                'timestamp': row.created_at
            })

for c in commits:
    print(f"{c['timestamp']}: {c['sha'][:8]} - {c['message']}")

Evidence Recovery:

CreateEvent reveals repository creation timestamp
PushEvent records contain commit SHAs and metadata
Commit SHAs can be used to recover code content via other archives or forks
Investigation Outcome: Complete reconstruction of attacker's staging infrastructure

Real Example: lkmanka58/code_whisperer repository deleted after attack, but GitHub Archive revealed June 13 creation with 3 commits containing AWS IAM role assumption attempts.

Deleted Tag Analysis

Scenario: Malicious tag used for payload delivery, then deleted to hide evidence.

Step 1: Search for Tag Events

SELECT
    type,
    created_at,
    actor.login,
    JSON_EXTRACT_SCALAR(payload, '$.ref') as tag_name,
    JSON_EXTRACT_SCALAR(payload, '$.ref_type') as ref_type
FROM `githubarchive.day.20250713`
WHERE
    repo.name = 'target/repository'
    AND type IN ('CreateEvent', 'DeleteEvent')
    AND JSON_EXTRACT_SCALAR(payload, '$.ref_type') = 'tag'
ORDER BY created_at

Timeline Reconstruction:

2025-07-13 19:41:44 UTC | CreateEvent | aws-toolkit-automation | tag 'stability'
2025-07-13 20:30:24 UTC | PushEvent   | aws-toolkit-automation | commit references tag
2025-07-14 08:15:33 UTC | DeleteEvent | aws-toolkit-automation | tag 'stability' deleted

Analysis: 48-hour window between tag creation and deletion reveals staging period for attack infrastructure.

Real Example: Amazon Q attack used 'stability' tag for malicious payload delivery. Tag was deleted, but CreateEvent in GitHub Archive preserved creation timestamp and actor, proving 48-hour staging window.

Deleted Branch Reconstruction

Scenario: Attacker creates development branch with malicious code, pushes commits, then deletes branch after merging or to cover tracks.

Step 1: Find Branch Lifecycle

SELECT
    type,
    created_at,
    actor.login,
    JSON_EXTRACT_SCALAR(payload, '$.ref') as branch_name,
    JSON_EXTRACT_SCALAR(payload, '$.ref_type') as ref_type
FROM `githubarchive.day.2025*`
WHERE
    repo.name = 'target/repository'
    AND type IN ('CreateEvent', 'DeleteEvent')
    AND JSON_EXTRACT_SCALAR(payload, '$.ref_type') = 'branch'
ORDER BY created_at

Step 2: Extract All Commit SHAs from Deleted Branch

SELECT
    created_at,
    actor.login as pusher,
    JSON_EXTRACT_SCALAR(payload, '$.ref') as branch_ref,
    JSON_EXTRACT_SCALAR(commit, '$.sha') as commit_sha,
    JSON_EXTRACT_SCALAR(commit, '$.message') as commit_message,
    JSON_EXTRACT_SCALAR(commit, '$.author.name') as author_name,
    JSON_EXTRACT_SCALAR(commit, '$.author.email') as author_email
FROM `githubarchive.day.2025*`,
UNNEST(JSON_EXTRACT_ARRAY(payload, '$.commits')) as commit
WHERE
    repo.name = 'target/repository'
    AND type = 'PushEvent'
    AND JSON_EXTRACT_SCALAR(payload, '$.ref') = 'refs/heads/deleted-branch-name'
ORDER BY created_at

Evidence Recovery:

Commit SHAs: All commit identifiers permanently recorded in PushEvent payload
Commit Messages: Full commit messages preserved in commits array
Author Metadata: Name and email from commit author field
Pusher Identity: Actor who executed the push operation
Temporal Sequence: Exact timestamps for each push operation
Branch Lifecycle: Complete creation-to-deletion timeline

Forensic Value: Even after branch deletion, commit SHAs can be used to:

Search for commits in forked repositories
Check if commits were merged into other branches
Search external code archives (Software Heritage, etc.)
Reconstruct complete attack development timeline

Automation vs Direct API Attribution

Scenario: Suspicious commits appear under automation account name. Determine if they came from legitimate GitHub Actions workflow execution or direct API abuse with compromised token.

Step 1: Search for Workflow Events During Suspicious Window

query = """
SELECT
    type,
    created_at,
    actor.login,
    JSON_EXTRACT_SCALAR(payload, '$.workflow_run.name') as workflow_name,
    JSON_EXTRACT_SCALAR(payload, '$.workflow_run.head_sha') as commit_sha,
    JSON_EXTRACT_SCALAR(payload, '$.workflow_run.conclusion') as conclusion
FROM `githubarchive.day.20250713`
WHERE
    repo.name = 'org/repository'
    AND type IN ('WorkflowRunEvent', 'WorkflowJobEvent')
    AND created_at >= '2025-07-13T20:25:00Z'
    AND created_at <= '2025-07-13T20:35:00Z'
ORDER BY created_at
"""

workflow_events = list(client.query(query))

Step 2: Establish Baseline Pattern

baseline_query = """
SELECT
    type,
    created_at,
    actor.login,
    JSON_EXTRACT_SCALAR(payload, '$.workflow_run.name') as workflow_name
FROM `githubarchive.day.20250713`
WHERE
    repo.name = 'org/repository'
    AND actor.login = 'automation-account'
    AND type = 'WorkflowRunEvent'
ORDER BY created_at
"""

baseline = list(client.query(baseline_query))
print(f"Total workflows for day: {len(baseline)}")

Step 3: Analyze Results

if not workflow_events:
    print("🚨 DIRECT API ATTACK DETECTED")
    print("No WorkflowRunEvent during suspicious commit window")
    print("Commit was NOT from legitimate workflow execution")
else:
    print("✓ Legitimate workflow execution detected")
    for event in workflow_events:
        print(f"{event.created_at}: {event.workflow_name} - {event.conclusion}")

Expected Results if Legitimate Workflow:

2025-07-13 20:30:15 UTC | WorkflowRunEvent | deploy-automation | requested
2025-07-13 20:30:24 UTC | PushEvent        | aws-toolkit-automation | refs/heads/main
2025-07-13 20:31:08 UTC | WorkflowRunEvent | deploy-automation | completed

Expected Results if Direct API Abuse:

2025-07-13 20:30:24 UTC | PushEvent | aws-toolkit-automation | refs/heads/main
[NO WORKFLOW EVENTS IN ±10 MINUTE WINDOW]

Investigation Outcome: Absence of WorkflowRunEvent = Direct API attack with stolen token

Real Example: Amazon Q investigation needed to determine if malicious commit 678851bbe9776228f55e0460e66a6167ac2a1685 (pushed July 13, 2025 20:30:24 UTC by aws-toolkit-automation) came from compromised workflow or direct API abuse. GitHub Archive query showed ZERO WorkflowRunEvent or WorkflowJobEvent records during the 20:25-20:35 UTC window. Baseline analysis revealed the same automation account had 18 workflows that day, all clustered in 20:48-21:02 UTC. The temporal gap and complete workflow absence during the malicious commit proved direct API attack, not workflow compromise.

Troubleshooting

Permission denied errors:

Verify service account has BigQuery User role
Check credentials file path is correct
Ensure BigQuery API is enabled in Google Cloud project

Query exceeds free tier (>1TB):

Use daily tables instead of wildcard: githubarchive.day.20250615
Add date filters: WHERE created_at >= '2025-06-01' AND created_at < '2025-07-01'
Limit columns: Select only needed fields, not SELECT *
Use monthly tables for broader searches: githubarchive.month.202506

No results for known event:

Verify date range (archive starts Feb 12, 2011)
Check timezone (GitHub Archive uses UTC)
Confirm actor.login spelling (case-sensitive)
Some events may take up to 1 hour to appear (hourly updates)

Payload extraction returns NULL:

Verify JSON path exists with JSON_EXTRACT() before using JSON_EXTRACT_SCALAR()
Check event type has that payload field (not all events have all fields)
Inspect raw payload: SELECT payload FROM ... LIMIT 1

Query timeout or slow performance:

Add repo.name filter when possible (significantly reduces data scanned)
Use specific date ranges instead of wildcards
Consider using monthly aggregated tables for long-term analysis
Partition queries by date and run in parallel

Force Push Recovery (Zero-Commit PushEvents)

Scenario: Developer accidentally commits secrets, then force pushes to "delete" the commit. The commit remains accessible on GitHub, but finding it requires knowing the SHA.

Background: When a developer runs git reset --hard HEAD~1 && git push --force, Git removes the reference to that commit from the branch. However:

GitHub stores these "dangling" commits indefinitely
GitHub Archive records the before SHA in PushEvent payloads
Force pushes appear as PushEvents with zero commits (empty commits array)

Step 1: Find All Zero-Commit PushEvents (Organization-Wide)

SELECT
    created_at,
    actor.login,
    repo.name,
    JSON_EXTRACT_SCALAR(payload, '$.before') as deleted_commit_sha,
    JSON_EXTRACT_SCALAR(payload, '$.head') as current_head,
    JSON_EXTRACT_SCALAR(payload, '$.ref') as branch
FROM `githubarchive.day.2025*`
WHERE
    repo.name LIKE 'target-org/%'
    AND type = 'PushEvent'
    AND JSON_EXTRACT_SCALAR(payload, '$.size') = '0'
ORDER BY created_at DESC

Step 2: Search for Specific Repository

SELECT
    created_at,
    actor.login,
    JSON_EXTRACT_SCALAR(payload, '$.before') as deleted_commit_sha,
    JSON_EXTRACT_SCALAR(payload, '$.head') as after_sha,
    JSON_EXTRACT_SCALAR(payload, '$.ref') as branch
FROM `githubarchive.day.202506*`
WHERE
    repo.name = 'org/repository'
    AND type = 'PushEvent'
    AND JSON_EXTRACT_SCALAR(payload, '$.size') = '0'
ORDER BY created_at

Step 3: Bulk Recovery Query

query = """
SELECT
    created_at,
    actor.login,
    repo.name,
    JSON_EXTRACT_SCALAR(payload, '$.before') as deleted_sha,
    JSON_EXTRACT_SCALAR(payload, '$.ref') as branch
FROM `githubarchive.year.2024`
WHERE
    type = 'PushEvent'
    AND JSON_EXTRACT_SCALAR(payload, '$.size') = '0'
    AND repo.name LIKE 'target-org/%'
"""

results = client.query(query)
deleted_commits = []
for row in results:
    deleted_commits.append({
        'timestamp': row.created_at,
        'actor': row.actor_login,
        'repo': row.repo_name,
        'deleted_sha': row.deleted_sha,
        'branch': row.branch
    })

print(f"Found {len(deleted_commits)} force-pushed commits to investigate")

Evidence Recovery:

before SHA: The commit that was "deleted" by the force push
head SHA: The commit the branch was reset to
ref: Which branch was force pushed
actor.login: Who performed the force push
Commit Access: Use recovered SHA to access commit via GitHub API or web UI

Forensic Applications:

Secret Scanning: Scan recovered commits for leaked credentials, API keys, tokens
Incident Timeline: Identify when secrets were committed and when they were "hidden"
Attribution: Determine who committed secrets and who attempted to cover them up
Compliance: Prove data exposure window for breach notifications

Real Example: Security researcher Sharon Brizinov scanned all zero-commit PushEvents since 2020 across GitHub, recovering "deleted" commits and scanning them for secrets. This technique uncovered credentials worth $25k in bug bounties, including an admin-level GitHub PAT with access to all Istio repositories (36k stars, used by Google, IBM, Red Hat). The token could have enabled a massive supply-chain attack.

Important Notes:

Force pushing does NOT delete commits from GitHub - they remain accessible via SHA
GitHub Archive preserves the before SHA indefinitely
Zero-commit PushEvents are the forensic fingerprint of history rewrites
This technique provides 100% coverage of "deleted" commits (vs brute-forcing 4-char SHA prefixes)

Learn More

GH Archive Documentation: https://www.gharchive.org/
GitHub Event Types Schema: https://docs.github.com/en/rest/using-the-rest-api/github-event-types
BigQuery Documentation: https://cloud.google.com/bigquery/docs
BigQuery SQL Reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax
Force Push Scanner Tool: https://github.com/trufflesecurity/force-push-scanner

Install Skill

SKILL.md

GitHub Archive

When to Use This Skill

Core Principles

What Persists After Deletion

Quick Start

Setup

Prerequisites

Initialize Client

Cost Management & Query Optimization

Understanding GitHub Archive Costs

ALWAYS Estimate Costs Before Querying

When to Ask the User About Costs

Cost Optimization Techniques for GitHub Archive

1. Select Only Required Columns (50-90% cost reduction)

2. Use Specific Date Ranges (10-100x cost reduction)

3. Filter by Repository Name (5-50x cost reduction)

4. Avoid SELECT * with Wildcards (Critical)

5. Use LIMIT Correctly (Does NOT reduce cost on GHArchive)

Safe Query Execution Template

Common Investigation Patterns: Cost Comparison

Development vs Production Queries

Cost Monitoring

Schema Reference

Table Organization

Schema Structure

Event Types Reference

Repository Events

Automation & CI/CD Events

Issue & Discussion Events

Other Events

Investigation Patterns

Deleted Issue & PR Text Recovery

Deleted PRs

Deleted Repository Forensics

Deleted Tag Analysis

Deleted Branch Reconstruction

Automation vs Direct API Attribution

Troubleshooting

Force Push Recovery (Zero-Commit PushEvents)

Learn More