| name | vm-infrastructure-ops |
| description | Troubleshoot and manage GCP e2-micro VM running eth-realtime-collector. Use when VM service is down, systemd failures occur, real-time data stream stops, or VM network issues arise. Keywords systemd, journalctl, eth-collector, gcloud compute. |
VM Infrastructure Operations
Version: 1.0.0 Last Updated: 2025-11-13 Purpose: Troubleshoot and manage GCP e2-micro VM running eth-realtime-collector
When to Use
Use this skill when:
- VM service down, "eth-collector" systemd failures
- Real-time data stream stopped (ClickHouse not receiving blocks)
- VM network issues, DNS resolution failures
- Need to check service status, view logs, or restart services
- Keywords: systemd, journalctl, eth-collector, gcloud compute
Prerequisites
- GCP project access:
eonlabs-ethereum-bq - VM instance:
eth-realtime-collectorin zoneus-east1-b - gcloud CLI configured with appropriate credentials
Workflows
1. Check Service Status
Check if eth-collector systemd service is running:
gcloud compute ssh eth-realtime-collector --zone=us-east1-b \
--command='sudo systemctl status eth-collector'
Expected Output (healthy):
● eth-collector.service - Ethereum Real-Time Collector
Loaded: loaded (/etc/systemd/system/eth-collector.service; enabled)
Active: active (running) since ...
Alternative (use provided script):
.claude/skills/vm-infrastructure-ops/scripts/check_vm_status.sh
2. View Logs (Live Tail)
Stream real-time logs from the collector service:
gcloud compute ssh eth-realtime-collector --zone=us-east1-b \
--command='sudo journalctl -u eth-collector -f'
What to Look For:
- "Block inserted" messages every ~12 seconds (healthy)
- gRPC errors, DNS resolution failures (unhealthy)
- "Connection refused" or "Metadata server unreachable" (network issues)
3. View Recent Logs (Last 100 Lines)
gcloud compute ssh eth-realtime-collector --zone=us-east1-b \
--command='sudo journalctl -u eth-collector -n 100'
4. Restart Service
Restart the collector service after configuration changes or to recover from errors:
gcloud compute ssh eth-realtime-collector --zone=us-east1-b \
--command='sudo systemctl restart eth-collector'
Alternative (use provided script with pre-checks):
.claude/skills/vm-infrastructure-ops/scripts/restart_collector.sh
When to Use:
- After deploying code updates
- Recovering from gRPC metadata validation errors
- After Secret Manager credential updates
5. VM Hard Reset
Hard reset the VM instance (use as last resort):
gcloud compute instances reset eth-realtime-collector --zone=us-east1-b
When to Use:
- VM network connectivity completely lost
- DNS resolution failures
- Metadata server unreachable
- Service restart doesn't resolve issues
Warning: This forcefully restarts the VM. All in-memory state is lost.
6. Verify Data Flow
After restarting services, verify data is flowing to ClickHouse:
cd
doppler run --project aws-credentials --config prd -- python3 -c "
import clickhouse_connect
import os
client = clickhouse_connect.get_client(
host=os.environ['CLICKHOUSE_HOST'],
port=8443,
username='default',
password=os.environ['CLICKHOUSE_PASSWORD'],
secure=True
)
result = client.query('SELECT MAX(timestamp), MAX(number) FROM ethereum_mainnet.blocks FINAL')
print(f'Latest block: {result.result_rows[0][1]:,} at {result.result_rows[0][0]}')
"
Expected Output (healthy):
Latest block: 23,800,000+ at <within last 60 seconds>
Common Failure Modes
See VM Failure Modes for detailed troubleshooting guide.
Quick Reference:
| Symptom | Likely Cause | Solution |
|---|---|---|
Service status: failed |
gRPC metadata error | Check logs, restart with .strip() fix |
| No blocks for >5 minutes | Network connectivity | Check network, reset VM if needed |
| DNS resolution errors | Metadata server unreachable | VM hard reset |
| "Connection refused" | Service not running | Restart service |
Systemd Commands
See Systemd Commands Reference for complete systemd operations.
Quick Reference:
# Status
sudo systemctl status eth-collector
# Start
sudo systemctl start eth-collector
# Stop
sudo systemctl stop eth-collector
# Restart
sudo systemctl restart eth-collector
# Enable (start on boot)
sudo systemctl enable eth-collector
# Disable (don't start on boot)
sudo systemctl disable eth-collector
# View service logs
sudo journalctl -u eth-collector
# Follow logs live
sudo journalctl -u eth-collector -f
Operational History
Infrastructure Recovery (2025-11-10 07:00 UTC):
- VM network failure detected (DNS resolution failed, metadata server unreachable)
- Recovery: VM reset restored network connectivity
- eth-collector service restarted with
.strip()fix (gRPC metadata validation resolved) - Real-time data flow confirmed: blocks streaming every ~12 seconds
- Database verified: 23.8M blocks (2015-2025), latest block within seconds
Maintainability SLO Achievement: Critical infrastructure failure (VM network down) resolved in <30 minutes (VM reset + service restart + verification).
Related Documentation
- ClickHouse Migration ADR - Production database migration
- Real-Time Collector Deployment Guide - VM deployment
- Gap Monitor README - Automated gap detection
- Data Pipeline Monitoring Skill - Cloud Run Jobs monitoring
Scripts
- `check_vm_status.sh` - Automated status check via gcloud
- `restart_collector.sh` - Safe restart with pre-checks
References
- `vm-failure-modes.md` - Common failure scenarios and solutions
- `systemd-commands.md` - Complete systemd operations reference