| name | persistent-cache-gap-filling |
| description | Persistent data cache with gap-filling for historical market data. Trigger when: (1) cache re-downloads complete data unnecessarily, (2) time-based cache expiry wastes API calls, (3) historical data needs incremental updates only. |
| author | Claude Code |
| date | Thu Jan 01 2026 00:00:00 GMT+0000 (Coordinated Universal Time) |
Persistent Cache with Gap-Filling (v2.8.0)
Experiment Overview
| Item | Details |
|---|---|
| Date | 2026-01-01 |
| Goal | Eliminate redundant downloads of historical data by removing time-based cache expiry |
| Environment | alpaca_trading/data/ modules |
| Status | Success |
Context
User noticed that re-running the training notebook caused complete re-downloads of historical data even though:
- Data was downloaded earlier the same day
- Historical data is immutable (past candles never change)
- Only new bars since the last download were needed
The root cause was time-based cache expiry:
- SQLite cache (
cache.py): 12-hour TTL viaPERSISTED_TTL_HOURS - Pickle cache (
caching_fetcher.py): 3-7 day expiry viacache_expiry_days
v2.8.0 Solution: Persistent Cache + Gap-Filling
Core Principle
Historical market data is immutable. Once downloaded and validated, it should persist indefinitely. Only fetch new bars to fill the gap between cache end and current time.
Changes Made
1. cache.py - SQLite Cache
# Before: TTL always checked
def get(self, ..., ttl_hours: int = 24):
if created_at < ttl_cutoff or expires_at < now_ts:
self._remove_entry(cache_key)
return None
# After: TTL is optional (None = no expiry)
def get(self, ..., ttl_hours: Optional[int] = None):
# Only check TTL if explicitly specified
if ttl_hours is not None:
# ... TTL check
# Otherwise return cached data regardless of age
2. fetcher.py - DataFetcher
# Removed
PERSISTED_TTL_HOURS = 12
self._cache_ttl_hours = PERSISTED_TTL_HOURS
# Updated _load_persisted - no TTL check
def _load_persisted(self, symbol: str, timeframe: str) -> pd.DataFrame:
# No TTL - historical data is immutable
cached = self._cache.get(symbol, timeframe, start="", end="")
return cached
3. caching_fetcher.py - CachingDataFetcher
class CachingDataFetcher:
def get_bars(self, symbol, timeframe, lookback_days, **kwargs):
cached_df = load_from_cache(symbol, timeframe, cache_dir=self._cache_dir)
if cached_df is not None:
cache_end = cached_df.index.max()
# Check if cache covers requested range
if cache_start <= start_dt and cache_end >= end_dt - tolerance:
return cached_df # Complete - no API call
# Gap-fill: only fetch new bars
fetch_start = cache_end + timedelta(hours=1)
new_df = self._fetcher.get_bars(symbol, start=fetch_start, ...)
# Merge and save
combined = pd.concat([cached_df, new_df])
save_to_cache(symbol, combined, ...)
return combined
Behavior Comparison
Before (Time-Based Expiry)
Run 1 (10:00 AM): Fetch 4 years of data [API] -> Cache (12h TTL)
Run 2 (10:30 AM): Cache valid -> [CACHE] instant
Run 3 (11:00 PM): Cache expired -> [API] Fetch 4 years AGAIN
After (Persistent + Gap-Fill)
Run 1 (10:00 AM): Fetch 4 years of data [API] -> Cache (persistent)
Run 2 (10:30 AM): Cache complete -> [CACHE] instant
Run 3 (11:00 PM): Cache + gap-fill -> [GAP-FILL] Fetch 13 new bars only
Output Messages
| Message | Meaning |
|---|---|
[CACHE] AAPL: 35,040 bars (complete) |
Cache covers full range, no API call |
[GAP-FILL] AAPL: Fetching 2026-01-01 to 2026-01-01... |
Fetching only new bars |
[UPDATED] AAPL: 35,038 + 2 = 35,040 bars |
Merged new bars with cache |
[API] AAPL: Fetching 1460 days... |
No cache, full download |
Cache Statistics
New gap_fills counter added:
stats = fetcher.get_cache_stats()
# {
# 'cache_hits': 8, # Returned cached data unchanged
# 'cache_misses': 2, # No cache, full download
# 'gap_fills': 5, # Merged new bars with cache
# 'hit_rate': 0.87 # (hits + gap_fills) / total
# }
Failed Attempts
| Approach | Result | Why It Failed |
|---|---|---|
| Increase TTL to 30 days | Worked but fragile | Still expires eventually, arbitrary cutoff |
| Check file modification time | Partial | Doesn't verify data completeness |
Key Insights
- Historical data is immutable - Past candles never change, so there's no reason to re-fetch them
- Only the edge needs updating - New bars appear at the end of the series
- Time-based expiry is wrong model - For mutable data (news, weather) TTL makes sense; for historical OHLCV it's waste
- Completeness > freshness - Check if cache covers the requested date range, not how old the file is
Files Modified
alpaca_trading/data/cache.py:
- get(): ttl_hours now Optional[int] = None (no expiry by default)
alpaca_trading/data/fetcher.py:
- Removed PERSISTED_TTL_HOURS constant
- _load_persisted(): No TTL check
- _save_persisted(): Uses 10-year TTL (effectively infinite)
alpaca_trading/data/caching_fetcher.py:
- DEFAULT_*_CACHE_EXPIRY_DAYS = None (no expiry)
- is_cache_valid(): Just checks file exists
- get_bars(): Gap-filling logic added
- get_cache_stats(): Added gap_fills counter
Backward Compatibility
- Existing
.pklcache files work unchanged cache_expiry_daysparameter still accepted but ignored- Old caches are automatically upgraded (no migration needed)
References
- Skill:
selection-data-caching- Original caching implementation (v2.5.1) - Skill:
data-source-priority- Data fetching hierarchy alpaca_trading/data/caching_fetcher.py: Gap-filling implementation