This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Caching & Storage System
Relevant source files
Purpose and Scope
This document describes the caching and storage infrastructure used by the Rust sec-fetcher application to minimize redundant API requests and improve performance. The system implements a two-tier caching architecture with persistent storage backed by simd-r-drive. For information about network request throttling and retry logic, see Network Layer & SecClient. For database storage used by the Python components, see Database & Storage Integration.
Overview
The caching system provides two distinct cache layers:
- HTTP Cache : Stores raw HTTP responses from SEC EDGAR API requests
- Preprocessor Cache : Stores transformed and processed data structures
Both caches use the simd-r-drive key-value storage backend with persistent file-based storage, initialized once at application startup using the OnceLock pattern for thread-safe static access.
Sources: src/caches.rs:1-66
Caching Architecture
The following diagram illustrates the complete caching architecture and how it integrates with the network layer:
Cache Initialization Flow
graph TB
subgraph "Application Initialization"
Main["main.rs"]
ConfigMgr["ConfigManager"]
CachesInit["Caches::init(config_manager)"]
end
subgraph "Static Cache Storage (OnceLock)"
HTTPCache["SIMD_R_DRIVE_HTTP_CACHE\nOnceLock<Arc<DataStore>>"]
PreCache["SIMD_R_DRIVE_PREPROCESSOR_CACHE\nOnceLock<Arc<DataStore>>"]
end
subgraph "File System"
HTTPFile["cache_base_dir/http_storage_cache.bin"]
PreFile["cache_base_dir/preprocessor_cache.bin"]
end
subgraph "Network Layer"
SecClient["SecClient"]
CacheMiddleware["reqwest_drive\nCache Middleware"]
ThrottleMiddleware["Throttle Middleware"]
CachePolicy["CachePolicy\nTTL: 1 week\nrespect_headers: false"]
end
subgraph "Cache Access"
GetHTTP["Caches::get_http_cache_store()"]
GetPre["Caches::get_preprocessor_cache()"]
end
Main --> ConfigMgr
ConfigMgr --> CachesInit
CachesInit --> HTTPCache
CachesInit --> PreCache
HTTPCache -.->|DataStore::open| HTTPFile
PreCache -.->|DataStore::open| PreFile
SecClient --> GetHTTP
GetHTTP --> HTTPCache
HTTPCache --> CacheMiddleware
CacheMiddleware --> CachePolicy
CacheMiddleware --> ThrottleMiddleware
GetPre --> PreCache
The caching system is initialized early in the application lifecycle using a configuration-driven approach that ensures thread-safe access across the async runtime.
Sources: src/caches.rs:1-66 src/network/sec_client.rs:1-181
Cache Initialization System
OnceLock Pattern
The system uses Rust's OnceLock for lazy, thread-safe initialization of static cache instances:
This pattern ensures:
- Single initialization : Each cache is initialized exactly once
- Thread safety : Safe access from multiple async tasks
- Zero overhead after init : No locking required for read access after initialization
Sources: src/caches.rs:7-8
Initialization Process
Initialization Code Flow
The Caches::init() method performs the following steps:
- Retrieves
cache_base_dirfromConfigManager - Constructs paths for both cache files
- Opens each
DataStoreinstance - Sets the
OnceLockwithArc-wrapped stores - Logs warnings if already initialized (prevents re-initialization panics)
Sources: src/caches.rs:13-48
HTTP Cache Integration
graph LR
subgraph "SecClient Initialization"
FromConfig["SecClient::from_config_manager()"]
GetHTTPCache["Caches::get_http_cache_store()"]
CreatePolicy["CachePolicy::new()"]
CreateThrottle["ThrottlePolicy::new()"]
end
subgraph "Middleware Chain"
InitDrive["init_cache_with_drive_and_throttle()"]
DriveCache["drive_cache middleware"]
ThrottleCache["throttle_cache middleware"]
InitClient["init_client_with_cache_and_throttle()"]
end
subgraph "HTTP Client"
ClientWithMiddleware["ClientWithMiddleware"]
end
FromConfig --> GetHTTPCache
FromConfig --> CreatePolicy
FromConfig --> CreateThrottle
GetHTTPCache --> InitDrive
CreatePolicy --> InitDrive
CreateThrottle --> InitDrive
InitDrive --> DriveCache
InitDrive --> ThrottleCache
DriveCache --> InitClient
ThrottleCache --> InitClient
InitClient --> ClientWithMiddleware
SecClient Cache Setup
The SecClient integrates with the HTTP cache through the reqwest_drive middleware system:
SecClient Construction
The SecClient struct maintains references to both cache and throttle policies:
| Field | Type | Purpose |
|---|---|---|
email | String | User-Agent identification |
http_client | ClientWithMiddleware | HTTP client with middleware chain |
cache_policy | Arc<CachePolicy> | Cache configuration settings |
throttle_policy | Arc<ThrottlePolicy> | Request throttling configuration |
Sources: src/network/sec_client.rs:14-19 src/network/sec_client.rs:73-81
CachePolicy Configuration
The CachePolicy struct defines cache behavior:
| Parameter | Value | Description |
|---|---|---|
default_ttl | Duration::from_secs(60 * 60 * 24 * 7) | 1 week time-to-live |
respect_headers | false | Ignore HTTP cache headers from SEC API |
cache_status_override | None | No custom status code handling |
The 1-week TTL is currently hardcoded but marked for future configurability.
Sources: src/network/sec_client.rs:45-50
Cache Access Patterns
HTTP Cache Access Flow
Request Processing
When SecClient makes a request:
- The
fetch_json()method callsraw_request()with the target URL - The middleware chain intercepts the request
- Cache middleware checks the
DataStorefor a matching entry - If found and not expired (TTL check), returns cached response
- If not found, forwards to throttle middleware and HTTP client
- Response is stored in cache with timestamp before returning
Sources: src/network/sec_client.rs:140-179 tests/sec_client_tests.rs:35-62
Preprocessor Cache Access
The preprocessor cache is accessed directly through the Caches module API:
This cache is intended for storing transformed data structures after processing, though the current codebase primarily uses it as infrastructure for future preprocessing optimization.
Sources: src/caches.rs:59-64
Storage Backend: simd-r-drive
DataStore Integration
The simd-r-drive crate provides the persistent key-value storage backend:
DataStore Characteristics
| Feature | Description |
|---|---|
| Persistence | All data survives application restarts |
| Binary Format | Stores arbitrary byte arrays |
| Thread-Safe | Safe for concurrent access from async tasks |
| File-Backed | Single file per cache instance |
| No Network | Local-only storage (unlike WebSocket mode) |
Sources: src/caches.rs3 src/caches.rs:22-37
Cache Configuration
ConfigManager Integration
The caching system reads configuration from ConfigManager:
| Config Field | Purpose | Default |
|---|---|---|
cache_base_dir | Parent directory for cache files | Required (no default) |
The cache base directory path is constructed at runtime and must be set before initialization:
Cache File Paths
- HTTP Cache:
{cache_base_dir}/http_storage_cache.bin - Preprocessor Cache:
{cache_base_dir}/preprocessor_cache.bin
Sources: src/caches.rs16 src/caches.rs20 src/caches.rs34
ThrottlePolicy Configuration
While not part of the cache storage itself, the ThrottlePolicy is closely integrated with the cache middleware:
| Parameter | Config Source | Purpose |
|---|---|---|
base_delay_ms | config.min_delay_ms | Minimum delay between requests |
max_concurrent | config.max_concurrent | Maximum parallel requests |
max_retries | config.max_retries | Retry attempts for failed requests |
adaptive_jitter_ms | Hardcoded: 500 | Random jitter for backoff |
Sources: src/network/sec_client.rs:53-59
Error Handling
Initialization Errors
The cache initialization handles several error scenarios:
Panic Conditions:
DataStore::open()fails (I/O error, permission denied, etc.)- Calling
get_http_cache_store()beforeCaches::init() - Calling
get_preprocessor_cache()beforeCaches::init()
Warnings:
- Attempting to reinitialize already-set
OnceLock(logs warning, doesn't fail)
Sources: src/caches.rs:25-29 src/caches.rs:51-55
Runtime Access Errors
Both cache accessor methods use expect() with descriptive messages:
These panics indicate programmer errors (accessing cache before initialization) rather than runtime failures.
Sources: src/caches.rs:51-56 src/caches.rs:59-64
Testing
Cache Testing Strategy
The test suite verifies caching behavior indirectly through SecClient tests:
Test Coverage:
| Test | Purpose | File |
|---|---|---|
test_fetch_json_without_retry_success | Verifies basic fetch with caching enabled | tests/sec_client_tests.rs:36-62 |
test_fetch_json_with_retry_success | Tests cache behavior with successful retry | tests/sec_client_tests.rs:65-91 |
test_fetch_json_with_retry_failure | Validates cache doesn't store failed responses | tests/sec_client_tests.rs:94-120 |
test_fetch_json_with_retry_backoff | Tests cache with retry backoff logic | tests/sec_client_tests.rs:123-158 |
Mock Server Testing:
Tests use mockito::Server to simulate SEC API responses without requiring cache setup, as the test environment creates ephemeral cache instances per test run.
Sources: tests/sec_client_tests.rs:1-159
Usage Patterns
Typical Initialization Flow
Future Extensions
The preprocessor cache infrastructure is in place but not yet fully utilized. Potential use cases include:
- Caching parsed/transformed US GAAP concepts
- Storing intermediate data structures from
distill_us_gaap_fundamental_concepts() - Caching resolved CIK lookups from ticker symbols
- Storing compiled ticker-to-CIK mappings
Sources: src/caches.rs:59-64
Integration with Other Systems
The caching system integrates with:
- ConfigManager (#2.1): Provides cache directory configuration
- SecClient (#3.1): Primary consumer of HTTP cache
- Network Functions (#3.2): All fetch operations benefit from caching
- simd-r-drive : External storage backend (see #6 for dependency details)
The cache layer operates transparently below the network API, requiring no changes to calling code when caching is enabled or disabled.
Sources: src/caches.rs:1-66 src/network/sec_client.rs:73-81