Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Caching & Storage System

Relevant source files

Purpose and Scope

This document describes the caching and storage infrastructure used by the Rust sec-fetcher application to minimize redundant API requests and improve performance. The system implements a two-tier caching architecture with persistent storage backed by simd-r-drive. For information about network request throttling and retry logic, see Network Layer & SecClient. For database storage used by the Python components, see Database & Storage Integration.

Overview

The caching system provides two distinct cache layers:

  1. HTTP Cache : Stores raw HTTP responses from SEC EDGAR API requests
  2. Preprocessor Cache : Stores transformed and processed data structures

Both caches use the simd-r-drive key-value storage backend with persistent file-based storage, initialized once at application startup using the OnceLock pattern for thread-safe static access.

Sources: src/caches.rs:1-66


Caching Architecture

The following diagram illustrates the complete caching architecture and how it integrates with the network layer:

Cache Initialization Flow

graph TB
    subgraph "Application Initialization"
        Main["main.rs"]
ConfigMgr["ConfigManager"]
CachesInit["Caches::init(config_manager)"]
end
    
    subgraph "Static Cache Storage (OnceLock)"
        HTTPCache["SIMD_R_DRIVE_HTTP_CACHE\nOnceLock<Arc<DataStore>>"]
PreCache["SIMD_R_DRIVE_PREPROCESSOR_CACHE\nOnceLock<Arc<DataStore>>"]
end
    
    subgraph "File System"
        HTTPFile["cache_base_dir/http_storage_cache.bin"]
PreFile["cache_base_dir/preprocessor_cache.bin"]
end
    
    subgraph "Network Layer"
        SecClient["SecClient"]
CacheMiddleware["reqwest_drive\nCache Middleware"]
ThrottleMiddleware["Throttle Middleware"]
CachePolicy["CachePolicy\nTTL: 1 week\nrespect_headers: false"]
end
    
    subgraph "Cache Access"
        GetHTTP["Caches::get_http_cache_store()"]
GetPre["Caches::get_preprocessor_cache()"]
end
    
 
   Main --> ConfigMgr
 
   ConfigMgr --> CachesInit
 
   CachesInit --> HTTPCache
 
   CachesInit --> PreCache
    
 
   HTTPCache -.->|DataStore::open| HTTPFile
 
   PreCache -.->|DataStore::open| PreFile
    
 
   SecClient --> GetHTTP
 
   GetHTTP --> HTTPCache
 
   HTTPCache --> CacheMiddleware
 
   CacheMiddleware --> CachePolicy
 
   CacheMiddleware --> ThrottleMiddleware
    
 
   GetPre --> PreCache

The caching system is initialized early in the application lifecycle using a configuration-driven approach that ensures thread-safe access across the async runtime.

Sources: src/caches.rs:1-66 src/network/sec_client.rs:1-181


Cache Initialization System

OnceLock Pattern

The system uses Rust's OnceLock for lazy, thread-safe initialization of static cache instances:

This pattern ensures:

  • Single initialization : Each cache is initialized exactly once
  • Thread safety : Safe access from multiple async tasks
  • Zero overhead after init : No locking required for read access after initialization

Sources: src/caches.rs:7-8

Initialization Process

Initialization Code Flow

The Caches::init() method performs the following steps:

  1. Retrieves cache_base_dir from ConfigManager
  2. Constructs paths for both cache files
  3. Opens each DataStore instance
  4. Sets the OnceLock with Arc-wrapped stores
  5. Logs warnings if already initialized (prevents re-initialization panics)

Sources: src/caches.rs:13-48


HTTP Cache Integration

graph LR
    subgraph "SecClient Initialization"
        FromConfig["SecClient::from_config_manager()"]
GetHTTPCache["Caches::get_http_cache_store()"]
CreatePolicy["CachePolicy::new()"]
CreateThrottle["ThrottlePolicy::new()"]
end
    
    subgraph "Middleware Chain"
        InitDrive["init_cache_with_drive_and_throttle()"]
DriveCache["drive_cache middleware"]
ThrottleCache["throttle_cache middleware"]
InitClient["init_client_with_cache_and_throttle()"]
end
    
    subgraph "HTTP Client"
        ClientWithMiddleware["ClientWithMiddleware"]
end
    
 
   FromConfig --> GetHTTPCache
 
   FromConfig --> CreatePolicy
 
   FromConfig --> CreateThrottle
    
 
   GetHTTPCache --> InitDrive
 
   CreatePolicy --> InitDrive
 
   CreateThrottle --> InitDrive
    
 
   InitDrive --> DriveCache
 
   InitDrive --> ThrottleCache
    
 
   DriveCache --> InitClient
 
   ThrottleCache --> InitClient
    
 
   InitClient --> ClientWithMiddleware

SecClient Cache Setup

The SecClient integrates with the HTTP cache through the reqwest_drive middleware system:

SecClient Construction

The SecClient struct maintains references to both cache and throttle policies:

FieldTypePurpose
emailStringUser-Agent identification
http_clientClientWithMiddlewareHTTP client with middleware chain
cache_policyArc<CachePolicy>Cache configuration settings
throttle_policyArc<ThrottlePolicy>Request throttling configuration

Sources: src/network/sec_client.rs:14-19 src/network/sec_client.rs:73-81

CachePolicy Configuration

The CachePolicy struct defines cache behavior:

ParameterValueDescription
default_ttlDuration::from_secs(60 * 60 * 24 * 7)1 week time-to-live
respect_headersfalseIgnore HTTP cache headers from SEC API
cache_status_overrideNoneNo custom status code handling

The 1-week TTL is currently hardcoded but marked for future configurability.

Sources: src/network/sec_client.rs:45-50


Cache Access Patterns

HTTP Cache Access Flow

Request Processing

When SecClient makes a request:

  1. The fetch_json() method calls raw_request() with the target URL
  2. The middleware chain intercepts the request
  3. Cache middleware checks the DataStore for a matching entry
  4. If found and not expired (TTL check), returns cached response
  5. If not found, forwards to throttle middleware and HTTP client
  6. Response is stored in cache with timestamp before returning

Sources: src/network/sec_client.rs:140-179 tests/sec_client_tests.rs:35-62

Preprocessor Cache Access

The preprocessor cache is accessed directly through the Caches module API:

This cache is intended for storing transformed data structures after processing, though the current codebase primarily uses it as infrastructure for future preprocessing optimization.

Sources: src/caches.rs:59-64


Storage Backend: simd-r-drive

DataStore Integration

The simd-r-drive crate provides the persistent key-value storage backend:

DataStore Characteristics

FeatureDescription
PersistenceAll data survives application restarts
Binary FormatStores arbitrary byte arrays
Thread-SafeSafe for concurrent access from async tasks
File-BackedSingle file per cache instance
No NetworkLocal-only storage (unlike WebSocket mode)

Sources: src/caches.rs3 src/caches.rs:22-37


Cache Configuration

ConfigManager Integration

The caching system reads configuration from ConfigManager:

Config FieldPurposeDefault
cache_base_dirParent directory for cache filesRequired (no default)

The cache base directory path is constructed at runtime and must be set before initialization:

Cache File Paths

  • HTTP Cache: {cache_base_dir}/http_storage_cache.bin
  • Preprocessor Cache: {cache_base_dir}/preprocessor_cache.bin

Sources: src/caches.rs16 src/caches.rs20 src/caches.rs34

ThrottlePolicy Configuration

While not part of the cache storage itself, the ThrottlePolicy is closely integrated with the cache middleware:

ParameterConfig SourcePurpose
base_delay_msconfig.min_delay_msMinimum delay between requests
max_concurrentconfig.max_concurrentMaximum parallel requests
max_retriesconfig.max_retriesRetry attempts for failed requests
adaptive_jitter_msHardcoded: 500Random jitter for backoff

Sources: src/network/sec_client.rs:53-59


Error Handling

Initialization Errors

The cache initialization handles several error scenarios:

Panic Conditions:

  • DataStore::open() fails (I/O error, permission denied, etc.)
  • Calling get_http_cache_store() before Caches::init()
  • Calling get_preprocessor_cache() before Caches::init()

Warnings:

  • Attempting to reinitialize already-set OnceLock (logs warning, doesn't fail)

Sources: src/caches.rs:25-29 src/caches.rs:51-55

Runtime Access Errors

Both cache accessor methods use expect() with descriptive messages:

These panics indicate programmer errors (accessing cache before initialization) rather than runtime failures.

Sources: src/caches.rs:51-56 src/caches.rs:59-64


Testing

Cache Testing Strategy

The test suite verifies caching behavior indirectly through SecClient tests:

Test Coverage:

TestPurposeFile
test_fetch_json_without_retry_successVerifies basic fetch with caching enabledtests/sec_client_tests.rs:36-62
test_fetch_json_with_retry_successTests cache behavior with successful retrytests/sec_client_tests.rs:65-91
test_fetch_json_with_retry_failureValidates cache doesn't store failed responsestests/sec_client_tests.rs:94-120
test_fetch_json_with_retry_backoffTests cache with retry backoff logictests/sec_client_tests.rs:123-158

Mock Server Testing:

Tests use mockito::Server to simulate SEC API responses without requiring cache setup, as the test environment creates ephemeral cache instances per test run.

Sources: tests/sec_client_tests.rs:1-159


Usage Patterns

Typical Initialization Flow

Future Extensions

The preprocessor cache infrastructure is in place but not yet fully utilized. Potential use cases include:

  • Caching parsed/transformed US GAAP concepts
  • Storing intermediate data structures from distill_us_gaap_fundamental_concepts()
  • Caching resolved CIK lookups from ticker symbols
  • Storing compiled ticker-to-CIK mappings

Sources: src/caches.rs:59-64


Integration with Other Systems

The caching system integrates with:

  • ConfigManager (#2.1): Provides cache directory configuration
  • SecClient (#3.1): Primary consumer of HTTP cache
  • Network Functions (#3.2): All fetch operations benefit from caching
  • simd-r-drive : External storage backend (see #6 for dependency details)

The cache layer operates transparently below the network API, requiring no changes to calling code when caching is enabled or disabled.

Sources: src/caches.rs:1-66 src/network/sec_client.rs:73-81