This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Rust sec-fetcher Application
Relevant source files
Purpose and Scope
This page provides an architectural overview of the Rust sec-fetcher application, which is responsible for fetching financial data from the SEC EDGAR API and transforming it into structured CSV files. The application serves as the data collection and preprocessing layer in a larger dual-language system that combines Rust's performance for I/O operations with Python's machine learning capabilities.
This page covers the high-level architecture, module organization, and data flow patterns. For detailed information about specific components, see:
The Python machine learning pipeline that consumes this data is documented in Python narrative_stack System.
Sources: src/lib.rs:1-11 Cargo.toml:1-45 src/main.rs:1-241
Application Architecture
The sec-fetcher application is built around a modular architecture that separates concerns into distinct layers: configuration, networking, data transformation, and storage. The core design principle is to fetch data from SEC APIs with robust error handling and caching, transform it into a standardized format, and output structured CSV files for downstream consumption.
graph TB
subgraph "src/lib.rs Module Organization"
config["config\nConfigManager, AppConfig"]
enums["enums\nFundamentalConcept, Url\nCacheNamespacePrefix"]
models["models\nTicker, CikSubmission\nNportInvestment, AccessionNumber"]
network["network\nSecClient, fetch_* functions\nThrottlePolicy, CachePolicy"]
transformers["transformers\ndistill_us_gaap_fundamental_concepts"]
parsers["parsers\nXML/JSON parsing utilities"]
caches["caches\nCaches (internal)\nHTTP cache, preprocessor cache"]
fs["fs\nFile system operations"]
utils["utils\nVecExtensions, helpers"]
end
subgraph "External Dependencies"
reqwest["reqwest\nHTTP client"]
polars["polars\nDataFrame operations"]
simd["simd-r-drive\nDrive-based cache storage"]
tokio["tokio\nAsync runtime"]
serde["serde\nSerialization"]
end
config --> caches
network --> config
network --> caches
network --> models
network --> enums
network --> parsers
transformers --> models
transformers --> enums
parsers --> models
network --> reqwest
network --> simd
network --> tokio
transformers --> polars
models --> serde
style config fill:#f9f9f9
style network fill:#f9f9f9
style transformers fill:#f9f9f9
style caches fill:#f9f9f9
Module Structure
The application is organized into nine core modules as declared in src/lib.rs:1-10:
| Module | Purpose | Key Components |
|---|---|---|
config | Configuration management and credential handling | ConfigManager, AppConfig |
enums | Type-safe enumerations for domain concepts | FundamentalConcept, Url, CacheNamespacePrefix, TickerOrigin |
models | Data structures representing SEC entities | Ticker, CikSubmission, NportInvestment, AccessionNumber |
network | HTTP client and data fetching functions | SecClient, fetch_company_tickers, fetch_us_gaap_fundamentals |
transformers | Data normalization and transformation | distill_us_gaap_fundamental_concepts |
parsers | XML/JSON parsing utilities | Filing parsers, XML extractors |
caches | Internal caching infrastructure | Caches (singleton), HTTP cache, preprocessor cache |
fs | File system operations | Directory creation, path utilities |
utils | Helper functions and extensions | VecExtensions, invert_multivalue_indexmap |
Sources: src/lib.rs:1-11 Cargo.toml:8-40
Data Flow Architecture
Request-Response Flow with Caching
The data flow follows a pipeline pattern:
- Request Initiation : Application code calls functions like
fetch_us_gaap_fundamentalsorfetch_nport_filing_by_ticker_symbol - Client Middleware :
SecClientapplies throttling and caching policies before making HTTP requests - Cache Check :
CachePolicycheckssimd-r-drivestorage for cached responses - API Request : If cache miss, request is sent to SEC EDGAR API with proper headers and rate limiting
- Parsing : Raw XML/JSON responses are parsed into structured data models
- Transformation : Data passes through normalization functions like
distill_us_gaap_fundamental_concepts - Output : Transformed data is written to CSV files organized by ticker symbol or category
Sources: src/main.rs:174-240 src/lib.rs:1-11
Key Dependencies and Technology Stack
The application leverages modern Rust crates for performance and reliability:
Critical Dependencies
| Category | Crate | Version | Purpose |
|---|---|---|---|
| Async Runtime | tokio | 1.43.0 | Asynchronous I/O, task scheduling |
| HTTP Client | reqwest | 0.12.15 | HTTP requests with JSON support |
| Data Frames | polars | 0.46.0 | High-performance DataFrame operations |
| Caching | simd-r-drive | 0.3.0 | WebSocket-based key-value storage |
| Parallelism | rayon | 1.10.0 | Data parallelism for CPU-bound work |
| Configuration | config | 0.15.9 | TOML/JSON configuration file parsing |
| Credentials | keyring | 3.6.2 | Secure credential storage (OS keychain) |
| XML Parsing | quick-xml | 0.37.2 | Fast XML parsing for SEC filings |
| Serialization | serde | 1.0.218 | Data structure serialization/deserialization |
Sources: Cargo.toml:8-40
Application Entry Point and Main Loop
The application entry point in src/main.rs:174-240 demonstrates the typical processing pattern:
Main Processing Loop Structure
The main application follows this pattern:
- Configuration Loading :
ConfigManager::load()reads configuration from TOML files and environment variables - Client Initialization :
SecClient::from_config_manager()creates HTTP client with throttling and caching middleware - Ticker Fetching :
fetch_company_tickers()retrieves all company ticker symbols from SEC - Processing Loop : Iterates over each ticker symbol:
- Calls
fetch_us_gaap_fundamentals()to retrieve financial data - Transforms data through
distill_us_gaap_fundamental_concepts - Writes results to CSV files organized by ticker:
data/us-gaap/{ticker}.csv - Logs errors to
error_logHashMap for later reporting
- Calls
- Error Reporting : Prints summary of any errors encountered during processing
Example Main Function Flow
Sources: src/main.rs:174-240
Module Interaction Patterns
Configuration and Client Initialization
The initialization sequence ensures:
- Configuration is loaded before any network operations
- Credentials are retrieved securely from OS keychain via
keyringcrate Cachessingleton is initialized with drive storage connection- HTTP client middleware is properly configured with throttling and caching
Sources: src/lib.rs:1-11
Data Transformation Pipeline
The transformation pipeline is the most critical component (importance: 8.37 as noted in the high-level architecture). The distill_us_gaap_fundamental_concepts function handles:
- Synonym Consolidation : Maps 6+ variations of "Cost of Revenue" to single
CostOfRevenueconcept - Industry-Specific Revenue : Handles 57+ revenue field variations (SalesRevenueNet, RevenueFromContractWithCustomer, etc.)
- Hierarchical Mapping : Maps specific concepts to parent categories (e.g.,
CurrentAssetsalso maps toAssets) - Unit Normalization : Standardizes currency, shares, and percentage units
Sources: Cargo.toml24 high-level architecture diagram
Error Handling Strategy
The application uses a multi-layered error handling approach:
| Layer | Strategy | Example |
|---|---|---|
| Network | Retry with exponential backoff | ThrottlePolicy with max_retries |
| Cache | Fallback to API on cache errors | CachePolicy transparent fallback |
| Parsing | Log and continue processing | Error log in main loop |
| File I/O | Log error, continue to next item | Individual CSV write failures don't stop processing |
The main loop accumulates errors in a HashMap<String, String> and reports them at the end, ensuring that one failure doesn't halt the entire batch processing job.
Sources: src/main.rs:185-240
Performance Characteristics
The application is optimized for I/O-bound workloads:
- Async I/O :
tokioruntime enables concurrent network requests - CPU Parallelism :
rayonparallelizes DataFrame operations inpolars - Caching :
simd-r-drivereduces redundant API calls with 1-week TTL - Throttling : Respects SEC rate limits with adaptive jitter
- Streaming : Processes large datasets without loading entirely into memory
The combination of async networking for I/O operations and Rayon parallelism for CPU-bound transformations provides optimal throughput for SEC data fetching and processing.
Sources: Cargo.toml:8-40 high-level architecture diagrams