This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Dependencies & Technology Stack
Relevant source files
This page provides a comprehensive overview of all external dependencies used in the rust-sec-fetcher codebase, covering both the Rust sec-fetcher application and the Python narrative_stack ML system. For information about the configuration system and credential management, see Configuration System. For details on how these dependencies are used within specific components, see Rust sec-fetcher Application and Python narrative_stack System.
Overview
The system employs a dual-language architecture with distinct but complementary technology stacks. The Rust layer prioritizes high-performance I/O operations, concurrent data fetching, and reliable HTTP caching. The Python layer focuses on scientific computing, machine learning model training, and numerical data processing. Both layers share common infrastructure through the simd-r-drive data storage system and file-based CSV interchange.
Sources: Cargo.toml:1-45 High-Level System Architecture diagrams
Rust Technology Stack
Core Direct Dependencies
The Rust application declares 29 direct dependencies in its manifest, each serving specific architectural roles:
Dependency Categories and Usage:
graph TB
subgraph "Async Runtime & Concurrency"
tokio["tokio 1.43.0\nFull async runtime"]
rayon["rayon 1.10.0\nData parallelism"]
dashmap["dashmap 6.1.0\nConcurrent hashmap"]
end
subgraph "HTTP & Network"
reqwest["reqwest 0.12.15\nHTTP client"]
reqwest_drive["reqwest-drive 0.1.0-alpha.9\nDrive middleware"]
end
subgraph "Data Processing"
polars["polars 0.46.0\nDataFrame operations"]
csv_crate["csv 1.3.1\nCSV parsing"]
serde["serde 1.0.218\nSerialization"]
serde_json["serde_json 1.0.140\nJSON support"]
quick_xml["quick-xml 0.37.2\nXML parsing"]
end
subgraph "Storage & Caching"
simd_r_drive["simd-r-drive 0.3.0-alpha.1\nKey-value store"]
simd_r_drive_ext["simd-r-drive-extensions\n0.4.0-alpha.6"]
end
subgraph "Configuration & Validation"
config_crate["config 0.15.9\nConfig management"]
keyring["keyring 3.6.2\nCredential storage"]
email_address["email_address 0.2.9\nEmail validation"]
rust_decimal["rust_decimal 1.36.0\nDecimal numbers"]
chrono["chrono 0.4.40\nDate/time handling"]
end
subgraph "Development Tools"
mockito["mockito 1.7.0\nHTTP mocking"]
tempfile["tempfile 3.18.0\nTemp file creation"]
env_logger["env_logger 0.11.7\nLogging"]
end
| Category | Crates | Primary Use Cases |
|---|---|---|
| Async Runtime | tokio | Event loop, async I/O, task scheduling, timers |
| HTTP Stack | reqwest, reqwest-drive | SEC API communication, middleware integration |
| Data Frames | polars | Large-scale data transformation, CSV/JSON processing |
| Serialization | serde, serde_json, serde_with | Data structure serialization, API response parsing |
| Concurrency | rayon, dashmap, crossbeam | Parallel processing, concurrent data structures |
| Storage | simd-r-drive, simd-r-drive-extensions | HTTP cache, preprocessor cache, persistent storage |
| Configuration | config, keyring | TOML config loading, secure credential management |
| Validation | email_address, rust_decimal, chrono | Input validation, financial precision, timestamps |
| Utilities | itertools, indexmap, bytes | Iterator extensions, ordered maps, byte manipulation |
| Testing | mockito, tempfile | HTTP mock servers, temporary test files |
Sources: Cargo.toml:8-44
HTTP and Network Layer
The HTTP stack consists of multiple layers providing comprehensive request handling:
TLS Configuration: The system supports dual TLS backends for platform compatibility:
graph TB
SecClient["SecClient\n(src/sec_client.rs)"]
reqwest["reqwest 0.12.15\nHigh-level HTTP client"]
reqwest_drive["reqwest-drive 0.1.0-alpha.9\nMiddleware framework"]
hyper["hyper 1.6.0\nHTTP/1.1 & HTTP/2"]
h2["h2 0.4.8\nHTTP/2 protocol"]
TLS_Layer["TLS Layer"]
rustls["rustls 0.23.21\nPure Rust TLS"]
native_tls["native-tls 0.2.14\nPlatform TLS"]
openssl["openssl 0.10.71\nOpenSSL bindings"]
hyper_util["hyper-util 0.1.10\nUtilities"]
hyper_rustls["hyper-rustls 0.27.5\nRustls connector"]
hyper_tls["hyper-tls 0.6.0\nNative TLS connector"]
SecClient --> reqwest
reqwest --> reqwest_drive
reqwest --> hyper
reqwest --> TLS_Layer
hyper --> h2
hyper --> hyper_util
TLS_Layer --> hyper_rustls
TLS_Layer --> hyper_tls
hyper_rustls --> rustls
hyper_tls --> native_tls
native_tls --> openssl
- rustls (0.23.21): Pure Rust implementation, no external dependencies, used by default
- native-tls (0.2.14): Platform-native TLS (OpenSSL on Linux, Security.framework on macOS, SChannel on Windows)
Sources: Cargo.lock:1280-1351 Cargo.toml28
Data Processing Stack
Polars Features Enabled:
json: JSON file reading/writinglazy: Lazy evaluation engine for query optimizationpivot: Pivot table operations
Key Numeric Types:
rust_decimal::Decimal: Exact decimal arithmetic for financial calculations (importance: 8.37 for US GAAP processing)chrono::NaiveDate,chrono::DateTime: Timestamp handling for filing dates and report periods
Sources: Cargo.toml24 Cargo.lock:2170-2394
Concurrency and Parallelism
Concurrency Model:
- tokio : Handles I/O-bound operations (HTTP requests, file I/O)
- rayon : Handles CPU-bound operations (DataFrame transformations, parallel iterations)
- dashmap : Thread-safe caching without explicit locking
Key Usage Locations:
tokio: src/main.rs - main async runtimerayon: Cargo.toml14 - dashmap integration, parallel processingonce_cell: src/caches.rs:7-8 -OnceLock<Arc<DataStore>>for global cache initialization
Sources: Cargo.toml:14-40 src/caches.rs:1-66 Cargo.lock:612-762
Storage and Caching System
Cache Architecture:
- Two separate
DataStoreinstances for different caching layers - Thread-safe access via
Arc<DataStore>wrapped inOnceLockstatics - Initialized once at application startup via
Caches::init() - 1-week TTL for HTTP responses, indefinite for preprocessor results
Storage Format:
- Binary
.binfiles for efficient serialization - Uses
bincode 1.3.3for encoding cached data - Persistent across application restarts
Sources: src/caches.rs:7-65 Cargo.toml:36-37 Cargo.lock:246-252
Configuration and Credentials
Configuration Sources (Priority Order):
- Environment variables
config.tomlin current directory~/.config/sec-fetcher/config.toml- Default values
Keyring Features:
apple-native: Direct Security.framework integration on macOSwindows-native: Windows Credential Manager via Win32 APIssync-secret-service: D-Bus Secret Service for Linux
Sources: Cargo.toml:12-20 Cargo.lock:1641-1653
Validation and Type Safety
| Crate | Version | Purpose | Key Types |
|---|---|---|---|
email_address | 0.2.9 | SEC email validation | EmailAddress |
rust_decimal | 1.36.0 | Financial calculations | Decimal |
chrono | 0.4.40 | Date/time operations | NaiveDate, DateTime<Utc> |
chrono-tz | 0.10.1 | Timezone handling | Tz::America__New_York |
strum / strum_macros | 0.27.1 | Enum utilities | EnumString, Display |
Decimal Precision:
- 28-29 significant digits
- No floating-point rounding errors
- Critical for US GAAP fundamental values
Date Handling:
- All SEC filing dates parsed to
chrono::NaiveDate - UTC timestamps for API requests
- Eastern Time conversion for market hours
Sources: Cargo.toml:16-39 Cargo.lock:403-868
Python Technology Stack
Machine Learning Framework
Training Configuration:
- Patience : 20 epochs for early stopping
- Checkpoint Strategy : Save top 1 model by validation loss
- Learning Rate : Configurable via checkpoint override
- Gradient Clipping : Prevents exploding gradients
graph TB
subgraph "Core Scientific Libraries"
numpy["NumPy\nArray operations"]
pandas["pandas\nDataFrame manipulation"]
sklearn["scikit-learn\nPCA, RobustScaler"]
end
subgraph "Preprocessing Pipeline"
PCA["PCA\nDimensionality reduction\nVariance threshold: 0.95"]
RobustScaler["RobustScaler\nPer concept/unit pair\nOutlier-robust normalization"]
sklearn --> PCA
sklearn --> RobustScaler
end
subgraph "Data Structures"
triplets["Concept/Unit/Value Triplets"]
embeddings["Semantic Embeddings\nPCA-compressed"]
RobustScaler --> triplets
PCA --> embeddings
triplets --> embeddings
end
subgraph "Data Ingestion"
csv_files["CSV Files\nfrom Rust sec-fetcher"]
UsGaapRowRecord["UsGaapRowRecord\nParsed row structure"]
csv_files --> UsGaapRowRecord
UsGaapRowRecord --> numpy
end
Sources: High-Level System Architecture (Diagram 3), narrative_stack training pipeline
Data Processing and Scientific Computing
scikit-learn Components:
- PCA : Reduces embedding dimensionality while retaining 95% variance
- RobustScaler : Normalizes using median and IQR, resistant to outliers
- Both fitted per unique concept/unit pair for specialized normalization
NumPy Usage:
- Validation via
np.isclose()checks - Scaler transformation verification
- Embedding matrix storage
graph TB
subgraph "Database Layer"
mysql_db[("MySQL Database\nus_gaap_test")]
DbUsGaap["DbUsGaap\nDatabase interface"]
asyncio["asyncio\nAsync queries"]
DbUsGaap --> mysql_db
DbUsGaap --> asyncio
end
subgraph "WebSocket Storage"
DataStoreWsClient["DataStoreWsClient\nWebSocket client"]
simd_r_drive_server["simd-r-drive\nWebSocket server"]
DataStoreWsClient --> simd_r_drive_server
end
subgraph "Unified Access"
UsGaapStore["UsGaapStore\nFacade pattern"]
UsGaapStore --> DbUsGaap
UsGaapStore --> DataStoreWsClient
end
subgraph "Data Models"
triplet_storage["Triplet Storage:\n- concept\n- unit\n- scaled_value\n- scaler\n- embedding"]
UsGaapStore --> triplet_storage
end
Sources: High-Level System Architecture (Diagram 3), us_gaap_store preprocessing pipeline
Database and Storage Integration
Storage Strategy:
- MySQL : Relational queries for ingested CSV data
- WebSocket Store : High-performance embedding matrix storage
- Facade Pattern :
UsGaapStoreabstracts storage backend choice
Python WebSocket Client:
- Package:
simd-r-drive-client(Python equivalent of Rustsimd-r-drive) - Async communication with WebSocket server
- Shared embedding matrices between Rust and Python
Sources: High-Level System Architecture (Diagram 3), Database & Storage Integration section
Visualization and Debugging
| Library | Purpose | Key Outputs |
|---|---|---|
matplotlib | Static plots | PCA variance plots, scaler distribution |
TensorBoard | Training metrics | Loss curves, learning rate schedules |
itertools | Data iteration | Batching, grouping concept pairs |
Visualization Types:
- PCA Explanation Plots : Cumulative variance by component
- Semantic Embedding Scatterplots : t-SNE/UMAP projections of concept space
- Variance Analysis : Per-concept/unit value distributions
- Training Curves : TensorBoard integration via PyTorch Lightning
graph TB
subgraph "Server Implementation"
server["simd-r-drive-ws-server\nWebSocket server"]
DataStore_server["DataStore\nBackend storage"]
server --> DataStore_server
end
subgraph "Rust Clients"
http_cache_rust["HTTP Cache\n(Rust)"]
preprocessor_cache_rust["Preprocessor Cache\n(Rust)"]
http_cache_rust --> DataStore_server
preprocessor_cache_rust --> DataStore_server
end
subgraph "Python Clients"
DataStoreWsClient_py["DataStoreWsClient\n(Python)"]
embedding_matrix["Embedding Matrix\nStorage"]
DataStoreWsClient_py --> DataStore_server
embedding_matrix --> DataStoreWsClient_py
end
subgraph "Docker Deployment"
dockerfile["Dockerfile\nContainer config"]
rust_image["rust:1.87\nBase image"]
dockerfile --> rust_image
dockerfile --> server
end
Sources: High-Level System Architecture (Diagram 3), Validation & Analysis section
Shared Infrastructure
simd-r-drive WebSocket Server
Key Features:
- Cross-language data sharing via WebSocket protocol
- Binary-efficient key-value storage
- Persistent storage across restarts
- Docker containerization for CI/CD environments
Usage Patterns:
- Rust writes HTTP cache entries
- Rust writes preprocessor transformations
- Python reads/writes embedding matrices
- Python reads ingested US GAAP triplets
Sources: Cargo.toml:36-37 High-Level System Architecture (Diagram 1, Diagram 5)
File System Interchange
Directory Structure:
- fund-holdings/A-Z/ : NPORT filing holdings, alphabetized by ticker
- us-gaap/ : Fundamental concepts CSV outputs
- Row-based CSV format with standardized columns
Data Flow:
- Rust fetches from SEC API
- Rust transforms via
distill_us_gaap_fundamental_concepts - Rust writes CSV files
- Python walks directories
- Python parses into
UsGaapRowRecord - Python ingests to MySQL and WebSocket store
graph TB
subgraph "CI Pipeline"
workflow[".github/workflows/\nus-gaap-store-integration-test"]
checkout["Checkout code"]
rust_setup["Setup Rust 1.87"]
docker_compose["Docker Compose\nsimd-r-drive-ci-server"]
mysql_container["MySQL test container"]
workflow --> checkout
checkout --> rust_setup
checkout --> docker_compose
docker_compose --> mysql_container
end
subgraph "Test Execution"
cargo_test["cargo test\nRust unit tests"]
pytest["pytest\nPython integration tests"]
rust_setup --> cargo_test
docker_compose --> pytest
end
subgraph "Testing Tools"
mockito_tests["mockito 1.7.0\nHTTP mock server"]
tempfile_tests["tempfile 3.18.0\nTemp directories"]
cargo_test --> mockito_tests
cargo_test --> tempfile_tests
end
Sources: High-Level System Architecture (Diagram 1), main.rs CSV output logic
Development and CI/CD Stack
GitHub Actions Workflow
Docker Configuration:
- Image :
rust:1.87for reproducible builds - Services : simd-r-drive-ws-server, MySQL 8.0
- Environment Variables : Database credentials, WebSocket ports
- Volume Mounts : Test data directories
Testing Strategy:
- Rust Unit Tests : mockito for HTTP mocking, tempfile for isolated test fixtures
- Python Integration Tests : pytest with MySQL fixtures, WebSocket client validation
- End-to-End : Full pipeline from CSV ingestion through ML training
Sources: Cargo.toml:42-44 High-Level System Architecture (Diagram 5), CI/CD Pipeline section
Testing Dependencies
| Language | Framework | Mocking | Assertions |
|---|---|---|---|
| Rust | Built-in #[test] | mockito 1.7.0 | Standard assertions |
| Python | pytest | unittest.mock | np.isclose() |
Rust Test Modules:
config_manager_tests: Configuration loading and mergingsec_client_tests: HTTP client behaviordistill_tests: US GAAP concept transformation
Python Test Modules:
- Integration tests: End-to-end pipeline validation
- Unit tests: Scaler, PCA, embedding generation
Sources: Cargo.lock:1802-1824 Testing Strategy documentation
Version Compatibility Matrix
Critical Version Constraints
| Component | Version | Constraints | Reason |
|---|---|---|---|
tokio | 1.43.0 | >= 1.0 | Stable async runtime API |
reqwest | 0.12.15 | 0.12.x | Compatible with reqwest-drive middleware |
polars | 0.46.0 | 0.46.x | Breaking changes between minor versions |
simd-r-drive | 0.3.0-alpha.1 | Exact match | Alpha API, unstable |
serde | 1.0.218 | >= 1.0.100 | Derive macro stability |
hyper | 1.6.0 | 1.x | HTTP/2 support, reqwest dependency |
Rust Toolchain
Minimum Rust Version: 1.70+ (for async trait bounds and GAT support)
Feature Requirements:
async-traitfor trait async methods#[tokio::main]macro support- Generic Associated Types (GATs) for polars iterators
Sources: Cargo.toml4 Cargo.lock:1-3
Python Version Requirements
Estimated Python Version: 3.8+ (for PyTorch Lightning compatibility)
Key Constraints:
graph LR
reqwest["reqwest"] --> hyper
hyper --> h2["h2\nHTTP/2"]
hyper --> http["http\nTypes"]
reqwest --> rustls
reqwest --> native_tls
rustls --> ring["ring\nCryptography"]
native_tls --> openssl["openssl\nSystem TLS"]
h2 --> tokio
hyper --> tokio
- PyTorch: Requires Python 3.8 or later
- PyTorch Lightning: Requires PyTorch >= 1.9
- asyncio: Stable async/await syntax (Python 3.7+)
Sources: High-Level System Architecture (Python technology sections)
Transitive Dependency Highlights
HTTP/TLS Stack Depth
Notable Transitive Dependencies:
h2 0.4.8: HTTP/2 protocol implementation (37 dependencies)rustls 0.23.21: TLS 1.2/1.3 without OpenSSL (14 dependencies)tokio-util 0.7.13: Codec, framing utilities (8 dependencies)
Sources: Cargo.lock:1141-1351
Polars Ecosystem Depth
The polars dependency pulls in 23 related crates:
polars-core,polars-lazy,polars-io,polars-opspolars-arrow: Apache Arrow array implementationpolars-parquet: Parquet file format supportpolars-sql: SQL query interfacepolars-time: Temporal operationspolars-utils: Shared utilities
Total Transitive: ~180 crates in the full dependency tree
Sources: Cargo.lock:2170-2394
Security and Cryptography
Platform-Specific Cryptography:
- macOS: Security.framework via
security-framework-sys - Linux: D-Bus +
libdbus-sys - Windows: Win32 Credential Manager via
windows-sys
Sources: Cargo.lock:764-1653
Total Rust Dependencies: ~250 crates (including transitive)
Key Architectural Decisions:
- Dual TLS support for platform flexibility
- Polars instead of native DataFrame for performance
- simd-r-drive for cross-language data sharing
- Alpha versions for simd-r-drive (active development)
- Platform-specific credential storage via keyring
Sources: Cargo.toml:1-45 Cargo.lock1 High-Level System Architecture (all diagrams)