Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Dependencies & Technology Stack

Relevant source files

This page provides a comprehensive overview of all external dependencies used in the rust-sec-fetcher codebase, covering both the Rust sec-fetcher application and the Python narrative_stack ML system. For information about the configuration system and credential management, see Configuration System. For details on how these dependencies are used within specific components, see Rust sec-fetcher Application and Python narrative_stack System.

Overview

The system employs a dual-language architecture with distinct but complementary technology stacks. The Rust layer prioritizes high-performance I/O operations, concurrent data fetching, and reliable HTTP caching. The Python layer focuses on scientific computing, machine learning model training, and numerical data processing. Both layers share common infrastructure through the simd-r-drive data storage system and file-based CSV interchange.

Sources: Cargo.toml:1-45 High-Level System Architecture diagrams

Rust Technology Stack

Core Direct Dependencies

The Rust application declares 29 direct dependencies in its manifest, each serving specific architectural roles:

Dependency Categories and Usage:

graph TB
    subgraph "Async Runtime & Concurrency"
        tokio["tokio 1.43.0\nFull async runtime"]
rayon["rayon 1.10.0\nData parallelism"]
dashmap["dashmap 6.1.0\nConcurrent hashmap"]
end
    
    subgraph "HTTP & Network"
        reqwest["reqwest 0.12.15\nHTTP client"]
reqwest_drive["reqwest-drive 0.1.0-alpha.9\nDrive middleware"]
end
    
    subgraph "Data Processing"
        polars["polars 0.46.0\nDataFrame operations"]
csv_crate["csv 1.3.1\nCSV parsing"]
serde["serde 1.0.218\nSerialization"]
serde_json["serde_json 1.0.140\nJSON support"]
quick_xml["quick-xml 0.37.2\nXML parsing"]
end
    
    subgraph "Storage & Caching"
        simd_r_drive["simd-r-drive 0.3.0-alpha.1\nKey-value store"]
simd_r_drive_ext["simd-r-drive-extensions\n0.4.0-alpha.6"]
end
    
    subgraph "Configuration & Validation"
        config_crate["config 0.15.9\nConfig management"]
keyring["keyring 3.6.2\nCredential storage"]
email_address["email_address 0.2.9\nEmail validation"]
rust_decimal["rust_decimal 1.36.0\nDecimal numbers"]
chrono["chrono 0.4.40\nDate/time handling"]
end
    
    subgraph "Development Tools"
        mockito["mockito 1.7.0\nHTTP mocking"]
tempfile["tempfile 3.18.0\nTemp file creation"]
env_logger["env_logger 0.11.7\nLogging"]
end
CategoryCratesPrimary Use Cases
Async RuntimetokioEvent loop, async I/O, task scheduling, timers
HTTP Stackreqwest, reqwest-driveSEC API communication, middleware integration
Data FramespolarsLarge-scale data transformation, CSV/JSON processing
Serializationserde, serde_json, serde_withData structure serialization, API response parsing
Concurrencyrayon, dashmap, crossbeamParallel processing, concurrent data structures
Storagesimd-r-drive, simd-r-drive-extensionsHTTP cache, preprocessor cache, persistent storage
Configurationconfig, keyringTOML config loading, secure credential management
Validationemail_address, rust_decimal, chronoInput validation, financial precision, timestamps
Utilitiesitertools, indexmap, bytesIterator extensions, ordered maps, byte manipulation
Testingmockito, tempfileHTTP mock servers, temporary test files

Sources: Cargo.toml:8-44

HTTP and Network Layer

The HTTP stack consists of multiple layers providing comprehensive request handling:

TLS Configuration: The system supports dual TLS backends for platform compatibility:

graph TB
    SecClient["SecClient\n(src/sec_client.rs)"]
reqwest["reqwest 0.12.15\nHigh-level HTTP client"]
reqwest_drive["reqwest-drive 0.1.0-alpha.9\nMiddleware framework"]
hyper["hyper 1.6.0\nHTTP/1.1 & HTTP/2"]
h2["h2 0.4.8\nHTTP/2 protocol"]
TLS_Layer["TLS Layer"]
rustls["rustls 0.23.21\nPure Rust TLS"]
native_tls["native-tls 0.2.14\nPlatform TLS"]
openssl["openssl 0.10.71\nOpenSSL bindings"]
hyper_util["hyper-util 0.1.10\nUtilities"]
hyper_rustls["hyper-rustls 0.27.5\nRustls connector"]
hyper_tls["hyper-tls 0.6.0\nNative TLS connector"]
SecClient --> reqwest
 
   reqwest --> reqwest_drive
 
   reqwest --> hyper
 
   reqwest --> TLS_Layer
    
 
   hyper --> h2
 
   hyper --> hyper_util
    
 
   TLS_Layer --> hyper_rustls
 
   TLS_Layer --> hyper_tls
 
   hyper_rustls --> rustls
 
   hyper_tls --> native_tls
 
   native_tls --> openssl
  • rustls (0.23.21): Pure Rust implementation, no external dependencies, used by default
  • native-tls (0.2.14): Platform-native TLS (OpenSSL on Linux, Security.framework on macOS, SChannel on Windows)

Sources: Cargo.lock:1280-1351 Cargo.toml28

Data Processing Stack

Polars Features Enabled:

  • json: JSON file reading/writing
  • lazy: Lazy evaluation engine for query optimization
  • pivot: Pivot table operations

Key Numeric Types:

  • rust_decimal::Decimal: Exact decimal arithmetic for financial calculations (importance: 8.37 for US GAAP processing)
  • chrono::NaiveDate, chrono::DateTime: Timestamp handling for filing dates and report periods

Sources: Cargo.toml24 Cargo.lock:2170-2394

Concurrency and Parallelism

Concurrency Model:

  • tokio : Handles I/O-bound operations (HTTP requests, file I/O)
  • rayon : Handles CPU-bound operations (DataFrame transformations, parallel iterations)
  • dashmap : Thread-safe caching without explicit locking

Key Usage Locations:

Sources: Cargo.toml:14-40 src/caches.rs:1-66 Cargo.lock:612-762

Storage and Caching System

Cache Architecture:

  • Two separate DataStore instances for different caching layers
  • Thread-safe access via Arc<DataStore> wrapped in OnceLock statics
  • Initialized once at application startup via Caches::init()
  • 1-week TTL for HTTP responses, indefinite for preprocessor results

Storage Format:

  • Binary .bin files for efficient serialization
  • Uses bincode 1.3.3 for encoding cached data
  • Persistent across application restarts

Sources: src/caches.rs:7-65 Cargo.toml:36-37 Cargo.lock:246-252

Configuration and Credentials

Configuration Sources (Priority Order):

  1. Environment variables
  2. config.toml in current directory
  3. ~/.config/sec-fetcher/config.toml
  4. Default values

Keyring Features:

  • apple-native: Direct Security.framework integration on macOS
  • windows-native: Windows Credential Manager via Win32 APIs
  • sync-secret-service: D-Bus Secret Service for Linux

Sources: Cargo.toml:12-20 Cargo.lock:1641-1653

Validation and Type Safety

CrateVersionPurposeKey Types
email_address0.2.9SEC email validationEmailAddress
rust_decimal1.36.0Financial calculationsDecimal
chrono0.4.40Date/time operationsNaiveDate, DateTime<Utc>
chrono-tz0.10.1Timezone handlingTz::America__New_York
strum / strum_macros0.27.1Enum utilitiesEnumString, Display

Decimal Precision:

  • 28-29 significant digits
  • No floating-point rounding errors
  • Critical for US GAAP fundamental values

Date Handling:

  • All SEC filing dates parsed to chrono::NaiveDate
  • UTC timestamps for API requests
  • Eastern Time conversion for market hours

Sources: Cargo.toml:16-39 Cargo.lock:403-868

Python Technology Stack

Machine Learning Framework

Training Configuration:

  • Patience : 20 epochs for early stopping
  • Checkpoint Strategy : Save top 1 model by validation loss
  • Learning Rate : Configurable via checkpoint override
  • Gradient Clipping : Prevents exploding gradients
graph TB
    subgraph "Core Scientific Libraries"
        numpy["NumPy\nArray operations"]
pandas["pandas\nDataFrame manipulation"]
sklearn["scikit-learn\nPCA, RobustScaler"]
end
    
    subgraph "Preprocessing Pipeline"
        PCA["PCA\nDimensionality reduction\nVariance threshold: 0.95"]
RobustScaler["RobustScaler\nPer concept/unit pair\nOutlier-robust normalization"]
sklearn --> PCA
 
       sklearn --> RobustScaler
    end
    
    subgraph "Data Structures"
        triplets["Concept/Unit/Value Triplets"]
embeddings["Semantic Embeddings\nPCA-compressed"]
RobustScaler --> triplets
 
       PCA --> embeddings
 
       triplets --> embeddings
    end
    
    subgraph "Data Ingestion"
        csv_files["CSV Files\nfrom Rust sec-fetcher"]
UsGaapRowRecord["UsGaapRowRecord\nParsed row structure"]
csv_files --> UsGaapRowRecord
 
       UsGaapRowRecord --> numpy
    end

Sources: High-Level System Architecture (Diagram 3), narrative_stack training pipeline

Data Processing and Scientific Computing

scikit-learn Components:

  • PCA : Reduces embedding dimensionality while retaining 95% variance
  • RobustScaler : Normalizes using median and IQR, resistant to outliers
  • Both fitted per unique concept/unit pair for specialized normalization

NumPy Usage:

  • Validation via np.isclose() checks
  • Scaler transformation verification
  • Embedding matrix storage
graph TB
    subgraph "Database Layer"
        mysql_db[("MySQL Database\nus_gaap_test")]
        DbUsGaap["DbUsGaap\nDatabase interface"]
asyncio["asyncio\nAsync queries"]
DbUsGaap --> mysql_db
 
       DbUsGaap --> asyncio
    end
    
    subgraph "WebSocket Storage"
        DataStoreWsClient["DataStoreWsClient\nWebSocket client"]
simd_r_drive_server["simd-r-drive\nWebSocket server"]
DataStoreWsClient --> simd_r_drive_server
    end
    
    subgraph "Unified Access"
        UsGaapStore["UsGaapStore\nFacade pattern"]
UsGaapStore --> DbUsGaap
 
       UsGaapStore --> DataStoreWsClient
    end
    
    subgraph "Data Models"
        triplet_storage["Triplet Storage:\n- concept\n- unit\n- scaled_value\n- scaler\n- embedding"]
UsGaapStore --> triplet_storage
    end

Sources: High-Level System Architecture (Diagram 3), us_gaap_store preprocessing pipeline

Database and Storage Integration

Storage Strategy:

  • MySQL : Relational queries for ingested CSV data
  • WebSocket Store : High-performance embedding matrix storage
  • Facade Pattern : UsGaapStore abstracts storage backend choice

Python WebSocket Client:

  • Package: simd-r-drive-client (Python equivalent of Rust simd-r-drive)
  • Async communication with WebSocket server
  • Shared embedding matrices between Rust and Python

Sources: High-Level System Architecture (Diagram 3), Database & Storage Integration section

Visualization and Debugging

LibraryPurposeKey Outputs
matplotlibStatic plotsPCA variance plots, scaler distribution
TensorBoardTraining metricsLoss curves, learning rate schedules
itertoolsData iterationBatching, grouping concept pairs

Visualization Types:

  1. PCA Explanation Plots : Cumulative variance by component
  2. Semantic Embedding Scatterplots : t-SNE/UMAP projections of concept space
  3. Variance Analysis : Per-concept/unit value distributions
  4. Training Curves : TensorBoard integration via PyTorch Lightning
graph TB
    subgraph "Server Implementation"
        server["simd-r-drive-ws-server\nWebSocket server"]
DataStore_server["DataStore\nBackend storage"]
server --> DataStore_server
    end
    
    subgraph "Rust Clients"
        http_cache_rust["HTTP Cache\n(Rust)"]
preprocessor_cache_rust["Preprocessor Cache\n(Rust)"]
http_cache_rust --> DataStore_server
 
       preprocessor_cache_rust --> DataStore_server
    end
    
    subgraph "Python Clients"
        DataStoreWsClient_py["DataStoreWsClient\n(Python)"]
embedding_matrix["Embedding Matrix\nStorage"]
DataStoreWsClient_py --> DataStore_server
 
       embedding_matrix --> DataStoreWsClient_py
    end
    
    subgraph "Docker Deployment"
        dockerfile["Dockerfile\nContainer config"]
rust_image["rust:1.87\nBase image"]
dockerfile --> rust_image
 
       dockerfile --> server
    end

Sources: High-Level System Architecture (Diagram 3), Validation & Analysis section

Shared Infrastructure

simd-r-drive WebSocket Server

Key Features:

  • Cross-language data sharing via WebSocket protocol
  • Binary-efficient key-value storage
  • Persistent storage across restarts
  • Docker containerization for CI/CD environments

Usage Patterns:

  1. Rust writes HTTP cache entries
  2. Rust writes preprocessor transformations
  3. Python reads/writes embedding matrices
  4. Python reads ingested US GAAP triplets

Sources: Cargo.toml:36-37 High-Level System Architecture (Diagram 1, Diagram 5)

File System Interchange

Directory Structure:

  • fund-holdings/A-Z/ : NPORT filing holdings, alphabetized by ticker
  • us-gaap/ : Fundamental concepts CSV outputs
  • Row-based CSV format with standardized columns

Data Flow:

  1. Rust fetches from SEC API
  2. Rust transforms via distill_us_gaap_fundamental_concepts
  3. Rust writes CSV files
  4. Python walks directories
  5. Python parses into UsGaapRowRecord
  6. Python ingests to MySQL and WebSocket store
graph TB
    subgraph "CI Pipeline"
        workflow[".github/workflows/\nus-gaap-store-integration-test"]
checkout["Checkout code"]
rust_setup["Setup Rust 1.87"]
docker_compose["Docker Compose\nsimd-r-drive-ci-server"]
mysql_container["MySQL test container"]
workflow --> checkout
 
       checkout --> rust_setup
 
       checkout --> docker_compose
 
       docker_compose --> mysql_container
    end
    
    subgraph "Test Execution"
        cargo_test["cargo test\nRust unit tests"]
pytest["pytest\nPython integration tests"]
rust_setup --> cargo_test
 
       docker_compose --> pytest
    end
    
    subgraph "Testing Tools"
        mockito_tests["mockito 1.7.0\nHTTP mock server"]
tempfile_tests["tempfile 3.18.0\nTemp directories"]
cargo_test --> mockito_tests
 
       cargo_test --> tempfile_tests
    end

Sources: High-Level System Architecture (Diagram 1), main.rs CSV output logic

Development and CI/CD Stack

GitHub Actions Workflow

Docker Configuration:

  • Image : rust:1.87 for reproducible builds
  • Services : simd-r-drive-ws-server, MySQL 8.0
  • Environment Variables : Database credentials, WebSocket ports
  • Volume Mounts : Test data directories

Testing Strategy:

  • Rust Unit Tests : mockito for HTTP mocking, tempfile for isolated test fixtures
  • Python Integration Tests : pytest with MySQL fixtures, WebSocket client validation
  • End-to-End : Full pipeline from CSV ingestion through ML training

Sources: Cargo.toml:42-44 High-Level System Architecture (Diagram 5), CI/CD Pipeline section

Testing Dependencies

LanguageFrameworkMockingAssertions
RustBuilt-in #[test]mockito 1.7.0Standard assertions
Pythonpytestunittest.mocknp.isclose()

Rust Test Modules:

  • config_manager_tests: Configuration loading and merging
  • sec_client_tests: HTTP client behavior
  • distill_tests: US GAAP concept transformation

Python Test Modules:

  • Integration tests: End-to-end pipeline validation
  • Unit tests: Scaler, PCA, embedding generation

Sources: Cargo.lock:1802-1824 Testing Strategy documentation

Version Compatibility Matrix

Critical Version Constraints

ComponentVersionConstraintsReason
tokio1.43.0>= 1.0Stable async runtime API
reqwest0.12.150.12.xCompatible with reqwest-drive middleware
polars0.46.00.46.xBreaking changes between minor versions
simd-r-drive0.3.0-alpha.1Exact matchAlpha API, unstable
serde1.0.218>= 1.0.100Derive macro stability
hyper1.6.01.xHTTP/2 support, reqwest dependency

Rust Toolchain

Minimum Rust Version: 1.70+ (for async trait bounds and GAT support)

Feature Requirements:

  • async-trait for trait async methods
  • #[tokio::main] macro support
  • Generic Associated Types (GATs) for polars iterators

Sources: Cargo.toml4 Cargo.lock:1-3

Python Version Requirements

Estimated Python Version: 3.8+ (for PyTorch Lightning compatibility)

Key Constraints:

graph LR
 
   reqwest["reqwest"] --> hyper
 
   hyper --> h2["h2\nHTTP/2"]
hyper --> http["http\nTypes"]
reqwest --> rustls
 
   reqwest --> native_tls
    
 
   rustls --> ring["ring\nCryptography"]
native_tls --> openssl["openssl\nSystem TLS"]
h2 --> tokio
 
   hyper --> tokio
  • PyTorch: Requires Python 3.8 or later
  • PyTorch Lightning: Requires PyTorch >= 1.9
  • asyncio: Stable async/await syntax (Python 3.7+)

Sources: High-Level System Architecture (Python technology sections)

Transitive Dependency Highlights

HTTP/TLS Stack Depth

Notable Transitive Dependencies:

  • h2 0.4.8: HTTP/2 protocol implementation (37 dependencies)
  • rustls 0.23.21: TLS 1.2/1.3 without OpenSSL (14 dependencies)
  • tokio-util 0.7.13: Codec, framing utilities (8 dependencies)

Sources: Cargo.lock:1141-1351

Polars Ecosystem Depth

The polars dependency pulls in 23 related crates:

  • polars-core, polars-lazy, polars-io, polars-ops
  • polars-arrow: Apache Arrow array implementation
  • polars-parquet: Parquet file format support
  • polars-sql: SQL query interface
  • polars-time: Temporal operations
  • polars-utils: Shared utilities

Total Transitive: ~180 crates in the full dependency tree

Sources: Cargo.lock:2170-2394

Security and Cryptography

Platform-Specific Cryptography:

  • macOS: Security.framework via security-framework-sys
  • Linux: D-Bus + libdbus-sys
  • Windows: Win32 Credential Manager via windows-sys

Sources: Cargo.lock:764-1653


Total Rust Dependencies: ~250 crates (including transitive)

Key Architectural Decisions:

  1. Dual TLS support for platform flexibility
  2. Polars instead of native DataFrame for performance
  3. simd-r-drive for cross-language data sharing
  4. Alpha versions for simd-r-drive (active development)
  5. Platform-specific credential storage via keyring

Sources: Cargo.toml:1-45 Cargo.lock1 High-Level System Architecture (all diagrams)