Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Dependencies & Technology Stack

Loading…

Dependencies & Technology Stack

Relevant source files

This page provides a comprehensive overview of all external dependencies used in the rust-sec-fetcher codebase, covering both the Rust sec-fetcher application and the Python narrative_stack ML system.

Overview

The system employs a dual-language architecture with distinct but complementary technology stacks. The Rust layer prioritizes high-performance I/O operations, concurrent data fetching, and reliable HTTP caching. The Python layer focuses on scientific computing, machine learning model training, and numerical data processing. Both layers share common infrastructure through the simd-r-drive data storage system and file-based CSV interchange.

Sources: Cargo.toml:1-82 Cargo.lock:1-100

Rust Technology Stack

Core Direct Dependencies

The Rust application declares its direct dependencies in the manifest, each serving specific architectural roles:

Dependency Categories and Usage:

graph TB
    subgraph "Async Runtime & Concurrency"
        tokio["tokio 1.50.0\nFull async runtime"]
rayon["rayon 1.11.0\nData parallelism"]
dashmap["dashmap 6.1.0\nConcurrent hashmap"]
end
    
    subgraph "HTTP & Network"
        reqwest["reqwest 0.13.2\nHTTP client"]
reqwest_drive["reqwest-drive 0.13.2-alpha\nDrive middleware"]
end
    
    subgraph "Data Processing"
        polars["polars 0.46.0\nDataFrame operations"]
csv_crate["csv 1.4.0\nCSV parsing"]
serde["serde 1.0.228\nSerialization"]
serde_json["serde_json 1.0.149\nJSON support"]
quick_xml["quick-xml 0.39.2\nXML parsing"]
end
    
    subgraph "Storage & Caching"
        simd_r_drive["simd-r-drive 0.15.5-alpha\nKey-value store"]
simd_r_drive_ext["simd-r-drive-extensions\n0.15.5-alpha"]
end
    
    subgraph "Configuration & Validation"
        config_crate["config 0.15.9\nConfig management"]
keyring["keyring 3.6.2\nCredential storage"]
email_address["email_address 0.2.9\nEmail validation"]
rust_decimal["rust_decimal 1.40.0\nDecimal numbers"]
chrono["chrono 0.4.44\nDate/time handling"]
end
    
    subgraph "Development Tools"
        mockito["mockito 1.7.0\nHTTP mocking"]
tempfile["tempfile 3.27.0\nTemp file creation"]
end
CategoryCratesPrimary Use Cases
Async RuntimetokioEvent loop, async I/O, task scheduling Cargo.toml79
HTTP Stackreqwest, reqwest-driveSEC API communication, middleware integration Cargo.toml:66-67
Data FramespolarsLarge-scale data transformation, CSV/JSON processing Cargo.toml61
Serializationserde, serde_json, serde_withData structure serialization, API response parsing Cargo.toml:71-73
Concurrencyrayon, dashmapParallel processing, concurrent data structures Cargo.toml:50-64
Storagesimd-r-drive, simd-r-drive-extensionsHTTP cache, preprocessor cache, persistent storage Cargo.toml:74-75
Configurationconfig, keyringTOML config loading, secure credential management Cargo.toml:48-59
Validationemail_address, rust_decimal, chronoInput validation, financial precision, timestamps Cargo.toml:46-68
Utilitiesitertools, indexmap, bytesIterator extensions, ordered maps, byte manipulation Cargo.toml:45-58
Testingmockito, tempfileHTTP mock servers, temporary test files Cargo.toml:78-85

Sources: Cargo.toml:44-86

Storage and Caching System

The caching layer utilizes simd-r-drive to persist HTTP responses and preprocessed results.

Cache Architecture:

graph TB
    subgraph "simd-r-drive Ecosystem"
        simd_r_drive["simd-r-drive 0.15.5-alpha\nCore key-value store"]
DataStore["DataStore\n(simd_r_drive::DataStore)"]
simd_r_drive --> DataStore
    end
    
    subgraph "Cache Management"
        Caches["Caches struct\n(src/caches.rs)"]
http_cache["http_cache: Arc<DataStore>"]
pre_cache["preprocessor_cache: Arc<DataStore>"]
Caches --> http_cache
 
       Caches --> pre_cache
    end
    
    subgraph "On-Disk Files"
        f1["http_storage_cache.bin"]
f2["preprocessor_cache.bin"]
http_cache -.-> f1
 
       pre_cache -.-> f2
    end
  • Isolation : The Caches struct manages two distinct DataStore instances src/caches.rs:11-14
  • Initialization : The Caches::open function ensures the base directory exists and opens the .bin storage files src/caches.rs:29-51
  • Thread Safety : Access to stores is provided via Arc<DataStore> clones src/caches.rs:53-59

Sources: src/caches.rs:1-60 Cargo.toml:74-75

Numeric Support and Precision

The system relies on rust_decimal for financial calculations where floating-point errors are unacceptable.

CrateVersionKey TypesPurpose
rust_decimal1.40.0DecimalExact decimal arithmetic for US GAAP values Cargo.toml68
chrono0.4.44NaiveDateDate handling for filing periods Cargo.toml46
polars0.46.0DataFrameHigh-performance columnar data processing Cargo.toml61

Sources: Cargo.toml:46-68

Python Technology Stack

The Python narrative_stack system focuses on the machine learning pipeline and data analysis.

Machine Learning Framework

The training pipeline uses PyTorch and PyTorch Lightning to build and train autoencoders on US GAAP concepts.

Preprocessing Logic:

graph TB
    subgraph "Training Pipeline"
        Stage1Autoencoder["Stage1Autoencoder\n(PyTorch Lightning Module)"]
IterableConceptValueDataset["IterableConceptValueDataset\n(PyTorch Dataset)"]
Stage1Autoencoder --> IterableConceptValueDataset
    end
    
    subgraph "Scientific Stack"
        numpy["NumPy\nArray operations"]
pandas["pandas\nData manipulation"]
sklearn["scikit-learn\nPreprocessing & PCA"]
end
    
    subgraph "Preprocessing"
        RobustScaler["RobustScaler\n(Normalization)"]
PCA["PCA\n(Dimensionality Reduction)"]
sklearn --> RobustScaler
 
       sklearn --> PCA
    end
  • RobustScaler : Normalizes values per concept/unit pair to handle outliers in financial data.
  • PCA : Reduces the dimensionality of semantic embeddings while preserving variance.

Sources: Project architecture overview, Cargo.toml61 (Polars/Python integration context)

Database and Storage Integration

The Python stack interacts with both relational and key-value stores:

  • MySQL : Stores ingested US GAAP triplets (concept, unit, value).
  • simd-r-drive (via WebSocket) : The DataStoreWsClient allows the Python stack to access the same high-performance storage used by the Rust application.

Sources: Project architecture overview.

Shared Infrastructure

graph TB
    subgraph "Rust: sec-fetcher"
        distill["distill_us_gaap_fundamental_concepts"]
csv_out["CSV Export\n(src/bin/pulls/us_gaap_bulk.rs)"]
distill --> csv_out
    end
    
    subgraph "Storage"
        shared_dir["/data/us-gaap/"]
csv_out --> shared_dir
    end
    
    subgraph "Python: narrative_stack"
        ingest["Data Ingestion"]
shared_dir --> ingest
 
       ingest --> model["Stage1Autoencoder"]
end

File System Interchange

Data is passed between the Rust fetcher and the Python ML stack primarily through CSV files and shared storage.

Sources: Cargo.toml:28-34 src/caches.rs:25-51

Development and CI/CD Stack

GitHub Actions Workflow

The project uses GitHub Actions for continuous integration, ensuring cross-platform compatibility and code quality.

  • Rust Tests : Executes cargo test across the workspace.
  • Lints : Runs cargo clippy and cargo fmt.
  • Integration : Uses docker-compose to spin up simd-r-drive-ws-server and MySQL for end-to-end testing.

Sources: Cargo.toml:83-86 Cargo.lock:1-100

Version Compatibility Matrix

ComponentVersionRequirement
Rust Edition2024Cargo.toml6
Polars0.46.0Cargo.toml61
Tokio1.50.0Cargo.toml79
Reqwest0.13.2Cargo.toml66

Sources: Cargo.toml:1-82

Dismiss

Refresh this wiki

Enter email to refresh