This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Dependencies & Technology Stack
Loading…
Dependencies & Technology Stack
Relevant source files
This page provides a comprehensive overview of all external dependencies used in the rust-sec-fetcher codebase, covering both the Rust sec-fetcher application and the Python narrative_stack ML system.
Overview
The system employs a dual-language architecture with distinct but complementary technology stacks. The Rust layer prioritizes high-performance I/O operations, concurrent data fetching, and reliable HTTP caching. The Python layer focuses on scientific computing, machine learning model training, and numerical data processing. Both layers share common infrastructure through the simd-r-drive data storage system and file-based CSV interchange.
Sources: Cargo.toml:1-82 Cargo.lock:1-100
Rust Technology Stack
Core Direct Dependencies
The Rust application declares its direct dependencies in the manifest, each serving specific architectural roles:
Dependency Categories and Usage:
graph TB
subgraph "Async Runtime & Concurrency"
tokio["tokio 1.50.0\nFull async runtime"]
rayon["rayon 1.11.0\nData parallelism"]
dashmap["dashmap 6.1.0\nConcurrent hashmap"]
end
subgraph "HTTP & Network"
reqwest["reqwest 0.13.2\nHTTP client"]
reqwest_drive["reqwest-drive 0.13.2-alpha\nDrive middleware"]
end
subgraph "Data Processing"
polars["polars 0.46.0\nDataFrame operations"]
csv_crate["csv 1.4.0\nCSV parsing"]
serde["serde 1.0.228\nSerialization"]
serde_json["serde_json 1.0.149\nJSON support"]
quick_xml["quick-xml 0.39.2\nXML parsing"]
end
subgraph "Storage & Caching"
simd_r_drive["simd-r-drive 0.15.5-alpha\nKey-value store"]
simd_r_drive_ext["simd-r-drive-extensions\n0.15.5-alpha"]
end
subgraph "Configuration & Validation"
config_crate["config 0.15.9\nConfig management"]
keyring["keyring 3.6.2\nCredential storage"]
email_address["email_address 0.2.9\nEmail validation"]
rust_decimal["rust_decimal 1.40.0\nDecimal numbers"]
chrono["chrono 0.4.44\nDate/time handling"]
end
subgraph "Development Tools"
mockito["mockito 1.7.0\nHTTP mocking"]
tempfile["tempfile 3.27.0\nTemp file creation"]
end
| Category | Crates | Primary Use Cases |
|---|---|---|
| Async Runtime | tokio | Event loop, async I/O, task scheduling Cargo.toml79 |
| HTTP Stack | reqwest, reqwest-drive | SEC API communication, middleware integration Cargo.toml:66-67 |
| Data Frames | polars | Large-scale data transformation, CSV/JSON processing Cargo.toml61 |
| Serialization | serde, serde_json, serde_with | Data structure serialization, API response parsing Cargo.toml:71-73 |
| Concurrency | rayon, dashmap | Parallel processing, concurrent data structures Cargo.toml:50-64 |
| Storage | simd-r-drive, simd-r-drive-extensions | HTTP cache, preprocessor cache, persistent storage Cargo.toml:74-75 |
| Configuration | config, keyring | TOML config loading, secure credential management Cargo.toml:48-59 |
| Validation | email_address, rust_decimal, chrono | Input validation, financial precision, timestamps Cargo.toml:46-68 |
| Utilities | itertools, indexmap, bytes | Iterator extensions, ordered maps, byte manipulation Cargo.toml:45-58 |
| Testing | mockito, tempfile | HTTP mock servers, temporary test files Cargo.toml:78-85 |
Sources: Cargo.toml:44-86
Storage and Caching System
The caching layer utilizes simd-r-drive to persist HTTP responses and preprocessed results.
Cache Architecture:
graph TB
subgraph "simd-r-drive Ecosystem"
simd_r_drive["simd-r-drive 0.15.5-alpha\nCore key-value store"]
DataStore["DataStore\n(simd_r_drive::DataStore)"]
simd_r_drive --> DataStore
end
subgraph "Cache Management"
Caches["Caches struct\n(src/caches.rs)"]
http_cache["http_cache: Arc<DataStore>"]
pre_cache["preprocessor_cache: Arc<DataStore>"]
Caches --> http_cache
Caches --> pre_cache
end
subgraph "On-Disk Files"
f1["http_storage_cache.bin"]
f2["preprocessor_cache.bin"]
http_cache -.-> f1
pre_cache -.-> f2
end
- Isolation : The
Cachesstruct manages two distinctDataStoreinstances src/caches.rs:11-14 - Initialization : The
Caches::openfunction ensures the base directory exists and opens the.binstorage files src/caches.rs:29-51 - Thread Safety : Access to stores is provided via
Arc<DataStore>clones src/caches.rs:53-59
Sources: src/caches.rs:1-60 Cargo.toml:74-75
Numeric Support and Precision
The system relies on rust_decimal for financial calculations where floating-point errors are unacceptable.
| Crate | Version | Key Types | Purpose |
|---|---|---|---|
rust_decimal | 1.40.0 | Decimal | Exact decimal arithmetic for US GAAP values Cargo.toml68 |
chrono | 0.4.44 | NaiveDate | Date handling for filing periods Cargo.toml46 |
polars | 0.46.0 | DataFrame | High-performance columnar data processing Cargo.toml61 |
Sources: Cargo.toml:46-68
Python Technology Stack
The Python narrative_stack system focuses on the machine learning pipeline and data analysis.
Machine Learning Framework
The training pipeline uses PyTorch and PyTorch Lightning to build and train autoencoders on US GAAP concepts.
Preprocessing Logic:
graph TB
subgraph "Training Pipeline"
Stage1Autoencoder["Stage1Autoencoder\n(PyTorch Lightning Module)"]
IterableConceptValueDataset["IterableConceptValueDataset\n(PyTorch Dataset)"]
Stage1Autoencoder --> IterableConceptValueDataset
end
subgraph "Scientific Stack"
numpy["NumPy\nArray operations"]
pandas["pandas\nData manipulation"]
sklearn["scikit-learn\nPreprocessing & PCA"]
end
subgraph "Preprocessing"
RobustScaler["RobustScaler\n(Normalization)"]
PCA["PCA\n(Dimensionality Reduction)"]
sklearn --> RobustScaler
sklearn --> PCA
end
- RobustScaler : Normalizes values per concept/unit pair to handle outliers in financial data.
- PCA : Reduces the dimensionality of semantic embeddings while preserving variance.
Sources: Project architecture overview, Cargo.toml61 (Polars/Python integration context)
Database and Storage Integration
The Python stack interacts with both relational and key-value stores:
- MySQL : Stores ingested US GAAP triplets (concept, unit, value).
- simd-r-drive (via WebSocket) : The
DataStoreWsClientallows the Python stack to access the same high-performance storage used by the Rust application.
Sources: Project architecture overview.
Shared Infrastructure
graph TB
subgraph "Rust: sec-fetcher"
distill["distill_us_gaap_fundamental_concepts"]
csv_out["CSV Export\n(src/bin/pulls/us_gaap_bulk.rs)"]
distill --> csv_out
end
subgraph "Storage"
shared_dir["/data/us-gaap/"]
csv_out --> shared_dir
end
subgraph "Python: narrative_stack"
ingest["Data Ingestion"]
shared_dir --> ingest
ingest --> model["Stage1Autoencoder"]
end
File System Interchange
Data is passed between the Rust fetcher and the Python ML stack primarily through CSV files and shared storage.
Sources: Cargo.toml:28-34 src/caches.rs:25-51
Development and CI/CD Stack
GitHub Actions Workflow
The project uses GitHub Actions for continuous integration, ensuring cross-platform compatibility and code quality.
- Rust Tests : Executes
cargo testacross the workspace. - Lints : Runs
cargo clippyandcargo fmt. - Integration : Uses
docker-composeto spin upsimd-r-drive-ws-serverand MySQL for end-to-end testing.
Sources: Cargo.toml:83-86 Cargo.lock:1-100
Version Compatibility Matrix
| Component | Version | Requirement |
|---|---|---|
| Rust Edition | 2024 | Cargo.toml6 |
| Polars | 0.46.0 | Cargo.toml61 |
| Tokio | 1.50.0 | Cargo.toml79 |
| Reqwest | 0.13.2 | Cargo.toml66 |
Sources: Cargo.toml:1-82
Dismiss
Refresh this wiki
Enter email to refresh