This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Overview
Relevant source files
This document provides a high-level introduction to the rust-sec-fetcher repository, explaining its purpose, architecture, and the relationship between its Rust and Python components. This page covers the system's overall design and data flow. For installation and configuration instructions, see Getting Started. For detailed implementation documentation of individual components, see Rust sec-fetcher Application and Python narrative_stack System.
Sources: Cargo.toml, README.md, src/lib.rs
System Purpose
The rust-sec-fetcher repository implements a dual-language financial data processing system that:
- Fetches company financial data from the SEC EDGAR API
- Transforms raw SEC filings into normalized US GAAP fundamental concepts
- Stores structured data as CSV files organized by ticker symbol
- Trains machine learning models to understand financial concept relationships
The system specializes in processing US GAAP (Generally Accepted Accounting Principles) financial data, normalizing 57+ variations of concepts like revenue and consolidating them into a standardized taxonomy of 64 FundamentalConcept variants. This normalization enables consistent querying across diverse financial reports and powers downstream machine learning applications.
Sources: Cargo.toml:1-6, src/main.rs:173-240
Dual-Language Architecture
The repository employs a dual-language design that leverages the strengths of both Rust and Python:
| Layer | Language | Primary Responsibilities | Key Reason |
|---|---|---|---|
| Data Fetching & Processing | Rust | HTTP requests, throttling, caching, data transformation, CSV generation | I/O-bound operations, memory safety, high performance |
| Machine Learning | Python | Embedding generation, model training, statistical analysis | Rich ML ecosystem (PyTorch, scikit-learn) |
Rust Layer (sec-fetcher):
- Implements
SecClientwith sophisticated throttling and caching policies - Fetches company tickers, CIK submissions, NPORT filings, and US GAAP fundamentals
- Transforms raw financial data via
distill_us_gaap_fundamental_concepts - Outputs structured CSV files organized by ticker symbol
Python Layer (narrative_stack):
- Ingests CSV files generated by Rust layer
- Generates semantic embeddings for concept/unit pairs
- Applies dimensionality reduction via PCA
- Trains
Stage1Autoencodermodels using PyTorch Lightning
Sources: Cargo.toml:1-40, src/main.rs:1-16
graph TB
SEC["SEC EDGAR API\ncompany_tickers.json\nCIK submissions\ncompanyfacts dataset"]
SecClient["SecClient\n(network layer)"]
NetworkFuncs["Network Functions\nfetch_company_tickers\nfetch_us_gaap_fundamentals\nfetch_nport_filing_by_ticker_symbol\nfetch_investment_company_series_and_class_dataset"]
Distill["distill_us_gaap_fundamental_concepts\n(transformers module)\nMaps 57+ variations → 64 concepts"]
Models["Data Models\nTicker, CikSubmission\nNportInvestment\nFundamentalConcept enum"]
CSV["File System\ndata/fund-holdings/A-Z/\ndata/us-gaap/"]
Ingest["Python Ingestion\nus_gaap_store.ingest_us_gaap_csvs\nWalks CSV directories"]
Preprocess["Preprocessing\nPCA dimensionality reduction\nRobustScaler normalization\nSemantic embeddings"]
DataStore["simd-r-drive\nWebSocket Key-Value Store\nEmbedding matrix storage"]
Training["Stage1Autoencoder\nPyTorch Lightning\nTensorBoard logging"]
SEC -->|HTTP GET| SecClient
SecClient --> NetworkFuncs
NetworkFuncs --> Distill
Distill --> Models
Models --> CSV
CSV -.->|CSV files| Ingest
Ingest --> Preprocess
Preprocess --> DataStore
DataStore --> Training
High-Level Data Flow
Data Flow Summary:
- Fetch :
SecClientretrieves data from SEC EDGAR API endpoints - Transform : Raw financial data passes through
distill_us_gaap_fundamental_conceptsto normalize concept names - Store : Structured data is written to CSV files, organized by ticker symbol first letter
- Ingest : Python scripts walk CSV directories and parse records
- Preprocess : Generate embeddings, apply PCA, normalize values
- Train : ML models learn financial concept relationships
Sources: src/main.rs:1-16, src/main.rs:173-240, src/lib.rs:1-11
graph TB
main["main.rs\nApplication entry point"]
config["config module\nConfigManager\nAppConfig"]
network["network module\nSecClient\nfetch_* functions\nThrottlePolicy\nCachePolicy"]
transformers["transformers module\ndistill_us_gaap_fundamental_concepts"]
models["models module\nTicker\nCik\nCikSubmission\nNportInvestment\nAccessionNumber"]
enums["enums module\nFundamentalConcept\nCacheNamespacePrefix\nTickerOrigin\nUrl"]
caches["caches module (private)\nCaches struct\nOnceLock statics\nHTTP cache\nPreprocessor cache"]
utils["utils module\ninvert_multivalue_indexmap\nis_development_mode\nis_interactive_mode\nVecExtensions"]
fs["fs module\nFile system utilities"]
parsers["parsers module\nData parsing functions"]
main --> config
main --> network
main --> utils
network --> config
network --> caches
network --> models
network --> transformers
network --> parsers
transformers --> enums
transformers --> models
models --> enums
caches --> config
Core Module Structure
Module Descriptions:
| Module | Primary Purpose | Key Exports |
|---|---|---|
config | Configuration management and credential loading | ConfigManager, AppConfig |
network | HTTP client, data fetching, throttling, caching | SecClient, fetch_company_tickers, fetch_us_gaap_fundamentals, fetch_nport_filing_by_ticker_symbol |
transformers | US GAAP concept normalization (importance: 8.37) | distill_us_gaap_fundamental_concepts |
models | Data structures for SEC entities | Ticker, Cik, CikSubmission, NportInvestment, AccessionNumber |
enums | Type-safe enumerations | FundamentalConcept (64 variants), CacheNamespacePrefix, TickerOrigin, Url |
caches | Internal caching infrastructure | Caches (private module) |
utils | Utility functions | invert_multivalue_indexmap, VecExtensions, is_development_mode |
Sources: src/lib.rs:1-11, src/main.rs:1-16, src/utils.rs:1-12
Rust Component Architecture
The Rust layer is organized around SecClient, which provides a high-level HTTP interface with integrated throttling and caching. Network functions (fetch_*) use this client to retrieve data from SEC EDGAR endpoints. The most critical component is distill_us_gaap_fundamental_concepts, which normalizes financial concepts using four mapping patterns:
- One-to-One : Direct mappings (e.g.,
Assets→FundamentalConcept::Assets) - Hierarchical : Specific concepts also map to parent categories (e.g.,
CurrentAssetsmaps to bothCurrentAssetsandAssets) - Synonym Consolidation : Multiple terms map to single concept (e.g., 6 cost variations →
CostOfRevenue) - Industry-Specific : 57+ revenue variations map to
Revenues
Sources: src/main.rs:1-16, Cargo.toml:8-40
Python Component Architecture
The Python narrative_stack system consumes CSV files produced by the Rust layer and trains machine learning models:
Key Components:
| Component | Module/Class | Purpose |
|---|---|---|
| Ingestion | us_gaap_store.ingest_us_gaap_csvs | Walks CSV directories, parses UsGaapRowRecord entries |
| Preprocessing | PCA, RobustScaler | Generates semantic embeddings, normalizes values, reduces dimensionality |
| Storage | DbUsGaap, DataStoreWsClient, UsGaapStore | Database interface, WebSocket client, unified data access facade |
| Training | Stage1Autoencoder | PyTorch Lightning autoencoder for learning concept embeddings |
| Validation | np.isclose checks | Scaler verification, embedding validation |
The preprocessing pipeline creates concept/unit/value triplets with associated embeddings and scalers, storing them in both MySQL and the simd-r-drive WebSocket data store. The Stage1Autoencoder learns latent representations by reconstructing embedding + scaled_value inputs.
Sources: (Python code not included in provided files, based on architecture diagrams)
Key Technologies
Rust Dependencies:
| Crate | Purpose |
|---|---|
tokio | Async runtime for I/O operations |
reqwest | HTTP client library |
polars | DataFrame operations and CSV handling |
rayon | CPU parallelism |
simd-r-drive | WebSocket data store integration |
serde, serde_json | Serialization/deserialization |
chrono | Date/time handling |
Python Dependencies:
- PyTorch & PyTorch Lightning (ML training)
- pandas, numpy (data processing)
- scikit-learn (PCA, RobustScaler)
- matplotlib, TensorBoard (visualization)
Sources: Cargo.toml:8-40
CSV Output Organization
The Rust application writes CSV files to organized directories:
data/
├── fund-holdings/
│ ├── A/
│ │ ├── AAPL.csv
│ │ ├── AMZN.csv
│ │ └── ...
│ ├── B/
│ ├── C/
│ └── ...
└── us-gaap/
├── AAPL.csv
├── MSFT.csv
└── ...
Files are organized by the first letter of the ticker symbol (uppercased) to facilitate parallel processing and improve file system performance with large datasets.
Sources: src/main.rs:83-102, src/main.rs:206-221
Next Steps
- For installation and configuration, see Getting Started
- For Rust implementation details, see Rust sec-fetcher Application
- For Python ML pipeline details, see Python narrative_stack System
- For development guidelines, see Development Guide
- For comprehensive dependency documentation, see Dependencies & Technology Stack
Sources: src/main.rs:1-240, Cargo.toml:1-45, src/lib.rs:1-11