This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Overview

Loading…

Overview

Relevant source files

This document provides a high-level introduction to the rust-sec-fetcher repository, explaining its purpose, architecture, and the relationship between its Rust and Python components. This page covers the system’s overall design and data flow. For installation and configuration instructions, see Getting Started. For detailed implementation documentation of individual components, see Rust sec-fetcher Application and Python narrative_stack System.

Sources: Cargo.toml, README.md, src/lib.rs

System Purpose

The rust-sec-fetcher repository implements a dual-language financial data processing system that:

Fetches company financial data from the SEC EDGAR API
Transforms raw SEC filings into normalized US GAAP fundamental concepts
Stores structured data as CSV files organized by ticker symbol
Trains machine learning models to understand financial concept relationships

The system specializes in processing US GAAP (Generally Accepted Accounting Principles) financial data, normalizing 57+ variations of concepts like revenue and consolidating them into a standardized taxonomy of 64 FundamentalConcept variants. This normalization enables consistent querying across diverse financial reports and powers downstream machine learning applications.

Sources: Cargo.toml:1-6, src/main.rs:173-240

Dual-Language Architecture

The repository employs a dual-language design that leverages the strengths of both Rust and Python:

Layer	Language	Primary Responsibilities	Key Reason
Data Fetching & Processing	Rust	HTTP requests, throttling, caching, data transformation, CSV generation	I/O-bound operations, memory safety, high performance
Machine Learning	Python	Embedding generation, model training, statistical analysis	Rich ML ecosystem (PyTorch, scikit-learn)

Rust Layer (sec-fetcher):

Implements SecClient with sophisticated throttling and caching policies
Fetches company tickers, CIK submissions, NPORT filings, and US GAAP fundamentals
Transforms raw financial data via distill_us_gaap_fundamental_concepts
Outputs structured CSV files organized by ticker symbol

Python Layer (narrative_stack):

Ingests CSV files generated by Rust layer
Generates semantic embeddings for concept/unit pairs
Applies dimensionality reduction via PCA
Trains Stage1Autoencoder models using PyTorch Lightning

Sources: Cargo.toml:1-40, src/main.rs:1-16

graph TB
    SEC["SEC EDGAR API\ncompany_tickers.json\nCIK submissions\ncompanyfacts dataset"]
SecClient["SecClient\n(network layer)"]
NetworkFuncs["Network Functions\nfetch_company_tickers\nfetch_us_gaap_fundamentals\nfetch_nport_filing_by_ticker_symbol\nfetch_investment_company_series_and_class_dataset"]
Distill["distill_us_gaap_fundamental_concepts\n(transformers module)\nMaps 57+ variations → 64 concepts"]
Models["Data Models\nTicker, CikSubmission\nNportInvestment\nFundamentalConcept enum"]
CSV["File System\ndata/fund-holdings/A-Z/\ndata/us-gaap/"]
Ingest["Python Ingestion\nus_gaap_store.ingest_us_gaap_csvs\nWalks CSV directories"]
Preprocess["Preprocessing\nPCA dimensionality reduction\nRobustScaler normalization\nSemantic embeddings"]
DataStore["simd-r-drive\nWebSocket Key-Value Store\nEmbedding matrix storage"]
Training["Stage1Autoencoder\nPyTorch Lightning\nTensorBoard logging"]
SEC -->|HTTP GET| SecClient
 
   SecClient --> NetworkFuncs
 
   NetworkFuncs --> Distill
 
   Distill --> Models
 
   Models --> CSV
    
 
   CSV -.->|CSV files| Ingest
 
   Ingest --> Preprocess
 
   Preprocess --> DataStore
 
   DataStore --> Training

High-Level Data Flow

Data Flow Summary:

Fetch : SecClient retrieves data from SEC EDGAR API endpoints
Transform : Raw financial data passes through distill_us_gaap_fundamental_concepts to normalize concept names
Store : Structured data is written to CSV files, organized by ticker symbol first letter
Ingest : Python scripts walk CSV directories and parse records
Preprocess : Generate embeddings, apply PCA, normalize values
Train : ML models learn financial concept relationships

Sources: src/main.rs:1-16, src/main.rs:173-240, src/lib.rs:1-11

graph TB
    main["main.rs\nApplication entry point"]
config["config module\nConfigManager\nAppConfig"]
network["network module\nSecClient\nfetch_* functions\nThrottlePolicy\nCachePolicy"]
transformers["transformers module\ndistill_us_gaap_fundamental_concepts"]
models["models module\nTicker\nCik\nCikSubmission\nNportInvestment\nAccessionNumber"]
enums["enums module\nFundamentalConcept\nCacheNamespacePrefix\nTickerOrigin\nUrl"]
caches["caches module (private)\nCaches struct\nOnceLock statics\nHTTP cache\nPreprocessor cache"]
utils["utils module\ninvert_multivalue_indexmap\nis_development_mode\nis_interactive_mode\nVecExtensions"]
fs["fs module\nFile system utilities"]
parsers["parsers module\nData parsing functions"]
main --> config
 
   main --> network
 
   main --> utils
    
 
   network --> config
 
   network --> caches
 
   network --> models
 
   network --> transformers
 
   network --> parsers
    
 
   transformers --> enums
 
   transformers --> models
    
 
   models --> enums
    
 
   caches --> config

Core Module Structure

Module Descriptions:

Module	Primary Purpose	Key Exports
`config`	Configuration management and credential loading	`ConfigManager`, `AppConfig`
`network`	HTTP client, data fetching, throttling, caching	`SecClient`, `fetch_company_tickers`, `fetch_us_gaap_fundamentals`, `fetch_nport_filing_by_ticker_symbol`
`transformers`	US GAAP concept normalization (importance: 8.37)	`distill_us_gaap_fundamental_concepts`
`models`	Data structures for SEC entities	`Ticker`, `Cik`, `CikSubmission`, `NportInvestment`, `AccessionNumber`
`enums`	Type-safe enumerations	`FundamentalConcept` (64 variants), `CacheNamespacePrefix`, `TickerOrigin`, `Url`
`caches`	Internal caching infrastructure	`Caches` (private module)
`utils`	Utility functions	`invert_multivalue_indexmap`, `VecExtensions`, `is_development_mode`

Sources: src/lib.rs:1-11, src/main.rs:1-16, src/utils.rs:1-12

Rust Component Architecture

The Rust layer is organized around SecClient, which provides a high-level HTTP interface with integrated throttling and caching. Network functions (fetch_*) use this client to retrieve data from SEC EDGAR endpoints. The most critical component is distill_us_gaap_fundamental_concepts, which normalizes financial concepts using four mapping patterns:

One-to-One : Direct mappings (e.g., Assets → FundamentalConcept::Assets)
Hierarchical : Specific concepts also map to parent categories (e.g., CurrentAssets maps to both CurrentAssets and Assets)
Synonym Consolidation : Multiple terms map to single concept (e.g., 6 cost variations → CostOfRevenue)
Industry-Specific : 57+ revenue variations map to Revenues

Sources: src/main.rs:1-16, Cargo.toml:8-40

Python Component Architecture

The Python narrative_stack system consumes CSV files produced by the Rust layer and trains machine learning models:

Key Components:

Component	Module/Class	Purpose
Ingestion	`us_gaap_store.ingest_us_gaap_csvs`	Walks CSV directories, parses `UsGaapRowRecord` entries
Preprocessing	PCA, RobustScaler	Generates semantic embeddings, normalizes values, reduces dimensionality
Storage	`DbUsGaap`, `DataStoreWsClient`, `UsGaapStore`	Database interface, WebSocket client, unified data access facade
Training	`Stage1Autoencoder`	PyTorch Lightning autoencoder for learning concept embeddings
Validation	`np.isclose` checks	Scaler verification, embedding validation

The preprocessing pipeline creates concept/unit/value triplets with associated embeddings and scalers, storing them in both MySQL and the simd-r-drive WebSocket data store. The Stage1Autoencoder learns latent representations by reconstructing embedding + scaled_value inputs.

Sources: (Python code not included in provided files, based on architecture diagrams)

Key Technologies

Rust Dependencies:

Crate	Purpose
`tokio`	Async runtime for I/O operations
`reqwest`	HTTP client library
`polars`	DataFrame operations and CSV handling
`rayon`	CPU parallelism
`simd-r-drive`	WebSocket data store integration
`serde`, `serde_json`	Serialization/deserialization
`chrono`	Date/time handling

Python Dependencies:

PyTorch & PyTorch Lightning (ML training)
pandas, numpy (data processing)
scikit-learn (PCA, RobustScaler)
matplotlib, TensorBoard (visualization)

Sources: Cargo.toml:8-40

CSV Output Organization

The Rust application writes CSV files to organized directories:

data/
├── fund-holdings/
│   ├── A/
│   │   ├── AAPL.csv
│   │   ├── AMZN.csv
│   │   └── ...
│   ├── B/
│   ├── C/
│   └── ...
└── us-gaap/
    ├── AAPL.csv
    ├── MSFT.csv
    └── ...

Files are organized by the first letter of the ticker symbol (uppercased) to facilitate parallel processing and improve file system performance with large datasets.

Sources: src/main.rs:83-102, src/main.rs:206-221

Next Steps

For installation and configuration, see Getting Started
For Rust implementation details, see Rust sec-fetcher Application
For Python ML pipeline details, see Python narrative_stack System
For development guidelines, see Development Guide
For comprehensive dependency documentation, see Dependencies & Technology Stack

Sources: src/main.rs:1-240, Cargo.toml:1-45, src/lib.rs:1-11

Dismiss

Refresh this wiki

Enter email to refresh

Keyboard shortcuts

rust-sec-fetcher Documentation