This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Overview
Relevant source files
This document provides a high-level introduction to the rust-sec-fetcher repository, explaining its purpose, architecture, and the relationship between its Rust and Python components. This page covers the system's overall design and data flow. For installation and configuration instructions, see Getting Started. For detailed implementation documentation of individual components, see Rust sec-fetcher Application and Python narrative_stack System.
Sources: Cargo.toml, README.md, src/lib.rs
System Purpose
The rust-sec-fetcher repository implements a dual-language financial data processing system that:
- Fetches company financial data from the SEC EDGAR API
- Transforms raw SEC filings into normalized US GAAP fundamental concepts
- Stores structured data as CSV files organized by ticker symbol
- Trains machine learning models to understand financial concept relationships
The system specializes in processing US GAAP (Generally Accepted Accounting Principles) financial data, normalizing 57+ variations of concepts like revenue and consolidating them into a standardized taxonomy of 64 FundamentalConcept variants. This normalization enables consistent querying across diverse financial reports and powers downstream machine learning applications.
Sources: Cargo.toml:1-6, src/main.rs:173-240
Dual-Language Architecture
The repository employs a dual-language design that leverages the strengths of both Rust and Python:
| Layer | Language | Primary Responsibilities | Key Reason |
|---|---|---|---|
| Data Fetching & Processing | Rust | HTTP requests, throttling, caching, data transformation, CSV generation | I/O-bound operations, memory safety, high performance |
| Machine Learning | Python | Embedding generation, model training, statistical analysis | Rich ML ecosystem (PyTorch, scikit-learn) |
Rust Layer (sec-fetcher):
- Implements
SecClientwith sophisticated throttling and caching policies - Fetches company tickers, CIK submissions, NPORT filings, and US GAAP fundamentals
- Transforms raw financial data via
distill_us_gaap_fundamental_concepts - Outputs structured CSV files organized by ticker symbol
Python Layer (narrative_stack):
- Ingests CSV files generated by Rust layer
- Generates semantic embeddings for concept/unit pairs
- Applies dimensionality reduction via PCA
- Trains
Stage1Autoencodermodels using PyTorch Lightning
Sources: Cargo.toml:1-40, src/main.rs:1-16
graph TB
SEC["SEC EDGAR API\ncompany_tickers.json\nCIK submissions\ncompanyfacts dataset"]
SecClient["SecClient\n(network layer)"]
NetworkFuncs["Network Functions\nfetch_company_tickers\nfetch_us_gaap_fundamentals\nfetch_nport_filing_by_ticker_symbol\nfetch_investment_company_series_and_class_dataset"]
Distill["distill_us_gaap_fundamental_concepts\n(transformers module)\nMaps 57+ variations → 64 concepts"]
Models["Data Models\nTicker, CikSubmission\nNportInvestment\nFundamentalConcept enum"]
CSV["File System\ndata/fund-holdings/A-Z/\ndata/us-gaap/"]
Ingest["Python Ingestion\nus_gaap_store.ingest_us_gaap_csvs\nWalks CSV directories"]
Preprocess["Preprocessing\nPCA dimensionality reduction\nRobustScaler normalization\nSemantic embeddings"]
DataStore["simd-r-drive\nWebSocket Key-Value Store\nEmbedding matrix storage"]
Training["Stage1Autoencoder\nPyTorch Lightning\nTensorBoard logging"]
SEC -->|HTTP GET| SecClient
SecClient --> NetworkFuncs
NetworkFuncs --> Distill
Distill --> Models
Models --> CSV
CSV -.->|CSV files| Ingest
Ingest --> Preprocess
Preprocess --> DataStore
DataStore --> Training
High-Level Data Flow
Data Flow Summary:
- Fetch :
SecClientretrieves data from SEC EDGAR API endpoints - Transform : Raw financial data passes through
distill_us_gaap_fundamental_conceptsto normalize concept names - Store : Structured data is written to CSV files, organized by ticker symbol first letter
- Ingest : Python scripts walk CSV directories and parse records
- Preprocess : Generate embeddings, apply PCA, normalize values
- Train : ML models learn financial concept relationships
Sources: src/main.rs:1-16, src/main.rs:173-240, src/lib.rs:1-11
graph TB
main["main.rs\nApplication entry point"]
config["config module\nConfigManager\nAppConfig"]
network["network module\nSecClient\nfetch_* functions\nThrottlePolicy\nCachePolicy"]
transformers["transformers module\ndistill_us_gaap_fundamental_concepts"]
models["models module\nTicker\nCik\nCikSubmission\nNportInvestment\nAccessionNumber"]
enums["enums module\nFundamentalConcept\nCacheNamespacePrefix\nTickerOrigin\nUrl"]
caches["caches module (private)\nCaches struct\nOnceLock statics\nHTTP cache\nPreprocessor cache"]
utils["utils module\ninvert_multivalue_indexmap\nis_development_mode\nis_interactive_mode\nVecExtensions"]
fs["fs module\nFile system utilities"]
parsers["parsers module\nData parsing functions"]
main --> config
main --> network
main --> utils
network --> config
network --> caches
network --> models
network --> transformers
network --> parsers
transformers --> enums
transformers --> models
models --> enums
caches --> config
Core Module Structure
Module Descriptions:
| Module | Primary Purpose | Key Exports |
|---|---|---|
config | Configuration management and credential loading | ConfigManager, AppConfig |
network | HTTP client, data fetching, throttling, caching | SecClient, fetch_company_tickers, fetch_us_gaap_fundamentals, fetch_nport_filing_by_ticker_symbol |
transformers | US GAAP concept normalization (importance: 8.37) | distill_us_gaap_fundamental_concepts |
models | Data structures for SEC entities | Ticker, Cik, CikSubmission, NportInvestment, AccessionNumber |
enums | Type-safe enumerations | FundamentalConcept (64 variants), CacheNamespacePrefix, TickerOrigin, Url |
caches | Internal caching infrastructure | Caches (private module) |
utils | Utility functions | invert_multivalue_indexmap, VecExtensions, is_development_mode |
Sources: src/lib.rs:1-11, src/main.rs:1-16, src/utils.rs:1-12
Rust Component Architecture
The Rust layer is organized around SecClient, which provides a high-level HTTP interface with integrated throttling and caching. Network functions (fetch_*) use this client to retrieve data from SEC EDGAR endpoints. The most critical component is distill_us_gaap_fundamental_concepts, which normalizes financial concepts using four mapping patterns:
- One-to-One : Direct mappings (e.g.,
Assets→FundamentalConcept::Assets) - Hierarchical : Specific concepts also map to parent categories (e.g.,
CurrentAssetsmaps to bothCurrentAssetsandAssets) - Synonym Consolidation : Multiple terms map to single concept (e.g., 6 cost variations →
CostOfRevenue) - Industry-Specific : 57+ revenue variations map to
Revenues
Sources: src/main.rs:1-16, Cargo.toml:8-40
Python Component Architecture
The Python narrative_stack system consumes CSV files produced by the Rust layer and trains machine learning models:
Key Components:
| Component | Module/Class | Purpose |
|---|---|---|
| Ingestion | us_gaap_store.ingest_us_gaap_csvs | Walks CSV directories, parses UsGaapRowRecord entries |
| Preprocessing | PCA, RobustScaler | Generates semantic embeddings, normalizes values, reduces dimensionality |
| Storage | DbUsGaap, DataStoreWsClient, UsGaapStore | Database interface, WebSocket client, unified data access facade |
| Training | Stage1Autoencoder | PyTorch Lightning autoencoder for learning concept embeddings |
| Validation | np.isclose checks | Scaler verification, embedding validation |
The preprocessing pipeline creates concept/unit/value triplets with associated embeddings and scalers, storing them in both MySQL and the simd-r-drive WebSocket data store. The Stage1Autoencoder learns latent representations by reconstructing embedding + scaled_value inputs.
Sources: (Python code not included in provided files, based on architecture diagrams)
Key Technologies
Rust Dependencies:
| Crate | Purpose |
|---|---|
tokio | Async runtime for I/O operations |
reqwest | HTTP client library |
polars | DataFrame operations and CSV handling |
rayon | CPU parallelism |
simd-r-drive | WebSocket data store integration |
serde, serde_json | Serialization/deserialization |
chrono | Date/time handling |
Python Dependencies:
- PyTorch & PyTorch Lightning (ML training)
- pandas, numpy (data processing)
- scikit-learn (PCA, RobustScaler)
- matplotlib, TensorBoard (visualization)
Sources: Cargo.toml:8-40
CSV Output Organization
The Rust application writes CSV files to organized directories:
data/
├── fund-holdings/
│ ├── A/
│ │ ├── AAPL.csv
│ │ ├── AMZN.csv
│ │ └── ...
│ ├── B/
│ ├── C/
│ └── ...
└── us-gaap/
├── AAPL.csv
├── MSFT.csv
└── ...
Files are organized by the first letter of the ticker symbol (uppercased) to facilitate parallel processing and improve file system performance with large datasets.
Sources: src/main.rs:83-102, src/main.rs:206-221
Next Steps
- For installation and configuration, see Getting Started
- For Rust implementation details, see Rust sec-fetcher Application
- For Python ML pipeline details, see Python narrative_stack System
- For development guidelines, see Development Guide
- For comprehensive dependency documentation, see Dependencies & Technology Stack
Sources: src/main.rs:1-240, Cargo.toml:1-45, src/lib.rs:1-11
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Getting Started
Relevant source files
This page guides you through installing, configuring, and running the rust-sec-fetcher application. It covers building the Rust binary, setting up required credentials, and executing your first data fetch. For detailed configuration options, see Configuration System. For comprehensive examples, see Running Examples.
The rust-sec-fetcher is the Rust component of a dual-language system. It fetches and transforms SEC financial data into structured CSV files. The companion Python system (narrative_stack) processes these files for machine learning applications.
Prerequisites
Before installation, ensure you have:
| Requirement | Purpose | Notes |
|---|---|---|
| Rust 1.87+ | Compile sec-fetcher | Edition 2021 features required |
| Email Address | SEC EDGAR API access | Required by SEC for API identification |
| 4+ GB Disk Space | Cache and CSV storage | Default location: data/ directory |
| Internet Connection | SEC API access | Throttled to 1 request/second |
Optional Components:
- Python 3.8+ for ML pipeline (narrative_stack)
- Docker for simd-r-drive WebSocket server (Docker Deployment)
- MySQL for US GAAP data storage (Database Integration)
Sources: Cargo.toml:1-45
Installation
Clone Repository
Build from Source
The compiled binary will be located at:
- Debug:
target/debug/sec-fetcher - Release:
target/release/sec-fetcher
Verify Installation
If successful, this will load the configuration and display it in JSON format. If no configuration exists, it will prompt for your email address in interactive mode.
Installation Flow Diagram
graph TB
Clone["Clone Repository\nrust-sec-fetcher"]
Build["cargo build --release"]
Binary["Binary Created\ntarget/release/sec-fetcher"]
Config["Configuration Setup\nConfigManager::load()"]
Verify["Run Example\ncargo run --example config"]
Clone --> Build
Build --> Binary
Binary --> Config
Config --> Verify
ConfigFile["Configuration File\nsec_fetcher_config.toml"]
Credential["Email Credential\nCredentialManager"]
Config --> ConfigFile
Config --> Credential
Verify --> Success["Display AppConfig\nJSON Output"]
Verify --> Error["Missing Email\nPrompt in Interactive Mode"]
Sources: Cargo.toml:1-6 src/config/config_manager.rs:20-23 examples/config.rs:1-17
Basic Configuration
The application uses a TOML configuration file combined with system credential storage for the required email address.
Configuration File Location
The ConfigManager searches for configuration files in this order:
-
System Config Directory : Platform-specific location returned by
ConfigManager::get_suggested_system_path()- Linux:
~/.config/sec-fetcher/config.toml - macOS:
~/Library/Application Support/sec-fetcher/config.toml - Windows:
C:\Users\<User>\AppData\Roaming\sec-fetcher\config.toml
- Linux:
-
Current Directory :
sec_fetcher_config.toml(fallback)
Configuration Fields
The AppConfig structure src/config/app_config.rs:15-32 supports the following fields:
| Field | Type | Default | Description |
|---|---|---|---|
email | Option<String> | None | Required - Your email for SEC API identification |
max_concurrent | Option<usize> | 1 | Maximum concurrent requests |
min_delay_ms | Option<u64> | 1000 | Minimum delay between requests (milliseconds) |
max_retries | Option<usize> | 5 | Maximum retry attempts for failed requests |
cache_base_dir | Option<PathBuf> | "data" | Base directory for caching and CSV output |
Example Configuration File
Create sec_fetcher_config.toml:
Email Credential Setup
The SEC EDGAR API requires an email address in the User-Agent header. The application manages this through the CredentialManager:
Interactive Mode (when running from terminal):
Non-Interactive Mode (CI/CD, background processes):
- Email must be pre-configured in
sec_fetcher_config.toml - Or stored in system credential manager via prior interactive session
Configuration Loading Flow Diagram
graph TB
Start["ConfigManager::load()"]
PathCheck{"Config Path\nExists?"}
LoadFile["Config::builder()\nadd_source(File)"]
DefaultConfig["AppConfig::default()"]
MergeUser["settings.merge(user_settings)"]
EmailCheck{"Email\nConfigured?"}
InteractiveCheck{"is_interactive_mode()?"}
Prompt["CredentialManager::from_prompt()"]
KeyringGet["credential_manager.get_credential()"]
Error["Error: Could not obtain email"]
InitCaches["Caches::init(config_manager)"]
Complete["ConfigManager Instance"]
Start --> PathCheck
PathCheck -->|Yes| LoadFile
PathCheck -->|No Fallback| LoadFile
LoadFile --> DefaultConfig
DefaultConfig --> MergeUser
MergeUser --> EmailCheck
EmailCheck -->|Missing| InteractiveCheck
EmailCheck -->|Present| InitCaches
InteractiveCheck -->|Yes| Prompt
InteractiveCheck -->|No| Error
Prompt --> KeyringGet
KeyringGet -->|Success| InitCaches
KeyringGet -->|Failure| Error
InitCaches --> Complete
Sources: src/config/config_manager.rs:20-86 src/config/app_config.rs:15-54 Cargo.toml20
Running Your First Data Fetch
Example: Configuration Display
The simplest example displays the loaded configuration:
Code Structure examples/config.rs:1-17:
ConfigManager::load()- Loads configuration from file + credentialsconfig_manager.get_config()- RetrievesAppConfigreferenceconfig.pretty_print()- Serializes to formatted JSON
Expected Output:
Example: Lookup CIK by Ticker
Fetch the Central Index Key (CIK) for a company ticker symbol:
This example demonstrates:
SecClientinitialization with throttlingfetch_company_tickers()- Downloads SEC company tickers JSONfetch_cik_by_ticker_symbol()- Maps ticker → CIK- Caching behavior (subsequent runs use cached data)
Example: Fetch NPORT Filing
Download and parse an NPORT-P investment company filing:
This example shows:
- Fetching XML filing by accession number
- Parsing
NportInvestmentdata structures - CSV output to
data/fund-holdings/{A-Z}/directories
For detailed walkthrough of all examples, see Running Examples.
Example Execution Flow Diagram
Sources: examples/config.rs:1-17 src/config/config_manager.rs:20-23 Cargo.toml:28-29
Data Output Structure
The application organizes fetched data into a structured directory hierarchy:
data/
├── http_cache/ # HTTP response cache (simd-r-drive)
│ └── sec.gov/
│ └── *.bin # Cached API responses
│
├── fund-holdings/ # NPORT filing data by ticker
│ ├── A/
│ │ ├── AAPL_holdings.csv
│ │ └── AMZN_holdings.csv
│ ├── B/
│ │ └── MSFT_holdings.csv
│ └── ... # A-Z directories
│
└── us-gaap/ # US GAAP fundamental data
├── AAPL_fundamentals.csv
├── MSFT_fundamentals.csv
└── ...
CSV File Formats
US GAAP Fundamentals (us-gaap/*.csv):
- Ticker symbol
- Filing date
- Fiscal period
FundamentalConcept(64 normalized concepts)- Value
- Units
- Accession number
NPORT Holdings (fund-holdings/{A-Z}/*.csv):
- Fund CIK
- Investment ticker symbol
- Investment name
- Balance (shares)
- Value (USD)
- Percentage of portfolio
- Asset category
- Issuer category
Data Flow from API to CSV Diagram
Sources: src/config/app_config.rs:31-44 Cargo.toml24
Cache Behavior
The application implements two-tier caching to minimize redundant API calls:
HTTP Cache
- Storage :
simd-r-drivekey-value store Cargo.toml36 - Location :
{cache_base_dir}/http_cache/ - TTL : 1 week (168 hours)
- Scope : Raw HTTP responses from SEC API
Preprocessor Cache
- Storage : In-memory
DashMapwith persistent backup - Scope : Transformed data structures (after
distill_us_gaap_fundamental_concepts) - Purpose : Skip expensive concept normalization on repeated runs
Cache Initialization : The Caches::init() function src/config/config_manager.rs:98-100 is called automatically during ConfigManager construction.
For detailed caching architecture, see Caching & Storage System.
Sources: Cargo.toml:14-37 src/config/config_manager.rs:98-100
Troubleshooting Common Issues
| Issue | Cause | Solution |
|---|---|---|
| "Could not obtain email credential" | No email configured in non-interactive mode | Add email = "..." to config file or run interactively once |
| "Config path does not exist" | Invalid custom config path | Check path spelling or omit to use defaults |
| "unknown field" in config | Typo in TOML key name | Run cargo run --example config to see valid keys |
| Rate limit errors from SEC | min_delay_ms too low | Increase to 1000+ ms (SEC requires 1 req/sec max) |
| Cache directory permission denied | Insufficient filesystem permissions | Change cache_base_dir to writable location |
Debug Configuration Issues:
The AppConfig::get_valid_keys() function src/config/app_config.rs:62-77 dynamically generates a list of valid configuration fields with their expected types using JSON schema introspection.
Sources: src/config/config_manager.rs:49-77 src/config/app_config.rs:62-77
Next Steps
Now that you have the application configured and running, explore these topics:
- Configuration System - Deep dive into
AppConfig,ConfigManager, credential management, and TOML structure - Running Examples - Comprehensive walkthrough of all example programs (
config.rs,funds.rs,lookup_cik.rs,nport_filing.rs,us_gaap_human_readable.rs) - Network Layer& SecClient - Understanding HTTP client, throttling policies, and retry logic
- US GAAP Concept Transformation - How 57+ revenue variations are normalized into 64 standardized concepts
- Main Application Flow - Complete data fetching workflow from
main.rs
For Python ML pipeline setup, see Python narrative_stack System.
Sources: examples/config.rs:1-17 src/config/config_manager.rs:1-121 src/config/app_config.rs:1-159
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Configuration System
Relevant source files
- examples/config.rs
- src/config/app_config.rs
- src/config/config_manager.rs
- tests/config_manager_tests.rs
This document describes the configuration management system in the Rust sec-fetcher application. The configuration system provides a flexible, multi-source approach to managing application settings including SEC API credentials, request throttling parameters, and cache directories.
For information about how the configuration integrates with the caching system, see Caching & Storage System. For details on credential storage mechanisms, see the credential management section below.
Overview
The configuration system consists of three primary components:
| Component | Purpose | File Location |
|---|---|---|
AppConfig | Data structure holding all configuration fields | src/config/app_config.rs |
ConfigManager | Loads, merges, and validates configuration from multiple sources | src/config/config_manager.rs |
CredentialManager | Handles email credential storage and retrieval (keyring or prompt) | Referenced in src/config/config_manager.rs:66-72 |
The system supports configuration from three sources with increasing priority:
- Default values - Hard-coded defaults in
AppConfig::default() - TOML configuration file - User-specified settings loaded from disk
- Interactive prompts - Credential collection when running in interactive mode
Sources: src/config/app_config.rs:15-54 src/config/config_manager.rs:11-120
Configuration Structure
AppConfig Fields
All fields in AppConfig are wrapped in Option<T> to support partial configuration and merging strategies. The #[merge(strategy = overwrite_option)] attribute ensures that non-None values from user configuration always replace default values.
Sources: src/config/app_config.rs:15-54 examples/config.rs:13-16
Configuration Loading Flow
ConfigManager Initialization
Sources: src/config/config_manager.rs:19-86 src/config/config_manager.rs:107-119
Configuration File Format
TOML Structure
The configuration file uses TOML format with strict schema validation. Any unrecognized keys will cause a descriptive error listing all valid keys and their types.
Example configuration file (sec_fetcher_config.toml):
Configuration File Locations
The system searches for configuration files in the following order:
| Priority | Location | Description |
|---|---|---|
| 1 | User-provided path | Passed to ConfigManager::from_config(Some(path)) |
| 2 | System config directory | ~/.config/sec-fetcher/config.toml (Unix) |
%APPDATA%\sec-fetcher\config.toml (Windows) | ||
| 3 | Current directory | ./sec_fetcher_config.toml |
Sources: src/config/config_manager.rs:107-119 tests/config_manager_tests.rs:36-57
Credential Management
Email Credential Requirement
The SEC EDGAR API requires a valid email address in the HTTP User-Agent header. The configuration system ensures this credential is available through multiple strategies:
Interactive Mode Detection:
graph TD
Start["ConfigManager Initialization"]
CheckEmail{"Email in\nAppConfig?"}
CheckInteractive{"is_interactive_mode()?"}
PromptUser["CredentialManager::from_prompt()"]
ReadKeyring["Read from system keyring"]
PromptInput["Prompt user for email"]
SaveKeyring["Save to keyring (optional)"]
SetEmail["Set email in user_settings"]
Error["Return Error:\nCould not obtain email credential"]
MergeConfig["Merge configurations"]
Start --> CheckEmail
CheckEmail -->|Yes| MergeConfig
CheckEmail -->|No| CheckInteractive
CheckInteractive -->|Yes| PromptUser
CheckInteractive -->|No| Error
PromptUser --> ReadKeyring
ReadKeyring -->|Found| SetEmail
ReadKeyring -->|Not found| PromptInput
PromptInput --> SaveKeyring
SaveKeyring --> SetEmail
SetEmail --> MergeConfig
- Returns
trueif stdout is a terminal (TTY) - Can be overridden via
set_interactive_mode_override()for testing
Sources: src/config/config_manager.rs:64-77 tests/config_manager_tests.rs:19-33
Configuration Merging Strategy
Merge Behavior
The AppConfig struct uses the merge crate with a custom overwrite_option strategy:
Merge Rules:
Some(new_value)always replacesSome(old_value)Some(new_value)always replacesNoneNonenever replacesSome(old_value)
This ensures user-provided values take absolute precedence over defaults while allowing partial configuration.
Sources: src/config/app_config.rs:8-13 src/config/config_manager.rs79
Schema Validation and Error Handling
graph LR
TOML["TOML File with\ninvalid_key"]
Deserialize["config.try_deserialize()"]
ExtractSchema["AppConfig::get_valid_keys()"]
Schema["schemars::schema_for!(AppConfig)"]
FormatError["Format error message with\nvalid keys and types"]
TOML --> Deserialize
Deserialize -->|Error| ExtractSchema
ExtractSchema --> Schema
Schema --> FormatError
FormatError --> Error["Return descriptive error:\n- email (String /Null - max_concurrent Integer/ Null)\n- min_delay_ms (Integer /Null - max_retries Integer/ Null)\n- cache_base_dir (String / Null)"]
Invalid Key Detection
The configuration system uses serde(deny_unknown_fields) to reject unknown keys in TOML files. When an invalid key is detected, the error message includes a complete list of valid keys with their types:
Example error output:
unknown field `invalid_key`, expected one of `email`, `max_concurrent`, `min_delay_ms`, `max_retries`, `cache_base_dir`
Valid configuration keys are:
- email (String | Null)
- max_concurrent (Integer | Null)
- min_delay_ms (Integer | Null)
- max_retries (Integer | Null)
- cache_base_dir (String | Null)
Sources: src/config/app_config.rs:62-77 src/config/config_manager.rs:45-62 tests/config_manager_tests.rs:67-94
Usage Examples
Loading Configuration
Default configuration path:
Custom configuration path:
Direct AppConfig construction:
Sources: examples/config.rs:1-17 tests/config_manager_tests.rs:36-57
graph LR
CM["ConfigManager::from_config()"]
InitCaches["init_caches()"]
CachesInit["Caches::init(self)"]
HTTPCache["Initialize HTTP cache\nwith cache_base_dir"]
PreprocCache["Initialize preprocessor cache"]
OnceLock["Store in OnceLock statics"]
CM --> InitCaches
InitCaches --> CachesInit
CachesInit --> HTTPCache
CachesInit --> PreprocCache
HTTPCache --> OnceLock
PreprocCache --> OnceLock
Cache Initialization
The ConfigManager automatically initializes the global Caches module during construction:
The cache_base_dir field from AppConfig is passed to the Caches module to determine where HTTP responses and preprocessed data are stored. For detailed information on cache policies and storage mechanisms, see Caching & Storage System.
Sources: src/config/config_manager.rs:83-100
Testing
Test Coverage
The configuration system includes comprehensive unit tests:
| Test Function | Purpose |
|---|---|
test_load_custom_config | Verifies loading from custom TOML file path |
test_load_non_existent_config | Ensures proper error on missing file |
test_fails_on_invalid_key | Validates schema enforcement and error messages |
test_fails_if_no_email_available | Checks email requirement in non-interactive mode (commented) |
Test Utilities:
create_temp_config(contents: &str)- Creates temporary TOML files for testingset_interactive_mode_override(Some(false))- Disables interactive prompts during tests
Sources: tests/config_manager_tests.rs:1-95
Configuration System Components
Complete Component Diagram
Sources: src/config/app_config.rs:1-159 src/config/config_manager.rs:1-121
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Running Examples
Relevant source files
This page demonstrates how to run the example programs included in the repository. Each example illustrates specific functionality of the sec-fetcher system, from basic configuration validation to fetching company data from the SEC EDGAR API. These examples serve as practical starting points for understanding how to integrate the library into your own applications.
For detailed information about configuring the application before running these examples, see Configuration System. For understanding the underlying network layer and data fetching mechanisms, see Network Layer & SecClient and Data Fetching Functions.
Example Programs Overview
The repository includes four example programs located in the examples/ directory. Each demonstrates different aspects of the system:
| Example Program | Primary Purpose | Key Functions Demonstrated | Command-Line Args |
|---|---|---|---|
config.rs | Configuration validation and inspection | ConfigManager::load(), AppConfig::pretty_print() | None |
funds.rs | Fetch investment company datasets | fetch_investment_company_series_and_class_dataset() | None |
lookup_cik.rs | CIK lookup and submission retrieval | fetch_cik_by_ticker_symbol(), fetch_cik_submissions() | Ticker symbol (e.g., AAPL) |
nport_filing.rs | Parse NPORT filing holdings | fetch_nport_filing(), XML parsing | Accession number |
us_gaap_human_readable.rs | Fetch and display US GAAP fundamentals | fetch_us_gaap_fundamentals(), distill_us_gaap_fundamental_concepts() | Ticker symbol |
Sources: examples/config.rs:1-18 examples/funds.rs:1-19 examples/lookup_cik.rs:1-75
Example Program Architecture
Diagram: Example Programs and System Integration
Sources: examples/config.rs:1-18 examples/funds.rs:1-19 examples/lookup_cik.rs:1-75
config.rs - Configuration Validation
Purpose : Load and display the current application configuration to verify that configuration files are properly formatted and credentials are correctly set.
How to Run :
Expected Output : The example displays the merged configuration in JSON format, showing values from the configuration file merged with defaults.
Code Flow :
graph TB
Start["main()
entry point\nexamples/config.rs:4"]
Load["ConfigManager::load()\nLoad config from file\nexamples/config.rs:13"]
GetConfig["config_manager.get_config()\nRetrieve AppConfig reference\nexamples/config.rs:15"]
Pretty["config.pretty_print()\nSerialize to JSON\nexamples/config.rs:16"]
Print["print! macro\nDisplay formatted config\nexamples/config.rs:16"]
End["Program exit"]
Start --> Load
Load --> GetConfig
GetConfig --> Pretty
Pretty --> Print
Print --> End
Key Functions Used :
| Function | File Location | Purpose |
|---|---|---|
ConfigManager::load() | examples/config.rs13 | Loads configuration from default paths |
get_config() | examples/config.rs15 | Returns reference to AppConfig struct |
pretty_print() | examples/config.rs16 | Serializes configuration to formatted JSON |
What This Demonstrates :
- Basic configuration loading workflow
- Configuration file resolution and merging
- Validation of configuration structure
Sources: examples/config.rs:1-18 src/config/app_config.rs:1-159
funds.rs - Investment Company Dataset
Purpose : Fetch the complete investment company series and class dataset from the SEC EDGAR API, demonstrating network client initialization and basic data fetching.
How to Run :
Expected Output : A list of investment company records, each containing series information, fund names, CIKs, and classification details. The output format is the Debug representation of the Ticker structs.
Code Flow :
graph TB
Start["#[tokio::main] async fn main()\nexamples/funds.rs:6-7"]
LoadCfg["ConfigManager::load()\nexamples/funds.rs:8"]
CreateClient["SecClient::from_config_manager()\nInitialize HTTP client\nexamples/funds.rs:9"]
FetchData["fetch_investment_company_series_and_class_dataset()\nNetwork request to SEC API\nexamples/funds.rs:11"]
LoopStart["for fund in funds\nIterate results\nexamples/funds.rs:13"]
PrintFund["print! macro\nDisplay fund Debug output\nexamples/funds.rs:14"]
End["Return Ok(())\nexamples/funds.rs:17"]
Start --> LoadCfg
LoadCfg --> CreateClient
CreateClient --> FetchData
FetchData --> LoopStart
LoopStart --> PrintFund
PrintFund -->|More funds| LoopStart
LoopStart -->|Done| End
Key Functions and Structures :
| Entity | Type | Purpose |
|---|---|---|
fetch_investment_company_series_and_class_dataset() | Function | Fetches investment company data from SEC |
SecClient | Struct | HTTP client with throttling and caching |
Ticker | Struct | Represents company ticker information |
tokio::main | Macro | Enables async runtime |
What This Demonstrates :
- Async/await pattern usage with
tokio SecClientinitialization from configuration- Network function invocation
- Iteration over fetched data structures
Sources: examples/funds.rs:1-19
lookup_cik.rs - CIK Lookup and Submission Retrieval
Purpose : Demonstrate CIK (Central Index Key) lookup by ticker symbol and retrieval of company submissions, with special handling for NPORT-P filings.
How to Run :
Replace AAPL with any valid ticker symbol.
Expected Output :
- The CIK number for the ticker
- URL to the SEC submissions JSON
- Most recent NPORT-P submission (if the company files NPORT-P forms)
- Or a list of other submission types (e.g., 10-K)
- EDGAR archive URLs for each submission
Execution Flow with Code Entities :
Command-Line Argument Handling :
The example requires exactly one argument (the ticker symbol). The argument parsing logic is implemented at examples/lookup_cik.rs:10-14:
args.len() check → if not 2, print usage and exit
ticker_symbol = args[1]
Key Code Entities :
| Entity | Type | Location | Purpose |
|---|---|---|---|
fetch_cik_by_ticker_symbol() | Function | examples/lookup_cik.rs:21-23 | Retrieves CIK for ticker |
fetch_cik_submissions() | Function | examples/lookup_cik.rs43 | Fetches all submissions for CIK |
CikSubmission::most_recent_nport_p_submission() | Method | examples/lookup_cik.rs:45-46 | Filters for latest NPORT-P |
as_edgar_archive_url() | Method | examples/lookup_cik.rs54 | Generates SEC EDGAR URL |
Cik | Struct | examples/lookup_cik.rs28 | 10-digit CIK identifier |
CikSubmission | Struct | examples/lookup_cik.rs43 | Represents a single SEC filing |
What This Demonstrates :
- Command-line argument parsing using
std::env::args() - Error handling with
ResultandOptiontypes - Conditional logic based on filing types
- Method chaining for data transformation
- URL generation from structured data
Sources: examples/lookup_cik.rs:1-75
Prerequisites and Common Patterns
Configuration Requirement : All examples (except config.rs) require a valid configuration file with at least an email address set. See Configuration System for setup instructions.
Common Code Pattern Across Examples :
All examples follow this pattern:
- Load configuration using
ConfigManager::load() - Initialize
SecClientfrom the configuration - Call network functions to fetch data
- Process and display results
Error Handling : Examples use the ? operator extensively to propagate errors and return Result<(), Box<dyn Error>> from main(). This allows errors to bubble up with descriptive messages.
Async Runtime : Examples that perform network operations use #[tokio::main] to enable the async runtime, allowing await syntax for asynchronous operations.
Sources: examples/config.rs:1-18 examples/funds.rs:1-19 examples/lookup_cik.rs:1-75
Running Examples in Development Mode
When running examples, the application may detect development mode based on environment variables or the presence of certain files. Development mode affects:
- Cache behavior : May use more aggressive caching strategies
- Logging verbosity : May produce more detailed output
- Rate limiting : May use relaxed throttling policies for testing
Refer to Utility Functions for details on development mode detection.
Next Steps
After running these examples:
- Explore the Network Layer & SecClient to understand HTTP client configuration
- Learn about Data Fetching Functions for all available network operations
- Review Data Models & Enumerations to understand returned data structures
- See Main Application Flow for production usage patterns
Sources: examples/config.rs:1-18 examples/funds.rs:1-19 examples/lookup_cik.rs:1-75
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Rust sec-fetcher Application
Relevant source files
Purpose and Scope
This page provides an architectural overview of the Rust sec-fetcher application, which is responsible for fetching financial data from the SEC EDGAR API and transforming it into structured CSV files. The application serves as the data collection and preprocessing layer in a larger dual-language system that combines Rust's performance for I/O operations with Python's machine learning capabilities.
This page covers the high-level architecture, module organization, and data flow patterns. For detailed information about specific components, see:
The Python machine learning pipeline that consumes this data is documented in Python narrative_stack System.
Sources: src/lib.rs:1-11 Cargo.toml:1-45 src/main.rs:1-241
Application Architecture
The sec-fetcher application is built around a modular architecture that separates concerns into distinct layers: configuration, networking, data transformation, and storage. The core design principle is to fetch data from SEC APIs with robust error handling and caching, transform it into a standardized format, and output structured CSV files for downstream consumption.
graph TB
subgraph "src/lib.rs Module Organization"
config["config\nConfigManager, AppConfig"]
enums["enums\nFundamentalConcept, Url\nCacheNamespacePrefix"]
models["models\nTicker, CikSubmission\nNportInvestment, AccessionNumber"]
network["network\nSecClient, fetch_* functions\nThrottlePolicy, CachePolicy"]
transformers["transformers\ndistill_us_gaap_fundamental_concepts"]
parsers["parsers\nXML/JSON parsing utilities"]
caches["caches\nCaches (internal)\nHTTP cache, preprocessor cache"]
fs["fs\nFile system operations"]
utils["utils\nVecExtensions, helpers"]
end
subgraph "External Dependencies"
reqwest["reqwest\nHTTP client"]
polars["polars\nDataFrame operations"]
simd["simd-r-drive\nDrive-based cache storage"]
tokio["tokio\nAsync runtime"]
serde["serde\nSerialization"]
end
config --> caches
network --> config
network --> caches
network --> models
network --> enums
network --> parsers
transformers --> models
transformers --> enums
parsers --> models
network --> reqwest
network --> simd
network --> tokio
transformers --> polars
models --> serde
style config fill:#f9f9f9
style network fill:#f9f9f9
style transformers fill:#f9f9f9
style caches fill:#f9f9f9
Module Structure
The application is organized into nine core modules as declared in src/lib.rs:1-10:
| Module | Purpose | Key Components |
|---|---|---|
config | Configuration management and credential handling | ConfigManager, AppConfig |
enums | Type-safe enumerations for domain concepts | FundamentalConcept, Url, CacheNamespacePrefix, TickerOrigin |
models | Data structures representing SEC entities | Ticker, CikSubmission, NportInvestment, AccessionNumber |
network | HTTP client and data fetching functions | SecClient, fetch_company_tickers, fetch_us_gaap_fundamentals |
transformers | Data normalization and transformation | distill_us_gaap_fundamental_concepts |
parsers | XML/JSON parsing utilities | Filing parsers, XML extractors |
caches | Internal caching infrastructure | Caches (singleton), HTTP cache, preprocessor cache |
fs | File system operations | Directory creation, path utilities |
utils | Helper functions and extensions | VecExtensions, invert_multivalue_indexmap |
Sources: src/lib.rs:1-11 Cargo.toml:8-40
Data Flow Architecture
Request-Response Flow with Caching
The data flow follows a pipeline pattern:
- Request Initiation : Application code calls functions like
fetch_us_gaap_fundamentalsorfetch_nport_filing_by_ticker_symbol - Client Middleware :
SecClientapplies throttling and caching policies before making HTTP requests - Cache Check :
CachePolicycheckssimd-r-drivestorage for cached responses - API Request : If cache miss, request is sent to SEC EDGAR API with proper headers and rate limiting
- Parsing : Raw XML/JSON responses are parsed into structured data models
- Transformation : Data passes through normalization functions like
distill_us_gaap_fundamental_concepts - Output : Transformed data is written to CSV files organized by ticker symbol or category
Sources: src/main.rs:174-240 src/lib.rs:1-11
Key Dependencies and Technology Stack
The application leverages modern Rust crates for performance and reliability:
Critical Dependencies
| Category | Crate | Version | Purpose |
|---|---|---|---|
| Async Runtime | tokio | 1.43.0 | Asynchronous I/O, task scheduling |
| HTTP Client | reqwest | 0.12.15 | HTTP requests with JSON support |
| Data Frames | polars | 0.46.0 | High-performance DataFrame operations |
| Caching | simd-r-drive | 0.3.0 | WebSocket-based key-value storage |
| Parallelism | rayon | 1.10.0 | Data parallelism for CPU-bound work |
| Configuration | config | 0.15.9 | TOML/JSON configuration file parsing |
| Credentials | keyring | 3.6.2 | Secure credential storage (OS keychain) |
| XML Parsing | quick-xml | 0.37.2 | Fast XML parsing for SEC filings |
| Serialization | serde | 1.0.218 | Data structure serialization/deserialization |
Sources: Cargo.toml:8-40
Application Entry Point and Main Loop
The application entry point in src/main.rs:174-240 demonstrates the typical processing pattern:
Main Processing Loop Structure
The main application follows this pattern:
- Configuration Loading :
ConfigManager::load()reads configuration from TOML files and environment variables - Client Initialization :
SecClient::from_config_manager()creates HTTP client with throttling and caching middleware - Ticker Fetching :
fetch_company_tickers()retrieves all company ticker symbols from SEC - Processing Loop : Iterates over each ticker symbol:
- Calls
fetch_us_gaap_fundamentals()to retrieve financial data - Transforms data through
distill_us_gaap_fundamental_concepts - Writes results to CSV files organized by ticker:
data/us-gaap/{ticker}.csv - Logs errors to
error_logHashMap for later reporting
- Calls
- Error Reporting : Prints summary of any errors encountered during processing
Example Main Function Flow
Sources: src/main.rs:174-240
Module Interaction Patterns
Configuration and Client Initialization
The initialization sequence ensures:
- Configuration is loaded before any network operations
- Credentials are retrieved securely from OS keychain via
keyringcrate Cachessingleton is initialized with drive storage connection- HTTP client middleware is properly configured with throttling and caching
Sources: src/lib.rs:1-11
Data Transformation Pipeline
The transformation pipeline is the most critical component (importance: 8.37 as noted in the high-level architecture). The distill_us_gaap_fundamental_concepts function handles:
- Synonym Consolidation : Maps 6+ variations of "Cost of Revenue" to single
CostOfRevenueconcept - Industry-Specific Revenue : Handles 57+ revenue field variations (SalesRevenueNet, RevenueFromContractWithCustomer, etc.)
- Hierarchical Mapping : Maps specific concepts to parent categories (e.g.,
CurrentAssetsalso maps toAssets) - Unit Normalization : Standardizes currency, shares, and percentage units
Sources: Cargo.toml24 high-level architecture diagram
Error Handling Strategy
The application uses a multi-layered error handling approach:
| Layer | Strategy | Example |
|---|---|---|
| Network | Retry with exponential backoff | ThrottlePolicy with max_retries |
| Cache | Fallback to API on cache errors | CachePolicy transparent fallback |
| Parsing | Log and continue processing | Error log in main loop |
| File I/O | Log error, continue to next item | Individual CSV write failures don't stop processing |
The main loop accumulates errors in a HashMap<String, String> and reports them at the end, ensuring that one failure doesn't halt the entire batch processing job.
Sources: src/main.rs:185-240
Performance Characteristics
The application is optimized for I/O-bound workloads:
- Async I/O :
tokioruntime enables concurrent network requests - CPU Parallelism :
rayonparallelizes DataFrame operations inpolars - Caching :
simd-r-drivereduces redundant API calls with 1-week TTL - Throttling : Respects SEC rate limits with adaptive jitter
- Streaming : Processes large datasets without loading entirely into memory
The combination of async networking for I/O operations and Rayon parallelism for CPU-bound transformations provides optimal throughput for SEC data fetching and processing.
Sources: Cargo.toml:8-40 high-level architecture diagrams
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Network Layer & SecClient
Relevant source files
Purpose and Scope
This page documents the network layer of the Rust sec-fetcher application, specifically the SecClient HTTP client and its associated infrastructure. The SecClient provides the foundational HTTP communication layer for all SEC EDGAR API interactions, implementing throttling, caching, and retry logic to ensure reliable and compliant data fetching.
This page covers:
- The
SecClientstructure and initialization - Throttling and rate limiting policies
- HTTP caching mechanisms
- Request/response handling
- User-Agent management and email validation
For information about the specific network fetching functions that use SecClient (such as fetch_company_tickers, fetch_us_gaap_fundamentals, etc.), see Data Fetching Functions. For details on the caching system architecture, see Caching & Storage System.
SecClient Architecture Overview
Component Diagram
Sources: src/network/sec_client.rs:1-181
SecClient Structure and Initialization
SecClient Fields
The SecClient struct maintains four core components:
| Field | Type | Purpose |
|---|---|---|
email | String | Contact email for SEC User-Agent header |
http_client | ClientWithMiddleware | reqwest client with middleware stack |
cache_policy | Arc<CachePolicy> | Shared cache configuration |
throttle_policy | Arc<ThrottlePolicy> | Shared throttle configuration |
Sources: src/network/sec_client.rs:14-19
Construction from ConfigManager
The from_config_manager() constructor performs the following initialization sequence:
- Extract Configuration : Reads
email,max_concurrent,min_delay_ms, andmax_retriesfromAppConfig - Create CachePolicy : Configures cache with 1-week TTL and disables header respect
- Create ThrottlePolicy : Configures rate limiting with adaptive jitter
- Initialize HTTP Cache : Retrieves the shared HTTP cache store from
Caches - Build Middleware Stack : Combines cache and throttle layers
- Construct Client : Creates
ClientWithMiddlewarewith full middleware stack
Sources: src/network/sec_client.rs:23-89
Configuration Parameters
The following table details the configuration parameters required by SecClient:
| Parameter | AppConfig Field | Required | Default | Purpose |
|---|---|---|---|---|
email | Yes | None | SEC User-Agent compliance | |
| Max Concurrent | max_concurrent | Yes | None | Concurrent request limit |
| Min Delay (ms) | min_delay_ms | Yes | None | Base throttle delay |
| Max Retries | max_retries | Yes | None | Retry attempt limit |
Sources: src/network/sec_client.rs:28-43
Throttle Policy Configuration
ThrottlePolicy Structure
The ThrottlePolicy from reqwest_drive controls request rate limiting with the following parameters:
| Field | Type | Source | Purpose |
|---|---|---|---|
base_delay_ms | u64 | AppConfig.min_delay_ms | Minimum delay between requests |
max_concurrent | u64 | AppConfig.max_concurrent | Maximum concurrent requests |
max_retries | u64 | AppConfig.max_retries | Maximum retry attempts |
adaptive_jitter_ms | u64 | Hardcoded: 500 | Randomized delay for retry backoff |
Sources: src/network/sec_client.rs:52-59
Request Throttling Mechanism
The throttle mechanism operates as follows:
- Concurrent Limit : Enforces
max_concurrentsimultaneous requests - Base Delay : Applies
base_delay_msbetween sequential requests - Retry Logic : Retries failed requests up to
max_retriestimes - Adaptive Jitter : Adds randomized delay of up to
adaptive_jitter_mson retries to prevent thundering herd
The ThrottlePolicy can be overridden on a per-request basis by passing a custom policy to raw_request():
Sources: src/network/sec_client.rs:140-165
Cache Policy Configuration
CachePolicy Structure
The CachePolicy configuration defines HTTP caching behavior:
| Field | Value | Purpose |
|---|---|---|
default_ttl | Duration::from_secs(60 * 60 * 24 * 7) | 1 week cache lifetime |
respect_headers | false | Ignore HTTP cache control headers |
cache_status_override | None | No status code override |
Note : The 1-week TTL is currently hardcoded and marked with a TODO comment for future configurability.
Sources: src/network/sec_client.rs:45-50
Cache Storage Integration
The cache storage uses the following integration:
- HTTP Cache Store : Retrieved via
Caches::get_http_cache_store(), a staticOnceLocksingleton - Drive Integration : Uses
init_cache_with_drive_and_throttle()fromreqwest_drive - Persistent Storage : Backed by
simd-r-driveWebSocket server for cross-session persistence
Sources: src/network/sec_client.rs:73-81
User-Agent Management
User-Agent Format
The get_user_agent() method generates a compliant User-Agent string for SEC EDGAR API requests:
Format: <package_name>/<version> (+<email>)
Example: sec-fetcher/0.1.0 (+contact@example.com)
Sources: src/network/sec_client.rs:91-108
Email Validation
The email validation occurs at User-Agent generation time rather than during instantiation. This design ensures that:
- Every network request validates the email format
- Invalid configurations fail fast on first network call
- The email is validated even if the
SecClientis constructed through different paths
Sources: src/network/sec_client.rs:91-99
Request Methods
raw_request Method
The raw_request() method provides low-level HTTP request functionality:
Parameters:
method: HTTP method (GET, POST, etc.)url: Target URLheaders: Optional additional headers as key-value tuplescustom_throttle_policy: Optional per-request throttle override
Behavior:
- Constructs request with User-Agent header
- Applies optional custom headers
- Injects custom throttle policy if provided (via request extensions)
- Executes request through middleware stack
- Returns raw
reqwest::Response
Sources: src/network/sec_client.rs:140-165
fetch_json Method
The fetch_json() method is a convenience wrapper for JSON API requests:
Flow:
- Calls
raw_request()with GET method - Awaits response
- Deserializes response body to
serde_json::Value - Returns parsed JSON
Sources: src/network/sec_client.rs:167-179
Request Flow Diagram
Sources: src/network/sec_client.rs:140-179
Testing Infrastructure
Test Organization
The SecClient tests use the mockito crate for HTTP mocking:
Sources: tests/sec_client_tests.rs:1-159
Test Coverage
The test suite covers the following scenarios:
| Test | Purpose | Mock Behavior | Assertion |
|---|---|---|---|
test_user_agent | User-Agent formatting | N/A | Matches expected format |
test_invalid_email_panic | Email validation | N/A | Panics with expected message |
test_fetch_json_without_retry_success | Successful request | 200 JSON response | JSON parsed correctly |
test_fetch_json_with_retry_success | Retry not needed | 200 JSON response | JSON parsed correctly |
test_fetch_json_with_retry_failure | Retry exhaustion | 500 error (3x) | Returns error |
test_fetch_json_with_retry_backoff | Retry with recovery | 500 → 200 | JSON parsed correctly |
Key Testing Patterns:
- Mock Server :
mockito::Server::new_async()creates isolated HTTP endpoints - Configuration : Tests use
AppConfig::default()with overrides - Async Execution : All network tests use
#[tokio::test] - Expectations :
mockitoverifies request counts with.expect(n)
Sources: tests/sec_client_tests.rs:7-158
Dependencies and Integration
External Dependencies
| Crate | Purpose | Usage in SecClient |
|---|---|---|
reqwest | HTTP client | Core HTTP functionality |
reqwest_drive | Middleware stack | Cache and throttle integration |
tokio | Async runtime | All async operations |
serde_json | JSON parsing | Response deserialization |
email_address | Email validation | User-Agent email checking |
Sources: src/network/sec_client.rs:1-12 Cargo.lock:1-100
Internal Dependencies
The SecClient integrates with:
- ConfigManager : Provides configuration for initialization (Configuration System)
- Caches : Supplies HTTP cache storage singleton (Caching & Storage System)
- Network Module : Exposed via
src/network.rsmodule - Fetch Functions : Used by all data fetching operations (Data Fetching Functions)
Sources: src/network.rs:1-23 src/network/sec_client.rs:1-12
Error Handling
Error Types
SecClient methods return Result<T, Box<dyn Error>>, allowing propagation of:
- reqwest::Error : HTTP client errors (connection, timeout, etc.)
- serde_json::Error : JSON deserialization errors
- Custom Errors : From configuration or validation
Panic Conditions
The only panic condition is invalid email format detected in get_user_agent():
This is intentional as an invalid email violates SEC API requirements.
Sources: src/network/sec_client.rs:95-98
Usage Example
The following example demonstrates typical SecClient usage:
Sources: tests/sec_client_tests.rs:36-62
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Data Fetching Functions
Relevant source files
- examples/lookup_cik.rs
- examples/nport_filing.rs
- src/network/fetch_cik_submissions.rs
- src/network/fetch_company_tickers.rs
- src/network/fetch_investment_company_series_and_class_dataset.rs
- src/network/fetch_nport_filing.rs
- src/network/fetch_us_gaap_fundamentals.rs
- src/parsers/parse_nport_xml.rs
Purpose and Scope
This document describes the data fetching functions in the Rust sec-fetcher application. These functions provide the interface for retrieving financial data from the SEC EDGAR API, including company tickers, CIK submissions, NPORT filings, US GAAP fundamentals, and investment company datasets.
For information about the underlying HTTP client, throttling, and caching infrastructure, see Network Layer & SecClient. For details on how US GAAP data is transformed after fetching, see US GAAP Concept Transformation. For information about the data structures returned by these functions, see Data Models & Enumerations.
Overview of Data Fetching Architecture
The network module provides six primary fetching functions that retrieve different types of financial data from the SEC EDGAR API. Each function accepts a SecClient reference and returns structured data types.
Diagram: Data Fetching Function Overview
graph TB
subgraph "Entry Points"
FetchTickers["fetch_company_tickers()\nsrc/network/fetch_company_tickers.rs"]
FetchCIK["fetch_cik_by_ticker_symbol()\nNot shown in files"]
FetchSubs["fetch_cik_submissions()\nsrc/network/fetch_cik_submissions.rs"]
FetchNPORT["fetch_nport_filing_by_ticker_symbol()\nfetch_nport_filing_by_cik_and_accession_number()\nsrc/network/fetch_nport_filing.rs"]
FetchGAAP["fetch_us_gaap_fundamentals()\nsrc/network/fetch_us_gaap_fundamentals.rs"]
FetchInvCo["fetch_investment_company_series_and_class_dataset()\nsrc/network/fetch_investment_company_series_and_class_dataset.rs"]
end
subgraph "SEC EDGAR API Endpoints"
APITickers["company_tickers.json"]
APISubmissions["submissions/CIK{cik}.json"]
APINport["Archives/edgar/data/{cik}/{accession}/primary_doc.xml"]
APIFacts["api/xbrl/companyfacts/CIK{cik}.json"]
APIInvCo["files/investment/data/other/.../{year}.csv"]
end
subgraph "Return Types"
VecTicker["Vec<Ticker>"]
CikType["Cik"]
VecCikSub["Vec<CikSubmission>"]
VecNport["Vec<NportInvestment>"]
DataFrame["TickerFundamentalsDataFrame"]
VecInvCo["Vec<InvestmentCompany>"]
end
FetchTickers -->|HTTP GET| APITickers
APITickers --> VecTicker
FetchCIK -->|Searches| VecTicker
FetchCIK --> CikType
FetchSubs -->|HTTP GET| APISubmissions
APISubmissions --> VecCikSub
FetchNPORT -->|HTTP GET| APINport
APINport --> VecNport
FetchGAAP -->|HTTP GET| APIFacts
APIFacts --> DataFrame
FetchInvCo -->|HTTP GET| APIInvCo
APIInvCo --> VecInvCo
Sources: src/network/fetch_company_tickers.rs src/network/fetch_cik_submissions.rs src/network/fetch_nport_filing.rs src/network/fetch_us_gaap_fundamentals.rs src/network/fetch_investment_company_series_and_class_dataset.rs
Company Ticker Fetching
Function: fetch_company_tickers
The fetch_company_tickers function retrieves the master list of operating company tickers from the SEC EDGAR API. This dataset provides the mapping between ticker symbols, company names, and CIK numbers.
Signature:
Implementation Details:
| Aspect | Detail |
|---|---|
| API Endpoint | Url::CompanyTickers (company_tickers.json) |
| Response Format | JSON object with numeric keys mapping to ticker data |
| Parsing Logic | Lines 1:8-31 |
| CIK Conversion | Uses Cik::from_u64() to format CIK numbers 22 |
| Origin Tag | All tickers tagged with TickerOrigin::CompanyTickers 28 |
Returned Data Structure:
Each Ticker in the result contains:
cik: 10-digit formatted CIK numbersymbol: Ticker symbol (e.g., "AAPL")company_name: Full company nameorigin: Set toTickerOrigin::CompanyTickers
Usage Example:
The function is commonly used as a prerequisite for other operations that require ticker-to-CIK mapping:
Sources: src/network/fetch_company_tickers.rs:1-34
CIK Lookup and Submissions
Function: fetch_cik_submissions
The fetch_cik_submissions function retrieves all SEC filings (submissions) for a given company, identified by its CIK number.
Signature:
Implementation Details:
The function performs the following operations:
- URL Construction : Creates the submissions endpoint URL using
Url::CikSubmissionenum 12 - JSON Fetching : Retrieves submission data via
sec_client.fetch_json()14 - Data Extraction : Parses the
filings.recentobject containing parallel arrays 2:5-51 - Parallel Array Processing : Uses
itertools::izip!to iterate over multiple arrays simultaneously 5:5-60
JSON Structure:
Diagram: CikSubmission JSON Structure
Parsing Implementation:
The function extracts four parallel arrays and combines them using izip!:
accessionNumber: Accession numbers for filingsform: Form types (e.g., "10-K", "NPORT-P")primaryDocument: Primary document filenamesfilingDate: Filing dates in "YYYY-MM-DD" format
Each set of parallel values is combined into a CikSubmission struct 6:8-75
Date Parsing:
Filing dates are parsed from strings into NaiveDate objects 6:1-63:
Sources: src/network/fetch_cik_submissions.rs:1-79 examples/lookup_cik.rs:43-70
NPORT Filing Fetching
NPORT-P filings contain detailed portfolio holdings for registered investment companies (mutual funds and ETFs). The module provides two related functions for fetching this data.
Function: fetch_nport_filing_by_ticker_symbol
Signature:
Workflow:
Diagram: NPORT Filing Fetch by Ticker Symbol
This convenience function orchestrates multiple calls 1:0-30:
- Lookup CIK from ticker symbol 14
- Fetch all submissions for that CIK 16
- Filter for most recent NPORT-P submission 1:8-20
- Fetch the filing details 2:2-27
Function: fetch_nport_filing_by_cik_and_accession_number
Signature:
Implementation Details:
| Step | Operation | Code Reference |
|---|---|---|
| 1. Fetch company tickers | Required for ticker mapping | 39 |
| 2. Construct URL | Url::CikAccessionPrimaryDocument | 41 |
| 3. Fetch XML | raw_request() with GET method | 4:3-46 |
| 4. Parse XML | parse_nport_xml() | 48 |
XML Parsing Details:
The parse_nport_xml function extracts investment holdings from the primary document XML:
- Main Element :
<invstOrSec>(investment or security) 2:6-46 - Fields Extracted : name, LEI, title, CUSIP, ISIN, balance, currency, USD value, percentage value, payoff profile, asset category, issuer category, country
- Ticker Mapping : Fuzzy matches investment names to company tickers 1:31-142
- Sorting : Results sorted by
pct_valdescending 125
Sources: src/network/fetch_nport_filing.rs:1-49 src/parsers/parse_nport_xml.rs:12-146 examples/nport_filing.rs:1-29
US GAAP Fundamentals Fetching
Function: fetch_us_gaap_fundamentals
The fetch_us_gaap_fundamentals function retrieves standardized financial statement data (US Generally Accepted Accounting Principles) for a company.
Signature:
Type Alias:
The function returns a Polars DataFrame containing financial facts 9
Implementation Flow:
Diagram: US GAAP Fundamentals Fetch Flow
Key Operations:
- CIK Lookup : Resolves ticker symbol to CIK using the provided company tickers list 18
- URL Construction : Builds the company facts endpoint URL 20
- API Call : Fetches JSON data from SEC 25
- Parsing : Converts JSON to structured DataFrame 27
API Response Structure:
The SEC company facts endpoint returns nested JSON containing:
- US-GAAP taxonomy : Standardized accounting concepts
- Unit types : Currency units (USD), shares, etc.
- Time series data : Historical values with filing dates and periods
The parsing function (documented in detail in US GAAP Concept Transformation) extracts this into a tabular DataFrame format.
Sources: src/network/fetch_us_gaap_fundamentals.rs:1-28
Investment Company Dataset Fetching
The investment company dataset provides comprehensive information about registered investment companies, including mutual funds and ETFs, their series, and share classes.
graph TB
Start["Start"]
CheckCache["Check preprocessor_cache\nfor latest_funds_year"]
CacheHit{{"Cache Hit?"}}
UseCache["Use cached year"]
UseCurrent["Use current year"]
TryFetch["fetch_investment_company_series_and_class_dataset_for_year(year)"]
Success{{"Success?"}}
Store["Store year in cache\nTTL: 1 week"]
Return["Return data"]
Decrement["year -= 1"]
CheckLimit{{"year >= 2024?"}}
Error["Return error"]
Start --> CheckCache
CheckCache --> CacheHit
CacheHit -->|Yes| UseCache
CacheHit -->|No| UseCurrent
UseCache --> TryFetch
UseCurrent --> TryFetch
TryFetch --> Success
Success -->|Yes| Store
Store --> Return
Success -->|No| Decrement
Decrement --> CheckLimit
CheckLimit -->|Yes| TryFetch
CheckLimit -->|No| Error
Function: fetch_investment_company_series_and_class_dataset
Signature:
Year Fallback Strategy:
This function implements a sophisticated year-based fallback mechanism because the SEC updates the dataset annually and may not have data for the current year immediately:
Diagram: Investment Company Dataset Year Fallback Logic
Implementation Details:
| Component | Description | Code Reference |
|---|---|---|
| Cache Key | NAMESPACE_HASHER_LATEST_FUNDS_YEAR | 1:1-15 |
| Namespace | CacheNamespacePrefix::LatestFundsYear | 13 |
| Cache Query | preprocessor_cache.read_with_ttl::<usize>() | 3:9-42 |
| Year Range | 2024 to current year | 46 |
| Cache TTL | 1 week (604800 seconds) | 59 |
Caching Strategy:
The function caches the most recent successful year to avoid repeated year fallback attempts 3:8-42:
Function: fetch_investment_company_series_and_class_dataset_for_year
Signature:
This lower-level function fetches the dataset for a specific year 8:0-105:
- URL Construction : Uses
Url::InvestmentCompanySeriesAndClassDataset(year)84 - Throttle Override : Reduces max retries to 2 for faster fallback 8:6-92
- Raw Request : Uses
raw_request()instead of higher-level fetch methods 9:4-101 - CSV Parsing : Calls
parse_investment_companies_csv()104
Throttle Policy Override:
Sources: src/network/fetch_investment_company_series_and_class_dataset.rs:1-178
graph TB
subgraph "Independent Functions"
FT["fetch_company_tickers()"]
FCS["fetch_cik_submissions(cik)"]
FIC["fetch_investment_company_series_and_class_dataset()"]
end
subgraph "Dependent Functions"
FCIK["fetch_cik_by_ticker_symbol()"]
FGAAP["fetch_us_gaap_fundamentals()"]
FNPORT1["fetch_nport_filing_by_ticker_symbol()"]
FNPORT2["fetch_nport_filing_by_cik_and_accession_number()"]
end
subgraph "Data Models"
VT["Vec<Ticker>"]
C["Cik"]
VCS["Vec<CikSubmission>"]
end
FT --> VT
VT --> FCIK
VT --> FGAAP
VT --> FNPORT2
FCIK --> C
C --> FCS
C --> FGAAP
FCS --> VCS
VCS --> FNPORT1
FCIK --> FNPORT1
FNPORT1 --> FNPORT2
Function Dependencies and Integration
The data fetching functions form a dependency graph where some functions rely on others to complete their tasks.
Diagram: Function Dependency Graph
Common Integration Patterns:
-
Ticker to CIK Resolution :
fetch_company_tickers()→fetch_cik_by_ticker_symbol()→Cik- Used by: NPORT filing fetch, US GAAP fundamentals fetch
-
Submission Filtering :
fetch_cik_submissions()→CikSubmission::most_recent_nport_p_submission()- Used by: NPORT filing by ticker symbol
-
Ticker Mapping in Results :
fetch_company_tickers()→ parsing functions (e.g.,parse_nport_xml)- Enriches raw data with ticker information
Example Integration (from examples):
Sources: src/network/fetch_nport_filing.rs:3-5 examples/nport_filing.rs:16-22 examples/lookup_cik.rs:18-24
Error Handling
All data fetching functions return Result<T, Box<dyn Error>>, providing consistent error handling across the module. Common error scenarios include:
| Error Type | Cause | Example Functions |
|---|---|---|
| Network Errors | HTTP request failures, timeouts | All functions |
| Parsing Errors | Invalid JSON/XML structure | fetch_cik_submissions, fetch_nport_filing_by_cik_and_accession_number |
| Not Found | Ticker symbol not found, no NPORT-P filing exists | fetch_nport_filing_by_ticker_symbol |
| Year Fallback Exhaustion | No data available for any year | fetch_investment_company_series_and_class_dataset |
Error Propagation:
Functions use the ? operator to propagate errors up the call stack, allowing callers to handle errors at the appropriate level. For example, in fetch_nport_filing_by_ticker_symbol:
If either dependency function fails, the error is immediately returned to the caller.
Sources: src/network/fetch_cik_submissions.rs:8-11 src/network/fetch_us_gaap_fundamentals.rs:12-16 src/network/fetch_nport_filing.rs:10-13
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
US GAAP Concept Transformation
Relevant source files
Purpose and Scope
This page documents the US GAAP concept transformation system, which normalizes raw financial concept names from SEC EDGAR filings into a standardized taxonomy. The core functionality is provided by the distill_us_gaap_fundamental_concepts function, which maps the diverse US GAAP terminology (57+ revenue variations, 6 cost variants, multiple equity representations) into a consistent set of 64 FundamentalConcept enum variants.
For information about fetching US GAAP data from the SEC API, see Data Fetching Functions. For details on the data models that use these concepts, see Data Models & Enumerations. For the Python ML pipeline that processes the transformed concepts, see Python narrative_stack System.
System Overview
The transformation system acts as a critical normalization layer between raw SEC EDGAR filings and downstream data processing. Companies report financial data using various US GAAP concept names (e.g., Revenues, SalesRevenueNet, HealthCareOrganizationRevenue), and this system ensures all variations map to consistent concept identifiers.
Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:1-1275 examples/us_gaap_human_readable.rs:1-9
graph TB
subgraph "Input Layer"
RawFiling["SEC EDGAR Filing\nRaw JSON Data"]
RawConcepts["Raw US GAAP Concepts\n- Revenues\n- SalesRevenueNet\n- HealthCareOrganizationRevenue\n- InterestAndDividendIncomeOperating\n- 57+ more revenue types"]
end
subgraph "Transformation Layer"
DistillFn["distill_us_gaap_fundamental_concepts"]
MappingEngine["Mapping Engine\n4 Pattern Types"]
OneToOne["One-to-One\nAssets → Assets"]
Hierarchical["Hierarchical\nCurrentAssets → [CurrentAssets, Assets]"]
Synonyms["Synonyms\n6 cost types → CostOfRevenue"]
Industry["Industry-Specific\n57+ revenue types → Revenues"]
MappingEngine --> OneToOne
MappingEngine --> Hierarchical
MappingEngine --> Synonyms
MappingEngine --> Industry
end
subgraph "Output Layer"
FundamentalConcept["FundamentalConcept Enum\n64 standardized variants"]
BS["Balance Sheet Concepts\n- Assets, CurrentAssets\n- Liabilities, CurrentLiabilities\n- Equity, EquityAttributableToParent"]
IS["Income Statement Concepts\n- Revenues, CostOfRevenue\n- GrossProfit, OperatingIncomeLoss\n- NetIncomeLoss"]
CF["Cash Flow Concepts\n- NetCashFlowFromOperatingActivities\n- NetCashFlowFromInvestingActivities\n- NetCashFlowFromFinancingActivities"]
FundamentalConcept --> BS
FundamentalConcept --> IS
FundamentalConcept --> CF
end
RawFiling --> RawConcepts
RawConcepts --> DistillFn
DistillFn --> MappingEngine
MappingEngine --> FundamentalConcept
style DistillFn fill:#f9f9f9
style FundamentalConcept fill:#f9f9f9
The FundamentalConcept Taxonomy
The FundamentalConcept enum defines 64 standardized financial concept variants organized into four main categories: Balance Sheet, Income Statement, Cash Flow, and Equity classifications. Each variant represents a normalized concept that may map from multiple raw US GAAP names.
Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:1-1275
graph TB
subgraph "FundamentalConcept Enum"
Root["FundamentalConcept\n(64 total variants)"]
end
subgraph "Balance Sheet Concepts"
Assets["Assets"]
CurrentAssets["CurrentAssets"]
NoncurrentAssets["NoncurrentAssets"]
Liabilities["Liabilities"]
CurrentLiabilities["CurrentLiabilities"]
NoncurrentLiabilities["NoncurrentLiabilities"]
LiabilitiesAndEquity["LiabilitiesAndEquity"]
CommitmentsAndContingencies["CommitmentsAndContingencies"]
end
subgraph "Income Statement Concepts"
Revenues["Revenues"]
RevenuesNetInterestExpense["RevenuesNetInterestExpense"]
RevenuesExcludingInterestAndDividends["RevenuesExcludingInterestAndDividends"]
InterestAndDividendIncomeOperating["InterestAndDividendIncomeOperating"]
CostOfRevenue["CostOfRevenue"]
GrossProfit["GrossProfit"]
OperatingExpenses["OperatingExpenses"]
OperatingIncomeLoss["OperatingIncomeLoss"]
NonoperatingIncomeLoss["NonoperatingIncomeLoss"]
IncomeTaxExpenseBenefit["IncomeTaxExpenseBenefit"]
NetIncomeLoss["NetIncomeLoss"]
NetIncomeLossAttributableToParent["NetIncomeLossAttributableToParent"]
NetIncomeLossAttributableToNoncontrollingInterest["NetIncomeLossAttributableToNoncontrollingInterest"]
end
subgraph "Cash Flow Concepts"
NetCashFlow["NetCashFlow"]
NetCashFlowContinuing["NetCashFlowContinuing"]
NetCashFlowDiscontinued["NetCashFlowDiscontinued"]
NetCashFlowFromOperatingActivities["NetCashFlowFromOperatingActivities"]
NetCashFlowFromInvestingActivities["NetCashFlowFromInvestingActivities"]
NetCashFlowFromFinancingActivities["NetCashFlowFromFinancingActivities"]
ExchangeGainsLosses["ExchangeGainsLosses"]
end
subgraph "Equity Concepts"
Equity["Equity"]
EquityAttributableToParent["EquityAttributableToParent"]
EquityAttributableToNoncontrollingInterest["EquityAttributableToNoncontrollingInterest"]
TemporaryEquity["TemporaryEquity"]
RedeemableNoncontrollingInterest["RedeemableNoncontrollingInterest"]
end
Root --> Assets
Root --> CurrentAssets
Root --> NoncurrentAssets
Root --> Liabilities
Root --> CurrentLiabilities
Root --> NoncurrentLiabilities
Root --> LiabilitiesAndEquity
Root --> CommitmentsAndContingencies
Root --> Revenues
Root --> RevenuesNetInterestExpense
Root --> RevenuesExcludingInterestAndDividends
Root --> InterestAndDividendIncomeOperating
Root --> CostOfRevenue
Root --> GrossProfit
Root --> OperatingExpenses
Root --> OperatingIncomeLoss
Root --> NonoperatingIncomeLoss
Root --> IncomeTaxExpenseBenefit
Root --> NetIncomeLoss
Root --> NetIncomeLossAttributableToParent
Root --> NetIncomeLossAttributableToNoncontrollingInterest
Root --> NetCashFlow
Root --> NetCashFlowContinuing
Root --> NetCashFlowDiscontinued
Root --> NetCashFlowFromOperatingActivities
Root --> NetCashFlowFromInvestingActivities
Root --> NetCashFlowFromFinancingActivities
Root --> ExchangeGainsLosses
Root --> Equity
Root --> EquityAttributableToParent
Root --> EquityAttributableToNoncontrollingInterest
Root --> TemporaryEquity
Root --> RedeemableNoncontrollingInterest
Mapping Pattern Types
The transformation system implements four distinct mapping patterns to handle the diverse ways companies report financial concepts.
Pattern 1: One-to-One Mapping
Simple direct mappings where a single US GAAP concept name maps to exactly one FundamentalConcept variant.
graph LR
A["Assets"] --> FA["FundamentalConcept::Assets"]
B["Liabilities"] --> FB["FundamentalConcept::Liabilities"]
C["GrossProfit"] --> FC["FundamentalConcept::GrossProfit"]
D["CommitmentsAndContingencies"] --> FD["FundamentalConcept::CommitmentsAndContingencies"]
style FA fill:#f9f9f9
style FB fill:#f9f9f9
style FC fill:#f9f9f9
style FD fill:#f9f9f9
| Raw US GAAP Concept | FundamentalConcept Output |
|---|---|
Assets | vec![Assets] |
Liabilities | vec![Liabilities] |
GrossProfit | vec![GrossProfit] |
OperatingIncomeLoss | vec![OperatingIncomeLoss] |
CommitmentsAndContingencies | vec![CommitmentsAndContingencies] |
Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:4-17 tests/distill_us_gaap_fundamental_concepts_tests.rs:31-36 tests/distill_us_gaap_fundamental_concepts_tests.rs:336-341
Pattern 2: Hierarchical Mapping
Specific concepts map to multiple variants, including both the specific concept and parent categories. This enables queries at different levels of granularity.
graph LR
subgraph "Input"
CA["AssetsCurrent"]
CL["LiabilitiesCurrent"]
SE["StockholdersEquity"]
end
subgraph "Output: Multiple Concepts"
CA1["FundamentalConcept::CurrentAssets"]
CA2["FundamentalConcept::Assets"]
SE1["FundamentalConcept::EquityAttributableToParent"]
SE2["FundamentalConcept::Equity"]
end
CA --> CA1
CA --> CA2
SE --> SE1
SE --> SE2
style CA1 fill:#f9f9f9
style CA2 fill:#f9f9f9
style SE1 fill:#f9f9f9
style SE2 fill:#f9f9f9
| Raw US GAAP Concept | FundamentalConcept Output (Ordered) |
|---|---|
AssetsCurrent | vec![CurrentAssets, Assets] |
StockholdersEquity | vec![EquityAttributableToParent, Equity] |
NetIncomeLoss | vec![NetIncomeLossAttributableToParent, NetIncomeLoss] |
IncomeLossFromContinuingOperations | vec![IncomeLossFromContinuingOperationsAfterTax, NetIncomeLoss] |
Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:10-17 tests/distill_us_gaap_fundamental_concepts_tests.rs:158-165 tests/distill_us_gaap_fundamental_concepts_tests.rs:262-286 tests/distill_us_gaap_fundamental_concepts_tests.rs:692-698
Pattern 3: Synonym Consolidation
Multiple US GAAP concept names that represent the same financial concept are consolidated into a single FundamentalConcept variant.
graph LR
subgraph "Input: 6 Cost Variants"
C1["CostOfRevenue"]
C2["CostOfGoodsAndServicesSold"]
C3["CostOfServices"]
C4["CostOfGoodsSold"]
C5["CostOfGoodsSoldExcludingDepreciationDepletionAndAmortization"]
C6["CostOfGoodsSoldElectric"]
end
subgraph "Output: Single Concept"
CO["FundamentalConcept::CostOfRevenue"]
end
C1 --> CO
C2 --> CO
C3 --> CO
C4 --> CO
C5 --> CO
C6 --> CO
style CO fill:#f9f9f9
Cost of Revenue Synonyms
| Raw US GAAP Concept | FundamentalConcept Output |
|---|---|
CostOfRevenue | vec![CostOfRevenue] |
CostOfGoodsAndServicesSold | vec![CostOfRevenue] |
CostOfServices | vec![CostOfRevenue] |
CostOfGoodsSold | vec![CostOfRevenue] |
CostOfGoodsSoldExcludingDepreciationDepletionAndAmortization | vec![CostOfRevenue] |
CostOfGoodsSoldElectric | vec![CostOfRevenue] |
Equity Noncontrolling Interest Synonyms
| Raw US GAAP Concept | FundamentalConcept Output |
|---|---|
MinorityInterest | vec![EquityAttributableToNoncontrollingInterest] |
PartnersCapitalAttributableToNoncontrollingInterest | vec![EquityAttributableToNoncontrollingInterest] |
MinorityInterestInLimitedPartnerships | vec![EquityAttributableToNoncontrollingInterest] |
MinorityInterestInOperatingPartnerships | vec![EquityAttributableToNoncontrollingInterest] |
MinorityInterestInJointVentures | vec![EquityAttributableToNoncontrollingInterest] |
NonredeemableNoncontrollingInterest | vec![EquityAttributableToNoncontrollingInterest] |
NoncontrollingInterestInVariableInterestEntity | vec![EquityAttributableToNoncontrollingInterest] |
Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:80-112 tests/distill_us_gaap_fundamental_concepts_tests.rs:196-259
Pattern 4: Industry-Specific Revenue Mapping
The most complex pattern handles 57+ industry-specific revenue variations, mapping them all to the Revenues concept. Some revenue types also map to hierarchical categories.
graph TB
subgraph "Industry-Specific Revenue Types"
R1["Revenues"]
R2["SalesRevenueNet"]
R3["HealthCareOrganizationRevenue"]
R4["RealEstateRevenueNet"]
R5["OilAndGasRevenue"]
R6["FinancialServicesRevenue"]
R7["AdvertisingRevenue"]
R8["SubscriptionRevenue"]
R9["RoyaltyRevenue"]
R10["ElectricUtilityRevenue"]
R11["PassengerRevenue"]
R12["CargoAndFreightRevenue"]
R13["... 45+ more types"]
end
subgraph "Special Revenue Categories"
RH1["InterestAndDividendIncomeOperating"]
RH2["RevenuesExcludingInterestAndDividends"]
RH3["RevenuesNetOfInterestExpense"]
RH4["InvestmentBankingRevenue"]
end
subgraph "Output Concepts"
Rev["FundamentalConcept::Revenues"]
RevInt["FundamentalConcept::InterestAndDividendIncomeOperating"]
RevExcl["FundamentalConcept::RevenuesExcludingInterestAndDividends"]
RevNet["FundamentalConcept::RevenuesNetInterestExpense"]
end
R1 --> Rev
R2 --> Rev
R3 --> Rev
R4 --> Rev
R5 --> Rev
R6 --> Rev
R7 --> Rev
R8 --> Rev
R9 --> Rev
R10 --> Rev
R11 --> Rev
R12 --> Rev
R13 --> Rev
RH1 --> RevInt
RH1 --> Rev
RH2 --> RevExcl
RH2 --> Rev
RH3 --> RevNet
RH3 --> Rev
RH4 --> RevExcl
RH4 --> Rev
style Rev fill:#f9f9f9
style RevInt fill:#f9f9f9
style RevExcl fill:#f9f9f9
style RevNet fill:#f9f9f9
Sample Revenue Mappings
| Industry | Raw US GAAP Concept | FundamentalConcept Output |
|---|---|---|
| General | Revenues | vec![Revenues] |
| General | SalesRevenueNet | vec![Revenues] |
| Healthcare | HealthCareOrganizationRevenue | vec![Revenues] |
| Real Estate | RealEstateRevenueNet | vec![Revenues] |
| Energy | OilAndGasRevenue | vec![Revenues] |
| Mining | RevenueMineralSales | vec![Revenues] |
| Hospitality | RevenueFromLeasedAndOwnedHotels | vec![Revenues] |
| Franchise | FranchisorRevenue | vec![Revenues] |
| Media | SubscriptionRevenue | vec![Revenues] |
| Media | AdvertisingRevenue | vec![Revenues] |
| Entertainment | AdmissionsRevenue | vec![Revenues] |
| Licensing | LicensesRevenue | vec![Revenues] |
| Licensing | RoyaltyRevenue | vec![Revenues] |
| Transportation | PassengerRevenue | vec![Revenues] |
| Transportation | CargoAndFreightRevenue | vec![Revenues] |
| Utilities | ElectricUtilityRevenue | vec![Revenues] |
| Financial | InterestAndDividendIncomeOperating | vec![InterestAndDividendIncomeOperating, Revenues] |
| Financial | InvestmentBankingRevenue | vec![RevenuesExcludingInterestAndDividends, Revenues] |
Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:954-1188 tests/distill_us_gaap_fundamental_concepts_tests.rs:1191-1225
The distill_us_gaap_fundamental_concepts Function
The core transformation function accepts a string representation of a US GAAP concept name and returns an Option<Vec<FundamentalConcept>>. The return type is an Option because not all US GAAP concepts are mapped (unmapped concepts return None), and a Vec because some concepts map to multiple standardized variants (hierarchical pattern).
graph TB
subgraph "Function Input"
InputStr["Input: &str\nUS GAAP concept name"]
end
subgraph "distill_us_gaap_fundamental_concepts"
Parse["Parse input string"]
MatchEngine["Pattern Matching Engine"]
Decision{"Concept\nRecognized?"}
BuildVec["Construct Vec<FundamentalConcept>\nApply mapping pattern"]
ReturnSome["Return Some(Vec<FundamentalConcept>)"]
ReturnNone["Return None"]
end
subgraph "Function Output"
OutputSome["Option::Some(Vec<FundamentalConcept>)\n1-2 variants typically"]
OutputNone["Option::None\nUnrecognized concept"]
end
InputStr --> Parse
Parse --> MatchEngine
MatchEngine --> Decision
Decision -->|Yes| BuildVec
Decision -->|No| ReturnNone
BuildVec --> ReturnSome
ReturnSome --> OutputSome
ReturnNone --> OutputNone
style MatchEngine fill:#f9f9f9
style ReturnSome fill:#f9f9f9
Function Signature and Flow
Example Usage
Sources: examples/us_gaap_human_readable.rs:1-9 tests/distill_us_gaap_fundamental_concepts_tests.rs:4-8
Complex Hierarchical Examples
Some concepts demonstrate multi-level hierarchical relationships where a specific concept may relate to multiple parent categories.
graph TD
subgraph "Net Income Loss Hierarchy"
NIL1["NetIncomeLoss\n(raw input)"]
NIL2["FundamentalConcept::NetIncomeLossAttributableToParent"]
NIL3["FundamentalConcept::NetIncomeLoss"]
NILCS["NetIncomeLossAvailableToCommonStockholdersBasic\n(raw input)"]
NILCS1["FundamentalConcept::NetIncomeLossAvailableToCommonStockholdersBasic"]
NILCS2["FundamentalConcept::NetIncomeLoss"]
ILCO1["IncomeLossFromContinuingOperations\n(raw input)"]
ILCO2["FundamentalConcept::IncomeLossFromContinuingOperationsAfterTax"]
ILCO3["FundamentalConcept::NetIncomeLoss"]
NIL1 --> NIL2
NIL1 --> NIL3
NILCS --> NILCS1
NILCS --> NILCS2
ILCO1 --> ILCO2
ILCO1 --> ILCO3
end
subgraph "Comprehensive Income Hierarchy"
CI1["ComprehensiveIncomeNetOfTax\n(raw input)"]
CI2["FundamentalConcept::ComprehensiveIncomeLossAttributableToParent"]
CI3["FundamentalConcept::ComprehensiveIncomeLoss"]
CI1 --> CI2
CI1 --> CI3
end
subgraph "Revenue Hierarchy"
RNOI["RevenuesNetOfInterestExpense\n(raw input)"]
RNOI1["FundamentalConcept::RevenuesNetInterestExpense"]
RNOI2["FundamentalConcept::Revenues"]
RNOI --> RNOI1
RNOI --> RNOI2
end
style NIL2 fill:#f9f9f9
style NIL3 fill:#f9f9f9
style NILCS1 fill:#f9f9f9
style NILCS2 fill:#f9f9f9
style ILCO2 fill:#f9f9f9
style ILCO3 fill:#f9f9f9
style CI2 fill:#f9f9f9
style CI3 fill:#f9f9f9
style RNOI1 fill:#f9f9f9
style RNOI2 fill:#f9f9f9
Income Statement Hierarchies
Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:686-725 tests/distill_us_gaap_fundamental_concepts_tests.rs:39-77 tests/distill_us_gaap_fundamental_concepts_tests.rs:976-982
Cash Flow Hierarchies
Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:550-683
Equity Concept Mappings
Equity concepts demonstrate all four mapping patterns, including complex organizational structures (stockholders vs. partners vs. members) and noncontrolling interest variations.
Equity Structure Overview
| Organizational Type | Raw US GAAP Concept | FundamentalConcept Output |
|---|---|---|
| Corporation (total) | StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest | vec![Equity] |
| Corporation (parent) | StockholdersEquity | vec![EquityAttributableToParent, Equity] |
| Partnership (total) | PartnersCapitalIncludingPortionAttributableToNoncontrollingInterest | vec![Equity] |
| Partnership (parent) | PartnersCapital | vec![EquityAttributableToParent, Equity] |
| LLC (parent) | MembersEquity | vec![EquityAttributableToParent, Equity] |
| Generic | CommonStockholdersEquity | vec![Equity] |
Temporary Equity Variations
| Raw US GAAP Concept | FundamentalConcept Output |
|---|---|
TemporaryEquityCarryingAmount | vec![TemporaryEquity] |
TemporaryEquityRedemptionValue | vec![TemporaryEquity] |
RedeemablePreferredStockCarryingAmount | vec![TemporaryEquity] |
TemporaryEquityCarryingAmountAttributableToParent | vec![TemporaryEquity] |
TemporaryEquityCarryingAmountAttributableToNoncontrollingInterest | vec![TemporaryEquity] |
TemporaryEquityLiquidationPreference | vec![TemporaryEquity] |
Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:150-193 tests/distill_us_gaap_fundamental_concepts_tests.rs:262-286 tests/distill_us_gaap_fundamental_concepts_tests.rs:1228-1274
graph TB
subgraph "Test Structure"
TestFile["distill_us_gaap_fundamental_concepts_tests.rs\n1275 lines, 70+ tests"]
BalanceSheet["Balance Sheet Tests\n- Assets (one-to-one)\n- Current Assets (hierarchical)\n- Equity (multiple org types)\n- Temporary Equity (7 variations)"]
IncomeStatement["Income Statement Tests\n- Revenues (57+ variations)\n- Cost of Revenue (6 synonyms)\n- Net Income (hierarchical)\n- Comprehensive Income (hierarchical)"]
CashFlow["Cash Flow Tests\n- Operating Activities\n- Investing Activities\n- Financing Activities\n- Continuing vs Discontinued"]
Special["Special Category Tests\n- Commitments and Contingencies\n- Nature of Operations\n- Exchange Gains/Losses\n- Research and Development"]
TestFile --> BalanceSheet
TestFile --> IncomeStatement
TestFile --> CashFlow
TestFile --> Special
end
Testing Strategy
The transformation system has comprehensive test coverage with 70+ test functions covering all 64 FundamentalConcept variants and their various input mappings.
Test Organization
Test Pattern Examples
Each test function validates one or more related mappings:
Test Coverage Summary
| Concept Category | Test Functions | Total Assertions | Pattern Types Tested |
|---|---|---|---|
| Balance Sheet | 15 | 35+ | All 4 patterns |
| Income Statement | 25 | 80+ | All 4 patterns |
| Cash Flow | 18 | 25+ | Hierarchical, One-to-One |
| Equity | 12 | 40+ | All 4 patterns |
| Total | 70+ | 180+ | All 4 patterns |
Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:1-1275
graph TB
subgraph "Data Source"
SECAPI["SEC EDGAR API\ndata endpoint"]
RawJSON["Raw JSON Response\nCompany Facts"]
end
subgraph "Rust Processing Pipeline"
FetchFn["fetch_us_gaap_fundamentals\n(network module)"]
ParseJSON["Parse JSON\nExtract concept names"]
DistillFn["distill_us_gaap_fundamental_concepts\n(transformers module)"]
FilterMapped["Filter to mapped concepts\nDiscard None results"]
BuildRecords["Build data records\nwith FundamentalConcept enum"]
end
subgraph "Storage Layer"
CSVOutput["CSV Files\nus-gaap/[ticker].csv"]
PolarsDF["Polars DataFrame\nStructured columns"]
end
subgraph "Python ML Pipeline"
Ingestion["Data Ingestion\nus_gaap_store.ingest_us_gaap_csvs"]
Preprocessing["Preprocessing\nConcept/unit pair extraction"]
end
SECAPI --> RawJSON
RawJSON --> FetchFn
FetchFn --> ParseJSON
ParseJSON --> DistillFn
DistillFn --> FilterMapped
FilterMapped --> BuildRecords
BuildRecords --> PolarsDF
PolarsDF --> CSVOutput
CSVOutput --> Ingestion
Ingestion --> Preprocessing
style DistillFn fill:#f9f9f9
style FilterMapped fill:#f9f9f9
Integration with Data Pipeline
The distill_us_gaap_fundamental_concepts function is a critical component in the broader data processing pipeline, serving as the normalization layer between raw SEC filings and structured data storage.
Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:1-1275 examples/us_gaap_human_readable.rs:1-9
Summary
The US GAAP concept transformation system provides:
- Standardization : Maps 200+ raw US GAAP concept names to 64 standardized
FundamentalConceptvariants - Flexibility : Supports four mapping patterns (one-to-one, hierarchical, synonyms, industry-specific) to handle diverse reporting practices
- Queryability : Hierarchical mappings enable queries at multiple granularity levels (e.g., query for all
Assetsor specificallyCurrentAssets) - Reliability : Comprehensive test coverage with 70+ test functions and 180+ assertions validates all mapping patterns
- Integration : Serves as critical normalization layer between SEC EDGAR API and downstream data processing/ML pipelines
The transformation system represents the highest-importance component (8.37) in the Rust codebase, enabling consistent financial data analysis across companies with varying reporting conventions.
Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:1-1275 examples/us_gaap_human_readable.rs:1-9
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Data Models & Enumerations
Relevant source files
- src/enums.rs
- src/enums/fundamental_concept_enum.rs
- src/models.rs
- src/models/accession_number.rs
- src/models/nport_investment.rs
Purpose and Scope
This page documents the core data structures and enumerations used throughout the rust-sec-fetcher application. These models represent SEC financial data, including company identifiers, filing metadata, investment holdings, and financial concepts. The data models are defined in src/models.rs:1-18 and src/enums.rs:1-12
For information about how these models are used in data fetching operations, see Data Fetching Functions. For details on how FundamentalConcept is used in the transformation pipeline, see US GAAP Concept Transformation.
Sources: src/models.rs:1-18 src/enums.rs:1-12
SEC Identifier Models
The system uses three primary identifier types to reference companies and filings within the SEC EDGAR system.
Ticker
The Ticker struct represents a company's stock ticker symbol along with its SEC identifiers. It is returned by the fetch_company_tickers function and serves as the primary entry point for company lookups.
Structure:
cik: Cik- The company's Central Index Keyticker_symbol: String- Stock ticker symbol (e.g., "AAPL")company_name: String- Full company nameorigin: TickerOrigin- Source of the ticker data
Sources: src/models.rs:10-11
Cik (Central Index Key)
The Cik struct represents a 10-digit SEC identifier that uniquely identifies a company or entity. CIKs are zero-padded to exactly 10 digits.
Structure:
value: u64- The numeric CIK value (0 to 9,999,999,999)
Validation:
- CIK values must not exceed 10 digits
- Values are zero-padded when formatted (e.g.,
123→"0000000123") - Parsing from strings handles both padded and unpadded formats
Error Handling:
CikError::InvalidCik- Raised when CIK exceeds 10 digitsCikError::ParseError- Raised when parsing fails
Sources: src/models.rs:4-5
AccessionNumber
The AccessionNumber struct represents a unique identifier for SEC filings. Each accession number is exactly 18 digits and encodes the filer's CIK, filing year, and sequence number.
Format: XXXXXXXXXX-YY-NNNNNN
Components:
- CIK (10 digits) - The SEC identifier of the filer
- Year (2 digits) - Last two digits of the filing year
- Sequence (6 digits) - Unique sequence number within the year
Example: "0001234567-23-000045" represents:
- CIK: 0001234567
- Year: 2023
- Sequence: 000045
Key Methods:
from_str(accession_str: &str)- Parses from string (with or without dashes)from_parts(cik_u64: u64, year: u16, sequence: u32)- Constructs from componentsto_string()- Returns dash-separated formatto_unformatted_string()- Returns plain 18-digit string
Sources: src/models/accession_number.rs:1-188 src/models.rs:1-3
SEC Identifier Relationships
Sources: src/models.rs:1-18 src/models/accession_number.rs:35-187
Filing Data Structures
CikSubmission
The CikSubmission struct represents metadata about an SEC filing submission. It is returned by the fetch_cik_submissions function.
Key Fields:
cik: Cik- The filer's Central Index Keyaccession_number: AccessionNumber- Unique filing identifierform: String- Filing form type (e.g., "10-K", "10-Q", "NPORT-P")primary_document: String- Main document filenamefiling_date: String- Date the filing was submitted
Sources: src/models.rs:7-8
NportInvestment
The NportInvestment struct represents a single investment holding from an NPORT-P filing (monthly portfolio holdings report for registered investment companies).
Mapped Fields (linked to company data):
mapped_ticker_symbol: Option<String>- Ticker symbol if matchedmapped_company_name: Option<String>- Company name if matchedmapped_company_cik_number: Option<String>- CIK if matched
Investment Identifiers:
name: String- Investment namelei: String- Legal Entity Identifiercusip: String- Committee on Uniform Securities Identification Procedures IDisin: String- International Securities Identification Numbertitle: String- Investment title
Financial Values:
balance: Decimal- Number of shares or units heldval_usd: Decimal- Value in USDpct_val: Decimal- Percentage of total portfolio valuecur_cd: String- Currency code
Classifications:
asset_cat: String- Asset categoryissuer_cat: String- Issuer categorypayoff_profile: String- Payoff profile typeinv_country: String- Investment country
Utility Methods:
sort_by_pct_val_desc(investments: &mut Vec<NportInvestment>)- Sorts holdings by percentage value in descending order
Sources: src/models/nport_investment.rs:1-46 src/models.rs:16-17
InvestmentCompany
The InvestmentCompany struct represents an investment company (mutual fund, ETF, etc.) registered with the SEC. It is returned by the fetch_investment_companies function.
Sources: src/models.rs:13-14
Filing Data Structure Relationships
Sources: src/models.rs:1-18 src/models/nport_investment.rs:8-46
FundamentalConcept Enumeration
The FundamentalConcept enum is the most critical enumeration in the system (importance: 8.37), defining 64 standardized financial concepts derived from US GAAP (Generally Accepted Accounting Principles) taxonomies. These concepts are used by the distill_us_gaap_fundamental_concepts transformer to normalize diverse financial reporting variations into a consistent taxonomy.
Definition: src/enums/fundamental_concept_enum.rs:1-72
Traits Implemented:
Eq,PartialEq- Equality comparisonHash- Hashable for use in mapsClone- CloneableEnumString- Parse from stringEnumIter- Iterate over all variantsDisplay- Format as stringDebug- Debug formatting
Sources: src/enums/fundamental_concept_enum.rs:1-5 src/enums.rs:4-5
Concept Categories
The 64 FundamentalConcept variants are organized into four primary categories corresponding to major financial statement types:
| Category | Concept Count | Description |
|---|---|---|
| Balance Sheet | 13 | Assets, liabilities, equity positions |
| Income Statement | 23 | Revenues, expenses, income/loss measures |
| Cash Flow Statement | 13 | Operating, investing, financing cash flows |
| Equity & Comprehensive Income | 6 | Equity attributions and comprehensive income |
| Other | 9 | Special items, metadata, and miscellaneous |
Sources: src/enums/fundamental_concept_enum.rs:4-72
Complete FundamentalConcept Taxonomy
Balance Sheet Concepts
| Variant | Description |
|---|---|
Assets | Total assets |
CurrentAssets | Assets expected to be converted to cash within one year |
NoncurrentAssets | Long-term assets |
Liabilities | Total liabilities |
CurrentLiabilities | Obligations due within one year |
NoncurrentLiabilities | Long-term obligations |
LiabilitiesAndEquity | Total liabilities plus equity (must equal total assets) |
Equity | Total shareholder equity |
EquityAttributableToParent | Equity attributable to parent company shareholders |
EquityAttributableToNoncontrollingInterest | Equity attributable to minority shareholders |
TemporaryEquity | Mezzanine equity (e.g., redeemable preferred stock) |
RedeemableNoncontrollingInterest | Redeemable minority interest |
CommitmentsAndContingencies | Off-balance-sheet obligations |
Income Statement Concepts
| Variant | Description |
|---|---|
IncomeStatementStartPeriodYearToDate | Statement start period marker |
Revenues | Total revenues (consolidated from 57+ variations) |
RevenuesExcludingInterestAndDividends | Non-interest revenues |
RevenuesNetInterestExpense | Revenues after interest expense |
CostOfRevenue | Direct costs of producing goods/services |
GrossProfit | Revenue minus cost of revenue |
OperatingExpenses | Operating expenses excluding COGS |
ResearchAndDevelopment | R&D expenses |
CostsAndExpenses | Total costs and expenses |
BenefitsCostsExpenses | Employee benefits and related costs |
OperatingIncomeLoss | Income from operations |
NonoperatingIncomeLoss | Income from non-operating activities |
OtherOperatingIncomeExpenses | Other operating items |
IncomeLossBeforeEquityMethodInvestments | Income before equity investments |
IncomeLossFromEquityMethodInvestments | Income from equity-method investees |
IncomeLossFromContinuingOperationsBeforeTax | Pre-tax income from continuing operations |
IncomeTaxExpenseBenefit | Income tax expense or benefit |
IncomeLossFromContinuingOperationsAfterTax | After-tax income from continuing operations |
IncomeLossFromDiscontinuedOperationsNetOfTax | After-tax income from discontinued operations |
ExtraordinaryItemsOfIncomeExpenseNetOfTax | Extraordinary items (after-tax) |
NetIncomeLoss | Bottom-line net income or loss |
NetIncomeLossAttributableToParent | Net income attributable to parent shareholders |
NetIncomeLossAttributableToNoncontrollingInterest | Net income attributable to minority shareholders |
NetIncomeLossAvailableToCommonStockholdersBasic | Net income available to common shareholders |
PreferredStockDividendsAndOtherAdjustments | Preferred dividends and adjustments |
Industry-Specific Income Statement Concepts
| Variant | Description |
|---|---|
InterestAndDividendIncomeOperating | Interest and dividend income (financial institutions) |
InterestExpenseOperating | Interest expense (financial institutions) |
InterestIncomeExpenseOperatingNet | Net interest income (banks) |
InterestAndDebtExpense | Total interest and debt expense |
InterestIncomeExpenseAfterProvisionForLosses | Interest income after loan loss provisions |
ProvisionForLoanLeaseAndOtherLosses | Provision for credit losses (banks) |
NoninterestIncome | Non-interest income (financial institutions) |
NoninterestExpense | Non-interest expense (financial institutions) |
Cash Flow Statement Concepts
| Variant | Description |
|---|---|
NetCashFlow | Total net cash flow |
NetCashFlowContinuing | Net cash flow from continuing operations |
NetCashFlowDiscontinued | Net cash flow from discontinued operations |
NetCashFlowFromOperatingActivities | Cash from operating activities |
NetCashFlowFromOperatingActivitiesContinuing | Operating cash flow (continuing) |
NetCashFlowFromOperatingActivitiesDiscontinued | Operating cash flow (discontinued) |
NetCashFlowFromInvestingActivities | Cash from investing activities |
NetCashFlowFromInvestingActivitiesContinuing | Investing cash flow (continuing) |
NetCashFlowFromInvestingActivitiesDiscontinued | Investing cash flow (discontinued) |
NetCashFlowFromFinancingActivities | Cash from financing activities |
NetCashFlowFromFinancingActivitiesContinuing | Financing cash flow (continuing) |
NetCashFlowFromFinancingActivitiesDiscontinued | Financing cash flow (discontinued) |
ExchangeGainsLosses | Foreign exchange gains/losses |
Equity & Comprehensive Income Concepts
| Variant | Description |
|---|---|
ComprehensiveIncomeLoss | Total comprehensive income |
ComprehensiveIncomeLossAttributableToParent | Comprehensive income attributable to parent |
ComprehensiveIncomeLossAttributableToNoncontrollingInterest | Comprehensive income attributable to minorities |
OtherComprehensiveIncomeLoss | Other comprehensive income (OCI) |
Other Concepts
| Variant | Description |
|---|---|
NatureOfOperations | Description of business operations |
GainLossOnSalePropertiesNetTax | Gains/losses on property sales (after-tax) |
Sources: src/enums/fundamental_concept_enum.rs:4-72
FundamentalConcept Taxonomy Diagram
Sources: src/enums/fundamental_concept_enum.rs:4-72
Other Enumerations
CacheNamespacePrefix
The CacheNamespacePrefix enum defines namespace prefixes used by the caching system to organize cached data by type. Each prefix corresponds to a specific data fetching operation.
Usage: Cache keys are constructed by combining a namespace prefix with a specific identifier (e.g., CacheNamespacePrefix::CompanyTickers + ticker_symbol).
Sources: src/enums.rs:1-2
TickerOrigin
The TickerOrigin enum indicates the source or origin of a ticker symbol.
Variants:
- Different ticker sources (e.g., SEC company tickers API, NPORT filings, manual mapping)
Usage: Stored in the Ticker struct to track data provenance.
Sources: src/enums.rs:7-8
Url
The Url enum defines the various SEC EDGAR API endpoints used by the application. Each variant represents a specific API URL pattern.
Usage: Used by the network layer to construct API requests without hardcoding URLs throughout the codebase.
Sources: src/enums.rs:10-11
Enumeration Module Structure
Sources: src/enums.rs:1-12
graph TB
subgraph Identifiers["SEC Identifier Models"]
Cik["Cik\nvalue: u64"]
Ticker["Ticker\ncik: Cik\nticker_symbol: String\ncompany_name: String\norigin: TickerOrigin"]
AccessionNumber["AccessionNumber\ncik: Cik\nyear: u16\nsequence: u32"]
end
subgraph Filings["Filing Data Models"]
CikSubmission["CikSubmission\ncik: Cik\naccession_number: AccessionNumber\nform: String\nprimary_document: String\nfiling_date: String"]
InvestmentCompany["InvestmentCompany\n(identified by Cik)"]
NportInvestment["NportInvestment\nmapped_ticker_symbol: Option<String>\nmapped_company_name: Option<String>\nmapped_company_cik_number: Option<String>\nname, lei, cusip, isin\nbalance, val_usd, pct_val"]
end
subgraph Enums["Enumerations"]
FundamentalConcept["FundamentalConcept\n64 variants\n(Balance Sheet, Income Statement,\nCash Flow, Equity)"]
TickerOrigin["TickerOrigin\n(ticker source/origin)"]
CacheNamespacePrefix["CacheNamespacePrefix\n(cache organization)"]
UrlEnum["Url\n(SEC EDGAR endpoints)"]
end
subgraph NetworkFunctions["Network Functions (3.2)"]
FetchTickers["fetch_company_tickers\n→ Vec<Ticker>"]
FetchCIK["fetch_cik_by_ticker_symbol\n→ Option<Cik>"]
FetchSubmissions["fetch_cik_submissions\n→ Vec<CikSubmission>"]
FetchNPORT["fetch_nport_filing\n→ Vec<NportInvestment>"]
FetchInvCo["fetch_investment_companies\n→ Vec<InvestmentCompany>"]
end
subgraph Transformer["Transformation (3.3)"]
Distill["distill_us_gaap_fundamental_concepts\nuses FundamentalConcept"]
end
Ticker -->|contains| Cik
Ticker -->|uses| TickerOrigin
AccessionNumber -->|contains| Cik
CikSubmission -->|contains| Cik
CikSubmission -->|contains| AccessionNumber
InvestmentCompany -.->|identified by| Cik
NportInvestment -.->|mapped to| Ticker
FetchTickers -.->|returns| Ticker
FetchCIK -.->|returns| Cik
FetchSubmissions -.->|returns| CikSubmission
FetchNPORT -.->|returns| NportInvestment
FetchInvCo -.->|returns| InvestmentCompany
Distill -.->|uses| FundamentalConcept
Data Model Relationships
The following diagram illustrates how the core data models relate to each other and which models contain or reference other models.
Sources: src/models.rs:1-18 src/enums.rs:1-12
Implementation Details
Error Handling
Both Cik and AccessionNumber implement custom error types to handle parsing and validation failures:
CikError:
InvalidCik- CIK exceeds 10 digitsParseError- Parsing from string failed
AccessionNumberError:
InvalidLength- Accession number is not 18 digitsParseError- Parsing failedCikError- Wrapped CIK error
Both error types implement std::fmt::Display and std::error::Error for proper error propagation.
Sources: src/models/accession_number.rs:42-76
Serialization
The NportInvestment struct uses serde for serialization/deserialization with serde_with for custom formatting:
#[serde_as(as = "DisplayFromStr")]- Applied toDecimalfields (balance,val_usd,pct_val) to serialize them as stringsOption<T>fields are used for nullable data from NPORT filings
Sources: src/models/nport_investment.rs:3-38
Decimal Precision
Financial values in NportInvestment use the rust_decimal::Decimal type, which provides:
- Arbitrary precision decimal arithmetic
- No floating-point rounding errors
- Safe comparison operations
Sources: src/models/nport_investment.rs:2-33
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Caching & Storage System
Relevant source files
Purpose and Scope
This document describes the caching and storage infrastructure used by the Rust sec-fetcher application to minimize redundant API requests and improve performance. The system implements a two-tier caching architecture with persistent storage backed by simd-r-drive. For information about network request throttling and retry logic, see Network Layer & SecClient. For database storage used by the Python components, see Database & Storage Integration.
Overview
The caching system provides two distinct cache layers:
- HTTP Cache : Stores raw HTTP responses from SEC EDGAR API requests
- Preprocessor Cache : Stores transformed and processed data structures
Both caches use the simd-r-drive key-value storage backend with persistent file-based storage, initialized once at application startup using the OnceLock pattern for thread-safe static access.
Sources: src/caches.rs:1-66
Caching Architecture
The following diagram illustrates the complete caching architecture and how it integrates with the network layer:
Cache Initialization Flow
graph TB
subgraph "Application Initialization"
Main["main.rs"]
ConfigMgr["ConfigManager"]
CachesInit["Caches::init(config_manager)"]
end
subgraph "Static Cache Storage (OnceLock)"
HTTPCache["SIMD_R_DRIVE_HTTP_CACHE\nOnceLock<Arc<DataStore>>"]
PreCache["SIMD_R_DRIVE_PREPROCESSOR_CACHE\nOnceLock<Arc<DataStore>>"]
end
subgraph "File System"
HTTPFile["cache_base_dir/http_storage_cache.bin"]
PreFile["cache_base_dir/preprocessor_cache.bin"]
end
subgraph "Network Layer"
SecClient["SecClient"]
CacheMiddleware["reqwest_drive\nCache Middleware"]
ThrottleMiddleware["Throttle Middleware"]
CachePolicy["CachePolicy\nTTL: 1 week\nrespect_headers: false"]
end
subgraph "Cache Access"
GetHTTP["Caches::get_http_cache_store()"]
GetPre["Caches::get_preprocessor_cache()"]
end
Main --> ConfigMgr
ConfigMgr --> CachesInit
CachesInit --> HTTPCache
CachesInit --> PreCache
HTTPCache -.->|DataStore::open| HTTPFile
PreCache -.->|DataStore::open| PreFile
SecClient --> GetHTTP
GetHTTP --> HTTPCache
HTTPCache --> CacheMiddleware
CacheMiddleware --> CachePolicy
CacheMiddleware --> ThrottleMiddleware
GetPre --> PreCache
The caching system is initialized early in the application lifecycle using a configuration-driven approach that ensures thread-safe access across the async runtime.
Sources: src/caches.rs:1-66 src/network/sec_client.rs:1-181
Cache Initialization System
OnceLock Pattern
The system uses Rust's OnceLock for lazy, thread-safe initialization of static cache instances:
This pattern ensures:
- Single initialization : Each cache is initialized exactly once
- Thread safety : Safe access from multiple async tasks
- Zero overhead after init : No locking required for read access after initialization
Sources: src/caches.rs:7-8
Initialization Process
Initialization Code Flow
The Caches::init() method performs the following steps:
- Retrieves
cache_base_dirfromConfigManager - Constructs paths for both cache files
- Opens each
DataStoreinstance - Sets the
OnceLockwithArc-wrapped stores - Logs warnings if already initialized (prevents re-initialization panics)
Sources: src/caches.rs:13-48
HTTP Cache Integration
graph LR
subgraph "SecClient Initialization"
FromConfig["SecClient::from_config_manager()"]
GetHTTPCache["Caches::get_http_cache_store()"]
CreatePolicy["CachePolicy::new()"]
CreateThrottle["ThrottlePolicy::new()"]
end
subgraph "Middleware Chain"
InitDrive["init_cache_with_drive_and_throttle()"]
DriveCache["drive_cache middleware"]
ThrottleCache["throttle_cache middleware"]
InitClient["init_client_with_cache_and_throttle()"]
end
subgraph "HTTP Client"
ClientWithMiddleware["ClientWithMiddleware"]
end
FromConfig --> GetHTTPCache
FromConfig --> CreatePolicy
FromConfig --> CreateThrottle
GetHTTPCache --> InitDrive
CreatePolicy --> InitDrive
CreateThrottle --> InitDrive
InitDrive --> DriveCache
InitDrive --> ThrottleCache
DriveCache --> InitClient
ThrottleCache --> InitClient
InitClient --> ClientWithMiddleware
SecClient Cache Setup
The SecClient integrates with the HTTP cache through the reqwest_drive middleware system:
SecClient Construction
The SecClient struct maintains references to both cache and throttle policies:
| Field | Type | Purpose |
|---|---|---|
email | String | User-Agent identification |
http_client | ClientWithMiddleware | HTTP client with middleware chain |
cache_policy | Arc<CachePolicy> | Cache configuration settings |
throttle_policy | Arc<ThrottlePolicy> | Request throttling configuration |
Sources: src/network/sec_client.rs:14-19 src/network/sec_client.rs:73-81
CachePolicy Configuration
The CachePolicy struct defines cache behavior:
| Parameter | Value | Description |
|---|---|---|
default_ttl | Duration::from_secs(60 * 60 * 24 * 7) | 1 week time-to-live |
respect_headers | false | Ignore HTTP cache headers from SEC API |
cache_status_override | None | No custom status code handling |
The 1-week TTL is currently hardcoded but marked for future configurability.
Sources: src/network/sec_client.rs:45-50
Cache Access Patterns
HTTP Cache Access Flow
Request Processing
When SecClient makes a request:
- The
fetch_json()method callsraw_request()with the target URL - The middleware chain intercepts the request
- Cache middleware checks the
DataStorefor a matching entry - If found and not expired (TTL check), returns cached response
- If not found, forwards to throttle middleware and HTTP client
- Response is stored in cache with timestamp before returning
Sources: src/network/sec_client.rs:140-179 tests/sec_client_tests.rs:35-62
Preprocessor Cache Access
The preprocessor cache is accessed directly through the Caches module API:
This cache is intended for storing transformed data structures after processing, though the current codebase primarily uses it as infrastructure for future preprocessing optimization.
Sources: src/caches.rs:59-64
Storage Backend: simd-r-drive
DataStore Integration
The simd-r-drive crate provides the persistent key-value storage backend:
DataStore Characteristics
| Feature | Description |
|---|---|
| Persistence | All data survives application restarts |
| Binary Format | Stores arbitrary byte arrays |
| Thread-Safe | Safe for concurrent access from async tasks |
| File-Backed | Single file per cache instance |
| No Network | Local-only storage (unlike WebSocket mode) |
Sources: src/caches.rs3 src/caches.rs:22-37
Cache Configuration
ConfigManager Integration
The caching system reads configuration from ConfigManager:
| Config Field | Purpose | Default |
|---|---|---|
cache_base_dir | Parent directory for cache files | Required (no default) |
The cache base directory path is constructed at runtime and must be set before initialization:
Cache File Paths
- HTTP Cache:
{cache_base_dir}/http_storage_cache.bin - Preprocessor Cache:
{cache_base_dir}/preprocessor_cache.bin
Sources: src/caches.rs16 src/caches.rs20 src/caches.rs34
ThrottlePolicy Configuration
While not part of the cache storage itself, the ThrottlePolicy is closely integrated with the cache middleware:
| Parameter | Config Source | Purpose |
|---|---|---|
base_delay_ms | config.min_delay_ms | Minimum delay between requests |
max_concurrent | config.max_concurrent | Maximum parallel requests |
max_retries | config.max_retries | Retry attempts for failed requests |
adaptive_jitter_ms | Hardcoded: 500 | Random jitter for backoff |
Sources: src/network/sec_client.rs:53-59
Error Handling
Initialization Errors
The cache initialization handles several error scenarios:
Panic Conditions:
DataStore::open()fails (I/O error, permission denied, etc.)- Calling
get_http_cache_store()beforeCaches::init() - Calling
get_preprocessor_cache()beforeCaches::init()
Warnings:
- Attempting to reinitialize already-set
OnceLock(logs warning, doesn't fail)
Sources: src/caches.rs:25-29 src/caches.rs:51-55
Runtime Access Errors
Both cache accessor methods use expect() with descriptive messages:
These panics indicate programmer errors (accessing cache before initialization) rather than runtime failures.
Sources: src/caches.rs:51-56 src/caches.rs:59-64
Testing
Cache Testing Strategy
The test suite verifies caching behavior indirectly through SecClient tests:
Test Coverage:
| Test | Purpose | File |
|---|---|---|
test_fetch_json_without_retry_success | Verifies basic fetch with caching enabled | tests/sec_client_tests.rs:36-62 |
test_fetch_json_with_retry_success | Tests cache behavior with successful retry | tests/sec_client_tests.rs:65-91 |
test_fetch_json_with_retry_failure | Validates cache doesn't store failed responses | tests/sec_client_tests.rs:94-120 |
test_fetch_json_with_retry_backoff | Tests cache with retry backoff logic | tests/sec_client_tests.rs:123-158 |
Mock Server Testing:
Tests use mockito::Server to simulate SEC API responses without requiring cache setup, as the test environment creates ephemeral cache instances per test run.
Sources: tests/sec_client_tests.rs:1-159
Usage Patterns
Typical Initialization Flow
Future Extensions
The preprocessor cache infrastructure is in place but not yet fully utilized. Potential use cases include:
- Caching parsed/transformed US GAAP concepts
- Storing intermediate data structures from
distill_us_gaap_fundamental_concepts() - Caching resolved CIK lookups from ticker symbols
- Storing compiled ticker-to-CIK mappings
Sources: src/caches.rs:59-64
Integration with Other Systems
The caching system integrates with:
- ConfigManager (#2.1): Provides cache directory configuration
- SecClient (#3.1): Primary consumer of HTTP cache
- Network Functions (#3.2): All fetch operations benefit from caching
- simd-r-drive : External storage backend (see #6 for dependency details)
The cache layer operates transparently below the network API, requiring no changes to calling code when caching is enabled or disabled.
Sources: src/caches.rs:1-66 src/network/sec_client.rs:73-81
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Main Application Flow
Relevant source files
- examples/nport_filing.rs
- src/main.rs
- src/network/fetch_investment_company_series_and_class_dataset.rs
- src/network/fetch_nport_filing.rs
- src/network/fetch_us_gaap_fundamentals.rs
- src/parsers/parse_nport_xml.rs
This document describes the main application entry point and execution flow in src/main.rs, including initialization, data fetching loops, CSV output organization, and error handling strategies. The main application orchestrates the high-level data collection process by coordinating the configuration system, network layer, and storage systems.
For detailed information about the network client and its middleware layers, see SecClient. For data model structures used throughout the application, see Data Models & Enumerations. For configuration and credential management, see Configuration System.
Application Entry Point and Initialization
The application entry point is the main() function in src/main.rs:174-240 The initialization sequence follows a consistent pattern regardless of which processing mode is active:
Sources: src/main.rs:174-180
graph TB
Start["Program Start\n#[tokio::main]\nasync fn main()"]
LoadConfig["ConfigManager::load()\nLoad TOML configuration\nValidate credentials"]
CreateClient["SecClient::from_config_manager()\nInitialize HTTP client\nApply throttling & caching"]
CheckMode{"Processing\nMode?"}
FetchTickers["fetch_company_tickers(&client)\nRetrieve complete ticker dataset"]
FetchInvCo["fetch_investment_company_\nseries_and_class_dataset(&client)\nGet fund listings"]
USGAAPLoop["US-GAAP Processing Loop\n(Active Implementation)"]
InvCoLoop["Investment Company Loop\n(Commented Prototypes)"]
ErrorReport["Generate Error Summary\nPrint failed tickers"]
End["Program Exit"]
Start --> LoadConfig
LoadConfig --> CreateClient
CreateClient --> CheckMode
CheckMode -->|US-GAAP Mode| FetchTickers
CheckMode -->|Investment Co Mode| FetchInvCo
FetchTickers --> USGAAPLoop
FetchInvCo --> InvCoLoop
USGAAPLoop --> ErrorReport
InvCoLoop --> ErrorReport
ErrorReport --> End
The initialization handles errors at each stage, returning early if critical components fail to load:
| Initialization Step | Error Handling | Line Reference |
|---|---|---|
| Configuration Loading | Returns Box<dyn Error> on failure | src/main.rs176 |
| Client Creation | Returns Box<dyn Error> on failure | src/main.rs178 |
| Ticker Fetching | Returns Box<dyn Error> on failure | src/main.rs180 |
Active Implementation: US-GAAP Fundamentals Processing
The currently active implementation in main.rs processes US-GAAP fundamentals for all company tickers. This represents the production data collection pipeline.
Sources: src/main.rs:174-240
graph TB
Start["Start US-GAAP Processing"]
FetchTickers["fetch_company_tickers(&client)\nReturns Vec<Ticker>"]
PrintTotal["Print total records\nDisplay first 60 tickers"]
InitErrorLog["Initialize error_log:\nHashMap<String, String>"]
LoopStart{"For each\ncompany_ticker\nin tickers"}
PrintProgress["Print processing status:\nticker_symbol (i+1 of total)"]
FetchFundamentals["fetch_us_gaap_fundamentals(\n&client,\n&company_tickers,\n&ticker_symbol)"]
CreateFile["File::create(\ndata/22-june-us-gaap/{ticker}.csv)"]
WriteCSV["CsvWriter::new(&mut file)\n.include_header(true)\n.finish(&mut df)"]
CaptureError["Insert error into error_log:\nticker → error message"]
CheckErrors{"error_log\nempty?"}
PrintErrors["Print error summary:\nList all failed tickers"]
PrintSuccess["Print success message"]
End["End Processing"]
Start --> FetchTickers
FetchTickers --> PrintTotal
PrintTotal --> InitErrorLog
InitErrorLog --> LoopStart
LoopStart -->|Next ticker| PrintProgress
PrintProgress --> FetchFundamentals
FetchFundamentals -->|Success| CreateFile
CreateFile -->|Success| WriteCSV
FetchFundamentals -->|Error| CaptureError
CreateFile -->|Error| CaptureError
WriteCSV -->|Error| CaptureError
WriteCSV -->|Success| LoopStart
CaptureError --> LoopStart
LoopStart -->|Complete| CheckErrors
CheckErrors -->|Has errors| PrintErrors
CheckErrors -->|No errors| PrintSuccess
PrintErrors --> End
PrintSuccess --> End
Processing Loop Details
The main processing loop iterates through all company tickers sequentially, performing the following operations for each ticker:
- Progress Logging - src/main.rs:190-195: Prints current position in the dataset
- Data Fetching - src/main.rs203: Calls
fetch_us_gaap_fundamentals()with client, full ticker list, and current ticker symbol - CSV Generation - src/main.rs:206-214: Creates file and writes DataFrame with headers
- Error Capture - src/main.rs:212-225: Logs any failures to
error_logHashMap
Error Handling Strategy
The application uses a non-fatal error handling approach:
| Error Type | Handling Strategy | Result |
|---|---|---|
| Fetch Error | Logged to error_log, processing continues | src/main.rs:223-225 |
| File Creation Error | Logged to error_log, processing continues | src/main.rs:217-220 |
| CSV Write Error | Logged to error_log, processing continues | src/main.rs:211-214 |
| Configuration Error | Fatal, returns immediately | src/main.rs176 |
| Client Creation Error | Fatal, returns immediately | src/main.rs178 |
Sources: src/main.rs:185-227
CSV Output Organization
The US-GAAP processing mode writes output to a flat directory structure:
data/22-june-us-gaap/
├── AAPL.csv
├── MSFT.csv
├── GOOGL.csv
└── ...
Each file is named using the ticker symbol and contains the complete US-GAAP fundamentals DataFrame for that company.
Sources: src/main.rs205
Investment Company Processing Flow (Prototype)
Two commented-out prototype implementations exist in main.rs for processing investment company data. These represent alternative processing modes that may be activated in future iterations.
Production Prototype (Lines 18-122)
The first prototype implements production-ready investment company processing with comprehensive error handling:
Sources: src/main.rs:18-122
graph TB
Start["Start Investment Co Processing"]
InitLogger["env_logger::Builder\n.filter(None, LevelFilter::Info)"]
InitErrorLog["Initialize error_log: Vec<String>"]
LoadConfig["ConfigManager::load()\nHandle fatal errors"]
CreateClient["SecClient::from_config_manager()\nHandle fatal errors"]
FetchInvCo["fetch_investment_company_\nseries_and_class_dataset(&client)"]
LogTotal["info! Total investment companies"]
LoopStart{"For each fund\nin investment_\ncompanies"}
LogProgress["info! Processing: i+1 of total"]
CheckTicker{"fund.class_\nticker exists?"}
LogTicker["info! Ticker symbol"]
FetchNPORT["fetch_nport_filing_by_\nticker_symbol(&client, ticker)"]
GetFirstLetter["ticker.chars().next()\n.to_ascii_uppercase()"]
CreateDir["create_dir_all(\ndata/fund-holdings/{letter})"]
LogRecords["info! Total records"]
WriteCSV["latest_nport_filing\n.write_to_csv(file_path)"]
LogError["error! Log message\nerror_log.push(msg)"]
CheckFinal{"error_log\nempty?"}
PrintSummary["error! Print error summary"]
PrintSuccess["info! All funds successful"]
End["End Processing"]
Start --> InitLogger
InitLogger --> InitErrorLog
InitErrorLog --> LoadConfig
LoadConfig -->|Error| LogError
LoadConfig -->|Success| CreateClient
CreateClient -->|Error| LogError
CreateClient -->|Success| FetchInvCo
FetchInvCo -->|Error| LogError
FetchInvCo -->|Success| LogTotal
LogTotal --> LoopStart
LoopStart -->|Next fund| LogProgress
LogProgress --> CheckTicker
CheckTicker -->|Yes| LogTicker
CheckTicker -->|No| LoopStart
LogTicker --> FetchNPORT
FetchNPORT -->|Error| LogError
FetchNPORT -->|Success| GetFirstLetter
GetFirstLetter --> CreateDir
CreateDir -->|Error| LogError
CreateDir -->|Success| LogRecords
LogRecords --> WriteCSV
WriteCSV -->|Error| LogError
WriteCSV -->|Success| LoopStart
LogError --> LoopStart
LoopStart -->|Complete| CheckFinal
CheckFinal -->|Has errors| PrintSummary
CheckFinal -->|No errors| PrintSuccess
PrintSummary --> End
PrintSuccess --> End
Directory Organization for Fund Holdings
The investment company prototype organizes CSV output by the first letter of the ticker symbol:
data/fund-holdings/
├── A/
│ ├── AADR.csv
│ └── AAXJ.csv
├── B/
│ ├── BKLN.csv
│ └── BND.csv
├── V/
│ ├── VTI.csv
│ └── VXUS.csv
└── ...
This alphabetical categorization is implemented at src/main.rs:83-84 using ticker_symbol.chars().next().unwrap().to_ascii_uppercase().
Sources: src/main.rs:82-107
Debug Prototype (Lines 124-171)
A simpler debug version exists for testing and development, with minimal error handling and console output:
| Feature | Debug Prototype | Production Prototype |
|---|---|---|
| Error Logging | Fatal errors only | Comprehensive Vec/HashMap logs |
| Logger | Not initialized | env_logger with INFO level |
| Progress Tracking | Simple println! | Structured info! macros |
| Error Recovery | Immediate exit with ? | Continue processing, log errors |
| Output | Console + CSV | CSV only |
Sources: src/main.rs:124-171
Data Flow Through Network Layer
The main application delegates all SEC API interactions to the network layer, which handles caching, throttling, and retries transparently:
Sources: src/main.rs:3-9 src/network/fetch_us_gaap_fundamentals.rs:12-28 src/network/fetch_nport_filing.rs:10-49 src/network/fetch_investment_company_series_and_class_dataset.rs:31-105
graph LR
Main["main.rs\nmain()"]
FetchTickers["network::fetch_company_tickers\nGET company_tickers.json"]
FetchGAAP["network::fetch_us_gaap_fundamentals\nGET CIK{cik}/company-facts.json"]
FetchInvCo["network::fetch_investment_company_\nseries_and_class_dataset\nGET investment-company-series-class-{year}.csv"]
FetchNPORT["network::fetch_nport_filing_\nby_ticker_symbol\nGET data/{cik}/{accession}/primary_doc.xml"]
SecClient["SecClient\nHTTP middleware\nThrottling, Caching, Retries"]
SECAPI["SEC EDGAR API\nwww.sec.gov"]
Main --> FetchTickers
Main --> FetchGAAP
Main --> FetchInvCo
Main --> FetchNPORT
FetchTickers --> SecClient
FetchGAAP --> SecClient
FetchInvCo --> SecClient
FetchNPORT --> SecClient
SecClient --> SECAPI
The network functions are designed to be composable, allowing the main application to chain multiple API calls together (e.g., fetching a CIK first, then using it to fetch submissions).
Composite Data Fetching Operations
Several network functions perform composite operations by calling other network functions. For example, fetch_nport_filing_by_ticker_symbol orchestrates multiple API calls:
Sources: src/network/fetch_nport_filing.rs:10-49
graph TB
Input["Input: ticker_symbol"]
FetchCIK["fetch_cik_by_ticker_symbol\n(&sec_client, ticker_symbol)"]
FetchSubs["fetch_cik_submissions\n(&sec_client, cik)"]
GetRecent["CikSubmission::most_recent_\nnport_p_submission(submissions)"]
FetchFiling["fetch_nport_filing_by_cik_\nand_accession_number(\n&sec_client, cik, accession)"]
FetchTickers["fetch_company_tickers\n(&sec_client)"]
ParseXML["parse_nport_xml(\n&xml_data, &company_tickers)"]
Output["Output: Vec<NportInvestment>"]
Input --> FetchCIK
FetchCIK --> FetchSubs
FetchSubs --> GetRecent
GetRecent --> FetchFiling
FetchFiling --> FetchTickers
FetchTickers --> ParseXML
ParseXML --> Output
This composition pattern allows the main application to use high-level functions while the network layer handles the complexity of multi-step API interactions.
graph TB
MainFn["main() -> Result<(), Box<dyn Error>>"]
NetworkFn["Network functions\n-> Result<T, Box<dyn Error>>"]
Parser["Parsers\n-> Result<T, Box<dyn Error>>"]
SecClient["SecClient\nImplements retries & backoff"]
MainFatal["Fatal Errors:\nConfig load failure\nClient creation failure\nInitial ticker fetch failure"]
MainNonFatal["Non-Fatal Errors:\nIndividual ticker fetch failures\nCSV write failures\n→ Logged to error_log\n→ Processing continues"]
NetworkRetry["Automatic Retries:\nHTTP timeouts\nRate limit errors\n→ Exponential backoff\n→ Max 3 retries"]
ParserErr["Parser Errors:\nMalformed XML/JSON\nMissing required fields\n→ Propagate to caller"]
MainFn --> MainFatal
MainFn --> MainNonFatal
MainFn --> NetworkFn
NetworkFn --> NetworkRetry
NetworkFn --> SecClient
NetworkFn --> Parser
Parser --> ParserErr
NetworkRetry --> MainNonFatal
ParserErr --> MainNonFatal
Error Propagation and Recovery
The application uses Rust's Result type throughout the call stack, with different error handling strategies at each level:
Sources: src/main.rs:174-240 src/network/fetch_us_gaap_fundamentals.rs:12-28
Fatal errors (configuration, client setup) immediately return from main(), while per-ticker errors are captured in error_log and reported at the end without interrupting the processing loop.
File System Output Structure
The application writes structured CSV files to the local file system. The output directory organization differs between processing modes:
US-GAAP Mode Output
data/
└── 22-june-us-gaap/
├── AAPL.csv
├── MSFT.csv
└── {ticker}.csv
Sources: src/main.rs205
Investment Company Mode Output
data/
└── fund-holdings/
├── A/
│ ├── AADR.csv
│ └── {ticker}.csv
├── B/
│ └── {ticker}.csv
└── {letter}/
└── {ticker}.csv
Sources: src/main.rs:82-102
The alphabetical organization in investment company mode enables efficient file system operations when dealing with thousands of fund ticker symbols, avoiding the creation of a single directory with excessive entries.
Performance Characteristics
The main application is designed for batch processing with the following characteristics:
| Aspect | Implementation | Location |
|---|---|---|
| Concurrency | Sequential processing (no parallelization in main loop) | src/main.rs187 |
| Rate Limiting | Handled by SecClient throttle policy | See SecClient |
| Caching | HTTP cache and preprocessor cache in SecClient | See Caching System |
| Memory Management | Processes one ticker at a time, drops DataFrames after write | src/main.rs:203-221 |
| Error Recovery | Non-fatal errors logged, processing continues | src/main.rs:185-227 |
The sequential processing model ensures consistent rate limiting and simplifies error tracking, though it could be parallelized in future iterations using Tokio tasks or Rayon parallel iterators.
Sources: src/main.rs:174-240
Summary
The main application flow follows a simple but robust pattern:
- Initialize configuration and HTTP client with comprehensive error handling
- Fetch the complete dataset of tickers or investment companies
- Iterate through each entry sequentially
- Process each entry by fetching additional data and writing to CSV
- Log any errors without stopping processing
- Report a summary of successes and failures
This design prioritizes resilience (continue processing on errors) and observability (detailed logging and error summaries) over performance, making it suitable for overnight batch jobs that process thousands of securities. The commented prototype implementations demonstrate alternative processing modes that can be activated by uncommenting the relevant sections and commenting out the active implementation.
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Utility Functions
Relevant source files
Purpose and Scope
This document covers the utility functions and helper modules provided by the utils module in the Rust sec-fetcher application. These utilities provide cross-cutting functionality used throughout the codebase, including data structure transformations, runtime mode detection, and collection extensions.
For information about the main application architecture and data flow, see Main Application Flow. For details on data models and core structures, see Data Models & Enumerations.
Sources: src/utils.rs:1-12
Module Overview
The utils module is organized as a collection of focused sub-modules, each providing a specific utility function or set of related functions. The module uses Rust's re-export pattern to provide a clean public API.
| Sub-module | Primary Export | Purpose |
|---|---|---|
invert_multivalue_indexmap | invert_multivalue_indexmap() | Inverts multi-value index mappings |
is_development_mode | is_development_mode() | Runtime environment detection |
is_interactive_mode | is_interactive_mode(), set_interactive_mode_override() | Interactive mode state management |
vec_extensions | VecExtensions trait | Extension methods for Vec<T> |
The module structure follows a pattern where each utility is isolated in its own file, promoting maintainability and testability while the parent utils.rs file acts as a facade.
Sources: src/utils.rs:1-12
Utility Module Architecture
Sources: src/utils.rs:1-12
IndexMap Inversion Utility
Function Signature
The invert_multivalue_indexmap function transforms a mapping where keys point to multiple values into a reverse mapping where values point to all their associated keys.
Algorithm and Complexity
The function performs the following operations:
- Initialization : Creates a new
IndexMapwith capacity matching the input map - Iteration : Traverses all key-value pairs in the original mapping
- Inversion : For each value in a key's vector, adds that key to the value's entry in the inverted map
- Preservation : Maintains insertion order via
IndexMapdata structure
Time Complexity: O(N) where N is the total number of key-value associations across all vectors
Space Complexity: O(N) for storing the inverted mapping
Use Cases in the Codebase
This utility is primarily used in the US GAAP transformation pipeline where bidirectional concept mappings are required. The function enables:
- Synonym Resolution : Mapping from multiple US GAAP tags to a single
FundamentalConceptenum variant - Reverse Lookups : Given a
FundamentalConcept, finding all original US GAAP tags that map to it - Hierarchical Queries : Supporting both forward (tag → concept) and reverse (concept → tags) navigation
Example Usage Pattern
Sources: src/utils/invert_multivalue_indexmap.rs:1-65
Development Mode Detection
The is_development_mode function provides runtime detection of development versus production environments. This utility enables conditional behavior based on the execution context.
Typical Implementation Pattern
Development mode detection typically checks for:
- Cargo Profile : Whether compiled with
--releaseflag - Environment Variables : Presence of
RUST_DEV_MODE,DEBUG, or similar markers - Build Configuration : Compile-time flags like
#[cfg(debug_assertions)]
Usage in Configuration System
The function integrates with the configuration system to adjust behavior:
| Development Mode | Production Mode |
|---|---|
| Relaxed validation | Strict validation |
| Verbose logging enabled | Minimal logging |
| Mock data allowed | Real API calls required |
| Cache disabled or short TTL | Full caching enabled |
This utility is likely referenced in src/config/config_manager.rs to adjust validation rules and in src/main.rs to control application initialization.
Sources: src/utils.rs:4-5
Interactive Mode Management
The interactive mode utilities manage application state related to user interaction, controlling whether the application should prompt for user input or run in automated mode.
Function Signatures
State Management Pattern
These functions likely implement a static or thread-local state management pattern:
- Default Behavior : Detect if running in a TTY (terminal) environment
- Override Mechanism : Allow explicit setting via
set_interactive_mode_override() - Query Interface : Check current state via
is_interactive_mode()
Use Cases
| Scenario | Interactive Mode | Non-Interactive Mode |
|---|---|---|
| Missing config | Prompt user for input | Exit with error |
| API rate limit | Pause and notify user | Automatic retry with backoff |
| Data validation failure | Ask user to continue | Fail fast and exit |
| Progress reporting | Show progress bars | Log to stdout/file |
Integration with Main Application Flow
The interactive mode flags are likely used in src/main.rs to control:
- Whether to display progress indicators during fund processing
- How to handle missing credentials (prompt vs. error)
- Whether to confirm before writing large CSV files
- User confirmation for destructive operations
Sources: src/utils.rs:7-8
Vector Extensions Trait
The VecExtensions trait provides extension methods for Rust's standard Vec<T> type, adding domain-specific functionality needed by the sec-fetcher application.
Trait Pattern
Extension traits in Rust follow this pattern:
Likely Extension Methods
Based on common patterns in data processing applications and the context of US GAAP data transformation, the trait likely provides:
| Method Category | Potential Methods | Purpose |
|---|---|---|
| Chunking | chunk_by_size(), batch() | Process large datasets in manageable batches |
| Deduplication | unique(), deduplicate_by() | Remove duplicate entries from fetched data |
| Filtering | filter_not_empty(), compact() | Remove null or empty elements |
| Transformation | map_parallel(), flat_map_concurrent() | Parallel data transformation |
| Validation | all_valid(), partition_valid() | Separate valid from invalid records |
Usage in Data Pipeline
The extensions are likely used extensively in:
- Network Module : Batching API requests, deduplicating ticker symbols
- Transform Module : Parallel processing of US GAAP concepts
- Main Application : Chunking investment company lists for concurrent processing
Sources: src/utils.rs:10-11
Utility Function Relationships
Sources: src/utils.rs:1-12 src/utils/invert_multivalue_indexmap.rs:1-65
Design Principles
Modularity
Each utility is isolated in its own sub-module with a single, well-defined responsibility. This design:
- Reduces Coupling : Utilities can be tested independently
- Improves Reusability : Functions can be used across different modules without dependencies
- Simplifies Maintenance : Changes to one utility don't affect others
Generic Programming
The invert_multivalue_indexmap function demonstrates generic programming principles:
- Type Parameters : Works with any types implementing
Eq + Hash + Clone - Zero-Cost Abstractions : No runtime overhead compared to specialized versions
- Compile-Time Guarantees : Type safety ensured by the compiler
Order Preservation
The use of IndexMap instead of HashMap in the inversion utility preserves insertion order, which is critical for:
- Deterministic Output : Consistent CSV file generation across runs
- Reproducible Transformations : Same input always produces same output order
- Debugging : Predictable ordering aids in troubleshooting
Sources: src/utils/invert_multivalue_indexmap.rs:4-14
Performance Characteristics
IndexMap Inversion
| Operation | Complexity | Notes |
|---|---|---|
| Inversion | O(N) | N = total key-value associations |
| Lookup in result | O(1) average | Hash-based access |
| Insertion order | Preserved | Via IndexMap internal ordering |
| Memory overhead | O(N) | Stores inverted mapping |
Runtime Mode Detection
| Operation | Complexity | Caching Strategy |
|---|---|---|
| First check | O(1) - O(log n) | Environment lookup or compile-time constant |
| Subsequent checks | O(1) | Static or lazy_static cached value |
The mode detection functions likely use lazy initialization patterns to cache their results, avoiding repeated environment variable lookups or system calls.
Sources: src/utils/invert_multivalue_indexmap.rs:26-28
Integration Points
US GAAP Transformation Pipeline
The utilities integrate with the concept transformation system (see US GAAP Concept Transformation):
- Forward Mapping :
distill_us_gaap_fundamental_conceptsuses hard-coded mappings - Reverse Mapping :
invert_multivalue_indexmapgenerates reverse index - Lookup Optimization : Enables O(1) queries for "which concepts contain this tag?"
Configuration System
The development and interactive mode utilities integrate with configuration management (see Configuration System):
- Validation :
is_development_moderelaxes validation in development - Credential Handling :
is_interactive_modedetermines prompt vs. error - Logging : Both functions control verbosity and output format
Main Application Loop
The utilities support the main processing flow (see Main Application Flow):
- Batch Processing :
VecExtensionsenables efficient chunking of investment companies - User Feedback :
is_interactive_modecontrols progress display - Error Handling : Mode detection influences retry behavior and error messages
Sources: src/utils.rs:1-12
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Python narrative_stack System
Relevant source files
- python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynbipynb)
- python/narrative_stack/notebooks/stage1_preprocessing.ipynb
- src/caches.rs
Purpose and Scope
The narrative_stack system is the Python machine learning component of the dual-language financial data processing architecture. This document provides an overview of the Python system's architecture, its integration with the Rust sec-fetcher component, and the high-level data flow from CSV ingestion through model training.
For detailed information on specific subsystems:
- Data ingestion and preprocessing pipeline: See Data Ingestion & Preprocessing
- Machine learning model architecture and training: See Machine Learning Training Pipeline
- Database and storage integration: See Database & Storage Integration
The Rust data fetching layer is documented in Rust sec-fetcher Application.
System Architecture Overview
The narrative_stack system processes US GAAP financial data through a multi-stage pipeline that transforms raw CSV files into learned latent representations. The architecture consists of three primary layers: storage/ingestion, preprocessing, and training.
graph TB
subgraph "Storage Layer"
DbUsGaap["DbUsGaap\nMySQL Database Interface"]
DataStoreWsClient["DataStoreWsClient\nWebSocket Client"]
UsGaapStore["UsGaapStore\nUnified Data Facade"]
end
subgraph "Data Sources"
RustCSV["CSV Files\nfrom sec-fetcher\ntruncated_csvs/"]
MySQL["MySQL Database\nus_gaap_test"]
SimdRDrive["simd-r-drive\nWebSocket Server"]
end
subgraph "Preprocessing Components"
IngestCSV["ingest_us_gaap_csvs\nCSV Walker & Parser"]
GenPCA["generate_pca_embeddings\nPCA Compression"]
RobustScaler["RobustScaler\nPer-Pair Normalization"]
ConceptEmbed["Semantic Embeddings\nConcept/Unit Pairs"]
end
subgraph "Training Components"
IterableDS["IterableConceptValueDataset\nStreaming Data Loader"]
Stage1AE["Stage1Autoencoder\nPyTorch Lightning Module"]
PLTrainer["pl.Trainer\nTraining Orchestration"]
Callbacks["EarlyStopping\nModelCheckpoint\nTensorBoard Logger"]
end
RustCSV --> IngestCSV
MySQL --> DbUsGaap
SimdRDrive --> DataStoreWsClient
DbUsGaap --> UsGaapStore
DataStoreWsClient --> UsGaapStore
IngestCSV --> UsGaapStore
UsGaapStore --> GenPCA
UsGaapStore --> RobustScaler
UsGaapStore --> ConceptEmbed
GenPCA --> UsGaapStore
RobustScaler --> UsGaapStore
ConceptEmbed --> UsGaapStore
UsGaapStore --> IterableDS
IterableDS --> Stage1AE
Stage1AE --> PLTrainer
PLTrainer --> Callbacks
Callbacks -.->|Checkpoints| Stage1AE
Component Architecture
Sources:
- python/narrative_stack/notebooks/stage1_preprocessing.ipynb:13-40
- python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb175-176
Core Component Responsibilities
| Component | Type | Responsibilities |
|---|---|---|
DbUsGaap | Database Interface | MySQL connection management, query execution |
DataStoreWsClient | WebSocket Client | Real-time communication with simd-r-drive server |
UsGaapStore | Data Facade | Unified API for ingestion, lookup, embedding management |
IterableConceptValueDataset | PyTorch Dataset | Streaming data loader with internal batching |
Stage1Autoencoder | Lightning Module | Encoder-decoder architecture for concept embeddings |
pl.Trainer | Training Framework | Orchestration, logging, checkpointing |
Sources:
- python/narrative_stack/notebooks/stage1_preprocessing.ipynb:13-40
- python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb175-176
Data Flow Pipeline
The system processes financial data through a sequential pipeline from raw CSV files to trained model checkpoints.
graph LR
subgraph "Input Stage"
CSV1["Rust CSV Output\nproject_paths.rust_data/\ntruncated_csvs/"]
end
subgraph "Ingestion Stage"
Walk["walk_us_gaap_csvs\nDirectory Walker"]
Parse["UsGaapRowRecord\nParser"]
Store1["Store to\nDbUsGaap & DataStore"]
end
subgraph "Preprocessing Stage"
Extract["Extract\nconcept/unit pairs"]
GenEmbed["generate_concept_unit_embeddings\nSemantic Embeddings"]
Scale["RobustScaler\nPer-Pair Normalization"]
PCA["PCA Compression\nvariance_threshold=0.95"]
Triplet["Triplet Storage\nconcept+unit+scaled_value\n+scaler+embedding"]
end
subgraph "Training Stage"
Stream["IterableConceptValueDataset\ninternal_batch_size=64"]
DataLoader["DataLoader\nbatch_size from hparams\ncollate_with_scaler"]
Encode["Stage1Autoencoder.encode\nInput → Latent"]
Decode["Stage1Autoencoder.decode\nLatent → Reconstruction"]
Loss["MSE Loss\nReconstruction Error"]
end
subgraph "Output Stage"
Ckpt["Model Checkpoints\nstage1_resume-vN.ckpt"]
TB["TensorBoard Logs\nval_loss_epoch monitoring"]
end
CSV1 --> Walk
Walk --> Parse
Parse --> Store1
Store1 --> Extract
Extract --> GenEmbed
GenEmbed --> Scale
Scale --> PCA
PCA --> Triplet
Triplet --> Stream
Stream --> DataLoader
DataLoader --> Encode
Encode --> Decode
Decode --> Loss
Loss --> Ckpt
Loss --> TB
End-to-End Processing Flow
Sources:
- python/narrative_stack/notebooks/stage1_preprocessing.ipynb:21-68
- python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-556
Processing Stages
1. CSV Ingestion
The system ingests CSV files produced by the Rust sec-fetcher using us_gaap_store.ingest_us_gaap_csvs(). This function walks the directory structure, parses each CSV file, and stores the data in both MySQL and the WebSocket data store.
Sources:
- python/narrative_stack/notebooks/stage1_preprocessing.ipynb:21-23
- python/narrative_stack/notebooks/stage1_preprocessing.ipynb:67-68
2. Concept/Unit Pair Extraction
The preprocessing pipeline extracts unique (concept, unit) pairs from the ingested data. Each pair represents a specific financial metric with its measurement unit (e.g., ("Revenues", "USD")).
Sources:
3. Semantic Embedding Generation
For each concept/unit pair, the system generates semantic embeddings that capture the meaning and relationship of financial concepts. These embeddings are then compressed using PCA with a variance threshold of 0.95.
Sources:
- python/narrative_stack/notebooks/stage1_preprocessing.ipynb:67-68
- python/narrative_stack/notebooks/stage1_preprocessing.ipynb:263-269
4. Value Normalization
The RobustScaler normalizes financial values separately for each concept/unit pair, making the data robust to outliers while preserving relative magnitudes within each group.
Sources:
5. Model Training
The Stage1Autoencoder learns to reconstruct the concatenated embedding+value input through a bottleneck architecture, forcing the model to learn compressed representations in the latent space.
Sources:
- python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb479-486
- python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb541-556
graph TB
subgraph "Python Application"
App["narrative_stack\nNotebooks & Scripts"]
end
subgraph "Storage Facade"
UsGaapStore["UsGaapStore\nUnified Interface"]
end
subgraph "Backend Interfaces"
DbInterface["DbUsGaap\ndb_config"]
WsInterface["DataStoreWsClient\nsimd_r_drive_server_config"]
end
subgraph "Storage Systems"
MySQL["MySQL Server\nus_gaap_test database"]
WsServer["simd-r-drive\nWebSocket Server\nKey-Value Store"]
FileCache["File System\nCache Storage"]
end
subgraph "Data Types"
Raw["Raw Records\nUsGaapRowRecord"]
Triplets["Triplets\nconcept+unit+scaled_value\n+scaler+embedding"]
Matrix["Embedding Matrix\nnumpy arrays"]
PCAModel["PCA Models\nsklearn objects"]
end
App --> UsGaapStore
UsGaapStore --> DbInterface
UsGaapStore --> WsInterface
DbInterface --> MySQL
WsInterface --> WsServer
WsServer --> FileCache
MySQL -.->|stores| Raw
WsServer -.->|stores| Triplets
WsServer -.->|stores| Matrix
WsServer -.->|stores| PCAModel
Storage Architecture Integration
The Python system integrates with multiple storage backends to support different access patterns and data requirements.
Storage Backend Architecture
Sources:
Storage System Characteristics
| Storage Backend | Use Case | Data Types | Access Pattern |
|---|---|---|---|
MySQL (DbUsGaap) | Raw record storage | UsGaapRowRecord entries | Batch queries during ingestion |
WebSocket (DataStoreWsClient) | Preprocessed data | Triplets, embeddings, scalers, PCA models | Real-time streaming during training |
| File System | Cache persistence | HTTP cache, preprocessor cache | Transparent via simd-r-drive |
Sources:
sequenceDiagram
participant NB as "Jupyter Notebook"
participant Config as "config module"
participant DB as "DbUsGaap"
participant WS as "DataStoreWsClient"
participant Store as "UsGaapStore"
NB->>Config: Import db_config
NB->>Config: Import simd_r_drive_server_config
NB->>DB: DbUsGaap(db_config)
NB->>WS: DataStoreWsClient(host, port)
NB->>Store: UsGaapStore(data_store)
Store->>WS: Delegate operations
Store->>DB: Query raw records
Configuration and Initialization
The system uses centralized configuration for database connections and WebSocket server endpoints.
Initialization Sequence
Sources:
Configuration Objects
| Configuration | Purpose | Key Parameters |
|---|---|---|
db_config | MySQL connection | Host, port, database name, credentials |
simd_r_drive_server_config | WebSocket server | Host, port |
project_paths | File system paths | rust_data, python_data |
Sources:
graph TB
subgraph "Model Configuration"
Model["Stage1Autoencoder\nhparams"]
LoadCkpt["load_from_checkpoint\nckpt_path"]
LR["Learning Rate\nlr, min_lr"]
end
subgraph "Data Configuration"
DS["IterableConceptValueDataset\ninternal_batch_size=64\nreturn_scaler=True\nshuffle=True/False"]
DL["DataLoader\nbatch_size from hparams\ncollate_fn=collate_with_scaler\nnum_workers=2\nprefetch_factor=4"]
end
subgraph "Callback Configuration"
ES["EarlyStopping\nmonitor=val_loss_epoch\npatience=20\nmode=min"]
MC["ModelCheckpoint\nmonitor=val_loss_epoch\nsave_top_k=1\nfilename=stage1_resume"]
TB["TensorBoardLogger\nOUTPUT_PATH\nname=stage1_autoencoder"]
end
subgraph "Trainer Configuration"
Trainer["pl.Trainer\nmax_epochs=1000\naccelerator=auto\ndevices=1\ngradient_clip_val"]
end
Model --> Trainer
LoadCkpt -.->|optional resume| Model
LR --> Model
DS --> DL
DL --> Trainer
ES --> Trainer
MC --> Trainer
TB --> Trainer
Training Infrastructure
The training system uses PyTorch Lightning for experiment management and reproducible training workflows.
Training Configuration
Sources:
Key Training Parameters
| Parameter | Value | Purpose |
|---|---|---|
EPOCHS | 1000 | Maximum training epochs |
PATIENCE | 20 | Early stopping patience |
internal_batch_size | 64 | Dataset internal batching |
num_workers | 2 | DataLoader worker processes |
prefetch_factor | 4 | Batches to prefetch per worker |
gradient_clip_val | From hparams | Gradient clipping threshold |
Sources:
Data Access Patterns
The UsGaapStore provides multiple access methods for different use cases.
Access Methods
| Method | Purpose | Return Type | Usage Context |
|---|---|---|---|
ingest_us_gaap_csvs(csv_dir, db) | Bulk CSV ingestion | None | Initial data loading |
generate_pca_embeddings() | PCA compression | None | Preprocessing |
lookup_by_index(idx) | Single triplet retrieval | Dict with concept, unit, scaled_value, scaler, embedding | Validation, debugging |
batch_lookup_by_indices(indices) | Multiple triplet retrieval | List of dicts | Batch validation |
get_triplet_count() | Count stored triplets | Integer | Monitoring |
get_pair_count() | Count unique pairs | Integer | Statistics |
get_embedding_matrix() | Full embedding matrix | (numpy array, pairs list) | Visualization, analysis |
Sources:
Validation and Monitoring
The system includes validation mechanisms to ensure data integrity throughout the pipeline.
Scaler Validation Example
The preprocessing notebook includes validation code to verify that stored scalers correctly transform values:
Sources:
Visualization Tools
The system provides visualization utilities for monitoring preprocessing quality:
| Function | Purpose | Output |
|---|---|---|
plot_pca_explanation() | Show PCA variance explained | Matplotlib plot with dimensionality recommendation |
plot_semantic_embeddings() | Visualize embedding space | 2D scatterplot of embeddings |
Sources:
- python/narrative_stack/notebooks/stage1_preprocessing.ipynb:259-269
- python/narrative_stack/notebooks/stage1_preprocessing.ipynb:337-342
graph LR
subgraph "Rust sec-fetcher"
Fetch["Network Fetching\nfetch_us_gaap_fundamentals"]
Transform["Concept Transformation\ndistill_us_gaap_fundamental_concepts"]
CSVWrite["CSV Output\nus-gaap/*.csv"]
end
subgraph "File System"
CSVStore["CSV Files\nproject_paths.rust_data/\ntruncated_csvs/"]
end
subgraph "Python narrative_stack"
CSVRead["CSV Ingestion\ningest_us_gaap_csvs"]
Process["Preprocessing Pipeline"]
Train["Model Training"]
end
Fetch --> Transform
Transform --> CSVWrite
CSVWrite --> CSVStore
CSVStore --> CSVRead
CSVRead --> Process
Process --> Train
Integration with Rust sec-fetcher
The Python system consumes data produced by the Rust sec-fetcher application through a file-based interface.
Rust-Python Data Bridge
Sources:
Data Format Expectations
The Python system expects CSV files produced by the Rust sec-fetcher to contain US GAAP fundamental data with the following structure:
- Each CSV file represents data for a single ticker symbol
- Files are organized in subdirectories (e.g.,
truncated_csvs/) - Each row contains concept, unit, value, and metadata fields
- Concepts are standardized through the Rust
distill_us_gaap_fundamental_conceptstransformer
The preprocessing pipeline parses these CSVs into UsGaapRowRecord objects for further processing.
Sources:
- python/narrative_stack/notebooks/stage1_preprocessing.ipynb:21-23
- python/narrative_stack/notebooks/stage1_preprocessing.ipynb:67-68
Checkpoint Management
The training system supports checkpoint-based workflows for resuming training and fine-tuning.
Checkpoint Loading Patterns
The notebook demonstrates multiple checkpoint loading strategies:
Sources:
Checkpoint Configuration
| Parameter | Purpose | Typical Value |
|---|---|---|
dirpath | Checkpoint save directory | OUTPUT_PATH |
filename | Checkpoint filename pattern | "stage1_resume" |
monitor | Metric to monitor | "val_loss_epoch" |
mode | Optimization direction | "min" |
save_top_k | Number of checkpoints to keep | 1 |
Sources:
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Data Ingestion & Preprocessing
Relevant source files
- python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynbipynb)
- python/narrative_stack/notebooks/stage1_preprocessing.ipynb
- src/caches.rs
Purpose and Scope
This document describes the data ingestion and preprocessing pipeline in the Python narrative_stack system. This pipeline transforms raw US GAAP financial data from CSV files (generated by the Rust sec-fetcher application) into normalized, embedded triplets suitable for machine learning training. The pipeline handles CSV parsing, concept/unit pair extraction, semantic embedding generation, robust statistical normalization, and PCA-based dimensionality reduction.
For information about the machine learning training pipeline that consumes this preprocessed data, see Machine Learning Training Pipeline. For details about the Rust data fetching and CSV generation, see Rust sec-fetcher Application.
Overview
The preprocessing pipeline orchestrates multiple transformation stages to prepare raw financial data for machine learning. The system reads CSV files containing US GAAP fundamental concepts (Assets, Revenues, etc.) with associated values and units of measure, then generates semantic embeddings for each concept/unit pair, normalizes values using robust statistical techniques, and compresses embeddings via PCA.
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:1-383
graph TB
subgraph "Input"
CSVFiles["CSV Files\nrust_data/us-gaap/*.csv\nSymbol-specific filings"]
end
subgraph "Ingestion Layer"
WalkCSV["walk_us_gaap_csvs()\nIterator over CSV files"]
ParseRow["UsGaapRowRecord\nParsed entries per symbol"]
end
subgraph "Extraction & Embedding"
ExtractPairs["Extract Concept/Unit Pairs\n(concept, unit)
tuples"]
GenEmbeddings["generate_concept_unit_embeddings()\nSemantic embeddings per pair"]
end
subgraph "Normalization"
GroupValues["Group by (concept, unit)\nCollect all values per pair"]
FitScaler["RobustScaler.fit()\nPer-pair scaling parameters"]
TransformValues["Transform values\nScaled to robust statistics"]
end
subgraph "Dimensionality Reduction"
PCAFit["PCA.fit()\nvariance_threshold=0.95"]
PCATransform["PCA.transform()\nCompress embeddings"]
end
subgraph "Storage"
BuildTriplets["Build Triplets\n(concept, unit, scaled_value,\nscaler, embedding)"]
StoreDS["DataStoreWsClient\nWrite to simd-r-drive"]
UsGaapStore["UsGaapStore\nUnified access facade"]
end
CSVFiles --> WalkCSV
WalkCSV --> ParseRow
ParseRow --> ExtractPairs
ExtractPairs --> GenEmbeddings
ParseRow --> GroupValues
GroupValues --> FitScaler
FitScaler --> TransformValues
GenEmbeddings --> PCAFit
PCAFit --> PCATransform
TransformValues --> BuildTriplets
PCATransform --> BuildTriplets
BuildTriplets --> StoreDS
StoreDS --> UsGaapStore
Architecture Components
The preprocessing system is built around three primary components that handle data access, transformation, and storage:
graph TB
subgraph "Database Layer"
DbUsGaap["DbUsGaap\nMySQL interface\nus_gaap_test database"]
end
subgraph "Storage Layer"
DataStoreWsClient["DataStoreWsClient\nWebSocket client\nsimd_r_drive_server_config"]
end
subgraph "Facade Layer"
UsGaapStore["UsGaapStore\nUnified data access\nOrchestrates ingestion & retrieval"]
end
subgraph "Core Operations"
Ingest["ingest_us_gaap_csvs()\nCSV → Database → WebSocket"]
GenPCA["generate_pca_embeddings()\nPCA compression pipeline"]
Lookup["lookup_by_index()\nRetrieve triplet + metadata"]
end
DbUsGaap --> UsGaapStore
DataStoreWsClient --> UsGaapStore
UsGaapStore --> Ingest
UsGaapStore --> GenPCA
UsGaapStore --> Lookup
| Component | Type | Purpose |
|---|---|---|
DbUsGaap | Database Interface | Provides async MySQL access to stored US GAAP data |
DataStoreWsClient | WebSocket Client | Connects to simd-r-drive server for key-value storage |
UsGaapStore | Facade | Coordinates ingestion, embedding generation, and data retrieval |
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:13-40
CSV Data Ingestion
The ingestion process begins by reading CSV files produced by the Rust sec-fetcher application. These files are organized by ticker symbol and contain US GAAP fundamental concepts extracted from SEC filings.
Directory Structure
CSV files are read from a configurable directory path that points to the Rust application's output:
graph LR ProjectPaths["project_paths.rust_data"] --> CSVDir["csv_data_dir\nPath to CSV directory"] CSVDir --> Files["Symbol CSV Files\nAAPL.csv, GOOGL.csv, etc."] Files --> Walker["walk_us_gaap_csvs()\nGenerator function"]
The ingestion system walks through the directory structure, processing one CSV file per symbol. Each file contains multiple filing entries with concept/value/unit triplets.
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:20-23
Ingestion Invocation
The primary ingestion method is us_gaap_store.ingest_us_gaap_csvs(), which accepts a CSV directory path and database connection:
This method orchestrates:
- CSV file walking and parsing
- Concept/unit pair extraction
- Value aggregation per pair
- Scaler fitting and transformation
- Storage in both MySQL and simd-r-drive
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb67
Concept/Unit Pair Extraction
Financial data in CSV files contains concept/value/unit triplets. The preprocessing pipeline extracts unique (concept, unit) pairs and groups all values associated with each pair.
Data Structure
Each row in the CSV contains:
- Concept : A
FundamentalConceptvariant (e.g., "Assets", "Revenues", "NetIncomeLoss") - Unit of Measure (UOM) : The measurement unit (e.g., "USD", "shares", "USD_per_share")
- Value : The numeric value reported in the filing
- Additional Metadata : Period type, balance type, filing date, etc.
Grouping Strategy
Values are grouped by (concept, unit) combinations because:
- Different concepts have different value ranges (e.g., Assets vs. EPS)
- The same concept in different units requires separate scaling (e.g., Revenue in USD vs. Revenue in thousands)
- This grouping enables per-pair statistical normalization
Each pair becomes associated with:
- A semantic embedding vector
- A fitted
RobustScalerinstance - A collection of normalized values
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:190-249
Semantic Embedding Generation
The system generates semantic embeddings for each unique (concept, unit) pair using a pre-trained language model. These embeddings capture the semantic meaning of the financial concept and its measurement unit.
graph LR Pairs["Concept/Unit Pairs\n(Revenues, USD)\n(Assets, USD)\n(NetIncomeLoss, USD)"] --> Model["Embedding Model\nPre-trained transformer"] Model --> Vectors["Embedding Vectors\nFixed dimensionality\nSemantic representation"] Vectors --> EmbedMap["embedding_map\nDict[(concept, unit) → embedding]"]
Embedding Process
The generate_concept_unit_embeddings() function creates fixed-dimensional vector representations:
Embedding Properties
| Property | Description |
|---|---|
| Dimensionality | Fixed size (typically 384 or 768 dimensions) |
| Semantic Preservation | Similar concepts produce similar embeddings |
| Deterministic | Same input always produces same output |
| Device-Agnostic | Can be computed on CPU or GPU |
The embeddings are stored in a mapping structure that associates each (concept, unit) pair with its corresponding vector. This mapping is used throughout the preprocessing and training pipelines.
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:279-292
Value Normalization with RobustScaler
Financial values exhibit extreme variance and outliers. The preprocessing pipeline uses RobustScaler from scikit-learn to normalize values within each (concept, unit) group.
RobustScaler Characteristics
RobustScaler is chosen over standard normalization techniques because:
- Uses median and interquartile range (IQR) instead of mean and standard deviation
- Resistant to outliers that are common in financial data
- Scales each feature independently
- Preserves zero values when appropriate
Scaling Formula
For each value in a (concept, unit) group:
scaled_value = (raw_value - median) / IQR
Where:
medianis the 50th percentile of all values in the groupIQRis the interquartile range (75th percentile - 25th percentile)
Scaler Persistence
Each fitted RobustScaler instance is stored alongside the scaled values. This enables:
- Inverse transformation for interpreting model outputs
- Validation of scaling correctness
- Consistent processing of new data points
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:108-166
PCA Dimensionality Reduction
The preprocessing pipeline applies Principal Component Analysis (PCA) to compress semantic embeddings while preserving most of the variance. This reduces memory footprint and training time.
PCA Configuration
| Parameter | Value | Purpose |
|---|---|---|
variance_threshold | 0.95 | Retain 95% of original variance |
fit_data | All concept/unit embeddings | Learn principal components |
transform_data | Same embeddings | Apply compression |
Variance Explanation Visualization
The system provides visualization tools to analyze PCA effectiveness:
This function generates plots showing:
- Cumulative explained variance vs. number of components
- Individual component variance contribution
- The optimal number of components to retain 95% variance
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:67-68 python/narrative_stack/notebooks/stage1_preprocessing.ipynb:259-270
graph TB
subgraph "Triplet Components"
Concept["concept: str\nFundamentalConcept variant"]
Unit["unit: str\nUnit of measure"]
ScaledVal["scaled_value: float\nRobustScaler output"]
UnscaledVal["unscaled_value: float\nOriginal raw value"]
Scaler["scaler: RobustScaler\nFitted scaler instance"]
Embedding["embedding: ndarray\nPCA-compressed vector"]
end
subgraph "Storage Format"
Index["Index: int\nSequential triplet ID"]
Serialized["Serialized Data\nPickled Python object"]
end
Concept --> Serialized
Unit --> Serialized
ScaledVal --> Serialized
UnscaledVal --> Serialized
Scaler --> Serialized
Embedding --> Serialized
Index --> Serialized
Serialized --> WebSocket["DataStoreWsClient\nWebSocket write"]
Triplet Storage Structure
The final output of preprocessing is a collection of triplets stored in the simd-r-drive key-value store via DataStoreWsClient. Each triplet contains all information needed for training and inference.
Triplet Schema
Storage Interface
The UsGaapStore facade provides methods for retrieving stored triplets:
| Method | Purpose |
|---|---|
lookup_by_index(idx: int) | Retrieve single triplet by sequential index |
batch_lookup_by_indices(indices: List[int]) | Retrieve multiple triplets efficiently |
get_triplet_count() | Get total number of stored triplets |
get_pair_count() | Get number of unique (concept, unit) pairs |
get_embedding_matrix() | Retrieve full embedding matrix and pair list |
Retrieval Example
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:78-97
Data Validation
The preprocessing pipeline includes validation mechanisms to ensure data integrity throughout the transformation process.
Scaler Verification
A validation check confirms that stored scaled values match the transformation applied by the stored scaler:
graph LR Lookup["Lookup triplet"] --> Extract["Extract unscaled_value\nand scaler"] Extract --> Transform["scaler.transform()"] Transform --> Compare["np.isclose()\ncheck vs scaled_value"] Compare --> Pass["Validation Pass"] Compare --> Fail["Validation Fail"]
This test:
- Retrieves a random triplet
- Re-applies the stored scaler to the unscaled value
- Verifies the result matches the stored scaled value
- Uses
np.isclose()for floating-point tolerance
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:91-96
Visualization Tools
The preprocessing module provides visualization utilities for analyzing embedding quality and PCA effectiveness.
PCA Explanation Plot
The plot_pca_explanation() function generates cumulative variance plots:
This visualization shows:
- How many principal components are needed to reach the variance threshold
- The diminishing returns of adding more components
- The optimal dimensionality for the compressed embeddings
Semantic Embedding Scatterplot
The plot_semantic_embeddings() function creates 2D projections of embeddings:
This visualization helps assess:
- Clustering patterns in concept/unit pairs
- Separation between different financial concepts
- Embedding space structure
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:259-270 python/narrative_stack/notebooks/stage1_preprocessing.ipynb:337-343
graph LR
subgraph "Preprocessing Output"
Triplets["Stored Triplets\nsimd-r-drive storage"]
end
subgraph "Training Input"
Dataset["IterableConceptValueDataset\nStreaming data loader"]
Collate["collate_with_scaler()\nBatch construction"]
DataLoader["PyTorch DataLoader\nTraining batches"]
end
Triplets --> Dataset
Dataset --> Collate
Collate --> DataLoader
Integration with Training Pipeline
The preprocessed data serves as input to the machine learning training pipeline. The stored triplets are loaded via streaming datasets during training.
| Component | Purpose |
|---|---|
| Triplets in simd-r-drive | Persistent storage of preprocessed data |
IterableConceptValueDataset | Streams triplets without loading all into memory |
collate_with_scaler | Custom collation function for batch processing |
PyTorch DataLoader | Manages batching, shuffling, and parallel loading |
For detailed information about the training pipeline, see Machine Learning Training Pipeline.
Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-522
Performance Considerations
The preprocessing pipeline is designed to handle large-scale financial datasets efficiently.
Memory Management
| Strategy | Implementation |
|---|---|
| Streaming CSV reading | Process files one at a time, not all in memory |
| Incremental scaler fitting | Use online algorithms for large groups |
| Compressed storage | PCA reduces embedding footprint by ~50-70% |
| WebSocket batching | DataStoreWsClient batches writes for efficiency |
graph TB
subgraph "Rust Caching Layer"
HTTPCache["SIMD_R_DRIVE_HTTP_CACHE\nhttp_storage_cache.bin\n1 week TTL"]
PreCache["SIMD_R_DRIVE_PREPROCESSOR_CACHE\npreprocessor_cache.bin\nPersistent storage"]
end
subgraph "Python Preprocessing"
CSVRead["Read CSV files"]
Process["Transform & normalize"]
end
HTTPCache -.->|Cached SEC data| CSVRead
PreCache -.->|Cached computations| Process
Caching Strategy
The Rust layer provides HTTP and preprocessor caches that the preprocessing pipeline can leverage:
Sources: src/caches.rs:1-66
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Machine Learning Training Pipeline
Relevant source files
- python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynbipynb)
- python/narrative_stack/notebooks/stage1_preprocessing.ipynb
Purpose and Scope
This page documents the machine learning training pipeline for the narrative_stack system, specifically the Stage 1 autoencoder that learns latent representations of US GAAP financial concepts. The training pipeline consumes preprocessed concept/unit/value triplets and their semantic embeddings (see Data Ingestion & Preprocessing) to train a neural network that can encode financial data into a compressed latent space.
The pipeline uses PyTorch Lightning for training orchestration, implements custom iterable datasets for efficient data streaming from the simd-r-drive WebSocket server, and provides comprehensive experiment tracking through TensorBoard. For details on the underlying storage mechanisms and data retrieval, see Database & Storage Integration.
Training Pipeline Architecture
The training pipeline operates as a streaming system that continuously fetches preprocessed triplets from the UsGaapStore and feeds them through the autoencoder model. The architecture emphasizes memory efficiency and reproducibility.
Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb456-556
graph TB
subgraph "Data Source Layer"
DataStoreWsClient["DataStoreWsClient\n(simd_r_drive_ws_client)"]
UsGaapStore["UsGaapStore\nlookup_by_index()"]
end
subgraph "Dataset Layer"
IterableDataset["IterableConceptValueDataset\ninternal_batch_size=64\nreturn_scaler=True\nshuffle=True/False"]
CollateFunction["collate_with_scaler()\nBatch construction"]
end
subgraph "PyTorch Lightning Training Loop"
DataLoader["DataLoader\nbatch_size from hparams\nnum_workers=2\npin_memory=True\npersistent_workers=True"]
Model["Stage1Autoencoder\nEncoder → Latent → Decoder"]
Optimizer["Adam Optimizer\n+ CosineAnnealingWarmRestarts\nReduceLROnPlateau"]
Trainer["pl.Trainer\nEarlyStopping\nModelCheckpoint\ngradient_clip_val"]
end
subgraph "Monitoring & Persistence"
TensorBoard["TensorBoardLogger\ntrain_loss\nval_loss_epoch\nlearning_rate"]
Checkpoints["Model Checkpoints\n.ckpt files\nsave_top_k=1\nmonitor='val_loss_epoch'"]
end
DataStoreWsClient --> UsGaapStore
UsGaapStore --> IterableDataset
IterableDataset --> CollateFunction
CollateFunction --> DataLoader
DataLoader --> Model
Model --> Optimizer
Optimizer --> Trainer
Trainer --> TensorBoard
Trainer --> Checkpoints
Checkpoints -.->|Resume training| Model
Stage1Autoencoder Model
Model Architecture
The Stage1Autoencoder is a fully-connected autoencoder that learns to compress financial concept embeddings combined with their scaled values into a lower-dimensional latent space. The model reconstructs its input, forcing the latent representation to capture the most important features.
graph LR
Input["Input Tensor\n[embedding + scaled_value]\nDimension: embedding_dim + 1"]
Encoder["Encoder Network\nfc1 → dropout → ReLU\nfc2 → dropout → ReLU"]
Latent["Latent Space\nDimension: latent_dim"]
Decoder["Decoder Network\nfc3 → dropout → ReLU\nfc4 → dropout → output"]
Output["Reconstructed Input\nSame dimension as input"]
Loss["MSE Loss\ninput vs output"]
Input --> Encoder
Encoder --> Latent
Latent --> Decoder
Decoder --> Output
Output --> Loss
Input --> Loss
Hyperparameters
The model exposes the following configurable hyperparameters through its hparams attribute:
| Parameter | Description | Typical Value |
|---|---|---|
input_dim | Dimension of input (embedding + 1 for value) | Varies based on embedding size |
latent_dim | Dimension of compressed latent space | 64-128 |
dropout_rate | Dropout probability for regularization | 0.0-0.2 |
lr | Initial learning rate | 1e-5 to 5e-5 |
min_lr | Minimum learning rate for scheduler | 1e-7 to 1e-6 |
batch_size | Training batch size | 32 (typical) |
weight_decay | L2 regularization parameter | 1e-8 to 1e-4 |
gradient_clip | Maximum gradient norm | 0.0-1.0 |
The model can be instantiated with default parameters or loaded from a checkpoint with overridden hyperparameters:
Loading from checkpoint with modified learning rate: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb479-486
Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb479-490
Loss Function and Optimization
The model uses Mean Squared Error (MSE) loss between the input and reconstructed output. The optimization strategy combines:
- Adam optimizer with configurable learning rate and weight decay
- CosineAnnealingWarmRestarts scheduler for cyclical learning rate annealing
- ReduceLROnPlateau for adaptive learning rate reduction when validation loss plateaus
Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb479-490
Dataset and Data Loading
IterableConceptValueDataset
The IterableConceptValueDataset is a custom PyTorch IterableDataset that streams training data from the UsGaapStore without loading the entire dataset into memory. This design enables training on datasets larger than available RAM.
Key characteristics:
graph TB
subgraph "Dataset Initialization"
Config["simd_r_drive_server_config\nhost + port"]
Params["Dataset Parameters\ninternal_batch_size\nreturn_scaler\nshuffle"]
end
subgraph "Data Streaming Process"
Store["UsGaapStore instance\nget_triplet_count()"]
IndexGen["Index Generator\nSequential or shuffled\nbased on shuffle param"]
BatchFetch["Internal Batching\nFetch internal_batch_size items\nvia batch_lookup_by_indices()"]
Unpack["Unpack Triplet Data\nembedding\nscaled_value\nscaler (optional)"]
end
subgraph "Output"
Tensor["PyTorch Tensors\nx: [embedding + scaled_value]\ny: [embedding + scaled_value]\nscaler: RobustScaler object"]
end
Config --> Store
Params --> IndexGen
Store --> IndexGen
IndexGen --> BatchFetch
BatchFetch --> Unpack
Unpack --> Tensor
- Iterable streaming : Data is fetched on-demand during iteration, not preloaded
- Internal batching : Fetches
internal_batch_sizeitems at once from the WebSocket server to reduce network overhead (typically 64) - Optional shuffling : Can randomize index order for training or maintain sequential order for validation
- Scaler inclusion : Optionally returns the
RobustScalerobject with each sample for validation purposes
Dataset instantiation: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-522
Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb175-176 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-522
DataLoader Configuration
PyTorch DataLoader instances wrap the iterable dataset with additional configuration for efficient batch loading:
| Parameter | Value | Purpose |
|---|---|---|
batch_size | From model.hparams.batch_size | Outer batch size for model training |
collate_fn | collate_with_scaler | Custom collation function to handle scaler objects |
num_workers | 2 | Number of parallel data loading processes |
pin_memory | True | Enables faster host-to-GPU transfers |
persistent_workers | True | Keeps worker processes alive between epochs |
prefetch_factor | 4 | Number of batches each worker pre-fetches |
Training loader uses shuffle=True in the dataset while validation loader uses shuffle=False to ensure reproducible validation metrics.
Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-522
collate_with_scaler Function
The collate_with_scaler function handles batch construction when the dataset returns triplets (x, y, scaler). It stacks the tensors into batches while preserving the scaler objects in a list:
Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb175-176 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb506-507
graph LR
Input["Batch Items\n[(x1, y1, scaler1),\n(x2, y2, scaler2),\n...]"]
Stack["Stack Tensors\ntorch.stack(x_list)\ntorch.stack(y_list)"]
Output["Batched Output\n(x_batch, y_batch, scaler_list)"]
Input --> Stack
Stack --> Output
Training Configuration
PyTorch Lightning Trainer Setup
The training pipeline uses PyTorch Lightning's Trainer class to orchestrate the training loop, validation, and callbacks:
Configuration values:
graph TB
subgraph "Training Configuration"
MaxEpochs["max_epochs\nTypically 1000"]
Accelerator["accelerator='auto'\ndevices=1"]
GradientClip["gradient_clip_val\nFrom model.hparams"]
end
subgraph "Callbacks"
EarlyStopping["EarlyStopping\nmonitor='val_loss_epoch'\npatience=20\nmode='min'"]
ModelCheckpoint["ModelCheckpoint\nmonitor='val_loss_epoch'\nsave_top_k=1\nfilename='stage1_resume'"]
end
subgraph "Logging"
TensorBoardLogger["TensorBoardLogger\nlog_dir=OUTPUT_PATH\nname='stage1_autoencoder'"]
end
Trainer["pl.Trainer"]
MaxEpochs --> Trainer
Accelerator --> Trainer
GradientClip --> Trainer
EarlyStopping --> Trainer
ModelCheckpoint --> Trainer
TensorBoardLogger --> Trainer
- EPOCHS : 1000 (allows for long training with early stopping)
- PATIENCE : 20 (compensates for learning rate annealing periods)
- OUTPUT_PATH :
project_paths.python_data / "stage1_23_(no_pre_dedupe)"or similar versioned directory
Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb468-548
Callbacks
EarlyStopping
The EarlyStopping callback monitors val_loss_epoch and stops training if no improvement occurs for 20 consecutive epochs. This prevents overfitting and saves computational resources.
python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb528-530
ModelCheckpoint
The ModelCheckpoint callback saves the best model based on validation loss:
python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb532-539
Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb528-539
Training Execution
The training process is initiated with the Trainer.fit() method:
python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb550-556
Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb550-556
Checkpointing and Resuming Training
Saving Checkpoints
Checkpoints are automatically saved by the ModelCheckpoint callback when validation loss improves. The checkpoint file contains:
- Model weights (encoder and decoder parameters)
- Optimizer state
- Learning rate scheduler state
- All hyperparameters (
hparams) - Current epoch number
- Best validation loss
Checkpoint files are saved with the .ckpt extension in the OUTPUT_PATH directory.
Loading and Resuming
The training pipeline supports two modes of checkpoint loading:
1. Resume Training with Existing Configuration
Pass ckpt_path to trainer.fit() to resume training with the exact state from the checkpoint:
python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb555
2. Fine-tune with Modified Hyperparameters
Load the checkpoint explicitly and override specific hyperparameters before training:
This approach is useful for fine-tuning with a lower learning rate after initial training convergence.
python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb479-486
Checkpoint File Naming
The checkpoint filename is determined by the ModelCheckpoint configuration:
- filename :
"stage1_resume"produces files likestage1_resume.ckptorstage1_resume-v1.ckpt,stage1_resume-v2.ckpt, etc. - Versioning : PyTorch Lightning automatically increments version numbers when the same filename exists
Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb475-486 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb532-539 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb550-556
graph TB
subgraph "Logged Metrics"
TrainLoss["train_loss\nPer-batch training loss"]
ValLoss["val_loss_epoch\nEpoch-averaged validation loss"]
LearningRate["learning_rate\nCurrent optimizer LR"]
end
subgraph "TensorBoard Logger"
Logger["TensorBoardLogger\nsave_dir=OUTPUT_PATH\nname='stage1_autoencoder'"]
end
subgraph "Log Directory Structure"
OutputPath["OUTPUT_PATH/"]
VersionDir["stage1_autoencoder/"]
Events["events.out.tfevents.*\nhparams.yaml"]
Checkpoints["*.ckpt checkpoint files"]
end
TrainLoss --> Logger
ValLoss --> Logger
LearningRate --> Logger
Logger --> OutputPath
OutputPath --> VersionDir
VersionDir --> Events
OutputPath --> Checkpoints
Monitoring and Logging
TensorBoard Integration
The TensorBoardLogger automatically logs training metrics for visualization in TensorBoard:
Log directory structure:
OUTPUT_PATH/
├── stage1_autoencoder/
│ └── version_0/
│ ├── events.out.tfevents.*
│ ├── hparams.yaml
│ └── checkpoints/
├── stage1_resume.ckpt
└── optuna_study.db (if using Optuna)
python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb543
Visualizing Training Progress
Launch TensorBoard to monitor training:
TensorBoard displays:
- Loss curves : Training and validation loss over epochs
- Learning rate schedule : Visualization of LR changes during training
- Hyperparameters : Table of all model hyperparameters
- Gradient histograms : Distribution of gradients (if enabled)
Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb541-548
Training Workflow Summary
The complete training workflow follows this sequence:
This pipeline design enables:
- Memory-efficient training through streaming data loading
- Reproducible experiments with comprehensive logging
- Flexible resumption with hyperparameter overrides
- Robust training with gradient clipping and early stopping
Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb456-556
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Database & Storage Integration
Relevant source files
- python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynbipynb)
- python/narrative_stack/notebooks/stage1_preprocessing.ipynb
- python/narrative_stack/us_gaap_store_integration_test.sh
- src/caches.rs
Purpose and Scope
This document describes the database and storage infrastructure used by the Python narrative_stack system to persist and retrieve US GAAP financial data. The system employs a dual-storage architecture: MySQL for structured financial records and a WebSocket-based key-value store (simd-r-drive) for high-performance embedding and triplet data. This page covers the DbUsGaap database interface, DataStoreWsClient WebSocket integration, UsGaapStore facade pattern, triplet storage format, and embedding matrix management.
For information about CSV ingestion and preprocessing operations that populate these storage systems, see Data Ingestion & Preprocessing. For information about how the training pipeline consumes this stored data, see Machine Learning Training Pipeline.
Storage Architecture Overview
The narrative_stack system uses two complementary storage backends to optimize for different data access patterns:
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:13-40
graph TB
subgraph "Python narrative_stack"
UsGaapStore["UsGaapStore\nUnified Facade"]
Preprocessing["Preprocessing Pipeline\ningest_us_gaap_csvs\ngenerate_pca_embeddings"]
Training["ML Training\nIterableConceptValueDataset"]
end
subgraph "Storage Backends"
DbUsGaap["DbUsGaap\nMySQL Interface\nus_gaap_test database"]
DataStoreWsClient["DataStoreWsClient\nWebSocket Client\nsimd-r-drive protocol"]
end
subgraph "External Services"
MySQL[("MySQL Server\nStructured Records\nSymbol/Concept/Value")]
SimdRDrive["simd-r-drive-ws-server\nKey-Value Store\nEmbeddings/Triplets/Scalers"]
end
Preprocessing -->|ingest CSV data| UsGaapStore
UsGaapStore -->|store records| DbUsGaap
UsGaapStore -->|store triplets/embeddings| DataStoreWsClient
DbUsGaap -->|SQL queries| MySQL
DataStoreWsClient -->|WebSocket protocol| SimdRDrive
UsGaapStore -->|retrieve triplets| Training
Training -->|batch lookups| UsGaapStore
The dual-storage design separates concerns:
- MySQL stores structured financial data with SQL query capabilities for filtering and aggregation
- simd-r-drive stores high-dimensional embeddings and preprocessed triplets for fast sequential access during training
MySQL Database Interface (DbUsGaap)
The DbUsGaap class provides an abstraction layer over MySQL database operations for US GAAP financial records.
Database Schema
The system expects a MySQL database named us_gaap_test with the following structure:
| Table/Field | Type | Purpose |
|---|---|---|
| Raw records table | - | Stores original CSV-ingested data |
symbol | VARCHAR | Company ticker symbol |
concept | VARCHAR | US GAAP concept name |
unit | VARCHAR | Unit of measurement (USD, shares, etc.) |
value | DECIMAL | Numeric value for the concept |
form | VARCHAR | SEC form type (10-K, 10-Q, etc.) |
filing_date | DATE | Date of filing |
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:30-35
DbUsGaap Connection Configuration
The DbUsGaap class is instantiated with configuration from db_config:
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:13-18
The database connection is established during initialization and used for querying structured financial data during the ingestion phase. The DbUsGaap interface supports filtering by symbol, concept, and date ranges for selective data loading.
WebSocket Data Store (DataStoreWsClient)
The DataStoreWsClient provides a WebSocket-based interface to the simd-r-drive key-value store for high-performance binary data storage and retrieval.
simd-r-drive Integration
The simd-r-drive system is a custom WebSocket server that provides persistent key-value storage optimized for embedding matrices and preprocessed training data:
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:33-40
graph TB
subgraph "Python Client"
WSClient["DataStoreWsClient\nWebSocket protocol handler"]
BatchWrite["batch_write\n[(key, value)]"]
BatchRead["batch_read\n[keys] → [values]"]
end
subgraph "simd-r-drive Server"
WSServer["simd-r-drive-ws-server\nWebSocket endpoint"]
DataStore["DataStore\nBinary file backend"]
end
BatchWrite -->|WebSocket message| WSClient
BatchRead -->|WebSocket message| WSClient
WSClient <-->|ws://host:port| WSServer
WSServer <-->|read/write| DataStore
Connection Initialization
The WebSocket client is instantiated with server configuration:
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb38
Batch Operations
The DataStoreWsClient supports efficient batch read and write operations:
| Operation | Input | Output | Purpose |
|---|---|---|---|
batch_write | [(key: bytes, value: bytes)] | None | Store multiple key-value pairs atomically |
batch_read | [key: bytes] | [value: bytes] | Retrieve multiple values by key |
The WebSocket protocol enables low-latency access to stored embeddings and preprocessed triplets during training iterations.
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:50-55
Rust-side Cache Storage
On the Rust side, the simd-r-drive storage system is accessed through the Caches module, which provides two cache stores:
Sources: src/caches.rs:7-65
graph TB
ConfigManager["ConfigManager\ncache_base_dir"]
CachesInit["Caches::init\nInitialize cache stores"]
subgraph "Cache Stores"
HTTPCache["SIMD_R_DRIVE_HTTP_CACHE\nhttp_storage_cache.bin\nOnceLock<Arc<DataStore>>"]
PreprocessorCache["SIMD_R_DRIVE_PREPROCESSOR_CACHE\npreprocessor_cache.bin\nOnceLock<Arc<DataStore>>"]
end
HTTPCacheGet["Caches::get_http_cache_store"]
PreprocessorCacheGet["Caches::get_preprocessor_cache"]
ConfigManager -->|cache_base_dir path| CachesInit
CachesInit -->|open DataStore| HTTPCache
CachesInit -->|open DataStore| PreprocessorCache
HTTPCache -->|Arc clone| HTTPCacheGet
PreprocessorCache -->|Arc clone| PreprocessorCacheGet
The Rust implementation uses OnceLock<Arc<DataStore>> for thread-safe lazy initialization of cache stores. The DataStore::open method creates or opens persistent binary files at paths derived from ConfigManager:
http_storage_cache.bin- HTTP response cachingpreprocessor_cache.bin- Preprocessed data caching
UsGaapStore Facade
The UsGaapStore class provides a unified interface that coordinates operations across both storage backends (MySQL and simd-r-drive):
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb40 python/narrative_stack/notebooks/stage1_preprocessing.ipynb:65-68
graph TB
subgraph "UsGaapStore Facade"
IngestCSV["ingest_us_gaap_csvs\nCSV → MySQL + triplets"]
GeneratePCA["generate_pca_embeddings\nCompress embeddings"]
LookupByIndex["lookup_by_index\nRetrieve triplet"]
BatchLookup["batch_lookup_by_indices\nBulk retrieval"]
GetEmbeddingMatrix["get_embedding_matrix\nReturn all embeddings"]
GetTripletCount["get_triplet_count\nTotal stored triplets"]
GetPairCount["get_pair_count\nUnique concept/unit pairs"]
end
subgraph "Backend Operations"
MySQLWrite["DbUsGaap.insert\nStore raw records"]
SimdWrite["DataStoreWsClient.batch_write\nStore triplets/embeddings"]
SimdRead["DataStoreWsClient.batch_read\nRetrieve data"]
end
IngestCSV -->|raw data| MySQLWrite
IngestCSV -->|triplets| SimdWrite
GeneratePCA -->|compressed embeddings| SimdWrite
LookupByIndex -->|key query| SimdRead
BatchLookup -->|batch keys| SimdRead
GetEmbeddingMatrix -->|matrix key| SimdRead
UsGaapStore Initialization
The facade is initialized with a reference to the DataStoreWsClient:
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb40
Key Methods
| Method | Parameters | Returns | Purpose |
|---|---|---|---|
ingest_us_gaap_csvs | csv_data_dir: Path, db_us_gaap: DbUsGaap | None | Walk CSV directory, parse records, store to both backends |
generate_pca_embeddings | None | None | Apply PCA dimensionality reduction to stored embeddings |
lookup_by_index | index: int | Triplet dict | Retrieve single triplet by sequential index |
batch_lookup_by_indices | indices: List[int] | List[Triplet dict] | Retrieve multiple triplets efficiently |
get_embedding_matrix | None | (ndarray, List[Tuple]) | Return full embedding matrix and concept/unit pairs |
get_triplet_count | None | int | Total number of stored triplets |
get_pair_count | None | int | Number of unique (concept, unit) pairs |
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:65-96
The UsGaapStore abstracts storage implementation details, allowing preprocessing and training code to interact with a simple, high-level API.
Triplet Storage Format
The core data structure stored in simd-r-drive is the triplet : a tuple of (concept, unit, scaled_value) along with associated metadata.
Triplet Structure
Each triplet stored in simd-r-drive contains the following fields:
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:88-96
Scaler Validation
The storage system maintains consistency between stored values and scalers. During retrieval, the system can verify that the scaler produces the expected scaled value:
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:92-96
This validation ensures that the RobustScaler instance stored with each (concept, unit) pair correctly transforms the unscaled value to the stored scaled value, maintaining data integrity across storage and retrieval cycles.
graph LR
Index0["Index 0\nTriplet 0"]
Index1["Index 1\nTriplet 1"]
Index2["Index 2\nTriplet 2"]
IndexN["Index N\nTriplet N"]
BatchLookup["batch_lookup_by_indices\n[0, 1, 2, ..., N]"]
Index0 -->|retrieve| BatchLookup
Index1 -->|retrieve| BatchLookup
Index2 -->|retrieve| BatchLookup
IndexN -->|retrieve| BatchLookup
Triplet Indexing
Triplets are stored with sequential integer indices, enabling efficient batch retrieval during training:
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:87-89
The sequential indexing scheme allows the training dataset to iterate efficiently through all stored triplets without loading the entire dataset into memory.
graph TB
subgraph "Embedding Matrix Generation"
ConceptUnitPairs["Unique Pairs\n(concept, unit)\nCollected during ingestion"]
GenerateEmbeddings["generate_concept_unit_embeddings\nSentence transformer model"]
PCACompression["PCA Compression\nVariance threshold: 0.95"]
EmbeddingMatrix["Embedding Matrix\nShape: (n_pairs, embedding_dim)"]
end
subgraph "Storage"
StorePairs["Store pairs list\nKey: 'concept_unit_pairs'"]
StoreMatrix["Store matrix\nKey: 'embedding_matrix'"]
end
ConceptUnitPairs -->|semantic text| GenerateEmbeddings
GenerateEmbeddings -->|full embeddings| PCACompression
PCACompression -->|compressed| EmbeddingMatrix
EmbeddingMatrix --> StoreMatrix
ConceptUnitPairs --> StorePairs
Embedding Matrix Management
The UsGaapStore maintains a pre-computed embedding matrix that maps each unique (concept, unit) pair to its semantic embedding vector.
Embedding Matrix Structure
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb68 python/narrative_stack/notebooks/stage1_preprocessing.ipynb:263-269
Retrieval and Usage
The embedding matrix can be retrieved for analysis or visualization:
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:106-108 python/narrative_stack/notebooks/stage1_preprocessing.ipynb263
During training, each triplet's embedding is retrieved from storage rather than recomputed, ensuring consistent representations across training runs.
PCA Dimensionality Reduction
The generate_pca_embeddings method applies PCA to compress the raw semantic embeddings while retaining 95% of the variance:
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb68
The PCA compression reduces storage requirements and computational overhead during training while maintaining semantic relationships between concept/unit pairs.
Data Flow Integration
The storage system integrates with both preprocessing and training workflows:
Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:65-68 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb175-176 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-522
Training Data Access Pattern
During training, the IterableConceptValueDataset streams triplets from storage:
Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-512
The dataset communicates with the simd-r-drive WebSocket server to stream triplets, avoiding memory exhaustion on large datasets. The collate_with_scaler function combines triplets into batches while preserving scaler metadata for potential validation or analysis.
Integration Testing
The storage integration is validated through automated integration tests:
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39
The integration test script:
- Starts isolated Docker containers for MySQL and simd-r-drive using
docker composewith project nameus_gaap_it - Waits for MySQL readiness using
mysqladmin ping - Creates the
us_gaap_testdatabase and loads the schema fromtests/integration/assets/us_gaap_schema_2025.sql - Runs pytest integration tests against the live services
- Tears down containers and volumes in cleanup trap
This ensures that the storage layer functions correctly across both backends before code is merged.
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:8-38
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Development Guide
Relevant source files
- python/narrative_stack/us_gaap_store_integration_test.sh
- src/network/sec_client.rs
- tests/config_manager_tests.rs
- tests/sec_client_tests.rs
Purpose and Scope
This guide provides an overview of development practices, code organization, and workflows for contributing to the rust-sec-fetcher project. It covers environment setup, code organization principles, development workflows, and common development tasks.
For detailed information about specific development topics, see:
- Testing strategies and test fixtures: Testing Strategy
- Continuous integration and automated testing: CI/CD Pipeline
- Docker container configuration: Docker Deployment
Development Environment Setup
Prerequisites
The project requires the following tools installed:
| Tool | Purpose | Version Requirement |
|---|---|---|
| Rust | Core application development | 1.87+ |
| Python | ML pipeline and preprocessing | 3.8+ |
| Docker | Integration testing and services | Latest stable |
| Git LFS | Large file support for test assets | Latest stable |
| MySQL | Database for US GAAP storage | 5.7+ or 8.0+ |
Rust Development Setup
-
Clone the repository and navigate to the root directory
-
Build the Rust application:
-
Run tests to verify setup:
The Rust workspace is configured in Cargo.toml with all necessary dependencies declared. Key development dependencies include:
mockitofor HTTP mocking in teststempfilefor temporary file/directory creation in teststokiotest macros for async test support
Python Development Setup
-
Create a virtual environment:
-
Install dependencies using
uv: -
Verify installation by running integration tests (requires Docker):
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39
Configuration Setup
The application requires a configuration file at ~/.config/sec-fetcher/config.toml or a custom path specified via command-line argument. Minimum configuration:
For non-interactive testing, use AppConfig directly in test code as shown in tests/config_manager_tests.rs:36-57
Sources: tests/config_manager_tests.rs:36-57 tests/sec_client_tests.rs:8-20
Code Organization and Architecture
Repository Structure
Sources: src/network/sec_client.rs:1-181 tests/config_manager_tests.rs:1-95 tests/sec_client_tests.rs:1-159 python/narrative_stack/us_gaap_store_integration_test.sh:1-39
Module Dependency Flow
The dependency flow follows a layered architecture:
- Configuration Layer :
ConfigManagerloads settings from TOML files and credentials from keyring - Network Layer :
SecClientwraps HTTP client with caching and throttling middleware - Data Fetching Layer : Network module functions fetch raw data from SEC APIs
- Transformation Layer : Transformers normalize raw data into standardized concepts
- Model Layer : Data structures represent domain entities
Sources: src/network/sec_client.rs:1-181 tests/config_manager_tests.rs:1-95
Development Workflow
Standard Development Cycle
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39
Running Tests Locally
Rust Unit Tests
Run all Rust tests with cargo:
Run specific test modules:
Run with output visibility:
Test Structure Mapping:
| Test File | Tests Component | Key Test Functions |
|---|---|---|
tests/config_manager_tests.rs | ConfigManager | test_load_custom_config, test_load_non_existent_config, test_fails_on_invalid_key |
tests/sec_client_tests.rs | SecClient | test_user_agent, test_fetch_json_without_retry_success, test_fetch_json_with_retry_failure |
Sources: tests/config_manager_tests.rs:1-95 tests/sec_client_tests.rs:1-159
Python Integration Tests
Integration tests require Docker services. Run via the provided shell script:
This script performs the following steps as defined in python/narrative_stack/us_gaap_store_integration_test.sh:1-39:
- Activates Python virtual environment
- Installs dependencies with
uv pip install -e . --group dev - Starts Docker Compose services (
db_test,simd_r_drive_ws_server_test) - Waits for MySQL availability
- Creates
us_gaap_testdatabase - Loads schema from
tests/integration/assets/us_gaap_schema_2025.sql - Runs pytest integration tests
- Tears down containers on exit
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39
Writing Tests
Unit Test Pattern (Rust)
The codebase follows standard Rust testing patterns with mockito for HTTP mocking:
Key patterns demonstrated in tests/sec_client_tests.rs:35-62:
- Use
#[tokio::test]for async tests - Create
mockito::Serverfor HTTP endpoint mocking - Construct
AppConfigprogrammatically for test isolation - Use
ConfigManager::from_app_config()to bypass file system dependencies - Assert on specific JSON fields in responses
Sources: tests/sec_client_tests.rs:35-62
Test Fixture Pattern
The codebase uses temporary directories for file-based tests:
This pattern ensures test isolation and automatic cleanup as shown in tests/config_manager_tests.rs:8-17
Sources: tests/config_manager_tests.rs:8-17
Error Case Testing
Test error conditions explicitly:
This test from tests/sec_client_tests.rs:93-120 verifies retry behavior by expecting exactly 3 HTTP requests (initial + 2 retries) before failing.
Sources: tests/sec_client_tests.rs:93-120
Common Development Tasks
Adding a New SEC Data Endpoint
To add support for fetching a new SEC data endpoint:
- Add URL enum variant in
src/models/url.rs - Create fetch function in
src/network/following the pattern of existing functions - Define data models in
src/models/for the response structure - Add transformation logic in
src/transformers/if normalization is needed - Write unit tests in
tests/usingmockito::Serverfor mocking - Update main.rs to integrate the new endpoint into the processing pipeline
Example function signature pattern:
Adding a New FundamentalConcept Mapping
The distill_us_gaap_fundamental_concepts function maps raw SEC concept names to the FundamentalConcept enum. To add a new concept:
- Add enum variant to
FundamentalConceptinsrc/models/fundamental_concept.rs - Update the match arms in
src/transformers/distill_us_gaap_fundamental_concepts.rs - Add test case to verify the mapping in
tests/distill_tests.rs
See the existing mapping patterns in the transformer module for hierarchical mappings (concepts that map to multiple parent categories).
Modifying HTTP Client Behavior
The SecClient is configured in src/network/sec_client.rs:21-89 Key configuration points:
| Configuration | Location | Purpose |
|---|---|---|
CachePolicy | src/network/sec_client.rs:45-50 | Controls cache TTL and behavior |
ThrottlePolicy | src/network/sec_client.rs:53-59 | Controls rate limiting and retries |
| User-Agent | src/network/sec_client.rs:91-108 | Constructs SEC-compliant User-Agent header |
To modify throttling behavior, adjust the ThrottlePolicy parameters:
base_delay_ms: Minimum delay between requestsmax_concurrent: Maximum concurrent requestsmax_retries: Number of retry attempts on failureadaptive_jitter_ms: Random jitter to prevent thundering herd
Sources: src/network/sec_client.rs:21-89
Working with Caches
The system uses two cache types managed by the Caches module:
- HTTP Cache : Stores raw HTTP responses with configurable TTL (default: 1 week)
- Preprocessor Cache : Stores transformed/preprocessed data
Cache instances are accessed via Caches::get_http_cache_store() as shown in src/network/sec_client.rs73
During development, you may need to clear caches when testing data transformations. Cache data is persisted via the simd-r-drive backend.
Sources: src/network/sec_client.rs73
Code Quality Standards
TODO Comments and Technical Debt
The codebase uses TODO comments to mark areas for improvement. Examples from src/network/sec_client.rs:
- src/network/sec_client.rs46: Cache TTL should be configurable
- src/network/sec_client.rs57: Adaptive jitter should be configurable
- src/network/sec_client.rs100: Repository URL should be included in User-Agent
When adding TODO comments:
- Be specific about what needs to be done
- Include context about why it's not done now
- Reference related issues if applicable
Panic vs Result
The codebase follows Rust best practices:
- Use
Result<T, E>for recoverable errors - Use
panic!only for non-recoverable errors or programming errors
Example from src/network/sec_client.rs:95-98:
This panics because an invalid email makes all SEC API calls fail, representing a configuration error rather than a runtime error.
Sources: src/network/sec_client.rs:95-98
Error Validation in Tests
Configuration validation is tested by verifying error messages contain expected content, as shown in tests/config_manager_tests.rs:68-94:
This pattern ensures configuration errors are informative to users.
Sources: tests/config_manager_tests.rs:68-94
Integration Test Architecture
The integration test script from python/narrative_stack/us_gaap_store_integration_test.sh:1-39 orchestrates:
- Python environment setup with dependencies
- Docker Compose service startup (isolated project name:
us_gaap_it) - MySQL container health check via
mysqladmin ping - Database creation and schema loading
- pytest execution with verbose output
- Automatic cleanup via EXIT trap
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39
Best Practices Summary
| Practice | Implementation | Reference |
|---|---|---|
| Test isolation | Use temporary directories and AppConfig::default() | tests/config_manager_tests.rs:9-17 |
| HTTP mocking | Use mockito::Server for endpoint simulation | tests/sec_client_tests.rs:37-45 |
| Async testing | Use #[tokio::test] attribute | tests/sec_client_tests.rs35 |
| Error handling | Prefer Result<T, E> over panic | src/network/sec_client.rs:140-165 |
| Configuration | Use ConfigManager::from_app_config() in tests | tests/sec_client_tests.rs10 |
| Integration testing | Use Docker Compose with isolated project names | python/narrative_stack/us_gaap_store_integration_test.sh8 |
| Cleanup | Use trap handlers for guaranteed cleanup | python/narrative_stack/us_gaap_store_integration_test.sh:14-19 |
Sources: tests/config_manager_tests.rs:9-17 tests/sec_client_tests.rs:35-62 src/network/sec_client.rs:140-165 python/narrative_stack/us_gaap_store_integration_test.sh:1-39
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Testing Strategy
Relevant source files
- examples/us_gaap_human_readable.rs
- python/narrative_stack/us_gaap_store_integration_test.sh
- src/network/sec_client.rs
- tests/config_manager_tests.rs
- tests/distill_us_gaap_fundamental_concepts_tests.rs
- tests/sec_client_tests.rs
This page documents the testing approach for the rust-sec-fetcher codebase, covering both Rust unit tests and Python integration tests. The testing strategy emphasizes isolated unit testing with mocking for Rust components and end-to-end integration testing for the Python ML pipeline. For information about the CI/CD automation of these tests, see CI/CD Pipeline.
Test Architecture Overview
The codebase employs a dual-layer testing strategy that mirrors its dual-language architecture. Rust components use unit tests with HTTP mocking, while Python components use containerized integration tests with real database instances.
graph TB
subgraph "Rust Unit Tests"
SecClientTest["sec_client_tests.rs\nTest SecClient HTTP layer"]
DistillTest["distill_us_gaap_fundamental_concepts_tests.rs\nTest concept transformation"]
ConfigTest["config_manager_tests.rs\nTest configuration loading"]
end
subgraph "Test Infrastructure"
MockitoServer["mockito::Server\nHTTP mock server"]
TempDir["tempfile::TempDir\nTemporary config files"]
end
subgraph "Python Integration Tests"
IntegrationScript["us_gaap_store_integration_test.sh\nTest orchestration"]
PytestRunner["pytest\nTest execution"]
end
subgraph "Docker Test Environment"
MySQLContainer["us_gaap_test_db\nMySQL 8.0 container"]
SimdRDriveContainer["simd-r-drive-ws-server\nWebSocket server"]
SQLSchema["us_gaap_schema_2025.sql\nDatabase schema fixture"]
end
SecClientTest --> MockitoServer
ConfigTest --> TempDir
IntegrationScript --> MySQLContainer
IntegrationScript --> SimdRDriveContainer
IntegrationScript --> SQLSchema
IntegrationScript --> PytestRunner
PytestRunner --> MySQLContainer
PytestRunner --> SimdRDriveContainer
Test Component Relationships
Sources: tests/sec_client_tests.rs:1-159 tests/distill_us_gaap_fundamental_concepts_tests.rs:1-1275 tests/config_manager_tests.rs:1-95 python/narrative_stack/us_gaap_store_integration_test.sh:1-39
Rust Unit Testing Strategy
The Rust test suite focuses on three critical areas: HTTP client behavior, data transformation correctness, and configuration management. Tests are isolated using mocking frameworks and temporary file systems.
SecClient HTTP Testing
The SecClient tests verify HTTP operations, retry logic, and authentication header generation using the mockito library for HTTP mocking.
| Test Function | Purpose | Mock Behavior |
|---|---|---|
test_user_agent | Validates User-Agent header format | N/A (direct method call) |
test_invalid_email_panic | Ensures invalid emails cause panic | N/A (panic verification) |
test_fetch_json_without_retry_success | Tests successful JSON fetch | Returns 200 with valid JSON |
test_fetch_json_with_retry_success | Tests successful fetch with retry available | Returns 200 immediately |
test_fetch_json_with_retry_failure | Verifies retry exhaustion | Returns 500 three times |
test_fetch_json_with_retry_backoff | Tests retry with eventual success | Returns 500 once, then 200 |
User Agent Validation Test Pattern
Sources: tests/sec_client_tests.rs:6-21
The test creates a minimal AppConfig with only the email field set tests/sec_client_tests.rs:8-10 constructs a ConfigManager and SecClient tests/sec_client_tests.rs:10-12 then verifies the User-Agent format matches the expected pattern including the package version from CARGO_PKG_VERSION tests/sec_client_tests.rs:14-20
HTTP Retry and Backoff Testing
The retry logic tests use mockito::Server to simulate various HTTP response scenarios:
Sources: tests/sec_client_tests.rs:64-158
graph TB
SetupMock["Setup mockito::Server"]
ConfigureResponse["Configure mock response\nStatus, Body, Expect count"]
CreateClient["Create SecClient\nwith max_retries config"]
FetchJSON["Call client.fetch_json()"]
VerifyResult["Verify result:\nSuccess or Error"]
VerifyCallCount["Verify mock was called\nexpected number of times"]
SetupMock --> ConfigureResponse
ConfigureResponse --> CreateClient
CreateClient --> FetchJSON
FetchJSON --> VerifyResult
FetchJSON --> VerifyCallCount
The retry failure test configures a mock to return HTTP 500 status and expects exactly 3 calls (initial + 2 retries) tests/sec_client_tests.rs:96-103 setting max_retries = 2 in the config tests/sec_client_tests.rs106 The test verifies the result is an error after exhausting retries tests/sec_client_tests.rs:111-119
The backoff test simulates transient failures by creating two mock endpoints: one that returns 500 once, and another that returns 200 tests/sec_client_tests.rs:126-140 This validates that the retry mechanism successfully recovers from temporary failures.
Sources: tests/sec_client_tests.rs:94-158
US GAAP Concept Transformation Testing
The distill_us_gaap_fundamental_concepts_tests.rs file contains 1,275 lines of comprehensive tests covering all 64 FundamentalConcept variants. These tests verify the complex mapping logic that normalizes diverse US GAAP terminology into a standardized taxonomy.
Test Coverage by Concept Category
| Category | Test Functions | Concept Variants Tested | Lines |
|---|---|---|---|
| Assets & Liabilities | 6 | 9 | 5-148 |
| Income Statement | 15 | 25 | 20-464 |
| Cash Flow | 12 | 15 | 550-683 |
| Equity | 7 | 13 | 150-286 |
| Revenue Variations | 2 (with 57+ assertions) | 45+ | 954-1188 |
Hierarchical Mapping Test Pattern
The tests verify that specific concepts map to both their precise category and parent categories:
Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:131-139
graph TB
InputConcept["Input: 'AssetsCurrent'"]
CallDistill["Call distill_us_gaap_fundamental_concepts()"]
ExpectVector["Expect Vec with 2 elements"]
CheckSpecific["Assert contains:\nFundamentalConcept::CurrentAssets"]
CheckParent["Assert contains:\nFundamentalConcept::Assets"]
InputConcept --> CallDistill
CallDistill --> ExpectVector
ExpectVector --> CheckSpecific
ExpectVector --> CheckParent
The test for AssetsCurrent verifies it maps to both CurrentAssets (specific) and Assets (parent category) tests/distill_us_gaap_fundamental_concepts_tests.rs:132-138 This hierarchical pattern appears throughout the test suite for concepts that have parent-child relationships.
Synonym Consolidation Testing
Multiple test functions verify that synonym variations of the same concept map to a single canonical form:
Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:80-112
graph LR
CostOfRevenue["CostOfRevenue"]
CostOfGoodsAndServicesSold["CostOfGoodsAndServicesSold"]
CostOfServices["CostOfServices"]
CostOfGoodsSold["CostOfGoodsSold"]
CostOfGoodsSoldExcluding["CostOfGoodsSold...\nExcludingDepreciation"]
CostOfGoodsSoldElectric["CostOfGoodsSoldElectric"]
Canonical["FundamentalConcept::\nCostOfRevenue"]
CostOfRevenue --> Canonical
CostOfGoodsAndServicesSold --> Canonical
CostOfServices --> Canonical
CostOfGoodsSold --> Canonical
CostOfGoodsSoldExcluding --> Canonical
CostOfGoodsSoldElectric --> Canonical
The test_cost_of_revenue function contains 6 assertions verifying different US GAAP terminology variants all map to FundamentalConcept::CostOfRevenue tests/distill_us_gaap_fundamental_concepts_tests.rs:81-111 This pattern ensures consistent normalization across different reporting styles.
Industry-Specific Revenue Testing
The revenue tests demonstrate the most complex mapping scenario, where 57+ industry-specific revenue concepts all normalize to FundamentalConcept::Revenues:
Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:954-1188
graph TB
subgraph "Standard Revenue Terms"
Revenues["Revenues"]
SalesRevenueNet["SalesRevenueNet"]
SalesRevenueServicesNet["SalesRevenueServicesNet"]
end
subgraph "Industry-Specific Terms"
HealthCare["HealthCareOrganizationRevenue"]
RealEstate["RealEstateRevenueNet"]
OilGas["OilAndGasRevenue"]
Financial["FinancialServicesRevenue"]
Advertising["AdvertisingRevenue"]
Subscription["SubscriptionRevenue"]
Mining["RevenueMineralSales"]
end
subgraph "Specialized Terms"
Franchisor["FranchisorRevenue"]
Admissions["AdmissionsRevenue"]
Licenses["LicensesRevenue"]
Royalty["RoyaltyRevenue"]
Clearing["ClearingFeesRevenue"]
Passenger["PassengerRevenue"]
end
Canonical["FundamentalConcept::Revenues"]
Revenues --> Canonical
SalesRevenueNet --> Canonical
SalesRevenueServicesNet --> Canonical
HealthCare --> Canonical
RealEstate --> Canonical
OilGas --> Canonical
Financial --> Canonical
Advertising --> Canonical
Subscription --> Canonical
Mining --> Canonical
Franchisor --> Canonical
Admissions --> Canonical
Licenses --> Canonical
Royalty --> Canonical
Clearing --> Canonical
Passenger --> Canonical
The test_revenues function contains 47 distinct assertions, each verifying a different industry-specific revenue term maps to the canonical FundamentalConcept::Revenues tests/distill_us_gaap_fundamental_concepts_tests.rs:955-1188 Some terms also map to more specific sub-categories, such as InterestAndDividendIncomeOperating mapping to both InterestAndDividendIncomeOperating and Revenues tests/distill_us_gaap_fundamental_concepts_tests.rs:989-994
graph TB
CreateTempDir["tempfile::tempdir()"]
CreatePath["dir.path().join('config.toml')"]
WriteContents["Write TOML contents\nto file"]
LoadConfig["ConfigManager::from_config(path)"]
AssertValues["Assert config values\nmatch expectations"]
DropTempDir["Drop TempDir\n(automatic cleanup)"]
CreateTempDir --> CreatePath
CreatePath --> WriteContents
WriteContents --> LoadConfig
LoadConfig --> AssertValues
AssertValues --> DropTempDir
Configuration Manager Testing
The config_manager_tests.rs file tests configuration loading, validation, and error handling using temporary files created with the tempfile crate.
Temporary Config File Test Pattern
Sources: tests/config_manager_tests.rs:8-57
The helper function create_temp_config creates a temporary directory and writes config contents to a config.toml file tests/config_manager_tests.rs:9-17 The function returns both the TempDir (to prevent premature cleanup) and the PathBuf tests/config_manager_tests.rs16 Tests explicitly drop the TempDir at the end to ensure cleanup tests/config_manager_tests.rs56
Invalid Configuration Key Detection
The test suite verifies that unknown configuration keys are rejected with helpful error messages:
| Test | Config Content | Expected Behavior |
|---|---|---|
test_load_custom_config | Valid keys only | Success, values loaded |
test_load_non_existent_config | N/A (file doesn't exist) | Error returned |
test_fails_on_invalid_key | Contains invalid_key | Error with valid keys listed |
Sources: tests/config_manager_tests.rs:68-94
The invalid key test verifies that the error message contains documentation of valid configuration keys, including their types tests/config_manager_tests.rs:88-93:
email (String | Null)max_concurrent (Integer | Null)max_retries (Integer | Null)min_delay_ms (Integer | Null)
Python Integration Testing Strategy
The Python testing strategy uses Docker Compose to orchestrate a complete test environment with MySQL database and WebSocket services, enabling end-to-end validation of the data pipeline.
graph TB
ScriptStart["us_gaap_store_integration_test.sh"]
SetupEnv["Set environment variables\nPROJECT_NAME=us_gaap_it\nCOMPOSE=docker compose -p ..."]
ActivateVenv["source .venv/bin/activate"]
InstallDeps["uv pip install -e . --group dev"]
RegisterTrap["trap 'cleanup' EXIT"]
StartContainers["docker compose up -d\n--profile test"]
WaitMySQL["Wait for MySQL ping\nmysqladmin ping loop"]
CreateDB["CREATE DATABASE us_gaap_test"]
LoadSchema["mysql < us_gaap_schema_2025.sql"]
RunPytest["pytest -s -v\ntest_us_gaap_store.py"]
Cleanup["Cleanup trap:\ndocker compose down\n--volumes --remove-orphans"]
ScriptStart --> SetupEnv
SetupEnv --> ActivateVenv
ActivateVenv --> InstallDeps
InstallDeps --> RegisterTrap
RegisterTrap --> StartContainers
StartContainers --> WaitMySQL
WaitMySQL --> CreateDB
CreateDB --> LoadSchema
LoadSchema --> RunPytest
RunPytest --> Cleanup
Integration Test Orchestration Flow
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39
Docker Compose Test Profile
The script uses an isolated Docker Compose project name to prevent conflicts with development containers python/narrative_stack/us_gaap_store_integration_test.sh:8-9:
PROJECT_NAME="us_gaap_it"COMPOSE="docker compose -p $PROJECT_NAME --profile test"
The --profile test flag ensures only test-specific services are started python/narrative_stack/us_gaap_store_integration_test.sh23 which include:
db_test(MySQL container namedus_gaap_test_db)simd_r_drive_ws_server_test(WebSocket server)
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:8-23
MySQL Test Database Setup
The integration test script performs a multi-step database initialization:
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:25-35
The script waits for MySQL to be ready using a ping loop python/narrative_stack/us_gaap_store_integration_test.sh:25-28:
Then creates the test database python/narrative_stack/us_gaap_store_integration_test.sh:30-31 and loads the schema from a fixture file python/narrative_stack/us_gaap_store_integration_test.sh:33-35
Test Cleanup and Isolation
The script registers a cleanup trap to ensure test containers are always removed python/narrative_stack/us_gaap_store_integration_test.sh:14-19:
This trap executes on both successful completion and script failure, ensuring:
- All containers in the
us_gaap_itproject are stopped and removed - All volumes are deleted (
--volumes) - Orphaned containers from previous runs are cleaned up (
--remove-orphans)
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:14-19
Pytest Execution Environment
The test environment is configured with specific paths and options python/narrative_stack/us_gaap_store_integration_test.sh:37-38:
| Configuration | Value | Purpose |
|---|---|---|
PYTHONPATH | src | Ensure module imports work |
pytest flags | -s -v | Show output, verbose mode |
| Test path | tests/integration/test_us_gaap_store.py | Integration test module |
The -s flag disables output capturing, allowing print statements and logs to display immediately during test execution, which is useful for debugging long-running integration tests. The -v flag enables verbose mode with detailed test function names.
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:37-38
Test Fixtures and Mocking Patterns
The codebase uses different mocking strategies appropriate to each language and component.
graph TB
CreateServer["mockito::Server::new_async()"]
ConfigureMock["server.mock(method, path)\n.with_status(code)\n.with_body(json)\n.expect(count)"]
CreateAsyncMock["create_async().await"]
GetServerUrl["server.url()"]
MakeRequest["client.fetch_json(url)"]
VerifyMock["Mock automatically verifies\nexpected call count"]
CreateServer --> ConfigureMock
ConfigureMock --> CreateAsyncMock
CreateAsyncMock --> GetServerUrl
GetServerUrl --> MakeRequest
MakeRequest --> VerifyMock
Rust Mocking with mockito
The mockito library provides HTTP server mocking for testing the SecClient:
Sources: tests/sec_client_tests.rs:36-62
Mock configuration example from test_fetch_json_without_retry_success tests/sec_client_tests.rs:39-45:
- Method:
GET - Path:
/files/company_tickers.json - Status:
200 - Header:
Content-Type: application/json - Body: JSON string with sample ticker data
- Call using
server.url()to get the mock endpoint
graph LR
CreateTempDir["tempfile::tempdir()"]
GetPath["dir.path().join('config.toml')"]
CreateFile["fs::File::create(path)"]
WriteContents["writeln!(file, contents)"]
ReturnBoth["Return (TempDir, PathBuf)"]
CreateTempDir --> GetPath
GetPath --> CreateFile
CreateFile --> WriteContents
WriteContents --> ReturnBoth
Temporary File Fixtures
Configuration tests use the tempfile crate to create isolated test environments:
Sources: tests/config_manager_tests.rs:8-17
The create_temp_config helper returns both the TempDir and PathBuf to prevent premature cleanup tests/config_manager_tests.rs16 The TempDir must remain in scope until after the test completes, as dropping it deletes the temporary directory.
SQL Schema Fixtures
The Python integration tests use a versioned SQL schema file as a fixture:
| File | Purpose | Usage |
|---|---|---|
tests/integration/assets/us_gaap_schema_2025.sql | Define us_gaap_test database structure | Loaded via docker exec python/narrative_stack/us_gaap_store_integration_test.sh:33-35 |
This approach ensures tests run against a known database structure that matches production schema, enabling reliable testing of database operations and query logic.
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:33-35
Test Execution Commands
The following table summarizes how to run different test suites:
| Test Suite | Command | Working Directory | Prerequisites |
|---|---|---|---|
| All Rust unit tests | cargo test | Repository root | Rust toolchain |
| Specific Rust test file | cargo test --test sec_client_tests | Repository root | Rust toolchain |
| Single Rust test function | cargo test test_user_agent | Repository root | Rust toolchain |
| Python integration tests | ./us_gaap_store_integration_test.sh | python/narrative_stack/ | Docker, Git LFS, uv |
| Python integration with pytest | pytest -s -v tests/integration/ | python/narrative_stack/ | Docker, containers running |
Prerequisites for Integration Tests
The Python integration test script requires python/narrative_stack/us_gaap_store_integration_test.sh:1-2:
- Git LFS enabled and updated (for large test fixtures)
- Docker with Docker Compose v2
- Python virtual environment with
uvpackage manager - Test data files committed with Git LFS
Sources: tests/sec_client_tests.rs:1-159 tests/distill_us_gaap_fundamental_concepts_tests.rs:1-1275 tests/config_manager_tests.rs:1-95 python/narrative_stack/us_gaap_store_integration_test.sh:1-39
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
CI/CD Pipeline
Relevant source files
- .github/workflows/us-gaap-store-integration-test.yml
- python/narrative_stack/Dockerfile.simd-r-drive-ci-server
- python/narrative_stack/us_gaap_store_integration_test.sh
Purpose and Scope
This document explains the continuous integration and continuous deployment (CI/CD) infrastructure for the rust-sec-fetcher repository. It covers the GitHub Actions workflow configuration, integration test automation, Docker container orchestration for testing, and the complete test execution flow.
The CI/CD pipeline focuses specifically on the Python narrative_stack system's integration testing. For general testing strategies including Rust unit tests and Python test fixtures, see Testing Strategy. For production Docker deployment of the simd-r-drive WebSocket server, see Docker Deployment.
GitHub Actions Workflow Overview
The repository implements a single GitHub Actions workflow named US GAAP Store Integration Test that validates the Python machine learning pipeline's integration with external dependencies.
Workflow Trigger Conditions
The workflow is configured to run automatically under specific conditions defined in .github/workflows/us-gaap-store-integration-test.yml:3-11:
Sources: .github/workflows/us-gaap-store-integration-test.yml:3-11
The workflow only executes when changes affect:
- Any file under
python/narrative_stack/directory - The workflow definition file itself (
.github/workflows/us_gaap_store_integration_test.yml)
This targeted triggering prevents unnecessary CI runs when changes are made to the Rust sec-fetcher components or unrelated Python modules.
Integration Test Job Structure
The workflow defines a single job named integration-test that runs on ubuntu-latest runners. The job executes a sequence of setup and validation steps.
Sources: .github/workflows/us-gaap-store-integration-test.yml:12-51
graph TB
Start["Job: integration-test\nruns-on: ubuntu-latest"]
Checkout["Step 1: Checkout repo\nactions/checkout@v4\nwith lfs: true"]
SetupPython["Step 2: Set up Python\nactions/setup-python@v5\npython-version: 3.12"]
InstallUV["Step 3: Install uv\ncurl astral.sh/uv/install.sh\nAdd to PATH"]
InstallDeps["Step 4: Install Python dependencies\nuv venv --python=3.12\nuv pip install -e . --group dev"]
Ruff["Step 5: Check code style with Ruff\nruff check ."]
Chmod["Step 6: Make test script executable\nchmod +x us_gaap_store_integration_test.sh"]
RunTest["Step 7: Run integration test\n./us_gaap_store_integration_test.sh"]
Start --> Checkout
Checkout --> SetupPython
SetupPython --> InstallUV
InstallUV --> InstallDeps
InstallDeps --> Ruff
Ruff --> Chmod
Chmod --> RunTest
style Start fill:#f9f9f9
style RunTest fill:#f9f9f9
Key Workflow Steps
| Step | Action | Purpose |
|---|---|---|
| Checkout repo | actions/checkout@v4 with lfs: true | Clone repository with Git LFS support for large test data files |
| Set up Python | actions/setup-python@v5 | Install Python 3.12 runtime |
| Install uv | Shell script execution | Install uv package manager from astral.sh |
| Install dependencies | uv pip install -e . --group dev | Install narrative_stack package in editable mode with dev dependencies |
| Check code style | ruff check . | Validate Python code formatting and linting rules |
| Make executable | chmod +x | Grant execution permissions to test script |
| Run integration test | Execute bash script | Orchestrate Docker containers and run pytest tests |
Sources: .github/workflows/us-gaap-store-integration-test.yml:17-50
Integration Test Architecture
The integration test orchestrates multiple Docker containers to create an isolated test environment that validates the narrative_stack system's ability to ingest, process, and store US GAAP financial data.
graph TB
subgraph "Docker Compose Project: us_gaap_it"
MySQL["Container: us_gaap_test_db\nImage: mysql\nPort: 3306\nUser: root\nPassword: onlylocal\nDatabase: us_gaap_test"]
SimdRDrive["Container: simd_r_drive_ws_server_test\nBuilt from: Dockerfile.simd-r-drive-ci-server\nPort: 8080\nBinary: simd-r-drive-ws-server@0.10.0-alpha"]
TestRunner["Test Runner\npytest process\ntests/integration/test_us_gaap_store.py"]
end
Schema["SQL Schema\ntests/integration/assets/us_gaap_schema_2025.sql"]
TestRunner -->|SQL queries| MySQL
TestRunner -->|WebSocket connection| SimdRDrive
Schema -->|Loaded via mysql CLI| MySQL
style MySQL fill:#f9f9f9
style SimdRDrive fill:#f9f9f9
style TestRunner fill:#f9f9f9
Container Architecture
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39 python/narrative_stack/Dockerfile.simd-r-drive-ci-server:1-34
The test environment consists of:
- MySQL Database Container (
us_gaap_test_db): Provides persistent storage for US GAAP financial data with pre-loaded schema - simd-r-drive WebSocket Server (
simd_r_drive_ws_server_test): Handles embedding matrix storage and data synchronization - pytest Test Runner : Executes integration tests against both containers
Test Execution Flow
The integration test script python/narrative_stack/us_gaap_store_integration_test.sh:1-39 orchestrates the complete test lifecycle from container startup through teardown.
graph TB
Start["Start: us_gaap_store_integration_test.sh"]
SetVars["Set variables\nPROJECT_NAME=us_gaap_it\nCOMPOSE=docker compose -p us_gaap_it"]
ActivateVenv["Activate virtual environment\nsource .venv/bin/activate"]
InstallDeps["Install dependencies\nuv pip install -e . --group dev"]
RegisterTrap["Register cleanup trap\ntrap 'cleanup' EXIT"]
DockerUp["Start Docker containers\ndocker compose up -d\nProfile: test"]
WaitMySQL["Wait for MySQL ready\nmysqladmin ping loop"]
CreateDB["Create database\nCREATE DATABASE us_gaap_test"]
LoadSchema["Load schema\nmysql < us_gaap_schema_2025.sql"]
SetPythonPath["Set PYTHONPATH=src"]
RunPytest["Execute pytest\npytest -s -v tests/integration/test_us_gaap_store.py"]
Cleanup["Cleanup function\ndocker compose down --volumes"]
Start --> SetVars
SetVars --> ActivateVenv
ActivateVenv --> InstallDeps
InstallDeps --> RegisterTrap
RegisterTrap --> DockerUp
DockerUp --> WaitMySQL
WaitMySQL --> CreateDB
CreateDB --> LoadSchema
LoadSchema --> SetPythonPath
SetPythonPath --> RunPytest
RunPytest --> Cleanup
style Start fill:#f9f9f9
style RunPytest fill:#f9f9f9
style Cleanup fill:#f9f9f9
Test Script Flow Diagram
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39
Cleanup Mechanism
The script implements a robust cleanup mechanism using bash trap handlers python/narrative_stack/us_gaap_store_integration_test.sh:14-19:
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:14-19
The cleanup function ensures that all Docker resources are removed regardless of test outcome, preventing resource leaks and port conflicts in subsequent test runs.
Docker Container Configuration
simd-r-drive-ws-server Container Build
The Dockerfile python/narrative_stack/Dockerfile.simd-r-drive-ci-server:1-34 creates a single-stage image optimized for CI environments.
Build Configuration:
| Parameter | Value | Purpose |
|---|---|---|
| Base Image | rust:1.87-slim | Minimal Rust runtime environment |
| Binary Installation | cargo install --locked simd-r-drive-ws-server@0.10.0-alpha | Install specific locked version of server |
| Default Server Args | data.bin --host 127.0.0.1 --port 8080 | Configure server binding and data file |
| Exposed Port | 8080 | WebSocket server port |
| Entrypoint | Shell wrapper with $SERVER_ARGS interpolation | Allow runtime argument override |
Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:9-33
graph LR
BuildTime["Build Time\n--build-arg SERVER_ARGS=..."]
BakeArgs["Bake into ENV variable"]
Runtime["Run Time\n-e SERVER_ARGS=... OR\ndocker run ... extra flags"]
Entrypoint["ENTRYPOINT interpolates\n$SERVER_ARGS + $@"]
ServerExec["Execute:\nsimd-r-drive-ws-server\nwith combined args"]
BuildTime --> BakeArgs
BakeArgs --> Runtime
Runtime --> Entrypoint
Entrypoint --> ServerExec
style Entrypoint fill:#f9f9f9
The Dockerfile uses build arguments to support configuration at both build-time and run-time:
Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:18-33
MySQL Container Configuration
The MySQL container is configured through Docker Compose with the following specifications python/narrative_stack/us_gaap_store_integration_test.sh:23-35:
| Configuration | Value | Source |
|---|---|---|
| Container Name | us_gaap_test_db | Docker Compose service name |
| Root Password | onlylocal | Environment variable |
| Database Name | us_gaap_test | Created via CREATE DATABASE command |
| Schema File | tests/integration/assets/us_gaap_schema_2025.sql | Loaded via mysql CLI |
| Health Check | mysqladmin ping -h127.0.0.1 | Readiness probe |
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:25-35
Environment Configuration and Dependencies
graph TB
UVInstall["Install uv package manager\ncurl -LsSf astral.sh/uv/install.sh"]
CreateVenv["Create virtual environment\nuv venv --python=3.12"]
ActivateVenv["Activate environment\nsource .venv/bin/activate"]
InstallPackage["Install narrative_stack\nuv pip install -e . --group dev"]
DevGroup["Dev dependency group includes:\n- pytest\n- ruff\n- integration test utilities"]
UVInstall --> CreateVenv
CreateVenv --> ActivateVenv
ActivateVenv --> InstallPackage
InstallPackage --> DevGroup
style InstallPackage fill:#f9f9f9
Python Environment Setup
The CI pipeline uses uv as the package manager to create isolated Python environments and install dependencies:
Sources: .github/workflows/us-gaap-store-integration-test.yml:27-37 python/narrative_stack/us_gaap_store_integration_test.sh:11-12
Code Quality Validation
Before running integration tests, the workflow executes Ruff code quality checks .github/workflows/us-gaap-store-integration-test.yml:39-43:
| Check | Command | Purpose |
|---|---|---|
| Linting | ruff check . | Validate code style, detect anti-patterns, enforce naming conventions |
The Ruff check acts as a fast pre-test gate that fails the CI pipeline if code quality issues are detected, preventing broken code from reaching the integration test phase.
Docker Compose Project Isolation
The integration test uses Docker Compose project isolation to prevent conflicts with other running containers python/narrative_stack/us_gaap_store_integration_test.sh:7-9:
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:7-9
The project name us_gaap_it creates an isolated namespace for all Docker resources (containers, networks, volumes) associated with the integration test, allowing multiple test runs or development environments to coexist without interference.
Test Execution and Reporting
pytest Integration Test Execution
The final step executes the integration test suite using pytest python/narrative_stack/us_gaap_store_integration_test.sh:37-38:
| Parameter | Value | Purpose |
|---|---|---|
| Test File | tests/integration/test_us_gaap_store.py | Integration test module |
| Verbosity | -v | Verbose output with test names |
| Output | -s | Show stdout/stderr (no capture) |
| Python Path | PYTHONPATH=src | Locate source modules |
| Working Directory | python/narrative_stack | Test execution context |
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:37-38
The pytest command runs with stdout capture disabled (-s flag), allowing real-time visibility into test progress and enabling debugging output from the integration tests.
graph TB
GitPush["Git Push/PR\nto python/narrative_stack/"]
GHAStart["GitHub Actions Trigger\nus-gaap-store-integration-test.yml"]
EnvSetup["Environment Setup:\n- Python 3.12\n- uv package manager\n- Dev dependencies"]
CodeQuality["Code Quality Check:\nruff check ."]
QualityFail["❌ Fail CI"]
QualityPass["✓ Pass"]
ScriptExec["Execute Test Script:\nus_gaap_store_integration_test.sh"]
DockerSetup["Docker Compose Setup:\n- MySQL (us_gaap_test_db)\n- simd-r-drive-ws-server"]
SchemaLoad["Load SQL Schema:\nus_gaap_schema_2025.sql"]
PytestRun["Run Integration Tests:\npytest tests/integration/"]
TestFail["❌ Fail CI"]
TestPass["✓ Pass"]
Cleanup["Cleanup:\ndocker compose down --volumes"]
CIComplete["✅ CI Complete"]
GitPush --> GHAStart
GHAStart --> EnvSetup
EnvSetup --> CodeQuality
CodeQuality -->|Issues detected| QualityFail
CodeQuality -->|Clean| QualityPass
QualityPass --> ScriptExec
ScriptExec --> DockerSetup
DockerSetup --> SchemaLoad
SchemaLoad --> PytestRun
PytestRun -->|Tests fail| TestFail
PytestRun -->|Tests pass| TestPass
TestFail --> Cleanup
TestPass --> Cleanup
Cleanup --> CIComplete
style GHAStart fill:#f9f9f9
style CodeQuality fill:#f9f9f9
style PytestRun fill:#f9f9f9
style CIComplete fill:#f9f9f9
CI/CD Pipeline Flow Summary
The complete CI/CD pipeline integrates all components into a continuous validation workflow:
Sources: .github/workflows/us-gaap-store-integration-test.yml:1-51 python/narrative_stack/us_gaap_store_integration_test.sh:1-39
Key Design Decisions
Git LFS Requirement
The workflow explicitly enables Git LFS .github/workflows/us-gaap-store-integration-test.yml:19-20 to support large test data files. The test script includes a comment python/narrative_stack/us_gaap_store_integration_test.sh:1-2 noting this requirement for local execution.
Single-Stage Docker Build
The Dockerfile uses a single-stage build python/narrative_stack/Dockerfile.simd-r-drive-ci-server:9-15 rather than multi-stage, prioritizing simplicity for CI environments where image size is less critical than build reliability and transparency.
Profile-Based Container Selection
The Docker Compose command uses --profile test python/narrative_stack/us_gaap_store_integration_test.sh9 to selectively start only test-related services, preventing unnecessary containers from consuming CI resources.
Working Directory Isolation
All workflow steps and script commands execute from the python/narrative_stack working directory .github/workflows/us-gaap-store-integration-test.yml33 ensuring consistent path resolution and dependency isolation from other repository components.
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Docker Deployment
Relevant source files
- python/narrative_stack/Dockerfile.simd-r-drive-ci-server
- python/narrative_stack/us_gaap_store_integration_test.sh
Purpose and Scope
This document describes the Docker containerization strategy for the rust-sec-fetcher system, focusing on the simd-r-drive-ws-server deployment and the Docker Compose infrastructure used for integration testing. The Dockerfile builds a lightweight container that runs the WebSocket server, while Docker Compose orchestrates multi-container test environments with MySQL and the WebSocket server.
For information about the CI/CD workflows that use these Docker configurations, see CI/CD Pipeline. For details about the simd-r-drive WebSocket server functionality, see Database & Storage Integration.
Docker Image Architecture
The system uses a single-stage Dockerfile to build and deploy the simd-r-drive-ws-server binary in a minimal Rust container. This approach prioritizes simplicity and reproducibility for CI/CD environments.
graph TB
BaseImage["rust:1.87-slim\nBase Image"]
CargoInstall["cargo install\nsimd-r-drive-ws-server@0.10.0-alpha"]
EnvSetup["Environment Setup\nSERVER_ARGS variable\ndefault: 'data.bin --host 127.0.0.1 --port 8080'"]
Expose["EXPOSE 8080\nWebSocket port"]
Entrypoint["ENTRYPOINT\n/bin/sh -c 'simd-r-drive-ws-server ${SERVER_ARGS}'"]
BaseImage --> CargoInstall
CargoInstall --> EnvSetup
EnvSetup --> Expose
Expose --> Entrypoint
Dockerfile Structure
Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:1-34
Build Configuration
The Dockerfile uses build arguments and environment variables to provide flexible configuration:
| Configuration | Type | Default Value | Purpose |
|---|---|---|---|
RUST_VERSION | Build ARG | 1.87 | Rust toolchain version for compilation |
SERVER_ARGS | Build ARG / ENV | data.bin --host 127.0.0.1 --port 8080 | Server command-line arguments |
Build-time customization:
Runtime customization:
Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:9-33
Image Composition
Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:10-26
Docker Compose Test Infrastructure
The integration testing environment uses Docker Compose to orchestrate multiple services with isolated networking and lifecycle management.
graph TB
subgraph "Docker Compose Project: us_gaap_it"
subgraph "db_test Service"
MySQLContainer["Container: us_gaap_test_db\nImage: mysql\nPort: 3306\nRoot password: onlylocal"]
Database["Database: us_gaap_test\nSchema: us_gaap_schema_2025.sql"]
end
subgraph "simd_r_drive_ws_server_test Service"
WSServer["Container: simd-r-drive-ws-server\nPort: 8080\nData file: data.bin"]
end
subgraph "Test Runner (Host)"
TestScript["us_gaap_store_integration_test.sh\npytest integration tests"]
end
end
MySQLContainer --> Database
TestScript -->|docker exec mysql commands| MySQLContainer
TestScript -->|pytest via network| WSServer
TestScript -->|pytest via network| Database
Service Architecture
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39
sequenceDiagram
participant Script as us_gaap_store_integration_test.sh
participant Compose as docker compose
participant MySQL as us_gaap_test_db
participant WSServer as simd-r-drive-ws-server
participant Pytest as pytest
Script->>Script: Set PROJECT_NAME="us_gaap_it"
Script->>Script: Register cleanup trap
Script->>Compose: up -d (profile: test)
Compose->>MySQL: Start container
Compose->>WSServer: Start container
loop Wait for MySQL
Script->>MySQL: mysqladmin ping
MySQL-->>Script: Connection status
end
Script->>MySQL: CREATE DATABASE us_gaap_test
Script->>MySQL: Load us_gaap_schema_2025.sql
Script->>Pytest: Run tests/integration/test_us_gaap_store.py
Pytest->>MySQL: Query database
Pytest->>WSServer: WebSocket operations
Pytest-->>Script: Test results
Script->>Compose: down --volumes --remove-orphans
Compose->>MySQL: Stop and remove
Compose->>WSServer: Stop and remove
Test Execution Flow
The integration test script implements a robust lifecycle with automatic cleanup:
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:7-39
Docker Compose Configuration Pattern
The test script uses the following Docker Compose invocation pattern:
| Component | Value | Purpose |
|---|---|---|
| Project Name | us_gaap_it | Isolates test containers from other Docker resources |
| Profile | test | Selects services tagged with profiles: [test] |
| Cleanup Strategy | --volumes --remove-orphans | Ensures complete teardown of test infrastructure |
Key commands:
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:8-18
graph TB
Start["Container Start\nImage: mysql\nRoot password: onlylocal"]
WaitReady["Health Check Loop\nmysqladmin ping -h127.0.0.1"]
CreateDB["docker exec\nCREATE DATABASE IF NOT EXISTS us_gaap_test"]
LoadSchema["docker exec\nmysql < us_gaap_schema_2025.sql"]
Ready["Database Ready\nSchema applied"]
Start --> WaitReady
WaitReady -->|Ping successful| CreateDB
WaitReady -->|Still initializing| WaitReady
CreateDB --> LoadSchema
LoadSchema --> Ready
Database Container Setup
The MySQL container requires specific initialization steps to prepare the test database:
Initialization Sequence
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:25-35
MySQL Container Configuration
The script executes the following commands against the us_gaap_test_db container:
-
Health check: Polls MySQL readiness using
mysqladmin ping -
Database creation: Creates the test database if it doesn't exist
-
Schema loading: Applies the SQL schema from the repository
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:25-35
Environment Variables and Configuration
The Docker deployment uses environment variables for runtime configuration of both the WebSocket server and test infrastructure.
simd-r-drive-ws-server Configuration
| Variable | Default | Configurable At | Description |
|---|---|---|---|
SERVER_ARGS | data.bin --host 127.0.0.1 --port 8080 | Build & Runtime | Complete command-line arguments for the server |
The SERVER_ARGS environment variable controls:
- Data file: Path to the persistent storage file (e.g.,
data.bin) - Host binding: IP address for the WebSocket server (
127.0.0.1for localhost,0.0.0.0for all interfaces) - Port: TCP port for the WebSocket endpoint (default
8080)
Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:18-32
Test Environment Variables
The integration test script sets the following environment variables:
This ensures Python can locate the narrative_stack module when running tests.
Sources: python/narrative_stack/us_gaap_store_integration_test.sh37
Deployment Considerations
Port Exposure and Network Configuration
The Dockerfile exposes port 8080 for WebSocket connections:
When deploying, map this port to the host system:
For production deployments, consider:
- Using a different external port (e.g.,
-p 9000:8080) - Binding to specific interfaces with
--host 0.0.0.0inSERVER_ARGS - Placing the container behind a reverse proxy (nginx, traefik)
Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server25
Data Persistence
The server expects a data file specified in SERVER_ARGS (default: data.bin). For persistent storage across container restarts:
Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:22-23
graph TB
Setup["Setup Phase\ndocker compose up -d"]
Trap["Register Trap\ntrap 'cleanup' EXIT"]
Execute["Test Execution\npytest integration tests"]
Cleanup["Cleanup Phase\ndocker compose down\n--volumes --remove-orphans"]
Setup --> Trap
Trap --> Execute
Execute -->|Success or Failure| Cleanup
Container Lifecycle Management
The integration test script demonstrates proper container lifecycle management:
The trap command ensures cleanup runs even if tests fail or the script is interrupted:
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:14-19
Build and Deployment Commands
Build the Docker image:
Run the container:
Run integration tests:
Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:1-3 python/narrative_stack/us_gaap_store_integration_test.sh:1-39
Integration with CI/CD
The Docker configuration integrates with GitHub Actions for automated testing. The CI pipeline:
- Builds the
simd-r-drive-ci-serverimage - Spins up Docker Compose services with the
testprofile - Initializes the MySQL database with the test schema
- Runs pytest integration tests
- Tears down all containers and volumes
This ensures reproducible test environments across local development and CI systems.
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-2 python/narrative_stack/Dockerfile.simd-r-drive-ci-server:1-8
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Dependencies & Technology Stack
Relevant source files
This page provides a comprehensive overview of all external dependencies used in the rust-sec-fetcher codebase, covering both the Rust sec-fetcher application and the Python narrative_stack ML system. For information about the configuration system and credential management, see Configuration System. For details on how these dependencies are used within specific components, see Rust sec-fetcher Application and Python narrative_stack System.
Overview
The system employs a dual-language architecture with distinct but complementary technology stacks. The Rust layer prioritizes high-performance I/O operations, concurrent data fetching, and reliable HTTP caching. The Python layer focuses on scientific computing, machine learning model training, and numerical data processing. Both layers share common infrastructure through the simd-r-drive data storage system and file-based CSV interchange.
Sources: Cargo.toml:1-45 High-Level System Architecture diagrams
Rust Technology Stack
Core Direct Dependencies
The Rust application declares 29 direct dependencies in its manifest, each serving specific architectural roles:
Dependency Categories and Usage:
graph TB
subgraph "Async Runtime & Concurrency"
tokio["tokio 1.43.0\nFull async runtime"]
rayon["rayon 1.10.0\nData parallelism"]
dashmap["dashmap 6.1.0\nConcurrent hashmap"]
end
subgraph "HTTP & Network"
reqwest["reqwest 0.12.15\nHTTP client"]
reqwest_drive["reqwest-drive 0.1.0-alpha.9\nDrive middleware"]
end
subgraph "Data Processing"
polars["polars 0.46.0\nDataFrame operations"]
csv_crate["csv 1.3.1\nCSV parsing"]
serde["serde 1.0.218\nSerialization"]
serde_json["serde_json 1.0.140\nJSON support"]
quick_xml["quick-xml 0.37.2\nXML parsing"]
end
subgraph "Storage & Caching"
simd_r_drive["simd-r-drive 0.3.0-alpha.1\nKey-value store"]
simd_r_drive_ext["simd-r-drive-extensions\n0.4.0-alpha.6"]
end
subgraph "Configuration & Validation"
config_crate["config 0.15.9\nConfig management"]
keyring["keyring 3.6.2\nCredential storage"]
email_address["email_address 0.2.9\nEmail validation"]
rust_decimal["rust_decimal 1.36.0\nDecimal numbers"]
chrono["chrono 0.4.40\nDate/time handling"]
end
subgraph "Development Tools"
mockito["mockito 1.7.0\nHTTP mocking"]
tempfile["tempfile 3.18.0\nTemp file creation"]
env_logger["env_logger 0.11.7\nLogging"]
end
| Category | Crates | Primary Use Cases |
|---|---|---|
| Async Runtime | tokio | Event loop, async I/O, task scheduling, timers |
| HTTP Stack | reqwest, reqwest-drive | SEC API communication, middleware integration |
| Data Frames | polars | Large-scale data transformation, CSV/JSON processing |
| Serialization | serde, serde_json, serde_with | Data structure serialization, API response parsing |
| Concurrency | rayon, dashmap, crossbeam | Parallel processing, concurrent data structures |
| Storage | simd-r-drive, simd-r-drive-extensions | HTTP cache, preprocessor cache, persistent storage |
| Configuration | config, keyring | TOML config loading, secure credential management |
| Validation | email_address, rust_decimal, chrono | Input validation, financial precision, timestamps |
| Utilities | itertools, indexmap, bytes | Iterator extensions, ordered maps, byte manipulation |
| Testing | mockito, tempfile | HTTP mock servers, temporary test files |
Sources: Cargo.toml:8-44
HTTP and Network Layer
The HTTP stack consists of multiple layers providing comprehensive request handling:
TLS Configuration: The system supports dual TLS backends for platform compatibility:
graph TB
SecClient["SecClient\n(src/sec_client.rs)"]
reqwest["reqwest 0.12.15\nHigh-level HTTP client"]
reqwest_drive["reqwest-drive 0.1.0-alpha.9\nMiddleware framework"]
hyper["hyper 1.6.0\nHTTP/1.1 & HTTP/2"]
h2["h2 0.4.8\nHTTP/2 protocol"]
TLS_Layer["TLS Layer"]
rustls["rustls 0.23.21\nPure Rust TLS"]
native_tls["native-tls 0.2.14\nPlatform TLS"]
openssl["openssl 0.10.71\nOpenSSL bindings"]
hyper_util["hyper-util 0.1.10\nUtilities"]
hyper_rustls["hyper-rustls 0.27.5\nRustls connector"]
hyper_tls["hyper-tls 0.6.0\nNative TLS connector"]
SecClient --> reqwest
reqwest --> reqwest_drive
reqwest --> hyper
reqwest --> TLS_Layer
hyper --> h2
hyper --> hyper_util
TLS_Layer --> hyper_rustls
TLS_Layer --> hyper_tls
hyper_rustls --> rustls
hyper_tls --> native_tls
native_tls --> openssl
- rustls (0.23.21): Pure Rust implementation, no external dependencies, used by default
- native-tls (0.2.14): Platform-native TLS (OpenSSL on Linux, Security.framework on macOS, SChannel on Windows)
Sources: Cargo.lock:1280-1351 Cargo.toml28
Data Processing Stack
Polars Features Enabled:
json: JSON file reading/writinglazy: Lazy evaluation engine for query optimizationpivot: Pivot table operations
Key Numeric Types:
rust_decimal::Decimal: Exact decimal arithmetic for financial calculations (importance: 8.37 for US GAAP processing)chrono::NaiveDate,chrono::DateTime: Timestamp handling for filing dates and report periods
Sources: Cargo.toml24 Cargo.lock:2170-2394
Concurrency and Parallelism
Concurrency Model:
- tokio : Handles I/O-bound operations (HTTP requests, file I/O)
- rayon : Handles CPU-bound operations (DataFrame transformations, parallel iterations)
- dashmap : Thread-safe caching without explicit locking
Key Usage Locations:
tokio: src/main.rs - main async runtimerayon: Cargo.toml14 - dashmap integration, parallel processingonce_cell: src/caches.rs:7-8 -OnceLock<Arc<DataStore>>for global cache initialization
Sources: Cargo.toml:14-40 src/caches.rs:1-66 Cargo.lock:612-762
Storage and Caching System
Cache Architecture:
- Two separate
DataStoreinstances for different caching layers - Thread-safe access via
Arc<DataStore>wrapped inOnceLockstatics - Initialized once at application startup via
Caches::init() - 1-week TTL for HTTP responses, indefinite for preprocessor results
Storage Format:
- Binary
.binfiles for efficient serialization - Uses
bincode 1.3.3for encoding cached data - Persistent across application restarts
Sources: src/caches.rs:7-65 Cargo.toml:36-37 Cargo.lock:246-252
Configuration and Credentials
Configuration Sources (Priority Order):
- Environment variables
config.tomlin current directory~/.config/sec-fetcher/config.toml- Default values
Keyring Features:
apple-native: Direct Security.framework integration on macOSwindows-native: Windows Credential Manager via Win32 APIssync-secret-service: D-Bus Secret Service for Linux
Sources: Cargo.toml:12-20 Cargo.lock:1641-1653
Validation and Type Safety
| Crate | Version | Purpose | Key Types |
|---|---|---|---|
email_address | 0.2.9 | SEC email validation | EmailAddress |
rust_decimal | 1.36.0 | Financial calculations | Decimal |
chrono | 0.4.40 | Date/time operations | NaiveDate, DateTime<Utc> |
chrono-tz | 0.10.1 | Timezone handling | Tz::America__New_York |
strum / strum_macros | 0.27.1 | Enum utilities | EnumString, Display |
Decimal Precision:
- 28-29 significant digits
- No floating-point rounding errors
- Critical for US GAAP fundamental values
Date Handling:
- All SEC filing dates parsed to
chrono::NaiveDate - UTC timestamps for API requests
- Eastern Time conversion for market hours
Sources: Cargo.toml:16-39 Cargo.lock:403-868
Python Technology Stack
Machine Learning Framework
Training Configuration:
- Patience : 20 epochs for early stopping
- Checkpoint Strategy : Save top 1 model by validation loss
- Learning Rate : Configurable via checkpoint override
- Gradient Clipping : Prevents exploding gradients
graph TB
subgraph "Core Scientific Libraries"
numpy["NumPy\nArray operations"]
pandas["pandas\nDataFrame manipulation"]
sklearn["scikit-learn\nPCA, RobustScaler"]
end
subgraph "Preprocessing Pipeline"
PCA["PCA\nDimensionality reduction\nVariance threshold: 0.95"]
RobustScaler["RobustScaler\nPer concept/unit pair\nOutlier-robust normalization"]
sklearn --> PCA
sklearn --> RobustScaler
end
subgraph "Data Structures"
triplets["Concept/Unit/Value Triplets"]
embeddings["Semantic Embeddings\nPCA-compressed"]
RobustScaler --> triplets
PCA --> embeddings
triplets --> embeddings
end
subgraph "Data Ingestion"
csv_files["CSV Files\nfrom Rust sec-fetcher"]
UsGaapRowRecord["UsGaapRowRecord\nParsed row structure"]
csv_files --> UsGaapRowRecord
UsGaapRowRecord --> numpy
end
Sources: High-Level System Architecture (Diagram 3), narrative_stack training pipeline
Data Processing and Scientific Computing
scikit-learn Components:
- PCA : Reduces embedding dimensionality while retaining 95% variance
- RobustScaler : Normalizes using median and IQR, resistant to outliers
- Both fitted per unique concept/unit pair for specialized normalization
NumPy Usage:
- Validation via
np.isclose()checks - Scaler transformation verification
- Embedding matrix storage
graph TB
subgraph "Database Layer"
mysql_db[("MySQL Database\nus_gaap_test")]
DbUsGaap["DbUsGaap\nDatabase interface"]
asyncio["asyncio\nAsync queries"]
DbUsGaap --> mysql_db
DbUsGaap --> asyncio
end
subgraph "WebSocket Storage"
DataStoreWsClient["DataStoreWsClient\nWebSocket client"]
simd_r_drive_server["simd-r-drive\nWebSocket server"]
DataStoreWsClient --> simd_r_drive_server
end
subgraph "Unified Access"
UsGaapStore["UsGaapStore\nFacade pattern"]
UsGaapStore --> DbUsGaap
UsGaapStore --> DataStoreWsClient
end
subgraph "Data Models"
triplet_storage["Triplet Storage:\n- concept\n- unit\n- scaled_value\n- scaler\n- embedding"]
UsGaapStore --> triplet_storage
end
Sources: High-Level System Architecture (Diagram 3), us_gaap_store preprocessing pipeline
Database and Storage Integration
Storage Strategy:
- MySQL : Relational queries for ingested CSV data
- WebSocket Store : High-performance embedding matrix storage
- Facade Pattern :
UsGaapStoreabstracts storage backend choice
Python WebSocket Client:
- Package:
simd-r-drive-client(Python equivalent of Rustsimd-r-drive) - Async communication with WebSocket server
- Shared embedding matrices between Rust and Python
Sources: High-Level System Architecture (Diagram 3), Database & Storage Integration section
Visualization and Debugging
| Library | Purpose | Key Outputs |
|---|---|---|
matplotlib | Static plots | PCA variance plots, scaler distribution |
TensorBoard | Training metrics | Loss curves, learning rate schedules |
itertools | Data iteration | Batching, grouping concept pairs |
Visualization Types:
- PCA Explanation Plots : Cumulative variance by component
- Semantic Embedding Scatterplots : t-SNE/UMAP projections of concept space
- Variance Analysis : Per-concept/unit value distributions
- Training Curves : TensorBoard integration via PyTorch Lightning
graph TB
subgraph "Server Implementation"
server["simd-r-drive-ws-server\nWebSocket server"]
DataStore_server["DataStore\nBackend storage"]
server --> DataStore_server
end
subgraph "Rust Clients"
http_cache_rust["HTTP Cache\n(Rust)"]
preprocessor_cache_rust["Preprocessor Cache\n(Rust)"]
http_cache_rust --> DataStore_server
preprocessor_cache_rust --> DataStore_server
end
subgraph "Python Clients"
DataStoreWsClient_py["DataStoreWsClient\n(Python)"]
embedding_matrix["Embedding Matrix\nStorage"]
DataStoreWsClient_py --> DataStore_server
embedding_matrix --> DataStoreWsClient_py
end
subgraph "Docker Deployment"
dockerfile["Dockerfile\nContainer config"]
rust_image["rust:1.87\nBase image"]
dockerfile --> rust_image
dockerfile --> server
end
Sources: High-Level System Architecture (Diagram 3), Validation & Analysis section
Shared Infrastructure
simd-r-drive WebSocket Server
Key Features:
- Cross-language data sharing via WebSocket protocol
- Binary-efficient key-value storage
- Persistent storage across restarts
- Docker containerization for CI/CD environments
Usage Patterns:
- Rust writes HTTP cache entries
- Rust writes preprocessor transformations
- Python reads/writes embedding matrices
- Python reads ingested US GAAP triplets
Sources: Cargo.toml:36-37 High-Level System Architecture (Diagram 1, Diagram 5)
File System Interchange
Directory Structure:
- fund-holdings/A-Z/ : NPORT filing holdings, alphabetized by ticker
- us-gaap/ : Fundamental concepts CSV outputs
- Row-based CSV format with standardized columns
Data Flow:
- Rust fetches from SEC API
- Rust transforms via
distill_us_gaap_fundamental_concepts - Rust writes CSV files
- Python walks directories
- Python parses into
UsGaapRowRecord - Python ingests to MySQL and WebSocket store
graph TB
subgraph "CI Pipeline"
workflow[".github/workflows/\nus-gaap-store-integration-test"]
checkout["Checkout code"]
rust_setup["Setup Rust 1.87"]
docker_compose["Docker Compose\nsimd-r-drive-ci-server"]
mysql_container["MySQL test container"]
workflow --> checkout
checkout --> rust_setup
checkout --> docker_compose
docker_compose --> mysql_container
end
subgraph "Test Execution"
cargo_test["cargo test\nRust unit tests"]
pytest["pytest\nPython integration tests"]
rust_setup --> cargo_test
docker_compose --> pytest
end
subgraph "Testing Tools"
mockito_tests["mockito 1.7.0\nHTTP mock server"]
tempfile_tests["tempfile 3.18.0\nTemp directories"]
cargo_test --> mockito_tests
cargo_test --> tempfile_tests
end
Sources: High-Level System Architecture (Diagram 1), main.rs CSV output logic
Development and CI/CD Stack
GitHub Actions Workflow
Docker Configuration:
- Image :
rust:1.87for reproducible builds - Services : simd-r-drive-ws-server, MySQL 8.0
- Environment Variables : Database credentials, WebSocket ports
- Volume Mounts : Test data directories
Testing Strategy:
- Rust Unit Tests : mockito for HTTP mocking, tempfile for isolated test fixtures
- Python Integration Tests : pytest with MySQL fixtures, WebSocket client validation
- End-to-End : Full pipeline from CSV ingestion through ML training
Sources: Cargo.toml:42-44 High-Level System Architecture (Diagram 5), CI/CD Pipeline section
Testing Dependencies
| Language | Framework | Mocking | Assertions |
|---|---|---|---|
| Rust | Built-in #[test] | mockito 1.7.0 | Standard assertions |
| Python | pytest | unittest.mock | np.isclose() |
Rust Test Modules:
config_manager_tests: Configuration loading and mergingsec_client_tests: HTTP client behaviordistill_tests: US GAAP concept transformation
Python Test Modules:
- Integration tests: End-to-end pipeline validation
- Unit tests: Scaler, PCA, embedding generation
Sources: Cargo.lock:1802-1824 Testing Strategy documentation
Version Compatibility Matrix
Critical Version Constraints
| Component | Version | Constraints | Reason |
|---|---|---|---|
tokio | 1.43.0 | >= 1.0 | Stable async runtime API |
reqwest | 0.12.15 | 0.12.x | Compatible with reqwest-drive middleware |
polars | 0.46.0 | 0.46.x | Breaking changes between minor versions |
simd-r-drive | 0.3.0-alpha.1 | Exact match | Alpha API, unstable |
serde | 1.0.218 | >= 1.0.100 | Derive macro stability |
hyper | 1.6.0 | 1.x | HTTP/2 support, reqwest dependency |
Rust Toolchain
Minimum Rust Version: 1.70+ (for async trait bounds and GAT support)
Feature Requirements:
async-traitfor trait async methods#[tokio::main]macro support- Generic Associated Types (GATs) for polars iterators
Sources: Cargo.toml4 Cargo.lock:1-3
Python Version Requirements
Estimated Python Version: 3.8+ (for PyTorch Lightning compatibility)
Key Constraints:
graph LR
reqwest["reqwest"] --> hyper
hyper --> h2["h2\nHTTP/2"]
hyper --> http["http\nTypes"]
reqwest --> rustls
reqwest --> native_tls
rustls --> ring["ring\nCryptography"]
native_tls --> openssl["openssl\nSystem TLS"]
h2 --> tokio
hyper --> tokio
- PyTorch: Requires Python 3.8 or later
- PyTorch Lightning: Requires PyTorch >= 1.9
- asyncio: Stable async/await syntax (Python 3.7+)
Sources: High-Level System Architecture (Python technology sections)
Transitive Dependency Highlights
HTTP/TLS Stack Depth
Notable Transitive Dependencies:
h2 0.4.8: HTTP/2 protocol implementation (37 dependencies)rustls 0.23.21: TLS 1.2/1.3 without OpenSSL (14 dependencies)tokio-util 0.7.13: Codec, framing utilities (8 dependencies)
Sources: Cargo.lock:1141-1351
Polars Ecosystem Depth
The polars dependency pulls in 23 related crates:
polars-core,polars-lazy,polars-io,polars-opspolars-arrow: Apache Arrow array implementationpolars-parquet: Parquet file format supportpolars-sql: SQL query interfacepolars-time: Temporal operationspolars-utils: Shared utilities
Total Transitive: ~180 crates in the full dependency tree
Sources: Cargo.lock:2170-2394
Security and Cryptography
Platform-Specific Cryptography:
- macOS: Security.framework via
security-framework-sys - Linux: D-Bus +
libdbus-sys - Windows: Win32 Credential Manager via
windows-sys
Sources: Cargo.lock:764-1653
Total Rust Dependencies: ~250 crates (including transitive)
Key Architectural Decisions:
- Dual TLS support for platform flexibility
- Polars instead of native DataFrame for performance
- simd-r-drive for cross-language data sharing
- Alpha versions for simd-r-drive (active development)
- Platform-specific credential storage via keyring
Sources: Cargo.toml:1-45 Cargo.lock1 High-Level System Architecture (all diagrams)