This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Overview

Relevant source files

This document provides a high-level introduction to the rust-sec-fetcher repository, explaining its purpose, architecture, and the relationship between its Rust and Python components. This page covers the system's overall design and data flow. For installation and configuration instructions, see Getting Started. For detailed implementation documentation of individual components, see Rust sec-fetcher Application and Python narrative_stack System.

Sources: Cargo.toml, README.md, src/lib.rs

System Purpose

The rust-sec-fetcher repository implements a dual-language financial data processing system that:

Fetches company financial data from the SEC EDGAR API
Transforms raw SEC filings into normalized US GAAP fundamental concepts
Stores structured data as CSV files organized by ticker symbol
Trains machine learning models to understand financial concept relationships

The system specializes in processing US GAAP (Generally Accepted Accounting Principles) financial data, normalizing 57+ variations of concepts like revenue and consolidating them into a standardized taxonomy of 64 FundamentalConcept variants. This normalization enables consistent querying across diverse financial reports and powers downstream machine learning applications.

Sources: Cargo.toml:1-6, src/main.rs:173-240

Dual-Language Architecture

The repository employs a dual-language design that leverages the strengths of both Rust and Python:

Layer	Language	Primary Responsibilities	Key Reason
Data Fetching & Processing	Rust	HTTP requests, throttling, caching, data transformation, CSV generation	I/O-bound operations, memory safety, high performance
Machine Learning	Python	Embedding generation, model training, statistical analysis	Rich ML ecosystem (PyTorch, scikit-learn)

Rust Layer (sec-fetcher):

Implements SecClient with sophisticated throttling and caching policies
Fetches company tickers, CIK submissions, NPORT filings, and US GAAP fundamentals
Transforms raw financial data via distill_us_gaap_fundamental_concepts
Outputs structured CSV files organized by ticker symbol

Python Layer (narrative_stack):

Ingests CSV files generated by Rust layer
Generates semantic embeddings for concept/unit pairs
Applies dimensionality reduction via PCA
Trains Stage1Autoencoder models using PyTorch Lightning

Sources: Cargo.toml:1-40, src/main.rs:1-16

graph TB
    SEC["SEC EDGAR API\ncompany_tickers.json\nCIK submissions\ncompanyfacts dataset"]
SecClient["SecClient\n(network layer)"]
NetworkFuncs["Network Functions\nfetch_company_tickers\nfetch_us_gaap_fundamentals\nfetch_nport_filing_by_ticker_symbol\nfetch_investment_company_series_and_class_dataset"]
Distill["distill_us_gaap_fundamental_concepts\n(transformers module)\nMaps 57+ variations → 64 concepts"]
Models["Data Models\nTicker, CikSubmission\nNportInvestment\nFundamentalConcept enum"]
CSV["File System\ndata/fund-holdings/A-Z/\ndata/us-gaap/"]
Ingest["Python Ingestion\nus_gaap_store.ingest_us_gaap_csvs\nWalks CSV directories"]
Preprocess["Preprocessing\nPCA dimensionality reduction\nRobustScaler normalization\nSemantic embeddings"]
DataStore["simd-r-drive\nWebSocket Key-Value Store\nEmbedding matrix storage"]
Training["Stage1Autoencoder\nPyTorch Lightning\nTensorBoard logging"]
SEC -->|HTTP GET| SecClient
 
   SecClient --> NetworkFuncs
 
   NetworkFuncs --> Distill
 
   Distill --> Models
 
   Models --> CSV
    
 
   CSV -.->|CSV files| Ingest
 
   Ingest --> Preprocess
 
   Preprocess --> DataStore
 
   DataStore --> Training

High-Level Data Flow

Data Flow Summary:

Fetch : SecClient retrieves data from SEC EDGAR API endpoints
Transform : Raw financial data passes through distill_us_gaap_fundamental_concepts to normalize concept names
Store : Structured data is written to CSV files, organized by ticker symbol first letter
Ingest : Python scripts walk CSV directories and parse records
Preprocess : Generate embeddings, apply PCA, normalize values
Train : ML models learn financial concept relationships

Sources: src/main.rs:1-16, src/main.rs:173-240, src/lib.rs:1-11

graph TB
    main["main.rs\nApplication entry point"]
config["config module\nConfigManager\nAppConfig"]
network["network module\nSecClient\nfetch_* functions\nThrottlePolicy\nCachePolicy"]
transformers["transformers module\ndistill_us_gaap_fundamental_concepts"]
models["models module\nTicker\nCik\nCikSubmission\nNportInvestment\nAccessionNumber"]
enums["enums module\nFundamentalConcept\nCacheNamespacePrefix\nTickerOrigin\nUrl"]
caches["caches module (private)\nCaches struct\nOnceLock statics\nHTTP cache\nPreprocessor cache"]
utils["utils module\ninvert_multivalue_indexmap\nis_development_mode\nis_interactive_mode\nVecExtensions"]
fs["fs module\nFile system utilities"]
parsers["parsers module\nData parsing functions"]
main --> config
 
   main --> network
 
   main --> utils
    
 
   network --> config
 
   network --> caches
 
   network --> models
 
   network --> transformers
 
   network --> parsers
    
 
   transformers --> enums
 
   transformers --> models
    
 
   models --> enums
    
 
   caches --> config

Core Module Structure

Module Descriptions:

Module	Primary Purpose	Key Exports
`config`	Configuration management and credential loading	`ConfigManager`, `AppConfig`
`network`	HTTP client, data fetching, throttling, caching	`SecClient`, `fetch_company_tickers`, `fetch_us_gaap_fundamentals`, `fetch_nport_filing_by_ticker_symbol`
`transformers`	US GAAP concept normalization (importance: 8.37)	`distill_us_gaap_fundamental_concepts`
`models`	Data structures for SEC entities	`Ticker`, `Cik`, `CikSubmission`, `NportInvestment`, `AccessionNumber`
`enums`	Type-safe enumerations	`FundamentalConcept` (64 variants), `CacheNamespacePrefix`, `TickerOrigin`, `Url`
`caches`	Internal caching infrastructure	`Caches` (private module)
`utils`	Utility functions	`invert_multivalue_indexmap`, `VecExtensions`, `is_development_mode`

Sources: src/lib.rs:1-11, src/main.rs:1-16, src/utils.rs:1-12

Rust Component Architecture

The Rust layer is organized around SecClient, which provides a high-level HTTP interface with integrated throttling and caching. Network functions (fetch_*) use this client to retrieve data from SEC EDGAR endpoints. The most critical component is distill_us_gaap_fundamental_concepts, which normalizes financial concepts using four mapping patterns:

One-to-One : Direct mappings (e.g., Assets → FundamentalConcept::Assets)
Hierarchical : Specific concepts also map to parent categories (e.g., CurrentAssets maps to both CurrentAssets and Assets)
Synonym Consolidation : Multiple terms map to single concept (e.g., 6 cost variations → CostOfRevenue)
Industry-Specific : 57+ revenue variations map to Revenues

Sources: src/main.rs:1-16, Cargo.toml:8-40

Python Component Architecture

The Python narrative_stack system consumes CSV files produced by the Rust layer and trains machine learning models:

Key Components:

Component	Module/Class	Purpose
Ingestion	`us_gaap_store.ingest_us_gaap_csvs`	Walks CSV directories, parses `UsGaapRowRecord` entries
Preprocessing	PCA, RobustScaler	Generates semantic embeddings, normalizes values, reduces dimensionality
Storage	`DbUsGaap`, `DataStoreWsClient`, `UsGaapStore`	Database interface, WebSocket client, unified data access facade
Training	`Stage1Autoencoder`	PyTorch Lightning autoencoder for learning concept embeddings
Validation	`np.isclose` checks	Scaler verification, embedding validation

The preprocessing pipeline creates concept/unit/value triplets with associated embeddings and scalers, storing them in both MySQL and the simd-r-drive WebSocket data store. The Stage1Autoencoder learns latent representations by reconstructing embedding + scaled_value inputs.

Sources: (Python code not included in provided files, based on architecture diagrams)

Key Technologies

Rust Dependencies:

Crate	Purpose
`tokio`	Async runtime for I/O operations
`reqwest`	HTTP client library
`polars`	DataFrame operations and CSV handling
`rayon`	CPU parallelism
`simd-r-drive`	WebSocket data store integration
`serde`, `serde_json`	Serialization/deserialization
`chrono`	Date/time handling

Python Dependencies:

PyTorch & PyTorch Lightning (ML training)
pandas, numpy (data processing)
scikit-learn (PCA, RobustScaler)
matplotlib, TensorBoard (visualization)

Sources: Cargo.toml:8-40

CSV Output Organization

The Rust application writes CSV files to organized directories:

data/
├── fund-holdings/
│   ├── A/
│   │   ├── AAPL.csv
│   │   ├── AMZN.csv
│   │   └── ...
│   ├── B/
│   ├── C/
│   └── ...
└── us-gaap/
    ├── AAPL.csv
    ├── MSFT.csv
    └── ...

Files are organized by the first letter of the ticker symbol (uppercased) to facilitate parallel processing and improve file system performance with large datasets.

Sources: src/main.rs:83-102, src/main.rs:206-221

Next Steps

For installation and configuration, see Getting Started
For Rust implementation details, see Rust sec-fetcher Application
For Python ML pipeline details, see Python narrative_stack System
For development guidelines, see Development Guide
For comprehensive dependency documentation, see Dependencies & Technology Stack

Sources: src/main.rs:1-240, Cargo.toml:1-45, src/lib.rs:1-11

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Getting Started

Relevant source files

This page guides you through installing, configuring, and running the rust-sec-fetcher application. It covers building the Rust binary, setting up required credentials, and executing your first data fetch. For detailed configuration options, see Configuration System. For comprehensive examples, see Running Examples.

The rust-sec-fetcher is the Rust component of a dual-language system. It fetches and transforms SEC financial data into structured CSV files. The companion Python system (narrative_stack) processes these files for machine learning applications.

Prerequisites

Before installation, ensure you have:

Requirement	Purpose	Notes
Rust 1.87+	Compile sec-fetcher	Edition 2021 features required
Email Address	SEC EDGAR API access	Required by SEC for API identification
4+ GB Disk Space	Cache and CSV storage	Default location: `data/` directory
Internet Connection	SEC API access	Throttled to 1 request/second

Optional Components:

Python 3.8+ for ML pipeline (narrative_stack)
Docker for simd-r-drive WebSocket server (Docker Deployment)
MySQL for US GAAP data storage (Database Integration)

Sources: Cargo.toml:1-45

Installation

Clone Repository

Build from Source

The compiled binary will be located at:

Debug: target/debug/sec-fetcher
Release: target/release/sec-fetcher

Verify Installation

If successful, this will load the configuration and display it in JSON format. If no configuration exists, it will prompt for your email address in interactive mode.

Installation Flow Diagram

graph TB
    Clone["Clone Repository\nrust-sec-fetcher"]
Build["cargo build --release"]
Binary["Binary Created\ntarget/release/sec-fetcher"]
Config["Configuration Setup\nConfigManager::load()"]
Verify["Run Example\ncargo run --example config"]
Clone --> Build
 
   Build --> Binary
 
   Binary --> Config
 
   Config --> Verify
    
    ConfigFile["Configuration File\nsec_fetcher_config.toml"]
Credential["Email Credential\nCredentialManager"]
Config --> ConfigFile
 
   Config --> Credential
    
 
   Verify --> Success["Display AppConfig\nJSON Output"]
Verify --> Error["Missing Email\nPrompt in Interactive Mode"]

Sources: Cargo.toml:1-6 src/config/config_manager.rs:20-23 examples/config.rs:1-17

Basic Configuration

The application uses a TOML configuration file combined with system credential storage for the required email address.

Configuration File Location

The ConfigManager searches for configuration files in this order:

System Config Directory : Platform-specific location returned by ConfigManager::get_suggested_system_path()
- Linux: ~/.config/sec-fetcher/config.toml
- macOS: ~/Library/Application Support/sec-fetcher/config.toml
- Windows: C:\Users\<User>\AppData\Roaming\sec-fetcher\config.toml
Current Directory : sec_fetcher_config.toml (fallback)

Configuration Fields

The AppConfig structure src/config/app_config.rs:15-32 supports the following fields:

Field	Type	Default	Description
`email`	`Option<String>`	`None`	Required - Your email for SEC API identification
`max_concurrent`	`Option<usize>`	`1`	Maximum concurrent requests
`min_delay_ms`	`Option<u64>`	`1000`	Minimum delay between requests (milliseconds)
`max_retries`	`Option<usize>`	`5`	Maximum retry attempts for failed requests
`cache_base_dir`	`Option<PathBuf>`	`"data"`	Base directory for caching and CSV output

Example Configuration File

Create sec_fetcher_config.toml:

Email Credential Setup

The SEC EDGAR API requires an email address in the User-Agent header. The application manages this through the CredentialManager:

Interactive Mode (when running from terminal):

Non-Interactive Mode (CI/CD, background processes):

Email must be pre-configured in sec_fetcher_config.toml
Or stored in system credential manager via prior interactive session

Configuration Loading Flow Diagram

graph TB
    Start["ConfigManager::load()"]
PathCheck{"Config Path\nExists?"}
LoadFile["Config::builder()\nadd_source(File)"]
DefaultConfig["AppConfig::default()"]
MergeUser["settings.merge(user_settings)"]
EmailCheck{"Email\nConfigured?"}
InteractiveCheck{"is_interactive_mode()?"}
Prompt["CredentialManager::from_prompt()"]
KeyringGet["credential_manager.get_credential()"]
Error["Error: Could not obtain email"]
InitCaches["Caches::init(config_manager)"]
Complete["ConfigManager Instance"]
Start --> PathCheck
 
   PathCheck -->|Yes| LoadFile
 
   PathCheck -->|No Fallback| LoadFile
 
   LoadFile --> DefaultConfig
 
   DefaultConfig --> MergeUser
    
 
   MergeUser --> EmailCheck
 
   EmailCheck -->|Missing| InteractiveCheck
 
   EmailCheck -->|Present| InitCaches
    
 
   InteractiveCheck -->|Yes| Prompt
 
   InteractiveCheck -->|No| Error
    
 
   Prompt --> KeyringGet
 
   KeyringGet -->|Success| InitCaches
 
   KeyringGet -->|Failure| Error
    
 
   InitCaches --> Complete

Sources: src/config/config_manager.rs:20-86 src/config/app_config.rs:15-54 Cargo.toml20

Running Your First Data Fetch

Example: Configuration Display

The simplest example displays the loaded configuration:

Code Structure examples/config.rs:1-17:

ConfigManager::load() - Loads configuration from file + credentials
config_manager.get_config() - Retrieves AppConfig reference
config.pretty_print() - Serializes to formatted JSON

Expected Output:

Example: Lookup CIK by Ticker

Fetch the Central Index Key (CIK) for a company ticker symbol:

This example demonstrates:

SecClient initialization with throttling
fetch_company_tickers() - Downloads SEC company tickers JSON
fetch_cik_by_ticker_symbol() - Maps ticker → CIK
Caching behavior (subsequent runs use cached data)

Example: Fetch NPORT Filing

Download and parse an NPORT-P investment company filing:

This example shows:

Fetching XML filing by accession number
Parsing NportInvestment data structures
CSV output to data/fund-holdings/{A-Z}/ directories

For detailed walkthrough of all examples, see Running Examples.

Example Execution Flow Diagram

Sources: examples/config.rs:1-17 src/config/config_manager.rs:20-23 Cargo.toml:28-29

Data Output Structure

The application organizes fetched data into a structured directory hierarchy:

data/
├── http_cache/              # HTTP response cache (simd-r-drive)
│   └── sec.gov/
│       └── *.bin            # Cached API responses
│
├── fund-holdings/           # NPORT filing data by ticker
│   ├── A/
│   │   ├── AAPL_holdings.csv
│   │   └── AMZN_holdings.csv
│   ├── B/
│   │   └── MSFT_holdings.csv
│   └── ...                  # A-Z directories
│
└── us-gaap/                 # US GAAP fundamental data
    ├── AAPL_fundamentals.csv
    ├── MSFT_fundamentals.csv
    └── ...

CSV File Formats

US GAAP Fundamentals (us-gaap/*.csv):

Ticker symbol
Filing date
Fiscal period
FundamentalConcept (64 normalized concepts)
Value
Units
Accession number

NPORT Holdings (fund-holdings/{A-Z}/*.csv):

Fund CIK
Investment ticker symbol
Investment name
Balance (shares)
Value (USD)
Percentage of portfolio
Asset category
Issuer category

Data Flow from API to CSV Diagram

Sources: src/config/app_config.rs:31-44 Cargo.toml24

Cache Behavior

The application implements two-tier caching to minimize redundant API calls:

HTTP Cache

Storage : simd-r-drive key-value store Cargo.toml36
Location : {cache_base_dir}/http_cache/
TTL : 1 week (168 hours)
Scope : Raw HTTP responses from SEC API

Preprocessor Cache

Storage : In-memory DashMap with persistent backup
Scope : Transformed data structures (after distill_us_gaap_fundamental_concepts)
Purpose : Skip expensive concept normalization on repeated runs

Cache Initialization : The Caches::init() function src/config/config_manager.rs:98-100 is called automatically during ConfigManager construction.

For detailed caching architecture, see Caching & Storage System.

Sources: Cargo.toml:14-37 src/config/config_manager.rs:98-100

Troubleshooting Common Issues

Issue	Cause	Solution
"Could not obtain email credential"	No email configured in non-interactive mode	Add `email = "..."` to config file or run interactively once
"Config path does not exist"	Invalid custom config path	Check path spelling or omit to use defaults
"unknown field" in config	Typo in TOML key name	Run `cargo run --example config` to see valid keys
Rate limit errors from SEC	`min_delay_ms` too low	Increase to 1000+ ms (SEC requires 1 req/sec max)
Cache directory permission denied	Insufficient filesystem permissions	Change `cache_base_dir` to writable location

Debug Configuration Issues:

The AppConfig::get_valid_keys() function src/config/app_config.rs:62-77 dynamically generates a list of valid configuration fields with their expected types using JSON schema introspection.

Sources: src/config/config_manager.rs:49-77 src/config/app_config.rs:62-77

Next Steps

Now that you have the application configured and running, explore these topics:

Configuration System - Deep dive into AppConfig, ConfigManager, credential management, and TOML structure
Running Examples - Comprehensive walkthrough of all example programs (config.rs, funds.rs, lookup_cik.rs, nport_filing.rs, us_gaap_human_readable.rs)
Network Layer& SecClient - Understanding HTTP client, throttling policies, and retry logic
US GAAP Concept Transformation - How 57+ revenue variations are normalized into 64 standardized concepts
Main Application Flow - Complete data fetching workflow from main.rs

For Python ML pipeline setup, see Python narrative_stack System.

Sources: examples/config.rs:1-17 src/config/config_manager.rs:1-121 src/config/app_config.rs:1-159

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Configuration System

Relevant source files

This document describes the configuration management system in the Rust sec-fetcher application. The configuration system provides a flexible, multi-source approach to managing application settings including SEC API credentials, request throttling parameters, and cache directories.

For information about how the configuration integrates with the caching system, see Caching & Storage System. For details on credential storage mechanisms, see the credential management section below.

Overview

The configuration system consists of three primary components:

Component	Purpose	File Location
`AppConfig`	Data structure holding all configuration fields	src/config/app_config.rs
`ConfigManager`	Loads, merges, and validates configuration from multiple sources	src/config/config_manager.rs
`CredentialManager`	Handles email credential storage and retrieval (keyring or prompt)	Referenced in src/config/config_manager.rs:66-72

The system supports configuration from three sources with increasing priority:

Default values - Hard-coded defaults in AppConfig::default()
TOML configuration file - User-specified settings loaded from disk
Interactive prompts - Credential collection when running in interactive mode

Sources: src/config/app_config.rs:15-54 src/config/config_manager.rs:11-120

Configuration Structure

AppConfig Fields

All fields in AppConfig are wrapped in Option<T> to support partial configuration and merging strategies. The #[merge(strategy = overwrite_option)] attribute ensures that non-None values from user configuration always replace default values.

Sources: src/config/app_config.rs:15-54 examples/config.rs:13-16

Configuration Loading Flow

ConfigManager Initialization

Sources: src/config/config_manager.rs:19-86 src/config/config_manager.rs:107-119

Configuration File Format

TOML Structure

The configuration file uses TOML format with strict schema validation. Any unrecognized keys will cause a descriptive error listing all valid keys and their types.

Example configuration file (sec_fetcher_config.toml):

Configuration File Locations

The system searches for configuration files in the following order:

Priority	Location	Description
1	User-provided path	Passed to `ConfigManager::from_config(Some(path))`
2	System config directory	`~/.config/sec-fetcher/config.toml` (Unix)
`%APPDATA%\sec-fetcher\config.toml` (Windows)
3	Current directory	`./sec_fetcher_config.toml`

Sources: src/config/config_manager.rs:107-119 tests/config_manager_tests.rs:36-57

Credential Management

Email Credential Requirement

The SEC EDGAR API requires a valid email address in the HTTP User-Agent header. The configuration system ensures this credential is available through multiple strategies:

Interactive Mode Detection:

graph TD
    Start["ConfigManager Initialization"]
CheckEmail{"Email in\nAppConfig?"}
CheckInteractive{"is_interactive_mode()?"}
PromptUser["CredentialManager::from_prompt()"]
ReadKeyring["Read from system keyring"]
PromptInput["Prompt user for email"]
SaveKeyring["Save to keyring (optional)"]
SetEmail["Set email in user_settings"]
Error["Return Error:\nCould not obtain email credential"]
MergeConfig["Merge configurations"]
Start --> CheckEmail
 
   CheckEmail -->|Yes| MergeConfig
 
   CheckEmail -->|No| CheckInteractive
 
   CheckInteractive -->|Yes| PromptUser
 
   CheckInteractive -->|No| Error
 
   PromptUser --> ReadKeyring
 
   ReadKeyring -->|Found| SetEmail
 
   ReadKeyring -->|Not found| PromptInput
 
   PromptInput --> SaveKeyring
 
   SaveKeyring --> SetEmail
 
   SetEmail --> MergeConfig

Returns true if stdout is a terminal (TTY)
Can be overridden via set_interactive_mode_override() for testing

Sources: src/config/config_manager.rs:64-77 tests/config_manager_tests.rs:19-33

Configuration Merging Strategy

Merge Behavior

The AppConfig struct uses the merge crate with a custom overwrite_option strategy:

Merge Rules:

Some(new_value) always replaces Some(old_value)
Some(new_value) always replaces None
None never replaces Some(old_value)

This ensures user-provided values take absolute precedence over defaults while allowing partial configuration.

Sources: src/config/app_config.rs:8-13 src/config/config_manager.rs79

Schema Validation and Error Handling

graph LR
    TOML["TOML File with\ninvalid_key"]
Deserialize["config.try_deserialize()"]
ExtractSchema["AppConfig::get_valid_keys()"]
Schema["schemars::schema_for!(AppConfig)"]
FormatError["Format error message with\nvalid keys and types"]
TOML --> Deserialize
 
   Deserialize -->|Error| ExtractSchema
 
   ExtractSchema --> Schema
 
   Schema --> FormatError
 
   FormatError --> Error["Return descriptive error:\n- email (String /Null - max_concurrent Integer/ Null)\n- min_delay_ms (Integer /Null - max_retries Integer/ Null)\n- cache_base_dir (String / Null)"]

Invalid Key Detection

The configuration system uses serde(deny_unknown_fields) to reject unknown keys in TOML files. When an invalid key is detected, the error message includes a complete list of valid keys with their types:

Example error output:

unknown field `invalid_key`, expected one of `email`, `max_concurrent`, `min_delay_ms`, `max_retries`, `cache_base_dir`

Valid configuration keys are:
  - email (String | Null)
  - max_concurrent (Integer | Null)
  - min_delay_ms (Integer | Null)
  - max_retries (Integer | Null)
  - cache_base_dir (String | Null)

Sources: src/config/app_config.rs:62-77 src/config/config_manager.rs:45-62 tests/config_manager_tests.rs:67-94

Usage Examples

Loading Configuration

Default configuration path:

Custom configuration path:

Direct AppConfig construction:

Sources: examples/config.rs:1-17 tests/config_manager_tests.rs:36-57

graph LR
    CM["ConfigManager::from_config()"]
InitCaches["init_caches()"]
CachesInit["Caches::init(self)"]
HTTPCache["Initialize HTTP cache\nwith cache_base_dir"]
PreprocCache["Initialize preprocessor cache"]
OnceLock["Store in OnceLock statics"]
CM --> InitCaches
 
   InitCaches --> CachesInit
 
   CachesInit --> HTTPCache
 
   CachesInit --> PreprocCache
 
   HTTPCache --> OnceLock
 
   PreprocCache --> OnceLock

Cache Initialization

The ConfigManager automatically initializes the global Caches module during construction:

The cache_base_dir field from AppConfig is passed to the Caches module to determine where HTTP responses and preprocessed data are stored. For detailed information on cache policies and storage mechanisms, see Caching & Storage System.

Sources: src/config/config_manager.rs:83-100

Testing

Test Coverage

The configuration system includes comprehensive unit tests:

Test Function	Purpose
`test_load_custom_config`	Verifies loading from custom TOML file path
`test_load_non_existent_config`	Ensures proper error on missing file
`test_fails_on_invalid_key`	Validates schema enforcement and error messages
`test_fails_if_no_email_available`	Checks email requirement in non-interactive mode (commented)

Test Utilities:

create_temp_config(contents: &str) - Creates temporary TOML files for testing
set_interactive_mode_override(Some(false)) - Disables interactive prompts during tests

Sources: tests/config_manager_tests.rs:1-95

Configuration System Components

Complete Component Diagram

Sources: src/config/app_config.rs:1-159 src/config/config_manager.rs:1-121

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Running Examples

Relevant source files

This page demonstrates how to run the example programs included in the repository. Each example illustrates specific functionality of the sec-fetcher system, from basic configuration validation to fetching company data from the SEC EDGAR API. These examples serve as practical starting points for understanding how to integrate the library into your own applications.

For detailed information about configuring the application before running these examples, see Configuration System. For understanding the underlying network layer and data fetching mechanisms, see Network Layer & SecClient and Data Fetching Functions.

Example Programs Overview

The repository includes four example programs located in the examples/ directory. Each demonstrates different aspects of the system:

Example Program	Primary Purpose	Key Functions Demonstrated	Command-Line Args
`config.rs`	Configuration validation and inspection	`ConfigManager::load()`, `AppConfig::pretty_print()`	None
`funds.rs`	Fetch investment company datasets	`fetch_investment_company_series_and_class_dataset()`	None
`lookup_cik.rs`	CIK lookup and submission retrieval	`fetch_cik_by_ticker_symbol()`, `fetch_cik_submissions()`	Ticker symbol (e.g., AAPL)
`nport_filing.rs`	Parse NPORT filing holdings	`fetch_nport_filing()`, XML parsing	Accession number
`us_gaap_human_readable.rs`	Fetch and display US GAAP fundamentals	`fetch_us_gaap_fundamentals()`, `distill_us_gaap_fundamental_concepts()`	Ticker symbol

Sources: examples/config.rs:1-18 examples/funds.rs:1-19 examples/lookup_cik.rs:1-75

Example Program Architecture

Diagram: Example Programs and System Integration

Sources: examples/config.rs:1-18 examples/funds.rs:1-19 examples/lookup_cik.rs:1-75

config.rs - Configuration Validation

Purpose : Load and display the current application configuration to verify that configuration files are properly formatted and credentials are correctly set.

How to Run :

Expected Output : The example displays the merged configuration in JSON format, showing values from the configuration file merged with defaults.

Code Flow :

graph TB
    Start["main()
entry point\nexamples/config.rs:4"]
Load["ConfigManager::load()\nLoad config from file\nexamples/config.rs:13"]
GetConfig["config_manager.get_config()\nRetrieve AppConfig reference\nexamples/config.rs:15"]
Pretty["config.pretty_print()\nSerialize to JSON\nexamples/config.rs:16"]
Print["print! macro\nDisplay formatted config\nexamples/config.rs:16"]
End["Program exit"]
Start --> Load
 
   Load --> GetConfig
 
   GetConfig --> Pretty
 
   Pretty --> Print
 
   Print --> End

Key Functions Used :

Function	File Location	Purpose
`ConfigManager::load()`	examples/config.rs13	Loads configuration from default paths
`get_config()`	examples/config.rs15	Returns reference to `AppConfig` struct
`pretty_print()`	examples/config.rs16	Serializes configuration to formatted JSON

What This Demonstrates :

Basic configuration loading workflow
Configuration file resolution and merging
Validation of configuration structure

Sources: examples/config.rs:1-18 src/config/app_config.rs:1-159

funds.rs - Investment Company Dataset

Purpose : Fetch the complete investment company series and class dataset from the SEC EDGAR API, demonstrating network client initialization and basic data fetching.

How to Run :

Expected Output : A list of investment company records, each containing series information, fund names, CIKs, and classification details. The output format is the Debug representation of the Ticker structs.

Code Flow :

graph TB
    Start["#[tokio::main] async fn main()\nexamples/funds.rs:6-7"]
LoadCfg["ConfigManager::load()\nexamples/funds.rs:8"]
CreateClient["SecClient::from_config_manager()\nInitialize HTTP client\nexamples/funds.rs:9"]
FetchData["fetch_investment_company_series_and_class_dataset()\nNetwork request to SEC API\nexamples/funds.rs:11"]
LoopStart["for fund in funds\nIterate results\nexamples/funds.rs:13"]
PrintFund["print! macro\nDisplay fund Debug output\nexamples/funds.rs:14"]
End["Return Ok(())\nexamples/funds.rs:17"]
Start --> LoadCfg
 
   LoadCfg --> CreateClient
 
   CreateClient --> FetchData
 
   FetchData --> LoopStart
 
   LoopStart --> PrintFund
 
   PrintFund -->|More funds| LoopStart
 
   LoopStart -->|Done| End

Key Functions and Structures :

Entity	Type	Purpose
`fetch_investment_company_series_and_class_dataset()`	Function	Fetches investment company data from SEC
`SecClient`	Struct	HTTP client with throttling and caching
`Ticker`	Struct	Represents company ticker information
`tokio::main`	Macro	Enables async runtime

What This Demonstrates :

Async/await pattern usage with tokio
SecClient initialization from configuration
Network function invocation
Iteration over fetched data structures

Sources: examples/funds.rs:1-19

lookup_cik.rs - CIK Lookup and Submission Retrieval

Purpose : Demonstrate CIK (Central Index Key) lookup by ticker symbol and retrieval of company submissions, with special handling for NPORT-P filings.

How to Run :

Replace AAPL with any valid ticker symbol.

Expected Output :

The CIK number for the ticker
URL to the SEC submissions JSON
Most recent NPORT-P submission (if the company files NPORT-P forms)
Or a list of other submission types (e.g., 10-K)
EDGAR archive URLs for each submission

Execution Flow with Code Entities :

Command-Line Argument Handling :

The example requires exactly one argument (the ticker symbol). The argument parsing logic is implemented at examples/lookup_cik.rs:10-14:

args.len() check → if not 2, print usage and exit
ticker_symbol = args[1]

Key Code Entities :

Entity	Type	Location	Purpose
`fetch_cik_by_ticker_symbol()`	Function	examples/lookup_cik.rs:21-23	Retrieves CIK for ticker
`fetch_cik_submissions()`	Function	examples/lookup_cik.rs43	Fetches all submissions for CIK
`CikSubmission::most_recent_nport_p_submission()`	Method	examples/lookup_cik.rs:45-46	Filters for latest NPORT-P
`as_edgar_archive_url()`	Method	examples/lookup_cik.rs54	Generates SEC EDGAR URL
`Cik`	Struct	examples/lookup_cik.rs28	10-digit CIK identifier
`CikSubmission`	Struct	examples/lookup_cik.rs43	Represents a single SEC filing

What This Demonstrates :

Command-line argument parsing using std::env::args()
Error handling with Result and Option types
Conditional logic based on filing types
Method chaining for data transformation
URL generation from structured data

Sources: examples/lookup_cik.rs:1-75

Prerequisites and Common Patterns

Configuration Requirement : All examples (except config.rs) require a valid configuration file with at least an email address set. See Configuration System for setup instructions.

Common Code Pattern Across Examples :

All examples follow this pattern:

Load configuration using ConfigManager::load()
Initialize SecClient from the configuration
Call network functions to fetch data
Process and display results

Error Handling : Examples use the ? operator extensively to propagate errors and return Result<(), Box<dyn Error>> from main(). This allows errors to bubble up with descriptive messages.

Async Runtime : Examples that perform network operations use #[tokio::main] to enable the async runtime, allowing await syntax for asynchronous operations.

Sources: examples/config.rs:1-18 examples/funds.rs:1-19 examples/lookup_cik.rs:1-75

Running Examples in Development Mode

When running examples, the application may detect development mode based on environment variables or the presence of certain files. Development mode affects:

Cache behavior : May use more aggressive caching strategies
Logging verbosity : May produce more detailed output
Rate limiting : May use relaxed throttling policies for testing

Refer to Utility Functions for details on development mode detection.

Next Steps

After running these examples:

Explore the Network Layer & SecClient to understand HTTP client configuration
Learn about Data Fetching Functions for all available network operations
Review Data Models & Enumerations to understand returned data structures
See Main Application Flow for production usage patterns

Sources: examples/config.rs:1-18 examples/funds.rs:1-19 examples/lookup_cik.rs:1-75

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Rust sec-fetcher Application

Relevant source files

Purpose and Scope

This page provides an architectural overview of the Rust sec-fetcher application, which is responsible for fetching financial data from the SEC EDGAR API and transforming it into structured CSV files. The application serves as the data collection and preprocessing layer in a larger dual-language system that combines Rust's performance for I/O operations with Python's machine learning capabilities.

This page covers the high-level architecture, module organization, and data flow patterns. For detailed information about specific components, see:

The Python machine learning pipeline that consumes this data is documented in Python narrative_stack System.

Sources: src/lib.rs:1-11 Cargo.toml:1-45 src/main.rs:1-241

Application Architecture

The sec-fetcher application is built around a modular architecture that separates concerns into distinct layers: configuration, networking, data transformation, and storage. The core design principle is to fetch data from SEC APIs with robust error handling and caching, transform it into a standardized format, and output structured CSV files for downstream consumption.

graph TB
    subgraph "src/lib.rs Module Organization"
        config["config\nConfigManager, AppConfig"]
enums["enums\nFundamentalConcept, Url\nCacheNamespacePrefix"]
models["models\nTicker, CikSubmission\nNportInvestment, AccessionNumber"]
network["network\nSecClient, fetch_* functions\nThrottlePolicy, CachePolicy"]
transformers["transformers\ndistill_us_gaap_fundamental_concepts"]
parsers["parsers\nXML/JSON parsing utilities"]
caches["caches\nCaches (internal)\nHTTP cache, preprocessor cache"]
fs["fs\nFile system operations"]
utils["utils\nVecExtensions, helpers"]
end
    
    subgraph "External Dependencies"
        reqwest["reqwest\nHTTP client"]
polars["polars\nDataFrame operations"]
simd["simd-r-drive\nDrive-based cache storage"]
tokio["tokio\nAsync runtime"]
serde["serde\nSerialization"]
end
    
 
   config --> caches
 
   network --> config
 
   network --> caches
 
   network --> models
 
   network --> enums
 
   network --> parsers
 
   transformers --> models
 
   transformers --> enums
 
   parsers --> models
    
 
   network --> reqwest
 
   network --> simd
 
   network --> tokio
 
   transformers --> polars
 
   models --> serde
    
    style config fill:#f9f9f9
    style network fill:#f9f9f9
    style transformers fill:#f9f9f9
    style caches fill:#f9f9f9

Module Structure

The application is organized into nine core modules as declared in src/lib.rs:1-10:

Module	Purpose	Key Components
`config`	Configuration management and credential handling	`ConfigManager`, `AppConfig`
`enums`	Type-safe enumerations for domain concepts	`FundamentalConcept`, `Url`, `CacheNamespacePrefix`, `TickerOrigin`
`models`	Data structures representing SEC entities	`Ticker`, `CikSubmission`, `NportInvestment`, `AccessionNumber`
`network`	HTTP client and data fetching functions	`SecClient`, `fetch_company_tickers`, `fetch_us_gaap_fundamentals`
`transformers`	Data normalization and transformation	`distill_us_gaap_fundamental_concepts`
`parsers`	XML/JSON parsing utilities	Filing parsers, XML extractors
`caches`	Internal caching infrastructure	`Caches` (singleton), HTTP cache, preprocessor cache
`fs`	File system operations	Directory creation, path utilities
`utils`	Helper functions and extensions	`VecExtensions`, `invert_multivalue_indexmap`

Sources: src/lib.rs:1-11 Cargo.toml:8-40

Data Flow Architecture

Request-Response Flow with Caching

The data flow follows a pipeline pattern:

Request Initiation : Application code calls functions like fetch_us_gaap_fundamentals or fetch_nport_filing_by_ticker_symbol
Client Middleware : SecClient applies throttling and caching policies before making HTTP requests
Cache Check : CachePolicy checks simd-r-drive storage for cached responses
API Request : If cache miss, request is sent to SEC EDGAR API with proper headers and rate limiting
Parsing : Raw XML/JSON responses are parsed into structured data models
Transformation : Data passes through normalization functions like distill_us_gaap_fundamental_concepts
Output : Transformed data is written to CSV files organized by ticker symbol or category

Sources: src/main.rs:174-240 src/lib.rs:1-11

Key Dependencies and Technology Stack

The application leverages modern Rust crates for performance and reliability:

Critical Dependencies

Category	Crate	Version	Purpose
Async Runtime	`tokio`	1.43.0	Asynchronous I/O, task scheduling
HTTP Client	`reqwest`	0.12.15	HTTP requests with JSON support
Data Frames	`polars`	0.46.0	High-performance DataFrame operations
Caching	`simd-r-drive`	0.3.0	WebSocket-based key-value storage
Parallelism	`rayon`	1.10.0	Data parallelism for CPU-bound work
Configuration	`config`	0.15.9	TOML/JSON configuration file parsing
Credentials	`keyring`	3.6.2	Secure credential storage (OS keychain)
XML Parsing	`quick-xml`	0.37.2	Fast XML parsing for SEC filings
Serialization	`serde`	1.0.218	Data structure serialization/deserialization

Sources: Cargo.toml:8-40

Application Entry Point and Main Loop

The application entry point in src/main.rs:174-240 demonstrates the typical processing pattern:

Main Processing Loop Structure

The main application follows this pattern:

Configuration Loading : ConfigManager::load() reads configuration from TOML files and environment variables
Client Initialization : SecClient::from_config_manager() creates HTTP client with throttling and caching middleware
Ticker Fetching : fetch_company_tickers() retrieves all company ticker symbols from SEC
Processing Loop : Iterates over each ticker symbol:
- Calls fetch_us_gaap_fundamentals() to retrieve financial data
- Transforms data through distill_us_gaap_fundamental_concepts
- Writes results to CSV files organized by ticker: data/us-gaap/{ticker}.csv
- Logs errors to error_log HashMap for later reporting
Error Reporting : Prints summary of any errors encountered during processing

Example Main Function Flow

Sources: src/main.rs:174-240

Module Interaction Patterns

Configuration and Client Initialization

The initialization sequence ensures:

Configuration is loaded before any network operations
Credentials are retrieved securely from OS keychain via keyring crate
Caches singleton is initialized with drive storage connection
HTTP client middleware is properly configured with throttling and caching

Sources: src/lib.rs:1-11

Data Transformation Pipeline

The transformation pipeline is the most critical component (importance: 8.37 as noted in the high-level architecture). The distill_us_gaap_fundamental_concepts function handles:

Synonym Consolidation : Maps 6+ variations of "Cost of Revenue" to single CostOfRevenue concept
Industry-Specific Revenue : Handles 57+ revenue field variations (SalesRevenueNet, RevenueFromContractWithCustomer, etc.)
Hierarchical Mapping : Maps specific concepts to parent categories (e.g., CurrentAssets also maps to Assets)
Unit Normalization : Standardizes currency, shares, and percentage units

Sources: Cargo.toml24 high-level architecture diagram

Error Handling Strategy

The application uses a multi-layered error handling approach:

Layer	Strategy	Example
Network	Retry with exponential backoff	`ThrottlePolicy` with `max_retries`
Cache	Fallback to API on cache errors	`CachePolicy` transparent fallback
Parsing	Log and continue processing	Error log in main loop
File I/O	Log error, continue to next item	Individual CSV write failures don't stop processing

The main loop accumulates errors in a HashMap<String, String> and reports them at the end, ensuring that one failure doesn't halt the entire batch processing job.

Sources: src/main.rs:185-240

Performance Characteristics

The application is optimized for I/O-bound workloads:

Async I/O : tokio runtime enables concurrent network requests
CPU Parallelism : rayon parallelizes DataFrame operations in polars
Caching : simd-r-drive reduces redundant API calls with 1-week TTL
Throttling : Respects SEC rate limits with adaptive jitter
Streaming : Processes large datasets without loading entirely into memory

The combination of async networking for I/O operations and Rayon parallelism for CPU-bound transformations provides optimal throughput for SEC data fetching and processing.

Sources: Cargo.toml:8-40 high-level architecture diagrams

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Network Layer & SecClient

Relevant source files

Purpose and Scope

This page documents the network layer of the Rust sec-fetcher application, specifically the SecClient HTTP client and its associated infrastructure. The SecClient provides the foundational HTTP communication layer for all SEC EDGAR API interactions, implementing throttling, caching, and retry logic to ensure reliable and compliant data fetching.

This page covers:

The SecClient structure and initialization
Throttling and rate limiting policies
HTTP caching mechanisms
Request/response handling
User-Agent management and email validation

For information about the specific network fetching functions that use SecClient (such as fetch_company_tickers, fetch_us_gaap_fundamentals, etc.), see Data Fetching Functions. For details on the caching system architecture, see Caching & Storage System.

SecClient Architecture Overview

Component Diagram

Sources: src/network/sec_client.rs:1-181

SecClient Structure and Initialization

SecClient Fields

The SecClient struct maintains four core components:

Field	Type	Purpose
`email`	`String`	Contact email for SEC User-Agent header
`http_client`	`ClientWithMiddleware`	reqwest client with middleware stack
`cache_policy`	`Arc<CachePolicy>`	Shared cache configuration
`throttle_policy`	`Arc<ThrottlePolicy>`	Shared throttle configuration

Sources: src/network/sec_client.rs:14-19

Construction from ConfigManager

The from_config_manager() constructor performs the following initialization sequence:

Extract Configuration : Reads email, max_concurrent, min_delay_ms, and max_retries from AppConfig
Create CachePolicy : Configures cache with 1-week TTL and disables header respect
Create ThrottlePolicy : Configures rate limiting with adaptive jitter
Initialize HTTP Cache : Retrieves the shared HTTP cache store from Caches
Build Middleware Stack : Combines cache and throttle layers
Construct Client : Creates ClientWithMiddleware with full middleware stack

Sources: src/network/sec_client.rs:23-89

Configuration Parameters

The following table details the configuration parameters required by SecClient:

Parameter	AppConfig Field	Required	Default	Purpose
Email	`email`	Yes	None	SEC User-Agent compliance
Max Concurrent	`max_concurrent`	Yes	None	Concurrent request limit
Min Delay (ms)	`min_delay_ms`	Yes	None	Base throttle delay
Max Retries	`max_retries`	Yes	None	Retry attempt limit

Sources: src/network/sec_client.rs:28-43

Throttle Policy Configuration

ThrottlePolicy Structure

The ThrottlePolicy from reqwest_drive controls request rate limiting with the following parameters:

Field	Type	Source	Purpose
`base_delay_ms`	`u64`	`AppConfig.min_delay_ms`	Minimum delay between requests
`max_concurrent`	`u64`	`AppConfig.max_concurrent`	Maximum concurrent requests
`max_retries`	`u64`	`AppConfig.max_retries`	Maximum retry attempts
`adaptive_jitter_ms`	`u64`	Hardcoded: 500	Randomized delay for retry backoff

Sources: src/network/sec_client.rs:52-59

Request Throttling Mechanism

The throttle mechanism operates as follows:

Concurrent Limit : Enforces max_concurrent simultaneous requests
Base Delay : Applies base_delay_ms between sequential requests
Retry Logic : Retries failed requests up to max_retries times
Adaptive Jitter : Adds randomized delay of up to adaptive_jitter_ms on retries to prevent thundering herd

The ThrottlePolicy can be overridden on a per-request basis by passing a custom policy to raw_request():

Sources: src/network/sec_client.rs:140-165

Cache Policy Configuration

CachePolicy Structure

The CachePolicy configuration defines HTTP caching behavior:

Field	Value	Purpose
`default_ttl`	`Duration::from_secs(60 * 60 * 24 * 7)`	1 week cache lifetime
`respect_headers`	`false`	Ignore HTTP cache control headers
`cache_status_override`	`None`	No status code override

Note : The 1-week TTL is currently hardcoded and marked with a TODO comment for future configurability.

Sources: src/network/sec_client.rs:45-50

Cache Storage Integration

The cache storage uses the following integration:

HTTP Cache Store : Retrieved via Caches::get_http_cache_store(), a static OnceLock singleton
Drive Integration : Uses init_cache_with_drive_and_throttle() from reqwest_drive
Persistent Storage : Backed by simd-r-drive WebSocket server for cross-session persistence

Sources: src/network/sec_client.rs:73-81

User-Agent Management

User-Agent Format

The get_user_agent() method generates a compliant User-Agent string for SEC EDGAR API requests:

Format: <package_name>/<version> (+<email>)
Example: sec-fetcher/0.1.0 (+contact@example.com)

Sources: src/network/sec_client.rs:91-108

Email Validation

The email validation occurs at User-Agent generation time rather than during instantiation. This design ensures that:

Every network request validates the email format
Invalid configurations fail fast on first network call
The email is validated even if the SecClient is constructed through different paths

Sources: src/network/sec_client.rs:91-99

Request Methods

raw_request Method

The raw_request() method provides low-level HTTP request functionality:

Parameters:

method: HTTP method (GET, POST, etc.)
url: Target URL
headers: Optional additional headers as key-value tuples
custom_throttle_policy: Optional per-request throttle override

Behavior:

Constructs request with User-Agent header
Applies optional custom headers
Injects custom throttle policy if provided (via request extensions)
Executes request through middleware stack
Returns raw reqwest::Response

Sources: src/network/sec_client.rs:140-165

fetch_json Method

The fetch_json() method is a convenience wrapper for JSON API requests:

Flow:

Calls raw_request() with GET method
Awaits response
Deserializes response body to serde_json::Value
Returns parsed JSON

Sources: src/network/sec_client.rs:167-179

Request Flow Diagram

Sources: src/network/sec_client.rs:140-179

Testing Infrastructure

Test Organization

The SecClient tests use the mockito crate for HTTP mocking:

Sources: tests/sec_client_tests.rs:1-159

Test Coverage

The test suite covers the following scenarios:

Test	Purpose	Mock Behavior	Assertion
`test_user_agent`	User-Agent formatting	N/A	Matches expected format
`test_invalid_email_panic`	Email validation	N/A	Panics with expected message
`test_fetch_json_without_retry_success`	Successful request	200 JSON response	JSON parsed correctly
`test_fetch_json_with_retry_success`	Retry not needed	200 JSON response	JSON parsed correctly
`test_fetch_json_with_retry_failure`	Retry exhaustion	500 error (3x)	Returns error
`test_fetch_json_with_retry_backoff`	Retry with recovery	500 → 200	JSON parsed correctly

Key Testing Patterns:

Mock Server : mockito::Server::new_async() creates isolated HTTP endpoints
Configuration : Tests use AppConfig::default() with overrides
Async Execution : All network tests use #[tokio::test]
Expectations : mockito verifies request counts with .expect(n)

Sources: tests/sec_client_tests.rs:7-158

Dependencies and Integration

External Dependencies

Crate	Purpose	Usage in SecClient
`reqwest`	HTTP client	Core HTTP functionality
`reqwest_drive`	Middleware stack	Cache and throttle integration
`tokio`	Async runtime	All async operations
`serde_json`	JSON parsing	Response deserialization
`email_address`	Email validation	User-Agent email checking

Sources: src/network/sec_client.rs:1-12 Cargo.lock:1-100

Internal Dependencies

The SecClient integrates with:

ConfigManager : Provides configuration for initialization (Configuration System)
Caches : Supplies HTTP cache storage singleton (Caching & Storage System)
Network Module : Exposed via src/network.rs module
Fetch Functions : Used by all data fetching operations (Data Fetching Functions)

Sources: src/network.rs:1-23 src/network/sec_client.rs:1-12

Error Handling

Error Types

SecClient methods return Result<T, Box<dyn Error>>, allowing propagation of:

reqwest::Error : HTTP client errors (connection, timeout, etc.)
serde_json::Error : JSON deserialization errors
Custom Errors : From configuration or validation

Panic Conditions

The only panic condition is invalid email format detected in get_user_agent():

This is intentional as an invalid email violates SEC API requirements.

Sources: src/network/sec_client.rs:95-98

Usage Example

The following example demonstrates typical SecClient usage:

Sources: tests/sec_client_tests.rs:36-62

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Data Fetching Functions

Relevant source files

Purpose and Scope

This document describes the data fetching functions in the Rust sec-fetcher application. These functions provide the interface for retrieving financial data from the SEC EDGAR API, including company tickers, CIK submissions, NPORT filings, US GAAP fundamentals, and investment company datasets.

For information about the underlying HTTP client, throttling, and caching infrastructure, see Network Layer & SecClient. For details on how US GAAP data is transformed after fetching, see US GAAP Concept Transformation. For information about the data structures returned by these functions, see Data Models & Enumerations.

Overview of Data Fetching Architecture

The network module provides six primary fetching functions that retrieve different types of financial data from the SEC EDGAR API. Each function accepts a SecClient reference and returns structured data types.

Diagram: Data Fetching Function Overview

graph TB
    subgraph "Entry Points"
        FetchTickers["fetch_company_tickers()\nsrc/network/fetch_company_tickers.rs"]
FetchCIK["fetch_cik_by_ticker_symbol()\nNot shown in files"]
FetchSubs["fetch_cik_submissions()\nsrc/network/fetch_cik_submissions.rs"]
FetchNPORT["fetch_nport_filing_by_ticker_symbol()\nfetch_nport_filing_by_cik_and_accession_number()\nsrc/network/fetch_nport_filing.rs"]
FetchGAAP["fetch_us_gaap_fundamentals()\nsrc/network/fetch_us_gaap_fundamentals.rs"]
FetchInvCo["fetch_investment_company_series_and_class_dataset()\nsrc/network/fetch_investment_company_series_and_class_dataset.rs"]
end
    
    subgraph "SEC EDGAR API Endpoints"
        APITickers["company_tickers.json"]
APISubmissions["submissions/CIK{cik}.json"]
APINport["Archives/edgar/data/{cik}/{accession}/primary_doc.xml"]
APIFacts["api/xbrl/companyfacts/CIK{cik}.json"]
APIInvCo["files/investment/data/other/.../{year}.csv"]
end
    
    subgraph "Return Types"
        VecTicker["Vec&lt;Ticker&gt;"]
CikType["Cik"]
VecCikSub["Vec&lt;CikSubmission&gt;"]
VecNport["Vec&lt;NportInvestment&gt;"]
DataFrame["TickerFundamentalsDataFrame"]
VecInvCo["Vec&lt;InvestmentCompany&gt;"]
end
    
 
   FetchTickers -->|HTTP GET| APITickers
 
   APITickers --> VecTicker
    
 
   FetchCIK -->|Searches| VecTicker
 
   FetchCIK --> CikType
    
 
   FetchSubs -->|HTTP GET| APISubmissions
 
   APISubmissions --> VecCikSub
    
 
   FetchNPORT -->|HTTP GET| APINport
 
   APINport --> VecNport
    
 
   FetchGAAP -->|HTTP GET| APIFacts
 
   APIFacts --> DataFrame
    
 
   FetchInvCo -->|HTTP GET| APIInvCo
 
   APIInvCo --> VecInvCo

Sources: src/network/fetch_company_tickers.rs src/network/fetch_cik_submissions.rs src/network/fetch_nport_filing.rs src/network/fetch_us_gaap_fundamentals.rs src/network/fetch_investment_company_series_and_class_dataset.rs

Company Ticker Fetching

Function: `fetch_company_tickers`

The fetch_company_tickers function retrieves the master list of operating company tickers from the SEC EDGAR API. This dataset provides the mapping between ticker symbols, company names, and CIK numbers.

Signature:

Implementation Details:

Aspect	Detail
API Endpoint	`Url::CompanyTickers` (company_tickers.json)
Response Format	JSON object with numeric keys mapping to ticker data
Parsing Logic	Lines 1:8-31
CIK Conversion	Uses `Cik::from_u64()` to format CIK numbers 22
Origin Tag	All tickers tagged with `TickerOrigin::CompanyTickers` 28

Returned Data Structure:

Each Ticker in the result contains:

cik: 10-digit formatted CIK number
symbol: Ticker symbol (e.g., "AAPL")
company_name: Full company name
origin: Set to TickerOrigin::CompanyTickers

Usage Example:

The function is commonly used as a prerequisite for other operations that require ticker-to-CIK mapping:

Sources: src/network/fetch_company_tickers.rs:1-34

CIK Lookup and Submissions

Function: `fetch_cik_submissions`

The fetch_cik_submissions function retrieves all SEC filings (submissions) for a given company, identified by its CIK number.

Signature:

Implementation Details:

The function performs the following operations:

URL Construction : Creates the submissions endpoint URL using Url::CikSubmission enum 12
JSON Fetching : Retrieves submission data via sec_client.fetch_json() 14
Data Extraction : Parses the filings.recent object containing parallel arrays 2:5-51
Parallel Array Processing : Uses itertools::izip! to iterate over multiple arrays simultaneously 5:5-60

JSON Structure:

Diagram: CikSubmission JSON Structure

Parsing Implementation:

The function extracts four parallel arrays and combines them using izip!:

accessionNumber: Accession numbers for filings
form: Form types (e.g., "10-K", "NPORT-P")
primaryDocument: Primary document filenames
filingDate: Filing dates in "YYYY-MM-DD" format

Each set of parallel values is combined into a CikSubmission struct 6:8-75

Date Parsing:

Filing dates are parsed from strings into NaiveDate objects 6:1-63:

Sources: src/network/fetch_cik_submissions.rs:1-79 examples/lookup_cik.rs:43-70

NPORT Filing Fetching

NPORT-P filings contain detailed portfolio holdings for registered investment companies (mutual funds and ETFs). The module provides two related functions for fetching this data.

Function: `fetch_nport_filing_by_ticker_symbol`

Signature:

Workflow:

Diagram: NPORT Filing Fetch by Ticker Symbol

This convenience function orchestrates multiple calls 1:0-30:

Lookup CIK from ticker symbol 14
Fetch all submissions for that CIK 16
Filter for most recent NPORT-P submission 1:8-20
Fetch the filing details 2:2-27

Function: `fetch_nport_filing_by_cik_and_accession_number`

Signature:

Implementation Details:

Step	Operation	Code Reference
1. Fetch company tickers	Required for ticker mapping	39
2. Construct URL	`Url::CikAccessionPrimaryDocument`	41
3. Fetch XML	`raw_request()` with GET method	4:3-46
4. Parse XML	`parse_nport_xml()`	48

XML Parsing Details:

The parse_nport_xml function extracts investment holdings from the primary document XML:

Main Element : <invstOrSec> (investment or security) 2:6-46
Fields Extracted : name, LEI, title, CUSIP, ISIN, balance, currency, USD value, percentage value, payoff profile, asset category, issuer category, country
Ticker Mapping : Fuzzy matches investment names to company tickers 1:31-142
Sorting : Results sorted by pct_val descending 125

Sources: src/network/fetch_nport_filing.rs:1-49 src/parsers/parse_nport_xml.rs:12-146 examples/nport_filing.rs:1-29

US GAAP Fundamentals Fetching

Function: `fetch_us_gaap_fundamentals`

The fetch_us_gaap_fundamentals function retrieves standardized financial statement data (US Generally Accepted Accounting Principles) for a company.

Signature:

Type Alias:

The function returns a Polars DataFrame containing financial facts 9

Implementation Flow:

Diagram: US GAAP Fundamentals Fetch Flow

Key Operations:

CIK Lookup : Resolves ticker symbol to CIK using the provided company tickers list 18
URL Construction : Builds the company facts endpoint URL 20
API Call : Fetches JSON data from SEC 25
Parsing : Converts JSON to structured DataFrame 27

API Response Structure:

The SEC company facts endpoint returns nested JSON containing:

US-GAAP taxonomy : Standardized accounting concepts
Unit types : Currency units (USD), shares, etc.
Time series data : Historical values with filing dates and periods

The parsing function (documented in detail in US GAAP Concept Transformation) extracts this into a tabular DataFrame format.

Sources: src/network/fetch_us_gaap_fundamentals.rs:1-28

Investment Company Dataset Fetching

The investment company dataset provides comprehensive information about registered investment companies, including mutual funds and ETFs, their series, and share classes.

graph TB
    Start["Start"]
CheckCache["Check preprocessor_cache\nfor latest_funds_year"]
CacheHit{{"Cache Hit?"}}
UseCache["Use cached year"]
UseCurrent["Use current year"]
TryFetch["fetch_investment_company_series_and_class_dataset_for_year(year)"]
Success{{"Success?"}}
Store["Store year in cache\nTTL: 1 week"]
Return["Return data"]
Decrement["year -= 1"]
CheckLimit{{"year >= 2024?"}}
Error["Return error"]
Start --> CheckCache
 
   CheckCache --> CacheHit
 
   CacheHit -->|Yes| UseCache
 
   CacheHit -->|No| UseCurrent
 
   UseCache --> TryFetch
 
   UseCurrent --> TryFetch
    
 
   TryFetch --> Success
 
   Success -->|Yes| Store
 
   Store --> Return
 
   Success -->|No| Decrement
 
   Decrement --> CheckLimit
 
   CheckLimit -->|Yes| TryFetch
 
   CheckLimit -->|No| Error

Function: `fetch_investment_company_series_and_class_dataset`

Signature:

Year Fallback Strategy:

This function implements a sophisticated year-based fallback mechanism because the SEC updates the dataset annually and may not have data for the current year immediately:

Diagram: Investment Company Dataset Year Fallback Logic

Implementation Details:

Component	Description	Code Reference
Cache Key	`NAMESPACE_HASHER_LATEST_FUNDS_YEAR`	1:1-15
Namespace	`CacheNamespacePrefix::LatestFundsYear`	13
Cache Query	`preprocessor_cache.read_with_ttl::<usize>()`	3:9-42
Year Range	2024 to current year	46
Cache TTL	1 week (604800 seconds)	59

Caching Strategy:

The function caches the most recent successful year to avoid repeated year fallback attempts 3:8-42:

Function: `fetch_investment_company_series_and_class_dataset_for_year`

Signature:

This lower-level function fetches the dataset for a specific year 8:0-105:

URL Construction : Uses Url::InvestmentCompanySeriesAndClassDataset(year) 84
Throttle Override : Reduces max retries to 2 for faster fallback 8:6-92
Raw Request : Uses raw_request() instead of higher-level fetch methods 9:4-101
CSV Parsing : Calls parse_investment_companies_csv() 104

Throttle Policy Override:

Sources: src/network/fetch_investment_company_series_and_class_dataset.rs:1-178

graph TB
    subgraph "Independent Functions"
        FT["fetch_company_tickers()"]
FCS["fetch_cik_submissions(cik)"]
FIC["fetch_investment_company_series_and_class_dataset()"]
end
    
    subgraph "Dependent Functions"
        FCIK["fetch_cik_by_ticker_symbol()"]
FGAAP["fetch_us_gaap_fundamentals()"]
FNPORT1["fetch_nport_filing_by_ticker_symbol()"]
FNPORT2["fetch_nport_filing_by_cik_and_accession_number()"]
end
    
    subgraph "Data Models"
        VT["Vec&lt;Ticker&gt;"]
C["Cik"]
VCS["Vec&lt;CikSubmission&gt;"]
end
    
 
   FT --> VT
 
   VT --> FCIK
 
   VT --> FGAAP
 
   VT --> FNPORT2
    
 
   FCIK --> C
 
   C --> FCS
 
   C --> FGAAP
    
 
   FCS --> VCS
 
   VCS --> FNPORT1
    
 
   FCIK --> FNPORT1
 
   FNPORT1 --> FNPORT2

Function Dependencies and Integration

The data fetching functions form a dependency graph where some functions rely on others to complete their tasks.

Diagram: Function Dependency Graph

Common Integration Patterns:

Ticker to CIK Resolution :
- fetch_company_tickers() → fetch_cik_by_ticker_symbol() → Cik
- Used by: NPORT filing fetch, US GAAP fundamentals fetch
Submission Filtering :
- fetch_cik_submissions() → CikSubmission::most_recent_nport_p_submission()
- Used by: NPORT filing by ticker symbol
Ticker Mapping in Results :
- fetch_company_tickers() → parsing functions (e.g., parse_nport_xml)
- Enriches raw data with ticker information

Example Integration (from examples):

Sources: src/network/fetch_nport_filing.rs:3-5 examples/nport_filing.rs:16-22 examples/lookup_cik.rs:18-24

Error Handling

All data fetching functions return Result<T, Box<dyn Error>>, providing consistent error handling across the module. Common error scenarios include:

Error Type	Cause	Example Functions
Network Errors	HTTP request failures, timeouts	All functions
Parsing Errors	Invalid JSON/XML structure	`fetch_cik_submissions`, `fetch_nport_filing_by_cik_and_accession_number`
Not Found	Ticker symbol not found, no NPORT-P filing exists	`fetch_nport_filing_by_ticker_symbol`
Year Fallback Exhaustion	No data available for any year	`fetch_investment_company_series_and_class_dataset`

Error Propagation:

Functions use the ? operator to propagate errors up the call stack, allowing callers to handle errors at the appropriate level. For example, in fetch_nport_filing_by_ticker_symbol:

If either dependency function fails, the error is immediately returned to the caller.

Sources: src/network/fetch_cik_submissions.rs:8-11 src/network/fetch_us_gaap_fundamentals.rs:12-16 src/network/fetch_nport_filing.rs:10-13

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

US GAAP Concept Transformation

Relevant source files

Purpose and Scope

This page documents the US GAAP concept transformation system, which normalizes raw financial concept names from SEC EDGAR filings into a standardized taxonomy. The core functionality is provided by the distill_us_gaap_fundamental_concepts function, which maps the diverse US GAAP terminology (57+ revenue variations, 6 cost variants, multiple equity representations) into a consistent set of 64 FundamentalConcept enum variants.

For information about fetching US GAAP data from the SEC API, see Data Fetching Functions. For details on the data models that use these concepts, see Data Models & Enumerations. For the Python ML pipeline that processes the transformed concepts, see Python narrative_stack System.

System Overview

The transformation system acts as a critical normalization layer between raw SEC EDGAR filings and downstream data processing. Companies report financial data using various US GAAP concept names (e.g., Revenues, SalesRevenueNet, HealthCareOrganizationRevenue), and this system ensures all variations map to consistent concept identifiers.

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:1-1275 examples/us_gaap_human_readable.rs:1-9

graph TB
    subgraph "Input Layer"
        RawFiling["SEC EDGAR Filing\nRaw JSON Data"]
RawConcepts["Raw US GAAP Concepts\n- Revenues\n- SalesRevenueNet\n- HealthCareOrganizationRevenue\n- InterestAndDividendIncomeOperating\n- 57+ more revenue types"]
end
    
    subgraph "Transformation Layer"
        DistillFn["distill_us_gaap_fundamental_concepts"]
MappingEngine["Mapping Engine\n4 Pattern Types"]
OneToOne["One-to-One\nAssets → Assets"]
Hierarchical["Hierarchical\nCurrentAssets → [CurrentAssets, Assets]"]
Synonyms["Synonyms\n6 cost types → CostOfRevenue"]
Industry["Industry-Specific\n57+ revenue types → Revenues"]
MappingEngine --> OneToOne
 
       MappingEngine --> Hierarchical
 
       MappingEngine --> Synonyms
 
       MappingEngine --> Industry
    end
    
    subgraph "Output Layer"
        FundamentalConcept["FundamentalConcept Enum\n64 standardized variants"]
BS["Balance Sheet Concepts\n- Assets, CurrentAssets\n- Liabilities, CurrentLiabilities\n- Equity, EquityAttributableToParent"]
IS["Income Statement Concepts\n- Revenues, CostOfRevenue\n- GrossProfit, OperatingIncomeLoss\n- NetIncomeLoss"]
CF["Cash Flow Concepts\n- NetCashFlowFromOperatingActivities\n- NetCashFlowFromInvestingActivities\n- NetCashFlowFromFinancingActivities"]
FundamentalConcept --> BS
 
       FundamentalConcept --> IS
 
       FundamentalConcept --> CF
    end
    
 
   RawFiling --> RawConcepts
 
   RawConcepts --> DistillFn
 
   DistillFn --> MappingEngine
 
   MappingEngine --> FundamentalConcept
    
    style DistillFn fill:#f9f9f9
    style FundamentalConcept fill:#f9f9f9

The FundamentalConcept Taxonomy

The FundamentalConcept enum defines 64 standardized financial concept variants organized into four main categories: Balance Sheet, Income Statement, Cash Flow, and Equity classifications. Each variant represents a normalized concept that may map from multiple raw US GAAP names.

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:1-1275

graph TB
    subgraph "FundamentalConcept Enum"
        Root["FundamentalConcept\n(64 total variants)"]
end
    
    subgraph "Balance Sheet Concepts"
        Assets["Assets"]
CurrentAssets["CurrentAssets"]
NoncurrentAssets["NoncurrentAssets"]
Liabilities["Liabilities"]
CurrentLiabilities["CurrentLiabilities"]
NoncurrentLiabilities["NoncurrentLiabilities"]
LiabilitiesAndEquity["LiabilitiesAndEquity"]
CommitmentsAndContingencies["CommitmentsAndContingencies"]
end
    
    subgraph "Income Statement Concepts"
        Revenues["Revenues"]
RevenuesNetInterestExpense["RevenuesNetInterestExpense"]
RevenuesExcludingInterestAndDividends["RevenuesExcludingInterestAndDividends"]
InterestAndDividendIncomeOperating["InterestAndDividendIncomeOperating"]
CostOfRevenue["CostOfRevenue"]
GrossProfit["GrossProfit"]
OperatingExpenses["OperatingExpenses"]
OperatingIncomeLoss["OperatingIncomeLoss"]
NonoperatingIncomeLoss["NonoperatingIncomeLoss"]
IncomeTaxExpenseBenefit["IncomeTaxExpenseBenefit"]
NetIncomeLoss["NetIncomeLoss"]
NetIncomeLossAttributableToParent["NetIncomeLossAttributableToParent"]
NetIncomeLossAttributableToNoncontrollingInterest["NetIncomeLossAttributableToNoncontrollingInterest"]
end
    
    subgraph "Cash Flow Concepts"
        NetCashFlow["NetCashFlow"]
NetCashFlowContinuing["NetCashFlowContinuing"]
NetCashFlowDiscontinued["NetCashFlowDiscontinued"]
NetCashFlowFromOperatingActivities["NetCashFlowFromOperatingActivities"]
NetCashFlowFromInvestingActivities["NetCashFlowFromInvestingActivities"]
NetCashFlowFromFinancingActivities["NetCashFlowFromFinancingActivities"]
ExchangeGainsLosses["ExchangeGainsLosses"]
end
    
    subgraph "Equity Concepts"
        Equity["Equity"]
EquityAttributableToParent["EquityAttributableToParent"]
EquityAttributableToNoncontrollingInterest["EquityAttributableToNoncontrollingInterest"]
TemporaryEquity["TemporaryEquity"]
RedeemableNoncontrollingInterest["RedeemableNoncontrollingInterest"]
end
    
 
   Root --> Assets
 
   Root --> CurrentAssets
 
   Root --> NoncurrentAssets
 
   Root --> Liabilities
 
   Root --> CurrentLiabilities
 
   Root --> NoncurrentLiabilities
 
   Root --> LiabilitiesAndEquity
 
   Root --> CommitmentsAndContingencies
    
 
   Root --> Revenues
 
   Root --> RevenuesNetInterestExpense
 
   Root --> RevenuesExcludingInterestAndDividends
 
   Root --> InterestAndDividendIncomeOperating
 
   Root --> CostOfRevenue
 
   Root --> GrossProfit
 
   Root --> OperatingExpenses
 
   Root --> OperatingIncomeLoss
 
   Root --> NonoperatingIncomeLoss
 
   Root --> IncomeTaxExpenseBenefit
 
   Root --> NetIncomeLoss
 
   Root --> NetIncomeLossAttributableToParent
 
   Root --> NetIncomeLossAttributableToNoncontrollingInterest
    
 
   Root --> NetCashFlow
 
   Root --> NetCashFlowContinuing
 
   Root --> NetCashFlowDiscontinued
 
   Root --> NetCashFlowFromOperatingActivities
 
   Root --> NetCashFlowFromInvestingActivities
 
   Root --> NetCashFlowFromFinancingActivities
 
   Root --> ExchangeGainsLosses
    
 
   Root --> Equity
 
   Root --> EquityAttributableToParent
 
   Root --> EquityAttributableToNoncontrollingInterest
 
   Root --> TemporaryEquity
 
   Root --> RedeemableNoncontrollingInterest

Mapping Pattern Types

The transformation system implements four distinct mapping patterns to handle the diverse ways companies report financial concepts.

Pattern 1: One-to-One Mapping

Simple direct mappings where a single US GAAP concept name maps to exactly one FundamentalConcept variant.

graph LR
 
   A["Assets"] --> FA["FundamentalConcept::Assets"]
B["Liabilities"] --> FB["FundamentalConcept::Liabilities"]
C["GrossProfit"] --> FC["FundamentalConcept::GrossProfit"]
D["CommitmentsAndContingencies"] --> FD["FundamentalConcept::CommitmentsAndContingencies"]
style FA fill:#f9f9f9
    style FB fill:#f9f9f9
    style FC fill:#f9f9f9
    style FD fill:#f9f9f9

Raw US GAAP Concept	FundamentalConcept Output
`Assets`	`vec![Assets]`
`Liabilities`	`vec![Liabilities]`
`GrossProfit`	`vec![GrossProfit]`
`OperatingIncomeLoss`	`vec![OperatingIncomeLoss]`
`CommitmentsAndContingencies`	`vec![CommitmentsAndContingencies]`

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:4-17 tests/distill_us_gaap_fundamental_concepts_tests.rs:31-36 tests/distill_us_gaap_fundamental_concepts_tests.rs:336-341

Pattern 2: Hierarchical Mapping

Specific concepts map to multiple variants, including both the specific concept and parent categories. This enables queries at different levels of granularity.

graph LR
    subgraph "Input"
        CA["AssetsCurrent"]
CL["LiabilitiesCurrent"]
SE["StockholdersEquity"]
end
    
    subgraph "Output: Multiple Concepts"
        CA1["FundamentalConcept::CurrentAssets"]
CA2["FundamentalConcept::Assets"]
SE1["FundamentalConcept::EquityAttributableToParent"]
SE2["FundamentalConcept::Equity"]
end
    
 
   CA --> CA1
 
   CA --> CA2
 
   SE --> SE1
 
   SE --> SE2
    
    style CA1 fill:#f9f9f9
    style CA2 fill:#f9f9f9
    style SE1 fill:#f9f9f9
    style SE2 fill:#f9f9f9

Raw US GAAP Concept	FundamentalConcept Output (Ordered)
`AssetsCurrent`	`vec![CurrentAssets, Assets]`
`StockholdersEquity`	`vec![EquityAttributableToParent, Equity]`
`NetIncomeLoss`	`vec![NetIncomeLossAttributableToParent, NetIncomeLoss]`
`IncomeLossFromContinuingOperations`	`vec![IncomeLossFromContinuingOperationsAfterTax, NetIncomeLoss]`

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:10-17 tests/distill_us_gaap_fundamental_concepts_tests.rs:158-165 tests/distill_us_gaap_fundamental_concepts_tests.rs:262-286 tests/distill_us_gaap_fundamental_concepts_tests.rs:692-698

Pattern 3: Synonym Consolidation

Multiple US GAAP concept names that represent the same financial concept are consolidated into a single FundamentalConcept variant.

graph LR
    subgraph "Input: 6 Cost Variants"
        C1["CostOfRevenue"]
C2["CostOfGoodsAndServicesSold"]
C3["CostOfServices"]
C4["CostOfGoodsSold"]
C5["CostOfGoodsSoldExcludingDepreciationDepletionAndAmortization"]
C6["CostOfGoodsSoldElectric"]
end
    
    subgraph "Output: Single Concept"
        CO["FundamentalConcept::CostOfRevenue"]
end
    
 
   C1 --> CO
 
   C2 --> CO
 
   C3 --> CO
 
   C4 --> CO
 
   C5 --> CO
 
   C6 --> CO
    
    style CO fill:#f9f9f9

Cost of Revenue Synonyms

Raw US GAAP Concept	FundamentalConcept Output
`CostOfRevenue`	`vec![CostOfRevenue]`
`CostOfGoodsAndServicesSold`	`vec![CostOfRevenue]`
`CostOfServices`	`vec![CostOfRevenue]`
`CostOfGoodsSold`	`vec![CostOfRevenue]`
`CostOfGoodsSoldExcludingDepreciationDepletionAndAmortization`	`vec![CostOfRevenue]`
`CostOfGoodsSoldElectric`	`vec![CostOfRevenue]`

Equity Noncontrolling Interest Synonyms

Raw US GAAP Concept	FundamentalConcept Output
`MinorityInterest`	`vec![EquityAttributableToNoncontrollingInterest]`
`PartnersCapitalAttributableToNoncontrollingInterest`	`vec![EquityAttributableToNoncontrollingInterest]`
`MinorityInterestInLimitedPartnerships`	`vec![EquityAttributableToNoncontrollingInterest]`
`MinorityInterestInOperatingPartnerships`	`vec![EquityAttributableToNoncontrollingInterest]`
`MinorityInterestInJointVentures`	`vec![EquityAttributableToNoncontrollingInterest]`
`NonredeemableNoncontrollingInterest`	`vec![EquityAttributableToNoncontrollingInterest]`
`NoncontrollingInterestInVariableInterestEntity`	`vec![EquityAttributableToNoncontrollingInterest]`

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:80-112 tests/distill_us_gaap_fundamental_concepts_tests.rs:196-259

Pattern 4: Industry-Specific Revenue Mapping

The most complex pattern handles 57+ industry-specific revenue variations, mapping them all to the Revenues concept. Some revenue types also map to hierarchical categories.

graph TB
    subgraph "Industry-Specific Revenue Types"
        R1["Revenues"]
R2["SalesRevenueNet"]
R3["HealthCareOrganizationRevenue"]
R4["RealEstateRevenueNet"]
R5["OilAndGasRevenue"]
R6["FinancialServicesRevenue"]
R7["AdvertisingRevenue"]
R8["SubscriptionRevenue"]
R9["RoyaltyRevenue"]
R10["ElectricUtilityRevenue"]
R11["PassengerRevenue"]
R12["CargoAndFreightRevenue"]
R13["... 45+ more types"]
end
    
    subgraph "Special Revenue Categories"
        RH1["InterestAndDividendIncomeOperating"]
RH2["RevenuesExcludingInterestAndDividends"]
RH3["RevenuesNetOfInterestExpense"]
RH4["InvestmentBankingRevenue"]
end
    
    subgraph "Output Concepts"
        Rev["FundamentalConcept::Revenues"]
RevInt["FundamentalConcept::InterestAndDividendIncomeOperating"]
RevExcl["FundamentalConcept::RevenuesExcludingInterestAndDividends"]
RevNet["FundamentalConcept::RevenuesNetInterestExpense"]
end
    
 
   R1 --> Rev
 
   R2 --> Rev
 
   R3 --> Rev
 
   R4 --> Rev
 
   R5 --> Rev
 
   R6 --> Rev
 
   R7 --> Rev
 
   R8 --> Rev
 
   R9 --> Rev
 
   R10 --> Rev
 
   R11 --> Rev
 
   R12 --> Rev
 
   R13 --> Rev
    
 
   RH1 --> RevInt
 
   RH1 --> Rev
 
   RH2 --> RevExcl
 
   RH2 --> Rev
 
   RH3 --> RevNet
 
   RH3 --> Rev
 
   RH4 --> RevExcl
 
   RH4 --> Rev
    
    style Rev fill:#f9f9f9
    style RevInt fill:#f9f9f9
    style RevExcl fill:#f9f9f9
    style RevNet fill:#f9f9f9

Sample Revenue Mappings

Industry	Raw US GAAP Concept	FundamentalConcept Output
General	`Revenues`	`vec![Revenues]`
General	`SalesRevenueNet`	`vec![Revenues]`
Healthcare	`HealthCareOrganizationRevenue`	`vec![Revenues]`
Real Estate	`RealEstateRevenueNet`	`vec![Revenues]`
Energy	`OilAndGasRevenue`	`vec![Revenues]`
Mining	`RevenueMineralSales`	`vec![Revenues]`
Hospitality	`RevenueFromLeasedAndOwnedHotels`	`vec![Revenues]`
Franchise	`FranchisorRevenue`	`vec![Revenues]`
Media	`SubscriptionRevenue`	`vec![Revenues]`
Media	`AdvertisingRevenue`	`vec![Revenues]`
Entertainment	`AdmissionsRevenue`	`vec![Revenues]`
Licensing	`LicensesRevenue`	`vec![Revenues]`
Licensing	`RoyaltyRevenue`	`vec![Revenues]`
Transportation	`PassengerRevenue`	`vec![Revenues]`
Transportation	`CargoAndFreightRevenue`	`vec![Revenues]`
Utilities	`ElectricUtilityRevenue`	`vec![Revenues]`
Financial	`InterestAndDividendIncomeOperating`	`vec![InterestAndDividendIncomeOperating, Revenues]`
Financial	`InvestmentBankingRevenue`	`vec![RevenuesExcludingInterestAndDividends, Revenues]`

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:954-1188 tests/distill_us_gaap_fundamental_concepts_tests.rs:1191-1225

The distill_us_gaap_fundamental_concepts Function

The core transformation function accepts a string representation of a US GAAP concept name and returns an Option<Vec<FundamentalConcept>>. The return type is an Option because not all US GAAP concepts are mapped (unmapped concepts return None), and a Vec because some concepts map to multiple standardized variants (hierarchical pattern).

graph TB
    subgraph "Function Input"
        InputStr["Input: &str\nUS GAAP concept name"]
end
    
    subgraph "distill_us_gaap_fundamental_concepts"
        Parse["Parse input string"]
MatchEngine["Pattern Matching Engine"]
Decision{"Concept\nRecognized?"}
BuildVec["Construct Vec<FundamentalConcept>\nApply mapping pattern"]
ReturnSome["Return Some(Vec<FundamentalConcept>)"]
ReturnNone["Return None"]
end
    
    subgraph "Function Output"
        OutputSome["Option::Some(Vec<FundamentalConcept>)\n1-2 variants typically"]
OutputNone["Option::None\nUnrecognized concept"]
end
    
 
   InputStr --> Parse
 
   Parse --> MatchEngine
 
   MatchEngine --> Decision
 
   Decision -->|Yes| BuildVec
 
   Decision -->|No| ReturnNone
 
   BuildVec --> ReturnSome
 
   ReturnSome --> OutputSome
 
   ReturnNone --> OutputNone
    
    style MatchEngine fill:#f9f9f9
    style ReturnSome fill:#f9f9f9

Function Signature and Flow

Example Usage

Sources: examples/us_gaap_human_readable.rs:1-9 tests/distill_us_gaap_fundamental_concepts_tests.rs:4-8

Complex Hierarchical Examples

Some concepts demonstrate multi-level hierarchical relationships where a specific concept may relate to multiple parent categories.

graph TD
    subgraph "Net Income Loss Hierarchy"
        NIL1["NetIncomeLoss\n(raw input)"]
NIL2["FundamentalConcept::NetIncomeLossAttributableToParent"]
NIL3["FundamentalConcept::NetIncomeLoss"]
NILCS["NetIncomeLossAvailableToCommonStockholdersBasic\n(raw input)"]
NILCS1["FundamentalConcept::NetIncomeLossAvailableToCommonStockholdersBasic"]
NILCS2["FundamentalConcept::NetIncomeLoss"]
ILCO1["IncomeLossFromContinuingOperations\n(raw input)"]
ILCO2["FundamentalConcept::IncomeLossFromContinuingOperationsAfterTax"]
ILCO3["FundamentalConcept::NetIncomeLoss"]
NIL1 --> NIL2
 
       NIL1 --> NIL3
 
       NILCS --> NILCS1
 
       NILCS --> NILCS2
 
       ILCO1 --> ILCO2
 
       ILCO1 --> ILCO3
    end
    
    subgraph "Comprehensive Income Hierarchy"
        CI1["ComprehensiveIncomeNetOfTax\n(raw input)"]
CI2["FundamentalConcept::ComprehensiveIncomeLossAttributableToParent"]
CI3["FundamentalConcept::ComprehensiveIncomeLoss"]
CI1 --> CI2
 
       CI1 --> CI3
    end
    
    subgraph "Revenue Hierarchy"
        RNOI["RevenuesNetOfInterestExpense\n(raw input)"]
RNOI1["FundamentalConcept::RevenuesNetInterestExpense"]
RNOI2["FundamentalConcept::Revenues"]
RNOI --> RNOI1
 
       RNOI --> RNOI2
    end
    
    style NIL2 fill:#f9f9f9
    style NIL3 fill:#f9f9f9
    style NILCS1 fill:#f9f9f9
    style NILCS2 fill:#f9f9f9
    style ILCO2 fill:#f9f9f9
    style ILCO3 fill:#f9f9f9
    style CI2 fill:#f9f9f9
    style CI3 fill:#f9f9f9
    style RNOI1 fill:#f9f9f9
    style RNOI2 fill:#f9f9f9

Income Statement Hierarchies

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:686-725 tests/distill_us_gaap_fundamental_concepts_tests.rs:39-77 tests/distill_us_gaap_fundamental_concepts_tests.rs:976-982

Cash Flow Hierarchies

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:550-683

Equity Concept Mappings

Equity concepts demonstrate all four mapping patterns, including complex organizational structures (stockholders vs. partners vs. members) and noncontrolling interest variations.

Equity Structure Overview

Organizational Type	Raw US GAAP Concept	FundamentalConcept Output
Corporation (total)	`StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest`	`vec![Equity]`
Corporation (parent)	`StockholdersEquity`	`vec![EquityAttributableToParent, Equity]`
Partnership (total)	`PartnersCapitalIncludingPortionAttributableToNoncontrollingInterest`	`vec![Equity]`
Partnership (parent)	`PartnersCapital`	`vec![EquityAttributableToParent, Equity]`
LLC (parent)	`MembersEquity`	`vec![EquityAttributableToParent, Equity]`
Generic	`CommonStockholdersEquity`	`vec![Equity]`

Temporary Equity Variations

Raw US GAAP Concept	FundamentalConcept Output
`TemporaryEquityCarryingAmount`	`vec![TemporaryEquity]`
`TemporaryEquityRedemptionValue`	`vec![TemporaryEquity]`
`RedeemablePreferredStockCarryingAmount`	`vec![TemporaryEquity]`
`TemporaryEquityCarryingAmountAttributableToParent`	`vec![TemporaryEquity]`
`TemporaryEquityCarryingAmountAttributableToNoncontrollingInterest`	`vec![TemporaryEquity]`
`TemporaryEquityLiquidationPreference`	`vec![TemporaryEquity]`

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:150-193 tests/distill_us_gaap_fundamental_concepts_tests.rs:262-286 tests/distill_us_gaap_fundamental_concepts_tests.rs:1228-1274

graph TB
    subgraph "Test Structure"
        TestFile["distill_us_gaap_fundamental_concepts_tests.rs\n1275 lines, 70+ tests"]
BalanceSheet["Balance Sheet Tests\n- Assets (one-to-one)\n- Current Assets (hierarchical)\n- Equity (multiple org types)\n- Temporary Equity (7 variations)"]
IncomeStatement["Income Statement Tests\n- Revenues (57+ variations)\n- Cost of Revenue (6 synonyms)\n- Net Income (hierarchical)\n- Comprehensive Income (hierarchical)"]
CashFlow["Cash Flow Tests\n- Operating Activities\n- Investing Activities\n- Financing Activities\n- Continuing vs Discontinued"]
Special["Special Category Tests\n- Commitments and Contingencies\n- Nature of Operations\n- Exchange Gains/Losses\n- Research and Development"]
TestFile --> BalanceSheet
 
       TestFile --> IncomeStatement
 
       TestFile --> CashFlow
 
       TestFile --> Special
    end

Testing Strategy

The transformation system has comprehensive test coverage with 70+ test functions covering all 64 FundamentalConcept variants and their various input mappings.

Test Organization

Test Pattern Examples

Each test function validates one or more related mappings:

Test Coverage Summary

Concept Category	Test Functions	Total Assertions	Pattern Types Tested
Balance Sheet	15	35+	All 4 patterns
Income Statement	25	80+	All 4 patterns
Cash Flow	18	25+	Hierarchical, One-to-One
Equity	12	40+	All 4 patterns
Total	70+	180+	All 4 patterns

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:1-1275

graph TB
    subgraph "Data Source"
        SECAPI["SEC EDGAR API\ndata endpoint"]
RawJSON["Raw JSON Response\nCompany Facts"]
end
    
    subgraph "Rust Processing Pipeline"
        FetchFn["fetch_us_gaap_fundamentals\n(network module)"]
ParseJSON["Parse JSON\nExtract concept names"]
DistillFn["distill_us_gaap_fundamental_concepts\n(transformers module)"]
FilterMapped["Filter to mapped concepts\nDiscard None results"]
BuildRecords["Build data records\nwith FundamentalConcept enum"]
end
    
    subgraph "Storage Layer"
        CSVOutput["CSV Files\nus-gaap/[ticker].csv"]
PolarsDF["Polars DataFrame\nStructured columns"]
end
    
    subgraph "Python ML Pipeline"
        Ingestion["Data Ingestion\nus_gaap_store.ingest_us_gaap_csvs"]
Preprocessing["Preprocessing\nConcept/unit pair extraction"]
end
    
 
   SECAPI --> RawJSON
 
   RawJSON --> FetchFn
 
   FetchFn --> ParseJSON
 
   ParseJSON --> DistillFn
 
   DistillFn --> FilterMapped
 
   FilterMapped --> BuildRecords
 
   BuildRecords --> PolarsDF
 
   PolarsDF --> CSVOutput
 
   CSVOutput --> Ingestion
 
   Ingestion --> Preprocessing
    
    style DistillFn fill:#f9f9f9
    style FilterMapped fill:#f9f9f9

Integration with Data Pipeline

The distill_us_gaap_fundamental_concepts function is a critical component in the broader data processing pipeline, serving as the normalization layer between raw SEC filings and structured data storage.

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:1-1275 examples/us_gaap_human_readable.rs:1-9

Summary

The US GAAP concept transformation system provides:

Standardization : Maps 200+ raw US GAAP concept names to 64 standardized FundamentalConcept variants
Flexibility : Supports four mapping patterns (one-to-one, hierarchical, synonyms, industry-specific) to handle diverse reporting practices
Queryability : Hierarchical mappings enable queries at multiple granularity levels (e.g., query for all Assets or specifically CurrentAssets)
Reliability : Comprehensive test coverage with 70+ test functions and 180+ assertions validates all mapping patterns
Integration : Serves as critical normalization layer between SEC EDGAR API and downstream data processing/ML pipelines

The transformation system represents the highest-importance component (8.37) in the Rust codebase, enabling consistent financial data analysis across companies with varying reporting conventions.

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:1-1275 examples/us_gaap_human_readable.rs:1-9

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Data Models & Enumerations

Relevant source files

Purpose and Scope

This page documents the core data structures and enumerations used throughout the rust-sec-fetcher application. These models represent SEC financial data, including company identifiers, filing metadata, investment holdings, and financial concepts. The data models are defined in src/models.rs:1-18 and src/enums.rs:1-12

For information about how these models are used in data fetching operations, see Data Fetching Functions. For details on how FundamentalConcept is used in the transformation pipeline, see US GAAP Concept Transformation.

Sources: src/models.rs:1-18 src/enums.rs:1-12

SEC Identifier Models

The system uses three primary identifier types to reference companies and filings within the SEC EDGAR system.

Ticker

The Ticker struct represents a company's stock ticker symbol along with its SEC identifiers. It is returned by the fetch_company_tickers function and serves as the primary entry point for company lookups.

Structure:

cik: Cik - The company's Central Index Key
ticker_symbol: String - Stock ticker symbol (e.g., "AAPL")
company_name: String - Full company name
origin: TickerOrigin - Source of the ticker data

Sources: src/models.rs:10-11

Cik (Central Index Key)

The Cik struct represents a 10-digit SEC identifier that uniquely identifies a company or entity. CIKs are zero-padded to exactly 10 digits.

Structure:

value: u64 - The numeric CIK value (0 to 9,999,999,999)

Validation:

CIK values must not exceed 10 digits
Values are zero-padded when formatted (e.g., 123 → "0000000123")
Parsing from strings handles both padded and unpadded formats

Error Handling:

CikError::InvalidCik - Raised when CIK exceeds 10 digits
CikError::ParseError - Raised when parsing fails

Sources: src/models.rs:4-5

AccessionNumber

The AccessionNumber struct represents a unique identifier for SEC filings. Each accession number is exactly 18 digits and encodes the filer's CIK, filing year, and sequence number.

Format: XXXXXXXXXX-YY-NNNNNN

Components:

CIK (10 digits) - The SEC identifier of the filer
Year (2 digits) - Last two digits of the filing year
Sequence (6 digits) - Unique sequence number within the year

Example: "0001234567-23-000045" represents:

CIK: 0001234567
Year: 2023
Sequence: 000045

Key Methods:

from_str(accession_str: &str) - Parses from string (with or without dashes)
from_parts(cik_u64: u64, year: u16, sequence: u32) - Constructs from components
to_string() - Returns dash-separated format
to_unformatted_string() - Returns plain 18-digit string

Sources: src/models/accession_number.rs:1-188 src/models.rs:1-3

SEC Identifier Relationships

Sources: src/models.rs:1-18 src/models/accession_number.rs:35-187

Filing Data Structures

CikSubmission

The CikSubmission struct represents metadata about an SEC filing submission. It is returned by the fetch_cik_submissions function.

Key Fields:

cik: Cik - The filer's Central Index Key
accession_number: AccessionNumber - Unique filing identifier
form: String - Filing form type (e.g., "10-K", "10-Q", "NPORT-P")
primary_document: String - Main document filename
filing_date: String - Date the filing was submitted

Sources: src/models.rs:7-8

NportInvestment

The NportInvestment struct represents a single investment holding from an NPORT-P filing (monthly portfolio holdings report for registered investment companies).

Mapped Fields (linked to company data):

mapped_ticker_symbol: Option<String> - Ticker symbol if matched
mapped_company_name: Option<String> - Company name if matched
mapped_company_cik_number: Option<String> - CIK if matched

Investment Identifiers:

name: String - Investment name
lei: String - Legal Entity Identifier
cusip: String - Committee on Uniform Securities Identification Procedures ID
isin: String - International Securities Identification Number
title: String - Investment title

Financial Values:

balance: Decimal - Number of shares or units held
val_usd: Decimal - Value in USD
pct_val: Decimal - Percentage of total portfolio value
cur_cd: String - Currency code

Classifications:

asset_cat: String - Asset category
issuer_cat: String - Issuer category
payoff_profile: String - Payoff profile type
inv_country: String - Investment country

Utility Methods:

sort_by_pct_val_desc(investments: &mut Vec<NportInvestment>) - Sorts holdings by percentage value in descending order

Sources: src/models/nport_investment.rs:1-46 src/models.rs:16-17

InvestmentCompany

The InvestmentCompany struct represents an investment company (mutual fund, ETF, etc.) registered with the SEC. It is returned by the fetch_investment_companies function.

Sources: src/models.rs:13-14

Filing Data Structure Relationships

Sources: src/models.rs:1-18 src/models/nport_investment.rs:8-46

FundamentalConcept Enumeration

The FundamentalConcept enum is the most critical enumeration in the system (importance: 8.37), defining 64 standardized financial concepts derived from US GAAP (Generally Accepted Accounting Principles) taxonomies. These concepts are used by the distill_us_gaap_fundamental_concepts transformer to normalize diverse financial reporting variations into a consistent taxonomy.

Definition: src/enums/fundamental_concept_enum.rs:1-72

Traits Implemented:

Eq, PartialEq - Equality comparison
Hash - Hashable for use in maps
Clone - Cloneable
EnumString - Parse from string
EnumIter - Iterate over all variants
Display - Format as string
Debug - Debug formatting

Sources: src/enums/fundamental_concept_enum.rs:1-5 src/enums.rs:4-5

Concept Categories

The 64 FundamentalConcept variants are organized into four primary categories corresponding to major financial statement types:

Category	Concept Count	Description
Balance Sheet	13	Assets, liabilities, equity positions
Income Statement	23	Revenues, expenses, income/loss measures
Cash Flow Statement	13	Operating, investing, financing cash flows
Equity & Comprehensive Income	6	Equity attributions and comprehensive income
Other	9	Special items, metadata, and miscellaneous

Sources: src/enums/fundamental_concept_enum.rs:4-72

Complete FundamentalConcept Taxonomy

Balance Sheet Concepts

Variant	Description
`Assets`	Total assets
`CurrentAssets`	Assets expected to be converted to cash within one year
`NoncurrentAssets`	Long-term assets
`Liabilities`	Total liabilities
`CurrentLiabilities`	Obligations due within one year
`NoncurrentLiabilities`	Long-term obligations
`LiabilitiesAndEquity`	Total liabilities plus equity (must equal total assets)
`Equity`	Total shareholder equity
`EquityAttributableToParent`	Equity attributable to parent company shareholders
`EquityAttributableToNoncontrollingInterest`	Equity attributable to minority shareholders
`TemporaryEquity`	Mezzanine equity (e.g., redeemable preferred stock)
`RedeemableNoncontrollingInterest`	Redeemable minority interest
`CommitmentsAndContingencies`	Off-balance-sheet obligations

Income Statement Concepts

Variant	Description
`IncomeStatementStartPeriodYearToDate`	Statement start period marker
`Revenues`	Total revenues (consolidated from 57+ variations)
`RevenuesExcludingInterestAndDividends`	Non-interest revenues
`RevenuesNetInterestExpense`	Revenues after interest expense
`CostOfRevenue`	Direct costs of producing goods/services
`GrossProfit`	Revenue minus cost of revenue
`OperatingExpenses`	Operating expenses excluding COGS
`ResearchAndDevelopment`	R&D expenses
`CostsAndExpenses`	Total costs and expenses
`BenefitsCostsExpenses`	Employee benefits and related costs
`OperatingIncomeLoss`	Income from operations
`NonoperatingIncomeLoss`	Income from non-operating activities
`OtherOperatingIncomeExpenses`	Other operating items
`IncomeLossBeforeEquityMethodInvestments`	Income before equity investments
`IncomeLossFromEquityMethodInvestments`	Income from equity-method investees
`IncomeLossFromContinuingOperationsBeforeTax`	Pre-tax income from continuing operations
`IncomeTaxExpenseBenefit`	Income tax expense or benefit
`IncomeLossFromContinuingOperationsAfterTax`	After-tax income from continuing operations
`IncomeLossFromDiscontinuedOperationsNetOfTax`	After-tax income from discontinued operations
`ExtraordinaryItemsOfIncomeExpenseNetOfTax`	Extraordinary items (after-tax)
`NetIncomeLoss`	Bottom-line net income or loss
`NetIncomeLossAttributableToParent`	Net income attributable to parent shareholders
`NetIncomeLossAttributableToNoncontrollingInterest`	Net income attributable to minority shareholders
`NetIncomeLossAvailableToCommonStockholdersBasic`	Net income available to common shareholders
`PreferredStockDividendsAndOtherAdjustments`	Preferred dividends and adjustments

Industry-Specific Income Statement Concepts

Variant	Description
`InterestAndDividendIncomeOperating`	Interest and dividend income (financial institutions)
`InterestExpenseOperating`	Interest expense (financial institutions)
`InterestIncomeExpenseOperatingNet`	Net interest income (banks)
`InterestAndDebtExpense`	Total interest and debt expense
`InterestIncomeExpenseAfterProvisionForLosses`	Interest income after loan loss provisions
`ProvisionForLoanLeaseAndOtherLosses`	Provision for credit losses (banks)
`NoninterestIncome`	Non-interest income (financial institutions)
`NoninterestExpense`	Non-interest expense (financial institutions)

Cash Flow Statement Concepts

Variant	Description
`NetCashFlow`	Total net cash flow
`NetCashFlowContinuing`	Net cash flow from continuing operations
`NetCashFlowDiscontinued`	Net cash flow from discontinued operations
`NetCashFlowFromOperatingActivities`	Cash from operating activities
`NetCashFlowFromOperatingActivitiesContinuing`	Operating cash flow (continuing)
`NetCashFlowFromOperatingActivitiesDiscontinued`	Operating cash flow (discontinued)
`NetCashFlowFromInvestingActivities`	Cash from investing activities
`NetCashFlowFromInvestingActivitiesContinuing`	Investing cash flow (continuing)
`NetCashFlowFromInvestingActivitiesDiscontinued`	Investing cash flow (discontinued)
`NetCashFlowFromFinancingActivities`	Cash from financing activities
`NetCashFlowFromFinancingActivitiesContinuing`	Financing cash flow (continuing)
`NetCashFlowFromFinancingActivitiesDiscontinued`	Financing cash flow (discontinued)
`ExchangeGainsLosses`	Foreign exchange gains/losses

Equity & Comprehensive Income Concepts

Variant	Description
`ComprehensiveIncomeLoss`	Total comprehensive income
`ComprehensiveIncomeLossAttributableToParent`	Comprehensive income attributable to parent
`ComprehensiveIncomeLossAttributableToNoncontrollingInterest`	Comprehensive income attributable to minorities
`OtherComprehensiveIncomeLoss`	Other comprehensive income (OCI)

Other Concepts

Variant	Description
`NatureOfOperations`	Description of business operations
`GainLossOnSalePropertiesNetTax`	Gains/losses on property sales (after-tax)

Sources: src/enums/fundamental_concept_enum.rs:4-72

FundamentalConcept Taxonomy Diagram

Sources: src/enums/fundamental_concept_enum.rs:4-72

Other Enumerations

CacheNamespacePrefix

The CacheNamespacePrefix enum defines namespace prefixes used by the caching system to organize cached data by type. Each prefix corresponds to a specific data fetching operation.

Usage: Cache keys are constructed by combining a namespace prefix with a specific identifier (e.g., CacheNamespacePrefix::CompanyTickers + ticker_symbol).

Sources: src/enums.rs:1-2

TickerOrigin

The TickerOrigin enum indicates the source or origin of a ticker symbol.

Variants:

Different ticker sources (e.g., SEC company tickers API, NPORT filings, manual mapping)

Usage: Stored in the Ticker struct to track data provenance.

Sources: src/enums.rs:7-8

Url

The Url enum defines the various SEC EDGAR API endpoints used by the application. Each variant represents a specific API URL pattern.

Usage: Used by the network layer to construct API requests without hardcoding URLs throughout the codebase.

Sources: src/enums.rs:10-11

Enumeration Module Structure

Sources: src/enums.rs:1-12

graph TB
    subgraph Identifiers["SEC Identifier Models"]
Cik["Cik\nvalue: u64"]
Ticker["Ticker\ncik: Cik\nticker_symbol: String\ncompany_name: String\norigin: TickerOrigin"]
AccessionNumber["AccessionNumber\ncik: Cik\nyear: u16\nsequence: u32"]
end
    
    subgraph Filings["Filing Data Models"]
CikSubmission["CikSubmission\ncik: Cik\naccession_number: AccessionNumber\nform: String\nprimary_document: String\nfiling_date: String"]
InvestmentCompany["InvestmentCompany\n(identified by Cik)"]
NportInvestment["NportInvestment\nmapped_ticker_symbol: Option&lt;String&gt;\nmapped_company_name: Option&lt;String&gt;\nmapped_company_cik_number: Option&lt;String&gt;\nname, lei, cusip, isin\nbalance, val_usd, pct_val"]
end
    
    subgraph Enums["Enumerations"]
FundamentalConcept["FundamentalConcept\n64 variants\n(Balance Sheet, Income Statement,\nCash Flow, Equity)"]
TickerOrigin["TickerOrigin\n(ticker source/origin)"]
CacheNamespacePrefix["CacheNamespacePrefix\n(cache organization)"]
UrlEnum["Url\n(SEC EDGAR endpoints)"]
end
    
    subgraph NetworkFunctions["Network Functions (3.2)"]
FetchTickers["fetch_company_tickers\n→ Vec&lt;Ticker&gt;"]
FetchCIK["fetch_cik_by_ticker_symbol\n→ Option&lt;Cik&gt;"]
FetchSubmissions["fetch_cik_submissions\n→ Vec&lt;CikSubmission&gt;"]
FetchNPORT["fetch_nport_filing\n→ Vec&lt;NportInvestment&gt;"]
FetchInvCo["fetch_investment_companies\n→ Vec&lt;InvestmentCompany&gt;"]
end
    
    subgraph Transformer["Transformation (3.3)"]
Distill["distill_us_gaap_fundamental_concepts\nuses FundamentalConcept"]
end
    
 
   Ticker -->|contains| Cik
 
   Ticker -->|uses| TickerOrigin
 
   AccessionNumber -->|contains| Cik
 
   CikSubmission -->|contains| Cik
 
   CikSubmission -->|contains| AccessionNumber
 
   InvestmentCompany -.->|identified by| Cik
 
   NportInvestment -.->|mapped to| Ticker
    
 
   FetchTickers -.->|returns| Ticker
 
   FetchCIK -.->|returns| Cik
 
   FetchSubmissions -.->|returns| CikSubmission
 
   FetchNPORT -.->|returns| NportInvestment
 
   FetchInvCo -.->|returns| InvestmentCompany
    
 
   Distill -.->|uses| FundamentalConcept

Data Model Relationships

The following diagram illustrates how the core data models relate to each other and which models contain or reference other models.

Sources: src/models.rs:1-18 src/enums.rs:1-12

Implementation Details

Error Handling

Both Cik and AccessionNumber implement custom error types to handle parsing and validation failures:

CikError:

InvalidCik - CIK exceeds 10 digits
ParseError - Parsing from string failed

AccessionNumberError:

InvalidLength - Accession number is not 18 digits
ParseError - Parsing failed
CikError - Wrapped CIK error

Both error types implement std::fmt::Display and std::error::Error for proper error propagation.

Sources: src/models/accession_number.rs:42-76

Serialization

The NportInvestment struct uses serde for serialization/deserialization with serde_with for custom formatting:

#[serde_as(as = "DisplayFromStr")] - Applied to Decimal fields (balance, val_usd, pct_val) to serialize them as strings
Option<T> fields are used for nullable data from NPORT filings

Sources: src/models/nport_investment.rs:3-38

Decimal Precision

Financial values in NportInvestment use the rust_decimal::Decimal type, which provides:

Arbitrary precision decimal arithmetic
No floating-point rounding errors
Safe comparison operations

Sources: src/models/nport_investment.rs:2-33

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Caching & Storage System

Relevant source files

Purpose and Scope

This document describes the caching and storage infrastructure used by the Rust sec-fetcher application to minimize redundant API requests and improve performance. The system implements a two-tier caching architecture with persistent storage backed by simd-r-drive. For information about network request throttling and retry logic, see Network Layer & SecClient. For database storage used by the Python components, see Database & Storage Integration.

Overview

The caching system provides two distinct cache layers:

HTTP Cache : Stores raw HTTP responses from SEC EDGAR API requests
Preprocessor Cache : Stores transformed and processed data structures

Both caches use the simd-r-drive key-value storage backend with persistent file-based storage, initialized once at application startup using the OnceLock pattern for thread-safe static access.

Sources: src/caches.rs:1-66

Caching Architecture

The following diagram illustrates the complete caching architecture and how it integrates with the network layer:

Cache Initialization Flow

graph TB
    subgraph "Application Initialization"
        Main["main.rs"]
ConfigMgr["ConfigManager"]
CachesInit["Caches::init(config_manager)"]
end
    
    subgraph "Static Cache Storage (OnceLock)"
        HTTPCache["SIMD_R_DRIVE_HTTP_CACHE\nOnceLock&lt;Arc&lt;DataStore&gt;&gt;"]
PreCache["SIMD_R_DRIVE_PREPROCESSOR_CACHE\nOnceLock&lt;Arc&lt;DataStore&gt;&gt;"]
end
    
    subgraph "File System"
        HTTPFile["cache_base_dir/http_storage_cache.bin"]
PreFile["cache_base_dir/preprocessor_cache.bin"]
end
    
    subgraph "Network Layer"
        SecClient["SecClient"]
CacheMiddleware["reqwest_drive\nCache Middleware"]
ThrottleMiddleware["Throttle Middleware"]
CachePolicy["CachePolicy\nTTL: 1 week\nrespect_headers: false"]
end
    
    subgraph "Cache Access"
        GetHTTP["Caches::get_http_cache_store()"]
GetPre["Caches::get_preprocessor_cache()"]
end
    
 
   Main --> ConfigMgr
 
   ConfigMgr --> CachesInit
 
   CachesInit --> HTTPCache
 
   CachesInit --> PreCache
    
 
   HTTPCache -.->|DataStore::open| HTTPFile
 
   PreCache -.->|DataStore::open| PreFile
    
 
   SecClient --> GetHTTP
 
   GetHTTP --> HTTPCache
 
   HTTPCache --> CacheMiddleware
 
   CacheMiddleware --> CachePolicy
 
   CacheMiddleware --> ThrottleMiddleware
    
 
   GetPre --> PreCache

The caching system is initialized early in the application lifecycle using a configuration-driven approach that ensures thread-safe access across the async runtime.

Sources: src/caches.rs:1-66 src/network/sec_client.rs:1-181

Cache Initialization System

OnceLock Pattern

The system uses Rust's OnceLock for lazy, thread-safe initialization of static cache instances:

This pattern ensures:

Single initialization : Each cache is initialized exactly once
Thread safety : Safe access from multiple async tasks
Zero overhead after init : No locking required for read access after initialization

Sources: src/caches.rs:7-8

Initialization Process

Initialization Code Flow

The Caches::init() method performs the following steps:

Retrieves cache_base_dir from ConfigManager
Constructs paths for both cache files
Opens each DataStore instance
Sets the OnceLock with Arc-wrapped stores
Logs warnings if already initialized (prevents re-initialization panics)

Sources: src/caches.rs:13-48

HTTP Cache Integration

graph LR
    subgraph "SecClient Initialization"
        FromConfig["SecClient::from_config_manager()"]
GetHTTPCache["Caches::get_http_cache_store()"]
CreatePolicy["CachePolicy::new()"]
CreateThrottle["ThrottlePolicy::new()"]
end
    
    subgraph "Middleware Chain"
        InitDrive["init_cache_with_drive_and_throttle()"]
DriveCache["drive_cache middleware"]
ThrottleCache["throttle_cache middleware"]
InitClient["init_client_with_cache_and_throttle()"]
end
    
    subgraph "HTTP Client"
        ClientWithMiddleware["ClientWithMiddleware"]
end
    
 
   FromConfig --> GetHTTPCache
 
   FromConfig --> CreatePolicy
 
   FromConfig --> CreateThrottle
    
 
   GetHTTPCache --> InitDrive
 
   CreatePolicy --> InitDrive
 
   CreateThrottle --> InitDrive
    
 
   InitDrive --> DriveCache
 
   InitDrive --> ThrottleCache
    
 
   DriveCache --> InitClient
 
   ThrottleCache --> InitClient
    
 
   InitClient --> ClientWithMiddleware

SecClient Cache Setup

The SecClient integrates with the HTTP cache through the reqwest_drive middleware system:

SecClient Construction

The SecClient struct maintains references to both cache and throttle policies:

Field	Type	Purpose
`email`	`String`	User-Agent identification
`http_client`	`ClientWithMiddleware`	HTTP client with middleware chain
`cache_policy`	`Arc<CachePolicy>`	Cache configuration settings
`throttle_policy`	`Arc<ThrottlePolicy>`	Request throttling configuration

Sources: src/network/sec_client.rs:14-19 src/network/sec_client.rs:73-81

CachePolicy Configuration

The CachePolicy struct defines cache behavior:

Parameter	Value	Description
`default_ttl`	`Duration::from_secs(60 * 60 * 24 * 7)`	1 week time-to-live
`respect_headers`	`false`	Ignore HTTP cache headers from SEC API
`cache_status_override`	`None`	No custom status code handling

The 1-week TTL is currently hardcoded but marked for future configurability.

Sources: src/network/sec_client.rs:45-50

Cache Access Patterns

HTTP Cache Access Flow

Request Processing

When SecClient makes a request:

The fetch_json() method calls raw_request() with the target URL
The middleware chain intercepts the request
Cache middleware checks the DataStore for a matching entry
If found and not expired (TTL check), returns cached response
If not found, forwards to throttle middleware and HTTP client
Response is stored in cache with timestamp before returning

Sources: src/network/sec_client.rs:140-179 tests/sec_client_tests.rs:35-62

Preprocessor Cache Access

The preprocessor cache is accessed directly through the Caches module API:

This cache is intended for storing transformed data structures after processing, though the current codebase primarily uses it as infrastructure for future preprocessing optimization.

Sources: src/caches.rs:59-64

Storage Backend: simd-r-drive

DataStore Integration

The simd-r-drive crate provides the persistent key-value storage backend:

DataStore Characteristics

Feature	Description
Persistence	All data survives application restarts
Binary Format	Stores arbitrary byte arrays
Thread-Safe	Safe for concurrent access from async tasks
File-Backed	Single file per cache instance
No Network	Local-only storage (unlike WebSocket mode)

Sources: src/caches.rs3 src/caches.rs:22-37

Cache Configuration

ConfigManager Integration

The caching system reads configuration from ConfigManager:

Config Field	Purpose	Default
`cache_base_dir`	Parent directory for cache files	Required (no default)

The cache base directory path is constructed at runtime and must be set before initialization:

Cache File Paths

HTTP Cache: {cache_base_dir}/http_storage_cache.bin
Preprocessor Cache: {cache_base_dir}/preprocessor_cache.bin

Sources: src/caches.rs16 src/caches.rs20 src/caches.rs34

ThrottlePolicy Configuration

While not part of the cache storage itself, the ThrottlePolicy is closely integrated with the cache middleware:

Parameter	Config Source	Purpose
`base_delay_ms`	`config.min_delay_ms`	Minimum delay between requests
`max_concurrent`	`config.max_concurrent`	Maximum parallel requests
`max_retries`	`config.max_retries`	Retry attempts for failed requests
`adaptive_jitter_ms`	Hardcoded: `500`	Random jitter for backoff

Sources: src/network/sec_client.rs:53-59

Error Handling

Initialization Errors

The cache initialization handles several error scenarios:

Panic Conditions:

DataStore::open() fails (I/O error, permission denied, etc.)
Calling get_http_cache_store() before Caches::init()
Calling get_preprocessor_cache() before Caches::init()

Warnings:

Attempting to reinitialize already-set OnceLock (logs warning, doesn't fail)

Sources: src/caches.rs:25-29 src/caches.rs:51-55

Runtime Access Errors

Both cache accessor methods use expect() with descriptive messages:

These panics indicate programmer errors (accessing cache before initialization) rather than runtime failures.

Sources: src/caches.rs:51-56 src/caches.rs:59-64

Testing

Cache Testing Strategy

The test suite verifies caching behavior indirectly through SecClient tests:

Test Coverage:

Test	Purpose	File
`test_fetch_json_without_retry_success`	Verifies basic fetch with caching enabled	tests/sec_client_tests.rs:36-62
`test_fetch_json_with_retry_success`	Tests cache behavior with successful retry	tests/sec_client_tests.rs:65-91
`test_fetch_json_with_retry_failure`	Validates cache doesn't store failed responses	tests/sec_client_tests.rs:94-120
`test_fetch_json_with_retry_backoff`	Tests cache with retry backoff logic	tests/sec_client_tests.rs:123-158

Mock Server Testing:

Tests use mockito::Server to simulate SEC API responses without requiring cache setup, as the test environment creates ephemeral cache instances per test run.

Sources: tests/sec_client_tests.rs:1-159

Usage Patterns

Typical Initialization Flow

Future Extensions

The preprocessor cache infrastructure is in place but not yet fully utilized. Potential use cases include:

Caching parsed/transformed US GAAP concepts
Storing intermediate data structures from distill_us_gaap_fundamental_concepts()
Caching resolved CIK lookups from ticker symbols
Storing compiled ticker-to-CIK mappings

Sources: src/caches.rs:59-64

Integration with Other Systems

The caching system integrates with:

ConfigManager (#2.1): Provides cache directory configuration
SecClient (#3.1): Primary consumer of HTTP cache
Network Functions (#3.2): All fetch operations benefit from caching
simd-r-drive : External storage backend (see #6 for dependency details)

The cache layer operates transparently below the network API, requiring no changes to calling code when caching is enabled or disabled.

Sources: src/caches.rs:1-66 src/network/sec_client.rs:73-81

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Main Application Flow

Relevant source files

This document describes the main application entry point and execution flow in src/main.rs, including initialization, data fetching loops, CSV output organization, and error handling strategies. The main application orchestrates the high-level data collection process by coordinating the configuration system, network layer, and storage systems.

For detailed information about the network client and its middleware layers, see SecClient. For data model structures used throughout the application, see Data Models & Enumerations. For configuration and credential management, see Configuration System.

Application Entry Point and Initialization

The application entry point is the main() function in src/main.rs:174-240 The initialization sequence follows a consistent pattern regardless of which processing mode is active:

Sources: src/main.rs:174-180

graph TB
    Start["Program Start\n#[tokio::main]\nasync fn main()"]
LoadConfig["ConfigManager::load()\nLoad TOML configuration\nValidate credentials"]
CreateClient["SecClient::from_config_manager()\nInitialize HTTP client\nApply throttling & caching"]
CheckMode{"Processing\nMode?"}
FetchTickers["fetch_company_tickers(&client)\nRetrieve complete ticker dataset"]
FetchInvCo["fetch_investment_company_\nseries_and_class_dataset(&client)\nGet fund listings"]
USGAAPLoop["US-GAAP Processing Loop\n(Active Implementation)"]
InvCoLoop["Investment Company Loop\n(Commented Prototypes)"]
ErrorReport["Generate Error Summary\nPrint failed tickers"]
End["Program Exit"]
Start --> LoadConfig
 
   LoadConfig --> CreateClient
 
   CreateClient --> CheckMode
 
   CheckMode -->|US-GAAP Mode| FetchTickers
 
   CheckMode -->|Investment Co Mode| FetchInvCo
 
   FetchTickers --> USGAAPLoop
 
   FetchInvCo --> InvCoLoop
 
   USGAAPLoop --> ErrorReport
 
   InvCoLoop --> ErrorReport
 
   ErrorReport --> End

The initialization handles errors at each stage, returning early if critical components fail to load:

Initialization Step	Error Handling	Line Reference
Configuration Loading	Returns `Box<dyn Error>` on failure	src/main.rs176
Client Creation	Returns `Box<dyn Error>` on failure	src/main.rs178
Ticker Fetching	Returns `Box<dyn Error>` on failure	src/main.rs180

Active Implementation: US-GAAP Fundamentals Processing

The currently active implementation in main.rs processes US-GAAP fundamentals for all company tickers. This represents the production data collection pipeline.

Sources: src/main.rs:174-240

graph TB
    Start["Start US-GAAP Processing"]
FetchTickers["fetch_company_tickers(&client)\nReturns Vec&lt;Ticker&gt;"]
PrintTotal["Print total records\nDisplay first 60 tickers"]
InitErrorLog["Initialize error_log:\nHashMap&lt;String, String&gt;"]
LoopStart{"For each\ncompany_ticker\nin tickers"}
PrintProgress["Print processing status:\nticker_symbol (i+1 of total)"]
FetchFundamentals["fetch_us_gaap_fundamentals(\n&client,\n&company_tickers,\n&ticker_symbol)"]
CreateFile["File::create(\ndata/22-june-us-gaap/{ticker}.csv)"]
WriteCSV["CsvWriter::new(&mut file)\n.include_header(true)\n.finish(&mut df)"]
CaptureError["Insert error into error_log:\nticker → error message"]
CheckErrors{"error_log\nempty?"}
PrintErrors["Print error summary:\nList all failed tickers"]
PrintSuccess["Print success message"]
End["End Processing"]
Start --> FetchTickers
 
   FetchTickers --> PrintTotal
 
   PrintTotal --> InitErrorLog
 
   InitErrorLog --> LoopStart
 
   LoopStart -->|Next ticker| PrintProgress
 
   PrintProgress --> FetchFundamentals
 
   FetchFundamentals -->|Success| CreateFile
 
   CreateFile -->|Success| WriteCSV
 
   FetchFundamentals -->|Error| CaptureError
 
   CreateFile -->|Error| CaptureError
 
   WriteCSV -->|Error| CaptureError
 
   WriteCSV -->|Success| LoopStart
 
   CaptureError --> LoopStart
 
   LoopStart -->|Complete| CheckErrors
 
   CheckErrors -->|Has errors| PrintErrors
 
   CheckErrors -->|No errors| PrintSuccess
 
   PrintErrors --> End
 
   PrintSuccess --> End

Processing Loop Details

The main processing loop iterates through all company tickers sequentially, performing the following operations for each ticker:

Progress Logging - src/main.rs:190-195: Prints current position in the dataset
Data Fetching - src/main.rs203: Calls fetch_us_gaap_fundamentals() with client, full ticker list, and current ticker symbol
CSV Generation - src/main.rs:206-214: Creates file and writes DataFrame with headers
Error Capture - src/main.rs:212-225: Logs any failures to error_log HashMap

Error Handling Strategy

The application uses a non-fatal error handling approach:

Error Type	Handling Strategy	Result
Fetch Error	Logged to `error_log`, processing continues	src/main.rs:223-225
File Creation Error	Logged to `error_log`, processing continues	src/main.rs:217-220
CSV Write Error	Logged to `error_log`, processing continues	src/main.rs:211-214
Configuration Error	Fatal, returns immediately	src/main.rs176
Client Creation Error	Fatal, returns immediately	src/main.rs178

Sources: src/main.rs:185-227

CSV Output Organization

The US-GAAP processing mode writes output to a flat directory structure:

data/22-june-us-gaap/
├── AAPL.csv
├── MSFT.csv
├── GOOGL.csv
└── ...

Each file is named using the ticker symbol and contains the complete US-GAAP fundamentals DataFrame for that company.

Sources: src/main.rs205

Investment Company Processing Flow (Prototype)

Two commented-out prototype implementations exist in main.rs for processing investment company data. These represent alternative processing modes that may be activated in future iterations.

Production Prototype (Lines 18-122)

The first prototype implements production-ready investment company processing with comprehensive error handling:

Sources: src/main.rs:18-122

graph TB
    Start["Start Investment Co Processing"]
InitLogger["env_logger::Builder\n.filter(None, LevelFilter::Info)"]
InitErrorLog["Initialize error_log: Vec&lt;String&gt;"]
LoadConfig["ConfigManager::load()\nHandle fatal errors"]
CreateClient["SecClient::from_config_manager()\nHandle fatal errors"]
FetchInvCo["fetch_investment_company_\nseries_and_class_dataset(&client)"]
LogTotal["info! Total investment companies"]
LoopStart{"For each fund\nin investment_\ncompanies"}
LogProgress["info! Processing: i+1 of total"]
CheckTicker{"fund.class_\nticker exists?"}
LogTicker["info! Ticker symbol"]
FetchNPORT["fetch_nport_filing_by_\nticker_symbol(&client, ticker)"]
GetFirstLetter["ticker.chars().next()\n.to_ascii_uppercase()"]
CreateDir["create_dir_all(\ndata/fund-holdings/{letter})"]
LogRecords["info! Total records"]
WriteCSV["latest_nport_filing\n.write_to_csv(file_path)"]
LogError["error! Log message\nerror_log.push(msg)"]
CheckFinal{"error_log\nempty?"}
PrintSummary["error! Print error summary"]
PrintSuccess["info! All funds successful"]
End["End Processing"]
Start --> InitLogger
 
   InitLogger --> InitErrorLog
 
   InitErrorLog --> LoadConfig
 
   LoadConfig -->|Error| LogError
 
   LoadConfig -->|Success| CreateClient
 
   CreateClient -->|Error| LogError
 
   CreateClient -->|Success| FetchInvCo
 
   FetchInvCo -->|Error| LogError
 
   FetchInvCo -->|Success| LogTotal
 
   LogTotal --> LoopStart
 
   LoopStart -->|Next fund| LogProgress
 
   LogProgress --> CheckTicker
 
   CheckTicker -->|Yes| LogTicker
 
   CheckTicker -->|No| LoopStart
 
   LogTicker --> FetchNPORT
 
   FetchNPORT -->|Error| LogError
 
   FetchNPORT -->|Success| GetFirstLetter
 
   GetFirstLetter --> CreateDir
 
   CreateDir -->|Error| LogError
 
   CreateDir -->|Success| LogRecords
 
   LogRecords --> WriteCSV
 
   WriteCSV -->|Error| LogError
 
   WriteCSV -->|Success| LoopStart
 
   LogError --> LoopStart
 
   LoopStart -->|Complete| CheckFinal
 
   CheckFinal -->|Has errors| PrintSummary
 
   CheckFinal -->|No errors| PrintSuccess
 
   PrintSummary --> End
 
   PrintSuccess --> End

Directory Organization for Fund Holdings

The investment company prototype organizes CSV output by the first letter of the ticker symbol:

data/fund-holdings/
├── A/
│   ├── AADR.csv
│   └── AAXJ.csv
├── B/
│   ├── BKLN.csv
│   └── BND.csv
├── V/
│   ├── VTI.csv
│   └── VXUS.csv
└── ...

This alphabetical categorization is implemented at src/main.rs:83-84 using ticker_symbol.chars().next().unwrap().to_ascii_uppercase().

Sources: src/main.rs:82-107

Debug Prototype (Lines 124-171)

A simpler debug version exists for testing and development, with minimal error handling and console output:

Feature	Debug Prototype	Production Prototype
Error Logging	Fatal errors only	Comprehensive Vec/HashMap logs
Logger	Not initialized	`env_logger` with INFO level
Progress Tracking	Simple `println!`	Structured `info!` macros
Error Recovery	Immediate exit with `?`	Continue processing, log errors
Output	Console + CSV	CSV only

Sources: src/main.rs:124-171

Data Flow Through Network Layer

The main application delegates all SEC API interactions to the network layer, which handles caching, throttling, and retries transparently:

Sources: src/main.rs:3-9 src/network/fetch_us_gaap_fundamentals.rs:12-28 src/network/fetch_nport_filing.rs:10-49 src/network/fetch_investment_company_series_and_class_dataset.rs:31-105

graph LR
    Main["main.rs\nmain()"]
FetchTickers["network::fetch_company_tickers\nGET company_tickers.json"]
FetchGAAP["network::fetch_us_gaap_fundamentals\nGET CIK{cik}/company-facts.json"]
FetchInvCo["network::fetch_investment_company_\nseries_and_class_dataset\nGET investment-company-series-class-{year}.csv"]
FetchNPORT["network::fetch_nport_filing_\nby_ticker_symbol\nGET data/{cik}/{accession}/primary_doc.xml"]
SecClient["SecClient\nHTTP middleware\nThrottling, Caching, Retries"]
SECAPI["SEC EDGAR API\nwww.sec.gov"]
Main --> FetchTickers
 
   Main --> FetchGAAP
 
   Main --> FetchInvCo
 
   Main --> FetchNPORT
    
 
   FetchTickers --> SecClient
 
   FetchGAAP --> SecClient
 
   FetchInvCo --> SecClient
 
   FetchNPORT --> SecClient
    
 
   SecClient --> SECAPI

The network functions are designed to be composable, allowing the main application to chain multiple API calls together (e.g., fetching a CIK first, then using it to fetch submissions).

Composite Data Fetching Operations

Several network functions perform composite operations by calling other network functions. For example, fetch_nport_filing_by_ticker_symbol orchestrates multiple API calls:

Sources: src/network/fetch_nport_filing.rs:10-49

graph TB
    Input["Input: ticker_symbol"]
FetchCIK["fetch_cik_by_ticker_symbol\n(&sec_client, ticker_symbol)"]
FetchSubs["fetch_cik_submissions\n(&sec_client, cik)"]
GetRecent["CikSubmission::most_recent_\nnport_p_submission(submissions)"]
FetchFiling["fetch_nport_filing_by_cik_\nand_accession_number(\n&sec_client, cik, accession)"]
FetchTickers["fetch_company_tickers\n(&sec_client)"]
ParseXML["parse_nport_xml(\n&xml_data, &company_tickers)"]
Output["Output: Vec&lt;NportInvestment&gt;"]
Input --> FetchCIK
 
   FetchCIK --> FetchSubs
 
   FetchSubs --> GetRecent
 
   GetRecent --> FetchFiling
 
   FetchFiling --> FetchTickers
 
   FetchTickers --> ParseXML
 
   ParseXML --> Output

This composition pattern allows the main application to use high-level functions while the network layer handles the complexity of multi-step API interactions.

graph TB
    MainFn["main() -> Result&lt;(), Box&lt;dyn Error&gt;&gt;"]
NetworkFn["Network functions\n-> Result&lt;T, Box&lt;dyn Error&gt;&gt;"]
Parser["Parsers\n-> Result&lt;T, Box&lt;dyn Error&gt;&gt;"]
SecClient["SecClient\nImplements retries & backoff"]
MainFatal["Fatal Errors:\nConfig load failure\nClient creation failure\nInitial ticker fetch failure"]
MainNonFatal["Non-Fatal Errors:\nIndividual ticker fetch failures\nCSV write failures\n→ Logged to error_log\n→ Processing continues"]
NetworkRetry["Automatic Retries:\nHTTP timeouts\nRate limit errors\n→ Exponential backoff\n→ Max 3 retries"]
ParserErr["Parser Errors:\nMalformed XML/JSON\nMissing required fields\n→ Propagate to caller"]
MainFn --> MainFatal
 
   MainFn --> MainNonFatal
    
 
   MainFn --> NetworkFn
 
   NetworkFn --> NetworkRetry
 
   NetworkFn --> SecClient
    
 
   NetworkFn --> Parser
 
   Parser --> ParserErr
    
 
   NetworkRetry --> MainNonFatal
 
   ParserErr --> MainNonFatal

Error Propagation and Recovery

The application uses Rust's Result type throughout the call stack, with different error handling strategies at each level:

Sources: src/main.rs:174-240 src/network/fetch_us_gaap_fundamentals.rs:12-28

Fatal errors (configuration, client setup) immediately return from main(), while per-ticker errors are captured in error_log and reported at the end without interrupting the processing loop.

File System Output Structure

The application writes structured CSV files to the local file system. The output directory organization differs between processing modes:

US-GAAP Mode Output

data/
└── 22-june-us-gaap/
    ├── AAPL.csv
    ├── MSFT.csv
    └── {ticker}.csv

Sources: src/main.rs205

Investment Company Mode Output

data/
└── fund-holdings/
    ├── A/
    │   ├── AADR.csv
    │   └── {ticker}.csv
    ├── B/
    │   └── {ticker}.csv
    └── {letter}/
        └── {ticker}.csv

Sources: src/main.rs:82-102

The alphabetical organization in investment company mode enables efficient file system operations when dealing with thousands of fund ticker symbols, avoiding the creation of a single directory with excessive entries.

Performance Characteristics

The main application is designed for batch processing with the following characteristics:

Aspect	Implementation	Location
Concurrency	Sequential processing (no parallelization in main loop)	src/main.rs187
Rate Limiting	Handled by `SecClient` throttle policy	See SecClient
Caching	HTTP cache and preprocessor cache in `SecClient`	See Caching System
Memory Management	Processes one ticker at a time, drops DataFrames after write	src/main.rs:203-221
Error Recovery	Non-fatal errors logged, processing continues	src/main.rs:185-227

The sequential processing model ensures consistent rate limiting and simplifies error tracking, though it could be parallelized in future iterations using Tokio tasks or Rayon parallel iterators.

Sources: src/main.rs:174-240

Summary

The main application flow follows a simple but robust pattern:

Initialize configuration and HTTP client with comprehensive error handling
Fetch the complete dataset of tickers or investment companies
Iterate through each entry sequentially
Process each entry by fetching additional data and writing to CSV
Log any errors without stopping processing
Report a summary of successes and failures

This design prioritizes resilience (continue processing on errors) and observability (detailed logging and error summaries) over performance, making it suitable for overnight batch jobs that process thousands of securities. The commented prototype implementations demonstrate alternative processing modes that can be activated by uncommenting the relevant sections and commenting out the active implementation.

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Utility Functions

Relevant source files

Purpose and Scope

This document covers the utility functions and helper modules provided by the utils module in the Rust sec-fetcher application. These utilities provide cross-cutting functionality used throughout the codebase, including data structure transformations, runtime mode detection, and collection extensions.

For information about the main application architecture and data flow, see Main Application Flow. For details on data models and core structures, see Data Models & Enumerations.

Sources: src/utils.rs:1-12

Module Overview

The utils module is organized as a collection of focused sub-modules, each providing a specific utility function or set of related functions. The module uses Rust's re-export pattern to provide a clean public API.

Sub-module	Primary Export	Purpose
`invert_multivalue_indexmap`	`invert_multivalue_indexmap()`	Inverts multi-value index mappings
`is_development_mode`	`is_development_mode()`	Runtime environment detection
`is_interactive_mode`	`is_interactive_mode()`, `set_interactive_mode_override()`	Interactive mode state management
`vec_extensions`	`VecExtensions` trait	Extension methods for `Vec<T>`

The module structure follows a pattern where each utility is isolated in its own file, promoting maintainability and testability while the parent utils.rs file acts as a facade.

Sources: src/utils.rs:1-12

Utility Module Architecture

Sources: src/utils.rs:1-12

IndexMap Inversion Utility

Function Signature

The invert_multivalue_indexmap function transforms a mapping where keys point to multiple values into a reverse mapping where values point to all their associated keys.

Algorithm and Complexity

The function performs the following operations:

Initialization : Creates a new IndexMap with capacity matching the input map
Iteration : Traverses all key-value pairs in the original mapping
Inversion : For each value in a key's vector, adds that key to the value's entry in the inverted map
Preservation : Maintains insertion order via IndexMap data structure

Time Complexity: O(N) where N is the total number of key-value associations across all vectors

Space Complexity: O(N) for storing the inverted mapping

Use Cases in the Codebase

This utility is primarily used in the US GAAP transformation pipeline where bidirectional concept mappings are required. The function enables:

Synonym Resolution : Mapping from multiple US GAAP tags to a single FundamentalConcept enum variant
Reverse Lookups : Given a FundamentalConcept, finding all original US GAAP tags that map to it
Hierarchical Queries : Supporting both forward (tag → concept) and reverse (concept → tags) navigation

Example Usage Pattern

Sources: src/utils/invert_multivalue_indexmap.rs:1-65

Development Mode Detection

The is_development_mode function provides runtime detection of development versus production environments. This utility enables conditional behavior based on the execution context.

Typical Implementation Pattern

Development mode detection typically checks for:

Cargo Profile : Whether compiled with --release flag
Environment Variables : Presence of RUST_DEV_MODE, DEBUG, or similar markers
Build Configuration : Compile-time flags like #[cfg(debug_assertions)]

Usage in Configuration System

The function integrates with the configuration system to adjust behavior:

Development Mode	Production Mode
Relaxed validation	Strict validation
Verbose logging enabled	Minimal logging
Mock data allowed	Real API calls required
Cache disabled or short TTL	Full caching enabled

This utility is likely referenced in src/config/config_manager.rs to adjust validation rules and in src/main.rs to control application initialization.

Sources: src/utils.rs:4-5

Interactive Mode Management

The interactive mode utilities manage application state related to user interaction, controlling whether the application should prompt for user input or run in automated mode.

Function Signatures

State Management Pattern

These functions likely implement a static or thread-local state management pattern:

Default Behavior : Detect if running in a TTY (terminal) environment
Override Mechanism : Allow explicit setting via set_interactive_mode_override()
Query Interface : Check current state via is_interactive_mode()

Use Cases

Scenario	Interactive Mode	Non-Interactive Mode
Missing config	Prompt user for input	Exit with error
API rate limit	Pause and notify user	Automatic retry with backoff
Data validation failure	Ask user to continue	Fail fast and exit
Progress reporting	Show progress bars	Log to stdout/file

Integration with Main Application Flow

The interactive mode flags are likely used in src/main.rs to control:

Whether to display progress indicators during fund processing
How to handle missing credentials (prompt vs. error)
Whether to confirm before writing large CSV files
User confirmation for destructive operations

Sources: src/utils.rs:7-8

Vector Extensions Trait

The VecExtensions trait provides extension methods for Rust's standard Vec<T> type, adding domain-specific functionality needed by the sec-fetcher application.

Trait Pattern

Extension traits in Rust follow this pattern:

Likely Extension Methods

Based on common patterns in data processing applications and the context of US GAAP data transformation, the trait likely provides:

Method Category	Potential Methods	Purpose
Chunking	`chunk_by_size()`, `batch()`	Process large datasets in manageable batches
Deduplication	`unique()`, `deduplicate_by()`	Remove duplicate entries from fetched data
Filtering	`filter_not_empty()`, `compact()`	Remove null or empty elements
Transformation	`map_parallel()`, `flat_map_concurrent()`	Parallel data transformation
Validation	`all_valid()`, `partition_valid()`	Separate valid from invalid records

Usage in Data Pipeline

The extensions are likely used extensively in:

Network Module : Batching API requests, deduplicating ticker symbols
Transform Module : Parallel processing of US GAAP concepts
Main Application : Chunking investment company lists for concurrent processing

Sources: src/utils.rs:10-11

Utility Function Relationships

Sources: src/utils.rs:1-12 src/utils/invert_multivalue_indexmap.rs:1-65

Design Principles

Modularity

Each utility is isolated in its own sub-module with a single, well-defined responsibility. This design:

Reduces Coupling : Utilities can be tested independently
Improves Reusability : Functions can be used across different modules without dependencies
Simplifies Maintenance : Changes to one utility don't affect others

Generic Programming

The invert_multivalue_indexmap function demonstrates generic programming principles:

Type Parameters : Works with any types implementing Eq + Hash + Clone
Zero-Cost Abstractions : No runtime overhead compared to specialized versions
Compile-Time Guarantees : Type safety ensured by the compiler

Order Preservation

The use of IndexMap instead of HashMap in the inversion utility preserves insertion order, which is critical for:

Deterministic Output : Consistent CSV file generation across runs
Reproducible Transformations : Same input always produces same output order
Debugging : Predictable ordering aids in troubleshooting

Sources: src/utils/invert_multivalue_indexmap.rs:4-14

Performance Characteristics

IndexMap Inversion

Operation	Complexity	Notes
Inversion	O(N)	N = total key-value associations
Lookup in result	O(1) average	Hash-based access
Insertion order	Preserved	Via IndexMap internal ordering
Memory overhead	O(N)	Stores inverted mapping

Runtime Mode Detection

Operation	Complexity	Caching Strategy
First check	O(1) - O(log n)	Environment lookup or compile-time constant
Subsequent checks	O(1)	Static or lazy_static cached value

The mode detection functions likely use lazy initialization patterns to cache their results, avoiding repeated environment variable lookups or system calls.

Sources: src/utils/invert_multivalue_indexmap.rs:26-28

Integration Points

US GAAP Transformation Pipeline

The utilities integrate with the concept transformation system (see US GAAP Concept Transformation):

Forward Mapping : distill_us_gaap_fundamental_concepts uses hard-coded mappings
Reverse Mapping : invert_multivalue_indexmap generates reverse index
Lookup Optimization : Enables O(1) queries for "which concepts contain this tag?"

Configuration System

The development and interactive mode utilities integrate with configuration management (see Configuration System):

Validation : is_development_mode relaxes validation in development
Credential Handling : is_interactive_mode determines prompt vs. error
Logging : Both functions control verbosity and output format

Main Application Loop

The utilities support the main processing flow (see Main Application Flow):

Batch Processing : VecExtensions enables efficient chunking of investment companies
User Feedback : is_interactive_mode controls progress display
Error Handling : Mode detection influences retry behavior and error messages

Sources: src/utils.rs:1-12

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Python narrative_stack System

Relevant source files

Purpose and Scope

The narrative_stack system is the Python machine learning component of the dual-language financial data processing architecture. This document provides an overview of the Python system's architecture, its integration with the Rust sec-fetcher component, and the high-level data flow from CSV ingestion through model training.

For detailed information on specific subsystems:

Data ingestion and preprocessing pipeline: See Data Ingestion & Preprocessing
Machine learning model architecture and training: See Machine Learning Training Pipeline
Database and storage integration: See Database & Storage Integration

The Rust data fetching layer is documented in Rust sec-fetcher Application.

System Architecture Overview

The narrative_stack system processes US GAAP financial data through a multi-stage pipeline that transforms raw CSV files into learned latent representations. The architecture consists of three primary layers: storage/ingestion, preprocessing, and training.

graph TB
    subgraph "Storage Layer"
        DbUsGaap["DbUsGaap\nMySQL Database Interface"]
DataStoreWsClient["DataStoreWsClient\nWebSocket Client"]
UsGaapStore["UsGaapStore\nUnified Data Facade"]
end
    
    subgraph "Data Sources"
        RustCSV["CSV Files\nfrom sec-fetcher\ntruncated_csvs/"]
MySQL["MySQL Database\nus_gaap_test"]
SimdRDrive["simd-r-drive\nWebSocket Server"]
end
    
    subgraph "Preprocessing Components"
        IngestCSV["ingest_us_gaap_csvs\nCSV Walker & Parser"]
GenPCA["generate_pca_embeddings\nPCA Compression"]
RobustScaler["RobustScaler\nPer-Pair Normalization"]
ConceptEmbed["Semantic Embeddings\nConcept/Unit Pairs"]
end
    
    subgraph "Training Components"
        IterableDS["IterableConceptValueDataset\nStreaming Data Loader"]
Stage1AE["Stage1Autoencoder\nPyTorch Lightning Module"]
PLTrainer["pl.Trainer\nTraining Orchestration"]
Callbacks["EarlyStopping\nModelCheckpoint\nTensorBoard Logger"]
end
    
 
   RustCSV --> IngestCSV
 
   MySQL --> DbUsGaap
 
   SimdRDrive --> DataStoreWsClient
    
 
   DbUsGaap --> UsGaapStore
 
   DataStoreWsClient --> UsGaapStore
    
 
   IngestCSV --> UsGaapStore
 
   UsGaapStore --> GenPCA
 
   UsGaapStore --> RobustScaler
 
   UsGaapStore --> ConceptEmbed
    
 
   GenPCA --> UsGaapStore
 
   RobustScaler --> UsGaapStore
 
   ConceptEmbed --> UsGaapStore
    
 
   UsGaapStore --> IterableDS
 
   IterableDS --> Stage1AE
 
   Stage1AE --> PLTrainer
 
   PLTrainer --> Callbacks
    
 
   Callbacks -.->|Checkpoints| Stage1AE

Component Architecture

Sources:

Core Component Responsibilities

Component	Type	Responsibilities
`DbUsGaap`	Database Interface	MySQL connection management, query execution
`DataStoreWsClient`	WebSocket Client	Real-time communication with simd-r-drive server
`UsGaapStore`	Data Facade	Unified API for ingestion, lookup, embedding management
`IterableConceptValueDataset`	PyTorch Dataset	Streaming data loader with internal batching
`Stage1Autoencoder`	Lightning Module	Encoder-decoder architecture for concept embeddings
`pl.Trainer`	Training Framework	Orchestration, logging, checkpointing

Sources:

Data Flow Pipeline

The system processes financial data through a sequential pipeline from raw CSV files to trained model checkpoints.

graph LR
    subgraph "Input Stage"
        CSV1["Rust CSV Output\nproject_paths.rust_data/\ntruncated_csvs/"]
end
    
    subgraph "Ingestion Stage"
        Walk["walk_us_gaap_csvs\nDirectory Walker"]
Parse["UsGaapRowRecord\nParser"]
Store1["Store to\nDbUsGaap & DataStore"]
end
    
    subgraph "Preprocessing Stage"
        Extract["Extract\nconcept/unit pairs"]
GenEmbed["generate_concept_unit_embeddings\nSemantic Embeddings"]
Scale["RobustScaler\nPer-Pair Normalization"]
PCA["PCA Compression\nvariance_threshold=0.95"]
Triplet["Triplet Storage\nconcept+unit+scaled_value\n+scaler+embedding"]
end
    
    subgraph "Training Stage"
        Stream["IterableConceptValueDataset\ninternal_batch_size=64"]
DataLoader["DataLoader\nbatch_size from hparams\ncollate_with_scaler"]
Encode["Stage1Autoencoder.encode\nInput → Latent"]
Decode["Stage1Autoencoder.decode\nLatent → Reconstruction"]
Loss["MSE Loss\nReconstruction Error"]
end
    
    subgraph "Output Stage"
        Ckpt["Model Checkpoints\nstage1_resume-vN.ckpt"]
TB["TensorBoard Logs\nval_loss_epoch monitoring"]
end
    
 
   CSV1 --> Walk
 
   Walk --> Parse
 
   Parse --> Store1
 
   Store1 --> Extract
 
   Extract --> GenEmbed
 
   GenEmbed --> Scale
 
   Scale --> PCA
 
   PCA --> Triplet
 
   Triplet --> Stream
 
   Stream --> DataLoader
 
   DataLoader --> Encode
 
   Encode --> Decode
 
   Decode --> Loss
 
   Loss --> Ckpt
 
   Loss --> TB

End-to-End Processing Flow

Sources:

Processing Stages

1. CSV Ingestion

The system ingests CSV files produced by the Rust sec-fetcher using us_gaap_store.ingest_us_gaap_csvs(). This function walks the directory structure, parses each CSV file, and stores the data in both MySQL and the WebSocket data store.

Sources:

2. Concept/Unit Pair Extraction

The preprocessing pipeline extracts unique (concept, unit) pairs from the ingested data. Each pair represents a specific financial metric with its measurement unit (e.g., ("Revenues", "USD")).

Sources:

python/narrative_stack/notebooks/stage1_preprocessing.ipynb:67-68

3. Semantic Embedding Generation

For each concept/unit pair, the system generates semantic embeddings that capture the meaning and relationship of financial concepts. These embeddings are then compressed using PCA with a variance threshold of 0.95.

Sources:

4. Value Normalization

The RobustScaler normalizes financial values separately for each concept/unit pair, making the data robust to outliers while preserving relative magnitudes within each group.

Sources:

python/narrative_stack/notebooks/stage1_preprocessing.ipynb:88-96

5. Model Training

The Stage1Autoencoder learns to reconstruct the concatenated embedding+value input through a bottleneck architecture, forcing the model to learn compressed representations in the latent space.

Sources:

graph TB
    subgraph "Python Application"
        App["narrative_stack\nNotebooks & Scripts"]
end
    
    subgraph "Storage Facade"
        UsGaapStore["UsGaapStore\nUnified Interface"]
end
    
    subgraph "Backend Interfaces"
        DbInterface["DbUsGaap\ndb_config"]
WsInterface["DataStoreWsClient\nsimd_r_drive_server_config"]
end
    
    subgraph "Storage Systems"
        MySQL["MySQL Server\nus_gaap_test database"]
WsServer["simd-r-drive\nWebSocket Server\nKey-Value Store"]
FileCache["File System\nCache Storage"]
end
    
    subgraph "Data Types"
        Raw["Raw Records\nUsGaapRowRecord"]
Triplets["Triplets\nconcept+unit+scaled_value\n+scaler+embedding"]
Matrix["Embedding Matrix\nnumpy arrays"]
PCAModel["PCA Models\nsklearn objects"]
end
    
 
   App --> UsGaapStore
 
   UsGaapStore --> DbInterface
 
   UsGaapStore --> WsInterface
    
 
   DbInterface --> MySQL
 
   WsInterface --> WsServer
 
   WsServer --> FileCache
    
 
   MySQL -.->|stores| Raw
 
   WsServer -.->|stores| Triplets
 
   WsServer -.->|stores| Matrix
 
   WsServer -.->|stores| PCAModel

Storage Architecture Integration

The Python system integrates with multiple storage backends to support different access patterns and data requirements.

Storage Backend Architecture

Sources:

Storage System Characteristics

Storage Backend	Use Case	Data Types	Access Pattern
MySQL (`DbUsGaap`)	Raw record storage	`UsGaapRowRecord` entries	Batch queries during ingestion
WebSocket (`DataStoreWsClient`)	Preprocessed data	Triplets, embeddings, scalers, PCA models	Real-time streaming during training
File System	Cache persistence	HTTP cache, preprocessor cache	Transparent via `simd-r-drive`

Sources:

python/narrative_stack/notebooks/stage1_preprocessing.ipynb:13-40

sequenceDiagram
    participant NB as "Jupyter Notebook"
    participant Config as "config module"
    participant DB as "DbUsGaap"
    participant WS as "DataStoreWsClient"
    participant Store as "UsGaapStore"
    
    NB->>Config: Import db_config
    NB->>Config: Import simd_r_drive_server_config
    NB->>DB: DbUsGaap(db_config)
    NB->>WS: DataStoreWsClient(host, port)
    NB->>Store: UsGaapStore(data_store)
    Store->>WS: Delegate operations
    Store->>DB: Query raw records

Configuration and Initialization

The system uses centralized configuration for database connections and WebSocket server endpoints.

Initialization Sequence

Sources:

python/narrative_stack/notebooks/stage1_preprocessing.ipynb:13-40

Configuration Objects

Configuration	Purpose	Key Parameters
`db_config`	MySQL connection	Host, port, database name, credentials
`simd_r_drive_server_config`	WebSocket server	Host, port
`project_paths`	File system paths	`rust_data`, `python_data`

Sources:

python/narrative_stack/notebooks/stage1_preprocessing.ipynb:13-23

graph TB
    subgraph "Model Configuration"
        Model["Stage1Autoencoder\nhparams"]
LoadCkpt["load_from_checkpoint\nckpt_path"]
LR["Learning Rate\nlr, min_lr"]
end
    
    subgraph "Data Configuration"
        DS["IterableConceptValueDataset\ninternal_batch_size=64\nreturn_scaler=True\nshuffle=True/False"]
DL["DataLoader\nbatch_size from hparams\ncollate_fn=collate_with_scaler\nnum_workers=2\nprefetch_factor=4"]
end
    
    subgraph "Callback Configuration"
        ES["EarlyStopping\nmonitor=val_loss_epoch\npatience=20\nmode=min"]
MC["ModelCheckpoint\nmonitor=val_loss_epoch\nsave_top_k=1\nfilename=stage1_resume"]
TB["TensorBoardLogger\nOUTPUT_PATH\nname=stage1_autoencoder"]
end
    
    subgraph "Trainer Configuration"
        Trainer["pl.Trainer\nmax_epochs=1000\naccelerator=auto\ndevices=1\ngradient_clip_val"]
end
    
 
   Model --> Trainer
 
   LoadCkpt -.->|optional resume| Model
 
   LR --> Model
    
 
   DS --> DL
 
   DL --> Trainer
    
 
   ES --> Trainer
 
   MC --> Trainer
 
   TB --> Trainer

Training Infrastructure

The training system uses PyTorch Lightning for experiment management and reproducible training workflows.

Training Configuration

Sources:

python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb468-556

Key Training Parameters

Parameter	Value	Purpose
`EPOCHS`	1000	Maximum training epochs
`PATIENCE`	20	Early stopping patience
`internal_batch_size`	64	Dataset internal batching
`num_workers`	2	DataLoader worker processes
`prefetch_factor`	4	Batches to prefetch per worker
`gradient_clip_val`	From hparams	Gradient clipping threshold

Sources:

python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb468-522

Data Access Patterns

The UsGaapStore provides multiple access methods for different use cases.

Access Methods

Method	Purpose	Return Type	Usage Context
`ingest_us_gaap_csvs(csv_dir, db)`	Bulk CSV ingestion	None	Initial data loading
`generate_pca_embeddings()`	PCA compression	None	Preprocessing
`lookup_by_index(idx)`	Single triplet retrieval	Dict with concept, unit, scaled_value, scaler, embedding	Validation, debugging
`batch_lookup_by_indices(indices)`	Multiple triplet retrieval	List of dicts	Batch validation
`get_triplet_count()`	Count stored triplets	Integer	Monitoring
`get_pair_count()`	Count unique pairs	Integer	Statistics
`get_embedding_matrix()`	Full embedding matrix	(numpy array, pairs list)	Visualization, analysis

Sources:

python/narrative_stack/notebooks/stage1_preprocessing.ipynb:67-109

Validation and Monitoring

The system includes validation mechanisms to ensure data integrity throughout the pipeline.

Scaler Validation Example

The preprocessing notebook includes validation code to verify that stored scalers correctly transform values:

Sources:

python/narrative_stack/notebooks/stage1_preprocessing.ipynb:88-96

Visualization Tools

The system provides visualization utilities for monitoring preprocessing quality:

Function	Purpose	Output
`plot_pca_explanation()`	Show PCA variance explained	Matplotlib plot with dimensionality recommendation
`plot_semantic_embeddings()`	Visualize embedding space	2D scatterplot of embeddings

Sources:

graph LR
    subgraph "Rust sec-fetcher"
        Fetch["Network Fetching\nfetch_us_gaap_fundamentals"]
Transform["Concept Transformation\ndistill_us_gaap_fundamental_concepts"]
CSVWrite["CSV Output\nus-gaap/*.csv"]
end
    
    subgraph "File System"
        CSVStore["CSV Files\nproject_paths.rust_data/\ntruncated_csvs/"]
end
    
    subgraph "Python narrative_stack"
        CSVRead["CSV Ingestion\ningest_us_gaap_csvs"]
Process["Preprocessing Pipeline"]
Train["Model Training"]
end
    
 
   Fetch --> Transform
 
   Transform --> CSVWrite
 
   CSVWrite --> CSVStore
 
   CSVStore --> CSVRead
 
   CSVRead --> Process
 
   Process --> Train

Integration with Rust sec-fetcher

The Python system consumes data produced by the Rust sec-fetcher application through a file-based interface.

Rust-Python Data Bridge

Sources:

python/narrative_stack/notebooks/stage1_preprocessing.ipynb:21-23

Data Format Expectations

The Python system expects CSV files produced by the Rust sec-fetcher to contain US GAAP fundamental data with the following structure:

Each CSV file represents data for a single ticker symbol
Files are organized in subdirectories (e.g., truncated_csvs/)
Each row contains concept, unit, value, and metadata fields
Concepts are standardized through the Rust distill_us_gaap_fundamental_concepts transformer

The preprocessing pipeline parses these CSVs into UsGaapRowRecord objects for further processing.

Sources:

Checkpoint Management

The training system supports checkpoint-based workflows for resuming training and fine-tuning.

Checkpoint Loading Patterns

The notebook demonstrates multiple checkpoint loading strategies:

Sources:

python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb475-486

Checkpoint Configuration

Parameter	Purpose	Typical Value
`dirpath`	Checkpoint save directory	`OUTPUT_PATH`
`filename`	Checkpoint filename pattern	`"stage1_resume"`
`monitor`	Metric to monitor	`"val_loss_epoch"`
`mode`	Optimization direction	`"min"`
`save_top_k`	Number of checkpoints to keep	`1`

Sources:

python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb532-539

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Data Ingestion & Preprocessing

Relevant source files

Purpose and Scope

This document describes the data ingestion and preprocessing pipeline in the Python narrative_stack system. This pipeline transforms raw US GAAP financial data from CSV files (generated by the Rust sec-fetcher application) into normalized, embedded triplets suitable for machine learning training. The pipeline handles CSV parsing, concept/unit pair extraction, semantic embedding generation, robust statistical normalization, and PCA-based dimensionality reduction.

For information about the machine learning training pipeline that consumes this preprocessed data, see Machine Learning Training Pipeline. For details about the Rust data fetching and CSV generation, see Rust sec-fetcher Application.

Overview

The preprocessing pipeline orchestrates multiple transformation stages to prepare raw financial data for machine learning. The system reads CSV files containing US GAAP fundamental concepts (Assets, Revenues, etc.) with associated values and units of measure, then generates semantic embeddings for each concept/unit pair, normalizes values using robust statistical techniques, and compresses embeddings via PCA.

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:1-383

graph TB
    subgraph "Input"
        CSVFiles["CSV Files\nrust_data/us-gaap/*.csv\nSymbol-specific filings"]
end
    
    subgraph "Ingestion Layer"
        WalkCSV["walk_us_gaap_csvs()\nIterator over CSV files"]
ParseRow["UsGaapRowRecord\nParsed entries per symbol"]
end
    
    subgraph "Extraction & Embedding"
        ExtractPairs["Extract Concept/Unit Pairs\n(concept, unit)
tuples"]
GenEmbeddings["generate_concept_unit_embeddings()\nSemantic embeddings per pair"]
end
    
    subgraph "Normalization"
        GroupValues["Group by (concept, unit)\nCollect all values per pair"]
FitScaler["RobustScaler.fit()\nPer-pair scaling parameters"]
TransformValues["Transform values\nScaled to robust statistics"]
end
    
    subgraph "Dimensionality Reduction"
        PCAFit["PCA.fit()\nvariance_threshold=0.95"]
PCATransform["PCA.transform()\nCompress embeddings"]
end
    
    subgraph "Storage"
        BuildTriplets["Build Triplets\n(concept, unit, scaled_value,\nscaler, embedding)"]
StoreDS["DataStoreWsClient\nWrite to simd-r-drive"]
UsGaapStore["UsGaapStore\nUnified access facade"]
end
    
 
   CSVFiles --> WalkCSV
 
   WalkCSV --> ParseRow
 
   ParseRow --> ExtractPairs
 
   ExtractPairs --> GenEmbeddings
 
   ParseRow --> GroupValues
 
   GroupValues --> FitScaler
 
   FitScaler --> TransformValues
 
   GenEmbeddings --> PCAFit
 
   PCAFit --> PCATransform
 
   TransformValues --> BuildTriplets
 
   PCATransform --> BuildTriplets
 
   BuildTriplets --> StoreDS
 
   StoreDS --> UsGaapStore

Architecture Components

The preprocessing system is built around three primary components that handle data access, transformation, and storage:

graph TB
    subgraph "Database Layer"
        DbUsGaap["DbUsGaap\nMySQL interface\nus_gaap_test database"]
end
    
    subgraph "Storage Layer"
        DataStoreWsClient["DataStoreWsClient\nWebSocket client\nsimd_r_drive_server_config"]
end
    
    subgraph "Facade Layer"
        UsGaapStore["UsGaapStore\nUnified data access\nOrchestrates ingestion & retrieval"]
end
    
    subgraph "Core Operations"
        Ingest["ingest_us_gaap_csvs()\nCSV → Database → WebSocket"]
GenPCA["generate_pca_embeddings()\nPCA compression pipeline"]
Lookup["lookup_by_index()\nRetrieve triplet + metadata"]
end
    
 
   DbUsGaap --> UsGaapStore
 
   DataStoreWsClient --> UsGaapStore
 
   UsGaapStore --> Ingest
 
   UsGaapStore --> GenPCA
 
   UsGaapStore --> Lookup

Component	Type	Purpose
`DbUsGaap`	Database Interface	Provides async MySQL access to stored US GAAP data
`DataStoreWsClient`	WebSocket Client	Connects to `simd-r-drive` server for key-value storage
`UsGaapStore`	Facade	Coordinates ingestion, embedding generation, and data retrieval

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:13-40

CSV Data Ingestion

The ingestion process begins by reading CSV files produced by the Rust sec-fetcher application. These files are organized by ticker symbol and contain US GAAP fundamental concepts extracted from SEC filings.

Directory Structure

CSV files are read from a configurable directory path that points to the Rust application's output:

graph LR
 
   ProjectPaths["project_paths.rust_data"] --> CSVDir["csv_data_dir\nPath to CSV directory"]
CSVDir --> Files["Symbol CSV Files\nAAPL.csv, GOOGL.csv, etc."]
Files --> Walker["walk_us_gaap_csvs()\nGenerator function"]

The ingestion system walks through the directory structure, processing one CSV file per symbol. Each file contains multiple filing entries with concept/value/unit triplets.

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:20-23

Ingestion Invocation

The primary ingestion method is us_gaap_store.ingest_us_gaap_csvs(), which accepts a CSV directory path and database connection:

This method orchestrates:

CSV file walking and parsing
Concept/unit pair extraction
Value aggregation per pair
Scaler fitting and transformation
Storage in both MySQL and simd-r-drive

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb67

Concept/Unit Pair Extraction

Financial data in CSV files contains concept/value/unit triplets. The preprocessing pipeline extracts unique (concept, unit) pairs and groups all values associated with each pair.

Data Structure

Each row in the CSV contains:

Concept : A FundamentalConcept variant (e.g., "Assets", "Revenues", "NetIncomeLoss")
Unit of Measure (UOM) : The measurement unit (e.g., "USD", "shares", "USD_per_share")
Value : The numeric value reported in the filing
Additional Metadata : Period type, balance type, filing date, etc.

Grouping Strategy

Values are grouped by (concept, unit) combinations because:

Different concepts have different value ranges (e.g., Assets vs. EPS)
The same concept in different units requires separate scaling (e.g., Revenue in USD vs. Revenue in thousands)
This grouping enables per-pair statistical normalization

Each pair becomes associated with:

A semantic embedding vector
A fitted RobustScaler instance
A collection of normalized values

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:190-249

Semantic Embedding Generation

The system generates semantic embeddings for each unique (concept, unit) pair using a pre-trained language model. These embeddings capture the semantic meaning of the financial concept and its measurement unit.

graph LR
 
   Pairs["Concept/Unit Pairs\n(Revenues, USD)\n(Assets, USD)\n(NetIncomeLoss, USD)"] --> Model["Embedding Model\nPre-trained transformer"]
Model --> Vectors["Embedding Vectors\nFixed dimensionality\nSemantic representation"]
Vectors --> EmbedMap["embedding_map\nDict[(concept, unit) → embedding]"]

Embedding Process

The generate_concept_unit_embeddings() function creates fixed-dimensional vector representations:

Embedding Properties

Property	Description
Dimensionality	Fixed size (typically 384 or 768 dimensions)
Semantic Preservation	Similar concepts produce similar embeddings
Deterministic	Same input always produces same output
Device-Agnostic	Can be computed on CPU or GPU

The embeddings are stored in a mapping structure that associates each (concept, unit) pair with its corresponding vector. This mapping is used throughout the preprocessing and training pipelines.

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:279-292

Value Normalization with RobustScaler

Financial values exhibit extreme variance and outliers. The preprocessing pipeline uses RobustScaler from scikit-learn to normalize values within each (concept, unit) group.

RobustScaler Characteristics

RobustScaler is chosen over standard normalization techniques because:

Uses median and interquartile range (IQR) instead of mean and standard deviation
Resistant to outliers that are common in financial data
Scales each feature independently
Preserves zero values when appropriate

Scaling Formula

For each value in a (concept, unit) group:

scaled_value = (raw_value - median) / IQR

Where:

median is the 50th percentile of all values in the group
IQR is the interquartile range (75th percentile - 25th percentile)

Scaler Persistence

Each fitted RobustScaler instance is stored alongside the scaled values. This enables:

Inverse transformation for interpreting model outputs
Validation of scaling correctness
Consistent processing of new data points

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:108-166

PCA Dimensionality Reduction

The preprocessing pipeline applies Principal Component Analysis (PCA) to compress semantic embeddings while preserving most of the variance. This reduces memory footprint and training time.

PCA Configuration

Parameter	Value	Purpose
`variance_threshold`	0.95	Retain 95% of original variance
`fit_data`	All concept/unit embeddings	Learn principal components
`transform_data`	Same embeddings	Apply compression

Variance Explanation Visualization

The system provides visualization tools to analyze PCA effectiveness:

This function generates plots showing:

Cumulative explained variance vs. number of components
Individual component variance contribution
The optimal number of components to retain 95% variance

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:67-68 python/narrative_stack/notebooks/stage1_preprocessing.ipynb:259-270

graph TB
    subgraph "Triplet Components"
        Concept["concept: str\nFundamentalConcept variant"]
Unit["unit: str\nUnit of measure"]
ScaledVal["scaled_value: float\nRobustScaler output"]
UnscaledVal["unscaled_value: float\nOriginal raw value"]
Scaler["scaler: RobustScaler\nFitted scaler instance"]
Embedding["embedding: ndarray\nPCA-compressed vector"]
end
    
    subgraph "Storage Format"
        Index["Index: int\nSequential triplet ID"]
Serialized["Serialized Data\nPickled Python object"]
end
    
 
   Concept --> Serialized
 
   Unit --> Serialized
 
   ScaledVal --> Serialized
 
   UnscaledVal --> Serialized
 
   Scaler --> Serialized
 
   Embedding --> Serialized
 
   Index --> Serialized
    
 
   Serialized --> WebSocket["DataStoreWsClient\nWebSocket write"]

Triplet Storage Structure

The final output of preprocessing is a collection of triplets stored in the simd-r-drive key-value store via DataStoreWsClient. Each triplet contains all information needed for training and inference.

Triplet Schema

Storage Interface

The UsGaapStore facade provides methods for retrieving stored triplets:

Method	Purpose
`lookup_by_index(idx: int)`	Retrieve single triplet by sequential index
`batch_lookup_by_indices(indices: List[int])`	Retrieve multiple triplets efficiently
`get_triplet_count()`	Get total number of stored triplets
`get_pair_count()`	Get number of unique (concept, unit) pairs
`get_embedding_matrix()`	Retrieve full embedding matrix and pair list

Retrieval Example

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:78-97

Data Validation

The preprocessing pipeline includes validation mechanisms to ensure data integrity throughout the transformation process.

Scaler Verification

A validation check confirms that stored scaled values match the transformation applied by the stored scaler:

graph LR
 
   Lookup["Lookup triplet"] --> Extract["Extract unscaled_value\nand scaler"]
Extract --> Transform["scaler.transform()"]
Transform --> Compare["np.isclose()\ncheck vs scaled_value"]
Compare --> Pass["Validation Pass"]
Compare --> Fail["Validation Fail"]

This test:

Retrieves a random triplet
Re-applies the stored scaler to the unscaled value
Verifies the result matches the stored scaled value
Uses np.isclose() for floating-point tolerance

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:91-96

Visualization Tools

The preprocessing module provides visualization utilities for analyzing embedding quality and PCA effectiveness.

PCA Explanation Plot

The plot_pca_explanation() function generates cumulative variance plots:

This visualization shows:

How many principal components are needed to reach the variance threshold
The diminishing returns of adding more components
The optimal dimensionality for the compressed embeddings

Semantic Embedding Scatterplot

The plot_semantic_embeddings() function creates 2D projections of embeddings:

This visualization helps assess:

Clustering patterns in concept/unit pairs
Separation between different financial concepts
Embedding space structure

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:259-270 python/narrative_stack/notebooks/stage1_preprocessing.ipynb:337-343

graph LR
    subgraph "Preprocessing Output"
        Triplets["Stored Triplets\nsimd-r-drive storage"]
end
    
    subgraph "Training Input"
        Dataset["IterableConceptValueDataset\nStreaming data loader"]
Collate["collate_with_scaler()\nBatch construction"]
DataLoader["PyTorch DataLoader\nTraining batches"]
end
    
 
   Triplets --> Dataset
 
   Dataset --> Collate
 
   Collate --> DataLoader

Integration with Training Pipeline

The preprocessed data serves as input to the machine learning training pipeline. The stored triplets are loaded via streaming datasets during training.

Component	Purpose
Triplets in simd-r-drive	Persistent storage of preprocessed data
`IterableConceptValueDataset`	Streams triplets without loading all into memory
`collate_with_scaler`	Custom collation function for batch processing
PyTorch `DataLoader`	Manages batching, shuffling, and parallel loading

For detailed information about the training pipeline, see Machine Learning Training Pipeline.

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-522

Performance Considerations

The preprocessing pipeline is designed to handle large-scale financial datasets efficiently.

Memory Management

Strategy	Implementation
Streaming CSV reading	Process files one at a time, not all in memory
Incremental scaler fitting	Use online algorithms for large groups
Compressed storage	PCA reduces embedding footprint by ~50-70%
WebSocket batching	`DataStoreWsClient` batches writes for efficiency

graph TB
    subgraph "Rust Caching Layer"
        HTTPCache["SIMD_R_DRIVE_HTTP_CACHE\nhttp_storage_cache.bin\n1 week TTL"]
PreCache["SIMD_R_DRIVE_PREPROCESSOR_CACHE\npreprocessor_cache.bin\nPersistent storage"]
end
    
    subgraph "Python Preprocessing"
        CSVRead["Read CSV files"]
Process["Transform & normalize"]
end
    
 
   HTTPCache -.->|Cached SEC data| CSVRead
 
   PreCache -.->|Cached computations| Process

Caching Strategy

The Rust layer provides HTTP and preprocessor caches that the preprocessing pipeline can leverage:

Sources: src/caches.rs:1-66

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Machine Learning Training Pipeline

Relevant source files

Purpose and Scope

This page documents the machine learning training pipeline for the narrative_stack system, specifically the Stage 1 autoencoder that learns latent representations of US GAAP financial concepts. The training pipeline consumes preprocessed concept/unit/value triplets and their semantic embeddings (see Data Ingestion & Preprocessing) to train a neural network that can encode financial data into a compressed latent space.

The pipeline uses PyTorch Lightning for training orchestration, implements custom iterable datasets for efficient data streaming from the simd-r-drive WebSocket server, and provides comprehensive experiment tracking through TensorBoard. For details on the underlying storage mechanisms and data retrieval, see Database & Storage Integration.

Training Pipeline Architecture

The training pipeline operates as a streaming system that continuously fetches preprocessed triplets from the UsGaapStore and feeds them through the autoencoder model. The architecture emphasizes memory efficiency and reproducibility.

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb456-556

graph TB
    subgraph "Data Source Layer"
        DataStoreWsClient["DataStoreWsClient\n(simd_r_drive_ws_client)"]
UsGaapStore["UsGaapStore\nlookup_by_index()"]
end
    
    subgraph "Dataset Layer"
        IterableDataset["IterableConceptValueDataset\ninternal_batch_size=64\nreturn_scaler=True\nshuffle=True/False"]
CollateFunction["collate_with_scaler()\nBatch construction"]
end
    
    subgraph "PyTorch Lightning Training Loop"
        DataLoader["DataLoader\nbatch_size from hparams\nnum_workers=2\npin_memory=True\npersistent_workers=True"]
Model["Stage1Autoencoder\nEncoder → Latent → Decoder"]
Optimizer["Adam Optimizer\n+ CosineAnnealingWarmRestarts\nReduceLROnPlateau"]
Trainer["pl.Trainer\nEarlyStopping\nModelCheckpoint\ngradient_clip_val"]
end
    
    subgraph "Monitoring & Persistence"
        TensorBoard["TensorBoardLogger\ntrain_loss\nval_loss_epoch\nlearning_rate"]
Checkpoints["Model Checkpoints\n.ckpt files\nsave_top_k=1\nmonitor='val_loss_epoch'"]
end
    
 
   DataStoreWsClient --> UsGaapStore
 
   UsGaapStore --> IterableDataset
 
   IterableDataset --> CollateFunction
 
   CollateFunction --> DataLoader
    
 
   DataLoader --> Model
 
   Model --> Optimizer
 
   Optimizer --> Trainer
    
 
   Trainer --> TensorBoard
 
   Trainer --> Checkpoints
    
 
   Checkpoints -.->|Resume training| Model

Stage1Autoencoder Model

Model Architecture

The Stage1Autoencoder is a fully-connected autoencoder that learns to compress financial concept embeddings combined with their scaled values into a lower-dimensional latent space. The model reconstructs its input, forcing the latent representation to capture the most important features.

graph LR
    Input["Input Tensor\n[embedding + scaled_value]\nDimension: embedding_dim + 1"]
Encoder["Encoder Network\nfc1 → dropout → ReLU\nfc2 → dropout → ReLU"]
Latent["Latent Space\nDimension: latent_dim"]
Decoder["Decoder Network\nfc3 → dropout → ReLU\nfc4 → dropout → output"]
Output["Reconstructed Input\nSame dimension as input"]
Loss["MSE Loss\ninput vs output"]
Input --> Encoder
 
   Encoder --> Latent
 
   Latent --> Decoder
 
   Decoder --> Output
 
   Output --> Loss
 
   Input --> Loss

Hyperparameters

The model exposes the following configurable hyperparameters through its hparams attribute:

Parameter	Description	Typical Value
`input_dim`	Dimension of input (embedding + 1 for value)	Varies based on embedding size
`latent_dim`	Dimension of compressed latent space	64-128
`dropout_rate`	Dropout probability for regularization	0.0-0.2
`lr`	Initial learning rate	1e-5 to 5e-5
`min_lr`	Minimum learning rate for scheduler	1e-7 to 1e-6
`batch_size`	Training batch size	32 (typical)
`weight_decay`	L2 regularization parameter	1e-8 to 1e-4
`gradient_clip`	Maximum gradient norm	0.0-1.0

The model can be instantiated with default parameters or loaded from a checkpoint with overridden hyperparameters:

Loading from checkpoint with modified learning rate: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb479-486

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb479-490

Loss Function and Optimization

The model uses Mean Squared Error (MSE) loss between the input and reconstructed output. The optimization strategy combines:

Adam optimizer with configurable learning rate and weight decay
CosineAnnealingWarmRestarts scheduler for cyclical learning rate annealing
ReduceLROnPlateau for adaptive learning rate reduction when validation loss plateaus

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb479-490

Dataset and Data Loading

IterableConceptValueDataset

The IterableConceptValueDataset is a custom PyTorch IterableDataset that streams training data from the UsGaapStore without loading the entire dataset into memory. This design enables training on datasets larger than available RAM.

Key characteristics:

graph TB
    subgraph "Dataset Initialization"
        Config["simd_r_drive_server_config\nhost + port"]
Params["Dataset Parameters\ninternal_batch_size\nreturn_scaler\nshuffle"]
end
    
    subgraph "Data Streaming Process"
        Store["UsGaapStore instance\nget_triplet_count()"]
IndexGen["Index Generator\nSequential or shuffled\nbased on shuffle param"]
BatchFetch["Internal Batching\nFetch internal_batch_size items\nvia batch_lookup_by_indices()"]
Unpack["Unpack Triplet Data\nembedding\nscaled_value\nscaler (optional)"]
end
    
    subgraph "Output"
        Tensor["PyTorch Tensors\nx: [embedding + scaled_value]\ny: [embedding + scaled_value]\nscaler: RobustScaler object"]
end
    
 
   Config --> Store
 
   Params --> IndexGen
 
   Store --> IndexGen
 
   IndexGen --> BatchFetch
 
   BatchFetch --> Unpack
 
   Unpack --> Tensor

Iterable streaming : Data is fetched on-demand during iteration, not preloaded
Internal batching : Fetches internal_batch_size items at once from the WebSocket server to reduce network overhead (typically 64)
Optional shuffling : Can randomize index order for training or maintain sequential order for validation
Scaler inclusion : Optionally returns the RobustScaler object with each sample for validation purposes

Dataset instantiation: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-522

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb175-176 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-522

DataLoader Configuration

PyTorch DataLoader instances wrap the iterable dataset with additional configuration for efficient batch loading:

Parameter	Value	Purpose
`batch_size`	From `model.hparams.batch_size`	Outer batch size for model training
`collate_fn`	`collate_with_scaler`	Custom collation function to handle scaler objects
`num_workers`	2	Number of parallel data loading processes
`pin_memory`	True	Enables faster host-to-GPU transfers
`persistent_workers`	True	Keeps worker processes alive between epochs
`prefetch_factor`	4	Number of batches each worker pre-fetches

Training loader uses shuffle=True in the dataset while validation loader uses shuffle=False to ensure reproducible validation metrics.

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-522

collate_with_scaler Function

The collate_with_scaler function handles batch construction when the dataset returns triplets (x, y, scaler). It stacks the tensors into batches while preserving the scaler objects in a list:

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb175-176 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb506-507

graph LR
    Input["Batch Items\n[(x1, y1, scaler1),\n(x2, y2, scaler2),\n...]"]
Stack["Stack Tensors\ntorch.stack(x_list)\ntorch.stack(y_list)"]
Output["Batched Output\n(x_batch, y_batch, scaler_list)"]
Input --> Stack
 
   Stack --> Output

Training Configuration

PyTorch Lightning Trainer Setup

The training pipeline uses PyTorch Lightning's Trainer class to orchestrate the training loop, validation, and callbacks:

Configuration values:

graph TB
    subgraph "Training Configuration"
        MaxEpochs["max_epochs\nTypically 1000"]
Accelerator["accelerator='auto'\ndevices=1"]
GradientClip["gradient_clip_val\nFrom model.hparams"]
end
    
    subgraph "Callbacks"
        EarlyStopping["EarlyStopping\nmonitor='val_loss_epoch'\npatience=20\nmode='min'"]
ModelCheckpoint["ModelCheckpoint\nmonitor='val_loss_epoch'\nsave_top_k=1\nfilename='stage1_resume'"]
end
    
    subgraph "Logging"
        TensorBoardLogger["TensorBoardLogger\nlog_dir=OUTPUT_PATH\nname='stage1_autoencoder'"]
end
    
    Trainer["pl.Trainer"]
MaxEpochs --> Trainer
 
   Accelerator --> Trainer
 
   GradientClip --> Trainer
 
   EarlyStopping --> Trainer
 
   ModelCheckpoint --> Trainer
 
   TensorBoardLogger --> Trainer

EPOCHS : 1000 (allows for long training with early stopping)
PATIENCE : 20 (compensates for learning rate annealing periods)
OUTPUT_PATH : project_paths.python_data / "stage1_23_(no_pre_dedupe)" or similar versioned directory

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb468-548

Callbacks

EarlyStopping

The EarlyStopping callback monitors val_loss_epoch and stops training if no improvement occurs for 20 consecutive epochs. This prevents overfitting and saves computational resources.

python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb528-530

ModelCheckpoint

The ModelCheckpoint callback saves the best model based on validation loss:

python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb532-539

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb528-539

Training Execution

The training process is initiated with the Trainer.fit() method:

python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb550-556

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb550-556

Checkpointing and Resuming Training

Saving Checkpoints

Checkpoints are automatically saved by the ModelCheckpoint callback when validation loss improves. The checkpoint file contains:

Model weights (encoder and decoder parameters)
Optimizer state
Learning rate scheduler state
All hyperparameters (hparams)
Current epoch number
Best validation loss

Checkpoint files are saved with the .ckpt extension in the OUTPUT_PATH directory.

Loading and Resuming

The training pipeline supports two modes of checkpoint loading:

1. Resume Training with Existing Configuration

Pass ckpt_path to trainer.fit() to resume training with the exact state from the checkpoint:

python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb555

2. Fine-tune with Modified Hyperparameters

Load the checkpoint explicitly and override specific hyperparameters before training:

This approach is useful for fine-tuning with a lower learning rate after initial training convergence.

python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb479-486

Checkpoint File Naming

The checkpoint filename is determined by the ModelCheckpoint configuration:

filename : "stage1_resume" produces files like stage1_resume.ckpt or stage1_resume-v1.ckpt, stage1_resume-v2.ckpt, etc.
Versioning : PyTorch Lightning automatically increments version numbers when the same filename exists

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb475-486 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb532-539 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb550-556

graph TB
    subgraph "Logged Metrics"
        TrainLoss["train_loss\nPer-batch training loss"]
ValLoss["val_loss_epoch\nEpoch-averaged validation loss"]
LearningRate["learning_rate\nCurrent optimizer LR"]
end
    
    subgraph "TensorBoard Logger"
        Logger["TensorBoardLogger\nsave_dir=OUTPUT_PATH\nname='stage1_autoencoder'"]
end
    
    subgraph "Log Directory Structure"
        OutputPath["OUTPUT_PATH/"]
VersionDir["stage1_autoencoder/"]
Events["events.out.tfevents.*\nhparams.yaml"]
Checkpoints["*.ckpt checkpoint files"]
end
    
 
   TrainLoss --> Logger
 
   ValLoss --> Logger
 
   LearningRate --> Logger
    
 
   Logger --> OutputPath
 
   OutputPath --> VersionDir
 
   VersionDir --> Events
 
   OutputPath --> Checkpoints

Monitoring and Logging

TensorBoard Integration

The TensorBoardLogger automatically logs training metrics for visualization in TensorBoard:

Log directory structure:

OUTPUT_PATH/
├── stage1_autoencoder/
│   └── version_0/
│       ├── events.out.tfevents.*
│       ├── hparams.yaml
│       └── checkpoints/
├── stage1_resume.ckpt
└── optuna_study.db (if using Optuna)

python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb543

Visualizing Training Progress

Launch TensorBoard to monitor training:

TensorBoard displays:

Loss curves : Training and validation loss over epochs
Learning rate schedule : Visualization of LR changes during training
Hyperparameters : Table of all model hyperparameters
Gradient histograms : Distribution of gradients (if enabled)

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb541-548

Training Workflow Summary

The complete training workflow follows this sequence:

This pipeline design enables:

Memory-efficient training through streaming data loading
Reproducible experiments with comprehensive logging
Flexible resumption with hyperparameter overrides
Robust training with gradient clipping and early stopping

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb456-556

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Database & Storage Integration

Relevant source files

Purpose and Scope

This document describes the database and storage infrastructure used by the Python narrative_stack system to persist and retrieve US GAAP financial data. The system employs a dual-storage architecture: MySQL for structured financial records and a WebSocket-based key-value store (simd-r-drive) for high-performance embedding and triplet data. This page covers the DbUsGaap database interface, DataStoreWsClient WebSocket integration, UsGaapStore facade pattern, triplet storage format, and embedding matrix management.

For information about CSV ingestion and preprocessing operations that populate these storage systems, see Data Ingestion & Preprocessing. For information about how the training pipeline consumes this stored data, see Machine Learning Training Pipeline.

Storage Architecture Overview

The narrative_stack system uses two complementary storage backends to optimize for different data access patterns:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:13-40

graph TB
    subgraph "Python narrative_stack"
        UsGaapStore["UsGaapStore\nUnified Facade"]
Preprocessing["Preprocessing Pipeline\ningest_us_gaap_csvs\ngenerate_pca_embeddings"]
Training["ML Training\nIterableConceptValueDataset"]
end
    
    subgraph "Storage Backends"
        DbUsGaap["DbUsGaap\nMySQL Interface\nus_gaap_test database"]
DataStoreWsClient["DataStoreWsClient\nWebSocket Client\nsimd-r-drive protocol"]
end
    
    subgraph "External Services"
        MySQL[("MySQL Server\nStructured Records\nSymbol/Concept/Value")]
        SimdRDrive["simd-r-drive-ws-server\nKey-Value Store\nEmbeddings/Triplets/Scalers"]
end
    
 
   Preprocessing -->|ingest CSV data| UsGaapStore
 
   UsGaapStore -->|store records| DbUsGaap
 
   UsGaapStore -->|store triplets/embeddings| DataStoreWsClient
 
   DbUsGaap -->|SQL queries| MySQL
 
   DataStoreWsClient -->|WebSocket protocol| SimdRDrive
 
   UsGaapStore -->|retrieve triplets| Training
 
   Training -->|batch lookups| UsGaapStore

The dual-storage design separates concerns:

MySQL stores structured financial data with SQL query capabilities for filtering and aggregation
simd-r-drive stores high-dimensional embeddings and preprocessed triplets for fast sequential access during training

MySQL Database Interface (DbUsGaap)

The DbUsGaap class provides an abstraction layer over MySQL database operations for US GAAP financial records.

Database Schema

The system expects a MySQL database named us_gaap_test with the following structure:

Table/Field	Type	Purpose
Raw records table	-	Stores original CSV-ingested data
`symbol`	VARCHAR	Company ticker symbol
`concept`	VARCHAR	US GAAP concept name
`unit`	VARCHAR	Unit of measurement (USD, shares, etc.)
`value`	DECIMAL	Numeric value for the concept
`form`	VARCHAR	SEC form type (10-K, 10-Q, etc.)
`filing_date`	DATE	Date of filing

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:30-35

DbUsGaap Connection Configuration

The DbUsGaap class is instantiated with configuration from db_config:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:13-18

The database connection is established during initialization and used for querying structured financial data during the ingestion phase. The DbUsGaap interface supports filtering by symbol, concept, and date ranges for selective data loading.

WebSocket Data Store (DataStoreWsClient)

The DataStoreWsClient provides a WebSocket-based interface to the simd-r-drive key-value store for high-performance binary data storage and retrieval.

simd-r-drive Integration

The simd-r-drive system is a custom WebSocket server that provides persistent key-value storage optimized for embedding matrices and preprocessed training data:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:33-40

graph TB
    subgraph "Python Client"
        WSClient["DataStoreWsClient\nWebSocket protocol handler"]
BatchWrite["batch_write\n[(key, value)]"]
BatchRead["batch_read\n[keys] → [values]"]
end
    
    subgraph "simd-r-drive Server"
        WSServer["simd-r-drive-ws-server\nWebSocket endpoint"]
DataStore["DataStore\nBinary file backend"]
end
    
 
   BatchWrite -->|WebSocket message| WSClient
 
   BatchRead -->|WebSocket message| WSClient
 
   WSClient <-->|ws://host:port| WSServer
 
   WSServer <-->|read/write| DataStore

Connection Initialization

The WebSocket client is instantiated with server configuration:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb38

Batch Operations

The DataStoreWsClient supports efficient batch read and write operations:

Operation	Input	Output	Purpose
`batch_write`	`[(key: bytes, value: bytes)]`	None	Store multiple key-value pairs atomically
`batch_read`	`[key: bytes]`	`[value: bytes]`	Retrieve multiple values by key

The WebSocket protocol enables low-latency access to stored embeddings and preprocessed triplets during training iterations.

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:50-55

Rust-side Cache Storage

On the Rust side, the simd-r-drive storage system is accessed through the Caches module, which provides two cache stores:

Sources: src/caches.rs:7-65

graph TB
    ConfigManager["ConfigManager\ncache_base_dir"]
CachesInit["Caches::init\nInitialize cache stores"]
subgraph "Cache Stores"
        HTTPCache["SIMD_R_DRIVE_HTTP_CACHE\nhttp_storage_cache.bin\nOnceLock<Arc<DataStore>>"]
PreprocessorCache["SIMD_R_DRIVE_PREPROCESSOR_CACHE\npreprocessor_cache.bin\nOnceLock<Arc<DataStore>>"]
end
    
    HTTPCacheGet["Caches::get_http_cache_store"]
PreprocessorCacheGet["Caches::get_preprocessor_cache"]
ConfigManager -->|cache_base_dir path| CachesInit
 
   CachesInit -->|open DataStore| HTTPCache
 
   CachesInit -->|open DataStore| PreprocessorCache
 
   HTTPCache -->|Arc clone| HTTPCacheGet
 
   PreprocessorCache -->|Arc clone| PreprocessorCacheGet

The Rust implementation uses OnceLock<Arc<DataStore>> for thread-safe lazy initialization of cache stores. The DataStore::open method creates or opens persistent binary files at paths derived from ConfigManager:

http_storage_cache.bin - HTTP response caching
preprocessor_cache.bin - Preprocessed data caching

UsGaapStore Facade

The UsGaapStore class provides a unified interface that coordinates operations across both storage backends (MySQL and simd-r-drive):

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb40 python/narrative_stack/notebooks/stage1_preprocessing.ipynb:65-68

graph TB
    subgraph "UsGaapStore Facade"
        IngestCSV["ingest_us_gaap_csvs\nCSV → MySQL + triplets"]
GeneratePCA["generate_pca_embeddings\nCompress embeddings"]
LookupByIndex["lookup_by_index\nRetrieve triplet"]
BatchLookup["batch_lookup_by_indices\nBulk retrieval"]
GetEmbeddingMatrix["get_embedding_matrix\nReturn all embeddings"]
GetTripletCount["get_triplet_count\nTotal stored triplets"]
GetPairCount["get_pair_count\nUnique concept/unit pairs"]
end
    
    subgraph "Backend Operations"
        MySQLWrite["DbUsGaap.insert\nStore raw records"]
SimdWrite["DataStoreWsClient.batch_write\nStore triplets/embeddings"]
SimdRead["DataStoreWsClient.batch_read\nRetrieve data"]
end
    
 
   IngestCSV -->|raw data| MySQLWrite
 
   IngestCSV -->|triplets| SimdWrite
 
   GeneratePCA -->|compressed embeddings| SimdWrite
 
   LookupByIndex -->|key query| SimdRead
 
   BatchLookup -->|batch keys| SimdRead
 
   GetEmbeddingMatrix -->|matrix key| SimdRead

UsGaapStore Initialization

The facade is initialized with a reference to the DataStoreWsClient:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb40

Key Methods

Method	Parameters	Returns	Purpose
`ingest_us_gaap_csvs`	`csv_data_dir: Path, db_us_gaap: DbUsGaap`	None	Walk CSV directory, parse records, store to both backends
`generate_pca_embeddings`	None	None	Apply PCA dimensionality reduction to stored embeddings
`lookup_by_index`	`index: int`	Triplet dict	Retrieve single triplet by sequential index
`batch_lookup_by_indices`	`indices: List[int]`	List[Triplet dict]	Retrieve multiple triplets efficiently
`get_embedding_matrix`	None	`(ndarray, List[Tuple])`	Return full embedding matrix and concept/unit pairs
`get_triplet_count`	None	`int`	Total number of stored triplets
`get_pair_count`	None	`int`	Number of unique (concept, unit) pairs

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:65-96

The UsGaapStore abstracts storage implementation details, allowing preprocessing and training code to interact with a simple, high-level API.

Triplet Storage Format

The core data structure stored in simd-r-drive is the triplet : a tuple of (concept, unit, scaled_value) along with associated metadata.

Triplet Structure

Each triplet stored in simd-r-drive contains the following fields:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:88-96

Scaler Validation

The storage system maintains consistency between stored values and scalers. During retrieval, the system can verify that the scaler produces the expected scaled value:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:92-96

This validation ensures that the RobustScaler instance stored with each (concept, unit) pair correctly transforms the unscaled value to the stored scaled value, maintaining data integrity across storage and retrieval cycles.

graph LR
    Index0["Index 0\nTriplet 0"]
Index1["Index 1\nTriplet 1"]
Index2["Index 2\nTriplet 2"]
IndexN["Index N\nTriplet N"]
BatchLookup["batch_lookup_by_indices\n[0, 1, 2, ..., N]"]
Index0 -->|retrieve| BatchLookup
 
   Index1 -->|retrieve| BatchLookup
 
   Index2 -->|retrieve| BatchLookup
 
   IndexN -->|retrieve| BatchLookup

Triplet Indexing

Triplets are stored with sequential integer indices, enabling efficient batch retrieval during training:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:87-89

The sequential indexing scheme allows the training dataset to iterate efficiently through all stored triplets without loading the entire dataset into memory.

graph TB
    subgraph "Embedding Matrix Generation"
        ConceptUnitPairs["Unique Pairs\n(concept, unit)\nCollected during ingestion"]
GenerateEmbeddings["generate_concept_unit_embeddings\nSentence transformer model"]
PCACompression["PCA Compression\nVariance threshold: 0.95"]
EmbeddingMatrix["Embedding Matrix\nShape: (n_pairs, embedding_dim)"]
end
    
    subgraph "Storage"
        StorePairs["Store pairs list\nKey: 'concept_unit_pairs'"]
StoreMatrix["Store matrix\nKey: 'embedding_matrix'"]
end
    
 
   ConceptUnitPairs -->|semantic text| GenerateEmbeddings
 
   GenerateEmbeddings -->|full embeddings| PCACompression
 
   PCACompression -->|compressed| EmbeddingMatrix
 
   EmbeddingMatrix --> StoreMatrix
 
   ConceptUnitPairs --> StorePairs

Embedding Matrix Management

The UsGaapStore maintains a pre-computed embedding matrix that maps each unique (concept, unit) pair to its semantic embedding vector.

Embedding Matrix Structure

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb68 python/narrative_stack/notebooks/stage1_preprocessing.ipynb:263-269

Retrieval and Usage

The embedding matrix can be retrieved for analysis or visualization:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:106-108 python/narrative_stack/notebooks/stage1_preprocessing.ipynb263

During training, each triplet's embedding is retrieved from storage rather than recomputed, ensuring consistent representations across training runs.

PCA Dimensionality Reduction

The generate_pca_embeddings method applies PCA to compress the raw semantic embeddings while retaining 95% of the variance:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb68

The PCA compression reduces storage requirements and computational overhead during training while maintaining semantic relationships between concept/unit pairs.

Data Flow Integration

The storage system integrates with both preprocessing and training workflows:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:65-68 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb175-176 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-522

Training Data Access Pattern

During training, the IterableConceptValueDataset streams triplets from storage:

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-512

The dataset communicates with the simd-r-drive WebSocket server to stream triplets, avoiding memory exhaustion on large datasets. The collate_with_scaler function combines triplets into batches while preserving scaler metadata for potential validation or analysis.

Integration Testing

The storage integration is validated through automated integration tests:

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39

The integration test script:

Starts isolated Docker containers for MySQL and simd-r-drive using docker compose with project name us_gaap_it
Waits for MySQL readiness using mysqladmin ping
Creates the us_gaap_test database and loads the schema from tests/integration/assets/us_gaap_schema_2025.sql
Runs pytest integration tests against the live services
Tears down containers and volumes in cleanup trap

This ensures that the storage layer functions correctly across both backends before code is merged.

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:8-38

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Development Guide

Relevant source files

Purpose and Scope

This guide provides an overview of development practices, code organization, and workflows for contributing to the rust-sec-fetcher project. It covers environment setup, code organization principles, development workflows, and common development tasks.

For detailed information about specific development topics, see:

Testing strategies and test fixtures: Testing Strategy
Continuous integration and automated testing: CI/CD Pipeline
Docker container configuration: Docker Deployment

Development Environment Setup

Prerequisites

The project requires the following tools installed:

Tool	Purpose	Version Requirement
Rust	Core application development	1.87+
Python	ML pipeline and preprocessing	3.8+
Docker	Integration testing and services	Latest stable
Git LFS	Large file support for test assets	Latest stable
MySQL	Database for US GAAP storage	5.7+ or 8.0+

Rust Development Setup

Clone the repository and navigate to the root directory
Build the Rust application:
Run tests to verify setup:

The Rust workspace is configured in Cargo.toml with all necessary dependencies declared. Key development dependencies include:

mockito for HTTP mocking in tests
tempfile for temporary file/directory creation in tests
tokio test macros for async test support

Python Development Setup

Create a virtual environment:
Install dependencies using uv:
Verify installation by running integration tests (requires Docker):

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39

Configuration Setup

The application requires a configuration file at ~/.config/sec-fetcher/config.toml or a custom path specified via command-line argument. Minimum configuration:

For non-interactive testing, use AppConfig directly in test code as shown in tests/config_manager_tests.rs:36-57

Sources: tests/config_manager_tests.rs:36-57 tests/sec_client_tests.rs:8-20

Code Organization and Architecture

Repository Structure

Sources: src/network/sec_client.rs:1-181 tests/config_manager_tests.rs:1-95 tests/sec_client_tests.rs:1-159 python/narrative_stack/us_gaap_store_integration_test.sh:1-39

Module Dependency Flow

The dependency flow follows a layered architecture:

Configuration Layer : ConfigManager loads settings from TOML files and credentials from keyring
Network Layer : SecClient wraps HTTP client with caching and throttling middleware
Data Fetching Layer : Network module functions fetch raw data from SEC APIs
Transformation Layer : Transformers normalize raw data into standardized concepts
Model Layer : Data structures represent domain entities

Sources: src/network/sec_client.rs:1-181 tests/config_manager_tests.rs:1-95

Development Workflow

Standard Development Cycle

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39

Running Tests Locally

Rust Unit Tests

Run all Rust tests with cargo:

Run specific test modules:

Run with output visibility:

Test Structure Mapping:

Test File	Tests Component	Key Test Functions
`tests/config_manager_tests.rs`	`ConfigManager`	`test_load_custom_config`, `test_load_non_existent_config`, `test_fails_on_invalid_key`
`tests/sec_client_tests.rs`	`SecClient`	`test_user_agent`, `test_fetch_json_without_retry_success`, `test_fetch_json_with_retry_failure`

Sources: tests/config_manager_tests.rs:1-95 tests/sec_client_tests.rs:1-159

Python Integration Tests

Integration tests require Docker services. Run via the provided shell script:

This script performs the following steps as defined in python/narrative_stack/us_gaap_store_integration_test.sh:1-39:

Activates Python virtual environment
Installs dependencies with uv pip install -e . --group dev
Starts Docker Compose services (db_test, simd_r_drive_ws_server_test)
Waits for MySQL availability
Creates us_gaap_test database
Loads schema from tests/integration/assets/us_gaap_schema_2025.sql
Runs pytest integration tests
Tears down containers on exit

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39

Writing Tests

Unit Test Pattern (Rust)

The codebase follows standard Rust testing patterns with mockito for HTTP mocking:

Key patterns demonstrated in tests/sec_client_tests.rs:35-62:

Use #[tokio::test] for async tests
Create mockito::Server for HTTP endpoint mocking
Construct AppConfig programmatically for test isolation
Use ConfigManager::from_app_config() to bypass file system dependencies
Assert on specific JSON fields in responses

Sources: tests/sec_client_tests.rs:35-62

Test Fixture Pattern

The codebase uses temporary directories for file-based tests:

This pattern ensures test isolation and automatic cleanup as shown in tests/config_manager_tests.rs:8-17

Sources: tests/config_manager_tests.rs:8-17

Error Case Testing

Test error conditions explicitly:

This test from tests/sec_client_tests.rs:93-120 verifies retry behavior by expecting exactly 3 HTTP requests (initial + 2 retries) before failing.

Sources: tests/sec_client_tests.rs:93-120

Common Development Tasks

Adding a New SEC Data Endpoint

To add support for fetching a new SEC data endpoint:

Add URL enum variant in src/models/url.rs
Create fetch function in src/network/ following the pattern of existing functions
Define data models in src/models/ for the response structure
Add transformation logic in src/transformers/ if normalization is needed
Write unit tests in tests/ using mockito::Server for mocking
Update main.rs to integrate the new endpoint into the processing pipeline

Example function signature pattern:

Adding a New FundamentalConcept Mapping

The distill_us_gaap_fundamental_concepts function maps raw SEC concept names to the FundamentalConcept enum. To add a new concept:

Add enum variant to FundamentalConcept in src/models/fundamental_concept.rs
Update the match arms in src/transformers/distill_us_gaap_fundamental_concepts.rs
Add test case to verify the mapping in tests/distill_tests.rs

See the existing mapping patterns in the transformer module for hierarchical mappings (concepts that map to multiple parent categories).

Modifying HTTP Client Behavior

The SecClient is configured in src/network/sec_client.rs:21-89 Key configuration points:

Configuration	Location	Purpose
`CachePolicy`	src/network/sec_client.rs:45-50	Controls cache TTL and behavior
`ThrottlePolicy`	src/network/sec_client.rs:53-59	Controls rate limiting and retries
User-Agent	src/network/sec_client.rs:91-108	Constructs SEC-compliant User-Agent header

To modify throttling behavior, adjust the ThrottlePolicy parameters:

base_delay_ms: Minimum delay between requests
max_concurrent: Maximum concurrent requests
max_retries: Number of retry attempts on failure
adaptive_jitter_ms: Random jitter to prevent thundering herd

Sources: src/network/sec_client.rs:21-89

Working with Caches

The system uses two cache types managed by the Caches module:

HTTP Cache : Stores raw HTTP responses with configurable TTL (default: 1 week)
Preprocessor Cache : Stores transformed/preprocessed data

Cache instances are accessed via Caches::get_http_cache_store() as shown in src/network/sec_client.rs73

During development, you may need to clear caches when testing data transformations. Cache data is persisted via the simd-r-drive backend.

Sources: src/network/sec_client.rs73

Code Quality Standards

TODO Comments and Technical Debt

The codebase uses TODO comments to mark areas for improvement. Examples from src/network/sec_client.rs:

src/network/sec_client.rs46: Cache TTL should be configurable
src/network/sec_client.rs57: Adaptive jitter should be configurable
src/network/sec_client.rs100: Repository URL should be included in User-Agent

When adding TODO comments:

Be specific about what needs to be done
Include context about why it's not done now
Reference related issues if applicable

Panic vs Result

The codebase follows Rust best practices:

Use Result<T, E> for recoverable errors
Use panic! only for non-recoverable errors or programming errors

Example from src/network/sec_client.rs:95-98:

This panics because an invalid email makes all SEC API calls fail, representing a configuration error rather than a runtime error.

Sources: src/network/sec_client.rs:95-98

Error Validation in Tests

Configuration validation is tested by verifying error messages contain expected content, as shown in tests/config_manager_tests.rs:68-94:

This pattern ensures configuration errors are informative to users.

Sources: tests/config_manager_tests.rs:68-94

Integration Test Architecture

The integration test script from python/narrative_stack/us_gaap_store_integration_test.sh:1-39 orchestrates:

Python environment setup with dependencies
Docker Compose service startup (isolated project name: us_gaap_it)
MySQL container health check via mysqladmin ping
Database creation and schema loading
pytest execution with verbose output
Automatic cleanup via EXIT trap

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39

Best Practices Summary

Practice	Implementation	Reference
Test isolation	Use temporary directories and `AppConfig::default()`	tests/config_manager_tests.rs:9-17
HTTP mocking	Use `mockito::Server` for endpoint simulation	tests/sec_client_tests.rs:37-45
Async testing	Use `#[tokio::test]` attribute	tests/sec_client_tests.rs35
Error handling	Prefer `Result<T, E>` over panic	src/network/sec_client.rs:140-165
Configuration	Use `ConfigManager::from_app_config()` in tests	tests/sec_client_tests.rs10
Integration testing	Use Docker Compose with isolated project names	python/narrative_stack/us_gaap_store_integration_test.sh8
Cleanup	Use trap handlers for guaranteed cleanup	python/narrative_stack/us_gaap_store_integration_test.sh:14-19

Sources: tests/config_manager_tests.rs:9-17 tests/sec_client_tests.rs:35-62 src/network/sec_client.rs:140-165 python/narrative_stack/us_gaap_store_integration_test.sh:1-39

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Testing Strategy

Relevant source files

This page documents the testing approach for the rust-sec-fetcher codebase, covering both Rust unit tests and Python integration tests. The testing strategy emphasizes isolated unit testing with mocking for Rust components and end-to-end integration testing for the Python ML pipeline. For information about the CI/CD automation of these tests, see CI/CD Pipeline.

Test Architecture Overview

The codebase employs a dual-layer testing strategy that mirrors its dual-language architecture. Rust components use unit tests with HTTP mocking, while Python components use containerized integration tests with real database instances.

graph TB
    subgraph "Rust Unit Tests"
        SecClientTest["sec_client_tests.rs\nTest SecClient HTTP layer"]
DistillTest["distill_us_gaap_fundamental_concepts_tests.rs\nTest concept transformation"]
ConfigTest["config_manager_tests.rs\nTest configuration loading"]
end
    
    subgraph "Test Infrastructure"
        MockitoServer["mockito::Server\nHTTP mock server"]
TempDir["tempfile::TempDir\nTemporary config files"]
end
    
    subgraph "Python Integration Tests"
        IntegrationScript["us_gaap_store_integration_test.sh\nTest orchestration"]
PytestRunner["pytest\nTest execution"]
end
    
    subgraph "Docker Test Environment"
        MySQLContainer["us_gaap_test_db\nMySQL 8.0 container"]
SimdRDriveContainer["simd-r-drive-ws-server\nWebSocket server"]
SQLSchema["us_gaap_schema_2025.sql\nDatabase schema fixture"]
end
    
 
   SecClientTest --> MockitoServer
 
   ConfigTest --> TempDir
    
 
   IntegrationScript --> MySQLContainer
 
   IntegrationScript --> SimdRDriveContainer
 
   IntegrationScript --> SQLSchema
 
   IntegrationScript --> PytestRunner
    
 
   PytestRunner --> MySQLContainer
 
   PytestRunner --> SimdRDriveContainer

Test Component Relationships

Sources: tests/sec_client_tests.rs:1-159 tests/distill_us_gaap_fundamental_concepts_tests.rs:1-1275 tests/config_manager_tests.rs:1-95 python/narrative_stack/us_gaap_store_integration_test.sh:1-39

Rust Unit Testing Strategy

The Rust test suite focuses on three critical areas: HTTP client behavior, data transformation correctness, and configuration management. Tests are isolated using mocking frameworks and temporary file systems.

SecClient HTTP Testing

The SecClient tests verify HTTP operations, retry logic, and authentication header generation using the mockito library for HTTP mocking.

Test Function	Purpose	Mock Behavior
`test_user_agent`	Validates User-Agent header format	N/A (direct method call)
`test_invalid_email_panic`	Ensures invalid emails cause panic	N/A (panic verification)
`test_fetch_json_without_retry_success`	Tests successful JSON fetch	Returns 200 with valid JSON
`test_fetch_json_with_retry_success`	Tests successful fetch with retry available	Returns 200 immediately
`test_fetch_json_with_retry_failure`	Verifies retry exhaustion	Returns 500 three times
`test_fetch_json_with_retry_backoff`	Tests retry with eventual success	Returns 500 once, then 200

User Agent Validation Test Pattern

Sources: tests/sec_client_tests.rs:6-21

The test creates a minimal AppConfig with only the email field set tests/sec_client_tests.rs:8-10 constructs a ConfigManager and SecClient tests/sec_client_tests.rs:10-12 then verifies the User-Agent format matches the expected pattern including the package version from CARGO_PKG_VERSION tests/sec_client_tests.rs:14-20

HTTP Retry and Backoff Testing

The retry logic tests use mockito::Server to simulate various HTTP response scenarios:

Sources: tests/sec_client_tests.rs:64-158

graph TB
    SetupMock["Setup mockito::Server"]
ConfigureResponse["Configure mock response\nStatus, Body, Expect count"]
CreateClient["Create SecClient\nwith max_retries config"]
FetchJSON["Call client.fetch_json()"]
VerifyResult["Verify result:\nSuccess or Error"]
VerifyCallCount["Verify mock was called\nexpected number of times"]
SetupMock --> ConfigureResponse
 
   ConfigureResponse --> CreateClient
 
   CreateClient --> FetchJSON
 
   FetchJSON --> VerifyResult
 
   FetchJSON --> VerifyCallCount

The retry failure test configures a mock to return HTTP 500 status and expects exactly 3 calls (initial + 2 retries) tests/sec_client_tests.rs:96-103 setting max_retries = 2 in the config tests/sec_client_tests.rs106 The test verifies the result is an error after exhausting retries tests/sec_client_tests.rs:111-119

The backoff test simulates transient failures by creating two mock endpoints: one that returns 500 once, and another that returns 200 tests/sec_client_tests.rs:126-140 This validates that the retry mechanism successfully recovers from temporary failures.

Sources: tests/sec_client_tests.rs:94-158

US GAAP Concept Transformation Testing

The distill_us_gaap_fundamental_concepts_tests.rs file contains 1,275 lines of comprehensive tests covering all 64 FundamentalConcept variants. These tests verify the complex mapping logic that normalizes diverse US GAAP terminology into a standardized taxonomy.

Test Coverage by Concept Category

Category	Test Functions	Concept Variants Tested	Lines
Assets & Liabilities	6	9	5-148
Income Statement	15	25	20-464
Cash Flow	12	15	550-683
Equity	7	13	150-286
Revenue Variations	2 (with 57+ assertions)	45+	954-1188

Hierarchical Mapping Test Pattern

The tests verify that specific concepts map to both their precise category and parent categories:

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:131-139

graph TB
    InputConcept["Input: 'AssetsCurrent'"]
CallDistill["Call distill_us_gaap_fundamental_concepts()"]
ExpectVector["Expect Vec with 2 elements"]
CheckSpecific["Assert contains:\nFundamentalConcept::CurrentAssets"]
CheckParent["Assert contains:\nFundamentalConcept::Assets"]
InputConcept --> CallDistill
 
   CallDistill --> ExpectVector
 
   ExpectVector --> CheckSpecific
 
   ExpectVector --> CheckParent

The test for AssetsCurrent verifies it maps to both CurrentAssets (specific) and Assets (parent category) tests/distill_us_gaap_fundamental_concepts_tests.rs:132-138 This hierarchical pattern appears throughout the test suite for concepts that have parent-child relationships.

Synonym Consolidation Testing

Multiple test functions verify that synonym variations of the same concept map to a single canonical form:

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:80-112

graph LR
    CostOfRevenue["CostOfRevenue"]
CostOfGoodsAndServicesSold["CostOfGoodsAndServicesSold"]
CostOfServices["CostOfServices"]
CostOfGoodsSold["CostOfGoodsSold"]
CostOfGoodsSoldExcluding["CostOfGoodsSold...\nExcludingDepreciation"]
CostOfGoodsSoldElectric["CostOfGoodsSoldElectric"]
Canonical["FundamentalConcept::\nCostOfRevenue"]
CostOfRevenue --> Canonical
 
   CostOfGoodsAndServicesSold --> Canonical
 
   CostOfServices --> Canonical
 
   CostOfGoodsSold --> Canonical
 
   CostOfGoodsSoldExcluding --> Canonical
 
   CostOfGoodsSoldElectric --> Canonical

The test_cost_of_revenue function contains 6 assertions verifying different US GAAP terminology variants all map to FundamentalConcept::CostOfRevenue tests/distill_us_gaap_fundamental_concepts_tests.rs:81-111 This pattern ensures consistent normalization across different reporting styles.

Industry-Specific Revenue Testing

The revenue tests demonstrate the most complex mapping scenario, where 57+ industry-specific revenue concepts all normalize to FundamentalConcept::Revenues:

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:954-1188

graph TB
    subgraph "Standard Revenue Terms"
        Revenues["Revenues"]
SalesRevenueNet["SalesRevenueNet"]
SalesRevenueServicesNet["SalesRevenueServicesNet"]
end
    
    subgraph "Industry-Specific Terms"
        HealthCare["HealthCareOrganizationRevenue"]
RealEstate["RealEstateRevenueNet"]
OilGas["OilAndGasRevenue"]
Financial["FinancialServicesRevenue"]
Advertising["AdvertisingRevenue"]
Subscription["SubscriptionRevenue"]
Mining["RevenueMineralSales"]
end
    
    subgraph "Specialized Terms"
        Franchisor["FranchisorRevenue"]
Admissions["AdmissionsRevenue"]
Licenses["LicensesRevenue"]
Royalty["RoyaltyRevenue"]
Clearing["ClearingFeesRevenue"]
Passenger["PassengerRevenue"]
end
    
    Canonical["FundamentalConcept::Revenues"]
Revenues --> Canonical
 
   SalesRevenueNet --> Canonical
 
   SalesRevenueServicesNet --> Canonical
 
   HealthCare --> Canonical
 
   RealEstate --> Canonical
 
   OilGas --> Canonical
 
   Financial --> Canonical
 
   Advertising --> Canonical
 
   Subscription --> Canonical
 
   Mining --> Canonical
 
   Franchisor --> Canonical
 
   Admissions --> Canonical
 
   Licenses --> Canonical
 
   Royalty --> Canonical
 
   Clearing --> Canonical
 
   Passenger --> Canonical

The test_revenues function contains 47 distinct assertions, each verifying a different industry-specific revenue term maps to the canonical FundamentalConcept::Revenues tests/distill_us_gaap_fundamental_concepts_tests.rs:955-1188 Some terms also map to more specific sub-categories, such as InterestAndDividendIncomeOperating mapping to both InterestAndDividendIncomeOperating and Revenues tests/distill_us_gaap_fundamental_concepts_tests.rs:989-994

graph TB
    CreateTempDir["tempfile::tempdir()"]
CreatePath["dir.path().join('config.toml')"]
WriteContents["Write TOML contents\nto file"]
LoadConfig["ConfigManager::from_config(path)"]
AssertValues["Assert config values\nmatch expectations"]
DropTempDir["Drop TempDir\n(automatic cleanup)"]
CreateTempDir --> CreatePath
 
   CreatePath --> WriteContents
 
   WriteContents --> LoadConfig
 
   LoadConfig --> AssertValues
 
   AssertValues --> DropTempDir

Configuration Manager Testing

The config_manager_tests.rs file tests configuration loading, validation, and error handling using temporary files created with the tempfile crate.

Temporary Config File Test Pattern

Sources: tests/config_manager_tests.rs:8-57

The helper function create_temp_config creates a temporary directory and writes config contents to a config.toml file tests/config_manager_tests.rs:9-17 The function returns both the TempDir (to prevent premature cleanup) and the PathBuf tests/config_manager_tests.rs16 Tests explicitly drop the TempDir at the end to ensure cleanup tests/config_manager_tests.rs56

Invalid Configuration Key Detection

The test suite verifies that unknown configuration keys are rejected with helpful error messages:

Test	Config Content	Expected Behavior
`test_load_custom_config`	Valid keys only	Success, values loaded
`test_load_non_existent_config`	N/A (file doesn't exist)	Error returned
`test_fails_on_invalid_key`	Contains `invalid_key`	Error with valid keys listed

Sources: tests/config_manager_tests.rs:68-94

The invalid key test verifies that the error message contains documentation of valid configuration keys, including their types tests/config_manager_tests.rs:88-93:

email (String | Null)
max_concurrent (Integer | Null)
max_retries (Integer | Null)
min_delay_ms (Integer | Null)

Python Integration Testing Strategy

The Python testing strategy uses Docker Compose to orchestrate a complete test environment with MySQL database and WebSocket services, enabling end-to-end validation of the data pipeline.

graph TB
    ScriptStart["us_gaap_store_integration_test.sh"]
SetupEnv["Set environment variables\nPROJECT_NAME=us_gaap_it\nCOMPOSE=docker compose -p ..."]
ActivateVenv["source .venv/bin/activate"]
InstallDeps["uv pip install -e . --group dev"]
RegisterTrap["trap 'cleanup' EXIT"]
StartContainers["docker compose up -d\n--profile test"]
WaitMySQL["Wait for MySQL ping\nmysqladmin ping loop"]
CreateDB["CREATE DATABASE us_gaap_test"]
LoadSchema["mysql < us_gaap_schema_2025.sql"]
RunPytest["pytest -s -v\ntest_us_gaap_store.py"]
Cleanup["Cleanup trap:\ndocker compose down\n--volumes --remove-orphans"]
ScriptStart --> SetupEnv
 
   SetupEnv --> ActivateVenv
 
   ActivateVenv --> InstallDeps
 
   InstallDeps --> RegisterTrap
 
   RegisterTrap --> StartContainers
 
   StartContainers --> WaitMySQL
 
   WaitMySQL --> CreateDB
 
   CreateDB --> LoadSchema
 
   LoadSchema --> RunPytest
 
   RunPytest --> Cleanup

Integration Test Orchestration Flow

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39

Docker Compose Test Profile

The script uses an isolated Docker Compose project name to prevent conflicts with development containers python/narrative_stack/us_gaap_store_integration_test.sh:8-9:

PROJECT_NAME="us_gaap_it"
COMPOSE="docker compose -p $PROJECT_NAME --profile test"

The --profile test flag ensures only test-specific services are started python/narrative_stack/us_gaap_store_integration_test.sh23 which include:

db_test (MySQL container named us_gaap_test_db)
simd_r_drive_ws_server_test (WebSocket server)

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:8-23

MySQL Test Database Setup

The integration test script performs a multi-step database initialization:

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:25-35

The script waits for MySQL to be ready using a ping loop python/narrative_stack/us_gaap_store_integration_test.sh:25-28:

Then creates the test database python/narrative_stack/us_gaap_store_integration_test.sh:30-31 and loads the schema from a fixture file python/narrative_stack/us_gaap_store_integration_test.sh:33-35

Test Cleanup and Isolation

The script registers a cleanup trap to ensure test containers are always removed python/narrative_stack/us_gaap_store_integration_test.sh:14-19:

This trap executes on both successful completion and script failure, ensuring:

All containers in the us_gaap_it project are stopped and removed
All volumes are deleted (--volumes)
Orphaned containers from previous runs are cleaned up (--remove-orphans)

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:14-19

Pytest Execution Environment

The test environment is configured with specific paths and options python/narrative_stack/us_gaap_store_integration_test.sh:37-38:

Configuration	Value	Purpose
`PYTHONPATH`	`src`	Ensure module imports work
`pytest` flags	`-s -v`	Show output, verbose mode
Test path	`tests/integration/test_us_gaap_store.py`	Integration test module

The -s flag disables output capturing, allowing print statements and logs to display immediately during test execution, which is useful for debugging long-running integration tests. The -v flag enables verbose mode with detailed test function names.

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:37-38

Test Fixtures and Mocking Patterns

The codebase uses different mocking strategies appropriate to each language and component.

graph TB
    CreateServer["mockito::Server::new_async()"]
ConfigureMock["server.mock(method, path)\n.with_status(code)\n.with_body(json)\n.expect(count)"]
CreateAsyncMock["create_async().await"]
GetServerUrl["server.url()"]
MakeRequest["client.fetch_json(url)"]
VerifyMock["Mock automatically verifies\nexpected call count"]
CreateServer --> ConfigureMock
 
   ConfigureMock --> CreateAsyncMock
 
   CreateAsyncMock --> GetServerUrl
 
   GetServerUrl --> MakeRequest
 
   MakeRequest --> VerifyMock

Rust Mocking with mockito

The mockito library provides HTTP server mocking for testing the SecClient:

Sources: tests/sec_client_tests.rs:36-62

Mock configuration example from test_fetch_json_without_retry_success tests/sec_client_tests.rs:39-45:

Method: GET
Path: /files/company_tickers.json
Status: 200
Header: Content-Type: application/json
Body: JSON string with sample ticker data
Call using server.url() to get the mock endpoint

graph LR
    CreateTempDir["tempfile::tempdir()"]
GetPath["dir.path().join('config.toml')"]
CreateFile["fs::File::create(path)"]
WriteContents["writeln!(file, contents)"]
ReturnBoth["Return (TempDir, PathBuf)"]
CreateTempDir --> GetPath
 
   GetPath --> CreateFile
 
   CreateFile --> WriteContents
 
   WriteContents --> ReturnBoth

Temporary File Fixtures

Configuration tests use the tempfile crate to create isolated test environments:

Sources: tests/config_manager_tests.rs:8-17

The create_temp_config helper returns both the TempDir and PathBuf to prevent premature cleanup tests/config_manager_tests.rs16 The TempDir must remain in scope until after the test completes, as dropping it deletes the temporary directory.

SQL Schema Fixtures

The Python integration tests use a versioned SQL schema file as a fixture:

File	Purpose	Usage
`tests/integration/assets/us_gaap_schema_2025.sql`	Define `us_gaap_test` database structure	Loaded via `docker exec` python/narrative_stack/us_gaap_store_integration_test.sh:33-35

This approach ensures tests run against a known database structure that matches production schema, enabling reliable testing of database operations and query logic.

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:33-35

Test Execution Commands

The following table summarizes how to run different test suites:

Test Suite	Command	Working Directory	Prerequisites
All Rust unit tests	`cargo test`	Repository root	Rust toolchain
Specific Rust test file	`cargo test --test sec_client_tests`	Repository root	Rust toolchain
Single Rust test function	`cargo test test_user_agent`	Repository root	Rust toolchain
Python integration tests	`./us_gaap_store_integration_test.sh`	`python/narrative_stack/`	Docker, Git LFS, uv
Python integration with pytest	`pytest -s -v tests/integration/`	`python/narrative_stack/`	Docker, containers running

Prerequisites for Integration Tests

The Python integration test script requires python/narrative_stack/us_gaap_store_integration_test.sh:1-2:

Git LFS enabled and updated (for large test fixtures)
Docker with Docker Compose v2
Python virtual environment with uv package manager
Test data files committed with Git LFS

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

CI/CD Pipeline

Relevant source files

Purpose and Scope

This document explains the continuous integration and continuous deployment (CI/CD) infrastructure for the rust-sec-fetcher repository. It covers the GitHub Actions workflow configuration, integration test automation, Docker container orchestration for testing, and the complete test execution flow.

The CI/CD pipeline focuses specifically on the Python narrative_stack system's integration testing. For general testing strategies including Rust unit tests and Python test fixtures, see Testing Strategy. For production Docker deployment of the simd-r-drive WebSocket server, see Docker Deployment.

GitHub Actions Workflow Overview

The repository implements a single GitHub Actions workflow named US GAAP Store Integration Test that validates the Python machine learning pipeline's integration with external dependencies.

Workflow Trigger Conditions

The workflow is configured to run automatically under specific conditions defined in .github/workflows/us-gaap-store-integration-test.yml:3-11:

Sources: .github/workflows/us-gaap-store-integration-test.yml:3-11

The workflow only executes when changes affect:

Any file under python/narrative_stack/ directory
The workflow definition file itself (.github/workflows/us_gaap_store_integration_test.yml)

This targeted triggering prevents unnecessary CI runs when changes are made to the Rust sec-fetcher components or unrelated Python modules.

Integration Test Job Structure

The workflow defines a single job named integration-test that runs on ubuntu-latest runners. The job executes a sequence of setup and validation steps.

Sources: .github/workflows/us-gaap-store-integration-test.yml:12-51

graph TB
    Start["Job: integration-test\nruns-on: ubuntu-latest"]
Checkout["Step 1: Checkout repo\nactions/checkout@v4\nwith lfs: true"]
SetupPython["Step 2: Set up Python\nactions/setup-python@v5\npython-version: 3.12"]
InstallUV["Step 3: Install uv\ncurl astral.sh/uv/install.sh\nAdd to PATH"]
InstallDeps["Step 4: Install Python dependencies\nuv venv --python=3.12\nuv pip install -e . --group dev"]
Ruff["Step 5: Check code style with Ruff\nruff check ."]
Chmod["Step 6: Make test script executable\nchmod +x us_gaap_store_integration_test.sh"]
RunTest["Step 7: Run integration test\n./us_gaap_store_integration_test.sh"]
Start --> Checkout
 
   Checkout --> SetupPython
 
   SetupPython --> InstallUV
 
   InstallUV --> InstallDeps
 
   InstallDeps --> Ruff
 
   Ruff --> Chmod
 
   Chmod --> RunTest
    
    style Start fill:#f9f9f9
    style RunTest fill:#f9f9f9

Key Workflow Steps

Step	Action	Purpose
Checkout repo	`actions/checkout@v4` with `lfs: true`	Clone repository with Git LFS support for large test data files
Set up Python	`actions/setup-python@v5`	Install Python 3.12 runtime
Install uv	Shell script execution	Install `uv` package manager from astral.sh
Install dependencies	`uv pip install -e . --group dev`	Install narrative_stack package in editable mode with dev dependencies
Check code style	`ruff check .`	Validate Python code formatting and linting rules
Make executable	`chmod +x`	Grant execution permissions to test script
Run integration test	Execute bash script	Orchestrate Docker containers and run pytest tests

Sources: .github/workflows/us-gaap-store-integration-test.yml:17-50

Integration Test Architecture

The integration test orchestrates multiple Docker containers to create an isolated test environment that validates the narrative_stack system's ability to ingest, process, and store US GAAP financial data.

graph TB
    subgraph "Docker Compose Project: us_gaap_it"
        MySQL["Container: us_gaap_test_db\nImage: mysql\nPort: 3306\nUser: root\nPassword: onlylocal\nDatabase: us_gaap_test"]
SimdRDrive["Container: simd_r_drive_ws_server_test\nBuilt from: Dockerfile.simd-r-drive-ci-server\nPort: 8080\nBinary: simd-r-drive-ws-server@0.10.0-alpha"]
TestRunner["Test Runner\npytest process\ntests/integration/test_us_gaap_store.py"]
end
    
    Schema["SQL Schema\ntests/integration/assets/us_gaap_schema_2025.sql"]
TestRunner -->|SQL queries| MySQL
 
   TestRunner -->|WebSocket connection| SimdRDrive
    
 
   Schema -->|Loaded via mysql CLI| MySQL
    
    style MySQL fill:#f9f9f9
    style SimdRDrive fill:#f9f9f9
    style TestRunner fill:#f9f9f9

Container Architecture

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39 python/narrative_stack/Dockerfile.simd-r-drive-ci-server:1-34

The test environment consists of:

MySQL Database Container (us_gaap_test_db): Provides persistent storage for US GAAP financial data with pre-loaded schema
simd-r-drive WebSocket Server (simd_r_drive_ws_server_test): Handles embedding matrix storage and data synchronization
pytest Test Runner : Executes integration tests against both containers

Test Execution Flow

The integration test script python/narrative_stack/us_gaap_store_integration_test.sh:1-39 orchestrates the complete test lifecycle from container startup through teardown.

graph TB
    Start["Start: us_gaap_store_integration_test.sh"]
SetVars["Set variables\nPROJECT_NAME=us_gaap_it\nCOMPOSE=docker compose -p us_gaap_it"]
ActivateVenv["Activate virtual environment\nsource .venv/bin/activate"]
InstallDeps["Install dependencies\nuv pip install -e . --group dev"]
RegisterTrap["Register cleanup trap\ntrap 'cleanup' EXIT"]
DockerUp["Start Docker containers\ndocker compose up -d\nProfile: test"]
WaitMySQL["Wait for MySQL ready\nmysqladmin ping loop"]
CreateDB["Create database\nCREATE DATABASE us_gaap_test"]
LoadSchema["Load schema\nmysql < us_gaap_schema_2025.sql"]
SetPythonPath["Set PYTHONPATH=src"]
RunPytest["Execute pytest\npytest -s -v tests/integration/test_us_gaap_store.py"]
Cleanup["Cleanup function\ndocker compose down --volumes"]
Start --> SetVars
 
   SetVars --> ActivateVenv
 
   ActivateVenv --> InstallDeps
 
   InstallDeps --> RegisterTrap
 
   RegisterTrap --> DockerUp
 
   DockerUp --> WaitMySQL
 
   WaitMySQL --> CreateDB
 
   CreateDB --> LoadSchema
 
   LoadSchema --> SetPythonPath
 
   SetPythonPath --> RunPytest
 
   RunPytest --> Cleanup
    
    style Start fill:#f9f9f9
    style RunPytest fill:#f9f9f9
    style Cleanup fill:#f9f9f9

Test Script Flow Diagram

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39

Cleanup Mechanism

The script implements a robust cleanup mechanism using bash trap handlers python/narrative_stack/us_gaap_store_integration_test.sh:14-19:

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:14-19

The cleanup function ensures that all Docker resources are removed regardless of test outcome, preventing resource leaks and port conflicts in subsequent test runs.

Docker Container Configuration

simd-r-drive-ws-server Container Build

The Dockerfile python/narrative_stack/Dockerfile.simd-r-drive-ci-server:1-34 creates a single-stage image optimized for CI environments.

Build Configuration:

Parameter	Value	Purpose
Base Image	`rust:1.87-slim`	Minimal Rust runtime environment
Binary Installation	`cargo install --locked simd-r-drive-ws-server@0.10.0-alpha`	Install specific locked version of server
Default Server Args	`data.bin --host 127.0.0.1 --port 8080`	Configure server binding and data file
Exposed Port	`8080`	WebSocket server port
Entrypoint	Shell wrapper with `$SERVER_ARGS` interpolation	Allow runtime argument override

Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:9-33

graph LR
    BuildTime["Build Time\n--build-arg SERVER_ARGS=..."]
BakeArgs["Bake into ENV variable"]
Runtime["Run Time\n-e SERVER_ARGS=... OR\ndocker run ... extra flags"]
Entrypoint["ENTRYPOINT interpolates\n$SERVER_ARGS + $@"]
ServerExec["Execute:\nsimd-r-drive-ws-server\nwith combined args"]
BuildTime --> BakeArgs
 
   BakeArgs --> Runtime
 
   Runtime --> Entrypoint
 
   Entrypoint --> ServerExec
    
    style Entrypoint fill:#f9f9f9

The Dockerfile uses build arguments to support configuration at both build-time and run-time:

Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:18-33

MySQL Container Configuration

The MySQL container is configured through Docker Compose with the following specifications python/narrative_stack/us_gaap_store_integration_test.sh:23-35:

Configuration	Value	Source
Container Name	`us_gaap_test_db`	Docker Compose service name
Root Password	`onlylocal`	Environment variable
Database Name	`us_gaap_test`	Created via `CREATE DATABASE` command
Schema File	`tests/integration/assets/us_gaap_schema_2025.sql`	Loaded via mysql CLI
Health Check	`mysqladmin ping -h127.0.0.1`	Readiness probe

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:25-35

Environment Configuration and Dependencies

graph TB
    UVInstall["Install uv package manager\ncurl -LsSf astral.sh/uv/install.sh"]
CreateVenv["Create virtual environment\nuv venv --python=3.12"]
ActivateVenv["Activate environment\nsource .venv/bin/activate"]
InstallPackage["Install narrative_stack\nuv pip install -e . --group dev"]
DevGroup["Dev dependency group includes:\n- pytest\n- ruff\n- integration test utilities"]
UVInstall --> CreateVenv
 
   CreateVenv --> ActivateVenv
 
   ActivateVenv --> InstallPackage
 
   InstallPackage --> DevGroup
    
    style InstallPackage fill:#f9f9f9

Python Environment Setup

The CI pipeline uses uv as the package manager to create isolated Python environments and install dependencies:

Sources: .github/workflows/us-gaap-store-integration-test.yml:27-37 python/narrative_stack/us_gaap_store_integration_test.sh:11-12

Code Quality Validation

Before running integration tests, the workflow executes Ruff code quality checks .github/workflows/us-gaap-store-integration-test.yml:39-43:

Check	Command	Purpose
Linting	`ruff check .`	Validate code style, detect anti-patterns, enforce naming conventions

The Ruff check acts as a fast pre-test gate that fails the CI pipeline if code quality issues are detected, preventing broken code from reaching the integration test phase.

Docker Compose Project Isolation

The integration test uses Docker Compose project isolation to prevent conflicts with other running containers python/narrative_stack/us_gaap_store_integration_test.sh:7-9:

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:7-9

The project name us_gaap_it creates an isolated namespace for all Docker resources (containers, networks, volumes) associated with the integration test, allowing multiple test runs or development environments to coexist without interference.

Test Execution and Reporting

pytest Integration Test Execution

The final step executes the integration test suite using pytest python/narrative_stack/us_gaap_store_integration_test.sh:37-38:

Parameter	Value	Purpose
Test File	`tests/integration/test_us_gaap_store.py`	Integration test module
Verbosity	`-v`	Verbose output with test names
Output	`-s`	Show stdout/stderr (no capture)
Python Path	`PYTHONPATH=src`	Locate source modules
Working Directory	`python/narrative_stack`	Test execution context

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:37-38

The pytest command runs with stdout capture disabled (-s flag), allowing real-time visibility into test progress and enabling debugging output from the integration tests.

graph TB
    GitPush["Git Push/PR\nto python/narrative_stack/"]
GHAStart["GitHub Actions Trigger\nus-gaap-store-integration-test.yml"]
EnvSetup["Environment Setup:\n- Python 3.12\n- uv package manager\n- Dev dependencies"]
CodeQuality["Code Quality Check:\nruff check ."]
QualityFail["❌ Fail CI"]
QualityPass["✓ Pass"]
ScriptExec["Execute Test Script:\nus_gaap_store_integration_test.sh"]
DockerSetup["Docker Compose Setup:\n- MySQL (us_gaap_test_db)\n- simd-r-drive-ws-server"]
SchemaLoad["Load SQL Schema:\nus_gaap_schema_2025.sql"]
PytestRun["Run Integration Tests:\npytest tests/integration/"]
TestFail["❌ Fail CI"]
TestPass["✓ Pass"]
Cleanup["Cleanup:\ndocker compose down --volumes"]
CIComplete["✅ CI Complete"]
GitPush --> GHAStart
 
   GHAStart --> EnvSetup
 
   EnvSetup --> CodeQuality
    
 
   CodeQuality -->|Issues detected| QualityFail
 
   CodeQuality -->|Clean| QualityPass
    
 
   QualityPass --> ScriptExec
 
   ScriptExec --> DockerSetup
 
   DockerSetup --> SchemaLoad
 
   SchemaLoad --> PytestRun
    
 
   PytestRun -->|Tests fail| TestFail
 
   PytestRun -->|Tests pass| TestPass
    
 
   TestFail --> Cleanup
 
   TestPass --> Cleanup
    
 
   Cleanup --> CIComplete
    
    style GHAStart fill:#f9f9f9
    style CodeQuality fill:#f9f9f9
    style PytestRun fill:#f9f9f9
    style CIComplete fill:#f9f9f9

CI/CD Pipeline Flow Summary

The complete CI/CD pipeline integrates all components into a continuous validation workflow:

Sources: .github/workflows/us-gaap-store-integration-test.yml:1-51 python/narrative_stack/us_gaap_store_integration_test.sh:1-39

Key Design Decisions

Git LFS Requirement

The workflow explicitly enables Git LFS .github/workflows/us-gaap-store-integration-test.yml:19-20 to support large test data files. The test script includes a comment python/narrative_stack/us_gaap_store_integration_test.sh:1-2 noting this requirement for local execution.

Single-Stage Docker Build

The Dockerfile uses a single-stage build python/narrative_stack/Dockerfile.simd-r-drive-ci-server:9-15 rather than multi-stage, prioritizing simplicity for CI environments where image size is less critical than build reliability and transparency.

Profile-Based Container Selection

The Docker Compose command uses --profile test python/narrative_stack/us_gaap_store_integration_test.sh9 to selectively start only test-related services, preventing unnecessary containers from consuming CI resources.

Working Directory Isolation

All workflow steps and script commands execute from the python/narrative_stack working directory .github/workflows/us-gaap-store-integration-test.yml33 ensuring consistent path resolution and dependency isolation from other repository components.

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Docker Deployment

Relevant source files

Purpose and Scope

This document describes the Docker containerization strategy for the rust-sec-fetcher system, focusing on the simd-r-drive-ws-server deployment and the Docker Compose infrastructure used for integration testing. The Dockerfile builds a lightweight container that runs the WebSocket server, while Docker Compose orchestrates multi-container test environments with MySQL and the WebSocket server.

For information about the CI/CD workflows that use these Docker configurations, see CI/CD Pipeline. For details about the simd-r-drive WebSocket server functionality, see Database & Storage Integration.

Docker Image Architecture

The system uses a single-stage Dockerfile to build and deploy the simd-r-drive-ws-server binary in a minimal Rust container. This approach prioritizes simplicity and reproducibility for CI/CD environments.

graph TB
    BaseImage["rust:1.87-slim\nBase Image"]
CargoInstall["cargo install\nsimd-r-drive-ws-server@0.10.0-alpha"]
EnvSetup["Environment Setup\nSERVER_ARGS variable\ndefault: 'data.bin --host 127.0.0.1 --port 8080'"]
Expose["EXPOSE 8080\nWebSocket port"]
Entrypoint["ENTRYPOINT\n/bin/sh -c 'simd-r-drive-ws-server ${SERVER_ARGS}'"]
BaseImage --> CargoInstall
 
   CargoInstall --> EnvSetup
 
   EnvSetup --> Expose
 
   Expose --> Entrypoint

Dockerfile Structure

Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:1-34

Build Configuration

The Dockerfile uses build arguments and environment variables to provide flexible configuration:

Configuration	Type	Default Value	Purpose
`RUST_VERSION`	Build ARG	`1.87`	Rust toolchain version for compilation
`SERVER_ARGS`	Build ARG / ENV	`data.bin --host 127.0.0.1 --port 8080`	Server command-line arguments

Build-time customization:

Runtime customization:

Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:9-33

Image Composition

Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:10-26

Docker Compose Test Infrastructure

The integration testing environment uses Docker Compose to orchestrate multiple services with isolated networking and lifecycle management.

graph TB
    subgraph "Docker Compose Project: us_gaap_it"
        
        subgraph "db_test Service"
            MySQLContainer["Container: us_gaap_test_db\nImage: mysql\nPort: 3306\nRoot password: onlylocal"]
Database["Database: us_gaap_test\nSchema: us_gaap_schema_2025.sql"]
end
        
        subgraph "simd_r_drive_ws_server_test Service"
            WSServer["Container: simd-r-drive-ws-server\nPort: 8080\nData file: data.bin"]
end
        
        subgraph "Test Runner (Host)"
            TestScript["us_gaap_store_integration_test.sh\npytest integration tests"]
end
    end
    
 
   MySQLContainer --> Database
 
   TestScript -->|docker exec mysql commands| MySQLContainer
 
   TestScript -->|pytest via network| WSServer
 
   TestScript -->|pytest via network| Database

Service Architecture

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39

sequenceDiagram
    participant Script as us_gaap_store_integration_test.sh
    participant Compose as docker compose
    participant MySQL as us_gaap_test_db
    participant WSServer as simd-r-drive-ws-server
    participant Pytest as pytest
    
    Script->>Script: Set PROJECT_NAME="us_gaap_it"
    Script->>Script: Register cleanup trap
    
    Script->>Compose: up -d (profile: test)
    Compose->>MySQL: Start container
    Compose->>WSServer: Start container
    
    loop Wait for MySQL
        Script->>MySQL: mysqladmin ping
        MySQL-->>Script: Connection status
    end
    
    Script->>MySQL: CREATE DATABASE us_gaap_test
    Script->>MySQL: Load us_gaap_schema_2025.sql
    
    Script->>Pytest: Run tests/integration/test_us_gaap_store.py
    Pytest->>MySQL: Query database
    Pytest->>WSServer: WebSocket operations
    Pytest-->>Script: Test results
    
    Script->>Compose: down --volumes --remove-orphans
    Compose->>MySQL: Stop and remove
    Compose->>WSServer: Stop and remove

Test Execution Flow

The integration test script implements a robust lifecycle with automatic cleanup:

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:7-39

Docker Compose Configuration Pattern

The test script uses the following Docker Compose invocation pattern:

Component	Value	Purpose
Project Name	`us_gaap_it`	Isolates test containers from other Docker resources
Profile	`test`	Selects services tagged with `profiles: [test]`
Cleanup Strategy	`--volumes --remove-orphans`	Ensures complete teardown of test infrastructure

Key commands:

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:8-18

graph TB
    Start["Container Start\nImage: mysql\nRoot password: onlylocal"]
WaitReady["Health Check Loop\nmysqladmin ping -h127.0.0.1"]
CreateDB["docker exec\nCREATE DATABASE IF NOT EXISTS us_gaap_test"]
LoadSchema["docker exec\nmysql < us_gaap_schema_2025.sql"]
Ready["Database Ready\nSchema applied"]
Start --> WaitReady
 
   WaitReady -->|Ping successful| CreateDB
 
   WaitReady -->|Still initializing| WaitReady
 
   CreateDB --> LoadSchema
 
   LoadSchema --> Ready

Database Container Setup

The MySQL container requires specific initialization steps to prepare the test database:

Initialization Sequence

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:25-35

MySQL Container Configuration

The script executes the following commands against the us_gaap_test_db container:

Health check: Polls MySQL readiness using mysqladmin ping
Database creation: Creates the test database if it doesn't exist
Schema loading: Applies the SQL schema from the repository

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:25-35

Environment Variables and Configuration

The Docker deployment uses environment variables for runtime configuration of both the WebSocket server and test infrastructure.

simd-r-drive-ws-server Configuration

Variable	Default	Configurable At	Description
`SERVER_ARGS`	`data.bin --host 127.0.0.1 --port 8080`	Build & Runtime	Complete command-line arguments for the server

The SERVER_ARGS environment variable controls:

Data file: Path to the persistent storage file (e.g., data.bin)
Host binding: IP address for the WebSocket server (127.0.0.1 for localhost, 0.0.0.0 for all interfaces)
Port: TCP port for the WebSocket endpoint (default 8080)

Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:18-32

Test Environment Variables

The integration test script sets the following environment variables:

This ensures Python can locate the narrative_stack module when running tests.

Sources: python/narrative_stack/us_gaap_store_integration_test.sh37

Deployment Considerations

Port Exposure and Network Configuration

The Dockerfile exposes port 8080 for WebSocket connections:

When deploying, map this port to the host system:

For production deployments, consider:

Using a different external port (e.g., -p 9000:8080)
Binding to specific interfaces with --host 0.0.0.0 in SERVER_ARGS
Placing the container behind a reverse proxy (nginx, traefik)

Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server25

Data Persistence

The server expects a data file specified in SERVER_ARGS (default: data.bin). For persistent storage across container restarts:

Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:22-23

graph TB
    Setup["Setup Phase\ndocker compose up -d"]
Trap["Register Trap\ntrap 'cleanup' EXIT"]
Execute["Test Execution\npytest integration tests"]
Cleanup["Cleanup Phase\ndocker compose down\n--volumes --remove-orphans"]
Setup --> Trap
 
   Trap --> Execute
 
   Execute -->|Success or Failure| Cleanup

Container Lifecycle Management

The integration test script demonstrates proper container lifecycle management:

The trap command ensures cleanup runs even if tests fail or the script is interrupted:

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:14-19

Build and Deployment Commands

Build the Docker image:

Run the container:

Run integration tests:

Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:1-3 python/narrative_stack/us_gaap_store_integration_test.sh:1-39

Integration with CI/CD

The Docker configuration integrates with GitHub Actions for automated testing. The CI pipeline:

Builds the simd-r-drive-ci-server image
Spins up Docker Compose services with the test profile
Initializes the MySQL database with the test schema
Runs pytest integration tests
Tears down all containers and volumes

This ensures reproducible test environments across local development and CI systems.

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-2 python/narrative_stack/Dockerfile.simd-r-drive-ci-server:1-8

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Dependencies & Technology Stack

Relevant source files

This page provides a comprehensive overview of all external dependencies used in the rust-sec-fetcher codebase, covering both the Rust sec-fetcher application and the Python narrative_stack ML system. For information about the configuration system and credential management, see Configuration System. For details on how these dependencies are used within specific components, see Rust sec-fetcher Application and Python narrative_stack System.

Overview

The system employs a dual-language architecture with distinct but complementary technology stacks. The Rust layer prioritizes high-performance I/O operations, concurrent data fetching, and reliable HTTP caching. The Python layer focuses on scientific computing, machine learning model training, and numerical data processing. Both layers share common infrastructure through the simd-r-drive data storage system and file-based CSV interchange.

Sources: Cargo.toml:1-45 High-Level System Architecture diagrams

Rust Technology Stack

Core Direct Dependencies

The Rust application declares 29 direct dependencies in its manifest, each serving specific architectural roles:

Dependency Categories and Usage:

graph TB
    subgraph "Async Runtime & Concurrency"
        tokio["tokio 1.43.0\nFull async runtime"]
rayon["rayon 1.10.0\nData parallelism"]
dashmap["dashmap 6.1.0\nConcurrent hashmap"]
end
    
    subgraph "HTTP & Network"
        reqwest["reqwest 0.12.15\nHTTP client"]
reqwest_drive["reqwest-drive 0.1.0-alpha.9\nDrive middleware"]
end
    
    subgraph "Data Processing"
        polars["polars 0.46.0\nDataFrame operations"]
csv_crate["csv 1.3.1\nCSV parsing"]
serde["serde 1.0.218\nSerialization"]
serde_json["serde_json 1.0.140\nJSON support"]
quick_xml["quick-xml 0.37.2\nXML parsing"]
end
    
    subgraph "Storage & Caching"
        simd_r_drive["simd-r-drive 0.3.0-alpha.1\nKey-value store"]
simd_r_drive_ext["simd-r-drive-extensions\n0.4.0-alpha.6"]
end
    
    subgraph "Configuration & Validation"
        config_crate["config 0.15.9\nConfig management"]
keyring["keyring 3.6.2\nCredential storage"]
email_address["email_address 0.2.9\nEmail validation"]
rust_decimal["rust_decimal 1.36.0\nDecimal numbers"]
chrono["chrono 0.4.40\nDate/time handling"]
end
    
    subgraph "Development Tools"
        mockito["mockito 1.7.0\nHTTP mocking"]
tempfile["tempfile 3.18.0\nTemp file creation"]
env_logger["env_logger 0.11.7\nLogging"]
end

Category	Crates	Primary Use Cases
Async Runtime	`tokio`	Event loop, async I/O, task scheduling, timers
HTTP Stack	`reqwest`, `reqwest-drive`	SEC API communication, middleware integration
Data Frames	`polars`	Large-scale data transformation, CSV/JSON processing
Serialization	`serde`, `serde_json`, `serde_with`	Data structure serialization, API response parsing
Concurrency	`rayon`, `dashmap`, `crossbeam`	Parallel processing, concurrent data structures
Storage	`simd-r-drive`, `simd-r-drive-extensions`	HTTP cache, preprocessor cache, persistent storage
Configuration	`config`, `keyring`	TOML config loading, secure credential management
Validation	`email_address`, `rust_decimal`, `chrono`	Input validation, financial precision, timestamps
Utilities	`itertools`, `indexmap`, `bytes`	Iterator extensions, ordered maps, byte manipulation
Testing	`mockito`, `tempfile`	HTTP mock servers, temporary test files

Sources: Cargo.toml:8-44

HTTP and Network Layer

The HTTP stack consists of multiple layers providing comprehensive request handling:

TLS Configuration: The system supports dual TLS backends for platform compatibility:

graph TB
    SecClient["SecClient\n(src/sec_client.rs)"]
reqwest["reqwest 0.12.15\nHigh-level HTTP client"]
reqwest_drive["reqwest-drive 0.1.0-alpha.9\nMiddleware framework"]
hyper["hyper 1.6.0\nHTTP/1.1 & HTTP/2"]
h2["h2 0.4.8\nHTTP/2 protocol"]
TLS_Layer["TLS Layer"]
rustls["rustls 0.23.21\nPure Rust TLS"]
native_tls["native-tls 0.2.14\nPlatform TLS"]
openssl["openssl 0.10.71\nOpenSSL bindings"]
hyper_util["hyper-util 0.1.10\nUtilities"]
hyper_rustls["hyper-rustls 0.27.5\nRustls connector"]
hyper_tls["hyper-tls 0.6.0\nNative TLS connector"]
SecClient --> reqwest
 
   reqwest --> reqwest_drive
 
   reqwest --> hyper
 
   reqwest --> TLS_Layer
    
 
   hyper --> h2
 
   hyper --> hyper_util
    
 
   TLS_Layer --> hyper_rustls
 
   TLS_Layer --> hyper_tls
 
   hyper_rustls --> rustls
 
   hyper_tls --> native_tls
 
   native_tls --> openssl

rustls (0.23.21): Pure Rust implementation, no external dependencies, used by default
native-tls (0.2.14): Platform-native TLS (OpenSSL on Linux, Security.framework on macOS, SChannel on Windows)

Sources: Cargo.lock:1280-1351 Cargo.toml28

Data Processing Stack

Polars Features Enabled:

json: JSON file reading/writing
lazy: Lazy evaluation engine for query optimization
pivot: Pivot table operations

Key Numeric Types:

rust_decimal::Decimal: Exact decimal arithmetic for financial calculations (importance: 8.37 for US GAAP processing)
chrono::NaiveDate, chrono::DateTime: Timestamp handling for filing dates and report periods

Sources: Cargo.toml24 Cargo.lock:2170-2394

Concurrency and Parallelism

Concurrency Model:

tokio : Handles I/O-bound operations (HTTP requests, file I/O)
rayon : Handles CPU-bound operations (DataFrame transformations, parallel iterations)
dashmap : Thread-safe caching without explicit locking

Key Usage Locations:

tokio: src/main.rs - main async runtime
rayon: Cargo.toml14 - dashmap integration, parallel processing
once_cell: src/caches.rs:7-8 - OnceLock<Arc<DataStore>> for global cache initialization

Sources: Cargo.toml:14-40 src/caches.rs:1-66 Cargo.lock:612-762

Storage and Caching System

Cache Architecture:

Two separate DataStore instances for different caching layers
Thread-safe access via Arc<DataStore> wrapped in OnceLock statics
Initialized once at application startup via Caches::init()
1-week TTL for HTTP responses, indefinite for preprocessor results

Storage Format:

Binary .bin files for efficient serialization
Uses bincode 1.3.3 for encoding cached data
Persistent across application restarts

Sources: src/caches.rs:7-65 Cargo.toml:36-37 Cargo.lock:246-252

Configuration and Credentials

Configuration Sources (Priority Order):

Environment variables
config.toml in current directory
~/.config/sec-fetcher/config.toml
Default values

Keyring Features:

apple-native: Direct Security.framework integration on macOS
windows-native: Windows Credential Manager via Win32 APIs
sync-secret-service: D-Bus Secret Service for Linux

Sources: Cargo.toml:12-20 Cargo.lock:1641-1653

Validation and Type Safety

Crate	Version	Purpose	Key Types
`email_address`	0.2.9	SEC email validation	`EmailAddress`
`rust_decimal`	1.36.0	Financial calculations	`Decimal`
`chrono`	0.4.40	Date/time operations	`NaiveDate`, `DateTime<Utc>`
`chrono-tz`	0.10.1	Timezone handling	`Tz::America__New_York`
`strum` / `strum_macros`	0.27.1	Enum utilities	`EnumString`, `Display`

Decimal Precision:

28-29 significant digits
No floating-point rounding errors
Critical for US GAAP fundamental values

Date Handling:

All SEC filing dates parsed to chrono::NaiveDate
UTC timestamps for API requests
Eastern Time conversion for market hours

Sources: Cargo.toml:16-39 Cargo.lock:403-868

Python Technology Stack

Machine Learning Framework

Training Configuration:

Patience : 20 epochs for early stopping
Checkpoint Strategy : Save top 1 model by validation loss
Learning Rate : Configurable via checkpoint override
Gradient Clipping : Prevents exploding gradients

graph TB
    subgraph "Core Scientific Libraries"
        numpy["NumPy\nArray operations"]
pandas["pandas\nDataFrame manipulation"]
sklearn["scikit-learn\nPCA, RobustScaler"]
end
    
    subgraph "Preprocessing Pipeline"
        PCA["PCA\nDimensionality reduction\nVariance threshold: 0.95"]
RobustScaler["RobustScaler\nPer concept/unit pair\nOutlier-robust normalization"]
sklearn --> PCA
 
       sklearn --> RobustScaler
    end
    
    subgraph "Data Structures"
        triplets["Concept/Unit/Value Triplets"]
embeddings["Semantic Embeddings\nPCA-compressed"]
RobustScaler --> triplets
 
       PCA --> embeddings
 
       triplets --> embeddings
    end
    
    subgraph "Data Ingestion"
        csv_files["CSV Files\nfrom Rust sec-fetcher"]
UsGaapRowRecord["UsGaapRowRecord\nParsed row structure"]
csv_files --> UsGaapRowRecord
 
       UsGaapRowRecord --> numpy
    end

Sources: High-Level System Architecture (Diagram 3), narrative_stack training pipeline

Data Processing and Scientific Computing

scikit-learn Components:

PCA : Reduces embedding dimensionality while retaining 95% variance
RobustScaler : Normalizes using median and IQR, resistant to outliers
Both fitted per unique concept/unit pair for specialized normalization

NumPy Usage:

Validation via np.isclose() checks
Scaler transformation verification
Embedding matrix storage

graph TB
    subgraph "Database Layer"
        mysql_db[("MySQL Database\nus_gaap_test")]
        DbUsGaap["DbUsGaap\nDatabase interface"]
asyncio["asyncio\nAsync queries"]
DbUsGaap --> mysql_db
 
       DbUsGaap --> asyncio
    end
    
    subgraph "WebSocket Storage"
        DataStoreWsClient["DataStoreWsClient\nWebSocket client"]
simd_r_drive_server["simd-r-drive\nWebSocket server"]
DataStoreWsClient --> simd_r_drive_server
    end
    
    subgraph "Unified Access"
        UsGaapStore["UsGaapStore\nFacade pattern"]
UsGaapStore --> DbUsGaap
 
       UsGaapStore --> DataStoreWsClient
    end
    
    subgraph "Data Models"
        triplet_storage["Triplet Storage:\n- concept\n- unit\n- scaled_value\n- scaler\n- embedding"]
UsGaapStore --> triplet_storage
    end

Sources: High-Level System Architecture (Diagram 3), us_gaap_store preprocessing pipeline

Database and Storage Integration

Storage Strategy:

MySQL : Relational queries for ingested CSV data
WebSocket Store : High-performance embedding matrix storage
Facade Pattern : UsGaapStore abstracts storage backend choice

Python WebSocket Client:

Package: simd-r-drive-client (Python equivalent of Rust simd-r-drive)
Async communication with WebSocket server
Shared embedding matrices between Rust and Python

Sources: High-Level System Architecture (Diagram 3), Database & Storage Integration section

Visualization and Debugging

Library	Purpose	Key Outputs
`matplotlib`	Static plots	PCA variance plots, scaler distribution
`TensorBoard`	Training metrics	Loss curves, learning rate schedules
`itertools`	Data iteration	Batching, grouping concept pairs

Visualization Types:

PCA Explanation Plots : Cumulative variance by component
Semantic Embedding Scatterplots : t-SNE/UMAP projections of concept space
Variance Analysis : Per-concept/unit value distributions
Training Curves : TensorBoard integration via PyTorch Lightning

graph TB
    subgraph "Server Implementation"
        server["simd-r-drive-ws-server\nWebSocket server"]
DataStore_server["DataStore\nBackend storage"]
server --> DataStore_server
    end
    
    subgraph "Rust Clients"
        http_cache_rust["HTTP Cache\n(Rust)"]
preprocessor_cache_rust["Preprocessor Cache\n(Rust)"]
http_cache_rust --> DataStore_server
 
       preprocessor_cache_rust --> DataStore_server
    end
    
    subgraph "Python Clients"
        DataStoreWsClient_py["DataStoreWsClient\n(Python)"]
embedding_matrix["Embedding Matrix\nStorage"]
DataStoreWsClient_py --> DataStore_server
 
       embedding_matrix --> DataStoreWsClient_py
    end
    
    subgraph "Docker Deployment"
        dockerfile["Dockerfile\nContainer config"]
rust_image["rust:1.87\nBase image"]
dockerfile --> rust_image
 
       dockerfile --> server
    end

Sources: High-Level System Architecture (Diagram 3), Validation & Analysis section

Shared Infrastructure

simd-r-drive WebSocket Server

Key Features:

Cross-language data sharing via WebSocket protocol
Binary-efficient key-value storage
Persistent storage across restarts
Docker containerization for CI/CD environments

Usage Patterns:

Rust writes HTTP cache entries
Rust writes preprocessor transformations
Python reads/writes embedding matrices
Python reads ingested US GAAP triplets

Sources: Cargo.toml:36-37 High-Level System Architecture (Diagram 1, Diagram 5)

File System Interchange

Directory Structure:

fund-holdings/A-Z/ : NPORT filing holdings, alphabetized by ticker
us-gaap/ : Fundamental concepts CSV outputs
Row-based CSV format with standardized columns

Data Flow:

Rust fetches from SEC API
Rust transforms via distill_us_gaap_fundamental_concepts
Rust writes CSV files
Python walks directories
Python parses into UsGaapRowRecord
Python ingests to MySQL and WebSocket store

graph TB
    subgraph "CI Pipeline"
        workflow[".github/workflows/\nus-gaap-store-integration-test"]
checkout["Checkout code"]
rust_setup["Setup Rust 1.87"]
docker_compose["Docker Compose\nsimd-r-drive-ci-server"]
mysql_container["MySQL test container"]
workflow --> checkout
 
       checkout --> rust_setup
 
       checkout --> docker_compose
 
       docker_compose --> mysql_container
    end
    
    subgraph "Test Execution"
        cargo_test["cargo test\nRust unit tests"]
pytest["pytest\nPython integration tests"]
rust_setup --> cargo_test
 
       docker_compose --> pytest
    end
    
    subgraph "Testing Tools"
        mockito_tests["mockito 1.7.0\nHTTP mock server"]
tempfile_tests["tempfile 3.18.0\nTemp directories"]
cargo_test --> mockito_tests
 
       cargo_test --> tempfile_tests
    end

Sources: High-Level System Architecture (Diagram 1), main.rs CSV output logic

Development and CI/CD Stack

GitHub Actions Workflow

Docker Configuration:

Image : rust:1.87 for reproducible builds
Services : simd-r-drive-ws-server, MySQL 8.0
Environment Variables : Database credentials, WebSocket ports
Volume Mounts : Test data directories

Testing Strategy:

Rust Unit Tests : mockito for HTTP mocking, tempfile for isolated test fixtures
Python Integration Tests : pytest with MySQL fixtures, WebSocket client validation
End-to-End : Full pipeline from CSV ingestion through ML training

Sources: Cargo.toml:42-44 High-Level System Architecture (Diagram 5), CI/CD Pipeline section

Testing Dependencies

Language	Framework	Mocking	Assertions
Rust	Built-in `#[test]`	`mockito 1.7.0`	Standard assertions
Python	`pytest`	`unittest.mock`	`np.isclose()`

Rust Test Modules:

config_manager_tests: Configuration loading and merging
sec_client_tests: HTTP client behavior
distill_tests: US GAAP concept transformation

Python Test Modules:

Integration tests: End-to-end pipeline validation
Unit tests: Scaler, PCA, embedding generation

Sources: Cargo.lock:1802-1824 Testing Strategy documentation

Version Compatibility Matrix

Critical Version Constraints

Component	Version	Constraints	Reason
`tokio`	1.43.0	`>= 1.0`	Stable async runtime API
`reqwest`	0.12.15	`0.12.x`	Compatible with reqwest-drive middleware
`polars`	0.46.0	`0.46.x`	Breaking changes between minor versions
`simd-r-drive`	0.3.0-alpha.1	Exact match	Alpha API, unstable
`serde`	1.0.218	`>= 1.0.100`	Derive macro stability
`hyper`	1.6.0	`1.x`	HTTP/2 support, reqwest dependency

Rust Toolchain

Minimum Rust Version: 1.70+ (for async trait bounds and GAT support)

Feature Requirements:

async-trait for trait async methods
#[tokio::main] macro support
Generic Associated Types (GATs) for polars iterators

Sources: Cargo.toml4 Cargo.lock:1-3

Python Version Requirements

Estimated Python Version: 3.8+ (for PyTorch Lightning compatibility)

Key Constraints:

graph LR
 
   reqwest["reqwest"] --> hyper
 
   hyper --> h2["h2\nHTTP/2"]
hyper --> http["http\nTypes"]
reqwest --> rustls
 
   reqwest --> native_tls
    
 
   rustls --> ring["ring\nCryptography"]
native_tls --> openssl["openssl\nSystem TLS"]
h2 --> tokio
 
   hyper --> tokio

PyTorch: Requires Python 3.8 or later
PyTorch Lightning: Requires PyTorch >= 1.9
asyncio: Stable async/await syntax (Python 3.7+)

Sources: High-Level System Architecture (Python technology sections)