Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Overview

Relevant source files

This document provides a high-level introduction to the rust-sec-fetcher repository, explaining its purpose, architecture, and the relationship between its Rust and Python components. This page covers the system's overall design and data flow. For installation and configuration instructions, see Getting Started. For detailed implementation documentation of individual components, see Rust sec-fetcher Application and Python narrative_stack System.

Sources: Cargo.toml, README.md, src/lib.rs

System Purpose

The rust-sec-fetcher repository implements a dual-language financial data processing system that:

  1. Fetches company financial data from the SEC EDGAR API
  2. Transforms raw SEC filings into normalized US GAAP fundamental concepts
  3. Stores structured data as CSV files organized by ticker symbol
  4. Trains machine learning models to understand financial concept relationships

The system specializes in processing US GAAP (Generally Accepted Accounting Principles) financial data, normalizing 57+ variations of concepts like revenue and consolidating them into a standardized taxonomy of 64 FundamentalConcept variants. This normalization enables consistent querying across diverse financial reports and powers downstream machine learning applications.

Sources: Cargo.toml:1-6, src/main.rs:173-240

Dual-Language Architecture

The repository employs a dual-language design that leverages the strengths of both Rust and Python:

LayerLanguagePrimary ResponsibilitiesKey Reason
Data Fetching & ProcessingRustHTTP requests, throttling, caching, data transformation, CSV generationI/O-bound operations, memory safety, high performance
Machine LearningPythonEmbedding generation, model training, statistical analysisRich ML ecosystem (PyTorch, scikit-learn)

Rust Layer (sec-fetcher):

  • Implements SecClient with sophisticated throttling and caching policies
  • Fetches company tickers, CIK submissions, NPORT filings, and US GAAP fundamentals
  • Transforms raw financial data via distill_us_gaap_fundamental_concepts
  • Outputs structured CSV files organized by ticker symbol

Python Layer (narrative_stack):

  • Ingests CSV files generated by Rust layer
  • Generates semantic embeddings for concept/unit pairs
  • Applies dimensionality reduction via PCA
  • Trains Stage1Autoencoder models using PyTorch Lightning

Sources: Cargo.toml:1-40, src/main.rs:1-16

graph TB
    SEC["SEC EDGAR API\ncompany_tickers.json\nCIK submissions\ncompanyfacts dataset"]
SecClient["SecClient\n(network layer)"]
NetworkFuncs["Network Functions\nfetch_company_tickers\nfetch_us_gaap_fundamentals\nfetch_nport_filing_by_ticker_symbol\nfetch_investment_company_series_and_class_dataset"]
Distill["distill_us_gaap_fundamental_concepts\n(transformers module)\nMaps 57+ variations → 64 concepts"]
Models["Data Models\nTicker, CikSubmission\nNportInvestment\nFundamentalConcept enum"]
CSV["File System\ndata/fund-holdings/A-Z/\ndata/us-gaap/"]
Ingest["Python Ingestion\nus_gaap_store.ingest_us_gaap_csvs\nWalks CSV directories"]
Preprocess["Preprocessing\nPCA dimensionality reduction\nRobustScaler normalization\nSemantic embeddings"]
DataStore["simd-r-drive\nWebSocket Key-Value Store\nEmbedding matrix storage"]
Training["Stage1Autoencoder\nPyTorch Lightning\nTensorBoard logging"]
SEC -->|HTTP GET| SecClient
 
   SecClient --> NetworkFuncs
 
   NetworkFuncs --> Distill
 
   Distill --> Models
 
   Models --> CSV
    
 
   CSV -.->|CSV files| Ingest
 
   Ingest --> Preprocess
 
   Preprocess --> DataStore
 
   DataStore --> Training

High-Level Data Flow

Data Flow Summary:

  1. Fetch : SecClient retrieves data from SEC EDGAR API endpoints
  2. Transform : Raw financial data passes through distill_us_gaap_fundamental_concepts to normalize concept names
  3. Store : Structured data is written to CSV files, organized by ticker symbol first letter
  4. Ingest : Python scripts walk CSV directories and parse records
  5. Preprocess : Generate embeddings, apply PCA, normalize values
  6. Train : ML models learn financial concept relationships

Sources: src/main.rs:1-16, src/main.rs:173-240, src/lib.rs:1-11

graph TB
    main["main.rs\nApplication entry point"]
config["config module\nConfigManager\nAppConfig"]
network["network module\nSecClient\nfetch_* functions\nThrottlePolicy\nCachePolicy"]
transformers["transformers module\ndistill_us_gaap_fundamental_concepts"]
models["models module\nTicker\nCik\nCikSubmission\nNportInvestment\nAccessionNumber"]
enums["enums module\nFundamentalConcept\nCacheNamespacePrefix\nTickerOrigin\nUrl"]
caches["caches module (private)\nCaches struct\nOnceLock statics\nHTTP cache\nPreprocessor cache"]
utils["utils module\ninvert_multivalue_indexmap\nis_development_mode\nis_interactive_mode\nVecExtensions"]
fs["fs module\nFile system utilities"]
parsers["parsers module\nData parsing functions"]
main --> config
 
   main --> network
 
   main --> utils
    
 
   network --> config
 
   network --> caches
 
   network --> models
 
   network --> transformers
 
   network --> parsers
    
 
   transformers --> enums
 
   transformers --> models
    
 
   models --> enums
    
 
   caches --> config

Core Module Structure

Module Descriptions:

ModulePrimary PurposeKey Exports
configConfiguration management and credential loadingConfigManager, AppConfig
networkHTTP client, data fetching, throttling, cachingSecClient, fetch_company_tickers, fetch_us_gaap_fundamentals, fetch_nport_filing_by_ticker_symbol
transformersUS GAAP concept normalization (importance: 8.37)distill_us_gaap_fundamental_concepts
modelsData structures for SEC entitiesTicker, Cik, CikSubmission, NportInvestment, AccessionNumber
enumsType-safe enumerationsFundamentalConcept (64 variants), CacheNamespacePrefix, TickerOrigin, Url
cachesInternal caching infrastructureCaches (private module)
utilsUtility functionsinvert_multivalue_indexmap, VecExtensions, is_development_mode

Sources: src/lib.rs:1-11, src/main.rs:1-16, src/utils.rs:1-12

Rust Component Architecture

The Rust layer is organized around SecClient, which provides a high-level HTTP interface with integrated throttling and caching. Network functions (fetch_*) use this client to retrieve data from SEC EDGAR endpoints. The most critical component is distill_us_gaap_fundamental_concepts, which normalizes financial concepts using four mapping patterns:

  • One-to-One : Direct mappings (e.g., AssetsFundamentalConcept::Assets)
  • Hierarchical : Specific concepts also map to parent categories (e.g., CurrentAssets maps to both CurrentAssets and Assets)
  • Synonym Consolidation : Multiple terms map to single concept (e.g., 6 cost variations → CostOfRevenue)
  • Industry-Specific : 57+ revenue variations map to Revenues

Sources: src/main.rs:1-16, Cargo.toml:8-40

Python Component Architecture

The Python narrative_stack system consumes CSV files produced by the Rust layer and trains machine learning models:

Key Components:

ComponentModule/ClassPurpose
Ingestionus_gaap_store.ingest_us_gaap_csvsWalks CSV directories, parses UsGaapRowRecord entries
PreprocessingPCA, RobustScalerGenerates semantic embeddings, normalizes values, reduces dimensionality
StorageDbUsGaap, DataStoreWsClient, UsGaapStoreDatabase interface, WebSocket client, unified data access facade
TrainingStage1AutoencoderPyTorch Lightning autoencoder for learning concept embeddings
Validationnp.isclose checksScaler verification, embedding validation

The preprocessing pipeline creates concept/unit/value triplets with associated embeddings and scalers, storing them in both MySQL and the simd-r-drive WebSocket data store. The Stage1Autoencoder learns latent representations by reconstructing embedding + scaled_value inputs.

Sources: (Python code not included in provided files, based on architecture diagrams)

Key Technologies

Rust Dependencies:

CratePurpose
tokioAsync runtime for I/O operations
reqwestHTTP client library
polarsDataFrame operations and CSV handling
rayonCPU parallelism
simd-r-driveWebSocket data store integration
serde, serde_jsonSerialization/deserialization
chronoDate/time handling

Python Dependencies:

  • PyTorch & PyTorch Lightning (ML training)
  • pandas, numpy (data processing)
  • scikit-learn (PCA, RobustScaler)
  • matplotlib, TensorBoard (visualization)

Sources: Cargo.toml:8-40

CSV Output Organization

The Rust application writes CSV files to organized directories:

data/
├── fund-holdings/
│   ├── A/
│   │   ├── AAPL.csv
│   │   ├── AMZN.csv
│   │   └── ...
│   ├── B/
│   ├── C/
│   └── ...
└── us-gaap/
    ├── AAPL.csv
    ├── MSFT.csv
    └── ...

Files are organized by the first letter of the ticker symbol (uppercased) to facilitate parallel processing and improve file system performance with large datasets.

Sources: src/main.rs:83-102, src/main.rs:206-221

Next Steps

Sources: src/main.rs:1-240, Cargo.toml:1-45, src/lib.rs:1-11


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Getting Started

Relevant source files

This page guides you through installing, configuring, and running the rust-sec-fetcher application. It covers building the Rust binary, setting up required credentials, and executing your first data fetch. For detailed configuration options, see Configuration System. For comprehensive examples, see Running Examples.

The rust-sec-fetcher is the Rust component of a dual-language system. It fetches and transforms SEC financial data into structured CSV files. The companion Python system (narrative_stack) processes these files for machine learning applications.


Prerequisites

Before installation, ensure you have:

RequirementPurposeNotes
Rust 1.87+Compile sec-fetcherEdition 2021 features required
Email AddressSEC EDGAR API accessRequired by SEC for API identification
4+ GB Disk SpaceCache and CSV storageDefault location: data/ directory
Internet ConnectionSEC API accessThrottled to 1 request/second

Optional Components:

Sources: Cargo.toml:1-45


Installation

Clone Repository

Build from Source

The compiled binary will be located at:

  • Debug: target/debug/sec-fetcher
  • Release: target/release/sec-fetcher

Verify Installation

If successful, this will load the configuration and display it in JSON format. If no configuration exists, it will prompt for your email address in interactive mode.

Installation Flow Diagram

graph TB
    Clone["Clone Repository\nrust-sec-fetcher"]
Build["cargo build --release"]
Binary["Binary Created\ntarget/release/sec-fetcher"]
Config["Configuration Setup\nConfigManager::load()"]
Verify["Run Example\ncargo run --example config"]
Clone --> Build
 
   Build --> Binary
 
   Binary --> Config
 
   Config --> Verify
    
    ConfigFile["Configuration File\nsec_fetcher_config.toml"]
Credential["Email Credential\nCredentialManager"]
Config --> ConfigFile
 
   Config --> Credential
    
 
   Verify --> Success["Display AppConfig\nJSON Output"]
Verify --> Error["Missing Email\nPrompt in Interactive Mode"]

Sources: Cargo.toml:1-6 src/config/config_manager.rs:20-23 examples/config.rs:1-17


Basic Configuration

The application uses a TOML configuration file combined with system credential storage for the required email address.

Configuration File Location

The ConfigManager searches for configuration files in this order:

  1. System Config Directory : Platform-specific location returned by ConfigManager::get_suggested_system_path()

    • Linux: ~/.config/sec-fetcher/config.toml
    • macOS: ~/Library/Application Support/sec-fetcher/config.toml
    • Windows: C:\Users\<User>\AppData\Roaming\sec-fetcher\config.toml
  2. Current Directory : sec_fetcher_config.toml (fallback)

Configuration Fields

The AppConfig structure src/config/app_config.rs:15-32 supports the following fields:

FieldTypeDefaultDescription
emailOption<String>NoneRequired - Your email for SEC API identification
max_concurrentOption<usize>1Maximum concurrent requests
min_delay_msOption<u64>1000Minimum delay between requests (milliseconds)
max_retriesOption<usize>5Maximum retry attempts for failed requests
cache_base_dirOption<PathBuf>"data"Base directory for caching and CSV output

Example Configuration File

Create sec_fetcher_config.toml:

Email Credential Setup

The SEC EDGAR API requires an email address in the User-Agent header. The application manages this through the CredentialManager:

Interactive Mode (when running from terminal):

Non-Interactive Mode (CI/CD, background processes):

  • Email must be pre-configured in sec_fetcher_config.toml
  • Or stored in system credential manager via prior interactive session

Configuration Loading Flow Diagram

graph TB
    Start["ConfigManager::load()"]
PathCheck{"Config Path\nExists?"}
LoadFile["Config::builder()\nadd_source(File)"]
DefaultConfig["AppConfig::default()"]
MergeUser["settings.merge(user_settings)"]
EmailCheck{"Email\nConfigured?"}
InteractiveCheck{"is_interactive_mode()?"}
Prompt["CredentialManager::from_prompt()"]
KeyringGet["credential_manager.get_credential()"]
Error["Error: Could not obtain email"]
InitCaches["Caches::init(config_manager)"]
Complete["ConfigManager Instance"]
Start --> PathCheck
 
   PathCheck -->|Yes| LoadFile
 
   PathCheck -->|No Fallback| LoadFile
 
   LoadFile --> DefaultConfig
 
   DefaultConfig --> MergeUser
    
 
   MergeUser --> EmailCheck
 
   EmailCheck -->|Missing| InteractiveCheck
 
   EmailCheck -->|Present| InitCaches
    
 
   InteractiveCheck -->|Yes| Prompt
 
   InteractiveCheck -->|No| Error
    
 
   Prompt --> KeyringGet
 
   KeyringGet -->|Success| InitCaches
 
   KeyringGet -->|Failure| Error
    
 
   InitCaches --> Complete

Sources: src/config/config_manager.rs:20-86 src/config/app_config.rs:15-54 Cargo.toml20


Running Your First Data Fetch

Example: Configuration Display

The simplest example displays the loaded configuration:

Code Structure examples/config.rs:1-17:

  1. ConfigManager::load() - Loads configuration from file + credentials
  2. config_manager.get_config() - Retrieves AppConfig reference
  3. config.pretty_print() - Serializes to formatted JSON

Expected Output:

Example: Lookup CIK by Ticker

Fetch the Central Index Key (CIK) for a company ticker symbol:

This example demonstrates:

  • SecClient initialization with throttling
  • fetch_company_tickers() - Downloads SEC company tickers JSON
  • fetch_cik_by_ticker_symbol() - Maps ticker → CIK
  • Caching behavior (subsequent runs use cached data)

Example: Fetch NPORT Filing

Download and parse an NPORT-P investment company filing:

This example shows:

  • Fetching XML filing by accession number
  • Parsing NportInvestment data structures
  • CSV output to data/fund-holdings/{A-Z}/ directories

For detailed walkthrough of all examples, see Running Examples.

Example Execution Flow Diagram

Sources: examples/config.rs:1-17 src/config/config_manager.rs:20-23 Cargo.toml:28-29


Data Output Structure

The application organizes fetched data into a structured directory hierarchy:

data/
├── http_cache/              # HTTP response cache (simd-r-drive)
│   └── sec.gov/
│       └── *.bin            # Cached API responses
│
├── fund-holdings/           # NPORT filing data by ticker
│   ├── A/
│   │   ├── AAPL_holdings.csv
│   │   └── AMZN_holdings.csv
│   ├── B/
│   │   └── MSFT_holdings.csv
│   └── ...                  # A-Z directories
│
└── us-gaap/                 # US GAAP fundamental data
    ├── AAPL_fundamentals.csv
    ├── MSFT_fundamentals.csv
    └── ...

CSV File Formats

US GAAP Fundamentals (us-gaap/*.csv):

  • Ticker symbol
  • Filing date
  • Fiscal period
  • FundamentalConcept (64 normalized concepts)
  • Value
  • Units
  • Accession number

NPORT Holdings (fund-holdings/{A-Z}/*.csv):

  • Fund CIK
  • Investment ticker symbol
  • Investment name
  • Balance (shares)
  • Value (USD)
  • Percentage of portfolio
  • Asset category
  • Issuer category

Data Flow from API to CSV Diagram

Sources: src/config/app_config.rs:31-44 Cargo.toml24


Cache Behavior

The application implements two-tier caching to minimize redundant API calls:

HTTP Cache

  • Storage : simd-r-drive key-value store Cargo.toml36
  • Location : {cache_base_dir}/http_cache/
  • TTL : 1 week (168 hours)
  • Scope : Raw HTTP responses from SEC API

Preprocessor Cache

  • Storage : In-memory DashMap with persistent backup
  • Scope : Transformed data structures (after distill_us_gaap_fundamental_concepts)
  • Purpose : Skip expensive concept normalization on repeated runs

Cache Initialization : The Caches::init() function src/config/config_manager.rs:98-100 is called automatically during ConfigManager construction.

For detailed caching architecture, see Caching & Storage System.

Sources: Cargo.toml:14-37 src/config/config_manager.rs:98-100


Troubleshooting Common Issues

IssueCauseSolution
"Could not obtain email credential"No email configured in non-interactive modeAdd email = "..." to config file or run interactively once
"Config path does not exist"Invalid custom config pathCheck path spelling or omit to use defaults
"unknown field" in configTypo in TOML key nameRun cargo run --example config to see valid keys
Rate limit errors from SECmin_delay_ms too lowIncrease to 1000+ ms (SEC requires 1 req/sec max)
Cache directory permission deniedInsufficient filesystem permissionsChange cache_base_dir to writable location

Debug Configuration Issues:

The AppConfig::get_valid_keys() function src/config/app_config.rs:62-77 dynamically generates a list of valid configuration fields with their expected types using JSON schema introspection.

Sources: src/config/config_manager.rs:49-77 src/config/app_config.rs:62-77


Next Steps

Now that you have the application configured and running, explore these topics:

For Python ML pipeline setup, see Python narrative_stack System.

Sources: examples/config.rs:1-17 src/config/config_manager.rs:1-121 src/config/app_config.rs:1-159


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Configuration System

Relevant source files

This document describes the configuration management system in the Rust sec-fetcher application. The configuration system provides a flexible, multi-source approach to managing application settings including SEC API credentials, request throttling parameters, and cache directories.

For information about how the configuration integrates with the caching system, see Caching & Storage System. For details on credential storage mechanisms, see the credential management section below.


Overview

The configuration system consists of three primary components:

ComponentPurposeFile Location
AppConfigData structure holding all configuration fieldssrc/config/app_config.rs
ConfigManagerLoads, merges, and validates configuration from multiple sourcessrc/config/config_manager.rs
CredentialManagerHandles email credential storage and retrieval (keyring or prompt)Referenced in src/config/config_manager.rs:66-72

The system supports configuration from three sources with increasing priority:

  1. Default values - Hard-coded defaults in AppConfig::default()
  2. TOML configuration file - User-specified settings loaded from disk
  3. Interactive prompts - Credential collection when running in interactive mode

Sources: src/config/app_config.rs:15-54 src/config/config_manager.rs:11-120


Configuration Structure

AppConfig Fields

All fields in AppConfig are wrapped in Option<T> to support partial configuration and merging strategies. The #[merge(strategy = overwrite_option)] attribute ensures that non-None values from user configuration always replace default values.

Sources: src/config/app_config.rs:15-54 examples/config.rs:13-16


Configuration Loading Flow

ConfigManager Initialization

Sources: src/config/config_manager.rs:19-86 src/config/config_manager.rs:107-119


Configuration File Format

TOML Structure

The configuration file uses TOML format with strict schema validation. Any unrecognized keys will cause a descriptive error listing all valid keys and their types.

Example configuration file (sec_fetcher_config.toml):

Configuration File Locations

The system searches for configuration files in the following order:

PriorityLocationDescription
1User-provided pathPassed to ConfigManager::from_config(Some(path))
2System config directory~/.config/sec-fetcher/config.toml (Unix)
%APPDATA%\sec-fetcher\config.toml (Windows)
3Current directory./sec_fetcher_config.toml

Sources: src/config/config_manager.rs:107-119 tests/config_manager_tests.rs:36-57


Credential Management

Email Credential Requirement

The SEC EDGAR API requires a valid email address in the HTTP User-Agent header. The configuration system ensures this credential is available through multiple strategies:

Interactive Mode Detection:

graph TD
    Start["ConfigManager Initialization"]
CheckEmail{"Email in\nAppConfig?"}
CheckInteractive{"is_interactive_mode()?"}
PromptUser["CredentialManager::from_prompt()"]
ReadKeyring["Read from system keyring"]
PromptInput["Prompt user for email"]
SaveKeyring["Save to keyring (optional)"]
SetEmail["Set email in user_settings"]
Error["Return Error:\nCould not obtain email credential"]
MergeConfig["Merge configurations"]
Start --> CheckEmail
 
   CheckEmail -->|Yes| MergeConfig
 
   CheckEmail -->|No| CheckInteractive
 
   CheckInteractive -->|Yes| PromptUser
 
   CheckInteractive -->|No| Error
 
   PromptUser --> ReadKeyring
 
   ReadKeyring -->|Found| SetEmail
 
   ReadKeyring -->|Not found| PromptInput
 
   PromptInput --> SaveKeyring
 
   SaveKeyring --> SetEmail
 
   SetEmail --> MergeConfig
  • Returns true if stdout is a terminal (TTY)
  • Can be overridden via set_interactive_mode_override() for testing

Sources: src/config/config_manager.rs:64-77 tests/config_manager_tests.rs:19-33


Configuration Merging Strategy

Merge Behavior

The AppConfig struct uses the merge crate with a custom overwrite_option strategy:

Merge Rules:

  1. Some(new_value) always replaces Some(old_value)
  2. Some(new_value) always replaces None
  3. None never replaces Some(old_value)

This ensures user-provided values take absolute precedence over defaults while allowing partial configuration.

Sources: src/config/app_config.rs:8-13 src/config/config_manager.rs79


Schema Validation and Error Handling

graph LR
    TOML["TOML File with\ninvalid_key"]
Deserialize["config.try_deserialize()"]
ExtractSchema["AppConfig::get_valid_keys()"]
Schema["schemars::schema_for!(AppConfig)"]
FormatError["Format error message with\nvalid keys and types"]
TOML --> Deserialize
 
   Deserialize -->|Error| ExtractSchema
 
   ExtractSchema --> Schema
 
   Schema --> FormatError
 
   FormatError --> Error["Return descriptive error:\n- email (String /Null - max_concurrent Integer/ Null)\n- min_delay_ms (Integer /Null - max_retries Integer/ Null)\n- cache_base_dir (String / Null)"]

Invalid Key Detection

The configuration system uses serde(deny_unknown_fields) to reject unknown keys in TOML files. When an invalid key is detected, the error message includes a complete list of valid keys with their types:

Example error output:

unknown field `invalid_key`, expected one of `email`, `max_concurrent`, `min_delay_ms`, `max_retries`, `cache_base_dir`

Valid configuration keys are:
  - email (String | Null)
  - max_concurrent (Integer | Null)
  - min_delay_ms (Integer | Null)
  - max_retries (Integer | Null)
  - cache_base_dir (String | Null)

Sources: src/config/app_config.rs:62-77 src/config/config_manager.rs:45-62 tests/config_manager_tests.rs:67-94


Usage Examples

Loading Configuration

Default configuration path:

Custom configuration path:

Direct AppConfig construction:

Sources: examples/config.rs:1-17 tests/config_manager_tests.rs:36-57


graph LR
    CM["ConfigManager::from_config()"]
InitCaches["init_caches()"]
CachesInit["Caches::init(self)"]
HTTPCache["Initialize HTTP cache\nwith cache_base_dir"]
PreprocCache["Initialize preprocessor cache"]
OnceLock["Store in OnceLock statics"]
CM --> InitCaches
 
   InitCaches --> CachesInit
 
   CachesInit --> HTTPCache
 
   CachesInit --> PreprocCache
 
   HTTPCache --> OnceLock
 
   PreprocCache --> OnceLock

Cache Initialization

The ConfigManager automatically initializes the global Caches module during construction:

The cache_base_dir field from AppConfig is passed to the Caches module to determine where HTTP responses and preprocessed data are stored. For detailed information on cache policies and storage mechanisms, see Caching & Storage System.

Sources: src/config/config_manager.rs:83-100


Testing

Test Coverage

The configuration system includes comprehensive unit tests:

Test FunctionPurpose
test_load_custom_configVerifies loading from custom TOML file path
test_load_non_existent_configEnsures proper error on missing file
test_fails_on_invalid_keyValidates schema enforcement and error messages
test_fails_if_no_email_availableChecks email requirement in non-interactive mode (commented)

Test Utilities:

  • create_temp_config(contents: &str) - Creates temporary TOML files for testing
  • set_interactive_mode_override(Some(false)) - Disables interactive prompts during tests

Sources: tests/config_manager_tests.rs:1-95


Configuration System Components

Complete Component Diagram

Sources: src/config/app_config.rs:1-159 src/config/config_manager.rs:1-121


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Running Examples

Relevant source files

This page demonstrates how to run the example programs included in the repository. Each example illustrates specific functionality of the sec-fetcher system, from basic configuration validation to fetching company data from the SEC EDGAR API. These examples serve as practical starting points for understanding how to integrate the library into your own applications.

For detailed information about configuring the application before running these examples, see Configuration System. For understanding the underlying network layer and data fetching mechanisms, see Network Layer & SecClient and Data Fetching Functions.

Example Programs Overview

The repository includes four example programs located in the examples/ directory. Each demonstrates different aspects of the system:

Example ProgramPrimary PurposeKey Functions DemonstratedCommand-Line Args
config.rsConfiguration validation and inspectionConfigManager::load(), AppConfig::pretty_print()None
funds.rsFetch investment company datasetsfetch_investment_company_series_and_class_dataset()None
lookup_cik.rsCIK lookup and submission retrievalfetch_cik_by_ticker_symbol(), fetch_cik_submissions()Ticker symbol (e.g., AAPL)
nport_filing.rsParse NPORT filing holdingsfetch_nport_filing(), XML parsingAccession number
us_gaap_human_readable.rsFetch and display US GAAP fundamentalsfetch_us_gaap_fundamentals(), distill_us_gaap_fundamental_concepts()Ticker symbol

Sources: examples/config.rs:1-18 examples/funds.rs:1-19 examples/lookup_cik.rs:1-75

Example Program Architecture

Diagram: Example Programs and System Integration

Sources: examples/config.rs:1-18 examples/funds.rs:1-19 examples/lookup_cik.rs:1-75

config.rs - Configuration Validation

Purpose : Load and display the current application configuration to verify that configuration files are properly formatted and credentials are correctly set.

How to Run :

Expected Output : The example displays the merged configuration in JSON format, showing values from the configuration file merged with defaults.

Code Flow :

graph TB
    Start["main()
entry point\nexamples/config.rs:4"]
Load["ConfigManager::load()\nLoad config from file\nexamples/config.rs:13"]
GetConfig["config_manager.get_config()\nRetrieve AppConfig reference\nexamples/config.rs:15"]
Pretty["config.pretty_print()\nSerialize to JSON\nexamples/config.rs:16"]
Print["print! macro\nDisplay formatted config\nexamples/config.rs:16"]
End["Program exit"]
Start --> Load
 
   Load --> GetConfig
 
   GetConfig --> Pretty
 
   Pretty --> Print
 
   Print --> End

Key Functions Used :

FunctionFile LocationPurpose
ConfigManager::load()examples/config.rs13Loads configuration from default paths
get_config()examples/config.rs15Returns reference to AppConfig struct
pretty_print()examples/config.rs16Serializes configuration to formatted JSON

What This Demonstrates :

  • Basic configuration loading workflow
  • Configuration file resolution and merging
  • Validation of configuration structure

Sources: examples/config.rs:1-18 src/config/app_config.rs:1-159

funds.rs - Investment Company Dataset

Purpose : Fetch the complete investment company series and class dataset from the SEC EDGAR API, demonstrating network client initialization and basic data fetching.

How to Run :

Expected Output : A list of investment company records, each containing series information, fund names, CIKs, and classification details. The output format is the Debug representation of the Ticker structs.

Code Flow :

graph TB
    Start["#[tokio::main] async fn main()\nexamples/funds.rs:6-7"]
LoadCfg["ConfigManager::load()\nexamples/funds.rs:8"]
CreateClient["SecClient::from_config_manager()\nInitialize HTTP client\nexamples/funds.rs:9"]
FetchData["fetch_investment_company_series_and_class_dataset()\nNetwork request to SEC API\nexamples/funds.rs:11"]
LoopStart["for fund in funds\nIterate results\nexamples/funds.rs:13"]
PrintFund["print! macro\nDisplay fund Debug output\nexamples/funds.rs:14"]
End["Return Ok(())\nexamples/funds.rs:17"]
Start --> LoadCfg
 
   LoadCfg --> CreateClient
 
   CreateClient --> FetchData
 
   FetchData --> LoopStart
 
   LoopStart --> PrintFund
 
   PrintFund -->|More funds| LoopStart
 
   LoopStart -->|Done| End

Key Functions and Structures :

EntityTypePurpose
fetch_investment_company_series_and_class_dataset()FunctionFetches investment company data from SEC
SecClientStructHTTP client with throttling and caching
TickerStructRepresents company ticker information
tokio::mainMacroEnables async runtime

What This Demonstrates :

  • Async/await pattern usage with tokio
  • SecClient initialization from configuration
  • Network function invocation
  • Iteration over fetched data structures

Sources: examples/funds.rs:1-19

lookup_cik.rs - CIK Lookup and Submission Retrieval

Purpose : Demonstrate CIK (Central Index Key) lookup by ticker symbol and retrieval of company submissions, with special handling for NPORT-P filings.

How to Run :

Replace AAPL with any valid ticker symbol.

Expected Output :

  • The CIK number for the ticker
  • URL to the SEC submissions JSON
  • Most recent NPORT-P submission (if the company files NPORT-P forms)
  • Or a list of other submission types (e.g., 10-K)
  • EDGAR archive URLs for each submission

Execution Flow with Code Entities :

Command-Line Argument Handling :

The example requires exactly one argument (the ticker symbol). The argument parsing logic is implemented at examples/lookup_cik.rs:10-14:

args.len() check → if not 2, print usage and exit
ticker_symbol = args[1]

Key Code Entities :

EntityTypeLocationPurpose
fetch_cik_by_ticker_symbol()Functionexamples/lookup_cik.rs:21-23Retrieves CIK for ticker
fetch_cik_submissions()Functionexamples/lookup_cik.rs43Fetches all submissions for CIK
CikSubmission::most_recent_nport_p_submission()Methodexamples/lookup_cik.rs:45-46Filters for latest NPORT-P
as_edgar_archive_url()Methodexamples/lookup_cik.rs54Generates SEC EDGAR URL
CikStructexamples/lookup_cik.rs2810-digit CIK identifier
CikSubmissionStructexamples/lookup_cik.rs43Represents a single SEC filing

What This Demonstrates :

  • Command-line argument parsing using std::env::args()
  • Error handling with Result and Option types
  • Conditional logic based on filing types
  • Method chaining for data transformation
  • URL generation from structured data

Sources: examples/lookup_cik.rs:1-75

Prerequisites and Common Patterns

Configuration Requirement : All examples (except config.rs) require a valid configuration file with at least an email address set. See Configuration System for setup instructions.

Common Code Pattern Across Examples :

All examples follow this pattern:

  1. Load configuration using ConfigManager::load()
  2. Initialize SecClient from the configuration
  3. Call network functions to fetch data
  4. Process and display results

Error Handling : Examples use the ? operator extensively to propagate errors and return Result<(), Box<dyn Error>> from main(). This allows errors to bubble up with descriptive messages.

Async Runtime : Examples that perform network operations use #[tokio::main] to enable the async runtime, allowing await syntax for asynchronous operations.

Sources: examples/config.rs:1-18 examples/funds.rs:1-19 examples/lookup_cik.rs:1-75

Running Examples in Development Mode

When running examples, the application may detect development mode based on environment variables or the presence of certain files. Development mode affects:

  • Cache behavior : May use more aggressive caching strategies
  • Logging verbosity : May produce more detailed output
  • Rate limiting : May use relaxed throttling policies for testing

Refer to Utility Functions for details on development mode detection.

Next Steps

After running these examples:

Sources: examples/config.rs:1-18 examples/funds.rs:1-19 examples/lookup_cik.rs:1-75


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Rust sec-fetcher Application

Relevant source files

Purpose and Scope

This page provides an architectural overview of the Rust sec-fetcher application, which is responsible for fetching financial data from the SEC EDGAR API and transforming it into structured CSV files. The application serves as the data collection and preprocessing layer in a larger dual-language system that combines Rust's performance for I/O operations with Python's machine learning capabilities.

This page covers the high-level architecture, module organization, and data flow patterns. For detailed information about specific components, see:

The Python machine learning pipeline that consumes this data is documented in Python narrative_stack System.

Sources: src/lib.rs:1-11 Cargo.toml:1-45 src/main.rs:1-241

Application Architecture

The sec-fetcher application is built around a modular architecture that separates concerns into distinct layers: configuration, networking, data transformation, and storage. The core design principle is to fetch data from SEC APIs with robust error handling and caching, transform it into a standardized format, and output structured CSV files for downstream consumption.

graph TB
    subgraph "src/lib.rs Module Organization"
        config["config\nConfigManager, AppConfig"]
enums["enums\nFundamentalConcept, Url\nCacheNamespacePrefix"]
models["models\nTicker, CikSubmission\nNportInvestment, AccessionNumber"]
network["network\nSecClient, fetch_* functions\nThrottlePolicy, CachePolicy"]
transformers["transformers\ndistill_us_gaap_fundamental_concepts"]
parsers["parsers\nXML/JSON parsing utilities"]
caches["caches\nCaches (internal)\nHTTP cache, preprocessor cache"]
fs["fs\nFile system operations"]
utils["utils\nVecExtensions, helpers"]
end
    
    subgraph "External Dependencies"
        reqwest["reqwest\nHTTP client"]
polars["polars\nDataFrame operations"]
simd["simd-r-drive\nDrive-based cache storage"]
tokio["tokio\nAsync runtime"]
serde["serde\nSerialization"]
end
    
 
   config --> caches
 
   network --> config
 
   network --> caches
 
   network --> models
 
   network --> enums
 
   network --> parsers
 
   transformers --> models
 
   transformers --> enums
 
   parsers --> models
    
 
   network --> reqwest
 
   network --> simd
 
   network --> tokio
 
   transformers --> polars
 
   models --> serde
    
    style config fill:#f9f9f9
    style network fill:#f9f9f9
    style transformers fill:#f9f9f9
    style caches fill:#f9f9f9

Module Structure

The application is organized into nine core modules as declared in src/lib.rs:1-10:

ModulePurposeKey Components
configConfiguration management and credential handlingConfigManager, AppConfig
enumsType-safe enumerations for domain conceptsFundamentalConcept, Url, CacheNamespacePrefix, TickerOrigin
modelsData structures representing SEC entitiesTicker, CikSubmission, NportInvestment, AccessionNumber
networkHTTP client and data fetching functionsSecClient, fetch_company_tickers, fetch_us_gaap_fundamentals
transformersData normalization and transformationdistill_us_gaap_fundamental_concepts
parsersXML/JSON parsing utilitiesFiling parsers, XML extractors
cachesInternal caching infrastructureCaches (singleton), HTTP cache, preprocessor cache
fsFile system operationsDirectory creation, path utilities
utilsHelper functions and extensionsVecExtensions, invert_multivalue_indexmap

Sources: src/lib.rs:1-11 Cargo.toml:8-40

Data Flow Architecture

Request-Response Flow with Caching

The data flow follows a pipeline pattern:

  1. Request Initiation : Application code calls functions like fetch_us_gaap_fundamentals or fetch_nport_filing_by_ticker_symbol
  2. Client Middleware : SecClient applies throttling and caching policies before making HTTP requests
  3. Cache Check : CachePolicy checks simd-r-drive storage for cached responses
  4. API Request : If cache miss, request is sent to SEC EDGAR API with proper headers and rate limiting
  5. Parsing : Raw XML/JSON responses are parsed into structured data models
  6. Transformation : Data passes through normalization functions like distill_us_gaap_fundamental_concepts
  7. Output : Transformed data is written to CSV files organized by ticker symbol or category

Sources: src/main.rs:174-240 src/lib.rs:1-11

Key Dependencies and Technology Stack

The application leverages modern Rust crates for performance and reliability:

Critical Dependencies

CategoryCrateVersionPurpose
Async Runtimetokio1.43.0Asynchronous I/O, task scheduling
HTTP Clientreqwest0.12.15HTTP requests with JSON support
Data Framespolars0.46.0High-performance DataFrame operations
Cachingsimd-r-drive0.3.0WebSocket-based key-value storage
Parallelismrayon1.10.0Data parallelism for CPU-bound work
Configurationconfig0.15.9TOML/JSON configuration file parsing
Credentialskeyring3.6.2Secure credential storage (OS keychain)
XML Parsingquick-xml0.37.2Fast XML parsing for SEC filings
Serializationserde1.0.218Data structure serialization/deserialization

Sources: Cargo.toml:8-40

Application Entry Point and Main Loop

The application entry point in src/main.rs:174-240 demonstrates the typical processing pattern:

Main Processing Loop Structure

The main application follows this pattern:

  1. Configuration Loading : ConfigManager::load() reads configuration from TOML files and environment variables
  2. Client Initialization : SecClient::from_config_manager() creates HTTP client with throttling and caching middleware
  3. Ticker Fetching : fetch_company_tickers() retrieves all company ticker symbols from SEC
  4. Processing Loop : Iterates over each ticker symbol:
    • Calls fetch_us_gaap_fundamentals() to retrieve financial data
    • Transforms data through distill_us_gaap_fundamental_concepts
    • Writes results to CSV files organized by ticker: data/us-gaap/{ticker}.csv
    • Logs errors to error_log HashMap for later reporting
  5. Error Reporting : Prints summary of any errors encountered during processing

Example Main Function Flow

Sources: src/main.rs:174-240

Module Interaction Patterns

Configuration and Client Initialization

The initialization sequence ensures:

  • Configuration is loaded before any network operations
  • Credentials are retrieved securely from OS keychain via keyring crate
  • Caches singleton is initialized with drive storage connection
  • HTTP client middleware is properly configured with throttling and caching

Sources: src/lib.rs:1-11

Data Transformation Pipeline

The transformation pipeline is the most critical component (importance: 8.37 as noted in the high-level architecture). The distill_us_gaap_fundamental_concepts function handles:

  • Synonym Consolidation : Maps 6+ variations of "Cost of Revenue" to single CostOfRevenue concept
  • Industry-Specific Revenue : Handles 57+ revenue field variations (SalesRevenueNet, RevenueFromContractWithCustomer, etc.)
  • Hierarchical Mapping : Maps specific concepts to parent categories (e.g., CurrentAssets also maps to Assets)
  • Unit Normalization : Standardizes currency, shares, and percentage units

Sources: Cargo.toml24 high-level architecture diagram

Error Handling Strategy

The application uses a multi-layered error handling approach:

LayerStrategyExample
NetworkRetry with exponential backoffThrottlePolicy with max_retries
CacheFallback to API on cache errorsCachePolicy transparent fallback
ParsingLog and continue processingError log in main loop
File I/OLog error, continue to next itemIndividual CSV write failures don't stop processing

The main loop accumulates errors in a HashMap<String, String> and reports them at the end, ensuring that one failure doesn't halt the entire batch processing job.

Sources: src/main.rs:185-240

Performance Characteristics

The application is optimized for I/O-bound workloads:

  • Async I/O : tokio runtime enables concurrent network requests
  • CPU Parallelism : rayon parallelizes DataFrame operations in polars
  • Caching : simd-r-drive reduces redundant API calls with 1-week TTL
  • Throttling : Respects SEC rate limits with adaptive jitter
  • Streaming : Processes large datasets without loading entirely into memory

The combination of async networking for I/O operations and Rayon parallelism for CPU-bound transformations provides optimal throughput for SEC data fetching and processing.

Sources: Cargo.toml:8-40 high-level architecture diagrams


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Network Layer & SecClient

Relevant source files

Purpose and Scope

This page documents the network layer of the Rust sec-fetcher application, specifically the SecClient HTTP client and its associated infrastructure. The SecClient provides the foundational HTTP communication layer for all SEC EDGAR API interactions, implementing throttling, caching, and retry logic to ensure reliable and compliant data fetching.

This page covers:

  • The SecClient structure and initialization
  • Throttling and rate limiting policies
  • HTTP caching mechanisms
  • Request/response handling
  • User-Agent management and email validation

For information about the specific network fetching functions that use SecClient (such as fetch_company_tickers, fetch_us_gaap_fundamentals, etc.), see Data Fetching Functions. For details on the caching system architecture, see Caching & Storage System.


SecClient Architecture Overview

Component Diagram

Sources: src/network/sec_client.rs:1-181


SecClient Structure and Initialization

SecClient Fields

The SecClient struct maintains four core components:

FieldTypePurpose
emailStringContact email for SEC User-Agent header
http_clientClientWithMiddlewarereqwest client with middleware stack
cache_policyArc<CachePolicy>Shared cache configuration
throttle_policyArc<ThrottlePolicy>Shared throttle configuration

Sources: src/network/sec_client.rs:14-19

Construction from ConfigManager

The from_config_manager() constructor performs the following initialization sequence:

  1. Extract Configuration : Reads email, max_concurrent, min_delay_ms, and max_retries from AppConfig
  2. Create CachePolicy : Configures cache with 1-week TTL and disables header respect
  3. Create ThrottlePolicy : Configures rate limiting with adaptive jitter
  4. Initialize HTTP Cache : Retrieves the shared HTTP cache store from Caches
  5. Build Middleware Stack : Combines cache and throttle layers
  6. Construct Client : Creates ClientWithMiddleware with full middleware stack

Sources: src/network/sec_client.rs:23-89

Configuration Parameters

The following table details the configuration parameters required by SecClient:

ParameterAppConfig FieldRequiredDefaultPurpose
EmailemailYesNoneSEC User-Agent compliance
Max Concurrentmax_concurrentYesNoneConcurrent request limit
Min Delay (ms)min_delay_msYesNoneBase throttle delay
Max Retriesmax_retriesYesNoneRetry attempt limit

Sources: src/network/sec_client.rs:28-43


Throttle Policy Configuration

ThrottlePolicy Structure

The ThrottlePolicy from reqwest_drive controls request rate limiting with the following parameters:

FieldTypeSourcePurpose
base_delay_msu64AppConfig.min_delay_msMinimum delay between requests
max_concurrentu64AppConfig.max_concurrentMaximum concurrent requests
max_retriesu64AppConfig.max_retriesMaximum retry attempts
adaptive_jitter_msu64Hardcoded: 500Randomized delay for retry backoff

Sources: src/network/sec_client.rs:52-59

Request Throttling Mechanism

The throttle mechanism operates as follows:

  1. Concurrent Limit : Enforces max_concurrent simultaneous requests
  2. Base Delay : Applies base_delay_ms between sequential requests
  3. Retry Logic : Retries failed requests up to max_retries times
  4. Adaptive Jitter : Adds randomized delay of up to adaptive_jitter_ms on retries to prevent thundering herd

The ThrottlePolicy can be overridden on a per-request basis by passing a custom policy to raw_request():

Sources: src/network/sec_client.rs:140-165


Cache Policy Configuration

CachePolicy Structure

The CachePolicy configuration defines HTTP caching behavior:

FieldValuePurpose
default_ttlDuration::from_secs(60 * 60 * 24 * 7)1 week cache lifetime
respect_headersfalseIgnore HTTP cache control headers
cache_status_overrideNoneNo status code override

Note : The 1-week TTL is currently hardcoded and marked with a TODO comment for future configurability.

Sources: src/network/sec_client.rs:45-50

Cache Storage Integration

The cache storage uses the following integration:

  1. HTTP Cache Store : Retrieved via Caches::get_http_cache_store(), a static OnceLock singleton
  2. Drive Integration : Uses init_cache_with_drive_and_throttle() from reqwest_drive
  3. Persistent Storage : Backed by simd-r-drive WebSocket server for cross-session persistence

Sources: src/network/sec_client.rs:73-81


User-Agent Management

User-Agent Format

The get_user_agent() method generates a compliant User-Agent string for SEC EDGAR API requests:

Format: <package_name>/<version> (+<email>)
Example: sec-fetcher/0.1.0 (+contact@example.com)

Sources: src/network/sec_client.rs:91-108

Email Validation

The email validation occurs at User-Agent generation time rather than during instantiation. This design ensures that:

  1. Every network request validates the email format
  2. Invalid configurations fail fast on first network call
  3. The email is validated even if the SecClient is constructed through different paths

Sources: src/network/sec_client.rs:91-99


Request Methods

raw_request Method

The raw_request() method provides low-level HTTP request functionality:

Parameters:

  • method: HTTP method (GET, POST, etc.)
  • url: Target URL
  • headers: Optional additional headers as key-value tuples
  • custom_throttle_policy: Optional per-request throttle override

Behavior:

  1. Constructs request with User-Agent header
  2. Applies optional custom headers
  3. Injects custom throttle policy if provided (via request extensions)
  4. Executes request through middleware stack
  5. Returns raw reqwest::Response

Sources: src/network/sec_client.rs:140-165

fetch_json Method

The fetch_json() method is a convenience wrapper for JSON API requests:

Flow:

  1. Calls raw_request() with GET method
  2. Awaits response
  3. Deserializes response body to serde_json::Value
  4. Returns parsed JSON

Sources: src/network/sec_client.rs:167-179

Request Flow Diagram

Sources: src/network/sec_client.rs:140-179


Testing Infrastructure

Test Organization

The SecClient tests use the mockito crate for HTTP mocking:

Sources: tests/sec_client_tests.rs:1-159

Test Coverage

The test suite covers the following scenarios:

TestPurposeMock BehaviorAssertion
test_user_agentUser-Agent formattingN/AMatches expected format
test_invalid_email_panicEmail validationN/APanics with expected message
test_fetch_json_without_retry_successSuccessful request200 JSON responseJSON parsed correctly
test_fetch_json_with_retry_successRetry not needed200 JSON responseJSON parsed correctly
test_fetch_json_with_retry_failureRetry exhaustion500 error (3x)Returns error
test_fetch_json_with_retry_backoffRetry with recovery500 → 200JSON parsed correctly

Key Testing Patterns:

  1. Mock Server : mockito::Server::new_async() creates isolated HTTP endpoints
  2. Configuration : Tests use AppConfig::default() with overrides
  3. Async Execution : All network tests use #[tokio::test]
  4. Expectations : mockito verifies request counts with .expect(n)

Sources: tests/sec_client_tests.rs:7-158


Dependencies and Integration

External Dependencies

CratePurposeUsage in SecClient
reqwestHTTP clientCore HTTP functionality
reqwest_driveMiddleware stackCache and throttle integration
tokioAsync runtimeAll async operations
serde_jsonJSON parsingResponse deserialization
email_addressEmail validationUser-Agent email checking

Sources: src/network/sec_client.rs:1-12 Cargo.lock:1-100

Internal Dependencies

The SecClient integrates with:

  1. ConfigManager : Provides configuration for initialization (Configuration System)
  2. Caches : Supplies HTTP cache storage singleton (Caching & Storage System)
  3. Network Module : Exposed via src/network.rs module
  4. Fetch Functions : Used by all data fetching operations (Data Fetching Functions)

Sources: src/network.rs:1-23 src/network/sec_client.rs:1-12


Error Handling

Error Types

SecClient methods return Result<T, Box<dyn Error>>, allowing propagation of:

  1. reqwest::Error : HTTP client errors (connection, timeout, etc.)
  2. serde_json::Error : JSON deserialization errors
  3. Custom Errors : From configuration or validation

Panic Conditions

The only panic condition is invalid email format detected in get_user_agent():

This is intentional as an invalid email violates SEC API requirements.

Sources: src/network/sec_client.rs:95-98


Usage Example

The following example demonstrates typical SecClient usage:

Sources: tests/sec_client_tests.rs:36-62


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Data Fetching Functions

Relevant source files

Purpose and Scope

This document describes the data fetching functions in the Rust sec-fetcher application. These functions provide the interface for retrieving financial data from the SEC EDGAR API, including company tickers, CIK submissions, NPORT filings, US GAAP fundamentals, and investment company datasets.

For information about the underlying HTTP client, throttling, and caching infrastructure, see Network Layer & SecClient. For details on how US GAAP data is transformed after fetching, see US GAAP Concept Transformation. For information about the data structures returned by these functions, see Data Models & Enumerations.

Overview of Data Fetching Architecture

The network module provides six primary fetching functions that retrieve different types of financial data from the SEC EDGAR API. Each function accepts a SecClient reference and returns structured data types.

Diagram: Data Fetching Function Overview

graph TB
    subgraph "Entry Points"
        FetchTickers["fetch_company_tickers()\nsrc/network/fetch_company_tickers.rs"]
FetchCIK["fetch_cik_by_ticker_symbol()\nNot shown in files"]
FetchSubs["fetch_cik_submissions()\nsrc/network/fetch_cik_submissions.rs"]
FetchNPORT["fetch_nport_filing_by_ticker_symbol()\nfetch_nport_filing_by_cik_and_accession_number()\nsrc/network/fetch_nport_filing.rs"]
FetchGAAP["fetch_us_gaap_fundamentals()\nsrc/network/fetch_us_gaap_fundamentals.rs"]
FetchInvCo["fetch_investment_company_series_and_class_dataset()\nsrc/network/fetch_investment_company_series_and_class_dataset.rs"]
end
    
    subgraph "SEC EDGAR API Endpoints"
        APITickers["company_tickers.json"]
APISubmissions["submissions/CIK{cik}.json"]
APINport["Archives/edgar/data/{cik}/{accession}/primary_doc.xml"]
APIFacts["api/xbrl/companyfacts/CIK{cik}.json"]
APIInvCo["files/investment/data/other/.../{year}.csv"]
end
    
    subgraph "Return Types"
        VecTicker["Vec&lt;Ticker&gt;"]
CikType["Cik"]
VecCikSub["Vec&lt;CikSubmission&gt;"]
VecNport["Vec&lt;NportInvestment&gt;"]
DataFrame["TickerFundamentalsDataFrame"]
VecInvCo["Vec&lt;InvestmentCompany&gt;"]
end
    
 
   FetchTickers -->|HTTP GET| APITickers
 
   APITickers --> VecTicker
    
 
   FetchCIK -->|Searches| VecTicker
 
   FetchCIK --> CikType
    
 
   FetchSubs -->|HTTP GET| APISubmissions
 
   APISubmissions --> VecCikSub
    
 
   FetchNPORT -->|HTTP GET| APINport
 
   APINport --> VecNport
    
 
   FetchGAAP -->|HTTP GET| APIFacts
 
   APIFacts --> DataFrame
    
 
   FetchInvCo -->|HTTP GET| APIInvCo
 
   APIInvCo --> VecInvCo

Sources: src/network/fetch_company_tickers.rs src/network/fetch_cik_submissions.rs src/network/fetch_nport_filing.rs src/network/fetch_us_gaap_fundamentals.rs src/network/fetch_investment_company_series_and_class_dataset.rs

Company Ticker Fetching

Function: fetch_company_tickers

The fetch_company_tickers function retrieves the master list of operating company tickers from the SEC EDGAR API. This dataset provides the mapping between ticker symbols, company names, and CIK numbers.

Signature:

Implementation Details:

AspectDetail
API EndpointUrl::CompanyTickers (company_tickers.json)
Response FormatJSON object with numeric keys mapping to ticker data
Parsing LogicLines 1:8-31
CIK ConversionUses Cik::from_u64() to format CIK numbers 22
Origin TagAll tickers tagged with TickerOrigin::CompanyTickers 28

Returned Data Structure:

Each Ticker in the result contains:

  • cik: 10-digit formatted CIK number
  • symbol: Ticker symbol (e.g., "AAPL")
  • company_name: Full company name
  • origin: Set to TickerOrigin::CompanyTickers

Usage Example:

The function is commonly used as a prerequisite for other operations that require ticker-to-CIK mapping:

Sources: src/network/fetch_company_tickers.rs:1-34

CIK Lookup and Submissions

Function: fetch_cik_submissions

The fetch_cik_submissions function retrieves all SEC filings (submissions) for a given company, identified by its CIK number.

Signature:

Implementation Details:

The function performs the following operations:

  1. URL Construction : Creates the submissions endpoint URL using Url::CikSubmission enum 12
  2. JSON Fetching : Retrieves submission data via sec_client.fetch_json() 14
  3. Data Extraction : Parses the filings.recent object containing parallel arrays 2:5-51
  4. Parallel Array Processing : Uses itertools::izip! to iterate over multiple arrays simultaneously 5:5-60

JSON Structure:

Diagram: CikSubmission JSON Structure

Parsing Implementation:

The function extracts four parallel arrays and combines them using izip!:

  • accessionNumber: Accession numbers for filings
  • form: Form types (e.g., "10-K", "NPORT-P")
  • primaryDocument: Primary document filenames
  • filingDate: Filing dates in "YYYY-MM-DD" format

Each set of parallel values is combined into a CikSubmission struct 6:8-75

Date Parsing:

Filing dates are parsed from strings into NaiveDate objects 6:1-63:

Sources: src/network/fetch_cik_submissions.rs:1-79 examples/lookup_cik.rs:43-70

NPORT Filing Fetching

NPORT-P filings contain detailed portfolio holdings for registered investment companies (mutual funds and ETFs). The module provides two related functions for fetching this data.

Function: fetch_nport_filing_by_ticker_symbol

Signature:

Workflow:

Diagram: NPORT Filing Fetch by Ticker Symbol

This convenience function orchestrates multiple calls 1:0-30:

  1. Lookup CIK from ticker symbol 14
  2. Fetch all submissions for that CIK 16
  3. Filter for most recent NPORT-P submission 1:8-20
  4. Fetch the filing details 2:2-27

Function: fetch_nport_filing_by_cik_and_accession_number

Signature:

Implementation Details:

StepOperationCode Reference
1. Fetch company tickersRequired for ticker mapping39
2. Construct URLUrl::CikAccessionPrimaryDocument41
3. Fetch XMLraw_request() with GET method4:3-46
4. Parse XMLparse_nport_xml()48

XML Parsing Details:

The parse_nport_xml function extracts investment holdings from the primary document XML:

  • Main Element : <invstOrSec> (investment or security) 2:6-46
  • Fields Extracted : name, LEI, title, CUSIP, ISIN, balance, currency, USD value, percentage value, payoff profile, asset category, issuer category, country
  • Ticker Mapping : Fuzzy matches investment names to company tickers 1:31-142
  • Sorting : Results sorted by pct_val descending 125

Sources: src/network/fetch_nport_filing.rs:1-49 src/parsers/parse_nport_xml.rs:12-146 examples/nport_filing.rs:1-29

US GAAP Fundamentals Fetching

Function: fetch_us_gaap_fundamentals

The fetch_us_gaap_fundamentals function retrieves standardized financial statement data (US Generally Accepted Accounting Principles) for a company.

Signature:

Type Alias:

The function returns a Polars DataFrame containing financial facts 9

Implementation Flow:

Diagram: US GAAP Fundamentals Fetch Flow

Key Operations:

  1. CIK Lookup : Resolves ticker symbol to CIK using the provided company tickers list 18
  2. URL Construction : Builds the company facts endpoint URL 20
  3. API Call : Fetches JSON data from SEC 25
  4. Parsing : Converts JSON to structured DataFrame 27

API Response Structure:

The SEC company facts endpoint returns nested JSON containing:

  • US-GAAP taxonomy : Standardized accounting concepts
  • Unit types : Currency units (USD), shares, etc.
  • Time series data : Historical values with filing dates and periods

The parsing function (documented in detail in US GAAP Concept Transformation) extracts this into a tabular DataFrame format.

Sources: src/network/fetch_us_gaap_fundamentals.rs:1-28

Investment Company Dataset Fetching

The investment company dataset provides comprehensive information about registered investment companies, including mutual funds and ETFs, their series, and share classes.

graph TB
    Start["Start"]
CheckCache["Check preprocessor_cache\nfor latest_funds_year"]
CacheHit{{"Cache Hit?"}}
UseCache["Use cached year"]
UseCurrent["Use current year"]
TryFetch["fetch_investment_company_series_and_class_dataset_for_year(year)"]
Success{{"Success?"}}
Store["Store year in cache\nTTL: 1 week"]
Return["Return data"]
Decrement["year -= 1"]
CheckLimit{{"year >= 2024?"}}
Error["Return error"]
Start --> CheckCache
 
   CheckCache --> CacheHit
 
   CacheHit -->|Yes| UseCache
 
   CacheHit -->|No| UseCurrent
 
   UseCache --> TryFetch
 
   UseCurrent --> TryFetch
    
 
   TryFetch --> Success
 
   Success -->|Yes| Store
 
   Store --> Return
 
   Success -->|No| Decrement
 
   Decrement --> CheckLimit
 
   CheckLimit -->|Yes| TryFetch
 
   CheckLimit -->|No| Error

Function: fetch_investment_company_series_and_class_dataset

Signature:

Year Fallback Strategy:

This function implements a sophisticated year-based fallback mechanism because the SEC updates the dataset annually and may not have data for the current year immediately:

Diagram: Investment Company Dataset Year Fallback Logic

Implementation Details:

ComponentDescriptionCode Reference
Cache KeyNAMESPACE_HASHER_LATEST_FUNDS_YEAR1:1-15
NamespaceCacheNamespacePrefix::LatestFundsYear13
Cache Querypreprocessor_cache.read_with_ttl::<usize>()3:9-42
Year Range2024 to current year46
Cache TTL1 week (604800 seconds)59

Caching Strategy:

The function caches the most recent successful year to avoid repeated year fallback attempts 3:8-42:

Function: fetch_investment_company_series_and_class_dataset_for_year

Signature:

This lower-level function fetches the dataset for a specific year 8:0-105:

  1. URL Construction : Uses Url::InvestmentCompanySeriesAndClassDataset(year) 84
  2. Throttle Override : Reduces max retries to 2 for faster fallback 8:6-92
  3. Raw Request : Uses raw_request() instead of higher-level fetch methods 9:4-101
  4. CSV Parsing : Calls parse_investment_companies_csv() 104

Throttle Policy Override:

Sources: src/network/fetch_investment_company_series_and_class_dataset.rs:1-178

graph TB
    subgraph "Independent Functions"
        FT["fetch_company_tickers()"]
FCS["fetch_cik_submissions(cik)"]
FIC["fetch_investment_company_series_and_class_dataset()"]
end
    
    subgraph "Dependent Functions"
        FCIK["fetch_cik_by_ticker_symbol()"]
FGAAP["fetch_us_gaap_fundamentals()"]
FNPORT1["fetch_nport_filing_by_ticker_symbol()"]
FNPORT2["fetch_nport_filing_by_cik_and_accession_number()"]
end
    
    subgraph "Data Models"
        VT["Vec&lt;Ticker&gt;"]
C["Cik"]
VCS["Vec&lt;CikSubmission&gt;"]
end
    
 
   FT --> VT
 
   VT --> FCIK
 
   VT --> FGAAP
 
   VT --> FNPORT2
    
 
   FCIK --> C
 
   C --> FCS
 
   C --> FGAAP
    
 
   FCS --> VCS
 
   VCS --> FNPORT1
    
 
   FCIK --> FNPORT1
 
   FNPORT1 --> FNPORT2

Function Dependencies and Integration

The data fetching functions form a dependency graph where some functions rely on others to complete their tasks.

Diagram: Function Dependency Graph

Common Integration Patterns:

  1. Ticker to CIK Resolution :

    • fetch_company_tickers()fetch_cik_by_ticker_symbol()Cik
    • Used by: NPORT filing fetch, US GAAP fundamentals fetch
  2. Submission Filtering :

    • fetch_cik_submissions()CikSubmission::most_recent_nport_p_submission()
    • Used by: NPORT filing by ticker symbol
  3. Ticker Mapping in Results :

    • fetch_company_tickers() → parsing functions (e.g., parse_nport_xml)
    • Enriches raw data with ticker information

Example Integration (from examples):

Sources: src/network/fetch_nport_filing.rs:3-5 examples/nport_filing.rs:16-22 examples/lookup_cik.rs:18-24

Error Handling

All data fetching functions return Result<T, Box<dyn Error>>, providing consistent error handling across the module. Common error scenarios include:

Error TypeCauseExample Functions
Network ErrorsHTTP request failures, timeoutsAll functions
Parsing ErrorsInvalid JSON/XML structurefetch_cik_submissions, fetch_nport_filing_by_cik_and_accession_number
Not FoundTicker symbol not found, no NPORT-P filing existsfetch_nport_filing_by_ticker_symbol
Year Fallback ExhaustionNo data available for any yearfetch_investment_company_series_and_class_dataset

Error Propagation:

Functions use the ? operator to propagate errors up the call stack, allowing callers to handle errors at the appropriate level. For example, in fetch_nport_filing_by_ticker_symbol:

If either dependency function fails, the error is immediately returned to the caller.

Sources: src/network/fetch_cik_submissions.rs:8-11 src/network/fetch_us_gaap_fundamentals.rs:12-16 src/network/fetch_nport_filing.rs:10-13


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

US GAAP Concept Transformation

Relevant source files

Purpose and Scope

This page documents the US GAAP concept transformation system, which normalizes raw financial concept names from SEC EDGAR filings into a standardized taxonomy. The core functionality is provided by the distill_us_gaap_fundamental_concepts function, which maps the diverse US GAAP terminology (57+ revenue variations, 6 cost variants, multiple equity representations) into a consistent set of 64 FundamentalConcept enum variants.

For information about fetching US GAAP data from the SEC API, see Data Fetching Functions. For details on the data models that use these concepts, see Data Models & Enumerations. For the Python ML pipeline that processes the transformed concepts, see Python narrative_stack System.


System Overview

The transformation system acts as a critical normalization layer between raw SEC EDGAR filings and downstream data processing. Companies report financial data using various US GAAP concept names (e.g., Revenues, SalesRevenueNet, HealthCareOrganizationRevenue), and this system ensures all variations map to consistent concept identifiers.

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:1-1275 examples/us_gaap_human_readable.rs:1-9

graph TB
    subgraph "Input Layer"
        RawFiling["SEC EDGAR Filing\nRaw JSON Data"]
RawConcepts["Raw US GAAP Concepts\n- Revenues\n- SalesRevenueNet\n- HealthCareOrganizationRevenue\n- InterestAndDividendIncomeOperating\n- 57+ more revenue types"]
end
    
    subgraph "Transformation Layer"
        DistillFn["distill_us_gaap_fundamental_concepts"]
MappingEngine["Mapping Engine\n4 Pattern Types"]
OneToOne["One-to-One\nAssets → Assets"]
Hierarchical["Hierarchical\nCurrentAssets → [CurrentAssets, Assets]"]
Synonyms["Synonyms\n6 cost types → CostOfRevenue"]
Industry["Industry-Specific\n57+ revenue types → Revenues"]
MappingEngine --> OneToOne
 
       MappingEngine --> Hierarchical
 
       MappingEngine --> Synonyms
 
       MappingEngine --> Industry
    end
    
    subgraph "Output Layer"
        FundamentalConcept["FundamentalConcept Enum\n64 standardized variants"]
BS["Balance Sheet Concepts\n- Assets, CurrentAssets\n- Liabilities, CurrentLiabilities\n- Equity, EquityAttributableToParent"]
IS["Income Statement Concepts\n- Revenues, CostOfRevenue\n- GrossProfit, OperatingIncomeLoss\n- NetIncomeLoss"]
CF["Cash Flow Concepts\n- NetCashFlowFromOperatingActivities\n- NetCashFlowFromInvestingActivities\n- NetCashFlowFromFinancingActivities"]
FundamentalConcept --> BS
 
       FundamentalConcept --> IS
 
       FundamentalConcept --> CF
    end
    
 
   RawFiling --> RawConcepts
 
   RawConcepts --> DistillFn
 
   DistillFn --> MappingEngine
 
   MappingEngine --> FundamentalConcept
    
    style DistillFn fill:#f9f9f9
    style FundamentalConcept fill:#f9f9f9

The FundamentalConcept Taxonomy

The FundamentalConcept enum defines 64 standardized financial concept variants organized into four main categories: Balance Sheet, Income Statement, Cash Flow, and Equity classifications. Each variant represents a normalized concept that may map from multiple raw US GAAP names.

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:1-1275

graph TB
    subgraph "FundamentalConcept Enum"
        Root["FundamentalConcept\n(64 total variants)"]
end
    
    subgraph "Balance Sheet Concepts"
        Assets["Assets"]
CurrentAssets["CurrentAssets"]
NoncurrentAssets["NoncurrentAssets"]
Liabilities["Liabilities"]
CurrentLiabilities["CurrentLiabilities"]
NoncurrentLiabilities["NoncurrentLiabilities"]
LiabilitiesAndEquity["LiabilitiesAndEquity"]
CommitmentsAndContingencies["CommitmentsAndContingencies"]
end
    
    subgraph "Income Statement Concepts"
        Revenues["Revenues"]
RevenuesNetInterestExpense["RevenuesNetInterestExpense"]
RevenuesExcludingInterestAndDividends["RevenuesExcludingInterestAndDividends"]
InterestAndDividendIncomeOperating["InterestAndDividendIncomeOperating"]
CostOfRevenue["CostOfRevenue"]
GrossProfit["GrossProfit"]
OperatingExpenses["OperatingExpenses"]
OperatingIncomeLoss["OperatingIncomeLoss"]
NonoperatingIncomeLoss["NonoperatingIncomeLoss"]
IncomeTaxExpenseBenefit["IncomeTaxExpenseBenefit"]
NetIncomeLoss["NetIncomeLoss"]
NetIncomeLossAttributableToParent["NetIncomeLossAttributableToParent"]
NetIncomeLossAttributableToNoncontrollingInterest["NetIncomeLossAttributableToNoncontrollingInterest"]
end
    
    subgraph "Cash Flow Concepts"
        NetCashFlow["NetCashFlow"]
NetCashFlowContinuing["NetCashFlowContinuing"]
NetCashFlowDiscontinued["NetCashFlowDiscontinued"]
NetCashFlowFromOperatingActivities["NetCashFlowFromOperatingActivities"]
NetCashFlowFromInvestingActivities["NetCashFlowFromInvestingActivities"]
NetCashFlowFromFinancingActivities["NetCashFlowFromFinancingActivities"]
ExchangeGainsLosses["ExchangeGainsLosses"]
end
    
    subgraph "Equity Concepts"
        Equity["Equity"]
EquityAttributableToParent["EquityAttributableToParent"]
EquityAttributableToNoncontrollingInterest["EquityAttributableToNoncontrollingInterest"]
TemporaryEquity["TemporaryEquity"]
RedeemableNoncontrollingInterest["RedeemableNoncontrollingInterest"]
end
    
 
   Root --> Assets
 
   Root --> CurrentAssets
 
   Root --> NoncurrentAssets
 
   Root --> Liabilities
 
   Root --> CurrentLiabilities
 
   Root --> NoncurrentLiabilities
 
   Root --> LiabilitiesAndEquity
 
   Root --> CommitmentsAndContingencies
    
 
   Root --> Revenues
 
   Root --> RevenuesNetInterestExpense
 
   Root --> RevenuesExcludingInterestAndDividends
 
   Root --> InterestAndDividendIncomeOperating
 
   Root --> CostOfRevenue
 
   Root --> GrossProfit
 
   Root --> OperatingExpenses
 
   Root --> OperatingIncomeLoss
 
   Root --> NonoperatingIncomeLoss
 
   Root --> IncomeTaxExpenseBenefit
 
   Root --> NetIncomeLoss
 
   Root --> NetIncomeLossAttributableToParent
 
   Root --> NetIncomeLossAttributableToNoncontrollingInterest
    
 
   Root --> NetCashFlow
 
   Root --> NetCashFlowContinuing
 
   Root --> NetCashFlowDiscontinued
 
   Root --> NetCashFlowFromOperatingActivities
 
   Root --> NetCashFlowFromInvestingActivities
 
   Root --> NetCashFlowFromFinancingActivities
 
   Root --> ExchangeGainsLosses
    
 
   Root --> Equity
 
   Root --> EquityAttributableToParent
 
   Root --> EquityAttributableToNoncontrollingInterest
 
   Root --> TemporaryEquity
 
   Root --> RedeemableNoncontrollingInterest

Mapping Pattern Types

The transformation system implements four distinct mapping patterns to handle the diverse ways companies report financial concepts.

Pattern 1: One-to-One Mapping

Simple direct mappings where a single US GAAP concept name maps to exactly one FundamentalConcept variant.

graph LR
 
   A["Assets"] --> FA["FundamentalConcept::Assets"]
B["Liabilities"] --> FB["FundamentalConcept::Liabilities"]
C["GrossProfit"] --> FC["FundamentalConcept::GrossProfit"]
D["CommitmentsAndContingencies"] --> FD["FundamentalConcept::CommitmentsAndContingencies"]
style FA fill:#f9f9f9
    style FB fill:#f9f9f9
    style FC fill:#f9f9f9
    style FD fill:#f9f9f9
Raw US GAAP ConceptFundamentalConcept Output
Assetsvec![Assets]
Liabilitiesvec![Liabilities]
GrossProfitvec![GrossProfit]
OperatingIncomeLossvec![OperatingIncomeLoss]
CommitmentsAndContingenciesvec![CommitmentsAndContingencies]

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:4-17 tests/distill_us_gaap_fundamental_concepts_tests.rs:31-36 tests/distill_us_gaap_fundamental_concepts_tests.rs:336-341

Pattern 2: Hierarchical Mapping

Specific concepts map to multiple variants, including both the specific concept and parent categories. This enables queries at different levels of granularity.

graph LR
    subgraph "Input"
        CA["AssetsCurrent"]
CL["LiabilitiesCurrent"]
SE["StockholdersEquity"]
end
    
    subgraph "Output: Multiple Concepts"
        CA1["FundamentalConcept::CurrentAssets"]
CA2["FundamentalConcept::Assets"]
SE1["FundamentalConcept::EquityAttributableToParent"]
SE2["FundamentalConcept::Equity"]
end
    
 
   CA --> CA1
 
   CA --> CA2
 
   SE --> SE1
 
   SE --> SE2
    
    style CA1 fill:#f9f9f9
    style CA2 fill:#f9f9f9
    style SE1 fill:#f9f9f9
    style SE2 fill:#f9f9f9
Raw US GAAP ConceptFundamentalConcept Output (Ordered)
AssetsCurrentvec![CurrentAssets, Assets]
StockholdersEquityvec![EquityAttributableToParent, Equity]
NetIncomeLossvec![NetIncomeLossAttributableToParent, NetIncomeLoss]
IncomeLossFromContinuingOperationsvec![IncomeLossFromContinuingOperationsAfterTax, NetIncomeLoss]

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:10-17 tests/distill_us_gaap_fundamental_concepts_tests.rs:158-165 tests/distill_us_gaap_fundamental_concepts_tests.rs:262-286 tests/distill_us_gaap_fundamental_concepts_tests.rs:692-698

Pattern 3: Synonym Consolidation

Multiple US GAAP concept names that represent the same financial concept are consolidated into a single FundamentalConcept variant.

graph LR
    subgraph "Input: 6 Cost Variants"
        C1["CostOfRevenue"]
C2["CostOfGoodsAndServicesSold"]
C3["CostOfServices"]
C4["CostOfGoodsSold"]
C5["CostOfGoodsSoldExcludingDepreciationDepletionAndAmortization"]
C6["CostOfGoodsSoldElectric"]
end
    
    subgraph "Output: Single Concept"
        CO["FundamentalConcept::CostOfRevenue"]
end
    
 
   C1 --> CO
 
   C2 --> CO
 
   C3 --> CO
 
   C4 --> CO
 
   C5 --> CO
 
   C6 --> CO
    
    style CO fill:#f9f9f9

Cost of Revenue Synonyms

Raw US GAAP ConceptFundamentalConcept Output
CostOfRevenuevec![CostOfRevenue]
CostOfGoodsAndServicesSoldvec![CostOfRevenue]
CostOfServicesvec![CostOfRevenue]
CostOfGoodsSoldvec![CostOfRevenue]
CostOfGoodsSoldExcludingDepreciationDepletionAndAmortizationvec![CostOfRevenue]
CostOfGoodsSoldElectricvec![CostOfRevenue]

Equity Noncontrolling Interest Synonyms

Raw US GAAP ConceptFundamentalConcept Output
MinorityInterestvec![EquityAttributableToNoncontrollingInterest]
PartnersCapitalAttributableToNoncontrollingInterestvec![EquityAttributableToNoncontrollingInterest]
MinorityInterestInLimitedPartnershipsvec![EquityAttributableToNoncontrollingInterest]
MinorityInterestInOperatingPartnershipsvec![EquityAttributableToNoncontrollingInterest]
MinorityInterestInJointVenturesvec![EquityAttributableToNoncontrollingInterest]
NonredeemableNoncontrollingInterestvec![EquityAttributableToNoncontrollingInterest]
NoncontrollingInterestInVariableInterestEntityvec![EquityAttributableToNoncontrollingInterest]

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:80-112 tests/distill_us_gaap_fundamental_concepts_tests.rs:196-259

Pattern 4: Industry-Specific Revenue Mapping

The most complex pattern handles 57+ industry-specific revenue variations, mapping them all to the Revenues concept. Some revenue types also map to hierarchical categories.

graph TB
    subgraph "Industry-Specific Revenue Types"
        R1["Revenues"]
R2["SalesRevenueNet"]
R3["HealthCareOrganizationRevenue"]
R4["RealEstateRevenueNet"]
R5["OilAndGasRevenue"]
R6["FinancialServicesRevenue"]
R7["AdvertisingRevenue"]
R8["SubscriptionRevenue"]
R9["RoyaltyRevenue"]
R10["ElectricUtilityRevenue"]
R11["PassengerRevenue"]
R12["CargoAndFreightRevenue"]
R13["... 45+ more types"]
end
    
    subgraph "Special Revenue Categories"
        RH1["InterestAndDividendIncomeOperating"]
RH2["RevenuesExcludingInterestAndDividends"]
RH3["RevenuesNetOfInterestExpense"]
RH4["InvestmentBankingRevenue"]
end
    
    subgraph "Output Concepts"
        Rev["FundamentalConcept::Revenues"]
RevInt["FundamentalConcept::InterestAndDividendIncomeOperating"]
RevExcl["FundamentalConcept::RevenuesExcludingInterestAndDividends"]
RevNet["FundamentalConcept::RevenuesNetInterestExpense"]
end
    
 
   R1 --> Rev
 
   R2 --> Rev
 
   R3 --> Rev
 
   R4 --> Rev
 
   R5 --> Rev
 
   R6 --> Rev
 
   R7 --> Rev
 
   R8 --> Rev
 
   R9 --> Rev
 
   R10 --> Rev
 
   R11 --> Rev
 
   R12 --> Rev
 
   R13 --> Rev
    
 
   RH1 --> RevInt
 
   RH1 --> Rev
 
   RH2 --> RevExcl
 
   RH2 --> Rev
 
   RH3 --> RevNet
 
   RH3 --> Rev
 
   RH4 --> RevExcl
 
   RH4 --> Rev
    
    style Rev fill:#f9f9f9
    style RevInt fill:#f9f9f9
    style RevExcl fill:#f9f9f9
    style RevNet fill:#f9f9f9

Sample Revenue Mappings

IndustryRaw US GAAP ConceptFundamentalConcept Output
GeneralRevenuesvec![Revenues]
GeneralSalesRevenueNetvec![Revenues]
HealthcareHealthCareOrganizationRevenuevec![Revenues]
Real EstateRealEstateRevenueNetvec![Revenues]
EnergyOilAndGasRevenuevec![Revenues]
MiningRevenueMineralSalesvec![Revenues]
HospitalityRevenueFromLeasedAndOwnedHotelsvec![Revenues]
FranchiseFranchisorRevenuevec![Revenues]
MediaSubscriptionRevenuevec![Revenues]
MediaAdvertisingRevenuevec![Revenues]
EntertainmentAdmissionsRevenuevec![Revenues]
LicensingLicensesRevenuevec![Revenues]
LicensingRoyaltyRevenuevec![Revenues]
TransportationPassengerRevenuevec![Revenues]
TransportationCargoAndFreightRevenuevec![Revenues]
UtilitiesElectricUtilityRevenuevec![Revenues]
FinancialInterestAndDividendIncomeOperatingvec![InterestAndDividendIncomeOperating, Revenues]
FinancialInvestmentBankingRevenuevec![RevenuesExcludingInterestAndDividends, Revenues]

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:954-1188 tests/distill_us_gaap_fundamental_concepts_tests.rs:1191-1225


The distill_us_gaap_fundamental_concepts Function

The core transformation function accepts a string representation of a US GAAP concept name and returns an Option<Vec<FundamentalConcept>>. The return type is an Option because not all US GAAP concepts are mapped (unmapped concepts return None), and a Vec because some concepts map to multiple standardized variants (hierarchical pattern).

graph TB
    subgraph "Function Input"
        InputStr["Input: &str\nUS GAAP concept name"]
end
    
    subgraph "distill_us_gaap_fundamental_concepts"
        Parse["Parse input string"]
MatchEngine["Pattern Matching Engine"]
Decision{"Concept\nRecognized?"}
BuildVec["Construct Vec<FundamentalConcept>\nApply mapping pattern"]
ReturnSome["Return Some(Vec<FundamentalConcept>)"]
ReturnNone["Return None"]
end
    
    subgraph "Function Output"
        OutputSome["Option::Some(Vec<FundamentalConcept>)\n1-2 variants typically"]
OutputNone["Option::None\nUnrecognized concept"]
end
    
 
   InputStr --> Parse
 
   Parse --> MatchEngine
 
   MatchEngine --> Decision
 
   Decision -->|Yes| BuildVec
 
   Decision -->|No| ReturnNone
 
   BuildVec --> ReturnSome
 
   ReturnSome --> OutputSome
 
   ReturnNone --> OutputNone
    
    style MatchEngine fill:#f9f9f9
    style ReturnSome fill:#f9f9f9

Function Signature and Flow

Example Usage

Sources: examples/us_gaap_human_readable.rs:1-9 tests/distill_us_gaap_fundamental_concepts_tests.rs:4-8


Complex Hierarchical Examples

Some concepts demonstrate multi-level hierarchical relationships where a specific concept may relate to multiple parent categories.

graph TD
    subgraph "Net Income Loss Hierarchy"
        NIL1["NetIncomeLoss\n(raw input)"]
NIL2["FundamentalConcept::NetIncomeLossAttributableToParent"]
NIL3["FundamentalConcept::NetIncomeLoss"]
NILCS["NetIncomeLossAvailableToCommonStockholdersBasic\n(raw input)"]
NILCS1["FundamentalConcept::NetIncomeLossAvailableToCommonStockholdersBasic"]
NILCS2["FundamentalConcept::NetIncomeLoss"]
ILCO1["IncomeLossFromContinuingOperations\n(raw input)"]
ILCO2["FundamentalConcept::IncomeLossFromContinuingOperationsAfterTax"]
ILCO3["FundamentalConcept::NetIncomeLoss"]
NIL1 --> NIL2
 
       NIL1 --> NIL3
 
       NILCS --> NILCS1
 
       NILCS --> NILCS2
 
       ILCO1 --> ILCO2
 
       ILCO1 --> ILCO3
    end
    
    subgraph "Comprehensive Income Hierarchy"
        CI1["ComprehensiveIncomeNetOfTax\n(raw input)"]
CI2["FundamentalConcept::ComprehensiveIncomeLossAttributableToParent"]
CI3["FundamentalConcept::ComprehensiveIncomeLoss"]
CI1 --> CI2
 
       CI1 --> CI3
    end
    
    subgraph "Revenue Hierarchy"
        RNOI["RevenuesNetOfInterestExpense\n(raw input)"]
RNOI1["FundamentalConcept::RevenuesNetInterestExpense"]
RNOI2["FundamentalConcept::Revenues"]
RNOI --> RNOI1
 
       RNOI --> RNOI2
    end
    
    style NIL2 fill:#f9f9f9
    style NIL3 fill:#f9f9f9
    style NILCS1 fill:#f9f9f9
    style NILCS2 fill:#f9f9f9
    style ILCO2 fill:#f9f9f9
    style ILCO3 fill:#f9f9f9
    style CI2 fill:#f9f9f9
    style CI3 fill:#f9f9f9
    style RNOI1 fill:#f9f9f9
    style RNOI2 fill:#f9f9f9

Income Statement Hierarchies

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:686-725 tests/distill_us_gaap_fundamental_concepts_tests.rs:39-77 tests/distill_us_gaap_fundamental_concepts_tests.rs:976-982

Cash Flow Hierarchies

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:550-683


Equity Concept Mappings

Equity concepts demonstrate all four mapping patterns, including complex organizational structures (stockholders vs. partners vs. members) and noncontrolling interest variations.

Equity Structure Overview

Organizational TypeRaw US GAAP ConceptFundamentalConcept Output
Corporation (total)StockholdersEquityIncludingPortionAttributableToNoncontrollingInterestvec![Equity]
Corporation (parent)StockholdersEquityvec![EquityAttributableToParent, Equity]
Partnership (total)PartnersCapitalIncludingPortionAttributableToNoncontrollingInterestvec![Equity]
Partnership (parent)PartnersCapitalvec![EquityAttributableToParent, Equity]
LLC (parent)MembersEquityvec![EquityAttributableToParent, Equity]
GenericCommonStockholdersEquityvec![Equity]

Temporary Equity Variations

Raw US GAAP ConceptFundamentalConcept Output
TemporaryEquityCarryingAmountvec![TemporaryEquity]
TemporaryEquityRedemptionValuevec![TemporaryEquity]
RedeemablePreferredStockCarryingAmountvec![TemporaryEquity]
TemporaryEquityCarryingAmountAttributableToParentvec![TemporaryEquity]
TemporaryEquityCarryingAmountAttributableToNoncontrollingInterestvec![TemporaryEquity]
TemporaryEquityLiquidationPreferencevec![TemporaryEquity]

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:150-193 tests/distill_us_gaap_fundamental_concepts_tests.rs:262-286 tests/distill_us_gaap_fundamental_concepts_tests.rs:1228-1274


graph TB
    subgraph "Test Structure"
        TestFile["distill_us_gaap_fundamental_concepts_tests.rs\n1275 lines, 70+ tests"]
BalanceSheet["Balance Sheet Tests\n- Assets (one-to-one)\n- Current Assets (hierarchical)\n- Equity (multiple org types)\n- Temporary Equity (7 variations)"]
IncomeStatement["Income Statement Tests\n- Revenues (57+ variations)\n- Cost of Revenue (6 synonyms)\n- Net Income (hierarchical)\n- Comprehensive Income (hierarchical)"]
CashFlow["Cash Flow Tests\n- Operating Activities\n- Investing Activities\n- Financing Activities\n- Continuing vs Discontinued"]
Special["Special Category Tests\n- Commitments and Contingencies\n- Nature of Operations\n- Exchange Gains/Losses\n- Research and Development"]
TestFile --> BalanceSheet
 
       TestFile --> IncomeStatement
 
       TestFile --> CashFlow
 
       TestFile --> Special
    end

Testing Strategy

The transformation system has comprehensive test coverage with 70+ test functions covering all 64 FundamentalConcept variants and their various input mappings.

Test Organization

Test Pattern Examples

Each test function validates one or more related mappings:

Test Coverage Summary

Concept CategoryTest FunctionsTotal AssertionsPattern Types Tested
Balance Sheet1535+All 4 patterns
Income Statement2580+All 4 patterns
Cash Flow1825+Hierarchical, One-to-One
Equity1240+All 4 patterns
Total70+180+All 4 patterns

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:1-1275


graph TB
    subgraph "Data Source"
        SECAPI["SEC EDGAR API\ndata endpoint"]
RawJSON["Raw JSON Response\nCompany Facts"]
end
    
    subgraph "Rust Processing Pipeline"
        FetchFn["fetch_us_gaap_fundamentals\n(network module)"]
ParseJSON["Parse JSON\nExtract concept names"]
DistillFn["distill_us_gaap_fundamental_concepts\n(transformers module)"]
FilterMapped["Filter to mapped concepts\nDiscard None results"]
BuildRecords["Build data records\nwith FundamentalConcept enum"]
end
    
    subgraph "Storage Layer"
        CSVOutput["CSV Files\nus-gaap/[ticker].csv"]
PolarsDF["Polars DataFrame\nStructured columns"]
end
    
    subgraph "Python ML Pipeline"
        Ingestion["Data Ingestion\nus_gaap_store.ingest_us_gaap_csvs"]
Preprocessing["Preprocessing\nConcept/unit pair extraction"]
end
    
 
   SECAPI --> RawJSON
 
   RawJSON --> FetchFn
 
   FetchFn --> ParseJSON
 
   ParseJSON --> DistillFn
 
   DistillFn --> FilterMapped
 
   FilterMapped --> BuildRecords
 
   BuildRecords --> PolarsDF
 
   PolarsDF --> CSVOutput
 
   CSVOutput --> Ingestion
 
   Ingestion --> Preprocessing
    
    style DistillFn fill:#f9f9f9
    style FilterMapped fill:#f9f9f9

Integration with Data Pipeline

The distill_us_gaap_fundamental_concepts function is a critical component in the broader data processing pipeline, serving as the normalization layer between raw SEC filings and structured data storage.

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:1-1275 examples/us_gaap_human_readable.rs:1-9


Summary

The US GAAP concept transformation system provides:

  1. Standardization : Maps 200+ raw US GAAP concept names to 64 standardized FundamentalConcept variants
  2. Flexibility : Supports four mapping patterns (one-to-one, hierarchical, synonyms, industry-specific) to handle diverse reporting practices
  3. Queryability : Hierarchical mappings enable queries at multiple granularity levels (e.g., query for all Assets or specifically CurrentAssets)
  4. Reliability : Comprehensive test coverage with 70+ test functions and 180+ assertions validates all mapping patterns
  5. Integration : Serves as critical normalization layer between SEC EDGAR API and downstream data processing/ML pipelines

The transformation system represents the highest-importance component (8.37) in the Rust codebase, enabling consistent financial data analysis across companies with varying reporting conventions.

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:1-1275 examples/us_gaap_human_readable.rs:1-9


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Data Models & Enumerations

Relevant source files

Purpose and Scope

This page documents the core data structures and enumerations used throughout the rust-sec-fetcher application. These models represent SEC financial data, including company identifiers, filing metadata, investment holdings, and financial concepts. The data models are defined in src/models.rs:1-18 and src/enums.rs:1-12

For information about how these models are used in data fetching operations, see Data Fetching Functions. For details on how FundamentalConcept is used in the transformation pipeline, see US GAAP Concept Transformation.

Sources: src/models.rs:1-18 src/enums.rs:1-12


SEC Identifier Models

The system uses three primary identifier types to reference companies and filings within the SEC EDGAR system.

Ticker

The Ticker struct represents a company's stock ticker symbol along with its SEC identifiers. It is returned by the fetch_company_tickers function and serves as the primary entry point for company lookups.

Structure:

  • cik: Cik - The company's Central Index Key
  • ticker_symbol: String - Stock ticker symbol (e.g., "AAPL")
  • company_name: String - Full company name
  • origin: TickerOrigin - Source of the ticker data

Sources: src/models.rs:10-11

Cik (Central Index Key)

The Cik struct represents a 10-digit SEC identifier that uniquely identifies a company or entity. CIKs are zero-padded to exactly 10 digits.

Structure:

  • value: u64 - The numeric CIK value (0 to 9,999,999,999)

Validation:

  • CIK values must not exceed 10 digits
  • Values are zero-padded when formatted (e.g., 123"0000000123")
  • Parsing from strings handles both padded and unpadded formats

Error Handling:

  • CikError::InvalidCik - Raised when CIK exceeds 10 digits
  • CikError::ParseError - Raised when parsing fails

Sources: src/models.rs:4-5

AccessionNumber

The AccessionNumber struct represents a unique identifier for SEC filings. Each accession number is exactly 18 digits and encodes the filer's CIK, filing year, and sequence number.

Format: XXXXXXXXXX-YY-NNNNNN

Components:

  • CIK (10 digits) - The SEC identifier of the filer
  • Year (2 digits) - Last two digits of the filing year
  • Sequence (6 digits) - Unique sequence number within the year

Example: "0001234567-23-000045" represents:

  • CIK: 0001234567
  • Year: 2023
  • Sequence: 000045

Key Methods:

  • from_str(accession_str: &str) - Parses from string (with or without dashes)
  • from_parts(cik_u64: u64, year: u16, sequence: u32) - Constructs from components
  • to_string() - Returns dash-separated format
  • to_unformatted_string() - Returns plain 18-digit string

Sources: src/models/accession_number.rs:1-188 src/models.rs:1-3

SEC Identifier Relationships

Sources: src/models.rs:1-18 src/models/accession_number.rs:35-187


Filing Data Structures

CikSubmission

The CikSubmission struct represents metadata about an SEC filing submission. It is returned by the fetch_cik_submissions function.

Key Fields:

  • cik: Cik - The filer's Central Index Key
  • accession_number: AccessionNumber - Unique filing identifier
  • form: String - Filing form type (e.g., "10-K", "10-Q", "NPORT-P")
  • primary_document: String - Main document filename
  • filing_date: String - Date the filing was submitted

Sources: src/models.rs:7-8

NportInvestment

The NportInvestment struct represents a single investment holding from an NPORT-P filing (monthly portfolio holdings report for registered investment companies).

Mapped Fields (linked to company data):

  • mapped_ticker_symbol: Option<String> - Ticker symbol if matched
  • mapped_company_name: Option<String> - Company name if matched
  • mapped_company_cik_number: Option<String> - CIK if matched

Investment Identifiers:

  • name: String - Investment name
  • lei: String - Legal Entity Identifier
  • cusip: String - Committee on Uniform Securities Identification Procedures ID
  • isin: String - International Securities Identification Number
  • title: String - Investment title

Financial Values:

  • balance: Decimal - Number of shares or units held
  • val_usd: Decimal - Value in USD
  • pct_val: Decimal - Percentage of total portfolio value
  • cur_cd: String - Currency code

Classifications:

  • asset_cat: String - Asset category
  • issuer_cat: String - Issuer category
  • payoff_profile: String - Payoff profile type
  • inv_country: String - Investment country

Utility Methods:

  • sort_by_pct_val_desc(investments: &mut Vec<NportInvestment>) - Sorts holdings by percentage value in descending order

Sources: src/models/nport_investment.rs:1-46 src/models.rs:16-17

InvestmentCompany

The InvestmentCompany struct represents an investment company (mutual fund, ETF, etc.) registered with the SEC. It is returned by the fetch_investment_companies function.

Sources: src/models.rs:13-14

Filing Data Structure Relationships

Sources: src/models.rs:1-18 src/models/nport_investment.rs:8-46


FundamentalConcept Enumeration

The FundamentalConcept enum is the most critical enumeration in the system (importance: 8.37), defining 64 standardized financial concepts derived from US GAAP (Generally Accepted Accounting Principles) taxonomies. These concepts are used by the distill_us_gaap_fundamental_concepts transformer to normalize diverse financial reporting variations into a consistent taxonomy.

Definition: src/enums/fundamental_concept_enum.rs:1-72

Traits Implemented:

  • Eq, PartialEq - Equality comparison
  • Hash - Hashable for use in maps
  • Clone - Cloneable
  • EnumString - Parse from string
  • EnumIter - Iterate over all variants
  • Display - Format as string
  • Debug - Debug formatting

Sources: src/enums/fundamental_concept_enum.rs:1-5 src/enums.rs:4-5

Concept Categories

The 64 FundamentalConcept variants are organized into four primary categories corresponding to major financial statement types:

CategoryConcept CountDescription
Balance Sheet13Assets, liabilities, equity positions
Income Statement23Revenues, expenses, income/loss measures
Cash Flow Statement13Operating, investing, financing cash flows
Equity & Comprehensive Income6Equity attributions and comprehensive income
Other9Special items, metadata, and miscellaneous

Sources: src/enums/fundamental_concept_enum.rs:4-72

Complete FundamentalConcept Taxonomy

Balance Sheet Concepts

VariantDescription
AssetsTotal assets
CurrentAssetsAssets expected to be converted to cash within one year
NoncurrentAssetsLong-term assets
LiabilitiesTotal liabilities
CurrentLiabilitiesObligations due within one year
NoncurrentLiabilitiesLong-term obligations
LiabilitiesAndEquityTotal liabilities plus equity (must equal total assets)
EquityTotal shareholder equity
EquityAttributableToParentEquity attributable to parent company shareholders
EquityAttributableToNoncontrollingInterestEquity attributable to minority shareholders
TemporaryEquityMezzanine equity (e.g., redeemable preferred stock)
RedeemableNoncontrollingInterestRedeemable minority interest
CommitmentsAndContingenciesOff-balance-sheet obligations

Income Statement Concepts

VariantDescription
IncomeStatementStartPeriodYearToDateStatement start period marker
RevenuesTotal revenues (consolidated from 57+ variations)
RevenuesExcludingInterestAndDividendsNon-interest revenues
RevenuesNetInterestExpenseRevenues after interest expense
CostOfRevenueDirect costs of producing goods/services
GrossProfitRevenue minus cost of revenue
OperatingExpensesOperating expenses excluding COGS
ResearchAndDevelopmentR&D expenses
CostsAndExpensesTotal costs and expenses
BenefitsCostsExpensesEmployee benefits and related costs
OperatingIncomeLossIncome from operations
NonoperatingIncomeLossIncome from non-operating activities
OtherOperatingIncomeExpensesOther operating items
IncomeLossBeforeEquityMethodInvestmentsIncome before equity investments
IncomeLossFromEquityMethodInvestmentsIncome from equity-method investees
IncomeLossFromContinuingOperationsBeforeTaxPre-tax income from continuing operations
IncomeTaxExpenseBenefitIncome tax expense or benefit
IncomeLossFromContinuingOperationsAfterTaxAfter-tax income from continuing operations
IncomeLossFromDiscontinuedOperationsNetOfTaxAfter-tax income from discontinued operations
ExtraordinaryItemsOfIncomeExpenseNetOfTaxExtraordinary items (after-tax)
NetIncomeLossBottom-line net income or loss
NetIncomeLossAttributableToParentNet income attributable to parent shareholders
NetIncomeLossAttributableToNoncontrollingInterestNet income attributable to minority shareholders
NetIncomeLossAvailableToCommonStockholdersBasicNet income available to common shareholders
PreferredStockDividendsAndOtherAdjustmentsPreferred dividends and adjustments

Industry-Specific Income Statement Concepts

VariantDescription
InterestAndDividendIncomeOperatingInterest and dividend income (financial institutions)
InterestExpenseOperatingInterest expense (financial institutions)
InterestIncomeExpenseOperatingNetNet interest income (banks)
InterestAndDebtExpenseTotal interest and debt expense
InterestIncomeExpenseAfterProvisionForLossesInterest income after loan loss provisions
ProvisionForLoanLeaseAndOtherLossesProvision for credit losses (banks)
NoninterestIncomeNon-interest income (financial institutions)
NoninterestExpenseNon-interest expense (financial institutions)

Cash Flow Statement Concepts

VariantDescription
NetCashFlowTotal net cash flow
NetCashFlowContinuingNet cash flow from continuing operations
NetCashFlowDiscontinuedNet cash flow from discontinued operations
NetCashFlowFromOperatingActivitiesCash from operating activities
NetCashFlowFromOperatingActivitiesContinuingOperating cash flow (continuing)
NetCashFlowFromOperatingActivitiesDiscontinuedOperating cash flow (discontinued)
NetCashFlowFromInvestingActivitiesCash from investing activities
NetCashFlowFromInvestingActivitiesContinuingInvesting cash flow (continuing)
NetCashFlowFromInvestingActivitiesDiscontinuedInvesting cash flow (discontinued)
NetCashFlowFromFinancingActivitiesCash from financing activities
NetCashFlowFromFinancingActivitiesContinuingFinancing cash flow (continuing)
NetCashFlowFromFinancingActivitiesDiscontinuedFinancing cash flow (discontinued)
ExchangeGainsLossesForeign exchange gains/losses

Equity & Comprehensive Income Concepts

VariantDescription
ComprehensiveIncomeLossTotal comprehensive income
ComprehensiveIncomeLossAttributableToParentComprehensive income attributable to parent
ComprehensiveIncomeLossAttributableToNoncontrollingInterestComprehensive income attributable to minorities
OtherComprehensiveIncomeLossOther comprehensive income (OCI)

Other Concepts

VariantDescription
NatureOfOperationsDescription of business operations
GainLossOnSalePropertiesNetTaxGains/losses on property sales (after-tax)

Sources: src/enums/fundamental_concept_enum.rs:4-72

FundamentalConcept Taxonomy Diagram

Sources: src/enums/fundamental_concept_enum.rs:4-72


Other Enumerations

CacheNamespacePrefix

The CacheNamespacePrefix enum defines namespace prefixes used by the caching system to organize cached data by type. Each prefix corresponds to a specific data fetching operation.

Usage: Cache keys are constructed by combining a namespace prefix with a specific identifier (e.g., CacheNamespacePrefix::CompanyTickers + ticker_symbol).

Sources: src/enums.rs:1-2

TickerOrigin

The TickerOrigin enum indicates the source or origin of a ticker symbol.

Variants:

  • Different ticker sources (e.g., SEC company tickers API, NPORT filings, manual mapping)

Usage: Stored in the Ticker struct to track data provenance.

Sources: src/enums.rs:7-8

Url

The Url enum defines the various SEC EDGAR API endpoints used by the application. Each variant represents a specific API URL pattern.

Usage: Used by the network layer to construct API requests without hardcoding URLs throughout the codebase.

Sources: src/enums.rs:10-11

Enumeration Module Structure

Sources: src/enums.rs:1-12


graph TB
    subgraph Identifiers["SEC Identifier Models"]
Cik["Cik\nvalue: u64"]
Ticker["Ticker\ncik: Cik\nticker_symbol: String\ncompany_name: String\norigin: TickerOrigin"]
AccessionNumber["AccessionNumber\ncik: Cik\nyear: u16\nsequence: u32"]
end
    
    subgraph Filings["Filing Data Models"]
CikSubmission["CikSubmission\ncik: Cik\naccession_number: AccessionNumber\nform: String\nprimary_document: String\nfiling_date: String"]
InvestmentCompany["InvestmentCompany\n(identified by Cik)"]
NportInvestment["NportInvestment\nmapped_ticker_symbol: Option&lt;String&gt;\nmapped_company_name: Option&lt;String&gt;\nmapped_company_cik_number: Option&lt;String&gt;\nname, lei, cusip, isin\nbalance, val_usd, pct_val"]
end
    
    subgraph Enums["Enumerations"]
FundamentalConcept["FundamentalConcept\n64 variants\n(Balance Sheet, Income Statement,\nCash Flow, Equity)"]
TickerOrigin["TickerOrigin\n(ticker source/origin)"]
CacheNamespacePrefix["CacheNamespacePrefix\n(cache organization)"]
UrlEnum["Url\n(SEC EDGAR endpoints)"]
end
    
    subgraph NetworkFunctions["Network Functions (3.2)"]
FetchTickers["fetch_company_tickers\n→ Vec&lt;Ticker&gt;"]
FetchCIK["fetch_cik_by_ticker_symbol\n→ Option&lt;Cik&gt;"]
FetchSubmissions["fetch_cik_submissions\n→ Vec&lt;CikSubmission&gt;"]
FetchNPORT["fetch_nport_filing\n→ Vec&lt;NportInvestment&gt;"]
FetchInvCo["fetch_investment_companies\n→ Vec&lt;InvestmentCompany&gt;"]
end
    
    subgraph Transformer["Transformation (3.3)"]
Distill["distill_us_gaap_fundamental_concepts\nuses FundamentalConcept"]
end
    
 
   Ticker -->|contains| Cik
 
   Ticker -->|uses| TickerOrigin
 
   AccessionNumber -->|contains| Cik
 
   CikSubmission -->|contains| Cik
 
   CikSubmission -->|contains| AccessionNumber
 
   InvestmentCompany -.->|identified by| Cik
 
   NportInvestment -.->|mapped to| Ticker
    
 
   FetchTickers -.->|returns| Ticker
 
   FetchCIK -.->|returns| Cik
 
   FetchSubmissions -.->|returns| CikSubmission
 
   FetchNPORT -.->|returns| NportInvestment
 
   FetchInvCo -.->|returns| InvestmentCompany
    
 
   Distill -.->|uses| FundamentalConcept

Data Model Relationships

The following diagram illustrates how the core data models relate to each other and which models contain or reference other models.

Sources: src/models.rs:1-18 src/enums.rs:1-12


Implementation Details

Error Handling

Both Cik and AccessionNumber implement custom error types to handle parsing and validation failures:

CikError:

  • InvalidCik - CIK exceeds 10 digits
  • ParseError - Parsing from string failed

AccessionNumberError:

  • InvalidLength - Accession number is not 18 digits
  • ParseError - Parsing failed
  • CikError - Wrapped CIK error

Both error types implement std::fmt::Display and std::error::Error for proper error propagation.

Sources: src/models/accession_number.rs:42-76

Serialization

The NportInvestment struct uses serde for serialization/deserialization with serde_with for custom formatting:

  • #[serde_as(as = "DisplayFromStr")] - Applied to Decimal fields (balance, val_usd, pct_val) to serialize them as strings
  • Option<T> fields are used for nullable data from NPORT filings

Sources: src/models/nport_investment.rs:3-38

Decimal Precision

Financial values in NportInvestment use the rust_decimal::Decimal type, which provides:

  • Arbitrary precision decimal arithmetic
  • No floating-point rounding errors
  • Safe comparison operations

Sources: src/models/nport_investment.rs:2-33


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Caching & Storage System

Relevant source files

Purpose and Scope

This document describes the caching and storage infrastructure used by the Rust sec-fetcher application to minimize redundant API requests and improve performance. The system implements a two-tier caching architecture with persistent storage backed by simd-r-drive. For information about network request throttling and retry logic, see Network Layer & SecClient. For database storage used by the Python components, see Database & Storage Integration.

Overview

The caching system provides two distinct cache layers:

  1. HTTP Cache : Stores raw HTTP responses from SEC EDGAR API requests
  2. Preprocessor Cache : Stores transformed and processed data structures

Both caches use the simd-r-drive key-value storage backend with persistent file-based storage, initialized once at application startup using the OnceLock pattern for thread-safe static access.

Sources: src/caches.rs:1-66


Caching Architecture

The following diagram illustrates the complete caching architecture and how it integrates with the network layer:

Cache Initialization Flow

graph TB
    subgraph "Application Initialization"
        Main["main.rs"]
ConfigMgr["ConfigManager"]
CachesInit["Caches::init(config_manager)"]
end
    
    subgraph "Static Cache Storage (OnceLock)"
        HTTPCache["SIMD_R_DRIVE_HTTP_CACHE\nOnceLock&lt;Arc&lt;DataStore&gt;&gt;"]
PreCache["SIMD_R_DRIVE_PREPROCESSOR_CACHE\nOnceLock&lt;Arc&lt;DataStore&gt;&gt;"]
end
    
    subgraph "File System"
        HTTPFile["cache_base_dir/http_storage_cache.bin"]
PreFile["cache_base_dir/preprocessor_cache.bin"]
end
    
    subgraph "Network Layer"
        SecClient["SecClient"]
CacheMiddleware["reqwest_drive\nCache Middleware"]
ThrottleMiddleware["Throttle Middleware"]
CachePolicy["CachePolicy\nTTL: 1 week\nrespect_headers: false"]
end
    
    subgraph "Cache Access"
        GetHTTP["Caches::get_http_cache_store()"]
GetPre["Caches::get_preprocessor_cache()"]
end
    
 
   Main --> ConfigMgr
 
   ConfigMgr --> CachesInit
 
   CachesInit --> HTTPCache
 
   CachesInit --> PreCache
    
 
   HTTPCache -.->|DataStore::open| HTTPFile
 
   PreCache -.->|DataStore::open| PreFile
    
 
   SecClient --> GetHTTP
 
   GetHTTP --> HTTPCache
 
   HTTPCache --> CacheMiddleware
 
   CacheMiddleware --> CachePolicy
 
   CacheMiddleware --> ThrottleMiddleware
    
 
   GetPre --> PreCache

The caching system is initialized early in the application lifecycle using a configuration-driven approach that ensures thread-safe access across the async runtime.

Sources: src/caches.rs:1-66 src/network/sec_client.rs:1-181


Cache Initialization System

OnceLock Pattern

The system uses Rust's OnceLock for lazy, thread-safe initialization of static cache instances:

This pattern ensures:

  • Single initialization : Each cache is initialized exactly once
  • Thread safety : Safe access from multiple async tasks
  • Zero overhead after init : No locking required for read access after initialization

Sources: src/caches.rs:7-8

Initialization Process

Initialization Code Flow

The Caches::init() method performs the following steps:

  1. Retrieves cache_base_dir from ConfigManager
  2. Constructs paths for both cache files
  3. Opens each DataStore instance
  4. Sets the OnceLock with Arc-wrapped stores
  5. Logs warnings if already initialized (prevents re-initialization panics)

Sources: src/caches.rs:13-48


HTTP Cache Integration

graph LR
    subgraph "SecClient Initialization"
        FromConfig["SecClient::from_config_manager()"]
GetHTTPCache["Caches::get_http_cache_store()"]
CreatePolicy["CachePolicy::new()"]
CreateThrottle["ThrottlePolicy::new()"]
end
    
    subgraph "Middleware Chain"
        InitDrive["init_cache_with_drive_and_throttle()"]
DriveCache["drive_cache middleware"]
ThrottleCache["throttle_cache middleware"]
InitClient["init_client_with_cache_and_throttle()"]
end
    
    subgraph "HTTP Client"
        ClientWithMiddleware["ClientWithMiddleware"]
end
    
 
   FromConfig --> GetHTTPCache
 
   FromConfig --> CreatePolicy
 
   FromConfig --> CreateThrottle
    
 
   GetHTTPCache --> InitDrive
 
   CreatePolicy --> InitDrive
 
   CreateThrottle --> InitDrive
    
 
   InitDrive --> DriveCache
 
   InitDrive --> ThrottleCache
    
 
   DriveCache --> InitClient
 
   ThrottleCache --> InitClient
    
 
   InitClient --> ClientWithMiddleware

SecClient Cache Setup

The SecClient integrates with the HTTP cache through the reqwest_drive middleware system:

SecClient Construction

The SecClient struct maintains references to both cache and throttle policies:

FieldTypePurpose
emailStringUser-Agent identification
http_clientClientWithMiddlewareHTTP client with middleware chain
cache_policyArc<CachePolicy>Cache configuration settings
throttle_policyArc<ThrottlePolicy>Request throttling configuration

Sources: src/network/sec_client.rs:14-19 src/network/sec_client.rs:73-81

CachePolicy Configuration

The CachePolicy struct defines cache behavior:

ParameterValueDescription
default_ttlDuration::from_secs(60 * 60 * 24 * 7)1 week time-to-live
respect_headersfalseIgnore HTTP cache headers from SEC API
cache_status_overrideNoneNo custom status code handling

The 1-week TTL is currently hardcoded but marked for future configurability.

Sources: src/network/sec_client.rs:45-50


Cache Access Patterns

HTTP Cache Access Flow

Request Processing

When SecClient makes a request:

  1. The fetch_json() method calls raw_request() with the target URL
  2. The middleware chain intercepts the request
  3. Cache middleware checks the DataStore for a matching entry
  4. If found and not expired (TTL check), returns cached response
  5. If not found, forwards to throttle middleware and HTTP client
  6. Response is stored in cache with timestamp before returning

Sources: src/network/sec_client.rs:140-179 tests/sec_client_tests.rs:35-62

Preprocessor Cache Access

The preprocessor cache is accessed directly through the Caches module API:

This cache is intended for storing transformed data structures after processing, though the current codebase primarily uses it as infrastructure for future preprocessing optimization.

Sources: src/caches.rs:59-64


Storage Backend: simd-r-drive

DataStore Integration

The simd-r-drive crate provides the persistent key-value storage backend:

DataStore Characteristics

FeatureDescription
PersistenceAll data survives application restarts
Binary FormatStores arbitrary byte arrays
Thread-SafeSafe for concurrent access from async tasks
File-BackedSingle file per cache instance
No NetworkLocal-only storage (unlike WebSocket mode)

Sources: src/caches.rs3 src/caches.rs:22-37


Cache Configuration

ConfigManager Integration

The caching system reads configuration from ConfigManager:

Config FieldPurposeDefault
cache_base_dirParent directory for cache filesRequired (no default)

The cache base directory path is constructed at runtime and must be set before initialization:

Cache File Paths

  • HTTP Cache: {cache_base_dir}/http_storage_cache.bin
  • Preprocessor Cache: {cache_base_dir}/preprocessor_cache.bin

Sources: src/caches.rs16 src/caches.rs20 src/caches.rs34

ThrottlePolicy Configuration

While not part of the cache storage itself, the ThrottlePolicy is closely integrated with the cache middleware:

ParameterConfig SourcePurpose
base_delay_msconfig.min_delay_msMinimum delay between requests
max_concurrentconfig.max_concurrentMaximum parallel requests
max_retriesconfig.max_retriesRetry attempts for failed requests
adaptive_jitter_msHardcoded: 500Random jitter for backoff

Sources: src/network/sec_client.rs:53-59


Error Handling

Initialization Errors

The cache initialization handles several error scenarios:

Panic Conditions:

  • DataStore::open() fails (I/O error, permission denied, etc.)
  • Calling get_http_cache_store() before Caches::init()
  • Calling get_preprocessor_cache() before Caches::init()

Warnings:

  • Attempting to reinitialize already-set OnceLock (logs warning, doesn't fail)

Sources: src/caches.rs:25-29 src/caches.rs:51-55

Runtime Access Errors

Both cache accessor methods use expect() with descriptive messages:

These panics indicate programmer errors (accessing cache before initialization) rather than runtime failures.

Sources: src/caches.rs:51-56 src/caches.rs:59-64


Testing

Cache Testing Strategy

The test suite verifies caching behavior indirectly through SecClient tests:

Test Coverage:

TestPurposeFile
test_fetch_json_without_retry_successVerifies basic fetch with caching enabledtests/sec_client_tests.rs:36-62
test_fetch_json_with_retry_successTests cache behavior with successful retrytests/sec_client_tests.rs:65-91
test_fetch_json_with_retry_failureValidates cache doesn't store failed responsestests/sec_client_tests.rs:94-120
test_fetch_json_with_retry_backoffTests cache with retry backoff logictests/sec_client_tests.rs:123-158

Mock Server Testing:

Tests use mockito::Server to simulate SEC API responses without requiring cache setup, as the test environment creates ephemeral cache instances per test run.

Sources: tests/sec_client_tests.rs:1-159


Usage Patterns

Typical Initialization Flow

Future Extensions

The preprocessor cache infrastructure is in place but not yet fully utilized. Potential use cases include:

  • Caching parsed/transformed US GAAP concepts
  • Storing intermediate data structures from distill_us_gaap_fundamental_concepts()
  • Caching resolved CIK lookups from ticker symbols
  • Storing compiled ticker-to-CIK mappings

Sources: src/caches.rs:59-64


Integration with Other Systems

The caching system integrates with:

  • ConfigManager (#2.1): Provides cache directory configuration
  • SecClient (#3.1): Primary consumer of HTTP cache
  • Network Functions (#3.2): All fetch operations benefit from caching
  • simd-r-drive : External storage backend (see #6 for dependency details)

The cache layer operates transparently below the network API, requiring no changes to calling code when caching is enabled or disabled.

Sources: src/caches.rs:1-66 src/network/sec_client.rs:73-81


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Main Application Flow

Relevant source files

This document describes the main application entry point and execution flow in src/main.rs, including initialization, data fetching loops, CSV output organization, and error handling strategies. The main application orchestrates the high-level data collection process by coordinating the configuration system, network layer, and storage systems.

For detailed information about the network client and its middleware layers, see SecClient. For data model structures used throughout the application, see Data Models & Enumerations. For configuration and credential management, see Configuration System.

Application Entry Point and Initialization

The application entry point is the main() function in src/main.rs:174-240 The initialization sequence follows a consistent pattern regardless of which processing mode is active:

Sources: src/main.rs:174-180

graph TB
    Start["Program Start\n#[tokio::main]\nasync fn main()"]
LoadConfig["ConfigManager::load()\nLoad TOML configuration\nValidate credentials"]
CreateClient["SecClient::from_config_manager()\nInitialize HTTP client\nApply throttling & caching"]
CheckMode{"Processing\nMode?"}
FetchTickers["fetch_company_tickers(&client)\nRetrieve complete ticker dataset"]
FetchInvCo["fetch_investment_company_\nseries_and_class_dataset(&client)\nGet fund listings"]
USGAAPLoop["US-GAAP Processing Loop\n(Active Implementation)"]
InvCoLoop["Investment Company Loop\n(Commented Prototypes)"]
ErrorReport["Generate Error Summary\nPrint failed tickers"]
End["Program Exit"]
Start --> LoadConfig
 
   LoadConfig --> CreateClient
 
   CreateClient --> CheckMode
 
   CheckMode -->|US-GAAP Mode| FetchTickers
 
   CheckMode -->|Investment Co Mode| FetchInvCo
 
   FetchTickers --> USGAAPLoop
 
   FetchInvCo --> InvCoLoop
 
   USGAAPLoop --> ErrorReport
 
   InvCoLoop --> ErrorReport
 
   ErrorReport --> End

The initialization handles errors at each stage, returning early if critical components fail to load:

Initialization StepError HandlingLine Reference
Configuration LoadingReturns Box<dyn Error> on failuresrc/main.rs176
Client CreationReturns Box<dyn Error> on failuresrc/main.rs178
Ticker FetchingReturns Box<dyn Error> on failuresrc/main.rs180

Active Implementation: US-GAAP Fundamentals Processing

The currently active implementation in main.rs processes US-GAAP fundamentals for all company tickers. This represents the production data collection pipeline.

Sources: src/main.rs:174-240

graph TB
    Start["Start US-GAAP Processing"]
FetchTickers["fetch_company_tickers(&client)\nReturns Vec&lt;Ticker&gt;"]
PrintTotal["Print total records\nDisplay first 60 tickers"]
InitErrorLog["Initialize error_log:\nHashMap&lt;String, String&gt;"]
LoopStart{"For each\ncompany_ticker\nin tickers"}
PrintProgress["Print processing status:\nticker_symbol (i+1 of total)"]
FetchFundamentals["fetch_us_gaap_fundamentals(\n&client,\n&company_tickers,\n&ticker_symbol)"]
CreateFile["File::create(\ndata/22-june-us-gaap/{ticker}.csv)"]
WriteCSV["CsvWriter::new(&mut file)\n.include_header(true)\n.finish(&mut df)"]
CaptureError["Insert error into error_log:\nticker → error message"]
CheckErrors{"error_log\nempty?"}
PrintErrors["Print error summary:\nList all failed tickers"]
PrintSuccess["Print success message"]
End["End Processing"]
Start --> FetchTickers
 
   FetchTickers --> PrintTotal
 
   PrintTotal --> InitErrorLog
 
   InitErrorLog --> LoopStart
 
   LoopStart -->|Next ticker| PrintProgress
 
   PrintProgress --> FetchFundamentals
 
   FetchFundamentals -->|Success| CreateFile
 
   CreateFile -->|Success| WriteCSV
 
   FetchFundamentals -->|Error| CaptureError
 
   CreateFile -->|Error| CaptureError
 
   WriteCSV -->|Error| CaptureError
 
   WriteCSV -->|Success| LoopStart
 
   CaptureError --> LoopStart
 
   LoopStart -->|Complete| CheckErrors
 
   CheckErrors -->|Has errors| PrintErrors
 
   CheckErrors -->|No errors| PrintSuccess
 
   PrintErrors --> End
 
   PrintSuccess --> End

Processing Loop Details

The main processing loop iterates through all company tickers sequentially, performing the following operations for each ticker:

  1. Progress Logging - src/main.rs:190-195: Prints current position in the dataset
  2. Data Fetching - src/main.rs203: Calls fetch_us_gaap_fundamentals() with client, full ticker list, and current ticker symbol
  3. CSV Generation - src/main.rs:206-214: Creates file and writes DataFrame with headers
  4. Error Capture - src/main.rs:212-225: Logs any failures to error_log HashMap

Error Handling Strategy

The application uses a non-fatal error handling approach:

Error TypeHandling StrategyResult
Fetch ErrorLogged to error_log, processing continuessrc/main.rs:223-225
File Creation ErrorLogged to error_log, processing continuessrc/main.rs:217-220
CSV Write ErrorLogged to error_log, processing continuessrc/main.rs:211-214
Configuration ErrorFatal, returns immediatelysrc/main.rs176
Client Creation ErrorFatal, returns immediatelysrc/main.rs178

Sources: src/main.rs:185-227

CSV Output Organization

The US-GAAP processing mode writes output to a flat directory structure:

data/22-june-us-gaap/
├── AAPL.csv
├── MSFT.csv
├── GOOGL.csv
└── ...

Each file is named using the ticker symbol and contains the complete US-GAAP fundamentals DataFrame for that company.

Sources: src/main.rs205

Investment Company Processing Flow (Prototype)

Two commented-out prototype implementations exist in main.rs for processing investment company data. These represent alternative processing modes that may be activated in future iterations.

Production Prototype (Lines 18-122)

The first prototype implements production-ready investment company processing with comprehensive error handling:

Sources: src/main.rs:18-122

graph TB
    Start["Start Investment Co Processing"]
InitLogger["env_logger::Builder\n.filter(None, LevelFilter::Info)"]
InitErrorLog["Initialize error_log: Vec&lt;String&gt;"]
LoadConfig["ConfigManager::load()\nHandle fatal errors"]
CreateClient["SecClient::from_config_manager()\nHandle fatal errors"]
FetchInvCo["fetch_investment_company_\nseries_and_class_dataset(&client)"]
LogTotal["info! Total investment companies"]
LoopStart{"For each fund\nin investment_\ncompanies"}
LogProgress["info! Processing: i+1 of total"]
CheckTicker{"fund.class_\nticker exists?"}
LogTicker["info! Ticker symbol"]
FetchNPORT["fetch_nport_filing_by_\nticker_symbol(&client, ticker)"]
GetFirstLetter["ticker.chars().next()\n.to_ascii_uppercase()"]
CreateDir["create_dir_all(\ndata/fund-holdings/{letter})"]
LogRecords["info! Total records"]
WriteCSV["latest_nport_filing\n.write_to_csv(file_path)"]
LogError["error! Log message\nerror_log.push(msg)"]
CheckFinal{"error_log\nempty?"}
PrintSummary["error! Print error summary"]
PrintSuccess["info! All funds successful"]
End["End Processing"]
Start --> InitLogger
 
   InitLogger --> InitErrorLog
 
   InitErrorLog --> LoadConfig
 
   LoadConfig -->|Error| LogError
 
   LoadConfig -->|Success| CreateClient
 
   CreateClient -->|Error| LogError
 
   CreateClient -->|Success| FetchInvCo
 
   FetchInvCo -->|Error| LogError
 
   FetchInvCo -->|Success| LogTotal
 
   LogTotal --> LoopStart
 
   LoopStart -->|Next fund| LogProgress
 
   LogProgress --> CheckTicker
 
   CheckTicker -->|Yes| LogTicker
 
   CheckTicker -->|No| LoopStart
 
   LogTicker --> FetchNPORT
 
   FetchNPORT -->|Error| LogError
 
   FetchNPORT -->|Success| GetFirstLetter
 
   GetFirstLetter --> CreateDir
 
   CreateDir -->|Error| LogError
 
   CreateDir -->|Success| LogRecords
 
   LogRecords --> WriteCSV
 
   WriteCSV -->|Error| LogError
 
   WriteCSV -->|Success| LoopStart
 
   LogError --> LoopStart
 
   LoopStart -->|Complete| CheckFinal
 
   CheckFinal -->|Has errors| PrintSummary
 
   CheckFinal -->|No errors| PrintSuccess
 
   PrintSummary --> End
 
   PrintSuccess --> End

Directory Organization for Fund Holdings

The investment company prototype organizes CSV output by the first letter of the ticker symbol:

data/fund-holdings/
├── A/
│   ├── AADR.csv
│   └── AAXJ.csv
├── B/
│   ├── BKLN.csv
│   └── BND.csv
├── V/
│   ├── VTI.csv
│   └── VXUS.csv
└── ...

This alphabetical categorization is implemented at src/main.rs:83-84 using ticker_symbol.chars().next().unwrap().to_ascii_uppercase().

Sources: src/main.rs:82-107

Debug Prototype (Lines 124-171)

A simpler debug version exists for testing and development, with minimal error handling and console output:

FeatureDebug PrototypeProduction Prototype
Error LoggingFatal errors onlyComprehensive Vec/HashMap logs
LoggerNot initializedenv_logger with INFO level
Progress TrackingSimple println!Structured info! macros
Error RecoveryImmediate exit with ?Continue processing, log errors
OutputConsole + CSVCSV only

Sources: src/main.rs:124-171

Data Flow Through Network Layer

The main application delegates all SEC API interactions to the network layer, which handles caching, throttling, and retries transparently:

Sources: src/main.rs:3-9 src/network/fetch_us_gaap_fundamentals.rs:12-28 src/network/fetch_nport_filing.rs:10-49 src/network/fetch_investment_company_series_and_class_dataset.rs:31-105

graph LR
    Main["main.rs\nmain()"]
FetchTickers["network::fetch_company_tickers\nGET company_tickers.json"]
FetchGAAP["network::fetch_us_gaap_fundamentals\nGET CIK{cik}/company-facts.json"]
FetchInvCo["network::fetch_investment_company_\nseries_and_class_dataset\nGET investment-company-series-class-{year}.csv"]
FetchNPORT["network::fetch_nport_filing_\nby_ticker_symbol\nGET data/{cik}/{accession}/primary_doc.xml"]
SecClient["SecClient\nHTTP middleware\nThrottling, Caching, Retries"]
SECAPI["SEC EDGAR API\nwww.sec.gov"]
Main --> FetchTickers
 
   Main --> FetchGAAP
 
   Main --> FetchInvCo
 
   Main --> FetchNPORT
    
 
   FetchTickers --> SecClient
 
   FetchGAAP --> SecClient
 
   FetchInvCo --> SecClient
 
   FetchNPORT --> SecClient
    
 
   SecClient --> SECAPI

The network functions are designed to be composable, allowing the main application to chain multiple API calls together (e.g., fetching a CIK first, then using it to fetch submissions).

Composite Data Fetching Operations

Several network functions perform composite operations by calling other network functions. For example, fetch_nport_filing_by_ticker_symbol orchestrates multiple API calls:

Sources: src/network/fetch_nport_filing.rs:10-49

graph TB
    Input["Input: ticker_symbol"]
FetchCIK["fetch_cik_by_ticker_symbol\n(&sec_client, ticker_symbol)"]
FetchSubs["fetch_cik_submissions\n(&sec_client, cik)"]
GetRecent["CikSubmission::most_recent_\nnport_p_submission(submissions)"]
FetchFiling["fetch_nport_filing_by_cik_\nand_accession_number(\n&sec_client, cik, accession)"]
FetchTickers["fetch_company_tickers\n(&sec_client)"]
ParseXML["parse_nport_xml(\n&xml_data, &company_tickers)"]
Output["Output: Vec&lt;NportInvestment&gt;"]
Input --> FetchCIK
 
   FetchCIK --> FetchSubs
 
   FetchSubs --> GetRecent
 
   GetRecent --> FetchFiling
 
   FetchFiling --> FetchTickers
 
   FetchTickers --> ParseXML
 
   ParseXML --> Output

This composition pattern allows the main application to use high-level functions while the network layer handles the complexity of multi-step API interactions.

graph TB
    MainFn["main() -> Result&lt;(), Box&lt;dyn Error&gt;&gt;"]
NetworkFn["Network functions\n-> Result&lt;T, Box&lt;dyn Error&gt;&gt;"]
Parser["Parsers\n-> Result&lt;T, Box&lt;dyn Error&gt;&gt;"]
SecClient["SecClient\nImplements retries & backoff"]
MainFatal["Fatal Errors:\nConfig load failure\nClient creation failure\nInitial ticker fetch failure"]
MainNonFatal["Non-Fatal Errors:\nIndividual ticker fetch failures\nCSV write failures\n→ Logged to error_log\n→ Processing continues"]
NetworkRetry["Automatic Retries:\nHTTP timeouts\nRate limit errors\n→ Exponential backoff\n→ Max 3 retries"]
ParserErr["Parser Errors:\nMalformed XML/JSON\nMissing required fields\n→ Propagate to caller"]
MainFn --> MainFatal
 
   MainFn --> MainNonFatal
    
 
   MainFn --> NetworkFn
 
   NetworkFn --> NetworkRetry
 
   NetworkFn --> SecClient
    
 
   NetworkFn --> Parser
 
   Parser --> ParserErr
    
 
   NetworkRetry --> MainNonFatal
 
   ParserErr --> MainNonFatal

Error Propagation and Recovery

The application uses Rust's Result type throughout the call stack, with different error handling strategies at each level:

Sources: src/main.rs:174-240 src/network/fetch_us_gaap_fundamentals.rs:12-28

Fatal errors (configuration, client setup) immediately return from main(), while per-ticker errors are captured in error_log and reported at the end without interrupting the processing loop.

File System Output Structure

The application writes structured CSV files to the local file system. The output directory organization differs between processing modes:

US-GAAP Mode Output

data/
└── 22-june-us-gaap/
    ├── AAPL.csv
    ├── MSFT.csv
    └── {ticker}.csv

Sources: src/main.rs205

Investment Company Mode Output

data/
└── fund-holdings/
    ├── A/
    │   ├── AADR.csv
    │   └── {ticker}.csv
    ├── B/
    │   └── {ticker}.csv
    └── {letter}/
        └── {ticker}.csv

Sources: src/main.rs:82-102

The alphabetical organization in investment company mode enables efficient file system operations when dealing with thousands of fund ticker symbols, avoiding the creation of a single directory with excessive entries.

Performance Characteristics

The main application is designed for batch processing with the following characteristics:

AspectImplementationLocation
ConcurrencySequential processing (no parallelization in main loop)src/main.rs187
Rate LimitingHandled by SecClient throttle policySee SecClient
CachingHTTP cache and preprocessor cache in SecClientSee Caching System
Memory ManagementProcesses one ticker at a time, drops DataFrames after writesrc/main.rs:203-221
Error RecoveryNon-fatal errors logged, processing continuessrc/main.rs:185-227

The sequential processing model ensures consistent rate limiting and simplifies error tracking, though it could be parallelized in future iterations using Tokio tasks or Rayon parallel iterators.

Sources: src/main.rs:174-240

Summary

The main application flow follows a simple but robust pattern:

  1. Initialize configuration and HTTP client with comprehensive error handling
  2. Fetch the complete dataset of tickers or investment companies
  3. Iterate through each entry sequentially
  4. Process each entry by fetching additional data and writing to CSV
  5. Log any errors without stopping processing
  6. Report a summary of successes and failures

This design prioritizes resilience (continue processing on errors) and observability (detailed logging and error summaries) over performance, making it suitable for overnight batch jobs that process thousands of securities. The commented prototype implementations demonstrate alternative processing modes that can be activated by uncommenting the relevant sections and commenting out the active implementation.


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Utility Functions

Relevant source files

Purpose and Scope

This document covers the utility functions and helper modules provided by the utils module in the Rust sec-fetcher application. These utilities provide cross-cutting functionality used throughout the codebase, including data structure transformations, runtime mode detection, and collection extensions.

For information about the main application architecture and data flow, see Main Application Flow. For details on data models and core structures, see Data Models & Enumerations.

Sources: src/utils.rs:1-12


Module Overview

The utils module is organized as a collection of focused sub-modules, each providing a specific utility function or set of related functions. The module uses Rust's re-export pattern to provide a clean public API.

Sub-modulePrimary ExportPurpose
invert_multivalue_indexmapinvert_multivalue_indexmap()Inverts multi-value index mappings
is_development_modeis_development_mode()Runtime environment detection
is_interactive_modeis_interactive_mode(), set_interactive_mode_override()Interactive mode state management
vec_extensionsVecExtensions traitExtension methods for Vec<T>

The module structure follows a pattern where each utility is isolated in its own file, promoting maintainability and testability while the parent utils.rs file acts as a facade.

Sources: src/utils.rs:1-12


Utility Module Architecture

Sources: src/utils.rs:1-12


IndexMap Inversion Utility

Function Signature

The invert_multivalue_indexmap function transforms a mapping where keys point to multiple values into a reverse mapping where values point to all their associated keys.

Algorithm and Complexity

The function performs the following operations:

  1. Initialization : Creates a new IndexMap with capacity matching the input map
  2. Iteration : Traverses all key-value pairs in the original mapping
  3. Inversion : For each value in a key's vector, adds that key to the value's entry in the inverted map
  4. Preservation : Maintains insertion order via IndexMap data structure

Time Complexity: O(N) where N is the total number of key-value associations across all vectors

Space Complexity: O(N) for storing the inverted mapping

Use Cases in the Codebase

This utility is primarily used in the US GAAP transformation pipeline where bidirectional concept mappings are required. The function enables:

  • Synonym Resolution : Mapping from multiple US GAAP tags to a single FundamentalConcept enum variant
  • Reverse Lookups : Given a FundamentalConcept, finding all original US GAAP tags that map to it
  • Hierarchical Queries : Supporting both forward (tag → concept) and reverse (concept → tags) navigation

Example Usage Pattern

Sources: src/utils/invert_multivalue_indexmap.rs:1-65


Development Mode Detection

The is_development_mode function provides runtime detection of development versus production environments. This utility enables conditional behavior based on the execution context.

Typical Implementation Pattern

Development mode detection typically checks for:

  • Cargo Profile : Whether compiled with --release flag
  • Environment Variables : Presence of RUST_DEV_MODE, DEBUG, or similar markers
  • Build Configuration : Compile-time flags like #[cfg(debug_assertions)]

Usage in Configuration System

The function integrates with the configuration system to adjust behavior:

Development ModeProduction Mode
Relaxed validationStrict validation
Verbose logging enabledMinimal logging
Mock data allowedReal API calls required
Cache disabled or short TTLFull caching enabled

This utility is likely referenced in src/config/config_manager.rs to adjust validation rules and in src/main.rs to control application initialization.

Sources: src/utils.rs:4-5


Interactive Mode Management

The interactive mode utilities manage application state related to user interaction, controlling whether the application should prompt for user input or run in automated mode.

Function Signatures

State Management Pattern

These functions likely implement a static or thread-local state management pattern:

  1. Default Behavior : Detect if running in a TTY (terminal) environment
  2. Override Mechanism : Allow explicit setting via set_interactive_mode_override()
  3. Query Interface : Check current state via is_interactive_mode()

Use Cases

ScenarioInteractive ModeNon-Interactive Mode
Missing configPrompt user for inputExit with error
API rate limitPause and notify userAutomatic retry with backoff
Data validation failureAsk user to continueFail fast and exit
Progress reportingShow progress barsLog to stdout/file

Integration with Main Application Flow

The interactive mode flags are likely used in src/main.rs to control:

  • Whether to display progress indicators during fund processing
  • How to handle missing credentials (prompt vs. error)
  • Whether to confirm before writing large CSV files
  • User confirmation for destructive operations

Sources: src/utils.rs:7-8


Vector Extensions Trait

The VecExtensions trait provides extension methods for Rust's standard Vec<T> type, adding domain-specific functionality needed by the sec-fetcher application.

Trait Pattern

Extension traits in Rust follow this pattern:

Likely Extension Methods

Based on common patterns in data processing applications and the context of US GAAP data transformation, the trait likely provides:

Method CategoryPotential MethodsPurpose
Chunkingchunk_by_size(), batch()Process large datasets in manageable batches
Deduplicationunique(), deduplicate_by()Remove duplicate entries from fetched data
Filteringfilter_not_empty(), compact()Remove null or empty elements
Transformationmap_parallel(), flat_map_concurrent()Parallel data transformation
Validationall_valid(), partition_valid()Separate valid from invalid records

Usage in Data Pipeline

The extensions are likely used extensively in:

  • Network Module : Batching API requests, deduplicating ticker symbols
  • Transform Module : Parallel processing of US GAAP concepts
  • Main Application : Chunking investment company lists for concurrent processing

Sources: src/utils.rs:10-11


Utility Function Relationships

Sources: src/utils.rs:1-12 src/utils/invert_multivalue_indexmap.rs:1-65


Design Principles

Modularity

Each utility is isolated in its own sub-module with a single, well-defined responsibility. This design:

  • Reduces Coupling : Utilities can be tested independently
  • Improves Reusability : Functions can be used across different modules without dependencies
  • Simplifies Maintenance : Changes to one utility don't affect others

Generic Programming

The invert_multivalue_indexmap function demonstrates generic programming principles:

  • Type Parameters : Works with any types implementing Eq + Hash + Clone
  • Zero-Cost Abstractions : No runtime overhead compared to specialized versions
  • Compile-Time Guarantees : Type safety ensured by the compiler

Order Preservation

The use of IndexMap instead of HashMap in the inversion utility preserves insertion order, which is critical for:

  • Deterministic Output : Consistent CSV file generation across runs
  • Reproducible Transformations : Same input always produces same output order
  • Debugging : Predictable ordering aids in troubleshooting

Sources: src/utils/invert_multivalue_indexmap.rs:4-14


Performance Characteristics

IndexMap Inversion

OperationComplexityNotes
InversionO(N)N = total key-value associations
Lookup in resultO(1) averageHash-based access
Insertion orderPreservedVia IndexMap internal ordering
Memory overheadO(N)Stores inverted mapping

Runtime Mode Detection

OperationComplexityCaching Strategy
First checkO(1) - O(log n)Environment lookup or compile-time constant
Subsequent checksO(1)Static or lazy_static cached value

The mode detection functions likely use lazy initialization patterns to cache their results, avoiding repeated environment variable lookups or system calls.

Sources: src/utils/invert_multivalue_indexmap.rs:26-28


Integration Points

US GAAP Transformation Pipeline

The utilities integrate with the concept transformation system (see US GAAP Concept Transformation):

  1. Forward Mapping : distill_us_gaap_fundamental_concepts uses hard-coded mappings
  2. Reverse Mapping : invert_multivalue_indexmap generates reverse index
  3. Lookup Optimization : Enables O(1) queries for "which concepts contain this tag?"

Configuration System

The development and interactive mode utilities integrate with configuration management (see Configuration System):

  1. Validation : is_development_mode relaxes validation in development
  2. Credential Handling : is_interactive_mode determines prompt vs. error
  3. Logging : Both functions control verbosity and output format

Main Application Loop

The utilities support the main processing flow (see Main Application Flow):

  1. Batch Processing : VecExtensions enables efficient chunking of investment companies
  2. User Feedback : is_interactive_mode controls progress display
  3. Error Handling : Mode detection influences retry behavior and error messages

Sources: src/utils.rs:1-12


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Python narrative_stack System

Relevant source files

Purpose and Scope

The narrative_stack system is the Python machine learning component of the dual-language financial data processing architecture. This document provides an overview of the Python system's architecture, its integration with the Rust sec-fetcher component, and the high-level data flow from CSV ingestion through model training.

For detailed information on specific subsystems:

The Rust data fetching layer is documented in Rust sec-fetcher Application.

System Architecture Overview

The narrative_stack system processes US GAAP financial data through a multi-stage pipeline that transforms raw CSV files into learned latent representations. The architecture consists of three primary layers: storage/ingestion, preprocessing, and training.

graph TB
    subgraph "Storage Layer"
        DbUsGaap["DbUsGaap\nMySQL Database Interface"]
DataStoreWsClient["DataStoreWsClient\nWebSocket Client"]
UsGaapStore["UsGaapStore\nUnified Data Facade"]
end
    
    subgraph "Data Sources"
        RustCSV["CSV Files\nfrom sec-fetcher\ntruncated_csvs/"]
MySQL["MySQL Database\nus_gaap_test"]
SimdRDrive["simd-r-drive\nWebSocket Server"]
end
    
    subgraph "Preprocessing Components"
        IngestCSV["ingest_us_gaap_csvs\nCSV Walker & Parser"]
GenPCA["generate_pca_embeddings\nPCA Compression"]
RobustScaler["RobustScaler\nPer-Pair Normalization"]
ConceptEmbed["Semantic Embeddings\nConcept/Unit Pairs"]
end
    
    subgraph "Training Components"
        IterableDS["IterableConceptValueDataset\nStreaming Data Loader"]
Stage1AE["Stage1Autoencoder\nPyTorch Lightning Module"]
PLTrainer["pl.Trainer\nTraining Orchestration"]
Callbacks["EarlyStopping\nModelCheckpoint\nTensorBoard Logger"]
end
    
 
   RustCSV --> IngestCSV
 
   MySQL --> DbUsGaap
 
   SimdRDrive --> DataStoreWsClient
    
 
   DbUsGaap --> UsGaapStore
 
   DataStoreWsClient --> UsGaapStore
    
 
   IngestCSV --> UsGaapStore
 
   UsGaapStore --> GenPCA
 
   UsGaapStore --> RobustScaler
 
   UsGaapStore --> ConceptEmbed
    
 
   GenPCA --> UsGaapStore
 
   RobustScaler --> UsGaapStore
 
   ConceptEmbed --> UsGaapStore
    
 
   UsGaapStore --> IterableDS
 
   IterableDS --> Stage1AE
 
   Stage1AE --> PLTrainer
 
   PLTrainer --> Callbacks
    
 
   Callbacks -.->|Checkpoints| Stage1AE

Component Architecture

Sources:

Core Component Responsibilities

ComponentTypeResponsibilities
DbUsGaapDatabase InterfaceMySQL connection management, query execution
DataStoreWsClientWebSocket ClientReal-time communication with simd-r-drive server
UsGaapStoreData FacadeUnified API for ingestion, lookup, embedding management
IterableConceptValueDatasetPyTorch DatasetStreaming data loader with internal batching
Stage1AutoencoderLightning ModuleEncoder-decoder architecture for concept embeddings
pl.TrainerTraining FrameworkOrchestration, logging, checkpointing

Sources:

Data Flow Pipeline

The system processes financial data through a sequential pipeline from raw CSV files to trained model checkpoints.

graph LR
    subgraph "Input Stage"
        CSV1["Rust CSV Output\nproject_paths.rust_data/\ntruncated_csvs/"]
end
    
    subgraph "Ingestion Stage"
        Walk["walk_us_gaap_csvs\nDirectory Walker"]
Parse["UsGaapRowRecord\nParser"]
Store1["Store to\nDbUsGaap & DataStore"]
end
    
    subgraph "Preprocessing Stage"
        Extract["Extract\nconcept/unit pairs"]
GenEmbed["generate_concept_unit_embeddings\nSemantic Embeddings"]
Scale["RobustScaler\nPer-Pair Normalization"]
PCA["PCA Compression\nvariance_threshold=0.95"]
Triplet["Triplet Storage\nconcept+unit+scaled_value\n+scaler+embedding"]
end
    
    subgraph "Training Stage"
        Stream["IterableConceptValueDataset\ninternal_batch_size=64"]
DataLoader["DataLoader\nbatch_size from hparams\ncollate_with_scaler"]
Encode["Stage1Autoencoder.encode\nInput → Latent"]
Decode["Stage1Autoencoder.decode\nLatent → Reconstruction"]
Loss["MSE Loss\nReconstruction Error"]
end
    
    subgraph "Output Stage"
        Ckpt["Model Checkpoints\nstage1_resume-vN.ckpt"]
TB["TensorBoard Logs\nval_loss_epoch monitoring"]
end
    
 
   CSV1 --> Walk
 
   Walk --> Parse
 
   Parse --> Store1
 
   Store1 --> Extract
 
   Extract --> GenEmbed
 
   GenEmbed --> Scale
 
   Scale --> PCA
 
   PCA --> Triplet
 
   Triplet --> Stream
 
   Stream --> DataLoader
 
   DataLoader --> Encode
 
   Encode --> Decode
 
   Decode --> Loss
 
   Loss --> Ckpt
 
   Loss --> TB

End-to-End Processing Flow

Sources:

Processing Stages

1. CSV Ingestion

The system ingests CSV files produced by the Rust sec-fetcher using us_gaap_store.ingest_us_gaap_csvs(). This function walks the directory structure, parses each CSV file, and stores the data in both MySQL and the WebSocket data store.

Sources:

2. Concept/Unit Pair Extraction

The preprocessing pipeline extracts unique (concept, unit) pairs from the ingested data. Each pair represents a specific financial metric with its measurement unit (e.g., ("Revenues", "USD")).

Sources:

3. Semantic Embedding Generation

For each concept/unit pair, the system generates semantic embeddings that capture the meaning and relationship of financial concepts. These embeddings are then compressed using PCA with a variance threshold of 0.95.

Sources:

4. Value Normalization

The RobustScaler normalizes financial values separately for each concept/unit pair, making the data robust to outliers while preserving relative magnitudes within each group.

Sources:

5. Model Training

The Stage1Autoencoder learns to reconstruct the concatenated embedding+value input through a bottleneck architecture, forcing the model to learn compressed representations in the latent space.

Sources:

graph TB
    subgraph "Python Application"
        App["narrative_stack\nNotebooks & Scripts"]
end
    
    subgraph "Storage Facade"
        UsGaapStore["UsGaapStore\nUnified Interface"]
end
    
    subgraph "Backend Interfaces"
        DbInterface["DbUsGaap\ndb_config"]
WsInterface["DataStoreWsClient\nsimd_r_drive_server_config"]
end
    
    subgraph "Storage Systems"
        MySQL["MySQL Server\nus_gaap_test database"]
WsServer["simd-r-drive\nWebSocket Server\nKey-Value Store"]
FileCache["File System\nCache Storage"]
end
    
    subgraph "Data Types"
        Raw["Raw Records\nUsGaapRowRecord"]
Triplets["Triplets\nconcept+unit+scaled_value\n+scaler+embedding"]
Matrix["Embedding Matrix\nnumpy arrays"]
PCAModel["PCA Models\nsklearn objects"]
end
    
 
   App --> UsGaapStore
 
   UsGaapStore --> DbInterface
 
   UsGaapStore --> WsInterface
    
 
   DbInterface --> MySQL
 
   WsInterface --> WsServer
 
   WsServer --> FileCache
    
 
   MySQL -.->|stores| Raw
 
   WsServer -.->|stores| Triplets
 
   WsServer -.->|stores| Matrix
 
   WsServer -.->|stores| PCAModel

Storage Architecture Integration

The Python system integrates with multiple storage backends to support different access patterns and data requirements.

Storage Backend Architecture

Sources:

Storage System Characteristics

Storage BackendUse CaseData TypesAccess Pattern
MySQL (DbUsGaap)Raw record storageUsGaapRowRecord entriesBatch queries during ingestion
WebSocket (DataStoreWsClient)Preprocessed dataTriplets, embeddings, scalers, PCA modelsReal-time streaming during training
File SystemCache persistenceHTTP cache, preprocessor cacheTransparent via simd-r-drive

Sources:

sequenceDiagram
    participant NB as "Jupyter Notebook"
    participant Config as "config module"
    participant DB as "DbUsGaap"
    participant WS as "DataStoreWsClient"
    participant Store as "UsGaapStore"
    
    NB->>Config: Import db_config
    NB->>Config: Import simd_r_drive_server_config
    NB->>DB: DbUsGaap(db_config)
    NB->>WS: DataStoreWsClient(host, port)
    NB->>Store: UsGaapStore(data_store)
    Store->>WS: Delegate operations
    Store->>DB: Query raw records

Configuration and Initialization

The system uses centralized configuration for database connections and WebSocket server endpoints.

Initialization Sequence

Sources:

Configuration Objects

ConfigurationPurposeKey Parameters
db_configMySQL connectionHost, port, database name, credentials
simd_r_drive_server_configWebSocket serverHost, port
project_pathsFile system pathsrust_data, python_data

Sources:

graph TB
    subgraph "Model Configuration"
        Model["Stage1Autoencoder\nhparams"]
LoadCkpt["load_from_checkpoint\nckpt_path"]
LR["Learning Rate\nlr, min_lr"]
end
    
    subgraph "Data Configuration"
        DS["IterableConceptValueDataset\ninternal_batch_size=64\nreturn_scaler=True\nshuffle=True/False"]
DL["DataLoader\nbatch_size from hparams\ncollate_fn=collate_with_scaler\nnum_workers=2\nprefetch_factor=4"]
end
    
    subgraph "Callback Configuration"
        ES["EarlyStopping\nmonitor=val_loss_epoch\npatience=20\nmode=min"]
MC["ModelCheckpoint\nmonitor=val_loss_epoch\nsave_top_k=1\nfilename=stage1_resume"]
TB["TensorBoardLogger\nOUTPUT_PATH\nname=stage1_autoencoder"]
end
    
    subgraph "Trainer Configuration"
        Trainer["pl.Trainer\nmax_epochs=1000\naccelerator=auto\ndevices=1\ngradient_clip_val"]
end
    
 
   Model --> Trainer
 
   LoadCkpt -.->|optional resume| Model
 
   LR --> Model
    
 
   DS --> DL
 
   DL --> Trainer
    
 
   ES --> Trainer
 
   MC --> Trainer
 
   TB --> Trainer

Training Infrastructure

The training system uses PyTorch Lightning for experiment management and reproducible training workflows.

Training Configuration

Sources:

Key Training Parameters

ParameterValuePurpose
EPOCHS1000Maximum training epochs
PATIENCE20Early stopping patience
internal_batch_size64Dataset internal batching
num_workers2DataLoader worker processes
prefetch_factor4Batches to prefetch per worker
gradient_clip_valFrom hparamsGradient clipping threshold

Sources:

Data Access Patterns

The UsGaapStore provides multiple access methods for different use cases.

Access Methods

MethodPurposeReturn TypeUsage Context
ingest_us_gaap_csvs(csv_dir, db)Bulk CSV ingestionNoneInitial data loading
generate_pca_embeddings()PCA compressionNonePreprocessing
lookup_by_index(idx)Single triplet retrievalDict with concept, unit, scaled_value, scaler, embeddingValidation, debugging
batch_lookup_by_indices(indices)Multiple triplet retrievalList of dictsBatch validation
get_triplet_count()Count stored tripletsIntegerMonitoring
get_pair_count()Count unique pairsIntegerStatistics
get_embedding_matrix()Full embedding matrix(numpy array, pairs list)Visualization, analysis

Sources:

Validation and Monitoring

The system includes validation mechanisms to ensure data integrity throughout the pipeline.

Scaler Validation Example

The preprocessing notebook includes validation code to verify that stored scalers correctly transform values:

Sources:

Visualization Tools

The system provides visualization utilities for monitoring preprocessing quality:

FunctionPurposeOutput
plot_pca_explanation()Show PCA variance explainedMatplotlib plot with dimensionality recommendation
plot_semantic_embeddings()Visualize embedding space2D scatterplot of embeddings

Sources:

graph LR
    subgraph "Rust sec-fetcher"
        Fetch["Network Fetching\nfetch_us_gaap_fundamentals"]
Transform["Concept Transformation\ndistill_us_gaap_fundamental_concepts"]
CSVWrite["CSV Output\nus-gaap/*.csv"]
end
    
    subgraph "File System"
        CSVStore["CSV Files\nproject_paths.rust_data/\ntruncated_csvs/"]
end
    
    subgraph "Python narrative_stack"
        CSVRead["CSV Ingestion\ningest_us_gaap_csvs"]
Process["Preprocessing Pipeline"]
Train["Model Training"]
end
    
 
   Fetch --> Transform
 
   Transform --> CSVWrite
 
   CSVWrite --> CSVStore
 
   CSVStore --> CSVRead
 
   CSVRead --> Process
 
   Process --> Train

Integration with Rust sec-fetcher

The Python system consumes data produced by the Rust sec-fetcher application through a file-based interface.

Rust-Python Data Bridge

Sources:

Data Format Expectations

The Python system expects CSV files produced by the Rust sec-fetcher to contain US GAAP fundamental data with the following structure:

  • Each CSV file represents data for a single ticker symbol
  • Files are organized in subdirectories (e.g., truncated_csvs/)
  • Each row contains concept, unit, value, and metadata fields
  • Concepts are standardized through the Rust distill_us_gaap_fundamental_concepts transformer

The preprocessing pipeline parses these CSVs into UsGaapRowRecord objects for further processing.

Sources:

Checkpoint Management

The training system supports checkpoint-based workflows for resuming training and fine-tuning.

Checkpoint Loading Patterns

The notebook demonstrates multiple checkpoint loading strategies:

Sources:

Checkpoint Configuration

ParameterPurposeTypical Value
dirpathCheckpoint save directoryOUTPUT_PATH
filenameCheckpoint filename pattern"stage1_resume"
monitorMetric to monitor"val_loss_epoch"
modeOptimization direction"min"
save_top_kNumber of checkpoints to keep1

Sources:


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Data Ingestion & Preprocessing

Relevant source files

Purpose and Scope

This document describes the data ingestion and preprocessing pipeline in the Python narrative_stack system. This pipeline transforms raw US GAAP financial data from CSV files (generated by the Rust sec-fetcher application) into normalized, embedded triplets suitable for machine learning training. The pipeline handles CSV parsing, concept/unit pair extraction, semantic embedding generation, robust statistical normalization, and PCA-based dimensionality reduction.

For information about the machine learning training pipeline that consumes this preprocessed data, see Machine Learning Training Pipeline. For details about the Rust data fetching and CSV generation, see Rust sec-fetcher Application.

Overview

The preprocessing pipeline orchestrates multiple transformation stages to prepare raw financial data for machine learning. The system reads CSV files containing US GAAP fundamental concepts (Assets, Revenues, etc.) with associated values and units of measure, then generates semantic embeddings for each concept/unit pair, normalizes values using robust statistical techniques, and compresses embeddings via PCA.

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:1-383

graph TB
    subgraph "Input"
        CSVFiles["CSV Files\nrust_data/us-gaap/*.csv\nSymbol-specific filings"]
end
    
    subgraph "Ingestion Layer"
        WalkCSV["walk_us_gaap_csvs()\nIterator over CSV files"]
ParseRow["UsGaapRowRecord\nParsed entries per symbol"]
end
    
    subgraph "Extraction & Embedding"
        ExtractPairs["Extract Concept/Unit Pairs\n(concept, unit)
tuples"]
GenEmbeddings["generate_concept_unit_embeddings()\nSemantic embeddings per pair"]
end
    
    subgraph "Normalization"
        GroupValues["Group by (concept, unit)\nCollect all values per pair"]
FitScaler["RobustScaler.fit()\nPer-pair scaling parameters"]
TransformValues["Transform values\nScaled to robust statistics"]
end
    
    subgraph "Dimensionality Reduction"
        PCAFit["PCA.fit()\nvariance_threshold=0.95"]
PCATransform["PCA.transform()\nCompress embeddings"]
end
    
    subgraph "Storage"
        BuildTriplets["Build Triplets\n(concept, unit, scaled_value,\nscaler, embedding)"]
StoreDS["DataStoreWsClient\nWrite to simd-r-drive"]
UsGaapStore["UsGaapStore\nUnified access facade"]
end
    
 
   CSVFiles --> WalkCSV
 
   WalkCSV --> ParseRow
 
   ParseRow --> ExtractPairs
 
   ExtractPairs --> GenEmbeddings
 
   ParseRow --> GroupValues
 
   GroupValues --> FitScaler
 
   FitScaler --> TransformValues
 
   GenEmbeddings --> PCAFit
 
   PCAFit --> PCATransform
 
   TransformValues --> BuildTriplets
 
   PCATransform --> BuildTriplets
 
   BuildTriplets --> StoreDS
 
   StoreDS --> UsGaapStore

Architecture Components

The preprocessing system is built around three primary components that handle data access, transformation, and storage:

graph TB
    subgraph "Database Layer"
        DbUsGaap["DbUsGaap\nMySQL interface\nus_gaap_test database"]
end
    
    subgraph "Storage Layer"
        DataStoreWsClient["DataStoreWsClient\nWebSocket client\nsimd_r_drive_server_config"]
end
    
    subgraph "Facade Layer"
        UsGaapStore["UsGaapStore\nUnified data access\nOrchestrates ingestion & retrieval"]
end
    
    subgraph "Core Operations"
        Ingest["ingest_us_gaap_csvs()\nCSV → Database → WebSocket"]
GenPCA["generate_pca_embeddings()\nPCA compression pipeline"]
Lookup["lookup_by_index()\nRetrieve triplet + metadata"]
end
    
 
   DbUsGaap --> UsGaapStore
 
   DataStoreWsClient --> UsGaapStore
 
   UsGaapStore --> Ingest
 
   UsGaapStore --> GenPCA
 
   UsGaapStore --> Lookup
ComponentTypePurpose
DbUsGaapDatabase InterfaceProvides async MySQL access to stored US GAAP data
DataStoreWsClientWebSocket ClientConnects to simd-r-drive server for key-value storage
UsGaapStoreFacadeCoordinates ingestion, embedding generation, and data retrieval

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:13-40

CSV Data Ingestion

The ingestion process begins by reading CSV files produced by the Rust sec-fetcher application. These files are organized by ticker symbol and contain US GAAP fundamental concepts extracted from SEC filings.

Directory Structure

CSV files are read from a configurable directory path that points to the Rust application's output:

graph LR
 
   ProjectPaths["project_paths.rust_data"] --> CSVDir["csv_data_dir\nPath to CSV directory"]
CSVDir --> Files["Symbol CSV Files\nAAPL.csv, GOOGL.csv, etc."]
Files --> Walker["walk_us_gaap_csvs()\nGenerator function"]

The ingestion system walks through the directory structure, processing one CSV file per symbol. Each file contains multiple filing entries with concept/value/unit triplets.

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:20-23

Ingestion Invocation

The primary ingestion method is us_gaap_store.ingest_us_gaap_csvs(), which accepts a CSV directory path and database connection:

This method orchestrates:

  1. CSV file walking and parsing
  2. Concept/unit pair extraction
  3. Value aggregation per pair
  4. Scaler fitting and transformation
  5. Storage in both MySQL and simd-r-drive

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb67

Concept/Unit Pair Extraction

Financial data in CSV files contains concept/value/unit triplets. The preprocessing pipeline extracts unique (concept, unit) pairs and groups all values associated with each pair.

Data Structure

Each row in the CSV contains:

  • Concept : A FundamentalConcept variant (e.g., "Assets", "Revenues", "NetIncomeLoss")
  • Unit of Measure (UOM) : The measurement unit (e.g., "USD", "shares", "USD_per_share")
  • Value : The numeric value reported in the filing
  • Additional Metadata : Period type, balance type, filing date, etc.

Grouping Strategy

Values are grouped by (concept, unit) combinations because:

  • Different concepts have different value ranges (e.g., Assets vs. EPS)
  • The same concept in different units requires separate scaling (e.g., Revenue in USD vs. Revenue in thousands)
  • This grouping enables per-pair statistical normalization

Each pair becomes associated with:

  • A semantic embedding vector
  • A fitted RobustScaler instance
  • A collection of normalized values

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:190-249

Semantic Embedding Generation

The system generates semantic embeddings for each unique (concept, unit) pair using a pre-trained language model. These embeddings capture the semantic meaning of the financial concept and its measurement unit.

graph LR
 
   Pairs["Concept/Unit Pairs\n(Revenues, USD)\n(Assets, USD)\n(NetIncomeLoss, USD)"] --> Model["Embedding Model\nPre-trained transformer"]
Model --> Vectors["Embedding Vectors\nFixed dimensionality\nSemantic representation"]
Vectors --> EmbedMap["embedding_map\nDict[(concept, unit) → embedding]"]

Embedding Process

The generate_concept_unit_embeddings() function creates fixed-dimensional vector representations:

Embedding Properties

PropertyDescription
DimensionalityFixed size (typically 384 or 768 dimensions)
Semantic PreservationSimilar concepts produce similar embeddings
DeterministicSame input always produces same output
Device-AgnosticCan be computed on CPU or GPU

The embeddings are stored in a mapping structure that associates each (concept, unit) pair with its corresponding vector. This mapping is used throughout the preprocessing and training pipelines.

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:279-292

Value Normalization with RobustScaler

Financial values exhibit extreme variance and outliers. The preprocessing pipeline uses RobustScaler from scikit-learn to normalize values within each (concept, unit) group.

RobustScaler Characteristics

RobustScaler is chosen over standard normalization techniques because:

  • Uses median and interquartile range (IQR) instead of mean and standard deviation
  • Resistant to outliers that are common in financial data
  • Scales each feature independently
  • Preserves zero values when appropriate

Scaling Formula

For each value in a (concept, unit) group:

scaled_value = (raw_value - median) / IQR

Where:

  • median is the 50th percentile of all values in the group
  • IQR is the interquartile range (75th percentile - 25th percentile)

Scaler Persistence

Each fitted RobustScaler instance is stored alongside the scaled values. This enables:

  • Inverse transformation for interpreting model outputs
  • Validation of scaling correctness
  • Consistent processing of new data points

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:108-166

PCA Dimensionality Reduction

The preprocessing pipeline applies Principal Component Analysis (PCA) to compress semantic embeddings while preserving most of the variance. This reduces memory footprint and training time.

PCA Configuration

ParameterValuePurpose
variance_threshold0.95Retain 95% of original variance
fit_dataAll concept/unit embeddingsLearn principal components
transform_dataSame embeddingsApply compression

Variance Explanation Visualization

The system provides visualization tools to analyze PCA effectiveness:

This function generates plots showing:

  • Cumulative explained variance vs. number of components
  • Individual component variance contribution
  • The optimal number of components to retain 95% variance

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:67-68 python/narrative_stack/notebooks/stage1_preprocessing.ipynb:259-270

graph TB
    subgraph "Triplet Components"
        Concept["concept: str\nFundamentalConcept variant"]
Unit["unit: str\nUnit of measure"]
ScaledVal["scaled_value: float\nRobustScaler output"]
UnscaledVal["unscaled_value: float\nOriginal raw value"]
Scaler["scaler: RobustScaler\nFitted scaler instance"]
Embedding["embedding: ndarray\nPCA-compressed vector"]
end
    
    subgraph "Storage Format"
        Index["Index: int\nSequential triplet ID"]
Serialized["Serialized Data\nPickled Python object"]
end
    
 
   Concept --> Serialized
 
   Unit --> Serialized
 
   ScaledVal --> Serialized
 
   UnscaledVal --> Serialized
 
   Scaler --> Serialized
 
   Embedding --> Serialized
 
   Index --> Serialized
    
 
   Serialized --> WebSocket["DataStoreWsClient\nWebSocket write"]

Triplet Storage Structure

The final output of preprocessing is a collection of triplets stored in the simd-r-drive key-value store via DataStoreWsClient. Each triplet contains all information needed for training and inference.

Triplet Schema

Storage Interface

The UsGaapStore facade provides methods for retrieving stored triplets:

MethodPurpose
lookup_by_index(idx: int)Retrieve single triplet by sequential index
batch_lookup_by_indices(indices: List[int])Retrieve multiple triplets efficiently
get_triplet_count()Get total number of stored triplets
get_pair_count()Get number of unique (concept, unit) pairs
get_embedding_matrix()Retrieve full embedding matrix and pair list

Retrieval Example

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:78-97

Data Validation

The preprocessing pipeline includes validation mechanisms to ensure data integrity throughout the transformation process.

Scaler Verification

A validation check confirms that stored scaled values match the transformation applied by the stored scaler:

graph LR
 
   Lookup["Lookup triplet"] --> Extract["Extract unscaled_value\nand scaler"]
Extract --> Transform["scaler.transform()"]
Transform --> Compare["np.isclose()\ncheck vs scaled_value"]
Compare --> Pass["Validation Pass"]
Compare --> Fail["Validation Fail"]

This test:

  1. Retrieves a random triplet
  2. Re-applies the stored scaler to the unscaled value
  3. Verifies the result matches the stored scaled value
  4. Uses np.isclose() for floating-point tolerance

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:91-96

Visualization Tools

The preprocessing module provides visualization utilities for analyzing embedding quality and PCA effectiveness.

PCA Explanation Plot

The plot_pca_explanation() function generates cumulative variance plots:

This visualization shows:

  • How many principal components are needed to reach the variance threshold
  • The diminishing returns of adding more components
  • The optimal dimensionality for the compressed embeddings

Semantic Embedding Scatterplot

The plot_semantic_embeddings() function creates 2D projections of embeddings:

This visualization helps assess:

  • Clustering patterns in concept/unit pairs
  • Separation between different financial concepts
  • Embedding space structure

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:259-270 python/narrative_stack/notebooks/stage1_preprocessing.ipynb:337-343

graph LR
    subgraph "Preprocessing Output"
        Triplets["Stored Triplets\nsimd-r-drive storage"]
end
    
    subgraph "Training Input"
        Dataset["IterableConceptValueDataset\nStreaming data loader"]
Collate["collate_with_scaler()\nBatch construction"]
DataLoader["PyTorch DataLoader\nTraining batches"]
end
    
 
   Triplets --> Dataset
 
   Dataset --> Collate
 
   Collate --> DataLoader

Integration with Training Pipeline

The preprocessed data serves as input to the machine learning training pipeline. The stored triplets are loaded via streaming datasets during training.

ComponentPurpose
Triplets in simd-r-drivePersistent storage of preprocessed data
IterableConceptValueDatasetStreams triplets without loading all into memory
collate_with_scalerCustom collation function for batch processing
PyTorch DataLoaderManages batching, shuffling, and parallel loading

For detailed information about the training pipeline, see Machine Learning Training Pipeline.

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-522

Performance Considerations

The preprocessing pipeline is designed to handle large-scale financial datasets efficiently.

Memory Management

StrategyImplementation
Streaming CSV readingProcess files one at a time, not all in memory
Incremental scaler fittingUse online algorithms for large groups
Compressed storagePCA reduces embedding footprint by ~50-70%
WebSocket batchingDataStoreWsClient batches writes for efficiency
graph TB
    subgraph "Rust Caching Layer"
        HTTPCache["SIMD_R_DRIVE_HTTP_CACHE\nhttp_storage_cache.bin\n1 week TTL"]
PreCache["SIMD_R_DRIVE_PREPROCESSOR_CACHE\npreprocessor_cache.bin\nPersistent storage"]
end
    
    subgraph "Python Preprocessing"
        CSVRead["Read CSV files"]
Process["Transform & normalize"]
end
    
 
   HTTPCache -.->|Cached SEC data| CSVRead
 
   PreCache -.->|Cached computations| Process

Caching Strategy

The Rust layer provides HTTP and preprocessor caches that the preprocessing pipeline can leverage:

Sources: src/caches.rs:1-66


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Machine Learning Training Pipeline

Relevant source files

Purpose and Scope

This page documents the machine learning training pipeline for the narrative_stack system, specifically the Stage 1 autoencoder that learns latent representations of US GAAP financial concepts. The training pipeline consumes preprocessed concept/unit/value triplets and their semantic embeddings (see Data Ingestion & Preprocessing) to train a neural network that can encode financial data into a compressed latent space.

The pipeline uses PyTorch Lightning for training orchestration, implements custom iterable datasets for efficient data streaming from the simd-r-drive WebSocket server, and provides comprehensive experiment tracking through TensorBoard. For details on the underlying storage mechanisms and data retrieval, see Database & Storage Integration.

Training Pipeline Architecture

The training pipeline operates as a streaming system that continuously fetches preprocessed triplets from the UsGaapStore and feeds them through the autoencoder model. The architecture emphasizes memory efficiency and reproducibility.

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb456-556

graph TB
    subgraph "Data Source Layer"
        DataStoreWsClient["DataStoreWsClient\n(simd_r_drive_ws_client)"]
UsGaapStore["UsGaapStore\nlookup_by_index()"]
end
    
    subgraph "Dataset Layer"
        IterableDataset["IterableConceptValueDataset\ninternal_batch_size=64\nreturn_scaler=True\nshuffle=True/False"]
CollateFunction["collate_with_scaler()\nBatch construction"]
end
    
    subgraph "PyTorch Lightning Training Loop"
        DataLoader["DataLoader\nbatch_size from hparams\nnum_workers=2\npin_memory=True\npersistent_workers=True"]
Model["Stage1Autoencoder\nEncoder → Latent → Decoder"]
Optimizer["Adam Optimizer\n+ CosineAnnealingWarmRestarts\nReduceLROnPlateau"]
Trainer["pl.Trainer\nEarlyStopping\nModelCheckpoint\ngradient_clip_val"]
end
    
    subgraph "Monitoring & Persistence"
        TensorBoard["TensorBoardLogger\ntrain_loss\nval_loss_epoch\nlearning_rate"]
Checkpoints["Model Checkpoints\n.ckpt files\nsave_top_k=1\nmonitor='val_loss_epoch'"]
end
    
 
   DataStoreWsClient --> UsGaapStore
 
   UsGaapStore --> IterableDataset
 
   IterableDataset --> CollateFunction
 
   CollateFunction --> DataLoader
    
 
   DataLoader --> Model
 
   Model --> Optimizer
 
   Optimizer --> Trainer
    
 
   Trainer --> TensorBoard
 
   Trainer --> Checkpoints
    
 
   Checkpoints -.->|Resume training| Model

Stage1Autoencoder Model

Model Architecture

The Stage1Autoencoder is a fully-connected autoencoder that learns to compress financial concept embeddings combined with their scaled values into a lower-dimensional latent space. The model reconstructs its input, forcing the latent representation to capture the most important features.

graph LR
    Input["Input Tensor\n[embedding + scaled_value]\nDimension: embedding_dim + 1"]
Encoder["Encoder Network\nfc1 → dropout → ReLU\nfc2 → dropout → ReLU"]
Latent["Latent Space\nDimension: latent_dim"]
Decoder["Decoder Network\nfc3 → dropout → ReLU\nfc4 → dropout → output"]
Output["Reconstructed Input\nSame dimension as input"]
Loss["MSE Loss\ninput vs output"]
Input --> Encoder
 
   Encoder --> Latent
 
   Latent --> Decoder
 
   Decoder --> Output
 
   Output --> Loss
 
   Input --> Loss

Hyperparameters

The model exposes the following configurable hyperparameters through its hparams attribute:

ParameterDescriptionTypical Value
input_dimDimension of input (embedding + 1 for value)Varies based on embedding size
latent_dimDimension of compressed latent space64-128
dropout_rateDropout probability for regularization0.0-0.2
lrInitial learning rate1e-5 to 5e-5
min_lrMinimum learning rate for scheduler1e-7 to 1e-6
batch_sizeTraining batch size32 (typical)
weight_decayL2 regularization parameter1e-8 to 1e-4
gradient_clipMaximum gradient norm0.0-1.0

The model can be instantiated with default parameters or loaded from a checkpoint with overridden hyperparameters:

Loading from checkpoint with modified learning rate: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb479-486

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb479-490

Loss Function and Optimization

The model uses Mean Squared Error (MSE) loss between the input and reconstructed output. The optimization strategy combines:

  • Adam optimizer with configurable learning rate and weight decay
  • CosineAnnealingWarmRestarts scheduler for cyclical learning rate annealing
  • ReduceLROnPlateau for adaptive learning rate reduction when validation loss plateaus

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb479-490

Dataset and Data Loading

IterableConceptValueDataset

The IterableConceptValueDataset is a custom PyTorch IterableDataset that streams training data from the UsGaapStore without loading the entire dataset into memory. This design enables training on datasets larger than available RAM.

Key characteristics:

graph TB
    subgraph "Dataset Initialization"
        Config["simd_r_drive_server_config\nhost + port"]
Params["Dataset Parameters\ninternal_batch_size\nreturn_scaler\nshuffle"]
end
    
    subgraph "Data Streaming Process"
        Store["UsGaapStore instance\nget_triplet_count()"]
IndexGen["Index Generator\nSequential or shuffled\nbased on shuffle param"]
BatchFetch["Internal Batching\nFetch internal_batch_size items\nvia batch_lookup_by_indices()"]
Unpack["Unpack Triplet Data\nembedding\nscaled_value\nscaler (optional)"]
end
    
    subgraph "Output"
        Tensor["PyTorch Tensors\nx: [embedding + scaled_value]\ny: [embedding + scaled_value]\nscaler: RobustScaler object"]
end
    
 
   Config --> Store
 
   Params --> IndexGen
 
   Store --> IndexGen
 
   IndexGen --> BatchFetch
 
   BatchFetch --> Unpack
 
   Unpack --> Tensor
  • Iterable streaming : Data is fetched on-demand during iteration, not preloaded
  • Internal batching : Fetches internal_batch_size items at once from the WebSocket server to reduce network overhead (typically 64)
  • Optional shuffling : Can randomize index order for training or maintain sequential order for validation
  • Scaler inclusion : Optionally returns the RobustScaler object with each sample for validation purposes

Dataset instantiation: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-522

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb175-176 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-522

DataLoader Configuration

PyTorch DataLoader instances wrap the iterable dataset with additional configuration for efficient batch loading:

ParameterValuePurpose
batch_sizeFrom model.hparams.batch_sizeOuter batch size for model training
collate_fncollate_with_scalerCustom collation function to handle scaler objects
num_workers2Number of parallel data loading processes
pin_memoryTrueEnables faster host-to-GPU transfers
persistent_workersTrueKeeps worker processes alive between epochs
prefetch_factor4Number of batches each worker pre-fetches

Training loader uses shuffle=True in the dataset while validation loader uses shuffle=False to ensure reproducible validation metrics.

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-522

collate_with_scaler Function

The collate_with_scaler function handles batch construction when the dataset returns triplets (x, y, scaler). It stacks the tensors into batches while preserving the scaler objects in a list:

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb175-176 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb506-507

graph LR
    Input["Batch Items\n[(x1, y1, scaler1),\n(x2, y2, scaler2),\n...]"]
Stack["Stack Tensors\ntorch.stack(x_list)\ntorch.stack(y_list)"]
Output["Batched Output\n(x_batch, y_batch, scaler_list)"]
Input --> Stack
 
   Stack --> Output

Training Configuration

PyTorch Lightning Trainer Setup

The training pipeline uses PyTorch Lightning's Trainer class to orchestrate the training loop, validation, and callbacks:

Configuration values:

graph TB
    subgraph "Training Configuration"
        MaxEpochs["max_epochs\nTypically 1000"]
Accelerator["accelerator='auto'\ndevices=1"]
GradientClip["gradient_clip_val\nFrom model.hparams"]
end
    
    subgraph "Callbacks"
        EarlyStopping["EarlyStopping\nmonitor='val_loss_epoch'\npatience=20\nmode='min'"]
ModelCheckpoint["ModelCheckpoint\nmonitor='val_loss_epoch'\nsave_top_k=1\nfilename='stage1_resume'"]
end
    
    subgraph "Logging"
        TensorBoardLogger["TensorBoardLogger\nlog_dir=OUTPUT_PATH\nname='stage1_autoencoder'"]
end
    
    Trainer["pl.Trainer"]
MaxEpochs --> Trainer
 
   Accelerator --> Trainer
 
   GradientClip --> Trainer
 
   EarlyStopping --> Trainer
 
   ModelCheckpoint --> Trainer
 
   TensorBoardLogger --> Trainer
  • EPOCHS : 1000 (allows for long training with early stopping)
  • PATIENCE : 20 (compensates for learning rate annealing periods)
  • OUTPUT_PATH : project_paths.python_data / "stage1_23_(no_pre_dedupe)" or similar versioned directory

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb468-548

Callbacks

EarlyStopping

The EarlyStopping callback monitors val_loss_epoch and stops training if no improvement occurs for 20 consecutive epochs. This prevents overfitting and saves computational resources.

python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb528-530

ModelCheckpoint

The ModelCheckpoint callback saves the best model based on validation loss:

python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb532-539

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb528-539

Training Execution

The training process is initiated with the Trainer.fit() method:

python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb550-556

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb550-556

Checkpointing and Resuming Training

Saving Checkpoints

Checkpoints are automatically saved by the ModelCheckpoint callback when validation loss improves. The checkpoint file contains:

  • Model weights (encoder and decoder parameters)
  • Optimizer state
  • Learning rate scheduler state
  • All hyperparameters (hparams)
  • Current epoch number
  • Best validation loss

Checkpoint files are saved with the .ckpt extension in the OUTPUT_PATH directory.

Loading and Resuming

The training pipeline supports two modes of checkpoint loading:

1. Resume Training with Existing Configuration

Pass ckpt_path to trainer.fit() to resume training with the exact state from the checkpoint:

python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb555

2. Fine-tune with Modified Hyperparameters

Load the checkpoint explicitly and override specific hyperparameters before training:

This approach is useful for fine-tuning with a lower learning rate after initial training convergence.

python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb479-486

Checkpoint File Naming

The checkpoint filename is determined by the ModelCheckpoint configuration:

  • filename : "stage1_resume" produces files like stage1_resume.ckpt or stage1_resume-v1.ckpt, stage1_resume-v2.ckpt, etc.
  • Versioning : PyTorch Lightning automatically increments version numbers when the same filename exists

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb475-486 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb532-539 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb550-556

graph TB
    subgraph "Logged Metrics"
        TrainLoss["train_loss\nPer-batch training loss"]
ValLoss["val_loss_epoch\nEpoch-averaged validation loss"]
LearningRate["learning_rate\nCurrent optimizer LR"]
end
    
    subgraph "TensorBoard Logger"
        Logger["TensorBoardLogger\nsave_dir=OUTPUT_PATH\nname='stage1_autoencoder'"]
end
    
    subgraph "Log Directory Structure"
        OutputPath["OUTPUT_PATH/"]
VersionDir["stage1_autoencoder/"]
Events["events.out.tfevents.*\nhparams.yaml"]
Checkpoints["*.ckpt checkpoint files"]
end
    
 
   TrainLoss --> Logger
 
   ValLoss --> Logger
 
   LearningRate --> Logger
    
 
   Logger --> OutputPath
 
   OutputPath --> VersionDir
 
   VersionDir --> Events
 
   OutputPath --> Checkpoints

Monitoring and Logging

TensorBoard Integration

The TensorBoardLogger automatically logs training metrics for visualization in TensorBoard:

Log directory structure:

OUTPUT_PATH/
├── stage1_autoencoder/
│   └── version_0/
│       ├── events.out.tfevents.*
│       ├── hparams.yaml
│       └── checkpoints/
├── stage1_resume.ckpt
└── optuna_study.db (if using Optuna)

python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb543

Visualizing Training Progress

Launch TensorBoard to monitor training:

TensorBoard displays:

  • Loss curves : Training and validation loss over epochs
  • Learning rate schedule : Visualization of LR changes during training
  • Hyperparameters : Table of all model hyperparameters
  • Gradient histograms : Distribution of gradients (if enabled)

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb541-548

Training Workflow Summary

The complete training workflow follows this sequence:

This pipeline design enables:

  • Memory-efficient training through streaming data loading
  • Reproducible experiments with comprehensive logging
  • Flexible resumption with hyperparameter overrides
  • Robust training with gradient clipping and early stopping

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb456-556


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Database & Storage Integration

Relevant source files

Purpose and Scope

This document describes the database and storage infrastructure used by the Python narrative_stack system to persist and retrieve US GAAP financial data. The system employs a dual-storage architecture: MySQL for structured financial records and a WebSocket-based key-value store (simd-r-drive) for high-performance embedding and triplet data. This page covers the DbUsGaap database interface, DataStoreWsClient WebSocket integration, UsGaapStore facade pattern, triplet storage format, and embedding matrix management.

For information about CSV ingestion and preprocessing operations that populate these storage systems, see Data Ingestion & Preprocessing. For information about how the training pipeline consumes this stored data, see Machine Learning Training Pipeline.

Storage Architecture Overview

The narrative_stack system uses two complementary storage backends to optimize for different data access patterns:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:13-40

graph TB
    subgraph "Python narrative_stack"
        UsGaapStore["UsGaapStore\nUnified Facade"]
Preprocessing["Preprocessing Pipeline\ningest_us_gaap_csvs\ngenerate_pca_embeddings"]
Training["ML Training\nIterableConceptValueDataset"]
end
    
    subgraph "Storage Backends"
        DbUsGaap["DbUsGaap\nMySQL Interface\nus_gaap_test database"]
DataStoreWsClient["DataStoreWsClient\nWebSocket Client\nsimd-r-drive protocol"]
end
    
    subgraph "External Services"
        MySQL[("MySQL Server\nStructured Records\nSymbol/Concept/Value")]
        SimdRDrive["simd-r-drive-ws-server\nKey-Value Store\nEmbeddings/Triplets/Scalers"]
end
    
 
   Preprocessing -->|ingest CSV data| UsGaapStore
 
   UsGaapStore -->|store records| DbUsGaap
 
   UsGaapStore -->|store triplets/embeddings| DataStoreWsClient
 
   DbUsGaap -->|SQL queries| MySQL
 
   DataStoreWsClient -->|WebSocket protocol| SimdRDrive
 
   UsGaapStore -->|retrieve triplets| Training
 
   Training -->|batch lookups| UsGaapStore

The dual-storage design separates concerns:

  • MySQL stores structured financial data with SQL query capabilities for filtering and aggregation
  • simd-r-drive stores high-dimensional embeddings and preprocessed triplets for fast sequential access during training

MySQL Database Interface (DbUsGaap)

The DbUsGaap class provides an abstraction layer over MySQL database operations for US GAAP financial records.

Database Schema

The system expects a MySQL database named us_gaap_test with the following structure:

Table/FieldTypePurpose
Raw records table-Stores original CSV-ingested data
symbolVARCHARCompany ticker symbol
conceptVARCHARUS GAAP concept name
unitVARCHARUnit of measurement (USD, shares, etc.)
valueDECIMALNumeric value for the concept
formVARCHARSEC form type (10-K, 10-Q, etc.)
filing_dateDATEDate of filing

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:30-35

DbUsGaap Connection Configuration

The DbUsGaap class is instantiated with configuration from db_config:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:13-18

The database connection is established during initialization and used for querying structured financial data during the ingestion phase. The DbUsGaap interface supports filtering by symbol, concept, and date ranges for selective data loading.

WebSocket Data Store (DataStoreWsClient)

The DataStoreWsClient provides a WebSocket-based interface to the simd-r-drive key-value store for high-performance binary data storage and retrieval.

simd-r-drive Integration

The simd-r-drive system is a custom WebSocket server that provides persistent key-value storage optimized for embedding matrices and preprocessed training data:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:33-40

graph TB
    subgraph "Python Client"
        WSClient["DataStoreWsClient\nWebSocket protocol handler"]
BatchWrite["batch_write\n[(key, value)]"]
BatchRead["batch_read\n[keys] → [values]"]
end
    
    subgraph "simd-r-drive Server"
        WSServer["simd-r-drive-ws-server\nWebSocket endpoint"]
DataStore["DataStore\nBinary file backend"]
end
    
 
   BatchWrite -->|WebSocket message| WSClient
 
   BatchRead -->|WebSocket message| WSClient
 
   WSClient <-->|ws://host:port| WSServer
 
   WSServer <-->|read/write| DataStore

Connection Initialization

The WebSocket client is instantiated with server configuration:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb38

Batch Operations

The DataStoreWsClient supports efficient batch read and write operations:

OperationInputOutputPurpose
batch_write[(key: bytes, value: bytes)]NoneStore multiple key-value pairs atomically
batch_read[key: bytes][value: bytes]Retrieve multiple values by key

The WebSocket protocol enables low-latency access to stored embeddings and preprocessed triplets during training iterations.

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:50-55

Rust-side Cache Storage

On the Rust side, the simd-r-drive storage system is accessed through the Caches module, which provides two cache stores:

Sources: src/caches.rs:7-65

graph TB
    ConfigManager["ConfigManager\ncache_base_dir"]
CachesInit["Caches::init\nInitialize cache stores"]
subgraph "Cache Stores"
        HTTPCache["SIMD_R_DRIVE_HTTP_CACHE\nhttp_storage_cache.bin\nOnceLock<Arc<DataStore>>"]
PreprocessorCache["SIMD_R_DRIVE_PREPROCESSOR_CACHE\npreprocessor_cache.bin\nOnceLock<Arc<DataStore>>"]
end
    
    HTTPCacheGet["Caches::get_http_cache_store"]
PreprocessorCacheGet["Caches::get_preprocessor_cache"]
ConfigManager -->|cache_base_dir path| CachesInit
 
   CachesInit -->|open DataStore| HTTPCache
 
   CachesInit -->|open DataStore| PreprocessorCache
 
   HTTPCache -->|Arc clone| HTTPCacheGet
 
   PreprocessorCache -->|Arc clone| PreprocessorCacheGet

The Rust implementation uses OnceLock<Arc<DataStore>> for thread-safe lazy initialization of cache stores. The DataStore::open method creates or opens persistent binary files at paths derived from ConfigManager:

  • http_storage_cache.bin - HTTP response caching
  • preprocessor_cache.bin - Preprocessed data caching

UsGaapStore Facade

The UsGaapStore class provides a unified interface that coordinates operations across both storage backends (MySQL and simd-r-drive):

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb40 python/narrative_stack/notebooks/stage1_preprocessing.ipynb:65-68

graph TB
    subgraph "UsGaapStore Facade"
        IngestCSV["ingest_us_gaap_csvs\nCSV → MySQL + triplets"]
GeneratePCA["generate_pca_embeddings\nCompress embeddings"]
LookupByIndex["lookup_by_index\nRetrieve triplet"]
BatchLookup["batch_lookup_by_indices\nBulk retrieval"]
GetEmbeddingMatrix["get_embedding_matrix\nReturn all embeddings"]
GetTripletCount["get_triplet_count\nTotal stored triplets"]
GetPairCount["get_pair_count\nUnique concept/unit pairs"]
end
    
    subgraph "Backend Operations"
        MySQLWrite["DbUsGaap.insert\nStore raw records"]
SimdWrite["DataStoreWsClient.batch_write\nStore triplets/embeddings"]
SimdRead["DataStoreWsClient.batch_read\nRetrieve data"]
end
    
 
   IngestCSV -->|raw data| MySQLWrite
 
   IngestCSV -->|triplets| SimdWrite
 
   GeneratePCA -->|compressed embeddings| SimdWrite
 
   LookupByIndex -->|key query| SimdRead
 
   BatchLookup -->|batch keys| SimdRead
 
   GetEmbeddingMatrix -->|matrix key| SimdRead

UsGaapStore Initialization

The facade is initialized with a reference to the DataStoreWsClient:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb40

Key Methods

MethodParametersReturnsPurpose
ingest_us_gaap_csvscsv_data_dir: Path, db_us_gaap: DbUsGaapNoneWalk CSV directory, parse records, store to both backends
generate_pca_embeddingsNoneNoneApply PCA dimensionality reduction to stored embeddings
lookup_by_indexindex: intTriplet dictRetrieve single triplet by sequential index
batch_lookup_by_indicesindices: List[int]List[Triplet dict]Retrieve multiple triplets efficiently
get_embedding_matrixNone(ndarray, List[Tuple])Return full embedding matrix and concept/unit pairs
get_triplet_countNoneintTotal number of stored triplets
get_pair_countNoneintNumber of unique (concept, unit) pairs

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:65-96

The UsGaapStore abstracts storage implementation details, allowing preprocessing and training code to interact with a simple, high-level API.

Triplet Storage Format

The core data structure stored in simd-r-drive is the triplet : a tuple of (concept, unit, scaled_value) along with associated metadata.

Triplet Structure

Each triplet stored in simd-r-drive contains the following fields:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:88-96

Scaler Validation

The storage system maintains consistency between stored values and scalers. During retrieval, the system can verify that the scaler produces the expected scaled value:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:92-96

This validation ensures that the RobustScaler instance stored with each (concept, unit) pair correctly transforms the unscaled value to the stored scaled value, maintaining data integrity across storage and retrieval cycles.

graph LR
    Index0["Index 0\nTriplet 0"]
Index1["Index 1\nTriplet 1"]
Index2["Index 2\nTriplet 2"]
IndexN["Index N\nTriplet N"]
BatchLookup["batch_lookup_by_indices\n[0, 1, 2, ..., N]"]
Index0 -->|retrieve| BatchLookup
 
   Index1 -->|retrieve| BatchLookup
 
   Index2 -->|retrieve| BatchLookup
 
   IndexN -->|retrieve| BatchLookup

Triplet Indexing

Triplets are stored with sequential integer indices, enabling efficient batch retrieval during training:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:87-89

The sequential indexing scheme allows the training dataset to iterate efficiently through all stored triplets without loading the entire dataset into memory.

graph TB
    subgraph "Embedding Matrix Generation"
        ConceptUnitPairs["Unique Pairs\n(concept, unit)\nCollected during ingestion"]
GenerateEmbeddings["generate_concept_unit_embeddings\nSentence transformer model"]
PCACompression["PCA Compression\nVariance threshold: 0.95"]
EmbeddingMatrix["Embedding Matrix\nShape: (n_pairs, embedding_dim)"]
end
    
    subgraph "Storage"
        StorePairs["Store pairs list\nKey: 'concept_unit_pairs'"]
StoreMatrix["Store matrix\nKey: 'embedding_matrix'"]
end
    
 
   ConceptUnitPairs -->|semantic text| GenerateEmbeddings
 
   GenerateEmbeddings -->|full embeddings| PCACompression
 
   PCACompression -->|compressed| EmbeddingMatrix
 
   EmbeddingMatrix --> StoreMatrix
 
   ConceptUnitPairs --> StorePairs

Embedding Matrix Management

The UsGaapStore maintains a pre-computed embedding matrix that maps each unique (concept, unit) pair to its semantic embedding vector.

Embedding Matrix Structure

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb68 python/narrative_stack/notebooks/stage1_preprocessing.ipynb:263-269

Retrieval and Usage

The embedding matrix can be retrieved for analysis or visualization:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:106-108 python/narrative_stack/notebooks/stage1_preprocessing.ipynb263

During training, each triplet's embedding is retrieved from storage rather than recomputed, ensuring consistent representations across training runs.

PCA Dimensionality Reduction

The generate_pca_embeddings method applies PCA to compress the raw semantic embeddings while retaining 95% of the variance:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb68

The PCA compression reduces storage requirements and computational overhead during training while maintaining semantic relationships between concept/unit pairs.

Data Flow Integration

The storage system integrates with both preprocessing and training workflows:

Sources: python/narrative_stack/notebooks/stage1_preprocessing.ipynb:65-68 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb175-176 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-522

Training Data Access Pattern

During training, the IterableConceptValueDataset streams triplets from storage:

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-512

The dataset communicates with the simd-r-drive WebSocket server to stream triplets, avoiding memory exhaustion on large datasets. The collate_with_scaler function combines triplets into batches while preserving scaler metadata for potential validation or analysis.

Integration Testing

The storage integration is validated through automated integration tests:

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39

The integration test script:

  1. Starts isolated Docker containers for MySQL and simd-r-drive using docker compose with project name us_gaap_it
  2. Waits for MySQL readiness using mysqladmin ping
  3. Creates the us_gaap_test database and loads the schema from tests/integration/assets/us_gaap_schema_2025.sql
  4. Runs pytest integration tests against the live services
  5. Tears down containers and volumes in cleanup trap

This ensures that the storage layer functions correctly across both backends before code is merged.

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:8-38


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Development Guide

Relevant source files

Purpose and Scope

This guide provides an overview of development practices, code organization, and workflows for contributing to the rust-sec-fetcher project. It covers environment setup, code organization principles, development workflows, and common development tasks.

For detailed information about specific development topics, see:

Development Environment Setup

Prerequisites

The project requires the following tools installed:

ToolPurposeVersion Requirement
RustCore application development1.87+
PythonML pipeline and preprocessing3.8+
DockerIntegration testing and servicesLatest stable
Git LFSLarge file support for test assetsLatest stable
MySQLDatabase for US GAAP storage5.7+ or 8.0+

Rust Development Setup

  1. Clone the repository and navigate to the root directory

  2. Build the Rust application:

  3. Run tests to verify setup:

The Rust workspace is configured in Cargo.toml with all necessary dependencies declared. Key development dependencies include:

  • mockito for HTTP mocking in tests
  • tempfile for temporary file/directory creation in tests
  • tokio test macros for async test support

Python Development Setup

  1. Create a virtual environment:

  2. Install dependencies using uv:

  3. Verify installation by running integration tests (requires Docker):

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39

Configuration Setup

The application requires a configuration file at ~/.config/sec-fetcher/config.toml or a custom path specified via command-line argument. Minimum configuration:

For non-interactive testing, use AppConfig directly in test code as shown in tests/config_manager_tests.rs:36-57

Sources: tests/config_manager_tests.rs:36-57 tests/sec_client_tests.rs:8-20

Code Organization and Architecture

Repository Structure

Sources: src/network/sec_client.rs:1-181 tests/config_manager_tests.rs:1-95 tests/sec_client_tests.rs:1-159 python/narrative_stack/us_gaap_store_integration_test.sh:1-39

Module Dependency Flow

The dependency flow follows a layered architecture:

  1. Configuration Layer : ConfigManager loads settings from TOML files and credentials from keyring
  2. Network Layer : SecClient wraps HTTP client with caching and throttling middleware
  3. Data Fetching Layer : Network module functions fetch raw data from SEC APIs
  4. Transformation Layer : Transformers normalize raw data into standardized concepts
  5. Model Layer : Data structures represent domain entities

Sources: src/network/sec_client.rs:1-181 tests/config_manager_tests.rs:1-95

Development Workflow

Standard Development Cycle

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39

Running Tests Locally

Rust Unit Tests

Run all Rust tests with cargo:

Run specific test modules:

Run with output visibility:

Test Structure Mapping:

Test FileTests ComponentKey Test Functions
tests/config_manager_tests.rsConfigManagertest_load_custom_config, test_load_non_existent_config, test_fails_on_invalid_key
tests/sec_client_tests.rsSecClienttest_user_agent, test_fetch_json_without_retry_success, test_fetch_json_with_retry_failure

Sources: tests/config_manager_tests.rs:1-95 tests/sec_client_tests.rs:1-159

Python Integration Tests

Integration tests require Docker services. Run via the provided shell script:

This script performs the following steps as defined in python/narrative_stack/us_gaap_store_integration_test.sh:1-39:

  1. Activates Python virtual environment
  2. Installs dependencies with uv pip install -e . --group dev
  3. Starts Docker Compose services (db_test, simd_r_drive_ws_server_test)
  4. Waits for MySQL availability
  5. Creates us_gaap_test database
  6. Loads schema from tests/integration/assets/us_gaap_schema_2025.sql
  7. Runs pytest integration tests
  8. Tears down containers on exit

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39

Writing Tests

Unit Test Pattern (Rust)

The codebase follows standard Rust testing patterns with mockito for HTTP mocking:

Key patterns demonstrated in tests/sec_client_tests.rs:35-62:

  • Use #[tokio::test] for async tests
  • Create mockito::Server for HTTP endpoint mocking
  • Construct AppConfig programmatically for test isolation
  • Use ConfigManager::from_app_config() to bypass file system dependencies
  • Assert on specific JSON fields in responses

Sources: tests/sec_client_tests.rs:35-62

Test Fixture Pattern

The codebase uses temporary directories for file-based tests:

This pattern ensures test isolation and automatic cleanup as shown in tests/config_manager_tests.rs:8-17

Sources: tests/config_manager_tests.rs:8-17

Error Case Testing

Test error conditions explicitly:

This test from tests/sec_client_tests.rs:93-120 verifies retry behavior by expecting exactly 3 HTTP requests (initial + 2 retries) before failing.

Sources: tests/sec_client_tests.rs:93-120

Common Development Tasks

Adding a New SEC Data Endpoint

To add support for fetching a new SEC data endpoint:

  1. Add URL enum variant in src/models/url.rs
  2. Create fetch function in src/network/ following the pattern of existing functions
  3. Define data models in src/models/ for the response structure
  4. Add transformation logic in src/transformers/ if normalization is needed
  5. Write unit tests in tests/ using mockito::Server for mocking
  6. Update main.rs to integrate the new endpoint into the processing pipeline

Example function signature pattern:

Adding a New FundamentalConcept Mapping

The distill_us_gaap_fundamental_concepts function maps raw SEC concept names to the FundamentalConcept enum. To add a new concept:

  1. Add enum variant to FundamentalConcept in src/models/fundamental_concept.rs
  2. Update the match arms in src/transformers/distill_us_gaap_fundamental_concepts.rs
  3. Add test case to verify the mapping in tests/distill_tests.rs

See the existing mapping patterns in the transformer module for hierarchical mappings (concepts that map to multiple parent categories).

Modifying HTTP Client Behavior

The SecClient is configured in src/network/sec_client.rs:21-89 Key configuration points:

ConfigurationLocationPurpose
CachePolicysrc/network/sec_client.rs:45-50Controls cache TTL and behavior
ThrottlePolicysrc/network/sec_client.rs:53-59Controls rate limiting and retries
User-Agentsrc/network/sec_client.rs:91-108Constructs SEC-compliant User-Agent header

To modify throttling behavior, adjust the ThrottlePolicy parameters:

  • base_delay_ms: Minimum delay between requests
  • max_concurrent: Maximum concurrent requests
  • max_retries: Number of retry attempts on failure
  • adaptive_jitter_ms: Random jitter to prevent thundering herd

Sources: src/network/sec_client.rs:21-89

Working with Caches

The system uses two cache types managed by the Caches module:

  1. HTTP Cache : Stores raw HTTP responses with configurable TTL (default: 1 week)
  2. Preprocessor Cache : Stores transformed/preprocessed data

Cache instances are accessed via Caches::get_http_cache_store() as shown in src/network/sec_client.rs73

During development, you may need to clear caches when testing data transformations. Cache data is persisted via the simd-r-drive backend.

Sources: src/network/sec_client.rs73

Code Quality Standards

TODO Comments and Technical Debt

The codebase uses TODO comments to mark areas for improvement. Examples from src/network/sec_client.rs:

When adding TODO comments:

  1. Be specific about what needs to be done
  2. Include context about why it's not done now
  3. Reference related issues if applicable

Panic vs Result

The codebase follows Rust best practices:

  • Use Result<T, E> for recoverable errors
  • Use panic! only for non-recoverable errors or programming errors

Example from src/network/sec_client.rs:95-98:

This panics because an invalid email makes all SEC API calls fail, representing a configuration error rather than a runtime error.

Sources: src/network/sec_client.rs:95-98

Error Validation in Tests

Configuration validation is tested by verifying error messages contain expected content, as shown in tests/config_manager_tests.rs:68-94:

This pattern ensures configuration errors are informative to users.

Sources: tests/config_manager_tests.rs:68-94

Integration Test Architecture

The integration test script from python/narrative_stack/us_gaap_store_integration_test.sh:1-39 orchestrates:

  1. Python environment setup with dependencies
  2. Docker Compose service startup (isolated project name: us_gaap_it)
  3. MySQL container health check via mysqladmin ping
  4. Database creation and schema loading
  5. pytest execution with verbose output
  6. Automatic cleanup via EXIT trap

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39

Best Practices Summary

PracticeImplementationReference
Test isolationUse temporary directories and AppConfig::default()tests/config_manager_tests.rs:9-17
HTTP mockingUse mockito::Server for endpoint simulationtests/sec_client_tests.rs:37-45
Async testingUse #[tokio::test] attributetests/sec_client_tests.rs35
Error handlingPrefer Result<T, E> over panicsrc/network/sec_client.rs:140-165
ConfigurationUse ConfigManager::from_app_config() in teststests/sec_client_tests.rs10
Integration testingUse Docker Compose with isolated project namespython/narrative_stack/us_gaap_store_integration_test.sh8
CleanupUse trap handlers for guaranteed cleanuppython/narrative_stack/us_gaap_store_integration_test.sh:14-19

Sources: tests/config_manager_tests.rs:9-17 tests/sec_client_tests.rs:35-62 src/network/sec_client.rs:140-165 python/narrative_stack/us_gaap_store_integration_test.sh:1-39


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Testing Strategy

Relevant source files

This page documents the testing approach for the rust-sec-fetcher codebase, covering both Rust unit tests and Python integration tests. The testing strategy emphasizes isolated unit testing with mocking for Rust components and end-to-end integration testing for the Python ML pipeline. For information about the CI/CD automation of these tests, see CI/CD Pipeline.


Test Architecture Overview

The codebase employs a dual-layer testing strategy that mirrors its dual-language architecture. Rust components use unit tests with HTTP mocking, while Python components use containerized integration tests with real database instances.

graph TB
    subgraph "Rust Unit Tests"
        SecClientTest["sec_client_tests.rs\nTest SecClient HTTP layer"]
DistillTest["distill_us_gaap_fundamental_concepts_tests.rs\nTest concept transformation"]
ConfigTest["config_manager_tests.rs\nTest configuration loading"]
end
    
    subgraph "Test Infrastructure"
        MockitoServer["mockito::Server\nHTTP mock server"]
TempDir["tempfile::TempDir\nTemporary config files"]
end
    
    subgraph "Python Integration Tests"
        IntegrationScript["us_gaap_store_integration_test.sh\nTest orchestration"]
PytestRunner["pytest\nTest execution"]
end
    
    subgraph "Docker Test Environment"
        MySQLContainer["us_gaap_test_db\nMySQL 8.0 container"]
SimdRDriveContainer["simd-r-drive-ws-server\nWebSocket server"]
SQLSchema["us_gaap_schema_2025.sql\nDatabase schema fixture"]
end
    
 
   SecClientTest --> MockitoServer
 
   ConfigTest --> TempDir
    
 
   IntegrationScript --> MySQLContainer
 
   IntegrationScript --> SimdRDriveContainer
 
   IntegrationScript --> SQLSchema
 
   IntegrationScript --> PytestRunner
    
 
   PytestRunner --> MySQLContainer
 
   PytestRunner --> SimdRDriveContainer

Test Component Relationships

Sources: tests/sec_client_tests.rs:1-159 tests/distill_us_gaap_fundamental_concepts_tests.rs:1-1275 tests/config_manager_tests.rs:1-95 python/narrative_stack/us_gaap_store_integration_test.sh:1-39


Rust Unit Testing Strategy

The Rust test suite focuses on three critical areas: HTTP client behavior, data transformation correctness, and configuration management. Tests are isolated using mocking frameworks and temporary file systems.

SecClient HTTP Testing

The SecClient tests verify HTTP operations, retry logic, and authentication header generation using the mockito library for HTTP mocking.

Test FunctionPurposeMock Behavior
test_user_agentValidates User-Agent header formatN/A (direct method call)
test_invalid_email_panicEnsures invalid emails cause panicN/A (panic verification)
test_fetch_json_without_retry_successTests successful JSON fetchReturns 200 with valid JSON
test_fetch_json_with_retry_successTests successful fetch with retry availableReturns 200 immediately
test_fetch_json_with_retry_failureVerifies retry exhaustionReturns 500 three times
test_fetch_json_with_retry_backoffTests retry with eventual successReturns 500 once, then 200

User Agent Validation Test Pattern

Sources: tests/sec_client_tests.rs:6-21

The test creates a minimal AppConfig with only the email field set tests/sec_client_tests.rs:8-10 constructs a ConfigManager and SecClient tests/sec_client_tests.rs:10-12 then verifies the User-Agent format matches the expected pattern including the package version from CARGO_PKG_VERSION tests/sec_client_tests.rs:14-20

HTTP Retry and Backoff Testing

The retry logic tests use mockito::Server to simulate various HTTP response scenarios:

Sources: tests/sec_client_tests.rs:64-158

graph TB
    SetupMock["Setup mockito::Server"]
ConfigureResponse["Configure mock response\nStatus, Body, Expect count"]
CreateClient["Create SecClient\nwith max_retries config"]
FetchJSON["Call client.fetch_json()"]
VerifyResult["Verify result:\nSuccess or Error"]
VerifyCallCount["Verify mock was called\nexpected number of times"]
SetupMock --> ConfigureResponse
 
   ConfigureResponse --> CreateClient
 
   CreateClient --> FetchJSON
 
   FetchJSON --> VerifyResult
 
   FetchJSON --> VerifyCallCount

The retry failure test configures a mock to return HTTP 500 status and expects exactly 3 calls (initial + 2 retries) tests/sec_client_tests.rs:96-103 setting max_retries = 2 in the config tests/sec_client_tests.rs106 The test verifies the result is an error after exhausting retries tests/sec_client_tests.rs:111-119

The backoff test simulates transient failures by creating two mock endpoints: one that returns 500 once, and another that returns 200 tests/sec_client_tests.rs:126-140 This validates that the retry mechanism successfully recovers from temporary failures.

Sources: tests/sec_client_tests.rs:94-158

US GAAP Concept Transformation Testing

The distill_us_gaap_fundamental_concepts_tests.rs file contains 1,275 lines of comprehensive tests covering all 64 FundamentalConcept variants. These tests verify the complex mapping logic that normalizes diverse US GAAP terminology into a standardized taxonomy.

Test Coverage by Concept Category

CategoryTest FunctionsConcept Variants TestedLines
Assets & Liabilities695-148
Income Statement152520-464
Cash Flow1215550-683
Equity713150-286
Revenue Variations2 (with 57+ assertions)45+954-1188

Hierarchical Mapping Test Pattern

The tests verify that specific concepts map to both their precise category and parent categories:

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:131-139

graph TB
    InputConcept["Input: 'AssetsCurrent'"]
CallDistill["Call distill_us_gaap_fundamental_concepts()"]
ExpectVector["Expect Vec with 2 elements"]
CheckSpecific["Assert contains:\nFundamentalConcept::CurrentAssets"]
CheckParent["Assert contains:\nFundamentalConcept::Assets"]
InputConcept --> CallDistill
 
   CallDistill --> ExpectVector
 
   ExpectVector --> CheckSpecific
 
   ExpectVector --> CheckParent

The test for AssetsCurrent verifies it maps to both CurrentAssets (specific) and Assets (parent category) tests/distill_us_gaap_fundamental_concepts_tests.rs:132-138 This hierarchical pattern appears throughout the test suite for concepts that have parent-child relationships.

Synonym Consolidation Testing

Multiple test functions verify that synonym variations of the same concept map to a single canonical form:

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:80-112

graph LR
    CostOfRevenue["CostOfRevenue"]
CostOfGoodsAndServicesSold["CostOfGoodsAndServicesSold"]
CostOfServices["CostOfServices"]
CostOfGoodsSold["CostOfGoodsSold"]
CostOfGoodsSoldExcluding["CostOfGoodsSold...\nExcludingDepreciation"]
CostOfGoodsSoldElectric["CostOfGoodsSoldElectric"]
Canonical["FundamentalConcept::\nCostOfRevenue"]
CostOfRevenue --> Canonical
 
   CostOfGoodsAndServicesSold --> Canonical
 
   CostOfServices --> Canonical
 
   CostOfGoodsSold --> Canonical
 
   CostOfGoodsSoldExcluding --> Canonical
 
   CostOfGoodsSoldElectric --> Canonical

The test_cost_of_revenue function contains 6 assertions verifying different US GAAP terminology variants all map to FundamentalConcept::CostOfRevenue tests/distill_us_gaap_fundamental_concepts_tests.rs:81-111 This pattern ensures consistent normalization across different reporting styles.

Industry-Specific Revenue Testing

The revenue tests demonstrate the most complex mapping scenario, where 57+ industry-specific revenue concepts all normalize to FundamentalConcept::Revenues:

Sources: tests/distill_us_gaap_fundamental_concepts_tests.rs:954-1188

graph TB
    subgraph "Standard Revenue Terms"
        Revenues["Revenues"]
SalesRevenueNet["SalesRevenueNet"]
SalesRevenueServicesNet["SalesRevenueServicesNet"]
end
    
    subgraph "Industry-Specific Terms"
        HealthCare["HealthCareOrganizationRevenue"]
RealEstate["RealEstateRevenueNet"]
OilGas["OilAndGasRevenue"]
Financial["FinancialServicesRevenue"]
Advertising["AdvertisingRevenue"]
Subscription["SubscriptionRevenue"]
Mining["RevenueMineralSales"]
end
    
    subgraph "Specialized Terms"
        Franchisor["FranchisorRevenue"]
Admissions["AdmissionsRevenue"]
Licenses["LicensesRevenue"]
Royalty["RoyaltyRevenue"]
Clearing["ClearingFeesRevenue"]
Passenger["PassengerRevenue"]
end
    
    Canonical["FundamentalConcept::Revenues"]
Revenues --> Canonical
 
   SalesRevenueNet --> Canonical
 
   SalesRevenueServicesNet --> Canonical
 
   HealthCare --> Canonical
 
   RealEstate --> Canonical
 
   OilGas --> Canonical
 
   Financial --> Canonical
 
   Advertising --> Canonical
 
   Subscription --> Canonical
 
   Mining --> Canonical
 
   Franchisor --> Canonical
 
   Admissions --> Canonical
 
   Licenses --> Canonical
 
   Royalty --> Canonical
 
   Clearing --> Canonical
 
   Passenger --> Canonical

The test_revenues function contains 47 distinct assertions, each verifying a different industry-specific revenue term maps to the canonical FundamentalConcept::Revenues tests/distill_us_gaap_fundamental_concepts_tests.rs:955-1188 Some terms also map to more specific sub-categories, such as InterestAndDividendIncomeOperating mapping to both InterestAndDividendIncomeOperating and Revenues tests/distill_us_gaap_fundamental_concepts_tests.rs:989-994

graph TB
    CreateTempDir["tempfile::tempdir()"]
CreatePath["dir.path().join('config.toml')"]
WriteContents["Write TOML contents\nto file"]
LoadConfig["ConfigManager::from_config(path)"]
AssertValues["Assert config values\nmatch expectations"]
DropTempDir["Drop TempDir\n(automatic cleanup)"]
CreateTempDir --> CreatePath
 
   CreatePath --> WriteContents
 
   WriteContents --> LoadConfig
 
   LoadConfig --> AssertValues
 
   AssertValues --> DropTempDir

Configuration Manager Testing

The config_manager_tests.rs file tests configuration loading, validation, and error handling using temporary files created with the tempfile crate.

Temporary Config File Test Pattern

Sources: tests/config_manager_tests.rs:8-57

The helper function create_temp_config creates a temporary directory and writes config contents to a config.toml file tests/config_manager_tests.rs:9-17 The function returns both the TempDir (to prevent premature cleanup) and the PathBuf tests/config_manager_tests.rs16 Tests explicitly drop the TempDir at the end to ensure cleanup tests/config_manager_tests.rs56

Invalid Configuration Key Detection

The test suite verifies that unknown configuration keys are rejected with helpful error messages:

TestConfig ContentExpected Behavior
test_load_custom_configValid keys onlySuccess, values loaded
test_load_non_existent_configN/A (file doesn't exist)Error returned
test_fails_on_invalid_keyContains invalid_keyError with valid keys listed

Sources: tests/config_manager_tests.rs:68-94

The invalid key test verifies that the error message contains documentation of valid configuration keys, including their types tests/config_manager_tests.rs:88-93:

  • email (String | Null)
  • max_concurrent (Integer | Null)
  • max_retries (Integer | Null)
  • min_delay_ms (Integer | Null)

Python Integration Testing Strategy

The Python testing strategy uses Docker Compose to orchestrate a complete test environment with MySQL database and WebSocket services, enabling end-to-end validation of the data pipeline.

graph TB
    ScriptStart["us_gaap_store_integration_test.sh"]
SetupEnv["Set environment variables\nPROJECT_NAME=us_gaap_it\nCOMPOSE=docker compose -p ..."]
ActivateVenv["source .venv/bin/activate"]
InstallDeps["uv pip install -e . --group dev"]
RegisterTrap["trap 'cleanup' EXIT"]
StartContainers["docker compose up -d\n--profile test"]
WaitMySQL["Wait for MySQL ping\nmysqladmin ping loop"]
CreateDB["CREATE DATABASE us_gaap_test"]
LoadSchema["mysql < us_gaap_schema_2025.sql"]
RunPytest["pytest -s -v\ntest_us_gaap_store.py"]
Cleanup["Cleanup trap:\ndocker compose down\n--volumes --remove-orphans"]
ScriptStart --> SetupEnv
 
   SetupEnv --> ActivateVenv
 
   ActivateVenv --> InstallDeps
 
   InstallDeps --> RegisterTrap
 
   RegisterTrap --> StartContainers
 
   StartContainers --> WaitMySQL
 
   WaitMySQL --> CreateDB
 
   CreateDB --> LoadSchema
 
   LoadSchema --> RunPytest
 
   RunPytest --> Cleanup

Integration Test Orchestration Flow

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39

Docker Compose Test Profile

The script uses an isolated Docker Compose project name to prevent conflicts with development containers python/narrative_stack/us_gaap_store_integration_test.sh:8-9:

  • PROJECT_NAME="us_gaap_it"
  • COMPOSE="docker compose -p $PROJECT_NAME --profile test"

The --profile test flag ensures only test-specific services are started python/narrative_stack/us_gaap_store_integration_test.sh23 which include:

  • db_test (MySQL container named us_gaap_test_db)
  • simd_r_drive_ws_server_test (WebSocket server)

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:8-23

MySQL Test Database Setup

The integration test script performs a multi-step database initialization:

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:25-35

The script waits for MySQL to be ready using a ping loop python/narrative_stack/us_gaap_store_integration_test.sh:25-28:

Then creates the test database python/narrative_stack/us_gaap_store_integration_test.sh:30-31 and loads the schema from a fixture file python/narrative_stack/us_gaap_store_integration_test.sh:33-35

Test Cleanup and Isolation

The script registers a cleanup trap to ensure test containers are always removed python/narrative_stack/us_gaap_store_integration_test.sh:14-19:

This trap executes on both successful completion and script failure, ensuring:

  • All containers in the us_gaap_it project are stopped and removed
  • All volumes are deleted (--volumes)
  • Orphaned containers from previous runs are cleaned up (--remove-orphans)

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:14-19

Pytest Execution Environment

The test environment is configured with specific paths and options python/narrative_stack/us_gaap_store_integration_test.sh:37-38:

ConfigurationValuePurpose
PYTHONPATHsrcEnsure module imports work
pytest flags-s -vShow output, verbose mode
Test pathtests/integration/test_us_gaap_store.pyIntegration test module

The -s flag disables output capturing, allowing print statements and logs to display immediately during test execution, which is useful for debugging long-running integration tests. The -v flag enables verbose mode with detailed test function names.

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:37-38


Test Fixtures and Mocking Patterns

The codebase uses different mocking strategies appropriate to each language and component.

graph TB
    CreateServer["mockito::Server::new_async()"]
ConfigureMock["server.mock(method, path)\n.with_status(code)\n.with_body(json)\n.expect(count)"]
CreateAsyncMock["create_async().await"]
GetServerUrl["server.url()"]
MakeRequest["client.fetch_json(url)"]
VerifyMock["Mock automatically verifies\nexpected call count"]
CreateServer --> ConfigureMock
 
   ConfigureMock --> CreateAsyncMock
 
   CreateAsyncMock --> GetServerUrl
 
   GetServerUrl --> MakeRequest
 
   MakeRequest --> VerifyMock

Rust Mocking with mockito

The mockito library provides HTTP server mocking for testing the SecClient:

Sources: tests/sec_client_tests.rs:36-62

Mock configuration example from test_fetch_json_without_retry_success tests/sec_client_tests.rs:39-45:

  • Method: GET
  • Path: /files/company_tickers.json
  • Status: 200
  • Header: Content-Type: application/json
  • Body: JSON string with sample ticker data
  • Call using server.url() to get the mock endpoint
graph LR
    CreateTempDir["tempfile::tempdir()"]
GetPath["dir.path().join('config.toml')"]
CreateFile["fs::File::create(path)"]
WriteContents["writeln!(file, contents)"]
ReturnBoth["Return (TempDir, PathBuf)"]
CreateTempDir --> GetPath
 
   GetPath --> CreateFile
 
   CreateFile --> WriteContents
 
   WriteContents --> ReturnBoth

Temporary File Fixtures

Configuration tests use the tempfile crate to create isolated test environments:

Sources: tests/config_manager_tests.rs:8-17

The create_temp_config helper returns both the TempDir and PathBuf to prevent premature cleanup tests/config_manager_tests.rs16 The TempDir must remain in scope until after the test completes, as dropping it deletes the temporary directory.

SQL Schema Fixtures

The Python integration tests use a versioned SQL schema file as a fixture:

FilePurposeUsage
tests/integration/assets/us_gaap_schema_2025.sqlDefine us_gaap_test database structureLoaded via docker exec python/narrative_stack/us_gaap_store_integration_test.sh:33-35

This approach ensures tests run against a known database structure that matches production schema, enabling reliable testing of database operations and query logic.

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:33-35


Test Execution Commands

The following table summarizes how to run different test suites:

Test SuiteCommandWorking DirectoryPrerequisites
All Rust unit testscargo testRepository rootRust toolchain
Specific Rust test filecargo test --test sec_client_testsRepository rootRust toolchain
Single Rust test functioncargo test test_user_agentRepository rootRust toolchain
Python integration tests./us_gaap_store_integration_test.shpython/narrative_stack/Docker, Git LFS, uv
Python integration with pytestpytest -s -v tests/integration/python/narrative_stack/Docker, containers running

Prerequisites for Integration Tests

The Python integration test script requires python/narrative_stack/us_gaap_store_integration_test.sh:1-2:

  • Git LFS enabled and updated (for large test fixtures)
  • Docker with Docker Compose v2
  • Python virtual environment with uv package manager
  • Test data files committed with Git LFS

Sources: tests/sec_client_tests.rs:1-159 tests/distill_us_gaap_fundamental_concepts_tests.rs:1-1275 tests/config_manager_tests.rs:1-95 python/narrative_stack/us_gaap_store_integration_test.sh:1-39


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

CI/CD Pipeline

Relevant source files

Purpose and Scope

This document explains the continuous integration and continuous deployment (CI/CD) infrastructure for the rust-sec-fetcher repository. It covers the GitHub Actions workflow configuration, integration test automation, Docker container orchestration for testing, and the complete test execution flow.

The CI/CD pipeline focuses specifically on the Python narrative_stack system's integration testing. For general testing strategies including Rust unit tests and Python test fixtures, see Testing Strategy. For production Docker deployment of the simd-r-drive WebSocket server, see Docker Deployment.


GitHub Actions Workflow Overview

The repository implements a single GitHub Actions workflow named US GAAP Store Integration Test that validates the Python machine learning pipeline's integration with external dependencies.

Workflow Trigger Conditions

The workflow is configured to run automatically under specific conditions defined in .github/workflows/us-gaap-store-integration-test.yml:3-11:

Sources: .github/workflows/us-gaap-store-integration-test.yml:3-11

The workflow only executes when changes affect:

  • Any file under python/narrative_stack/ directory
  • The workflow definition file itself (.github/workflows/us_gaap_store_integration_test.yml)

This targeted triggering prevents unnecessary CI runs when changes are made to the Rust sec-fetcher components or unrelated Python modules.


Integration Test Job Structure

The workflow defines a single job named integration-test that runs on ubuntu-latest runners. The job executes a sequence of setup and validation steps.

Sources: .github/workflows/us-gaap-store-integration-test.yml:12-51

graph TB
    Start["Job: integration-test\nruns-on: ubuntu-latest"]
Checkout["Step 1: Checkout repo\nactions/checkout@v4\nwith lfs: true"]
SetupPython["Step 2: Set up Python\nactions/setup-python@v5\npython-version: 3.12"]
InstallUV["Step 3: Install uv\ncurl astral.sh/uv/install.sh\nAdd to PATH"]
InstallDeps["Step 4: Install Python dependencies\nuv venv --python=3.12\nuv pip install -e . --group dev"]
Ruff["Step 5: Check code style with Ruff\nruff check ."]
Chmod["Step 6: Make test script executable\nchmod +x us_gaap_store_integration_test.sh"]
RunTest["Step 7: Run integration test\n./us_gaap_store_integration_test.sh"]
Start --> Checkout
 
   Checkout --> SetupPython
 
   SetupPython --> InstallUV
 
   InstallUV --> InstallDeps
 
   InstallDeps --> Ruff
 
   Ruff --> Chmod
 
   Chmod --> RunTest
    
    style Start fill:#f9f9f9
    style RunTest fill:#f9f9f9

Key Workflow Steps

StepActionPurpose
Checkout repoactions/checkout@v4 with lfs: trueClone repository with Git LFS support for large test data files
Set up Pythonactions/setup-python@v5Install Python 3.12 runtime
Install uvShell script executionInstall uv package manager from astral.sh
Install dependenciesuv pip install -e . --group devInstall narrative_stack package in editable mode with dev dependencies
Check code styleruff check .Validate Python code formatting and linting rules
Make executablechmod +xGrant execution permissions to test script
Run integration testExecute bash scriptOrchestrate Docker containers and run pytest tests

Sources: .github/workflows/us-gaap-store-integration-test.yml:17-50


Integration Test Architecture

The integration test orchestrates multiple Docker containers to create an isolated test environment that validates the narrative_stack system's ability to ingest, process, and store US GAAP financial data.

graph TB
    subgraph "Docker Compose Project: us_gaap_it"
        MySQL["Container: us_gaap_test_db\nImage: mysql\nPort: 3306\nUser: root\nPassword: onlylocal\nDatabase: us_gaap_test"]
SimdRDrive["Container: simd_r_drive_ws_server_test\nBuilt from: Dockerfile.simd-r-drive-ci-server\nPort: 8080\nBinary: simd-r-drive-ws-server@0.10.0-alpha"]
TestRunner["Test Runner\npytest process\ntests/integration/test_us_gaap_store.py"]
end
    
    Schema["SQL Schema\ntests/integration/assets/us_gaap_schema_2025.sql"]
TestRunner -->|SQL queries| MySQL
 
   TestRunner -->|WebSocket connection| SimdRDrive
    
 
   Schema -->|Loaded via mysql CLI| MySQL
    
    style MySQL fill:#f9f9f9
    style SimdRDrive fill:#f9f9f9
    style TestRunner fill:#f9f9f9

Container Architecture

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39 python/narrative_stack/Dockerfile.simd-r-drive-ci-server:1-34

The test environment consists of:

  1. MySQL Database Container (us_gaap_test_db): Provides persistent storage for US GAAP financial data with pre-loaded schema
  2. simd-r-drive WebSocket Server (simd_r_drive_ws_server_test): Handles embedding matrix storage and data synchronization
  3. pytest Test Runner : Executes integration tests against both containers

Test Execution Flow

The integration test script python/narrative_stack/us_gaap_store_integration_test.sh:1-39 orchestrates the complete test lifecycle from container startup through teardown.

graph TB
    Start["Start: us_gaap_store_integration_test.sh"]
SetVars["Set variables\nPROJECT_NAME=us_gaap_it\nCOMPOSE=docker compose -p us_gaap_it"]
ActivateVenv["Activate virtual environment\nsource .venv/bin/activate"]
InstallDeps["Install dependencies\nuv pip install -e . --group dev"]
RegisterTrap["Register cleanup trap\ntrap 'cleanup' EXIT"]
DockerUp["Start Docker containers\ndocker compose up -d\nProfile: test"]
WaitMySQL["Wait for MySQL ready\nmysqladmin ping loop"]
CreateDB["Create database\nCREATE DATABASE us_gaap_test"]
LoadSchema["Load schema\nmysql < us_gaap_schema_2025.sql"]
SetPythonPath["Set PYTHONPATH=src"]
RunPytest["Execute pytest\npytest -s -v tests/integration/test_us_gaap_store.py"]
Cleanup["Cleanup function\ndocker compose down --volumes"]
Start --> SetVars
 
   SetVars --> ActivateVenv
 
   ActivateVenv --> InstallDeps
 
   InstallDeps --> RegisterTrap
 
   RegisterTrap --> DockerUp
 
   DockerUp --> WaitMySQL
 
   WaitMySQL --> CreateDB
 
   CreateDB --> LoadSchema
 
   LoadSchema --> SetPythonPath
 
   SetPythonPath --> RunPytest
 
   RunPytest --> Cleanup
    
    style Start fill:#f9f9f9
    style RunPytest fill:#f9f9f9
    style Cleanup fill:#f9f9f9

Test Script Flow Diagram

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39

Cleanup Mechanism

The script implements a robust cleanup mechanism using bash trap handlers python/narrative_stack/us_gaap_store_integration_test.sh:14-19:

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:14-19

The cleanup function ensures that all Docker resources are removed regardless of test outcome, preventing resource leaks and port conflicts in subsequent test runs.


Docker Container Configuration

simd-r-drive-ws-server Container Build

The Dockerfile python/narrative_stack/Dockerfile.simd-r-drive-ci-server:1-34 creates a single-stage image optimized for CI environments.

Build Configuration:

ParameterValuePurpose
Base Imagerust:1.87-slimMinimal Rust runtime environment
Binary Installationcargo install --locked simd-r-drive-ws-server@0.10.0-alphaInstall specific locked version of server
Default Server Argsdata.bin --host 127.0.0.1 --port 8080Configure server binding and data file
Exposed Port8080WebSocket server port
EntrypointShell wrapper with $SERVER_ARGS interpolationAllow runtime argument override

Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:9-33

graph LR
    BuildTime["Build Time\n--build-arg SERVER_ARGS=..."]
BakeArgs["Bake into ENV variable"]
Runtime["Run Time\n-e SERVER_ARGS=... OR\ndocker run ... extra flags"]
Entrypoint["ENTRYPOINT interpolates\n$SERVER_ARGS + $@"]
ServerExec["Execute:\nsimd-r-drive-ws-server\nwith combined args"]
BuildTime --> BakeArgs
 
   BakeArgs --> Runtime
 
   Runtime --> Entrypoint
 
   Entrypoint --> ServerExec
    
    style Entrypoint fill:#f9f9f9

The Dockerfile uses build arguments to support configuration at both build-time and run-time:

Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:18-33

MySQL Container Configuration

The MySQL container is configured through Docker Compose with the following specifications python/narrative_stack/us_gaap_store_integration_test.sh:23-35:

ConfigurationValueSource
Container Nameus_gaap_test_dbDocker Compose service name
Root PasswordonlylocalEnvironment variable
Database Nameus_gaap_testCreated via CREATE DATABASE command
Schema Filetests/integration/assets/us_gaap_schema_2025.sqlLoaded via mysql CLI
Health Checkmysqladmin ping -h127.0.0.1Readiness probe

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:25-35


Environment Configuration and Dependencies

graph TB
    UVInstall["Install uv package manager\ncurl -LsSf astral.sh/uv/install.sh"]
CreateVenv["Create virtual environment\nuv venv --python=3.12"]
ActivateVenv["Activate environment\nsource .venv/bin/activate"]
InstallPackage["Install narrative_stack\nuv pip install -e . --group dev"]
DevGroup["Dev dependency group includes:\n- pytest\n- ruff\n- integration test utilities"]
UVInstall --> CreateVenv
 
   CreateVenv --> ActivateVenv
 
   ActivateVenv --> InstallPackage
 
   InstallPackage --> DevGroup
    
    style InstallPackage fill:#f9f9f9

Python Environment Setup

The CI pipeline uses uv as the package manager to create isolated Python environments and install dependencies:

Sources: .github/workflows/us-gaap-store-integration-test.yml:27-37 python/narrative_stack/us_gaap_store_integration_test.sh:11-12

Code Quality Validation

Before running integration tests, the workflow executes Ruff code quality checks .github/workflows/us-gaap-store-integration-test.yml:39-43:

CheckCommandPurpose
Lintingruff check .Validate code style, detect anti-patterns, enforce naming conventions

The Ruff check acts as a fast pre-test gate that fails the CI pipeline if code quality issues are detected, preventing broken code from reaching the integration test phase.


Docker Compose Project Isolation

The integration test uses Docker Compose project isolation to prevent conflicts with other running containers python/narrative_stack/us_gaap_store_integration_test.sh:7-9:

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:7-9

The project name us_gaap_it creates an isolated namespace for all Docker resources (containers, networks, volumes) associated with the integration test, allowing multiple test runs or development environments to coexist without interference.


Test Execution and Reporting

pytest Integration Test Execution

The final step executes the integration test suite using pytest python/narrative_stack/us_gaap_store_integration_test.sh:37-38:

ParameterValuePurpose
Test Filetests/integration/test_us_gaap_store.pyIntegration test module
Verbosity-vVerbose output with test names
Output-sShow stdout/stderr (no capture)
Python PathPYTHONPATH=srcLocate source modules
Working Directorypython/narrative_stackTest execution context

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:37-38

The pytest command runs with stdout capture disabled (-s flag), allowing real-time visibility into test progress and enabling debugging output from the integration tests.


graph TB
    GitPush["Git Push/PR\nto python/narrative_stack/"]
GHAStart["GitHub Actions Trigger\nus-gaap-store-integration-test.yml"]
EnvSetup["Environment Setup:\n- Python 3.12\n- uv package manager\n- Dev dependencies"]
CodeQuality["Code Quality Check:\nruff check ."]
QualityFail["❌ Fail CI"]
QualityPass["✓ Pass"]
ScriptExec["Execute Test Script:\nus_gaap_store_integration_test.sh"]
DockerSetup["Docker Compose Setup:\n- MySQL (us_gaap_test_db)\n- simd-r-drive-ws-server"]
SchemaLoad["Load SQL Schema:\nus_gaap_schema_2025.sql"]
PytestRun["Run Integration Tests:\npytest tests/integration/"]
TestFail["❌ Fail CI"]
TestPass["✓ Pass"]
Cleanup["Cleanup:\ndocker compose down --volumes"]
CIComplete["✅ CI Complete"]
GitPush --> GHAStart
 
   GHAStart --> EnvSetup
 
   EnvSetup --> CodeQuality
    
 
   CodeQuality -->|Issues detected| QualityFail
 
   CodeQuality -->|Clean| QualityPass
    
 
   QualityPass --> ScriptExec
 
   ScriptExec --> DockerSetup
 
   DockerSetup --> SchemaLoad
 
   SchemaLoad --> PytestRun
    
 
   PytestRun -->|Tests fail| TestFail
 
   PytestRun -->|Tests pass| TestPass
    
 
   TestFail --> Cleanup
 
   TestPass --> Cleanup
    
 
   Cleanup --> CIComplete
    
    style GHAStart fill:#f9f9f9
    style CodeQuality fill:#f9f9f9
    style PytestRun fill:#f9f9f9
    style CIComplete fill:#f9f9f9

CI/CD Pipeline Flow Summary

The complete CI/CD pipeline integrates all components into a continuous validation workflow:

Sources: .github/workflows/us-gaap-store-integration-test.yml:1-51 python/narrative_stack/us_gaap_store_integration_test.sh:1-39


Key Design Decisions

Git LFS Requirement

The workflow explicitly enables Git LFS .github/workflows/us-gaap-store-integration-test.yml:19-20 to support large test data files. The test script includes a comment python/narrative_stack/us_gaap_store_integration_test.sh:1-2 noting this requirement for local execution.

Single-Stage Docker Build

The Dockerfile uses a single-stage build python/narrative_stack/Dockerfile.simd-r-drive-ci-server:9-15 rather than multi-stage, prioritizing simplicity for CI environments where image size is less critical than build reliability and transparency.

Profile-Based Container Selection

The Docker Compose command uses --profile test python/narrative_stack/us_gaap_store_integration_test.sh9 to selectively start only test-related services, preventing unnecessary containers from consuming CI resources.

Working Directory Isolation

All workflow steps and script commands execute from the python/narrative_stack working directory .github/workflows/us-gaap-store-integration-test.yml33 ensuring consistent path resolution and dependency isolation from other repository components.


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Docker Deployment

Relevant source files

Purpose and Scope

This document describes the Docker containerization strategy for the rust-sec-fetcher system, focusing on the simd-r-drive-ws-server deployment and the Docker Compose infrastructure used for integration testing. The Dockerfile builds a lightweight container that runs the WebSocket server, while Docker Compose orchestrates multi-container test environments with MySQL and the WebSocket server.

For information about the CI/CD workflows that use these Docker configurations, see CI/CD Pipeline. For details about the simd-r-drive WebSocket server functionality, see Database & Storage Integration.


Docker Image Architecture

The system uses a single-stage Dockerfile to build and deploy the simd-r-drive-ws-server binary in a minimal Rust container. This approach prioritizes simplicity and reproducibility for CI/CD environments.

graph TB
    BaseImage["rust:1.87-slim\nBase Image"]
CargoInstall["cargo install\nsimd-r-drive-ws-server@0.10.0-alpha"]
EnvSetup["Environment Setup\nSERVER_ARGS variable\ndefault: 'data.bin --host 127.0.0.1 --port 8080'"]
Expose["EXPOSE 8080\nWebSocket port"]
Entrypoint["ENTRYPOINT\n/bin/sh -c 'simd-r-drive-ws-server ${SERVER_ARGS}'"]
BaseImage --> CargoInstall
 
   CargoInstall --> EnvSetup
 
   EnvSetup --> Expose
 
   Expose --> Entrypoint

Dockerfile Structure

Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:1-34

Build Configuration

The Dockerfile uses build arguments and environment variables to provide flexible configuration:

ConfigurationTypeDefault ValuePurpose
RUST_VERSIONBuild ARG1.87Rust toolchain version for compilation
SERVER_ARGSBuild ARG / ENVdata.bin --host 127.0.0.1 --port 8080Server command-line arguments

Build-time customization:

Runtime customization:

Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:9-33

Image Composition

Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:10-26


Docker Compose Test Infrastructure

The integration testing environment uses Docker Compose to orchestrate multiple services with isolated networking and lifecycle management.

graph TB
    subgraph "Docker Compose Project: us_gaap_it"
        
        subgraph "db_test Service"
            MySQLContainer["Container: us_gaap_test_db\nImage: mysql\nPort: 3306\nRoot password: onlylocal"]
Database["Database: us_gaap_test\nSchema: us_gaap_schema_2025.sql"]
end
        
        subgraph "simd_r_drive_ws_server_test Service"
            WSServer["Container: simd-r-drive-ws-server\nPort: 8080\nData file: data.bin"]
end
        
        subgraph "Test Runner (Host)"
            TestScript["us_gaap_store_integration_test.sh\npytest integration tests"]
end
    end
    
 
   MySQLContainer --> Database
 
   TestScript -->|docker exec mysql commands| MySQLContainer
 
   TestScript -->|pytest via network| WSServer
 
   TestScript -->|pytest via network| Database

Service Architecture

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39

sequenceDiagram
    participant Script as us_gaap_store_integration_test.sh
    participant Compose as docker compose
    participant MySQL as us_gaap_test_db
    participant WSServer as simd-r-drive-ws-server
    participant Pytest as pytest
    
    Script->>Script: Set PROJECT_NAME="us_gaap_it"
    Script->>Script: Register cleanup trap
    
    Script->>Compose: up -d (profile: test)
    Compose->>MySQL: Start container
    Compose->>WSServer: Start container
    
    loop Wait for MySQL
        Script->>MySQL: mysqladmin ping
        MySQL-->>Script: Connection status
    end
    
    Script->>MySQL: CREATE DATABASE us_gaap_test
    Script->>MySQL: Load us_gaap_schema_2025.sql
    
    Script->>Pytest: Run tests/integration/test_us_gaap_store.py
    Pytest->>MySQL: Query database
    Pytest->>WSServer: WebSocket operations
    Pytest-->>Script: Test results
    
    Script->>Compose: down --volumes --remove-orphans
    Compose->>MySQL: Stop and remove
    Compose->>WSServer: Stop and remove

Test Execution Flow

The integration test script implements a robust lifecycle with automatic cleanup:

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:7-39

Docker Compose Configuration Pattern

The test script uses the following Docker Compose invocation pattern:

ComponentValuePurpose
Project Nameus_gaap_itIsolates test containers from other Docker resources
ProfiletestSelects services tagged with profiles: [test]
Cleanup Strategy--volumes --remove-orphansEnsures complete teardown of test infrastructure

Key commands:

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:8-18


graph TB
    Start["Container Start\nImage: mysql\nRoot password: onlylocal"]
WaitReady["Health Check Loop\nmysqladmin ping -h127.0.0.1"]
CreateDB["docker exec\nCREATE DATABASE IF NOT EXISTS us_gaap_test"]
LoadSchema["docker exec\nmysql < us_gaap_schema_2025.sql"]
Ready["Database Ready\nSchema applied"]
Start --> WaitReady
 
   WaitReady -->|Ping successful| CreateDB
 
   WaitReady -->|Still initializing| WaitReady
 
   CreateDB --> LoadSchema
 
   LoadSchema --> Ready

Database Container Setup

The MySQL container requires specific initialization steps to prepare the test database:

Initialization Sequence

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:25-35

MySQL Container Configuration

The script executes the following commands against the us_gaap_test_db container:

  1. Health check: Polls MySQL readiness using mysqladmin ping

  2. Database creation: Creates the test database if it doesn't exist

  3. Schema loading: Applies the SQL schema from the repository

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:25-35


Environment Variables and Configuration

The Docker deployment uses environment variables for runtime configuration of both the WebSocket server and test infrastructure.

simd-r-drive-ws-server Configuration

VariableDefaultConfigurable AtDescription
SERVER_ARGSdata.bin --host 127.0.0.1 --port 8080Build & RuntimeComplete command-line arguments for the server

The SERVER_ARGS environment variable controls:

  • Data file: Path to the persistent storage file (e.g., data.bin)
  • Host binding: IP address for the WebSocket server (127.0.0.1 for localhost, 0.0.0.0 for all interfaces)
  • Port: TCP port for the WebSocket endpoint (default 8080)

Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:18-32

Test Environment Variables

The integration test script sets the following environment variables:

This ensures Python can locate the narrative_stack module when running tests.

Sources: python/narrative_stack/us_gaap_store_integration_test.sh37


Deployment Considerations

Port Exposure and Network Configuration

The Dockerfile exposes port 8080 for WebSocket connections:

When deploying, map this port to the host system:

For production deployments, consider:

  • Using a different external port (e.g., -p 9000:8080)
  • Binding to specific interfaces with --host 0.0.0.0 in SERVER_ARGS
  • Placing the container behind a reverse proxy (nginx, traefik)

Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server25

Data Persistence

The server expects a data file specified in SERVER_ARGS (default: data.bin). For persistent storage across container restarts:

Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:22-23

graph TB
    Setup["Setup Phase\ndocker compose up -d"]
Trap["Register Trap\ntrap 'cleanup' EXIT"]
Execute["Test Execution\npytest integration tests"]
Cleanup["Cleanup Phase\ndocker compose down\n--volumes --remove-orphans"]
Setup --> Trap
 
   Trap --> Execute
 
   Execute -->|Success or Failure| Cleanup

Container Lifecycle Management

The integration test script demonstrates proper container lifecycle management:

The trap command ensures cleanup runs even if tests fail or the script is interrupted:

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:14-19

Build and Deployment Commands

Build the Docker image:

Run the container:

Run integration tests:

Sources: python/narrative_stack/Dockerfile.simd-r-drive-ci-server:1-3 python/narrative_stack/us_gaap_store_integration_test.sh:1-39


Integration with CI/CD

The Docker configuration integrates with GitHub Actions for automated testing. The CI pipeline:

  1. Builds the simd-r-drive-ci-server image
  2. Spins up Docker Compose services with the test profile
  3. Initializes the MySQL database with the test schema
  4. Runs pytest integration tests
  5. Tears down all containers and volumes

This ensures reproducible test environments across local development and CI systems.

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-2 python/narrative_stack/Dockerfile.simd-r-drive-ci-server:1-8


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Dependencies & Technology Stack

Relevant source files

This page provides a comprehensive overview of all external dependencies used in the rust-sec-fetcher codebase, covering both the Rust sec-fetcher application and the Python narrative_stack ML system. For information about the configuration system and credential management, see Configuration System. For details on how these dependencies are used within specific components, see Rust sec-fetcher Application and Python narrative_stack System.

Overview

The system employs a dual-language architecture with distinct but complementary technology stacks. The Rust layer prioritizes high-performance I/O operations, concurrent data fetching, and reliable HTTP caching. The Python layer focuses on scientific computing, machine learning model training, and numerical data processing. Both layers share common infrastructure through the simd-r-drive data storage system and file-based CSV interchange.

Sources: Cargo.toml:1-45 High-Level System Architecture diagrams

Rust Technology Stack

Core Direct Dependencies

The Rust application declares 29 direct dependencies in its manifest, each serving specific architectural roles:

Dependency Categories and Usage:

graph TB
    subgraph "Async Runtime & Concurrency"
        tokio["tokio 1.43.0\nFull async runtime"]
rayon["rayon 1.10.0\nData parallelism"]
dashmap["dashmap 6.1.0\nConcurrent hashmap"]
end
    
    subgraph "HTTP & Network"
        reqwest["reqwest 0.12.15\nHTTP client"]
reqwest_drive["reqwest-drive 0.1.0-alpha.9\nDrive middleware"]
end
    
    subgraph "Data Processing"
        polars["polars 0.46.0\nDataFrame operations"]
csv_crate["csv 1.3.1\nCSV parsing"]
serde["serde 1.0.218\nSerialization"]
serde_json["serde_json 1.0.140\nJSON support"]
quick_xml["quick-xml 0.37.2\nXML parsing"]
end
    
    subgraph "Storage & Caching"
        simd_r_drive["simd-r-drive 0.3.0-alpha.1\nKey-value store"]
simd_r_drive_ext["simd-r-drive-extensions\n0.4.0-alpha.6"]
end
    
    subgraph "Configuration & Validation"
        config_crate["config 0.15.9\nConfig management"]
keyring["keyring 3.6.2\nCredential storage"]
email_address["email_address 0.2.9\nEmail validation"]
rust_decimal["rust_decimal 1.36.0\nDecimal numbers"]
chrono["chrono 0.4.40\nDate/time handling"]
end
    
    subgraph "Development Tools"
        mockito["mockito 1.7.0\nHTTP mocking"]
tempfile["tempfile 3.18.0\nTemp file creation"]
env_logger["env_logger 0.11.7\nLogging"]
end
CategoryCratesPrimary Use Cases
Async RuntimetokioEvent loop, async I/O, task scheduling, timers
HTTP Stackreqwest, reqwest-driveSEC API communication, middleware integration
Data FramespolarsLarge-scale data transformation, CSV/JSON processing
Serializationserde, serde_json, serde_withData structure serialization, API response parsing
Concurrencyrayon, dashmap, crossbeamParallel processing, concurrent data structures
Storagesimd-r-drive, simd-r-drive-extensionsHTTP cache, preprocessor cache, persistent storage
Configurationconfig, keyringTOML config loading, secure credential management
Validationemail_address, rust_decimal, chronoInput validation, financial precision, timestamps
Utilitiesitertools, indexmap, bytesIterator extensions, ordered maps, byte manipulation
Testingmockito, tempfileHTTP mock servers, temporary test files

Sources: Cargo.toml:8-44

HTTP and Network Layer

The HTTP stack consists of multiple layers providing comprehensive request handling:

TLS Configuration: The system supports dual TLS backends for platform compatibility:

graph TB
    SecClient["SecClient\n(src/sec_client.rs)"]
reqwest["reqwest 0.12.15\nHigh-level HTTP client"]
reqwest_drive["reqwest-drive 0.1.0-alpha.9\nMiddleware framework"]
hyper["hyper 1.6.0\nHTTP/1.1 & HTTP/2"]
h2["h2 0.4.8\nHTTP/2 protocol"]
TLS_Layer["TLS Layer"]
rustls["rustls 0.23.21\nPure Rust TLS"]
native_tls["native-tls 0.2.14\nPlatform TLS"]
openssl["openssl 0.10.71\nOpenSSL bindings"]
hyper_util["hyper-util 0.1.10\nUtilities"]
hyper_rustls["hyper-rustls 0.27.5\nRustls connector"]
hyper_tls["hyper-tls 0.6.0\nNative TLS connector"]
SecClient --> reqwest
 
   reqwest --> reqwest_drive
 
   reqwest --> hyper
 
   reqwest --> TLS_Layer
    
 
   hyper --> h2
 
   hyper --> hyper_util
    
 
   TLS_Layer --> hyper_rustls
 
   TLS_Layer --> hyper_tls
 
   hyper_rustls --> rustls
 
   hyper_tls --> native_tls
 
   native_tls --> openssl
  • rustls (0.23.21): Pure Rust implementation, no external dependencies, used by default
  • native-tls (0.2.14): Platform-native TLS (OpenSSL on Linux, Security.framework on macOS, SChannel on Windows)

Sources: Cargo.lock:1280-1351 Cargo.toml28

Data Processing Stack

Polars Features Enabled:

  • json: JSON file reading/writing
  • lazy: Lazy evaluation engine for query optimization
  • pivot: Pivot table operations

Key Numeric Types:

  • rust_decimal::Decimal: Exact decimal arithmetic for financial calculations (importance: 8.37 for US GAAP processing)
  • chrono::NaiveDate, chrono::DateTime: Timestamp handling for filing dates and report periods

Sources: Cargo.toml24 Cargo.lock:2170-2394

Concurrency and Parallelism

Concurrency Model:

  • tokio : Handles I/O-bound operations (HTTP requests, file I/O)
  • rayon : Handles CPU-bound operations (DataFrame transformations, parallel iterations)
  • dashmap : Thread-safe caching without explicit locking

Key Usage Locations:

Sources: Cargo.toml:14-40 src/caches.rs:1-66 Cargo.lock:612-762

Storage and Caching System

Cache Architecture:

  • Two separate DataStore instances for different caching layers
  • Thread-safe access via Arc<DataStore> wrapped in OnceLock statics
  • Initialized once at application startup via Caches::init()
  • 1-week TTL for HTTP responses, indefinite for preprocessor results

Storage Format:

  • Binary .bin files for efficient serialization
  • Uses bincode 1.3.3 for encoding cached data
  • Persistent across application restarts

Sources: src/caches.rs:7-65 Cargo.toml:36-37 Cargo.lock:246-252

Configuration and Credentials

Configuration Sources (Priority Order):

  1. Environment variables
  2. config.toml in current directory
  3. ~/.config/sec-fetcher/config.toml
  4. Default values

Keyring Features:

  • apple-native: Direct Security.framework integration on macOS
  • windows-native: Windows Credential Manager via Win32 APIs
  • sync-secret-service: D-Bus Secret Service for Linux

Sources: Cargo.toml:12-20 Cargo.lock:1641-1653

Validation and Type Safety

CrateVersionPurposeKey Types
email_address0.2.9SEC email validationEmailAddress
rust_decimal1.36.0Financial calculationsDecimal
chrono0.4.40Date/time operationsNaiveDate, DateTime<Utc>
chrono-tz0.10.1Timezone handlingTz::America__New_York
strum / strum_macros0.27.1Enum utilitiesEnumString, Display

Decimal Precision:

  • 28-29 significant digits
  • No floating-point rounding errors
  • Critical for US GAAP fundamental values

Date Handling:

  • All SEC filing dates parsed to chrono::NaiveDate
  • UTC timestamps for API requests
  • Eastern Time conversion for market hours

Sources: Cargo.toml:16-39 Cargo.lock:403-868

Python Technology Stack

Machine Learning Framework

Training Configuration:

  • Patience : 20 epochs for early stopping
  • Checkpoint Strategy : Save top 1 model by validation loss
  • Learning Rate : Configurable via checkpoint override
  • Gradient Clipping : Prevents exploding gradients
graph TB
    subgraph "Core Scientific Libraries"
        numpy["NumPy\nArray operations"]
pandas["pandas\nDataFrame manipulation"]
sklearn["scikit-learn\nPCA, RobustScaler"]
end
    
    subgraph "Preprocessing Pipeline"
        PCA["PCA\nDimensionality reduction\nVariance threshold: 0.95"]
RobustScaler["RobustScaler\nPer concept/unit pair\nOutlier-robust normalization"]
sklearn --> PCA
 
       sklearn --> RobustScaler
    end
    
    subgraph "Data Structures"
        triplets["Concept/Unit/Value Triplets"]
embeddings["Semantic Embeddings\nPCA-compressed"]
RobustScaler --> triplets
 
       PCA --> embeddings
 
       triplets --> embeddings
    end
    
    subgraph "Data Ingestion"
        csv_files["CSV Files\nfrom Rust sec-fetcher"]
UsGaapRowRecord["UsGaapRowRecord\nParsed row structure"]
csv_files --> UsGaapRowRecord
 
       UsGaapRowRecord --> numpy
    end

Sources: High-Level System Architecture (Diagram 3), narrative_stack training pipeline

Data Processing and Scientific Computing

scikit-learn Components:

  • PCA : Reduces embedding dimensionality while retaining 95% variance
  • RobustScaler : Normalizes using median and IQR, resistant to outliers
  • Both fitted per unique concept/unit pair for specialized normalization

NumPy Usage:

  • Validation via np.isclose() checks
  • Scaler transformation verification
  • Embedding matrix storage
graph TB
    subgraph "Database Layer"
        mysql_db[("MySQL Database\nus_gaap_test")]
        DbUsGaap["DbUsGaap\nDatabase interface"]
asyncio["asyncio\nAsync queries"]
DbUsGaap --> mysql_db
 
       DbUsGaap --> asyncio
    end
    
    subgraph "WebSocket Storage"
        DataStoreWsClient["DataStoreWsClient\nWebSocket client"]
simd_r_drive_server["simd-r-drive\nWebSocket server"]
DataStoreWsClient --> simd_r_drive_server
    end
    
    subgraph "Unified Access"
        UsGaapStore["UsGaapStore\nFacade pattern"]
UsGaapStore --> DbUsGaap
 
       UsGaapStore --> DataStoreWsClient
    end
    
    subgraph "Data Models"
        triplet_storage["Triplet Storage:\n- concept\n- unit\n- scaled_value\n- scaler\n- embedding"]
UsGaapStore --> triplet_storage
    end

Sources: High-Level System Architecture (Diagram 3), us_gaap_store preprocessing pipeline

Database and Storage Integration

Storage Strategy:

  • MySQL : Relational queries for ingested CSV data
  • WebSocket Store : High-performance embedding matrix storage
  • Facade Pattern : UsGaapStore abstracts storage backend choice

Python WebSocket Client:

  • Package: simd-r-drive-client (Python equivalent of Rust simd-r-drive)
  • Async communication with WebSocket server
  • Shared embedding matrices between Rust and Python

Sources: High-Level System Architecture (Diagram 3), Database & Storage Integration section

Visualization and Debugging

LibraryPurposeKey Outputs
matplotlibStatic plotsPCA variance plots, scaler distribution
TensorBoardTraining metricsLoss curves, learning rate schedules
itertoolsData iterationBatching, grouping concept pairs

Visualization Types:

  1. PCA Explanation Plots : Cumulative variance by component
  2. Semantic Embedding Scatterplots : t-SNE/UMAP projections of concept space
  3. Variance Analysis : Per-concept/unit value distributions
  4. Training Curves : TensorBoard integration via PyTorch Lightning
graph TB
    subgraph "Server Implementation"
        server["simd-r-drive-ws-server\nWebSocket server"]
DataStore_server["DataStore\nBackend storage"]
server --> DataStore_server
    end
    
    subgraph "Rust Clients"
        http_cache_rust["HTTP Cache\n(Rust)"]
preprocessor_cache_rust["Preprocessor Cache\n(Rust)"]
http_cache_rust --> DataStore_server
 
       preprocessor_cache_rust --> DataStore_server
    end
    
    subgraph "Python Clients"
        DataStoreWsClient_py["DataStoreWsClient\n(Python)"]
embedding_matrix["Embedding Matrix\nStorage"]
DataStoreWsClient_py --> DataStore_server
 
       embedding_matrix --> DataStoreWsClient_py
    end
    
    subgraph "Docker Deployment"
        dockerfile["Dockerfile\nContainer config"]
rust_image["rust:1.87\nBase image"]
dockerfile --> rust_image
 
       dockerfile --> server
    end

Sources: High-Level System Architecture (Diagram 3), Validation & Analysis section

Shared Infrastructure

simd-r-drive WebSocket Server

Key Features:

  • Cross-language data sharing via WebSocket protocol
  • Binary-efficient key-value storage
  • Persistent storage across restarts
  • Docker containerization for CI/CD environments

Usage Patterns:

  1. Rust writes HTTP cache entries
  2. Rust writes preprocessor transformations
  3. Python reads/writes embedding matrices
  4. Python reads ingested US GAAP triplets

Sources: Cargo.toml:36-37 High-Level System Architecture (Diagram 1, Diagram 5)

File System Interchange

Directory Structure:

  • fund-holdings/A-Z/ : NPORT filing holdings, alphabetized by ticker
  • us-gaap/ : Fundamental concepts CSV outputs
  • Row-based CSV format with standardized columns

Data Flow:

  1. Rust fetches from SEC API
  2. Rust transforms via distill_us_gaap_fundamental_concepts
  3. Rust writes CSV files
  4. Python walks directories
  5. Python parses into UsGaapRowRecord
  6. Python ingests to MySQL and WebSocket store
graph TB
    subgraph "CI Pipeline"
        workflow[".github/workflows/\nus-gaap-store-integration-test"]
checkout["Checkout code"]
rust_setup["Setup Rust 1.87"]
docker_compose["Docker Compose\nsimd-r-drive-ci-server"]
mysql_container["MySQL test container"]
workflow --> checkout
 
       checkout --> rust_setup
 
       checkout --> docker_compose
 
       docker_compose --> mysql_container
    end
    
    subgraph "Test Execution"
        cargo_test["cargo test\nRust unit tests"]
pytest["pytest\nPython integration tests"]
rust_setup --> cargo_test
 
       docker_compose --> pytest
    end
    
    subgraph "Testing Tools"
        mockito_tests["mockito 1.7.0\nHTTP mock server"]
tempfile_tests["tempfile 3.18.0\nTemp directories"]
cargo_test --> mockito_tests
 
       cargo_test --> tempfile_tests
    end

Sources: High-Level System Architecture (Diagram 1), main.rs CSV output logic

Development and CI/CD Stack

GitHub Actions Workflow

Docker Configuration:

  • Image : rust:1.87 for reproducible builds
  • Services : simd-r-drive-ws-server, MySQL 8.0
  • Environment Variables : Database credentials, WebSocket ports
  • Volume Mounts : Test data directories

Testing Strategy:

  • Rust Unit Tests : mockito for HTTP mocking, tempfile for isolated test fixtures
  • Python Integration Tests : pytest with MySQL fixtures, WebSocket client validation
  • End-to-End : Full pipeline from CSV ingestion through ML training

Sources: Cargo.toml:42-44 High-Level System Architecture (Diagram 5), CI/CD Pipeline section

Testing Dependencies

LanguageFrameworkMockingAssertions
RustBuilt-in #[test]mockito 1.7.0Standard assertions
Pythonpytestunittest.mocknp.isclose()

Rust Test Modules:

  • config_manager_tests: Configuration loading and merging
  • sec_client_tests: HTTP client behavior
  • distill_tests: US GAAP concept transformation

Python Test Modules:

  • Integration tests: End-to-end pipeline validation
  • Unit tests: Scaler, PCA, embedding generation

Sources: Cargo.lock:1802-1824 Testing Strategy documentation

Version Compatibility Matrix

Critical Version Constraints

ComponentVersionConstraintsReason
tokio1.43.0>= 1.0Stable async runtime API
reqwest0.12.150.12.xCompatible with reqwest-drive middleware
polars0.46.00.46.xBreaking changes between minor versions
simd-r-drive0.3.0-alpha.1Exact matchAlpha API, unstable
serde1.0.218>= 1.0.100Derive macro stability
hyper1.6.01.xHTTP/2 support, reqwest dependency

Rust Toolchain

Minimum Rust Version: 1.70+ (for async trait bounds and GAT support)

Feature Requirements:

  • async-trait for trait async methods
  • #[tokio::main] macro support
  • Generic Associated Types (GATs) for polars iterators

Sources: Cargo.toml4 Cargo.lock:1-3

Python Version Requirements

Estimated Python Version: 3.8+ (for PyTorch Lightning compatibility)

Key Constraints:

graph LR
 
   reqwest["reqwest"] --> hyper
 
   hyper --> h2["h2\nHTTP/2"]
hyper --> http["http\nTypes"]
reqwest --> rustls
 
   reqwest --> native_tls
    
 
   rustls --> ring["ring\nCryptography"]
native_tls --> openssl["openssl\nSystem TLS"]
h2 --> tokio
 
   hyper --> tokio
  • PyTorch: Requires Python 3.8 or later
  • PyTorch Lightning: Requires PyTorch >= 1.9
  • asyncio: Stable async/await syntax (Python 3.7+)

Sources: High-Level System Architecture (Python technology sections)

Transitive Dependency Highlights

HTTP/TLS Stack Depth

Notable Transitive Dependencies:

  • h2 0.4.8: HTTP/2 protocol implementation (37 dependencies)
  • rustls 0.23.21: TLS 1.2/1.3 without OpenSSL (14 dependencies)
  • tokio-util 0.7.13: Codec, framing utilities (8 dependencies)

Sources: Cargo.lock:1141-1351

Polars Ecosystem Depth

The polars dependency pulls in 23 related crates:

  • polars-core, polars-lazy, polars-io, polars-ops
  • polars-arrow: Apache Arrow array implementation
  • polars-parquet: Parquet file format support
  • polars-sql: SQL query interface
  • polars-time: Temporal operations
  • polars-utils: Shared utilities

Total Transitive: ~180 crates in the full dependency tree

Sources: Cargo.lock:2170-2394

Security and Cryptography

Platform-Specific Cryptography:

  • macOS: Security.framework via security-framework-sys
  • Linux: D-Bus + libdbus-sys
  • Windows: Win32 Credential Manager via windows-sys

Sources: Cargo.lock:764-1653


Total Rust Dependencies: ~250 crates (including transitive)

Key Architectural Decisions:

  1. Dual TLS support for platform flexibility
  2. Polars instead of native DataFrame for performance
  3. simd-r-drive for cross-language data sharing
  4. Alpha versions for simd-r-drive (active development)
  5. Platform-specific credential storage via keyring

Sources: Cargo.toml:1-45 Cargo.lock1 High-Level System Architecture (all diagrams)