Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Overview

Relevant source files

This document provides a high-level introduction to the rust-sec-fetcher repository, explaining its purpose, architecture, and the relationship between its Rust and Python components. This page covers the system's overall design and data flow. For installation and configuration instructions, see Getting Started. For detailed implementation documentation of individual components, see Rust sec-fetcher Application and Python narrative_stack System.

Sources: Cargo.toml, README.md, src/lib.rs

System Purpose

The rust-sec-fetcher repository implements a dual-language financial data processing system that:

  1. Fetches company financial data from the SEC EDGAR API
  2. Transforms raw SEC filings into normalized US GAAP fundamental concepts
  3. Stores structured data as CSV files organized by ticker symbol
  4. Trains machine learning models to understand financial concept relationships

The system specializes in processing US GAAP (Generally Accepted Accounting Principles) financial data, normalizing 57+ variations of concepts like revenue and consolidating them into a standardized taxonomy of 64 FundamentalConcept variants. This normalization enables consistent querying across diverse financial reports and powers downstream machine learning applications.

Sources: Cargo.toml:1-6, src/main.rs:173-240

Dual-Language Architecture

The repository employs a dual-language design that leverages the strengths of both Rust and Python:

LayerLanguagePrimary ResponsibilitiesKey Reason
Data Fetching & ProcessingRustHTTP requests, throttling, caching, data transformation, CSV generationI/O-bound operations, memory safety, high performance
Machine LearningPythonEmbedding generation, model training, statistical analysisRich ML ecosystem (PyTorch, scikit-learn)

Rust Layer (sec-fetcher):

  • Implements SecClient with sophisticated throttling and caching policies
  • Fetches company tickers, CIK submissions, NPORT filings, and US GAAP fundamentals
  • Transforms raw financial data via distill_us_gaap_fundamental_concepts
  • Outputs structured CSV files organized by ticker symbol

Python Layer (narrative_stack):

  • Ingests CSV files generated by Rust layer
  • Generates semantic embeddings for concept/unit pairs
  • Applies dimensionality reduction via PCA
  • Trains Stage1Autoencoder models using PyTorch Lightning

Sources: Cargo.toml:1-40, src/main.rs:1-16

graph TB
    SEC["SEC EDGAR API\ncompany_tickers.json\nCIK submissions\ncompanyfacts dataset"]
SecClient["SecClient\n(network layer)"]
NetworkFuncs["Network Functions\nfetch_company_tickers\nfetch_us_gaap_fundamentals\nfetch_nport_filing_by_ticker_symbol\nfetch_investment_company_series_and_class_dataset"]
Distill["distill_us_gaap_fundamental_concepts\n(transformers module)\nMaps 57+ variations → 64 concepts"]
Models["Data Models\nTicker, CikSubmission\nNportInvestment\nFundamentalConcept enum"]
CSV["File System\ndata/fund-holdings/A-Z/\ndata/us-gaap/"]
Ingest["Python Ingestion\nus_gaap_store.ingest_us_gaap_csvs\nWalks CSV directories"]
Preprocess["Preprocessing\nPCA dimensionality reduction\nRobustScaler normalization\nSemantic embeddings"]
DataStore["simd-r-drive\nWebSocket Key-Value Store\nEmbedding matrix storage"]
Training["Stage1Autoencoder\nPyTorch Lightning\nTensorBoard logging"]
SEC -->|HTTP GET| SecClient
 
   SecClient --> NetworkFuncs
 
   NetworkFuncs --> Distill
 
   Distill --> Models
 
   Models --> CSV
    
 
   CSV -.->|CSV files| Ingest
 
   Ingest --> Preprocess
 
   Preprocess --> DataStore
 
   DataStore --> Training

High-Level Data Flow

Data Flow Summary:

  1. Fetch : SecClient retrieves data from SEC EDGAR API endpoints
  2. Transform : Raw financial data passes through distill_us_gaap_fundamental_concepts to normalize concept names
  3. Store : Structured data is written to CSV files, organized by ticker symbol first letter
  4. Ingest : Python scripts walk CSV directories and parse records
  5. Preprocess : Generate embeddings, apply PCA, normalize values
  6. Train : ML models learn financial concept relationships

Sources: src/main.rs:1-16, src/main.rs:173-240, src/lib.rs:1-11

graph TB
    main["main.rs\nApplication entry point"]
config["config module\nConfigManager\nAppConfig"]
network["network module\nSecClient\nfetch_* functions\nThrottlePolicy\nCachePolicy"]
transformers["transformers module\ndistill_us_gaap_fundamental_concepts"]
models["models module\nTicker\nCik\nCikSubmission\nNportInvestment\nAccessionNumber"]
enums["enums module\nFundamentalConcept\nCacheNamespacePrefix\nTickerOrigin\nUrl"]
caches["caches module (private)\nCaches struct\nOnceLock statics\nHTTP cache\nPreprocessor cache"]
utils["utils module\ninvert_multivalue_indexmap\nis_development_mode\nis_interactive_mode\nVecExtensions"]
fs["fs module\nFile system utilities"]
parsers["parsers module\nData parsing functions"]
main --> config
 
   main --> network
 
   main --> utils
    
 
   network --> config
 
   network --> caches
 
   network --> models
 
   network --> transformers
 
   network --> parsers
    
 
   transformers --> enums
 
   transformers --> models
    
 
   models --> enums
    
 
   caches --> config

Core Module Structure

Module Descriptions:

ModulePrimary PurposeKey Exports
configConfiguration management and credential loadingConfigManager, AppConfig
networkHTTP client, data fetching, throttling, cachingSecClient, fetch_company_tickers, fetch_us_gaap_fundamentals, fetch_nport_filing_by_ticker_symbol
transformersUS GAAP concept normalization (importance: 8.37)distill_us_gaap_fundamental_concepts
modelsData structures for SEC entitiesTicker, Cik, CikSubmission, NportInvestment, AccessionNumber
enumsType-safe enumerationsFundamentalConcept (64 variants), CacheNamespacePrefix, TickerOrigin, Url
cachesInternal caching infrastructureCaches (private module)
utilsUtility functionsinvert_multivalue_indexmap, VecExtensions, is_development_mode

Sources: src/lib.rs:1-11, src/main.rs:1-16, src/utils.rs:1-12

Rust Component Architecture

The Rust layer is organized around SecClient, which provides a high-level HTTP interface with integrated throttling and caching. Network functions (fetch_*) use this client to retrieve data from SEC EDGAR endpoints. The most critical component is distill_us_gaap_fundamental_concepts, which normalizes financial concepts using four mapping patterns:

  • One-to-One : Direct mappings (e.g., AssetsFundamentalConcept::Assets)
  • Hierarchical : Specific concepts also map to parent categories (e.g., CurrentAssets maps to both CurrentAssets and Assets)
  • Synonym Consolidation : Multiple terms map to single concept (e.g., 6 cost variations → CostOfRevenue)
  • Industry-Specific : 57+ revenue variations map to Revenues

Sources: src/main.rs:1-16, Cargo.toml:8-40

Python Component Architecture

The Python narrative_stack system consumes CSV files produced by the Rust layer and trains machine learning models:

Key Components:

ComponentModule/ClassPurpose
Ingestionus_gaap_store.ingest_us_gaap_csvsWalks CSV directories, parses UsGaapRowRecord entries
PreprocessingPCA, RobustScalerGenerates semantic embeddings, normalizes values, reduces dimensionality
StorageDbUsGaap, DataStoreWsClient, UsGaapStoreDatabase interface, WebSocket client, unified data access facade
TrainingStage1AutoencoderPyTorch Lightning autoencoder for learning concept embeddings
Validationnp.isclose checksScaler verification, embedding validation

The preprocessing pipeline creates concept/unit/value triplets with associated embeddings and scalers, storing them in both MySQL and the simd-r-drive WebSocket data store. The Stage1Autoencoder learns latent representations by reconstructing embedding + scaled_value inputs.

Sources: (Python code not included in provided files, based on architecture diagrams)

Key Technologies

Rust Dependencies:

CratePurpose
tokioAsync runtime for I/O operations
reqwestHTTP client library
polarsDataFrame operations and CSV handling
rayonCPU parallelism
simd-r-driveWebSocket data store integration
serde, serde_jsonSerialization/deserialization
chronoDate/time handling

Python Dependencies:

  • PyTorch & PyTorch Lightning (ML training)
  • pandas, numpy (data processing)
  • scikit-learn (PCA, RobustScaler)
  • matplotlib, TensorBoard (visualization)

Sources: Cargo.toml:8-40

CSV Output Organization

The Rust application writes CSV files to organized directories:

data/
├── fund-holdings/
│   ├── A/
│   │   ├── AAPL.csv
│   │   ├── AMZN.csv
│   │   └── ...
│   ├── B/
│   ├── C/
│   └── ...
└── us-gaap/
    ├── AAPL.csv
    ├── MSFT.csv
    └── ...

Files are organized by the first letter of the ticker symbol (uppercased) to facilitate parallel processing and improve file system performance with large datasets.

Sources: src/main.rs:83-102, src/main.rs:206-221

Next Steps

Sources: src/main.rs:1-240, Cargo.toml:1-45, src/lib.rs:1-11