Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Rust sec-fetcher Application

Relevant source files

Purpose and Scope

This page provides an architectural overview of the Rust sec-fetcher application, which is responsible for fetching financial data from the SEC EDGAR API and transforming it into structured CSV files. The application serves as the data collection and preprocessing layer in a larger dual-language system that combines Rust's performance for I/O operations with Python's machine learning capabilities.

This page covers the high-level architecture, module organization, and data flow patterns. For detailed information about specific components, see:

The Python machine learning pipeline that consumes this data is documented in Python narrative_stack System.

Sources: src/lib.rs:1-11 Cargo.toml:1-45 src/main.rs:1-241

Application Architecture

The sec-fetcher application is built around a modular architecture that separates concerns into distinct layers: configuration, networking, data transformation, and storage. The core design principle is to fetch data from SEC APIs with robust error handling and caching, transform it into a standardized format, and output structured CSV files for downstream consumption.

graph TB
    subgraph "src/lib.rs Module Organization"
        config["config\nConfigManager, AppConfig"]
enums["enums\nFundamentalConcept, Url\nCacheNamespacePrefix"]
models["models\nTicker, CikSubmission\nNportInvestment, AccessionNumber"]
network["network\nSecClient, fetch_* functions\nThrottlePolicy, CachePolicy"]
transformers["transformers\ndistill_us_gaap_fundamental_concepts"]
parsers["parsers\nXML/JSON parsing utilities"]
caches["caches\nCaches (internal)\nHTTP cache, preprocessor cache"]
fs["fs\nFile system operations"]
utils["utils\nVecExtensions, helpers"]
end
    
    subgraph "External Dependencies"
        reqwest["reqwest\nHTTP client"]
polars["polars\nDataFrame operations"]
simd["simd-r-drive\nDrive-based cache storage"]
tokio["tokio\nAsync runtime"]
serde["serde\nSerialization"]
end
    
 
   config --> caches
 
   network --> config
 
   network --> caches
 
   network --> models
 
   network --> enums
 
   network --> parsers
 
   transformers --> models
 
   transformers --> enums
 
   parsers --> models
    
 
   network --> reqwest
 
   network --> simd
 
   network --> tokio
 
   transformers --> polars
 
   models --> serde
    
    style config fill:#f9f9f9
    style network fill:#f9f9f9
    style transformers fill:#f9f9f9
    style caches fill:#f9f9f9

Module Structure

The application is organized into nine core modules as declared in src/lib.rs:1-10:

ModulePurposeKey Components
configConfiguration management and credential handlingConfigManager, AppConfig
enumsType-safe enumerations for domain conceptsFundamentalConcept, Url, CacheNamespacePrefix, TickerOrigin
modelsData structures representing SEC entitiesTicker, CikSubmission, NportInvestment, AccessionNumber
networkHTTP client and data fetching functionsSecClient, fetch_company_tickers, fetch_us_gaap_fundamentals
transformersData normalization and transformationdistill_us_gaap_fundamental_concepts
parsersXML/JSON parsing utilitiesFiling parsers, XML extractors
cachesInternal caching infrastructureCaches (singleton), HTTP cache, preprocessor cache
fsFile system operationsDirectory creation, path utilities
utilsHelper functions and extensionsVecExtensions, invert_multivalue_indexmap

Sources: src/lib.rs:1-11 Cargo.toml:8-40

Data Flow Architecture

Request-Response Flow with Caching

The data flow follows a pipeline pattern:

  1. Request Initiation : Application code calls functions like fetch_us_gaap_fundamentals or fetch_nport_filing_by_ticker_symbol
  2. Client Middleware : SecClient applies throttling and caching policies before making HTTP requests
  3. Cache Check : CachePolicy checks simd-r-drive storage for cached responses
  4. API Request : If cache miss, request is sent to SEC EDGAR API with proper headers and rate limiting
  5. Parsing : Raw XML/JSON responses are parsed into structured data models
  6. Transformation : Data passes through normalization functions like distill_us_gaap_fundamental_concepts
  7. Output : Transformed data is written to CSV files organized by ticker symbol or category

Sources: src/main.rs:174-240 src/lib.rs:1-11

Key Dependencies and Technology Stack

The application leverages modern Rust crates for performance and reliability:

Critical Dependencies

CategoryCrateVersionPurpose
Async Runtimetokio1.43.0Asynchronous I/O, task scheduling
HTTP Clientreqwest0.12.15HTTP requests with JSON support
Data Framespolars0.46.0High-performance DataFrame operations
Cachingsimd-r-drive0.3.0WebSocket-based key-value storage
Parallelismrayon1.10.0Data parallelism for CPU-bound work
Configurationconfig0.15.9TOML/JSON configuration file parsing
Credentialskeyring3.6.2Secure credential storage (OS keychain)
XML Parsingquick-xml0.37.2Fast XML parsing for SEC filings
Serializationserde1.0.218Data structure serialization/deserialization

Sources: Cargo.toml:8-40

Application Entry Point and Main Loop

The application entry point in src/main.rs:174-240 demonstrates the typical processing pattern:

Main Processing Loop Structure

The main application follows this pattern:

  1. Configuration Loading : ConfigManager::load() reads configuration from TOML files and environment variables
  2. Client Initialization : SecClient::from_config_manager() creates HTTP client with throttling and caching middleware
  3. Ticker Fetching : fetch_company_tickers() retrieves all company ticker symbols from SEC
  4. Processing Loop : Iterates over each ticker symbol:
    • Calls fetch_us_gaap_fundamentals() to retrieve financial data
    • Transforms data through distill_us_gaap_fundamental_concepts
    • Writes results to CSV files organized by ticker: data/us-gaap/{ticker}.csv
    • Logs errors to error_log HashMap for later reporting
  5. Error Reporting : Prints summary of any errors encountered during processing

Example Main Function Flow

Sources: src/main.rs:174-240

Module Interaction Patterns

Configuration and Client Initialization

The initialization sequence ensures:

  • Configuration is loaded before any network operations
  • Credentials are retrieved securely from OS keychain via keyring crate
  • Caches singleton is initialized with drive storage connection
  • HTTP client middleware is properly configured with throttling and caching

Sources: src/lib.rs:1-11

Data Transformation Pipeline

The transformation pipeline is the most critical component (importance: 8.37 as noted in the high-level architecture). The distill_us_gaap_fundamental_concepts function handles:

  • Synonym Consolidation : Maps 6+ variations of "Cost of Revenue" to single CostOfRevenue concept
  • Industry-Specific Revenue : Handles 57+ revenue field variations (SalesRevenueNet, RevenueFromContractWithCustomer, etc.)
  • Hierarchical Mapping : Maps specific concepts to parent categories (e.g., CurrentAssets also maps to Assets)
  • Unit Normalization : Standardizes currency, shares, and percentage units

Sources: Cargo.toml24 high-level architecture diagram

Error Handling Strategy

The application uses a multi-layered error handling approach:

LayerStrategyExample
NetworkRetry with exponential backoffThrottlePolicy with max_retries
CacheFallback to API on cache errorsCachePolicy transparent fallback
ParsingLog and continue processingError log in main loop
File I/OLog error, continue to next itemIndividual CSV write failures don't stop processing

The main loop accumulates errors in a HashMap<String, String> and reports them at the end, ensuring that one failure doesn't halt the entire batch processing job.

Sources: src/main.rs:185-240

Performance Characteristics

The application is optimized for I/O-bound workloads:

  • Async I/O : tokio runtime enables concurrent network requests
  • CPU Parallelism : rayon parallelizes DataFrame operations in polars
  • Caching : simd-r-drive reduces redundant API calls with 1-week TTL
  • Throttling : Respects SEC rate limits with adaptive jitter
  • Streaming : Processes large datasets without loading entirely into memory

The combination of async networking for I/O operations and Rayon parallelism for CPU-bound transformations provides optimal throughput for SEC data fetching and processing.

Sources: Cargo.toml:8-40 high-level architecture diagrams