Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Overview

Loading…

Overview

Relevant source files

This document provides a high-level introduction to the rust-sec-fetcher repository, explaining its purpose, architecture, and the relationship between its Rust and Python components. This page covers the system’s overall design and data flow. For installation and configuration instructions, see [Getting Started](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Getting Started) For detailed implementation documentation of individual components, see [Rust sec-fetcher Application](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Rust sec-fetcher Application) and [Python narrative_stack System](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Python narrative_stack System)

Sources: Cargo.toml:1-10 README.md:1-8

System Purpose

The rust-sec-fetcher repository implements a dual-language financial data processing system that:

  1. Fetches company financial data from the SEC EDGAR API, including filings (10-K, 10-Q, 8-K), fund holdings (13F, N-PORT), and IPO registrations. README.md:5-8
  2. Transforms raw SEC filings into normalized US GAAP fundamental concepts or clean Markdown/text views. README.md:5-8 src/ops/filing_ops.rs:1-15
  3. Stores structured data as CSV files or provides it via high-level data models like Ticker, Cik, and NportInvestment. src/models/ticker.rs:1-10 src/models/nport.rs:1-10
  4. Trains machine learning models (in the Python narrative_stack) to understand financial concept relationships and perform semantic analysis.

The system specializes in processing US GAAP (Generally Accepted Accounting Principles) financial data, normalizing variations of concepts and consolidating them into a standardized taxonomy of FundamentalConcept variants. This normalization enables consistent querying across diverse financial reports and powers downstream machine learning applications.

Sources: Cargo.toml:1-10 README.md:1-8 src/enums/fundamental_concept.rs:1-20

Dual-Language Architecture

The repository employs a dual-language design that leverages the strengths of both Rust and Python:

LayerLanguagePrimary ResponsibilitiesKey Reason
Data Fetching & ProcessingRustHTTP requests, throttling, caching, data transformation, XML/JSON parsingI/O-bound operations, memory safety, high performance
Machine LearningPythonEmbedding generation, model training, statistical analysisRich ML ecosystem (PyTorch, scikit-learn)

Rust Layer (sec-fetcher)

Python Layer (narrative_stack)

  • Ingests data generated by the Rust layer.
  • Generates semantic embeddings for concept/unit pairs.
  • Trains Stage1Autoencoder models using PyTorch Lightning.

Sources: Cargo.toml:1-42 README.md:15-50

High-Level Data Flow

The following diagram bridges the gap between the natural language data flow and the specific code entities responsible for each stage.

graph TB
    SEC["SEC EDGAR API\n(company_tickers.json, submissions, archives)"]
SecClient["SecClient\n(src/network/sec_client.rs)"]
NetworkFuncs["Network Functions\nfetch_company_tickers\nfetch_filings\nfetch_nport_filings\n(src/network/mod.rs)"]
Parsers["Parsers\nparse_13f_xml\nparse_nport_xml\n(src/parsers/mod.rs)"]
Ops["Operations Logic\nrender_filing\nfetch_and_render\n(src/ops/filing_ops.rs)"]
Models["Data Models\nTicker, CikSubmission\nNportInvestment\n(src/models/mod.rs)"]
Views["Views\nMarkdownView\nEmbeddingTextView\n(src/views/mod.rs)"]
PythonIngest["Python narrative_stack\nData Ingestion & Training"]
SEC -->|HTTP GET| SecClient
 
   SecClient --> NetworkFuncs
 
   NetworkFuncs --> Parsers
 
   Parsers --> Models
 
   NetworkFuncs --> Ops
 
   Ops --> Views
 
   Models --> PythonIngest

System Data Flow and Code Entities

Data Flow Summary:

  1. Fetch : SecClient retrieves data from SEC EDGAR API endpoints using structured Url variants. src/enums/url_enum.rs:5-116
  2. Parse : Raw XML/JSON data is processed by specialized parsers (e.g., parse_13f_xml) into internal models. src/parsers/thirteen_f.rs:1-10
  3. Transform/Render : Filings are rendered into human-readable or machine-learning-ready formats via render_filing. src/ops/filing_ops.rs:118-130
  4. Analyze : Normalized data is passed to the Python layer for ML training and embedding generation.

Sources: src/network/sec_client.rs:1-50 src/enums/url_enum.rs:5-116 src/ops/filing_ops.rs:118-130 README.md:110-140

graph TB
    main["main.rs / lib.rs\nEntry Points"]
config["config module\nConfigManager\nAppConfig\n(src/config/mod.rs)"]
network["network module\nSecClient\nThrottlePolicy\n(src/network/mod.rs)"]
ops["ops module\nrender_filing\nholdings operations\n(src/ops/mod.rs)"]
models["models module\nTicker, Cik, AccessionNumber\n(src/models/mod.rs)"]
enums["enums module\nFundamentalConcept\nUrl, FormType\n(src/enums/mod.rs)"]
parsers["parsers module\nXML/JSON parsers\n(src/parsers/mod.rs)"]
views["views module\nMarkdownView\n(src/views/mod.rs)"]
main --> config
 
   main --> network
 
   main --> ops
    
 
   network --> config
 
   network --> models
 
   network --> enums
    
 
   ops --> network
 
   ops --> views
 
   ops --> models
    
 
   parsers --> models
 
   models --> enums

Core Module Structure

The Rust codebase is modularized to separate networking, data modeling, and business logic.

Module Dependency Graph

ModulePrimary PurposeKey Exports
configConfiguration management and credential loading.ConfigManager, AppConfig src/config/mod.rs
networkHTTP client, data fetching, and throttling.SecClient, fetch_filings, fetch_company_profile src/network/mod.rs
opsHigh-level business logic and data orchestration.render_filing, fetch_and_render src/ops/mod.rs
modelsDomain-specific data structures.Ticker, Cik, CikSubmission, NportInvestment src/models/mod.rs
enumsType-safe enumerations for SEC concepts.FundamentalConcept, FormType, Url src/enums/mod.rs
parsersLogic for converting SEC formats to Rust structs.parse_13f_xml, parse_nport_xml src/parsers/mod.rs
viewsRendering logic for different output formats.MarkdownView, EmbeddingTextView src/views/mod.rs

Sources: Cargo.toml:15-35 src/enums/url_enum.rs:1-5

Key Technologies

Rust Dependencies:

  • tokio: Async runtime for non-blocking I/O. Cargo.toml79
  • reqwest: Robust HTTP client with JSON support. Cargo.toml66
  • polars: High-performance DataFrame library for data manipulation. Cargo.toml61
  • quick-xml: Fast XML parsing for SEC filing documents. Cargo.toml62
  • html-to-markdown-rs: Converts SEC HTML filings to Markdown. Cargo.toml55
  • simd-r-drive: Integration with a high-performance data store. Cargo.toml74

Python Dependencies (narrative_stack):

  • PyTorch & PyTorch Lightning (ML training)
  • scikit-learn (Preprocessing and PCA)
  • BERT (Contextual embeddings for concept clustering)

Sources: Cargo.toml:44-82

Next Steps

  • Getting Started : Learn how to configure your SEC contact email and run your first lookup. See [Getting Started](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Getting Started)
  • Rust Architecture : Dive deep into the SecClient and networking layer. See [Rust sec-fetcher Application](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Rust sec-fetcher Application)
  • ML Pipeline : Explore the autoencoder training and US GAAP alignment. See [Python narrative_stack System](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Python narrative_stack System)

Sources: README.md:9-15

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Getting Started

Loading…

Getting Started

Relevant source files

This page guides you through installing, configuring, and running the rust-sec-fetcher application. It covers building the Rust binary, setting up required credentials, and executing your first data fetch. For detailed configuration options, see Configuration System. For comprehensive examples, see Running Examples.

The rust-sec-fetcher is the Rust component of a dual-language system. It fetches and transforms SEC financial data into structured formats. The companion Python system (Python narrative_stack System) processes these files for machine learning applications.


Prerequisites

Before installation, ensure you have:

RequirementPurposeNotes
Rust 1.87+Compile sec-fetcherEdition 2024 features required Cargo.toml6
Email AddressSEC EDGAR API accessRequired by SEC for User-Agent identification README.md:13-14
Disk SpaceCache and data storageDefault location: data/ directory README.md:204-205
Internet ConnectionSEC API accessThrottled to stay within SEC guidelines README.md:5-8

Optional Components:

Sources: Cargo.toml:1-82 README.md:1-205


Installation

Clone Repository

Build from Source

The project includes several specialized binaries for data maintenance and bulk pulling Cargo.toml:15-35

Verify Installation

Run the configuration display example to ensure the environment is ready:

Installation and Setup Flow

Sources: Cargo.toml:15-35 README.md:51-53 src/enums/url_enum.rs:30-52


Basic Configuration

The application uses ConfigManager to coordinate settings from files, environment variables, and the system keychain.

Configuration Entry Points

Every program using the library initializes through one of two patterns README.md:17-49:

  1. ConfigManager::load() : The standard approach. It reads from local config files or environment variables and handles interactive email prompts if missing.
  2. ConfigManager::from_app_config(&AppConfig { ... }): Used for hard-coding values directly in specialized tools.

Required Email Identification

The SEC mandates a contact email in the User-Agent header README.md:13-14 sec-fetcher provides multiple ways to supply this:

  • Interactive : Prompted on first run and stored via keyring (if feature enabled) Cargo.toml42
  • Environment : Setting SEC_FETCHER_EMAIL.
  • Code : Passing it directly to AppConfig.

Configuration Loading Logic

Sources: README.md:15-49 Cargo.toml:36-43 src/enums/url_enum.rs:121-163


Running Your First Data Fetch

Example: Ticker to CIK Lookup

The library maps human-readable tickers to SEC Central Index Keys (CIKs) using fetch_cik_by_ticker_symbol README.md:63-71

Example: Fetch and Render Filings

You can retrieve specific forms (like 10-K or 8-K) and render them into Markdown for LLM processing or human reading README.md:112-140

Example: Fund Holdings (13F/N-PORT)

The library can parse complex XML filings into structured data README.md:195-202

Data Retrieval Flow

Sources: README.md:112-140 src/enums/url_enum.rs:26-29 src/enums/url_enum.rs:146-150


Data Output & Caching

sec-fetcher is designed to be a “good citizen” on the SEC servers:

  • Throttling : Automatically limits requests to 10 per second per SEC policy README.md:5-8
  • Caching : Uses simd-r-drive to cache raw responses locally, reducing redundant network traffic Cargo.toml:74-75
  • Structured Storage : Data is typically organized by CIK or Ticker in the data/ directory.

Specialized Binaries

For bulk operations, use the provided binaries Cargo.toml:15-35:

  • pull-us-gaap-bulk: Downloads large-scale XBRL datasets for ML training.
  • pull-fund-holdings: Aggregates holdings from multiple investment companies.
  • refresh-test-fixtures: Updates local mock data for the test suite.

For details on the transformation of these datasets, see US GAAP Concept Transformation.

Sources: Cargo.toml:15-35 README.md:204-205 src/enums/url_enum.rs:54-55

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Configuration System

Loading…

Configuration System

Relevant source files

This document describes the configuration management system in the Rust sec-fetcher application. The configuration system provides a flexible, multi-source approach to managing application settings including SEC API credentials, request throttling parameters, and cache directories.

For information about how the configuration integrates with the caching system, see Caching & Storage System. For details on credential storage mechanisms, see the credential management section below.


Overview

The configuration system consists of three primary components:

ComponentPurposeFile Location
AppConfigData structure holding all configuration fields with JSON schema supportsrc/config/app_config.rs:16-41
ConfigManagerLoads, merges, and validates configuration from multiple sourcessrc/config/config_manager.rs:95-101
CredentialManagerHandles email credential storage and retrieval via system keyringsrc/config/credential_manager.rs:19-22

The system supports configuration from four sources with increasing priority:

  1. Default values - Hard-coded defaults in AppConfig::default() src/config/app_config.rs:43-62
  2. Environment Variables - SEC_FETCHER_EMAIL, SEC_FETCHER_APP_NAME, etc. src/config/config_manager.rs:56-88
  3. TOML configuration file - User-specified settings loaded from disk src/config/config_manager.rs:168-171
  4. Interactive prompts - Credential collection when running in interactive mode src/config/credential_manager.rs:81-127

Sources: src/config/app_config.rs:16-62 src/config/config_manager.rs:56-155


Configuration Structure

AppConfig Fields

All fields in AppConfig are wrapped in Option<T> to support partial configuration and merging strategies. The #[merge(strategy = overwrite_option)] attribute ensures that non-None values from user configuration always replace default values.

Sources: src/config/app_config.rs:16-41 src/config/app_config.rs:43-62


Configuration Loading Flow

ConfigManager Initialization

Sources: src/config/config_manager.rs:107-182 src/config/credential_manager.rs:81-127


Configuration File Format

TOML Structure

The configuration file uses TOML format with strict schema validation. Any unrecognized keys will cause a descriptive error listing all valid keys and their types using schemars integration.

Example configuration file (sec_fetcher_config.toml):

Configuration File Locations

The system searches for configuration files in the following order:

PriorityLocationDescription
1User-provided pathPassed to ConfigManager::from_config(Some(path))
2System config directory~/.config/sec_fetcher_config.toml (Unix)
3Current directory./sec_fetcher_config.toml

Sources: src/config/config_manager.rs:103-104 src/config/config_manager.rs:129-131 src/config/config_manager.rs:166-170


Credential Management

Email Credential Requirement

The SEC EDGAR API mandates a valid email address in the HTTP User-Agent header for identification. The configuration system resolves this using the following precedence:

Interactive Mode Detection:

graph TD
    Start["ConfigManager Resolution"]
CheckArg["App Identity Override?"]
CheckFile["Email in TOML Config?"]
CheckEnv["SEC_FETCHER_EMAIL Env Var?"]
CheckInteractive{"is_interactive_mode()?"}
PromptUser["CredentialManager::from_prompt()"]
ReadKeyring["Read from system keyring"]
PromptInput["Prompt user for email"]
SaveKeyring["Save to keyring"]
SetEmail["Set email in ConfigManager"]
Error["Return Error:\nCould not obtain email credential"]
Start --> CheckArg
 
   CheckArg -->|No| CheckFile
 
   CheckFile -->|No| CheckEnv
 
   CheckEnv -->|No| CheckInteractive
 
   CheckInteractive -->|Yes| PromptUser
 
   CheckInteractive -->|No| Error
 
   PromptUser --> ReadKeyring
 
   ReadKeyring -->|Found| SetEmail
 
   ReadKeyring -->|Not found| PromptInput
 
   PromptInput --> SaveKeyring
 
   SaveKeyring --> SetEmail

Sources: src/config/config_manager.rs:41-56 src/config/credential_manager.rs:32-76


Configuration Merging Strategy

Merge Behavior

The AppConfig struct uses the merge crate with a custom overwrite_option strategy defined in src/config/app_config.rs:8-12:

Merge Rules:

  1. Some(new_value) always replaces Some(old_value) src/config/app_config.rs:9-11
  2. Some(new_value) always replaces None
  3. None never replaces Some(old_value)

This ensures user-provided values take absolute precedence over defaults while allowing partial configuration.

Sources: src/config/app_config.rs:8-12 src/config/config_manager.rs:172-182


Schema Validation and Error Handling

Invalid Key Detection

The configuration system uses #[serde(deny_unknown_fields)] to reject unknown keys in TOML files src/config/app_config.rs15 When an invalid key is detected, the error message includes a complete list of valid keys with their types extracted via schemars:

Example error output:

graph LR
    TOML["TOML File with\ninvalid_key"]
Deserialize["config.try_deserialize()"]
ExtractSchema["AppConfig::get_valid_keys()"]
Schema["schemars::schema_for!(AppConfig)"]
FormatError["Format error message with\nvalid keys and types"]
TOML --> Deserialize
 
   Deserialize -->|Error| ExtractSchema
 
   ExtractSchema --> Schema
 
   Schema --> FormatError
 
   FormatError --> Error["Return descriptive error"]
unknown field `invalid_key`, expected one of `email`, `app_name`, `app_version`, `max_concurrent`, `min_delay_ms`, `max_retries`, `cache_base_dir`

Valid configuration keys are:
  - email (String | Null)
  - app_name (String | Null)
  - app_version (String | Null)
  - max_concurrent (Integer | Null)
  - min_delay_ms (Integer | Null)
  - max_retries (Integer | Null)
  - cache_base_dir (String | Null)

Sources: src/config/app_config.rs:69-83 src/config/config_manager.rs:175-182 tests/config_manager_tests.rs:154-181


Usage Examples

Loading Configuration

Default configuration path:

Custom configuration path:

Overriding App Identity:

Sources: src/config/config_manager.rs:108-110 src/config/config_manager.rs:129-131 src/config/config_manager.rs:151-155


Cache Isolation

The ConfigManager manages the lifetime of the cache directory. If cache_base_dir is not provided in the configuration, it creates a unique tempfile::TempDir src/config/app_config.rs:45-48

This ensures that tests and ephemeral runs have fully isolated cache storage that is cleaned up when the ConfigManager is dropped src/config/app_config.rs:47-48

Sources: src/config/app_config.rs:43-62 src/config/config_manager.rs:98-100


Testing

Test Coverage

The configuration system includes unit tests in tests/config_manager_tests.rs covering:

Test FunctionPurpose
test_load_custom_configVerifies loading from custom TOML file path tests/config_manager_tests.rs46
test_fails_on_invalid_keyValidates schema enforcement and error messages tests/config_manager_tests.rs154
test_email_from_env_var_non_interactiveChecks SEC_FETCHER_EMAIL resolution tests/config_manager_tests.rs103
test_config_file_email_takes_precedenceValidates priority of TOML over Env Vars tests/config_manager_tests.rs130

Sources: tests/config_manager_tests.rs:1-207

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Running Examples

Loading…

Running Examples

Relevant source files

This page demonstrates how to run the example programs included in the rust-sec-fetcher repository. These programs illustrate the library’s capabilities, ranging from simple company lookups to complex filing rendering and EDGAR feed polling.

Example Programs Overview

The repository includes a variety of example programs in the examples/ directory. These serve as functional documentation for the system’s core modules.

Example ProgramPrimary PurposeKey Functions / Models
config_showVerify active configuration and credentialsConfigManager::load, AppConfig::pretty_print
company_searchFuzzy-match names against the SEC ticker listfetch_company_tickers, Ticker::get_by_fuzzy_matched_name
company_showDisplay full SEC profile and industry metadatafetch_company_profile, fetch_company_description
cik_showLookup CIK and find specific recent filingsfetch_cik_by_ticker_symbol, CikSubmission::most_recent_by_form
8k_listList all 8-K filings for a ticker with URLsfetch_8k_filings, as_primary_document_url
filing_renderFetch and render any SEC URL to clean textfetch_and_render, MarkdownView, EmbeddingTextView
filing_showRender primary documents and/or exhibitsrender_filing, fetch_filings
edgar_feed_pollDelta-poll the live SEC Atom feedfetch_edgar_feeds_since, FeedEntry
edgar_index_browseBrowse historical quarterly master indexesfetch_edgar_master_index, MasterIndexEntry
fund_series_listList registered mutual funds and share classesfetch_investment_company_series_and_class_dataset

Sources: examples/config_show.rs:1-12 examples/company_search.rs:1-9 examples/company_show.rs:1-14 examples/cik_show.rs:1-15 examples/8k_list.rs:1-10 examples/filing_render.rs:1-12 examples/filing_show.rs:1-7 examples/edgar_feed_poll.rs:1-36 examples/edgar_index_browse.rs:1-21 examples/fund_series_list.rs:1-14

System Integration Architecture

Diagram: Example Program Data Flow

Sources: examples/company_search.rs:46-49 examples/company_show.rs:88-109 examples/filing_render.rs:67-79 examples/edgar_feed_poll.rs:144-157

company_search: Fuzzy Matching

Purpose : Resolves ambiguous company names or ticker symbols against the official SEC ticker list using tokenization and fuzzy matching.

Usage :

Implementation :

  1. It calls fetch_company_tickers to get the master list examples/company_search.rs49
  2. It tokenizes the query using Ticker::tokenize_company_name examples/company_search.rs43
  3. It performs an exact match check before falling back to Ticker::get_by_fuzzy_matched_name examples/company_search.rs:58-78

company_show: Detailed Metadata

Purpose : Aggregates data from multiple SEC endpoints to build a comprehensive profile, including SIC codes, industry descriptions, and fiscal year details.

Usage :

Implementation :

Sources: examples/company_search.rs:1-83 examples/company_show.rs:1-150

Filing Retrieval and Rendering

filing_render: Generic URL Rendering

Purpose : Fetches any arbitrary SEC document URL and converts it to clean text using a specified FilingView.

Usage :

Implementation : The program uses fetch_and_render, passing either MarkdownView or EmbeddingTextView examples/filing_render.rs:76-79

filing_show: Smart Part Selection

Purpose : Automatically finds the most recent filing of a specific type (e.g., 10-K, 8-K) and renders the body, exhibits, or both.

Usage :

Implementation :

  • It uses fetch_filings to locate the target document examples/filing_show.rs97
  • It calls ops::render_filing, which handles the complexity of identifying “substantive” exhibits (filtering out SOX certifications, XBRL schemas, and graphics) examples/filing_show.rs115

Sources: examples/filing_render.rs:1-84 examples/filing_show.rs:1-158

Live Feeds and Historical Indexes

edgar_feed_poll: Real-time Delta Polling

Purpose : Monitors the SEC Atom feed for new filings. It supports “delta mode,” where it only shows filings strictly newer than a provided high-water mark timestamp.

Usage :

Implementation : The core logic resides in fetch_edgar_feeds_since, which handles pagination and timestamp filtering examples/edgar_feed_poll.rs157 It identifies special routes like earnings releases by checking is_earnings_release() on FeedEntry examples/edgar_feed_poll.rs84

edgar_index_browse: Historical Backfills

Purpose : Accesses the quarterly master.idx files, which contain every filing since 1993.

Usage :

Implementation : It uses fetch_edgar_master_index to download and parse the pipe-delimited index file for the requested year and quarter examples/edgar_index_browse.rs113

Sources: examples/edgar_feed_poll.rs:1-209 examples/edgar_index_browse.rs:1-186

Code Entity Mapping

Diagram: Example Program CLI to Model Mapping

Sources: examples/cik_show.rs33 examples/8k_list.rs44 examples/edgar_feed_poll.rs:119-129 examples/edgar_index_browse.rs:98-99

Execution Patterns

All examples share a standard initialization sequence:

  1. Config Loading : ConfigManager::load() is called to resolve environment variables and .toml files examples/config_show.rs32
  2. Client Setup : SecClient::from_config_manager(&config) initializes the HTTP client with required User-Agent headers and rate limiting examples/company_show.rs89
  3. Async Runtime : Examples use the #[tokio::main] attribute to manage asynchronous network calls examples/cik_show.rs30

Common CLI Pattern : Most examples use clap for argument parsing, providing consistent help menus and type validation examples/filing_show.rs:54-69

Sources: examples/config_show.rs:26-37 examples/company_show.rs:83-100 examples/cik_show.rs:30-92

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Rust sec-fetcher Application

Loading…

Rust sec-fetcher Application

Relevant source files

Purpose and Scope

This page provides an architectural overview of the Rust sec-fetcher application, which is responsible for fetching financial data from the SEC EDGAR API and transforming it into structured formats. The application serves as the high-performance data collection and preprocessing layer in a larger system that combines Rust’s safety and speed for I/O operations with Python’s machine learning capabilities.

This page covers the high-level architecture, module organization, and data flow patterns. For detailed information about specific components, see:

Sources: src/lib.rs:1-12 src/network.rs:1-47

Application Architecture

The sec-fetcher application is built around a modular architecture that separates concerns into distinct layers: configuration, networking, data transformation, and storage. The core design principle is to fetch data from SEC APIs with robust error handling and caching, transform it into a standardized format (often using polars DataFrames), and output structured data for downstream consumption.

graph TB
    subgraph "src/lib.rs Module Organization"
        config["config\nConfigManager, AppConfig"]
enums["enums\nFundamentalConcept, Url\nCacheNamespacePrefix"]
models["models\nTicker, CikSubmission\nNportInvestment, AccessionNumber"]
network["network\nSecClient, fetch_* functions"]
ops["ops\nrender_filing, fetch_and_render"]
parsers["parsers\nXML/JSON parsing utilities"]
caches["caches\nInternal caching infrastructure"]
views["views\nMarkdownView, EmbeddingTextView"]
utils["utils\nVecExtensions, helpers"]
end
    
    subgraph "External Dependencies"
        reqwest["reqwest\nHTTP client"]
polars["polars\nDataFrame operations"]
simd["simd-r-drive\nDrive-based cache storage"]
tokio["tokio\nAsync runtime"]
serde["serde\nSerialization"]
end
    
 
   config --> caches
 
   network --> config
 
   network --> caches
 
   network --> models
 
   network --> enums
 
   network --> parsers
 
   ops --> network
 
   ops --> views
 
   parsers --> models
    
 
   network --> reqwest
 
   network --> simd
 
   network --> tokio
 
   network --> polars
 
   models --> serde

Module Structure

The application is organized into several core modules as declared in the library root:

ModulePurposeKey Components
configConfiguration management and credential handlingConfigManager, AppConfig
enumsType-safe enumerations for domain conceptsFundamentalConcept, Url, CacheNamespacePrefix, FormType
modelsData structures representing SEC entitiesTicker, CikSubmission, NportInvestment, AccessionNumber
networkHTTP client and data fetching functionsSecClient, fetch_company_tickers, fetch_us_gaap_fundamentals
opsHigher-level business logic and workflowsrender_filing, fetch_and_render, diff_holdings
parsersXML/JSON parsing utilitiesparse_us_gaap_fundamentals, parse_cik_submissions_json
cachesInternal caching infrastructureCaches (singleton), HTTP cache, preprocessor cache
viewsRendering logic for filing dataMarkdownView, EmbeddingTextView
normalizeData normalization and cleaningPct type, 13F normalization logic

Sources: src/lib.rs:1-12 src/network.rs:1-47 src/network/fetch_us_gaap_fundamentals.rs:1-108

Data Flow Architecture

Request-Response Flow with Caching

The data flow follows a pipeline pattern:

  1. Request Initiation : High-level operations in ops or CLI binaries call specific fetching functions like fetch_us_gaap_fundamentals src/network/fetch_us_gaap_fundamentals.rs:54-58
  2. Client Middleware : SecClient applies throttling and caching policies before making HTTP requests src/network/sec_client.rs:1-10
  3. Cache Check : The system checks simd-r-drive storage for cached responses based on CacheNamespacePrefix.
  4. API Request : If a cache miss occurs, the request is sent to the SEC EDGAR API (e.g., CompanyFacts endpoint src/network/fetch_us_gaap_fundamentals.rs:62-67).
  5. Parsing : Raw JSON/XML is converted into structured models or DataFrames via the parsers module src/network/fetch_us_gaap_fundamentals.rs69
  6. Enrichment : Data is often cross-referenced; for example, fundamentals are joined with submission data to resolve primary document URLs src/network/fetch_us_gaap_fundamentals.rs:74-105

Sources: src/network/fetch_us_gaap_fundamentals.rs:54-108 src/network.rs:1-47

Key Dependencies and Technology Stack

The application leverages modern Rust crates for performance and reliability:

CategoryCratePurpose
Async RuntimetokioAsynchronous I/O and task scheduling.
HTTP ClientreqwestUnderlying HTTP engine for SecClient.
Data FramespolarsHigh-performance data manipulation, especially for US GAAP data src/network/fetch_us_gaap_fundamentals.rs9
Cachingsimd-r-driveWebSocket-based key-value storage for persistent caching.
SerializationserdeJSON/CSV serialization and deserialization.
XML Parsingquick-xmlFast parsing for SEC XML filings (13F, N-PORT, Form 4).

Sources: src/network/fetch_us_gaap_fundamentals.rs:1-10 src/lib.rs:1-12

Module Interaction Patterns

US GAAP Data Retrieval Example

The interaction between modules is best exemplified by the US GAAP fundamentals retrieval process:

  1. Network Module : fetch_us_gaap_fundamentals is called src/network/fetch_us_gaap_fundamentals.rs54
  2. Models Module : It uses Cik::get_company_cik_by_ticker_symbol to resolve the ticker src/network/fetch_us_gaap_fundamentals.rs60
  3. Enums Module : It constructs the target URL using Url::CompanyFacts src/network/fetch_us_gaap_fundamentals.rs62
  4. Parsers Module : It delegates the raw JSON to parsers::parse_us_gaap_fundamentals src/network/fetch_us_gaap_fundamentals.rs69
  5. Network (Sub-call) : It calls fetch_cik_submissions to enrich the data with filing URLs src/network/fetch_us_gaap_fundamentals.rs74

Sources: src/network/fetch_us_gaap_fundamentals.rs:54-108

Error Handling Strategy

The application uses a layered error handling approach:

  • Network Layer : Handles transient HTTP errors and rate limiting via retries and throttling.
  • Parsing Layer : Returns specific error types (e.g., CikError, AccessionNumberError) when SEC data doesn’t match expected formats.
  • Operations Layer : Often implements “non-fatal” logic, where a failure to fetch secondary data (like submissions for URL enrichment) results in a warning rather than a process crash src/network/fetch_us_gaap_fundamentals.rs:101-105

Sources: src/network/fetch_us_gaap_fundamentals.rs:101-105 src/models.rs:1-5

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Network Layer & SecClient

Loading…

Network Layer & SecClient

Relevant source files

Purpose and Scope

This page documents the network layer of the rust-sec-fetcher application, specifically the SecClient HTTP client and its associated infrastructure. The SecClient provides the foundational HTTP communication layer for all SEC EDGAR API interactions, implementing throttling, caching, and retry logic to ensure reliable and compliant data fetching.

This page covers:

For information about the specific network fetching functions that use SecClient, see Data Fetching Functions. For details on the caching system architecture, see Caching & Storage System.


SecClient Architecture Overview

Component Diagram

Sources: src/network/sec_client.rs:14-108 src/network/sec_client.rs:159-214


SecClient Structure and Initialization

SecClient Fields

The SecClient struct maintains core components for networking and caching:

FieldTypePurpose
emailStringContact email for SEC User-Agent header src/network/sec_client.rs15
http_clientClientWithMiddlewarereqwest client with middleware stack src/network/sec_client.rs18
cache_policyArc<CachePolicy>Shared cache configuration src/network/sec_client.rs19
throttle_policyArc<ThrottlePolicy>Shared throttle configuration src/network/sec_client.rs20
preprocessor_cacheArc<DataStore>Cache for processed/transformed data src/network/sec_client.rs21

Sources: src/network/sec_client.rs:14-22

Construction from ConfigManager

The from_config_manager() constructor performs the following initialization sequence:

  1. Extract Configuration : Reads email, app_name, app_version, max_concurrent, min_delay_ms, and max_retries from AppConfig src/network/sec_client.rs:31-56
  2. Create CachePolicy : Configures cache with 1-week TTL (hardcoded) and disables header respect src/network/sec_client.rs:58-63
  3. Create ThrottlePolicy : Configures rate limiting based on max_concurrent and min_delay_ms with a 500ms adaptive jitter src/network/sec_client.rs:82-88
  4. Initialize Caches : Retrieves both HTTP and preprocessor cache stores from the ConfigManager src/network/sec_client.rs:90-91
  5. Build Middleware Stack : Combines the DataStore with policies using reqwest_drive helpers src/network/sec_client.rs:93-98

Sources: src/network/sec_client.rs:26-109


Throttle Policy & Compliance

The SEC’s public guidance for EDGAR states a maximum request rate of 10 requests/second. The SecClient chooses a conservative default of max_concurrent=1 and min_delay_ms=500 (~2 req/s) src/network/sec_client.rs:67-81

ParameterAppConfig FieldDefaultPurpose
base_delay_msmin_delay_ms500Minimum delay between requests src/network/sec_client.rs83
max_concurrentmax_concurrent1Concurrent request limit src/network/sec_client.rs84
max_retriesmax_retries3Retry attempt limit src/network/sec_client.rs85
adaptive_jitter_msN/A500Randomized delay for retry backoff src/network/sec_client.rs87

Sources: src/network/sec_client.rs:65-88


Request Flow and Caching

raw_request vs raw_request_bypass_cache

The client provides two primary ways to interact with the network:

  1. raw_request : The standard path. It uses the http_client which includes the CacheMiddleware. Requests are served from the on-disk cache if available and within TTL src/network/sec_client.rs:159-184
  2. raw_request_bypass_cache : Uses the CacheBypass extension to force a network fetch. This skips both reading from and writing to the cache while still respecting the ThrottlePolicy src/network/sec_client.rs:186-214

Request Pipeline

Sources: src/network/sec_client.rs:159-224 src/network/sec_client.rs:93-98


User-Agent Management

The get_user_agent() method generates a compliant User-Agent string as required by the SEC. It validates the email format every time it is called src/network/sec_client.rs:111-123

Format: {app_name}/{app_version} (+{email})
Example: sec-fetcher/0.1.0 (+user@example.com)

If the email provided in the configuration is invalid according to EmailAddress::is_valid, the client will panic src/network/sec_client.rs:115-118

Sources: src/network/sec_client.rs:111-123 tests/sec_client_tests.rs:7-23


Fetch and Render Pipeline

The fetch_and_render function (exported in src/network.rs) provides a high-level orchestration for retrieving filing documents and converting them to specific views (e.g., Markdown).

  1. Fetch : Retrieves the raw document from EDGAR using SecClient.
  2. Parse : Normalizes the document structure.
  3. Render : Applies a View transformation (like MarkdownView).

Sources: src/network.rs:45-46 src/network/fetch_and_render.rs:1-10 (Note: file content for fetch_and_render.rs was not provided in detail, but its existence is noted in the module structure).


Testing Infrastructure

SecClient logic is verified in tests/sec_client_tests.rs using mockito to simulate SEC API responses.

Test CaseCode EntityPurpose
test_user_agentSecClient::get_user_agentVerifies UA string construction tests/sec_client_tests.rs:7-23
test_invalid_email_panicSecClient::get_user_agentEnsures panic on malformed email tests/sec_client_tests.rs:105-117
test_fetch_json_with_retry_failureSecClient::fetch_jsonVerifies max_retries enforcement tests/sec_client_tests.rs:194-222
test_fetch_json_with_retry_backoffSecClient::fetch_jsonValidates recovery after 500 error tests/sec_client_tests.rs:225-233

Sources: tests/sec_client_tests.rs:1-233

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

CLI Binaries

Loading…

CLI Binaries

Relevant source files

This page documents the standalone binary programs provided by the rust-sec-fetcher crate. These tools are located in src/bin/ and serve specialized purposes ranging from test fixture maintenance and enum validation to bulk data extraction for machine learning pipelines.

1. refresh-test-fixtures

The refresh-test-fixtures utility automates the retrieval of real SEC EDGAR data to serve as test fixtures. It ensures that integration tests operate against authentic, version-controlled data without requiring live network access during test execution.

Purpose and Usage

This binary should be run whenever new test cases are added or when existing fixtures need to be updated to reflect modern EDGAR schema changes (e.g., the 2023 change in 13F value reporting).

Implementation Details

The program iterates through a hardcoded manifest of Fixture structs src/bin/refresh_test_fixtures.rs:55-63 Each fixture defines a TickerSymbol, an output filename, and a FixtureKind which determines the specific SEC endpoint to hit.

Key Components:

Fixture Generation Flow

Sources: src/bin/refresh_test_fixtures.rs:90-173 src/bin/refresh_test_fixtures.rs:178-240


2. check-form-type-coverage

This binary validates the completeness of the FormType enum against actual data in the EDGAR Master Index. It performs both “Forward” and “Reverse” coverage checks.

Coverage Logic

  1. Forward Check : Ensures every variant defined in the FormType enum (that isn’t marked as retired) actually appears in recent SEC filings src/bin/check_form_type_coverage.rs:16-19
  2. Reverse Check : Identifies any form types appearing frequently in the most recent quarter (above MINIMUM_FILINGS_THRESHOLD) that are not currently represented in the enum src/bin/check_form_type_coverage.rs:20-22

Usage

Technical Implementation

The program calculates the last_completed_quarter src/bin/check_form_type_coverage.rs:51-61 and scans backwards up to MAX_LOOKBACK_QUARTERS (default 8) src/bin/check_form_type_coverage.rs34 It uses fetch_edgar_master_index to retrieve the list of all filings for those periods src/bin/check_form_type_coverage.rs110

Sources: src/bin/check_form_type_coverage.rs:1-40 src/bin/check_form_type_coverage.rs:72-146


3. pull-us-gaap-bulk

The pull-us-gaap-bulk binary is the primary data ingestion tool for the narrative_stack ML pipeline. It performs a massive sweep of all primary-listed companies and extracts their XBRL fundamentals.

Purpose and Usage

It fetches CompanyFacts for every ticker and flattens the complex JSON structure into a tabular CSV format suitable for training autoencoders or dimensionality reduction models.

Data Flow and Constraints

Sources: src/bin/pulls/us_gaap_bulk.rs:1-33 src/bin/pulls/us_gaap_bulk.rs:45-95


4. pull-fund-holdings

This binary targets the investment management domain, specifically fetching N-PORT holdings for all registered investment companies (ETFs, Mutual Funds).

Purpose and Usage

It iterates through the SEC’s investment company dataset, finds the latest N-PORT-P (monthly portfolio holdings) filing for each fund, and exports the holdings to CSV.

Implementation Logic

  1. Dataset Retrieval : Calls fetch_investment_company_series_and_class_dataset to get the master list of funds src/bin/pulls/fund_holdings.rs:74-75
  2. Filing Discovery : For each fund ticker, it resolves the CIK and calls fetch_nport_filings to find the most recent submission src/bin/pulls/fund_holdings.rs:96-114
  3. Holdings Extraction : It fetches the specific N-PORT XML, parses the investment table, and normalizes the data src/bin/pulls/fund_holdings.rs:126-134
  4. Partitioned Storage : To avoid directories with tens of thousands of files, it organizes output by the first letter of the ticker (e.g., data/fund-holdings/S/SPY.csv) src/bin/pulls/fund_holdings.rs:137-155

Fund Processing Pipeline

Sources: src/bin/pulls/fund_holdings.rs:1-38 src/bin/pulls/fund_holdings.rs:74-157

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Data Fetching Functions

Loading…

Data Fetching Functions

Relevant source files

Purpose and Scope

This document describes the data fetching functions in the Rust sec-fetcher application. These functions provide the interface for retrieving financial data from the SEC EDGAR API, including company tickers, CIK lookups, submissions, company descriptions, SIC codes, EDGAR master index, NPORT filings, US GAAP fundamentals, and investment company datasets.

For information about the underlying HTTP client, throttling, and caching infrastructure, see [3.1 Network Layer & SecClient](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/3.1 Network Layer & SecClient) For details on how US GAAP data is transformed after fetching, see [3.4 US GAAP Concept Transformation](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/3.4 US GAAP Concept Transformation) For information about the data structures returned by these functions, see [3.5 Data Models & Enumerations](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/3.5 Data Models & Enumerations)

Overview of Data Fetching Architecture

The network module provides specialized fetching functions that retrieve different types of financial data from the SEC EDGAR API. Each function accepts a SecClient reference and returns structured data types or Polars DataFrames.

Data Flow: Natural Language to Code Entity Space

The following diagram maps high-level data requirements to specific Rust functions and their corresponding SEC EDGAR endpoints.

Diagram: Mapping Data Requirements to Code Entities

graph TB
    subgraph "Data Requirement (Natural Language)"
        ReqTickers["'Get all stock symbols'"]
ReqCIK["'Find CIK for AAPL'"]
ReqDesc["'What does this company do?'"]
ReqGAAP["'Get revenue and net income'"]
ReqFeed["'What was filed today?'"]
ReqFunds["'Search for mutual fund CIKs'"]
end
    
    subgraph "Code Entity Space (Functions)"
        fetch_company_tickers["fetch_company_tickers()\nsrc/network/fetch_company_tickers.rs"]
fetch_cik_by_ticker_symbol["fetch_cik_by_ticker_symbol()\nsrc/network/fetch_cik_by_ticker_symbol.rs"]
fetch_company_description["fetch_company_description()\nsrc/network/fetch_company_description.rs"]
fetch_us_gaap_fundamentals["fetch_us_gaap_fundamentals()\nsrc/network/fetch_us_gaap_fundamentals.rs"]
parse_edgar_atom_feed["parse_edgar_atom_feed()\nsrc/network/fetch_edgar_feed.rs"]
fetch_investment_company_series_and_class_dataset["fetch_investment_company_series_and_class_dataset()\nsrc/network/fetch_investment_company_series_and_class_dataset.rs"]
end
    
    subgraph "EDGAR API Endpoints (Url Enum)"
        Url_CompanyTickersJson["Url::CompanyTickersJson"]
Url_CompanyFacts["Url::CompanyFacts(cik)"]
Url_CikAccessionDocument["Url::CikAccessionDocument"]
Url_InvestmentCompanySeriesAndClassDataset["Url::InvestmentCompanySeriesAndClassDataset(year)"]
end
    
 
   ReqTickers --> fetch_company_tickers
 
   ReqCIK --> fetch_cik_by_ticker_symbol
 
   ReqDesc --> fetch_company_description
 
   ReqGAAP --> fetch_us_gaap_fundamentals
 
   ReqFeed --> parse_edgar_atom_feed
 
   ReqFunds --> fetch_investment_company_series_and_class_dataset
    
 
   fetch_company_tickers --> Url_CompanyTickersJson
 
   fetch_us_gaap_fundamentals --> Url_CompanyFacts
 
   fetch_company_description --> Url_CikAccessionDocument
 
   fetch_investment_company_series_and_class_dataset --> Url_InvestmentCompanySeriesAndClassDataset

Sources: src/network/fetch_company_tickers.rs:58-61 src/network/fetch_cik_by_ticker_symbol.rs:53-56 src/network/fetch_company_description.rs:35-38 src/network/fetch_us_gaap_fundamentals.rs:54-58 src/network/fetch_investment_company_series_and_class_dataset.rs:43-45 src/network/fetch_edgar_feed.rs118

Company Ticker and CIK Resolution

Function: fetch_company_tickers

Retrieves operating-company equity tickers. It supports merging primary listings with derived instruments (warrants, units, preferreds) src/network/fetch_company_tickers.rs:8-18

Function: fetch_cik_by_ticker_symbol

Resolves a ticker to a 10-digit CIK. It prioritizes operating companies before falling back to the investment company dataset for mutual funds/ETFs src/network/fetch_cik_by_ticker_symbol.rs:27-36

Fuzzy Matching: Ticker::get_by_fuzzy_matched_name

Performs weighted token overlap matching to resolve company names to tickers src/models/ticker.rs:38-42 It uses a TOKEN_MATCH_THRESHOLD of 0.6 and applies boosts for exact matches and common stock src/models/ticker.rs:27-32

Sources: src/network/fetch_company_tickers.rs:58-61 src/network/fetch_cik_by_ticker_symbol.rs:53-72 src/models/ticker.rs:35-136

Company Profiles and Descriptions

Function: fetch_company_description

Extracts the “Item 1. Business” section from a company’s most recent 10-K filing src/network/fetch_company_description.rs:11-13

Implementation Strategy :

  1. Locates all occurrences of “Item 1” and “Item 1A” using regex src/network/fetch_company_description.rs:81-82
  2. Identifies the “real” section by finding the largest HTML byte gap between an Item 1 and its subsequent Item 1A (avoiding Table of Contents entries) src/network/fetch_company_description.rs:90-96
  3. Uses html2text for tag stripping and entity decoding src/network/fetch_company_description.rs:103-105
  4. Skips short heading lines and truncates at a sentence boundary near 800 characters src/network/fetch_company_description.rs:109-133

Function: fetch_sic_codes

Fetches the complete SEC SIC code list from siccodes.htm src/network/fetch_sic_codes.rs:11-15 It parses the HTML table rows into SicCode models containing the code, office, and industry title src/network/fetch_sic_codes.rs:63-71

Sources: src/network/fetch_company_description.rs:35-68 src/network/fetch_sic_codes.rs:33-45

US GAAP Fundamentals

Function: fetch_us_gaap_fundamentals

Fetches all XBRL-tagged financial data for a company as a structured DataFrame src/network/fetch_us_gaap_fundamentals.rs:11-12

Data Flow :

  1. Resolves Ticker to CIK src/network/fetch_us_gaap_fundamentals.rs60
  2. Fetches JSON from the companyfacts endpoint src/network/fetch_us_gaap_fundamentals.rs67
  3. Accession Resolution : Calls fetch_cik_submissions to map accession numbers to primary document URLs (e.g., “aapl-20241228.htm”) src/network/fetch_us_gaap_fundamentals.rs:74-81
  4. Updates the filing_url column in the resulting DataFrame src/network/fetch_us_gaap_fundamentals.rs99

Sources: src/network/fetch_us_gaap_fundamentals.rs:54-108

Investment Company Datasets

Function: fetch_investment_company_series_and_class_dataset

Retrieves the annual CSV of registered investment company share classes src/network/fetch_investment_company_series_and_class_dataset.rs:22-32

Year Fallback and Caching :

Diagram: Investment Company Dataset Fetch Sequence

Sources: src/network/fetch_investment_company_series_and_class_dataset.rs:43-80 src/network/fetch_investment_company_series_and_class_dataset.rs:92-117

Real-time Feeds

Function: parse_edgar_atom_feed

Parses the EDGAR Atom XML feed into FeedEntry items src/network/fetch_edgar_feed.rs:112-118

Sources: src/network/fetch_edgar_feed.rs:46-109 src/network/fetch_edgar_feed.rs:118-155

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Filing Retrieval & Rendering

Loading…

Filing Retrieval & Rendering

Relevant source files

Purpose and Scope

This document details the filings submodule within the network layer, responsible for retrieving and parsing form-specific SEC filings. It covers the implementation of specialized fetchers for major form types (10-K, 10-Q, 8-K, 13F, Form 4, N-PORT, etc.), the FilingIndex parser for exploring filing archives, IPO feed polling, and the views system for rendering document content into Markdown or embedding-optimized text.

Filing Retrieval Architecture

The filing retrieval system is built on top of the SecClient and CikSubmission models. It provides a tiered approach: generic filing retrieval, form-specific convenience functions, and deep-parsing functions that extract structured data (like XML-based holdings) from within a filing.

Diagram: Filing Retrieval Logic Flow

graph TD
    subgraph "Code Entity Space: Filing Retrieval"
        FetchFilings["fetch_filings()\nsrc/network/filings/fetch_filings.rs"]
SpecificFetch["fetch_10k_filings()\nfetch_8k_filings()\n..."]
DeepFetch["fetch_13f()\nfetch_form4()\nfetch_nport()"]
FilingIndex["fetch_filing_index()\nsrc/network/filings/filing_index.rs"]
end

    subgraph "Natural Language Space: SEC Concepts"
        Submissions["Company Submissions\n(submissions/CIK.json)"]
Archive["EDGAR Archive Directory\n(data/CIK/Accession/)"]
PrimaryDoc["Primary Document\n(10-K HTML, 13F XML)"]
Exhibits["Exhibits & Supporting Docs\n(EX-99.1, InfoTable.xml)"]
end

 
   FetchFilings -->|Filters| Submissions
 
   SpecificFetch -->|Wraps| FetchFilings
 
   DeepFetch -->|Parses| PrimaryDoc
 
   DeepFetch -->|Uses| FilingIndex
 
   FilingIndex -->|Scrapes| Archive
 
   Archive --> Exhibits

Sources: src/network/filings/mod.rs:1-29 src/network/filings/fetch_filings.rs:67-77 src/network/filings/filing_index.rs:108-114

Form-Specific Fetchers

The library provides dedicated functions for common SEC forms. These functions encapsulate form-specific logic, such as including historical variants (e.g., 10-K405) or handling amendments.

FunctionForm Type(s)Key Features
fetch_10k_filings10-K, 10-K405Returns comprehensive annual reports; re-sorts mixed types newest-first. src/network/filings/fetch_10k.rs:59-77
fetch_10q_filings10-QReturns quarterly reports (Q1-Q3). src/network/filings/fetch_10q.rs:43-52
fetch_8k_filings8-KReturns material event notifications. src/network/filings/fetch_8k.rs:54-63
fetch_13f_filings13F-HRReturns institutional holdings report metadata. src/network/filings/fetch_13f.rs:32-41
fetch_form4_filings4, 4/AReturns insider trading reports and amendments. src/network/filings/fetch_form4.rs:33-49
fetch_def14a_filingsDEF 14AReturns definitive proxy statements for shareholder meetings. src/network/filings/fetch_def14a.rs:57-66
fetch_nport_filingsNPORT-PReturns monthly portfolio holdings for registered funds. src/network/filings/fetch_nport.rs:35-44

Sources: src/network/filings/fetch_10k.rs:59-77 src/network/filings/fetch_13f.rs:32-41 src/network/filings/fetch_form4.rs:33-49

Filing Index and Deep Parsing

While most filings are identified by a primary document, many (like 13F or 8-K) contain critical data in secondary files or require XML parsing of the primary document.

Filing Index Parser

The fetch_filing_index function scrapes the EDGAR -index.htm page to discover all files associated with an accession number. This is necessary because the submissions.json API only points to the “Primary Document,” which may not be the file containing the raw data (e.g., a 13F’s informationTable.xml).

Deep Data Extraction

Several functions go beyond metadata to return structured Rust models:

Sources: src/network/filings/filing_index.rs:23-76 src/network/filings/fetch_13f.rs:81-93 src/network/filings/fetch_form4.rs:91-102

graph LR
    subgraph "Code Entity Space: IPO Tracking"
        GetIPOFeed["get_ipo_feed_entries()\nsrc/ops/ipo_ops.rs"]
GetIPOReg["get_ipo_registration_filings()\nsrc/ops/ipo_ops.rs"]
FormType["FormType::IPO_REGISTRATION_FORM_TYPES\nsrc/enums/form_type.rs"]
end

    subgraph "Natural Language Space: IPO Lifecycle"
        S1["S-1 / F-1\n(Initial Registration)"]
S1A["S-1/A / F-1/A\n(Amendments)"]
B4["424B4\n(Final Pricing)"]
AtomFeed["EDGAR Live Feed\n(Polling)"]
end

 
   GetIPOFeed -->|Polls| AtomFeed
 
   AtomFeed -->|Filters for| FormType
 
   GetIPOReg -->|Aggregates| S1
 
   GetIPOReg -->|Aggregates| S1A
 
   GetIPOReg -->|Aggregates| B4

IPO Feed Polling

The system includes specialized logic for tracking Initial Public Offerings (IPOs) via the EDGAR Atom feed.

Diagram: IPO Feed and Registration Lifecycle

  • get_ipo_feed_entries : Polls the EDGAR Atom feed (the fastest source for new filings) and filters for S-1, F-1, and 424B4 forms examples/ipo_list.rs:43-51
  • get_ipo_registration_filings : Retrieves the full timeline of a company’s IPO process, from initial S-1 through all amendments to the final prospectus examples/ipo_show.rs:48-58

Sources: examples/ipo_list.rs:1-17 examples/ipo_show.rs:21-33 examples/ipo_list.rs:88-108

Document Rendering (Views)

The views system provides traits and implementations for converting SEC HTML/XBRL documents into readable text formats.

The FilingView Trait

The core abstraction for rendering. It defines how to format headers, sections, and tables.

Implementations

  1. MarkdownView :
    • Goal : Lossless representation.
    • Tables : Preserved as GitHub-Flavored Markdown (GFM) pipe tables examples/ipo_show.rs:94-95
    • Usage : Standard reading and documentation.
  2. EmbeddingTextView :
    • Goal : Optimization for Large Language Model (LLM) embeddings.
    • Tables : Flattened into labeled sentences to preserve semantic context (e.g., “The value of Assets for 2023 was 100M”) examples/ipo_show.rs:96-97
    • Prose : Cleaned of excessive whitespace and HTML artifacts.

Rendering Pipeline

The render_filing operation examples/ipo_show.rs48 orchestrates the process:

  1. Fetch the document content.
  2. Clean HTML using html_helpers.
  3. Apply the selected FilingView implementation.

Sources: examples/ipo_show.rs:92-108 src/views/markdown_view.rs:1-10 src/views/embedding_text_view.rs:1-10

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

US GAAP Concept Transformation

Loading…

US GAAP Concept Transformation

Relevant source files

Purpose and Scope

This page documents the US GAAP concept transformation system, which normalizes raw financial concept names from SEC EDGAR filings into a standardized taxonomy. The core functionality is provided by the distill_us_gaap_fundamental_concepts function, which maps the diverse US GAAP terminology (57+ revenue variations, 6 cost variants, multiple equity representations) into a consistent set of 71 FundamentalConcept enum variants src/enums/fundamental_concept_enum.rs:5-71

For information about fetching US GAAP data from the SEC API, see Data Fetching Functions. For details on the data models that use these concepts, see US GAAP Concept Transformation. For the Python ML pipeline that processes the transformed concepts, see Python narrative_stack System.


System Overview

The transformation system acts as a critical normalization layer between raw SEC EDGAR filings and downstream data processing. Companies report financial data using various US GAAP concept names (e.g., Revenues, SalesRevenueNet, HealthCareOrganizationRevenue), and this system ensures all variations map to consistent concept identifiers.

Data Flow: Natural Language to Code Entity Space

The following diagram bridges the gap between the natural language of financial reporting and the internal code entities used for processing.

Sources: src/enums/fundamental_concept_enum.rs:5-71 src/enums.rs:7-8

graph TB
    subgraph "Natural Language Space (SEC Filings)"
        RawConcepts["Raw US GAAP Concept Names\n'Revenues'\n'SalesRevenueNet'\n'AssetsCurrent'"]
end
    
    subgraph "Code Entity Space (rust-sec-fetcher)"
        DistillFn["distill_us_gaap_fundamental_concepts\n(Function)"]
FCEnum["FundamentalConcept\n(Enum)"]
Assets["FundamentalConcept::Assets"]
CurrentAssets["FundamentalConcept::CurrentAssets"]
Revenues["FundamentalConcept::Revenues"]
end
    
 
   RawConcepts -->|Input: &str| DistillFn
 
   DistillFn -->|Output: Option<Vec<FC>>| FCEnum
 
   FCEnum --> Assets
 
   FCEnum --> CurrentAssets
 
   FCEnum --> Revenues

The FundamentalConcept Taxonomy

The FundamentalConcept enum defines 71 standardized financial concept variants organized into main categories: Balance Sheet, Income Statement, Cash Flow, and Equity classifications src/enums/fundamental_concept_enum.rs:5-71 Each variant represents a normalized concept that may map from multiple raw US GAAP names.

Sources: src/enums/fundamental_concept_enum.rs:5-71

graph TB
    subgraph "FundamentalConcept Enum Variants"
        Root["FundamentalConcept\n(71 total variants)"]
end
    
    subgraph "Balance Sheet"
        Assets["Assets"]
CurrentAssets["CurrentAssets"]
NoncurrentAssets["NoncurrentAssets"]
Liabilities["Liabilities"]
CurrentLiabilities["CurrentLiabilities"]
NoncurrentLiabilities["NoncurrentLiabilities"]
LiabilitiesAndEquity["LiabilitiesAndEquity"]
end
    
    subgraph "Income Statement"
        Revenues["Revenues"]
CostOfRevenue["CostOfRevenue"]
GrossProfit["GrossProfit"]
OperatingExpenses["OperatingExpenses"]
OperatingIncomeLoss["OperatingIncomeLoss"]
NetIncomeLoss["NetIncomeLoss"]
InterestExpenseOperating["InterestExpenseOperating"]
ResearchAndDevelopment["ResearchAndDevelopment"]
end
    
    subgraph "Cash Flow"
        NetCashFlow["NetCashFlow"]
NetCashFlowFromOperatingActivities["NetCashFlowFromOperatingActivities"]
NetCashFlowFromInvestingActivities["NetCashFlowFromInvestingActivities"]
NetCashFlowFromFinancingActivities["NetCashFlowFromFinancingActivities"]
end
    
 
   Root --> Assets
 
   Root --> Liabilities
 
   Root --> Revenues
 
   Root --> NetIncomeLoss
 
   Root --> NetCashFlow

Mapping Pattern Types

The transformation system implements four distinct mapping patterns to handle the diverse ways companies report financial concepts.

Pattern 1: One-to-One Mapping

Simple direct mappings where a single US GAAP concept name maps to exactly one FundamentalConcept variant.

Raw US GAAP ConceptFundamentalConcept Output
Assetsvec![Assets]
Liabilitiesvec![Liabilities]
GrossProfitvec![GrossProfit]
OperatingIncomeLossvec![OperatingIncomeLoss]
CommitmentsAndContingenciesvec![CommitmentsAndContingencies]

Pattern 2: Hierarchical Mapping

Specific concepts map to multiple variants, including both the specific concept and parent categories. This enables queries at different levels of granularity.

Raw US GAAP ConceptFundamentalConcept Output (Ordered)
AssetsCurrentvec![CurrentAssets, Assets]
StockholdersEquityvec![EquityAttributableToParent, Equity]
NetIncomeLossvec![NetIncomeLossAttributableToParent, NetIncomeLoss]

Pattern 3: Synonym Consolidation

Multiple US GAAP concept names that represent the same financial concept are consolidated into a single FundamentalConcept variant. For example, CostOfGoodsAndServicesSold, CostOfServices, and CostOfGoodsSold all map to FundamentalConcept::CostOfRevenue.

Pattern 4: Industry-Specific Revenue Mapping

The system handles dozens of industry-specific revenue variations (e.g., HealthCareOrganizationRevenue, OilAndGasRevenue, ElectricUtilityRevenue), mapping them all to the Revenues concept.

Sources: src/enums/fundamental_concept_enum.rs:5-71


The distill_us_gaap_fundamental_concepts Function

The core transformation function accepts a string representation of a US GAAP concept name and returns an Option<Vec<FundamentalConcept>>. The return type is an Option because not all US GAAP concepts are mapped, and a Vec because some concepts map to multiple standardized variants.

Implementation Logic

The function serves as the primary entry point for the transformation logic. It is utilized by higher-level operations to normalize data before it is stored or used for training.

Sources: src/enums/fundamental_concept_enum.rs:5-71 src/enums.rs:7-8

graph LR
    subgraph "Input"
        RawStr["&str: 'SalesRevenueNet'"]
end
    
    subgraph "distill_us_gaap_fundamental_concepts"
        Match["Pattern Match Engine"]
end
    
    subgraph "Output"
        Result["Some(vec![FundamentalConcept::Revenues])"]
end
    
 
   RawStr --> Match
 
   Match --> Result

Summary

The US GAAP concept transformation system provides:

  1. Standardization : Maps hundreds of raw US GAAP concept names to 71 standardized FundamentalConcept variants src/enums/fundamental_concept_enum.rs:5-71
  2. Flexibility : Supports four mapping patterns (one-to-one, hierarchical, synonyms, industry-specific) to handle diverse reporting practices.
  3. Queryability : Hierarchical mappings enable queries at multiple granularity levels (e.g., query for all Assets or specifically CurrentAssets).
  4. Integration : Serves as the critical normalization layer between SEC EDGAR API and downstream data processing/ML pipelines.

Sources: src/enums/fundamental_concept_enum.rs:5-71 src/enums.rs:7-8

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Data Models & Enumerations

Loading…

Data Models & Enumerations

Relevant source files

Purpose and Scope

This page documents the core data structures and enumerations used throughout the rust-sec-fetcher application. These models represent SEC financial data, including company identifiers, filing metadata, investment holdings, and financial concepts. The data models are defined across the src/models/ directory and centralized in src/models.rs:1-18 while enumerations are managed in src/enums.rs:1-15

Sources: src/models.rs:1-18 src/enums.rs:1-15


SEC Identifier Models

The system uses three primary identifier types to reference companies and filings within the SEC EDGAR system.

Ticker

The Ticker struct represents a company’s stock ticker symbol along with its SEC identifiers. It is the primary structure for mapping human-readable symbols to regulatory keys.

Structure:

Fuzzy Matching: The Ticker model includes a sophisticated fuzzy matching engine in get_by_fuzzy_matched_name src/models/ticker.rs:38-136 It uses tokenization, SIMD-accelerated cleaning src/models/ticker.rs:148-204 and weighted scoring (e.g., EXACT_MATCH_BOOST, PREFERRED_STOCK_PENALTY) to resolve company names to CIKs src/models/ticker.rs:27-33

Sources: src/models/ticker.rs:19-35 src/models/ticker.rs:38-136

Cik (Central Index Key)

The Cik struct represents a 10-digit SEC identifier that uniquely identifies a company or entity. CIKs are permanent and never reused src/models/cik.rs:11-36

Structure:

Key Characteristics:

  • Formatting: Always zero-padded to 10 digits when displayed (e.g., 320193"0000320193") src/models/cik.rs:66-69
  • Resolution: get_company_cik_by_ticker_symbol handles the logic of resolving derived instruments (warrants, units) back to their parent registrant’s CIK src/models/cik.rs:143-167

Sources: src/models/cik.rs:37-40 src/models/cik.rs:143-167

AccessionNumber

The AccessionNumber struct represents a unique identifier for SEC filings. Each accession number is exactly 18 digits and encodes the filer’s CIK, filing year, and sequence number.

Format: XXXXXXXXXX-YY-NNNNNN src/models/accession_number.rs:11-14

Key Methods:

Sources: src/models/accession_number.rs:35-187

SEC Identifier Relationships

Sources: src/models/ticker.rs:20-25 src/models/cik.rs:143-167 src/models/accession_number.rs:35-40


Filing Data Structures

NportInvestment

The NportInvestment struct represents a single investment holding from an NPORT-P filing. It includes both raw data from the SEC and “mapped” fields enriched by the fetcher.

Key Fields:

Sources: src/models/nport_investment.rs:9-41

ThirteenfHolding

The ThirteenfHolding struct represents a row in a Form 13F-HR information table. Unlike raw XML data, these fields are stored in normalized form src/models/thirteenf_holding.rs:4-9

Sources: src/models/thirteenf_holding.rs:10-33

InvestmentCompany

Represents mutual funds and ETFs. It is primarily used to resolve tickers that do not appear in the standard operating company list src/models/investment_company.rs:6-49

Sources: src/models/investment_company.rs:52-67 src/network/fetch_cik_by_ticker_symbol.rs:67-69


Enumerations

FundamentalConcept

The FundamentalConcept enum defines 64 standardized financial concepts (e.g., Assets, NetIncomeLoss, Revenues). It is the backbone of the US GAAP transformation pipeline src/enums/fundamental_concept_enum.rs:1-72

FormType

The FormType enum covers SEC forms explicitly handled by the library, such as TenK (“10-K”), EightK (“8-K”), and Sc13G (“SCHEDULE 13G”) src/enums/form_type_enum.rs:65-200 It uses strum for case-insensitive parsing and provides the canonical EDGAR string via as_edgar_str src/enums/form_type_enum.rs:56-63

CacheNamespacePrefix

Defines the organizational structure of the simd-r-drive cache.

  • CompanyTickerFuzzyMatch: Used to cache expensive fuzzy matching results src/models/ticker.rs15
  • CompanyTickers: Used for the raw ticker dataset.

Url

A centralized registry of SEC EDGAR endpoints, such as CompanyTickersJson and CompanyTickersTxt src/network/fetch_company_tickers.rs:62-73

TickerOrigin

Distinguishes between PrimaryListing (from company_tickers.json) and DerivedInstrument (from ticker.txt, including warrants and preferreds) src/network/fetch_company_tickers.rs:22-32


graph LR
    subgraph Input["Natural Language Space"]
Query["'Apple' or 'AAPL'"]
end

    subgraph Logic["Code Entity Space"]
SClient["SecClient"]
FCT["fetch_company_tickers"]
T_Fuzzy["Ticker::get_by_fuzzy_matched_name"]
C_Lookup["Cik::get_company_cik_by_ticker_symbol"]
subgraph Models["Data Models"]
M_Ticker["Ticker"]
M_Cik["Cik"]
M_Origin["TickerOrigin"]
end
    end

 
   Query --> T_Fuzzy
 
   SClient --> FCT
 
   FCT --> M_Ticker
 
   T_Fuzzy --> M_Ticker
 
   M_Ticker --> M_Origin
 
   M_Ticker --> C_Lookup
 
   C_Lookup --> M_Cik

Data Flow & Relationships

The following diagram bridges the natural language concepts of “Searching for a Company” to the specific code entities involved.

Sources: src/network/fetch_company_tickers.rs:58-65 src/models/ticker.rs:38-42 src/models/cik.rs:143-146 examples/fuzzy_match_company.rs:35-75

Implementation Details: Precision & Normalization

The system prioritizes financial accuracy by using specialized types:

Sources: src/models/nport_investment.rs:2-35 src/models/thirteenf_holding.rs:1-32

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Parsers & Data Normalization

Loading…

Parsers & Data Normalization

Relevant source files

Purpose and Scope

The parsers and normalize modules form the data ingestion backbone of rust-sec-fetcher. While the network layer retrieves raw bytes (XML, JSON, or CSV), the parsers transform these into structured Rust models. The normalize module ensures that numeric inconsistencies across SEC eras—such as the transition from thousands-of-dollars to actual-dollars in Form 13F—are handled centrally rather than being scattered across the codebase.

Sources: src/parsers.rs:1-21 src/normalize/mod.rs:1-16


Normalization Logic

The normalize module is the single source of truth for scale conversions and unit adjustments. It prevents “inline conversions” in parsers, ensuring that logic for handling SEC schema changes is testable in isolation.

13F Value Normalization

The SEC changed the <value> unit in Form 13F-HR informationTable.xml filings around January 1, 2023.

  • Legacy Era: Values reported in thousands of USD.
  • Modern Era: Values reported in actual USD.

The function normalize_13f_value_usd src/normalize/thirteenf.rs:144-150 uses the THIRTEENF_THOUSANDS_ERA_CUTOFF constant src/normalize/thirteenf.rs72 to determine if a 1000x multiplier should be applied based on the filing_date.

Percentage Handling (Pct Type)

The Pct struct src/normalize/pct.rs31 is a type-safe wrapper around Decimal that enforces a 0–100 scale (e.g., 7.75 means 7.75%, not 0.0775).

Sources: src/normalize/thirteenf.rs:1-150 src/normalize/pct.rs:1-110


XML Parsers

The system uses quick-xml for high-performance, stream-based parsing of large SEC filings.

N-PORT XML Parser

The parse_nport_xml function processes monthly portfolio holdings for registered investment companies.

Form 13F-HR Parser

The parse_13f_xml function extracts institutional investment manager holdings.

Form 4 Parser

The parse_form4_xml function parses insider trading transactions.

Sources: src/parsers/parse_nport_xml.rs:15-149 src/parsers/parse_13f_xml.rs:26-128 src/parsers/parse_form4_xml.rs:14-158


US GAAP Fundamentals Parser

The parse_us_gaap_fundamentals function src/parsers/parse_us_gaap_fundamentals.rs41 converts the SEC’s companyfacts JSON into a Polars DataFrame.

Deduplication & Sorting

The parser implements a “Last-in Wins” strategy for amended filings:

  1. Chronological Sort: Data is sorted by the filed date descending src/parsers/parse_us_gaap_fundamentals.rs:32-33
  2. Pivot: During the pivot operation, it uses .first() to select the most recent filing for any given fiscal period (fy/fp) src/parsers/parse_us_gaap_fundamentals.rs:34-38
  3. Metadata: Every row is prefixed with US_GAAP_META_COLUMNS including accn and filing_url src/parsers/parse_us_gaap_fundamentals.rs:12-21

Magnitude Sanity Checks

Because filers often make “scale errors” (e.g., reporting millions as ones), the parser applies strictness checks:

Sources: src/parsers/parse_us_gaap_fundamentals.rs:12-127


Data Flow Diagrams

graph TD
    subgraph "Natural Language Space"
        SEC_XML["SEC XML Source\n(13F / N-PORT)"]
RawVal["'value' (13F)\n'pctVal' (N-PORT)"]
FilingDate["'filingDate'"]
end

    subgraph "Code Entity Space: Parsers"
        P13F["parse_13f_xml()"]
PNPORT["parse_nport_xml()"]
end

    subgraph "Code Entity Space: Normalize"
        N13F["normalize_13f_value_usd()"]
NPCT["Pct::from_pct()"]
EraCheck["is_13f_thousands_era()"]
end

 
   SEC_XML --> P13F
 
   SEC_XML --> PNPORT
    
 
   P13F -->|passes raw decimal| N13F
 
   P13F -->|passes date| EraCheck
 
   EraCheck -->|returns bool| N13F
    
 
   PNPORT -->|passes raw decimal| NPCT
    
 
   N13F -->|Result| Model13F["ThirteenfHolding.value_usd"]
NPCT -->|Result| ModelNPORT["NportInvestment.pct_val"]

From XML to Normalized Model

This diagram bridges the “Natural Language” SEC fields to the “Code Entities” in the normalize and parsers modules.

Sources: src/parsers/parse_13f_xml.rs98 src/normalize/thirteenf.rs144 src/parsers/parse_nport_xml.rs104

graph LR
    subgraph "Input"
        J["JSON Value\n(companyfacts)"]
end

    subgraph "src/parsers/parse_us_gaap_fundamentals.rs"
        Ext["Extraction Loop\n(fact_category_values, etc)"]
Sanity["Magnitude Sanity Checks\n(fy vs end_year)"]
DF["DataFrame::new()"]
Sort["sort(['filed'], descending=true)"]
Pivot["pivot(['fy', 'fp'])\naggregate: first()"]
end

    subgraph "Output"
        TDF["TickerFundamentalsDataFrame"]
end

 
   J --> Ext
 
   Ext --> Sanity
 
   Sanity --> DF
 
   DF --> Sort
 
   Sort --> Pivot
 
   Pivot --> TDF

US GAAP DataFrame Construction

The pipeline for converting raw JSON facts into an analysis-ready Polars DataFrame.

Sources: src/parsers/parse_us_gaap_fundamentals.rs:41-103 src/parsers/parse_us_gaap_fundamentals.rs:32-38


CSV and Text Parsers

Company Tickers

Master Index

  • parse_master_idx : Parses the quarterly master.idx files from EDGAR. It skips header lines and extracts CIK, Company Name, Form Type, Date Filed, and File Name (URL) src/parsers/parse_master_idx.rs:10-11

Investment Companies

Sources: src/parsers.rs:10-21

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Operations & Business Logic

Loading…

Operations & Business Logic

Relevant source files

The ops module provides high-level, orchestrated operations built on top of the lower-level network fetch functions. Each operation encapsulates multi-step workflows—such as filing index parsing, portfolio position normalization, and automated feed polling—allowing callers to work at a higher level of abstraction than raw HTTP requests.

Filing Operations

Filing operations handle the retrieval and transformation of SEC documents into human-readable or machine-learning-ready text. The primary entry point is render_filing, which coordinates fetching the primary document and its associated exhibits.

Rendering Pipeline

The rendering logic distinguishes between “substantive” exhibits (press releases, material contracts) and boilerplate (SOX certifications, auditor consents).

Sources: src/ops/filing.rs:61-84 src/ops/filing.rs:130-141

graph TD
    subgraph "ops::filing"
        RF["render_filing()"]
RE["render_exhibit_docs()"]
RED["render_exhibit_doc()"]
end

    subgraph "network::filings"
        FI["fetch_filing_index()"]
FAR["fetch_and_render()"]
end

 
   RF -->|if render_body| FAR
 
   RF -->|if render_exhibits| FI
 
   FI -->|FilingIndex| RE
 
   RE --> RED
 
   RED --> FAR
 
   FAR -->|FilingView| Output["Rendered Text"]

Key Functions and Structures

EntityRoleSource
RenderedFilingContainer for the optional body text and a Vec of RenderedExhibit.src/ops/filing.rs:20-25
render_filingHigh-level orchestrator that fetches the primary doc and substantive exhibits.src/ops/filing.rs:61-84
render_all_exhibitsVariant that skips substantive filtering to return every document in the archive.src/ops/filing.rs:92-100
fetch_filing_indexParses the EDGAR -index.htm page to find document filenames and types.src/network/filings/filing_index.rs:108-114

The FilingIndex parser uses regex to extract data from the SEC’s HTML table, identifying documents by their Seq, Description, and Type src/network/filings/filing_index.rs:23-76

Holdings Operations

Holdings operations normalize investment data from disparate SEC forms (13F for institutional managers and N-PORT for registered investment companies) into a common Position format for comparison.

Position Normalization and Diffing

The system matches positions by CUSIP and calculates weight changes. A “significant” change is defined by the WEIGHT_CHANGE_THRESHOLD (default 0.10 percentage points).

Sources: src/ops/holdings.rs:45-71 src/ops/holdings.rs:79-120

graph LR
 
   NPORT["NportInvestment"] -->|positions_from_nport| P1["Position"]
T13F["ThirteenfHolding"] -->|positions_from_13f| P2["Position"]
P1 --> DH["diff_holdings()"]
P2 --> DH
    
 
   DH -->|Result| Diff["Diff Structure"]
subgraph "Diff Results"
 
       Diff --> Added["added: Vec&lt;Position&gt;"]
Diff --> Removed["removed: Vec&lt;Position&gt;"]
Diff --> Changed["changed: Vec&lt;(Old, New)&gt;"]
end

Implementation Details

IPO Operations & Lifecycle

The IPO module manages the discovery and retrieval of registration statements (S-1/F-1) and their subsequent amendments (S-1/A, F-1/A).

Registration Filing Lifecycle

The system tracks companies through the registration process, starting from the initial filing through amendments to the final pricing prospectus.

Form TypeDescriptionConstant Group
S-1 / F-1Initial registration statement.FormType::IPO_REGISTRATION_FORM_TYPES
S-1/A / F-1/AAmendments responding to SEC comments.FormType::IPO_REGISTRATION_FORM_TYPES
424B4Final pricing prospectus (deal terms).FormType::IPO_PRICING_FORM_TYPES

Sources: src/ops/ipo.rs:40-43 examples/ipo_show.rs:26-32

Feed Polling and Deduplication

The get_ipo_feed_entries function provides a “delta-poll” capability. It filters the EDGAR Atom feed, which uses prefix matching (e.g., searching “S-1” returns “S-11”), to ensure exact form type matches.

Sources: src/ops/ipo.rs:83-110

graph TB
    Start["get_ipo_feed_entries()"]
Fetch["fetch_edgar_feeds_since()"]
ExactMatch{"Exact Form Match?"}
Dedup{"Seen Accession?"}
Start --> Fetch
 
   Fetch --> ExactMatch
 
   ExactMatch -->|No| Drop["Discard (e.g. S-11)"]
ExactMatch -->|Yes| Dedup
 
   Dedup -->|New| Collect["Add to Results"]
Dedup -->|Duplicate| Drop
    
 
   Collect --> HW["Update High Water Mark"]

Identity Resolution

Because pre-IPO companies lack ticker symbols, the logic supports resolution via CIK. The ipo_show example demonstrates this by prioritizing CIK lookup and falling back to ticker-based CIK discovery for companies that have already listed examples/ipo_show.rs:115-121

Sources: src/ops/ipo.rs:8-49 examples/ipo_list.rs:87-108

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Caching & Storage System

Loading…

Caching & Storage System

Relevant source files

Purpose and Scope

This document describes the caching and storage infrastructure used by the Rust sec-fetcher application to minimize redundant API requests and improve performance. The system implements a two-tier caching architecture with persistent storage backed by simd-r-drive.

The caching system is designed to be isolated per ConfigManager instance, ensuring that different environments (e.g., production vs. unit tests) do not suffer from cross-test cache pollution.

Overview

The caching system provides two distinct cache layers managed by the Caches struct:

  1. HTTP Cache : Stores raw HTTP responses from SEC EDGAR API requests to avoid re-downloading immutable filing data.
  2. Preprocessor Cache : Stores transformed and processed data structures (e.g., mapping tables, calculated values, or TTL-based metadata).

Both caches use the simd-r-drive key-value storage backend with persistent file-based storage.

Sources: src/caches.rs:1-14 src/caches.rs:25-51


Caching Architecture

The following diagram illustrates the caching architecture and its integration with the configuration and network layers:

Sources: src/caches.rs:11-14 src/caches.rs:29-51 src/network/fetch_investment_company_series_and_class_dataset.rs:43-46

graph TB
    subgraph "Initialization Space"
        ConfigMgr["ConfigManager"]
CachesStruct["Caches Struct"]
OpenFn["Caches::open(base_path)"]
end
    
    subgraph "Code Entity Space: Caches Module"
        HTTP_DS["http_cache: Arc&lt;DataStore&gt;"]
PRE_DS["preprocessor_cache: Arc&lt;DataStore&gt;"]
end
    
    subgraph "File System (On-Disk)"
        HTTP_File["http_storage_cache.bin"]
PRE_File["preprocessor_cache.bin"]
end
    
    subgraph "Network Integration"
        SecClient["SecClient"]
FetchInv["fetch_investment_company_..."]
end
    
 
   ConfigMgr -->|provides path| OpenFn
 
   OpenFn -->|instantiates| CachesStruct
 
   CachesStruct --> HTTP_DS
 
   CachesStruct --> PRE_DS
    
 
   HTTP_DS -->|persists to| HTTP_File
 
   PRE_DS -->|persists to| PRE_File
    
 
   SecClient -->|uses| HTTP_DS
 
   FetchInv -->|uses| PRE_DS

Implementation Details

The Caches Struct

Unlike previous versions that used global OnceLock statics, the current implementation encapsulates the storage logic within the Caches struct. This allows for better dependency injection and testing isolation.

MethodDescription
open(base: &Path)Creates the directory if missing and opens two DataStore files: http_storage_cache.bin and preprocessor_cache.bin.
get_http_cache_store()Returns an Arc<DataStore> for the HTTP response cache.
get_preprocessor_cache()Returns an Arc<DataStore> for the preprocessor/metadata cache.

Sources: src/caches.rs:25-59

CacheNamespacePrefix

To prevent key collisions within a single DataStore, the system utilizes CacheNamespacePrefix. This enum provides distinct prefixes for different types of cached data, which are then hashed using simd_r_drive::utils::NamespaceHasher.

Common namespaces include:

  • LatestFundsYear: Used to track the most recent available year for investment company datasets.

Sources: src/network/fetch_investment_company_series_and_class_dataset.rs:11-15 src/network/fetch_investment_company_series_and_class_dataset.rs:47-48


Preprocessor Cache Usage

The preprocessor cache is used for logic that requires persistence but isn’t a direct 1:1 mapping of an HTTP response. A primary example is the “Year-Fallback Logic” used when fetching investment company datasets.

sequenceDiagram
    participant App as Fetch Logic
    participant PreCache as Preprocessor Cache
    participant SEC as SEC EDGAR API
    
    App->>PreCache: read_with_ttl(Namespace: LatestFundsYear)
    alt Cache Hit
        PreCache-->>App: Return cached year (e.g., 2024)
    else Cache Miss
        App->>App: Default to Utc::now().year()
    end
    
    loop Fallback Logic
        App->>SEC: GET Dataset for Year
        alt 200 OK
            SEC-->>App: CSV Data
            App->>PreCache: write_with_ttl(year, TTL: 1 week)
            Note over App: Break Loop\nelse 404 Not Found
            App->>App: decrement year
        end
    end

Data Flow: Investment Company Dataset Fetching

Implementation Details:

Sources: src/network/fetch_investment_company_series_and_class_dataset.rs:46-73


HTTP Cache & SecClient

The SecClient utilizes the http_cache provided by the Caches struct. This integration typically happens during the construction of the SecClient via the ConfigManager.

Storage Characteristics (simd-r-drive)

The underlying storage provided by simd-r-drive offers:

  • High Performance : Optimized for fast key-value lookups.
  • Atomic Operations : Ensures data integrity during writes.
  • Simplicity : Single-file binary format (.bin) per store.

Sources: src/caches.rs:31-46


Integration Summary

ComponentRoleFile Reference
CachesOwner of DataStore handlessrc/caches.rs:11-14
simd_r_drive::DataStoreLow-level storage enginesrc/caches.rs1
NamespaceHasherScopes keys within a DataStoresrc/network/fetch_investment_company_series_and_class_dataset.rs:11-15
StorageCacheExtProvides read_with_ttl and write_with_ttlsrc/network/fetch_investment_company_series_and_class_dataset.rs7

Sources: src/caches.rs:1-60 src/network/fetch_investment_company_series_and_class_dataset.rs:1-73

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Utility Functions & Accessors

Loading…

Utility Functions & Accessors

Relevant source files

Purpose and Scope

This document covers the utility functions and helper modules provided by the utils module in the Rust sec-fetcher application. These utilities provide cross-cutting functionality used throughout the codebase, including data structure transformations, runtime mode detection, and collection extensions. Additionally, it documents the internal data extraction patterns used by the network layer to process complex SEC JSON responses.

Sources: src/utils.rs:1-9


Module Overview

The utils module is organized as a collection of focused sub-modules. The module uses Rust’s re-export pattern to provide a clean public API for environment detection and collection manipulation.

Sub-modulePrimary ExportPurpose
is_development_modeis_development_mode()Runtime environment detection for configuration and logging.
is_interactive_modeis_interactive_mode(), set_interactive_mode_override()Detects if the app is running in a TTY or should simulate one.
vec_extensionsVecExtensions traitExtension methods for Vec<T> used in data processing.

Sources: src/utils.rs:1-9


Interactive Mode Management

The interactive mode utilities manage application state related to user interaction, controlling whether the application should prompt for user input or run in automated mode (e.g., CI/CD or piped output).

Implementation Details

The is_interactive_mode function checks the environment and standard streams to determine the session type:

  1. Override Check : It first looks for the INTERACTIVE_MODE_OVERRIDE environment variable. Values "1"/"true" force interactive mode; "0"/"false" force non-interactive src/utils/is_interactive_mode.rs:8-28
  2. Terminal Detection : If no override exists, it uses std::io::IsTerminal to check if both stdin and stdout are connected to a terminal src/utils/is_interactive_mode.rs:30-32

Function Signatures

src/utils/is_interactive_mode.rs21 src/utils/is_interactive_mode.rs62

Usage Scenarios

Scenariois_interactive_mode() is trueis_interactive_mode() is false
Missing ConfigPrompt user for email/API keysExit with error message
ProgressRender dynamic progress barsLog static status updates
PipingWarning if output is not redirectedClean data output for grep/jq

Sources: src/utils/is_interactive_mode.rs:1-76


Data Accessors and Parsing Patterns

While not a standalone “accessor” module, the codebase implements standardized internal functions to extract data from SEC JSON structures (DataFrames-like objects in Python, but raw serde_json::Value in Rust).

Submission Parsing Logic

The fetch_cik_submissions module contains specialized logic to flatten the SEC’s columnar JSON format into a vector of strongly-typed models.

graph TD
    subgraph "SEC JSON Structure"
        JSON["Root JSON Object"]
Filings["'filings' Object"]
Recent["'recent' Block (Columnar)"]
Files["'files' Array (Pagination)"]
end

    subgraph "Logic: extract_filings_from_block"
        Zip["itertools::izip!"]
Acc["accessionNumber[]"]
Form["form[]"]
Doc["primaryDocument[]"]
Date["filingDate[]"]
end

    subgraph "Output Space"
        Model["Vec&lt;CikSubmission&gt;"]
end

 
   JSON --> Filings
 
   Filings --> Recent
 
   Filings --> Files
 
   Recent --> Acc & Form & Doc & Date
    Acc & Form & Doc &
 Date --> Zip
 
   Zip --> Model
 
   Files -.->|Recursively Fetch| JSON

Key Accessor Functions

Sources: src/network/fetch_cik_submissions.rs:10-177


Utility Function Relationships

This diagram maps the relationship between utility functions and the higher-level system components that consume them.

Sources: src/utils.rs:1-9 src/network/fetch_cik_submissions.rs:1-178

graph LR
    subgraph "Utility Layer (src/utils/)"
        IsDev["is_development_mode()"]
IsInt["is_interactive_mode()"]
VecExt["VecExtensions"]
end

    subgraph "Network Layer (src/network/)"
        Client["SecClient"]
SubParser["parse_cik_submissions_json()"]
end

    subgraph "App Logic"
        Config["ConfigManager"]
Main["main.rs / CLI"]
end

 
   IsDev --> Config
 
   IsInt --> Main
 
   VecExt --> SubParser
 
   SubParser --> Client

Development Mode Detection

The is_development_mode utility is a simple toggle used to gate features that should not be active in production, such as test fixture generation or bypasses for rate limiting in local mock environments.

Usage in Configuration

The ConfigManager and SecClient utilize these flags to determine if they should load production SEC endpoints or local mock servers during integration testing tests/sec_client_tests.rs:7-23

Sources: src/utils.rs:1-2 tests/sec_client_tests.rs:1-23


Vector Extensions Trait

The VecExtensions trait provides ergonomic helpers for common operations performed on lists of SEC data, such as deduplication or specialized filtering before rendering.

Trait Definition Pattern

src/utils/vec_extensions.rs:1-10 (referenced via src/utils.rs:7-8)

Sources: src/utils.rs:7-8

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Python narrative_stack System

Loading…

Python narrative_stack System

Relevant source files

Purpose and Scope

The narrative_stack system is the Python machine learning component of the dual-language financial data processing architecture. This system is responsible for transforming raw financial data fetched by the Rust sec-fetcher into learned latent representations using deep learning techniques.

For detailed information on specific subsystems:

  • Data Ingestion & Preprocessing: See Data Ingestion & Preprocessing for details on CSV parsing, semantic embedding generation, and PCA dimensionality reduction.
  • Machine Learning Training Pipeline : See Machine Learning Training Pipeline for documentation on the Stage1Autoencoder model, PyTorch Lightning setup, and dataset streaming.
  • Database & Storage Integration: See Database & Storage Integration for details on the DbUsGaap interface and DataStoreWsClient WebSocket integration.
  • US GAAP Distribution Analyzer : See US GAAP Distribution Analyzer for information on the Rust-based clustering tool used to analyze concept distributions.

The Rust data fetching layer is documented in Rust sec-fetcher Application.

System Architecture Overview

The narrative_stack system processes US GAAP financial data through a multi-stage pipeline that transforms raw CSV files into learned latent representations. The architecture consists of three primary layers: storage/ingestion, preprocessing, and training.

graph TB
    subgraph "Storage Layer"
        DbUsGaap["DbUsGaap\nMySQL Database Interface"]
DataStoreWsClient["DataStoreWsClient\nWebSocket Client"]
UsGaapStore["UsGaapStore\nUnified Data Facade"]
end
    
    subgraph "Data Sources"
        RustCSV["CSV Files\nfrom sec-fetcher\ntruncated_csvs/"]
MySQL["MySQL Database\nus_gaap_test"]
SimdRDrive["simd-r-drive\nWebSocket Server"]
end
    
    subgraph "Preprocessing Components"
        IngestCSV["ingest_us_gaap_csvs\nCSV Walker & Parser"]
GenPCA["generate_pca_embeddings\nPCA Compression"]
RobustScaler["RobustScaler\nPer-Pair Normalization"]
ConceptEmbed["Semantic Embeddings\nConcept/Unit Pairs"]
end
    
    subgraph "Training Components"
        IterableDS["IterableConceptValueDataset\nStreaming Data Loader"]
Stage1AE["Stage1Autoencoder\nPyTorch Lightning Module"]
PLTrainer["pl.Trainer\nTraining Orchestration"]
Callbacks["EarlyStopping\nModelCheckpoint\nTensorBoard Logger"]
end
    
 
   RustCSV --> IngestCSV
 
   MySQL --> DbUsGaap
 
   SimdRDrive --> DataStoreWsClient
    
 
   DbUsGaap --> UsGaapStore
 
   DataStoreWsClient --> UsGaapStore
    
 
   IngestCSV --> UsGaapStore
 
   UsGaapStore --> GenPCA
 
   UsGaapStore --> RobustScaler
 
   UsGaapStore --> ConceptEmbed
    
 
   GenPCA --> UsGaapStore
 
   RobustScaler --> UsGaapStore
 
   ConceptEmbed --> UsGaapStore
    
 
   UsGaapStore --> IterableDS
 
   IterableDS --> Stage1AE
 
   Stage1AE --> PLTrainer
 
   PLTrainer --> Callbacks
    
 
   Callbacks -.->|Checkpoints| Stage1AE

Component Architecture

Sources:

Core Component Responsibilities

ComponentTypeResponsibilities
DbUsGaapDatabase InterfaceMySQL connection management, query execution.
DataStoreWsClientWebSocket ClientReal-time communication with simd-r-drive server.
UsGaapStoreData FacadeUnified API for ingestion, lookup, and embedding management.
IterableConceptValueDatasetPyTorch DatasetStreaming data loader with internal batching to handle large datasets.
Stage1AutoencoderLightning ModuleEncoder-decoder architecture for learning financial concept embeddings.
pl.TrainerTraining FrameworkOrchestration of the training loop, logging, and checkpointing.

Sources:

Data Flow Pipeline

The system processes financial data through a sequential pipeline from raw CSV files to trained model checkpoints.

graph LR
    subgraph "Input Stage"
        CSV1["Rust CSV Output\nproject_paths.rust_data/\ntruncated_csvs/"]
end
    
    subgraph "Ingestion Stage"
        Walk["walk_us_gaap_csvs\nDirectory Walker"]
Parse["UsGaapRowRecord\nParser"]
Store1["Store to\nDbUsGaap & DataStore"]
end
    
    subgraph "Preprocessing Stage"
        Extract["Extract\nconcept/unit pairs"]
GenEmbed["generate_concept_unit_embeddings\nSemantic Embeddings"]
Scale["RobustScaler\nPer-Pair Normalization"]
PCA["PCA Compression\nvariance_threshold=0.95"]
Triplet["Triplet Storage\nconcept+unit+scaled_value\n+scaler+embedding"]
end
    
    subgraph "Training Stage"
        Stream["IterableConceptValueDataset\ninternal_batch_size=64"]
DataLoader["DataLoader\nbatch_size from hparams\ncollate_with_scaler"]
Encode["Stage1Autoencoder.encode\nInput → Latent"]
Decode["Stage1Autoencoder.decode\nLatent → Reconstruction"]
Loss["MSE Loss\nReconstruction Error"]
end
    
    subgraph "Output Stage"
        Ckpt["Model Checkpoints\nstage1_resume-vN.ckpt"]
TB["TensorBoard Logs\nval_loss_epoch monitoring"]
end
    
 
   CSV1 --> Walk
 
   Walk --> Parse
 
   Parse --> Store1
 
   Store1 --> Extract
 
   Extract --> GenEmbed
 
   GenEmbed --> Scale
 
   Scale --> PCA
 
   PCA --> Triplet
 
   Triplet --> Stream
 
   Stream --> DataLoader
 
   DataLoader --> Encode
 
   Encode --> Decode
 
   Decode --> Loss
 
   Loss --> Ckpt
 
   Loss --> TB

End-to-End Processing Flow

Sources:

Processing Stages

  1. CSV Ingestion : The system ingests CSV files produced by the Rust sec-fetcher using us_gaap_store.ingest_us_gaap_csvs(). python/narrative_stack/notebooks/stage1_preprocessing.ipynb:21-23
  2. Concept/Unit Pair Extraction : Unique (concept, unit) pairs are extracted to define the semantic space. python/narrative_stack/notebooks/stage1_preprocessing.ipynb:67-68
  3. Semantic Embedding Generation : Embeddings capture relationships between financial concepts, compressed via PCA with a 0.95 variance threshold. python/narrative_stack/notebooks/stage1_preprocessing.ipynb:263-269
  4. Value Normalization : RobustScaler is applied per-pair to handle outliers in financial magnitudes. python/narrative_stack/notebooks/stage1_preprocessing.ipynb:88-96
  5. Model Training : The Stage1Autoencoder learns a bottleneck representation of the concatenated embedding and scaled value. python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb479-486

Storage Architecture Integration

The Python system integrates with multiple storage backends to support different access patterns and data requirements.

graph TB
    subgraph "Python Application"
        App["narrative_stack\nNotebooks & Scripts"]
end
    
    subgraph "Storage Facade"
        UsGaapStore["UsGaapStore\nUnified Interface"]
end
    
    subgraph "Backend Interfaces"
        DbInterface["DbUsGaap\ndb_config"]
WsInterface["DataStoreWsClient\nsimd_r_drive_server_config"]
end
    
    subgraph "Storage Systems"
        MySQL["MySQL Server\nus_gaap_test database"]
WsServer["simd-r-drive\nWebSocket Server\nKey-Value Store"]
FileCache["File System\nCache Storage"]
end
    
    subgraph "Data Types"
        Raw["Raw Records\nUsGaapRowRecord"]
Triplets["Triplets\nconcept+unit+scaled_value\n+scaler+embedding"]
Matrix["Embedding Matrix\nnumpy arrays"]
PCAModel["PCA Models\nsklearn objects"]
end
    
 
   App --> UsGaapStore
 
   UsGaapStore --> DbInterface
 
   UsGaapStore --> WsInterface
    
 
   DbInterface --> MySQL
 
   WsInterface --> WsServer
 
   WsServer --> FileCache
    
 
   MySQL -.->|stores| Raw
 
   WsServer -.->|stores| Triplets
 
   WsServer -.->|stores| Matrix
 
   WsServer -.->|stores| PCAModel

Storage Backend Architecture

Sources:

Configuration and Initialization

The system uses centralized configuration for database connections and WebSocket server endpoints. The .vscode/settings.json file points to the specific Python environment for the stack.

Training Infrastructure

The training system uses PyTorch Lightning for experiment management.

Training Configuration

Sources:

Key Training Parameters

ParameterValuePurpose
EPOCHS1000Maximum training epochs.
internal_batch_size64Dataset internal batching size.
num_workers2DataLoader worker processes.
gradient_clip_valFrom hparamsGradient clipping threshold.

Sources:

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Machine Learning Training Pipeline

Loading…

Machine Learning Training Pipeline

Relevant source files

Purpose and Scope

This page documents the machine learning training pipeline for the narrative_stack system, specifically the Stage 1 autoencoder that learns latent representations of US GAAP financial concepts. The training pipeline consumes preprocessed concept/unit/value triplets and their semantic embeddings to train a neural network that can encode financial data into a compressed latent space.

The pipeline uses PyTorch Lightning for training orchestration, implements custom iterable datasets for efficient data streaming from the simd-r-drive WebSocket server, and provides comprehensive experiment tracking through TensorBoard.

Training Pipeline Architecture

The training pipeline operates as a streaming system that continuously fetches preprocessed triplets from the UsGaapStore and feeds them through the autoencoder model. The architecture emphasizes memory efficiency and reproducibility by avoiding full-dataset loads into RAM.

graph TB
    subgraph "Data Source Layer"
        DataStoreWsClient["DataStoreWsClient\n(simd_r_drive_ws_client)"]
UsGaapStore["UsGaapStore\nlookup_by_index()"]
end
    
    subgraph "Dataset Layer"
        IterableDataset["IterableConceptValueDataset\ninternal_batch_size=64\nreturn_scaler=True\nshuffle=True/False"]
CollateFunction["collate_with_scaler()\nBatch construction"]
end
    
    subgraph "PyTorch Lightning Training Loop"
        DataLoader["DataLoader\nbatch_size from hparams\nnum_workers=2\npin_memory=True\npersistent_workers=True"]
Model["Stage1Autoencoder\nEncoder → Latent → Decoder"]
Optimizer["Adam Optimizer\n+ CosineAnnealingWarmRestarts\nReduceLROnPlateau"]
Trainer["pl.Trainer\nEarlyStopping\nModelCheckpoint\ngradient_clip_val"]
end
    
    subgraph "Monitoring & Persistence"
        TensorBoard["TensorBoardLogger\ntrain_loss\nval_loss_epoch\nlearning_rate"]
Checkpoints["Model Checkpoints\n.ckpt files\nsave_top_k=1\nmonitor='val_loss_epoch'"]
end
    
 
   DataStoreWsClient --> UsGaapStore
 
   UsGaapStore --> IterableDataset
 
   IterableDataset --> CollateFunction
 
   CollateFunction --> DataLoader
    
 
   DataLoader --> Model
 
   Model --> Optimizer
 
   Optimizer --> Trainer
    
 
   Trainer --> TensorBoard
 
   Trainer --> Checkpoints
    
 
   Checkpoints -.->|Resume training| Model

Natural Language to Code Entity Space: Data Flow

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb456-556

Stage1Autoencoder Model

Model Architecture

The Stage1Autoencoder is a fully-connected autoencoder that learns to compress financial concept embeddings combined with their scaled values into a lower-dimensional latent space. The model reconstructs its input, forcing the latent representation to capture the most important features of the US GAAP taxonomy.

graph LR
    Input["Input Tensor\n[embedding + scaled_value]\nDimension: embedding_dim + 1"]
Encoder["Encoder Network\nfc1 → dropout → ReLU\nfc2 → dropout → ReLU"]
Latent["Latent Space\nDimension: latent_dim"]
Decoder["Decoder Network\nfc3 → dropout → ReLU\nfc4 → dropout → output"]
Output["Reconstructed Input\nSame dimension as input"]
Loss["MSE Loss\ninput vs output"]
Input --> Encoder
 
   Encoder --> Latent
 
   Latent --> Decoder
 
   Decoder --> Output
 
   Output --> Loss
 
   Input --> Loss

Hyperparameters

The model exposes the following configurable hyperparameters through its hparams attribute:

ParameterDescriptionTypical Value
input_dimDimension of input (embedding + 1 for value)Varies based on embedding size
latent_dimDimension of compressed latent space64-128
dropout_rateDropout probability for regularization0.0-0.2
lrInitial learning rate1e-5 to 5e-5
min_lrMinimum learning rate for scheduler1e-7 to 1e-6
batch_sizeTraining batch size32
weight_decayL2 regularization parameter1e-8 to 1e-4
gradient_clipMaximum gradient norm0.0-1.0

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb479-490

Loss Function and Optimization

The model uses Mean Squared Error (MSE) loss between the input and reconstructed output. The optimization strategy combines:

Dataset and Data Loading

IterableConceptValueDataset

The IterableConceptValueDataset is a custom PyTorch IterableDataset that streams training data from the UsGaapStore without loading the entire dataset into memory.

Key characteristics:

graph TB
    subgraph "Dataset Initialization"
        Config["simd_r_drive_server_config\nhost + port"]
Params["Dataset Parameters\ninternal_batch_size\nreturn_scaler\nshuffle"]
end
    
    subgraph "Data Streaming Process"
        Store["UsGaapStore instance\nget_triplet_count()"]
IndexGen["Index Generator\nSequential or shuffled\nbased on shuffle param"]
BatchFetch["Internal Batching\nFetch internal_batch_size items\nvia batch_lookup_by_indices()"]
Unpack["Unpack Triplet Data\nembedding\nscaled_value\nscaler (optional)"]
end
    
    subgraph "Output"
        Tensor["PyTorch Tensors\nx: [embedding + scaled_value]\ny: [embedding + scaled_value]\nscaler: RobustScaler object"]
end
    
 
   Config --> Store
 
   Params --> IndexGen
 
   Store --> IndexGen
 
   IndexGen --> BatchFetch
 
   BatchFetch --> Unpack
 
   Unpack --> Tensor

DataLoader and Collation

The collate_with_scaler function handles batch construction when the dataset returns triplets (x, y, scaler). It stacks the tensors into batches while preserving the scaler objects in a list. python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb506-507

ParameterValuePurpose
batch_sizemodel.hparams.batch_sizeOuter batch size for model training.
num_workers2Parallel data loading processes.
pin_memoryTrueFaster host-to-GPU transfers.
persistent_workersTrueKeeps worker processes alive between epochs.

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb175-176 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-522

Training Configuration

PyTorch Lightning Trainer Setup

The training pipeline uses PyTorch Lightning’s Trainer class to orchestrate the training loop, validation, and callbacks. python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb468-548

Callbacks and Persistence

Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb528-548

Checkpointing and Resuming Training

The pipeline supports resuming training from a .ckpt file. This is handled by passing ckpt_path to trainer.fit(). python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb555

Alternatively, a model can be loaded for fine-tuning with modified hyperparameters:

python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb479-486

Integration with Rust Caches

While the training occurs in Python, the underlying data is often derived from the Rust preprocessor. The Caches struct in the Rust application manages the preprocessor_cache.bin and http_storage_cache.bin src/caches.rs:11-14 which provide the raw data that the Python UsGaapStore eventually consumes. The Caches::open function src/caches.rs:29-51 ensures these data stores are correctly initialized on disk before the training pipeline attempts to access them via the WebSocket bridge.

Sources: src/caches.rs:11-60 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb456-556

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

US GAAP Distribution Analyzer

Loading…

US GAAP Distribution Analyzer

Relevant source files

The us-gaap-dist-analyzer is a specialized Rust sub-crate designed to perform unsupervised clustering and distribution analysis of US GAAP (Generally Accepted Accounting Principles) financial concepts. It bridges the gap between raw SEC filing data and semantic financial modeling by grouping taxonomical labels based on their contextual and mathematical distributions.

Purpose and Scope

The analyzer processes large-scale financial datasets to identify patterns in how public companies report specific financial metrics. By utilizing BERT embeddings for semantic understanding and K-Means clustering for statistical grouping, the tool allows researchers to visualize the high-dimensional space of US GAAP concepts and identify synonymous or related reporting items that may not share identical taxonomical names.


Technical Architecture

The system follows a linear pipeline that transforms raw US GAAP column names and their associated values into a clustered spatial representation.

Data Flow and Pipeline

The transformation process moves from high-dimensional natural language space to a reduced numerical space for efficient clustering.

  1. Embedding Generation : Converts US GAAP concept strings into vector representations using a BERT-based transformer model.
  2. Normalization : Applies scaling to ensure that concept values (magnitudes) do not disproportionately bias the clustering.
  3. Dimensionality Reduction : Uses Principal Component Analysis (PCA) to project high-dimensional embeddings into a lower-dimensional space while preserving variance.
  4. Clustering : Executes K-Means algorithms to group concepts into k distinct clusters.

System Components Diagram

The following diagram illustrates the relationship between the logical analysis steps and the underlying implementation components.

US GAAP Analysis Pipeline

graph TD
    subgraph "Natural Language Space"
        A["US GAAP Concept Names"]
B["distill_us_gaap_fundamental_concepts"]
end

    subgraph "Vector Space (Code Entities)"
        C["BERT Embedding Engine"]
D["PCA (Principal Component Analysis)"]
E["K-Means Clusterer"]
end

    subgraph "Output & Analysis"
        F["Cluster Centroids"]
G["Concept Distribution Maps"]
end

 
   A --> B
    B -- "Normalized Strings" --> C
    C -- "High-Dim Vectors" --> D
    D -- "Reduced Vectors" --> E
 
   E --> F
 
   E --> G

Sources: us-gaap-dist-analyzer/Cargo.lock:1-217 us-gaap-dist-analyzer/Cargo.lock:198-217


Implementation Details

Dependency Stack

The analyzer relies on several heavy-duty mathematical and machine learning libraries to perform its operations.

ComponentLibrary / CratePurpose
Embeddingsrust-bert / tchLoading and executing transformer models for semantic encoding.
Linear Algebrandarray / nalgebraMatrix operations for PCA and distance calculations.
Clusteringlinfa-clusteringImplementation of the K-Means algorithm.
Data HandlingpolarsHigh-performance DataFrame operations for managing large SEC datasets.
Cachingcached-pathManaging local storage of model weights and pre-computed embeddings.

Sources: us-gaap-dist-analyzer/Cargo.lock:198-201 us-gaap-dist-analyzer/Cargo.lock:53-61

Key Logic Flow

The analyzer’s execution logic is centered around the transition from raw SEC data to categorized clusters.

Clustering Logic Flow

sequenceDiagram
    participant Data as "SEC Data (Polars DataFrame)"
    participant BERT as "BERT Encoder"
    participant DimRed as "PCA Transformer"
    participant Cluster as "K-Means Engine"

    Data->>BERT: Extract Concept Labels
    BERT->>BERT: Generate 768-dim Embeddings
    BERT->>DimRed: Pass Embedding Matrix
    DimRed->>DimRed: Compute Covariance & Eigenvectors
    DimRed->>Cluster: Reduced Matrix (n_components)
    Cluster->>Cluster: Iterative Centroid Assignment
    Cluster-->>Data: Append 'cluster_id' to Labels

Sources: us-gaap-dist-analyzer/Cargo.lock:44-50 us-gaap-dist-analyzer/Cargo.lock:151-157


Data Integration

The us-gaap-dist-analyzer works in tandem with the narrative_stack Python components and the core Rust sec-fetcher library.

Relationship to Fundamental Concepts

While the Rust core defines a strict taxonomy in the FundamentalConcept enum [3.4 US GAAP Concept Transformation], this analyzer is used to discover new relationships or validate the existing taxonomy by observing how concepts are actually used in the wild.

  1. Input : The analyzer typically consumes the output of pull-us-gaap-bulk or the UsGaapStore.
  2. Processing : It clusters concepts like CashAndCashEquivalentsAtCarryingValue and CashAndCashEquivalentsPeriodIncreaseDecrease to see if they consistently appear in the same reporting “neighborhood.”
  3. Validation : Results are used to refine the mapping patterns used in distill_us_gaap_fundamental_concepts.

Performance Considerations

  • Memory Usage : Because it handles large embedding matrices, the crate utilizes ndarray for memory-efficient slicing and tch (LibTorch) for GPU-accelerated tensor operations when available.
  • Persistence : Clustering models and PCA projections are often serialized to disk using serde to allow for incremental analysis of new filing batches without re-training the entire distribution map.

Sources: us-gaap-dist-analyzer/Cargo.lock:209-211 us-gaap-dist-analyzer/Cargo.lock:165-175

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Development Guide

Loading…

Development Guide

Relevant source files

Purpose and Scope

This guide provides an overview of development practices, code organization, and workflows for contributing to the rust-sec-fetcher project. It covers environment setup, code organization principles, development workflows, and common development tasks.

For detailed information about specific development topics, see:

  • Testing strategies and test fixtures: Testing Strategy
  • Continuous integration and automated testing: CI/CD Pipeline
  • Docker container configuration: Docker Deployment

Development Environment Setup

Prerequisites

The project requires the following tools installed:

ToolPurposeVersion Requirement
RustCore application development1.87+
PythonML pipeline and preprocessing3.8+
DockerIntegration testing and servicesLatest stable
Git LFSLarge file support for test assetsLatest stable
MySQLDatabase for US GAAP storage5.7+ or 8.0+

Rust Development Setup

  1. Clone the repository and navigate to the root directory.

  2. Build the Rust application:

  3. Run tests to verify setup:

The Rust workspace is configured with necessary dependencies. Key development dependencies include:

  • mockito for HTTP mocking in tests.
  • tempfile for temporary file/directory creation in tests.
  • tokio test macros for async test support.

Python Development Setup

  1. Create a virtual environment:

  2. Install dependencies using uv:

  3. Verify installation by running integration tests (requires Docker):

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39

Configuration Setup

The application requires a configuration file at ~/.config/sec-fetcher/config.toml or a custom path. Minimum configuration:

For non-interactive testing, use AppConfig directly in test code as shown in tests/config_manager_tests.rs:36-57

Sources: tests/config_manager_tests.rs:36-57 tests/sec_client_tests.rs:8-20

Code Organization and Architecture

Repository Structure

Sources: src/network/sec_client.rs:1-181 tests/config_manager_tests.rs:1-95 tests/sec_client_tests.rs:1-159 python/narrative_stack/us_gaap_store_integration_test.sh:1-39

Module Dependency Flow

The dependency flow follows a layered architecture:

  1. Configuration Layer : ConfigManager loads settings from TOML files and credentials.
  2. Network Layer : SecClient wraps HTTP client with caching and throttling middleware.
  3. Data Fetching Layer : Network module functions fetch raw data from SEC APIs via Url variants src/enums/url_enum.rs:5-116
  4. Transformation Layer : Transformers normalize raw data into standardized concepts.
  5. Model Layer : Data structures represent domain entities.

Sources: src/network/sec_client.rs:1-181 tests/config_manager_tests.rs:1-95 src/enums/url_enum.rs:5-116

Development Workflow

Standard Development Cycle

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39

Running Tests Locally

Rust Unit Tests

Run all Rust tests with cargo:

Run specific test modules:

Test Structure Mapping:

Test FileTests ComponentKey Test Functions
tests/config_manager_tests.rsConfigManagertest_load_custom_config, test_load_non_existent_config
tests/sec_client_tests.rsSecClienttest_user_agent, test_fetch_json_without_retry_success

Sources: tests/config_manager_tests.rs:1-95 tests/sec_client_tests.rs:1-159

Python Integration Tests

Integration tests require Docker services. Run via the provided shell script:

This script performs the following steps as defined in python/narrative_stack/us_gaap_store_integration_test.sh:1-39:

  1. Activates Python virtual environment.
  2. Installs dependencies with uv.
  3. Starts Docker Compose services.
  4. Waits for MySQL availability.
  5. Creates us_gaap_test database and loads schema.
  6. Runs pytest integration tests.
  7. Tears down containers on exit.

Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39

Writing Tests

Unit Test Pattern (Rust)

The codebase uses mockito for HTTP mocking:

Sources: tests/sec_client_tests.rs:35-62

Test Fixture Pattern

The codebase uses temporary directories for file-based tests via tempfile::tempdir() as shown in tests/config_manager_tests.rs:8-17

Sources: tests/config_manager_tests.rs:8-17

Common Development Tasks

Adding a New SEC Data Endpoint

To add support for a new SEC data endpoint:

  1. Add URL enum variant in src/enums/url_enum.rs src/enums/url_enum.rs:5-116
  2. UpdateUrl::value() to return the formatted URL string src/enums/url_enum.rs:121-165
  3. Create fetch function in src/network/ using the new Url variant.
  4. Define data models in src/models/ for the response structure.

Sources: src/enums/url_enum.rs:5-165

Modifying HTTP Client Behavior

The SecClient is configured in src/network/sec_client.rs:21-89 Key configuration points:

ConfigurationLocationPurpose
CachePolicysrc/network/sec_client.rs:45-50Controls cache TTL and behavior
ThrottlePolicysrc/network/sec_client.rs:53-59Controls rate limiting and retries
User-Agentsrc/network/sec_client.rs:91-108Constructs SEC-compliant User-Agent header

Sources: src/network/sec_client.rs:21-108

Code Quality Standards

CI/CD and Maintenance

The project uses GitHub Actions for automated quality checks:

  • Linting : rust-lint.yml runs clippy and rustfmt.
  • Testing : rust-tests.yml runs the test suite.
  • Documentation : build-docs.yml generates documentation weekly build-docs.yml:1-81
  • Dependency Updates : Dependabot is configured for weekly Cargo updates dependabot.yml:1-10

Sources: .github/workflows/build-docs.yml:1-81 .github/dependabot.yml:1-10

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Testing Strategy

Loading…

Testing Strategy

Relevant source files

This page documents the testing approach for the rust-sec-fetcher codebase. The strategy emphasizes high-fidelity parsing verification using compressed EDGAR fixtures, HTTP layer isolation via mockito, and exhaustive mapping tests for US GAAP concept normalization.

Test Architecture Overview

The testing infrastructure is designed to handle the “dirty” nature of SEC EDGAR data (e.g., changing scale conventions in 13F filings or inconsistent XBRL tagging) by using real-world data snapshots as the primary source of truth.

graph TB
    subgraph "Rust Unit & Integration Tests"
        SecClientTest["sec_client_tests.rs\nTest SecClient HTTP & Retries"]
CikSubTest["cik_submissions_tests.rs\nTest Submissions JSON Parsing"]
ThirteenFTest["parse_13f_xml_tests.rs\nTest 13F Scaling & Sums"]
EraTest["thirteenf_era_fixture_tests.rs\nTest 2023 Schema Crossover"]
UsGaapAccTest["us_gaap_parser_accuracy_tests.rs\nVerify DF against JSON Source"]
end
    
    subgraph "Test Infrastructure & Fixtures"
        MockitoServer["mockito::Server\nHTTP mock server"]
FixtureFiles["tests/fixtures/*.gz\nCompressed SEC Snapshots"]
RefreshBin["refresh_test_fixtures.rs\nUtility to update snapshots"]
end
    
 
   SecClientTest --> MockitoServer
 
   CikSubTest --> FixtureFiles
 
   ThirteenFTest --> FixtureFiles
 
   EraTest --> FixtureFiles
 
   UsGaapAccTest --> FixtureFiles
 
   RefreshBin --> FixtureFiles

Test Component Relationships

Sources: tests/sec_client_tests.rs:1-132 tests/cik_submissions_tests.rs:1-39 tests/thirteenf_era_fixture_tests.rs:1-30 src/bin/refresh_test_fixtures.rs:1-30


Fixture-Based Testing Strategy

The project relies on a “Fixture-First” approach for data parsers. Instead of mocking complex nested JSON/XML structures by hand, the refresh_test_fixtures binary src/bin/refresh_test_fixtures.rs:1-173 downloads real filings from EDGAR and stores them as Gzip-compressed files in tests/fixtures/.

graph LR
    subgraph "Development Space"
        RefreshBin["bin/refresh_test_fixtures.rs"]
FixturesDir["tests/fixtures/"]
end

    subgraph "Code Entity Space"
        LoadFixture["load_fixture()"]
GzDecoder["flate2::read::GzDecoder"]
Parser["parse_cik_submissions_json()"]
end

    RefreshBin -- "Fetch & Compress" --> FixturesDir
    FixturesDir -- "Read .gz" --> LoadFixture
    LoadFixture -- "Decompress" --> GzDecoder
    GzDecoder -- "Stream JSON/XML" --> Parser

The Fixture Lifecycle

Sources: src/bin/refresh_test_fixtures.rs:31-173 tests/cik_submissions_tests.rs:16-30 tests/us_gaap_parser_accuracy_tests.rs:9-23

Parser Accuracy Verification

The us_gaap_parser_accuracy_tests.rs implements a deep-validation logic that ensures the Polars DataFrame produced by parse_us_gaap_fundamentals src/parsers/parse_us_gaap_fundamentals.rs:41-103 is 100% traceable to the source JSON.

Sources: tests/us_gaap_parser_accuracy_tests.rs:31-160 src/parsers/parse_us_gaap_fundamentals.rs:25-103


Specific Domain Testing

13F Era Crossover Testing

The SEC changed the <value> field in 13F-HR filings from “thousands of USD” to “actual USD” on 2023-01-01. The thirteenf_era_fixture_tests.rs uses three specific Berkshire Hathaway (BRK-B) fixtures to verify the normalization logic tests/thirteenf_era_fixture_tests.rs:1-12

FixtureFiling DateExpected ScalingSanity Check
BRK_B_13f_ancient.xml2022-11-14Value * 1,000AAPL Price ~$138/sh tests/thirteenf_era_fixture_tests.rs:92-106
BRK_B_13f_transition.xml2023-02-14Value * 1AAPL Price ~$130/sh tests/thirteenf_era_fixture_tests.rs:148-162
BRK_B_13f_modern.xml2026-02-17Value * 1Modern Schema tests/thirteenf_era_fixture_tests.rs12

Sources: tests/thirteenf_era_fixture_tests.rs:1-162 src/bin/refresh_test_fixtures.rs:147-172

EDGAR Atom Feed Testing

The edgar_feed_tests.rs validates the polling logic for the live EDGAR “Latest Filings” feed.

  • Delta Filtering : Tests the take_while logic that stops fetching when it hits a “high-water mark” (the timestamp of the last processed filing) tests/edgar_feed_tests.rs:43-51
  • High-Water Mark : Ensures the FeedDelta correctly identifies the newest entry’s timestamp as the next mark tests/edgar_feed_tests.rs:179-194

Sources: tests/edgar_feed_tests.rs:11-194


Network & Client Testing

The SecClient is tested using mockito to simulate SEC server responses, ensuring the crate respects the SEC’s strict User-Agent and rate-limiting requirements.

sequenceDiagram
    participant Test as test_fetch_json_with_retry_failure
    participant Client as SecClient
    participant Mock as mockito::Server
    
    Test->>Mock: Expect(3) calls to /path
    Test->>Client: fetch_json(mock_url)
    
    Client->>Mock: Attempt 1
    Mock-->>Client: 500 Internal Server Error
    Note over Client: Backoff Delay
    
    Client->>Mock: Attempt 2 (Retry)
    Mock-->>Client: 500 Internal Server Error
    
    Client->>Mock: Attempt 3 (Final Retry)
    Mock-->>Client: 500 Internal Server Error
    
    Client-->>Test: Err("Max retries exceeded")
    Test->>Test: assert!(result.is_err())

HTTP Mocking Pattern

Sources: tests/sec_client_tests.rs:194-222

User-Agent Compliance

The SEC requires a User-Agent in the format AppName/Version (+Email). Tests verify that SecClient correctly falls back to crate defaults or uses custom overrides provided in AppConfig tests/sec_client_tests.rs:7-82

Sources: tests/sec_client_tests.rs:7-82


Summary of Test Modules

FileSubsystem TestedKey Functions
sec_client_tests.rsNetwork / Authtest_fetch_json_with_retry_backoff, test_user_agent
cik_submissions_tests.rsSubmissions Parserparse_cik_submissions_json, items_split_correctly_on_comma
parse_13f_xml_tests.rs13F Holdingsweight_pct_is_on_0_to_100_scale, weights_sum_to_100
parse_nport_xml_tests.rsN-PORT Holdingspct_val_is_on_0_to_100_scale_for_tiny_position
edgar_feed_tests.rsPolling / Feedparse_edgar_atom_feed, delta_filter_excludes_entries_at_or_before_mark
us_gaap_parser_accuracy_tests.rsXBRL Financialsvalidate_dataframe_against_json

Sources: tests/sec_client_tests.rs:1-132 tests/cik_submissions_tests.rs:1-176 tests/parse_13f_xml_tests.rs:1-187 tests/parse_nport_xml_tests.rs:1-177 tests/edgar_feed_tests.rs:1-195 tests/us_gaap_parser_accuracy_tests.rs:1-160

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

CI/CD Pipeline

Loading…

CI/CD Pipeline

Relevant source files

Purpose and Scope

This document explains the continuous integration and continuous deployment (CI/CD) infrastructure for the rust-sec-fetcher repository. It covers the GitHub Actions workflow configuration, integration test automation, documentation deployment, and dependency management.

The CI/CD architecture is split between Rust-specific validation (linting, testing) and the Python narrative_stack system’s integration testing. For general testing strategies including Rust unit tests and Python test fixtures, see [Testing Strategy](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Testing Strategy)


GitHub Actions Workflows

The repository implements several GitHub Actions workflows to ensure code quality and system reliability across the dual Rust/Python architecture.

1. US GAAP Store Integration Test

This workflow validates the Python machine learning pipeline’s integration with external dependencies. It is triggered by changes to the python/narrative_stack/ directory [.github/workflows/us-gaap-store-integration-test.yml:3-11].

Sources: [.github/workflows/us-gaap-store-integration-test.yml:3-11]

graph TB
    PushTrigger["Push Event"]
PRTrigger["Pull Request Event"]
PathCheck{"Changed paths include:\npython/narrative_stack/**\nor workflow file itself?"}
WorkflowRun["Execute us-gaap-store-integration-test.yml"]
Skip["Skip workflow execution"]
PushTrigger --> PathCheck
 
   PRTrigger --> PathCheck
    
 
   PathCheck -->|Yes| WorkflowRun
 
   PathCheck -->|No| Skip

2. Build and Deploy Documentation

This workflow automates the generation of the project’s documentation using deepwiki-to-mdbook. It runs weekly or on manual dispatch [.github/workflows/build-docs.yml:4-7].

StepImplementationPurpose
Resolve MetadataShell scriptDetermines repo name and book title [.github/workflows/build-docs.yml:25-52]
Generate Docsjzombie/deepwiki-to-mdbook@mainConverts wiki content to mdBook format [.github/workflows/build-docs.yml:59-64]
Deployactions/deploy-pages@v4Publishes to GitHub Pages [.github/workflows/build-docs.yml:78-80]

Sources: [.github/workflows/build-docs.yml:1-81]


Integration Test Job Structure

The us-gaap-store-integration-test workflow defines a single job named integration-test that executes on ubuntu-latest [.github/workflows/us-gaap-store-integration-test.yml:12-15].

Sources: [.github/workflows/us-gaap-store-integration-test.yml:17-50]

graph TB
    Start["Job: integration-test"]
Checkout["Step 1: Checkout repo\nactions/checkout@v4\nwith lfs: true"]
SetupPython["Step 2: Set up Python\nactions/setup-python@v5\npython-version: 3.12"]
InstallUV["Step 3: Install uv\ncurl astral.sh/uv/install.sh"]
InstallDeps["Step 4: Install Python dependencies\nuv pip install -e . --group dev"]
Ruff["Step 5: Check style with Ruff\nruff check ."]
RunTest["Step 6: Run integration test\n./us_gaap_store_integration_test.sh"]
Start --> Checkout
 
   Checkout --> SetupPython
 
   SetupPython --> InstallUV
 
   InstallUV --> InstallDeps
 
   InstallDeps --> Ruff
 
   Ruff --> RunTest

Integration Test Architecture

The integration test orchestrates multiple Docker containers to create an isolated environment for validating the narrative_stack data flow.

Container & Entity Mapping

This diagram maps the CI orchestration to specific code entities and external services.

Sources: [python/narrative_stack/us_gaap_store_integration_test.sh:1-39], [python/narrative_stack/Dockerfile.simd-r-drive-ci-server:1-34]

graph TB
    subgraph "Docker Compose Project: us_gaap_it"
        MySQL["Container: us_gaap_test_db\n(MySQL)"]
SimdRDrive["Container: simd_r_drive_ws_server_test\n(WebSocket Server)"]
TestRunner["Test Runner\npytest process"]
end
    
    Schema["SQL Schema\ntests/integration/assets/us_gaap_schema_2025.sql"]
PyTestFile["tests/integration/test_us_gaap_store.py"]
TestRunner -->|Executes| PyTestFile
 
   PyTestFile -->|SQL queries| MySQL
 
   PyTestFile -->|WS connection| SimdRDrive
    
 
   Schema -->|Loaded via mysql CLI| MySQL

Test Execution Flow

The integration test script [python/narrative_stack/us_gaap_store_integration_test.sh:1-39] manages the container lifecycle.

Sources: [python/narrative_stack/us_gaap_store_integration_test.sh:1-39]

graph TB
    Start["Start: us_gaap_store_integration_test.sh"]
SetVars["Set variables\nPROJECT_NAME=us_gaap_it"]
RegisterTrap["Register cleanup trap\ntrap 'cleanup' EXIT"]
DockerUp["Start Docker containers\ndocker compose up -d --profile test"]
WaitMySQL["Wait for MySQL ready\nmysqladmin ping loop"]
LoadSchema["Load schema\nmysql < us_gaap_schema_2025.sql"]
RunPytest["Execute pytest\npytest tests/integration/test_us_gaap_store.py"]
Cleanup["Cleanup function\ndocker compose down --volumes"]
Start --> SetVars
 
   SetVars --> RegisterTrap
 
   RegisterTrap --> DockerUp
 
   DockerUp --> WaitMySQL
 
   WaitMySQL --> LoadSchema
 
   LoadSchema --> RunPytest
 
   RunPytest --> Cleanup

Docker Container Configuration

simd-r-drive-ws-server Container Build

The Dockerfile [python/narrative_stack/Dockerfile.simd-r-drive-ci-server:1-34] creates a single-stage image for the CI server. It installs the simd-r-drive-ws-server crate version 0.10.0-alpha [python/narrative_stack/Dockerfile.simd-r-drive-ci-server:12].

Sources: [python/narrative_stack/Dockerfile.simd-r-drive-ci-server:18-33]

graph LR
    BuildTime["Build Time\n--build-arg SERVER_ARGS"]
BakeArgs["ENV SERVER_ARGS"]
Entrypoint["ENTRYPOINT interpolates\n$SERVER_ARGS + $@"]
ServerExec["Execute:\nsimd-r-drive-ws-server"]
BuildTime --> BakeArgs
 
   BakeArgs --> Entrypoint
 
   Entrypoint --> ServerExec

Dependency Management

The project uses Dependabot to maintain up-to-date dependencies for the Rust components.

EcosystemDirectorySchedule
cargo/Weekly [.github/dependabot.yml:6-9]

Sources: [.github/dependabot.yml:1-10]


Environment Configuration

Python Environment

The CI pipeline uses uv for fast, reproducible environment setup [.github/workflows/us-gaap-store-integration-test.yml:27-37].

  • Python Version : 3.12
  • Installation : uv pip install -e . --group dev

Project Isolation

To prevent resource conflicts, the integration test uses a specific Docker Compose project name: us_gaap_it [python/narrative_stack/us_gaap_store_integration_test.sh:7]. This ensures that networks and volumes are isolated from other development or CI tasks.

Sources: [python/narrative_stack/us_gaap_store_integration_test.sh:7-9]

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Dependencies & Technology Stack

Loading…

Dependencies & Technology Stack

Relevant source files

This page provides a comprehensive overview of all external dependencies used in the rust-sec-fetcher codebase, covering both the Rust sec-fetcher application and the Python narrative_stack ML system.

Overview

The system employs a dual-language architecture with distinct but complementary technology stacks. The Rust layer prioritizes high-performance I/O operations, concurrent data fetching, and reliable HTTP caching. The Python layer focuses on scientific computing, machine learning model training, and numerical data processing. Both layers share common infrastructure through the simd-r-drive data storage system and file-based CSV interchange.

Sources: Cargo.toml:1-82 Cargo.lock:1-100

Rust Technology Stack

Core Direct Dependencies

The Rust application declares its direct dependencies in the manifest, each serving specific architectural roles:

Dependency Categories and Usage:

graph TB
    subgraph "Async Runtime & Concurrency"
        tokio["tokio 1.50.0\nFull async runtime"]
rayon["rayon 1.11.0\nData parallelism"]
dashmap["dashmap 6.1.0\nConcurrent hashmap"]
end
    
    subgraph "HTTP & Network"
        reqwest["reqwest 0.13.2\nHTTP client"]
reqwest_drive["reqwest-drive 0.13.2-alpha\nDrive middleware"]
end
    
    subgraph "Data Processing"
        polars["polars 0.46.0\nDataFrame operations"]
csv_crate["csv 1.4.0\nCSV parsing"]
serde["serde 1.0.228\nSerialization"]
serde_json["serde_json 1.0.149\nJSON support"]
quick_xml["quick-xml 0.39.2\nXML parsing"]
end
    
    subgraph "Storage & Caching"
        simd_r_drive["simd-r-drive 0.15.5-alpha\nKey-value store"]
simd_r_drive_ext["simd-r-drive-extensions\n0.15.5-alpha"]
end
    
    subgraph "Configuration & Validation"
        config_crate["config 0.15.9\nConfig management"]
keyring["keyring 3.6.2\nCredential storage"]
email_address["email_address 0.2.9\nEmail validation"]
rust_decimal["rust_decimal 1.40.0\nDecimal numbers"]
chrono["chrono 0.4.44\nDate/time handling"]
end
    
    subgraph "Development Tools"
        mockito["mockito 1.7.0\nHTTP mocking"]
tempfile["tempfile 3.27.0\nTemp file creation"]
end
CategoryCratesPrimary Use Cases
Async RuntimetokioEvent loop, async I/O, task scheduling Cargo.toml79
HTTP Stackreqwest, reqwest-driveSEC API communication, middleware integration Cargo.toml:66-67
Data FramespolarsLarge-scale data transformation, CSV/JSON processing Cargo.toml61
Serializationserde, serde_json, serde_withData structure serialization, API response parsing Cargo.toml:71-73
Concurrencyrayon, dashmapParallel processing, concurrent data structures Cargo.toml:50-64
Storagesimd-r-drive, simd-r-drive-extensionsHTTP cache, preprocessor cache, persistent storage Cargo.toml:74-75
Configurationconfig, keyringTOML config loading, secure credential management Cargo.toml:48-59
Validationemail_address, rust_decimal, chronoInput validation, financial precision, timestamps Cargo.toml:46-68
Utilitiesitertools, indexmap, bytesIterator extensions, ordered maps, byte manipulation Cargo.toml:45-58
Testingmockito, tempfileHTTP mock servers, temporary test files Cargo.toml:78-85

Sources: Cargo.toml:44-86

Storage and Caching System

The caching layer utilizes simd-r-drive to persist HTTP responses and preprocessed results.

Cache Architecture:

graph TB
    subgraph "simd-r-drive Ecosystem"
        simd_r_drive["simd-r-drive 0.15.5-alpha\nCore key-value store"]
DataStore["DataStore\n(simd_r_drive::DataStore)"]
simd_r_drive --> DataStore
    end
    
    subgraph "Cache Management"
        Caches["Caches struct\n(src/caches.rs)"]
http_cache["http_cache: Arc&lt;DataStore&gt;"]
pre_cache["preprocessor_cache: Arc&lt;DataStore&gt;"]
Caches --> http_cache
 
       Caches --> pre_cache
    end
    
    subgraph "On-Disk Files"
        f1["http_storage_cache.bin"]
f2["preprocessor_cache.bin"]
http_cache -.-> f1
 
       pre_cache -.-> f2
    end
  • Isolation : The Caches struct manages two distinct DataStore instances src/caches.rs:11-14
  • Initialization : The Caches::open function ensures the base directory exists and opens the .bin storage files src/caches.rs:29-51
  • Thread Safety : Access to stores is provided via Arc<DataStore> clones src/caches.rs:53-59

Sources: src/caches.rs:1-60 Cargo.toml:74-75

Numeric Support and Precision

The system relies on rust_decimal for financial calculations where floating-point errors are unacceptable.

CrateVersionKey TypesPurpose
rust_decimal1.40.0DecimalExact decimal arithmetic for US GAAP values Cargo.toml68
chrono0.4.44NaiveDateDate handling for filing periods Cargo.toml46
polars0.46.0DataFrameHigh-performance columnar data processing Cargo.toml61

Sources: Cargo.toml:46-68

Python Technology Stack

The Python narrative_stack system focuses on the machine learning pipeline and data analysis.

Machine Learning Framework

The training pipeline uses PyTorch and PyTorch Lightning to build and train autoencoders on US GAAP concepts.

Preprocessing Logic:

graph TB
    subgraph "Training Pipeline"
        Stage1Autoencoder["Stage1Autoencoder\n(PyTorch Lightning Module)"]
IterableConceptValueDataset["IterableConceptValueDataset\n(PyTorch Dataset)"]
Stage1Autoencoder --> IterableConceptValueDataset
    end
    
    subgraph "Scientific Stack"
        numpy["NumPy\nArray operations"]
pandas["pandas\nData manipulation"]
sklearn["scikit-learn\nPreprocessing & PCA"]
end
    
    subgraph "Preprocessing"
        RobustScaler["RobustScaler\n(Normalization)"]
PCA["PCA\n(Dimensionality Reduction)"]
sklearn --> RobustScaler
 
       sklearn --> PCA
    end
  • RobustScaler : Normalizes values per concept/unit pair to handle outliers in financial data.
  • PCA : Reduces the dimensionality of semantic embeddings while preserving variance.

Sources: Project architecture overview, Cargo.toml61 (Polars/Python integration context)

Database and Storage Integration

The Python stack interacts with both relational and key-value stores:

  • MySQL : Stores ingested US GAAP triplets (concept, unit, value).
  • simd-r-drive (via WebSocket) : The DataStoreWsClient allows the Python stack to access the same high-performance storage used by the Rust application.

Sources: Project architecture overview.

Shared Infrastructure

graph TB
    subgraph "Rust: sec-fetcher"
        distill["distill_us_gaap_fundamental_concepts"]
csv_out["CSV Export\n(src/bin/pulls/us_gaap_bulk.rs)"]
distill --> csv_out
    end
    
    subgraph "Storage"
        shared_dir["/data/us-gaap/"]
csv_out --> shared_dir
    end
    
    subgraph "Python: narrative_stack"
        ingest["Data Ingestion"]
shared_dir --> ingest
 
       ingest --> model["Stage1Autoencoder"]
end

File System Interchange

Data is passed between the Rust fetcher and the Python ML stack primarily through CSV files and shared storage.

Sources: Cargo.toml:28-34 src/caches.rs:25-51

Development and CI/CD Stack

GitHub Actions Workflow

The project uses GitHub Actions for continuous integration, ensuring cross-platform compatibility and code quality.

  • Rust Tests : Executes cargo test across the workspace.
  • Lints : Runs cargo clippy and cargo fmt.
  • Integration : Uses docker-compose to spin up simd-r-drive-ws-server and MySQL for end-to-end testing.

Sources: Cargo.toml:83-86 Cargo.lock:1-100

Version Compatibility Matrix

ComponentVersionRequirement
Rust Edition2024Cargo.toml6
Polars0.46.0Cargo.toml61
Tokio1.50.0Cargo.toml79
Reqwest0.13.2Cargo.toml66

Sources: Cargo.toml:1-82

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Glossary

Loading…

Glossary

Relevant source files

This glossary defines the technical terms, abbreviations, and domain-specific concepts used throughout the rust-sec-fetcher codebase. It serves as a reference for onboarding engineers to understand the intersection of SEC regulatory requirements, financial data structures, and the system’s implementation patterns.

SEC & EDGAR Terminology

The following terms originate from the U.S. Securities and Exchange Commission (SEC) and its Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system.

Accession Number

A unique identifier assigned by the SEC to every filing. It follows the format 0000320193-25-000008, where the first part is the CIK of the filer, the second is the year, and the third is a sequence number.

CIK (Central Index Key)

A 10-digit number used by the SEC to uniquely identify corporations and individuals who have filed disclosures.

Filing Index

The HTML landing page for a specific filing (e.g., -index.htm). It lists all documents associated with a submission, including the primary form (10-K) and all exhibits (EX-99.1).

US GAAP & XBRL

US GAAP (Generally Accepted Accounting Principles) is the standard framework of guidelines for financial accounting. XBRL (eXtensible Business Reporting Language) is the XML-based standard used to “tag” these financial concepts (e.g., NetIncomeLoss) in filings.

Sources: src/network/fetch_us_gaap_fundamentals.rs:11-53 src/network/filings/filing_index.rs:14-22 src/models/ticker.rs:19-25


Codebase-Specific Concepts

Derived Instrument

Financial instruments that are not the primary common stock listing but share the same CIK, such as warrants (-WT), units (-UN), or preferred shares (-PA).

Fuzzy Matching (Company Names)

A mechanism to resolve a human-readable company name to a Ticker and Cik using tokenization and weighted scoring.

Rendering Views

Traits and structures that define how a raw SEC HTML document is transformed into a readable format.

Sources: src/network/fetch_company_tickers.rs:50-55 src/models/ticker.rs:43-122 examples/ipo_show.rs:92-108


System Architecture Diagrams

graph TD
    subgraph "Input Space"
        Input["User Input: 'AAPL' or 'Apple Inc'"]
end

    subgraph "Code Entity Space (src/network & src/models)"
        Client["SecClient"]
FetchTicker["fetch_company_tickers"]
FuzzyMatch["Ticker::get_by_fuzzy_matched_name"]
CikLookup["fetch_cik_by_ticker_symbol"]
end

    subgraph "Data Sources"
        JSON["company_tickers.json"]
TXT["ticker.txt"]
end

 
   Input --> Client
 
   Client --> FetchTicker
 
   FetchTicker --> JSON
 
   FetchTicker --> TXT
 
   FetchTicker --> FuzzyMatch
 
   Input --> CikLookup
 
   CikLookup --> Client

Ticker Resolution Pipeline

This diagram bridges the natural language “Ticker/Name” input to the code entities responsible for resolution.

Sources: src/network/fetch_company_tickers.rs:58-61 src/models/ticker.rs:38-42 src/network/fetch_cik_by_ticker_symbol.rs

graph LR
    subgraph "SEC EDGAR"
        Submissions["/submissions/CIK{cik}.json"]
IndexPage["-index.htm"]
DocHTML["primary_doc.htm"]
end

    subgraph "Logic (src/network/filings & src/ops)"
        FetchSub["fetch_cik_submissions"]
FetchIdx["fetch_filing_index"]
RenderOp["render_filing"]
end

    subgraph "Views (src/views)"
        MDV["MarkdownView"]
ETV["EmbeddingTextView"]
end

 
   Submissions --> FetchSub
 
   FetchSub --> FetchIdx
 
   FetchIdx --> IndexPage
 
   IndexPage --> RenderOp
 
   RenderOp --> DocHTML
 
   RenderOp --> MDV
 
   RenderOp --> ETV

Filing Retrieval and Rendering Flow

This diagram shows the transition from a raw SEC submission to a rendered view.

Sources: src/network/filings/filing_index.rs:108-114 src/network/fetch_us_gaap_fundamentals.rs:74-81 examples/ipo_show.rs:110-114


Terminology Summary Table

TermContextCode Reference
N-PORTMonthly portfolio holdings for fundsNportInvestment src/models/nport_investment.rs11
13FQuarterly holdings for institutional managersfetch_13f_filings src/network/filings/mod.rs19
S-1 / F-1IPO Registration StatementsFormType::IPO_REGISTRATION_FORM_TYPES examples/ipo_list.rs88
424B4Final Pricing ProspectusFormType::IPO_PRICING_FORM_TYPES examples/ipo_list.rs:90-91
PctNormalized percentage type (0-100)Pct src/models/nport_investment.rs35
NamespaceHasherCache key generation with prefixingNamespaceHasher src/models/ticker.rs:13-17

Sources: src/models/nport_investment.rs:11-41 src/network/filings/mod.rs:16-29 examples/ipo_list.rs:1-17 src/models/ticker.rs:13-17

Dismiss

Refresh this wiki

Enter email to refresh