This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Overview
Loading…
Overview
Relevant source files
This document provides a high-level introduction to the rust-sec-fetcher repository, explaining its purpose, architecture, and the relationship between its Rust and Python components. This page covers the system’s overall design and data flow. For installation and configuration instructions, see [Getting Started](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Getting Started) For detailed implementation documentation of individual components, see [Rust sec-fetcher Application](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Rust sec-fetcher Application) and [Python narrative_stack System](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Python narrative_stack System)
Sources: Cargo.toml:1-10 README.md:1-8
System Purpose
The rust-sec-fetcher repository implements a dual-language financial data processing system that:
- Fetches company financial data from the SEC EDGAR API, including filings (10-K, 10-Q, 8-K), fund holdings (13F, N-PORT), and IPO registrations. README.md:5-8
- Transforms raw SEC filings into normalized US GAAP fundamental concepts or clean Markdown/text views. README.md:5-8 src/ops/filing_ops.rs:1-15
- Stores structured data as CSV files or provides it via high-level data models like
Ticker,Cik, andNportInvestment. src/models/ticker.rs:1-10 src/models/nport.rs:1-10 - Trains machine learning models (in the Python
narrative_stack) to understand financial concept relationships and perform semantic analysis.
The system specializes in processing US GAAP (Generally Accepted Accounting Principles) financial data, normalizing variations of concepts and consolidating them into a standardized taxonomy of FundamentalConcept variants. This normalization enables consistent querying across diverse financial reports and powers downstream machine learning applications.
Sources: Cargo.toml:1-10 README.md:1-8 src/enums/fundamental_concept.rs:1-20
Dual-Language Architecture
The repository employs a dual-language design that leverages the strengths of both Rust and Python:
| Layer | Language | Primary Responsibilities | Key Reason |
|---|---|---|---|
| Data Fetching & Processing | Rust | HTTP requests, throttling, caching, data transformation, XML/JSON parsing | I/O-bound operations, memory safety, high performance |
| Machine Learning | Python | Embedding generation, model training, statistical analysis | Rich ML ecosystem (PyTorch, scikit-learn) |
Rust Layer (sec-fetcher)
- Implements
SecClientwith sophisticated throttling and caching policies. src/network/sec_client.rs:1-50 - Fetches company tickers, CIK submissions, N-PORT filings, and US GAAP fundamentals. src/network/mod.rs:1-30
- Transforms raw financial data via
distill_us_gaap_fundamental_concepts. src/transformers/us_gaap.rs:1-20 - Provides specialized parsers for 13F, N-PORT, and Form 4 XML documents. src/parsers/mod.rs:1-20
Python Layer (narrative_stack)
- Ingests data generated by the Rust layer.
- Generates semantic embeddings for concept/unit pairs.
- Trains
Stage1Autoencodermodels using PyTorch Lightning.
Sources: Cargo.toml:1-42 README.md:15-50
High-Level Data Flow
The following diagram bridges the gap between the natural language data flow and the specific code entities responsible for each stage.
graph TB
SEC["SEC EDGAR API\n(company_tickers.json, submissions, archives)"]
SecClient["SecClient\n(src/network/sec_client.rs)"]
NetworkFuncs["Network Functions\nfetch_company_tickers\nfetch_filings\nfetch_nport_filings\n(src/network/mod.rs)"]
Parsers["Parsers\nparse_13f_xml\nparse_nport_xml\n(src/parsers/mod.rs)"]
Ops["Operations Logic\nrender_filing\nfetch_and_render\n(src/ops/filing_ops.rs)"]
Models["Data Models\nTicker, CikSubmission\nNportInvestment\n(src/models/mod.rs)"]
Views["Views\nMarkdownView\nEmbeddingTextView\n(src/views/mod.rs)"]
PythonIngest["Python narrative_stack\nData Ingestion & Training"]
SEC -->|HTTP GET| SecClient
SecClient --> NetworkFuncs
NetworkFuncs --> Parsers
Parsers --> Models
NetworkFuncs --> Ops
Ops --> Views
Models --> PythonIngest
System Data Flow and Code Entities
Data Flow Summary:
- Fetch :
SecClientretrieves data from SEC EDGAR API endpoints using structuredUrlvariants. src/enums/url_enum.rs:5-116 - Parse : Raw XML/JSON data is processed by specialized parsers (e.g.,
parse_13f_xml) into internal models. src/parsers/thirteen_f.rs:1-10 - Transform/Render : Filings are rendered into human-readable or machine-learning-ready formats via
render_filing. src/ops/filing_ops.rs:118-130 - Analyze : Normalized data is passed to the Python layer for ML training and embedding generation.
Sources: src/network/sec_client.rs:1-50 src/enums/url_enum.rs:5-116 src/ops/filing_ops.rs:118-130 README.md:110-140
graph TB
main["main.rs / lib.rs\nEntry Points"]
config["config module\nConfigManager\nAppConfig\n(src/config/mod.rs)"]
network["network module\nSecClient\nThrottlePolicy\n(src/network/mod.rs)"]
ops["ops module\nrender_filing\nholdings operations\n(src/ops/mod.rs)"]
models["models module\nTicker, Cik, AccessionNumber\n(src/models/mod.rs)"]
enums["enums module\nFundamentalConcept\nUrl, FormType\n(src/enums/mod.rs)"]
parsers["parsers module\nXML/JSON parsers\n(src/parsers/mod.rs)"]
views["views module\nMarkdownView\n(src/views/mod.rs)"]
main --> config
main --> network
main --> ops
network --> config
network --> models
network --> enums
ops --> network
ops --> views
ops --> models
parsers --> models
models --> enums
Core Module Structure
The Rust codebase is modularized to separate networking, data modeling, and business logic.
Module Dependency Graph
| Module | Primary Purpose | Key Exports |
|---|---|---|
config | Configuration management and credential loading. | ConfigManager, AppConfig src/config/mod.rs |
network | HTTP client, data fetching, and throttling. | SecClient, fetch_filings, fetch_company_profile src/network/mod.rs |
ops | High-level business logic and data orchestration. | render_filing, fetch_and_render src/ops/mod.rs |
models | Domain-specific data structures. | Ticker, Cik, CikSubmission, NportInvestment src/models/mod.rs |
enums | Type-safe enumerations for SEC concepts. | FundamentalConcept, FormType, Url src/enums/mod.rs |
parsers | Logic for converting SEC formats to Rust structs. | parse_13f_xml, parse_nport_xml src/parsers/mod.rs |
views | Rendering logic for different output formats. | MarkdownView, EmbeddingTextView src/views/mod.rs |
Sources: Cargo.toml:15-35 src/enums/url_enum.rs:1-5
Key Technologies
Rust Dependencies:
tokio: Async runtime for non-blocking I/O. Cargo.toml79reqwest: Robust HTTP client with JSON support. Cargo.toml66polars: High-performance DataFrame library for data manipulation. Cargo.toml61quick-xml: Fast XML parsing for SEC filing documents. Cargo.toml62html-to-markdown-rs: Converts SEC HTML filings to Markdown. Cargo.toml55simd-r-drive: Integration with a high-performance data store. Cargo.toml74
Python Dependencies (narrative_stack):
- PyTorch & PyTorch Lightning (ML training)
- scikit-learn (Preprocessing and PCA)
- BERT (Contextual embeddings for concept clustering)
Sources: Cargo.toml:44-82
Next Steps
- Getting Started : Learn how to configure your SEC contact email and run your first lookup. See [Getting Started](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Getting Started)
- Rust Architecture : Dive deep into the
SecClientand networking layer. See [Rust sec-fetcher Application](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Rust sec-fetcher Application) - ML Pipeline : Explore the autoencoder training and US GAAP alignment. See [Python narrative_stack System](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Python narrative_stack System)
Sources: README.md:9-15
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Getting Started
Loading…
Getting Started
Relevant source files
This page guides you through installing, configuring, and running the rust-sec-fetcher application. It covers building the Rust binary, setting up required credentials, and executing your first data fetch. For detailed configuration options, see Configuration System. For comprehensive examples, see Running Examples.
The rust-sec-fetcher is the Rust component of a dual-language system. It fetches and transforms SEC financial data into structured formats. The companion Python system (Python narrative_stack System) processes these files for machine learning applications.
Prerequisites
Before installation, ensure you have:
| Requirement | Purpose | Notes |
|---|---|---|
| Rust 1.87+ | Compile sec-fetcher | Edition 2024 features required Cargo.toml6 |
| Email Address | SEC EDGAR API access | Required by SEC for User-Agent identification README.md:13-14 |
| Disk Space | Cache and data storage | Default location: data/ directory README.md:204-205 |
| Internet Connection | SEC API access | Throttled to stay within SEC guidelines README.md:5-8 |
Optional Components:
- Python 3.8+ for ML pipeline (Python narrative_stack System)
- Docker for
simd-r-driveWebSocket server (Docker Deployment)
Sources: Cargo.toml:1-82 README.md:1-205
Installation
Clone Repository
Build from Source
The project includes several specialized binaries for data maintenance and bulk pulling Cargo.toml:15-35
Verify Installation
Run the configuration display example to ensure the environment is ready:
Installation and Setup Flow
Sources: Cargo.toml:15-35 README.md:51-53 src/enums/url_enum.rs:30-52
Basic Configuration
The application uses ConfigManager to coordinate settings from files, environment variables, and the system keychain.
Configuration Entry Points
Every program using the library initializes through one of two patterns README.md:17-49:
ConfigManager::load(): The standard approach. It reads from local config files or environment variables and handles interactive email prompts if missing.ConfigManager::from_app_config(&AppConfig { ... }): Used for hard-coding values directly in specialized tools.
Required Email Identification
The SEC mandates a contact email in the User-Agent header README.md:13-14 sec-fetcher provides multiple ways to supply this:
- Interactive : Prompted on first run and stored via
keyring(if feature enabled) Cargo.toml42 - Environment : Setting
SEC_FETCHER_EMAIL. - Code : Passing it directly to
AppConfig.
Configuration Loading Logic
Sources: README.md:15-49 Cargo.toml:36-43 src/enums/url_enum.rs:121-163
Running Your First Data Fetch
Example: Ticker to CIK Lookup
The library maps human-readable tickers to SEC Central Index Keys (CIKs) using fetch_cik_by_ticker_symbol README.md:63-71
Example: Fetch and Render Filings
You can retrieve specific forms (like 10-K or 8-K) and render them into Markdown for LLM processing or human reading README.md:112-140
Example: Fund Holdings (13F/N-PORT)
The library can parse complex XML filings into structured data README.md:195-202
Data Retrieval Flow
Sources: README.md:112-140 src/enums/url_enum.rs:26-29 src/enums/url_enum.rs:146-150
Data Output & Caching
sec-fetcher is designed to be a “good citizen” on the SEC servers:
- Throttling : Automatically limits requests to 10 per second per SEC policy README.md:5-8
- Caching : Uses
simd-r-driveto cache raw responses locally, reducing redundant network traffic Cargo.toml:74-75 - Structured Storage : Data is typically organized by CIK or Ticker in the
data/directory.
Specialized Binaries
For bulk operations, use the provided binaries Cargo.toml:15-35:
pull-us-gaap-bulk: Downloads large-scale XBRL datasets for ML training.pull-fund-holdings: Aggregates holdings from multiple investment companies.refresh-test-fixtures: Updates local mock data for the test suite.
For details on the transformation of these datasets, see US GAAP Concept Transformation.
Sources: Cargo.toml:15-35 README.md:204-205 src/enums/url_enum.rs:54-55
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Configuration System
Loading…
Configuration System
Relevant source files
- src/config.rs
- src/config/app_config.rs
- src/config/config_manager.rs
- src/config/credential_manager.rs
- src/models/cik.rs
- src/models/investment_company.rs
- src/parsers/parse_nport_xml.rs
- tests/config_manager_tests.rs
This document describes the configuration management system in the Rust sec-fetcher application. The configuration system provides a flexible, multi-source approach to managing application settings including SEC API credentials, request throttling parameters, and cache directories.
For information about how the configuration integrates with the caching system, see Caching & Storage System. For details on credential storage mechanisms, see the credential management section below.
Overview
The configuration system consists of three primary components:
| Component | Purpose | File Location |
|---|---|---|
AppConfig | Data structure holding all configuration fields with JSON schema support | src/config/app_config.rs:16-41 |
ConfigManager | Loads, merges, and validates configuration from multiple sources | src/config/config_manager.rs:95-101 |
CredentialManager | Handles email credential storage and retrieval via system keyring | src/config/credential_manager.rs:19-22 |
The system supports configuration from four sources with increasing priority:
- Default values - Hard-coded defaults in
AppConfig::default()src/config/app_config.rs:43-62 - Environment Variables -
SEC_FETCHER_EMAIL,SEC_FETCHER_APP_NAME, etc. src/config/config_manager.rs:56-88 - TOML configuration file - User-specified settings loaded from disk src/config/config_manager.rs:168-171
- Interactive prompts - Credential collection when running in interactive mode src/config/credential_manager.rs:81-127
Sources: src/config/app_config.rs:16-62 src/config/config_manager.rs:56-155
Configuration Structure
AppConfig Fields
All fields in AppConfig are wrapped in Option<T> to support partial configuration and merging strategies. The #[merge(strategy = overwrite_option)] attribute ensures that non-None values from user configuration always replace default values.
Sources: src/config/app_config.rs:16-41 src/config/app_config.rs:43-62
Configuration Loading Flow
ConfigManager Initialization
Sources: src/config/config_manager.rs:107-182 src/config/credential_manager.rs:81-127
Configuration File Format
TOML Structure
The configuration file uses TOML format with strict schema validation. Any unrecognized keys will cause a descriptive error listing all valid keys and their types using schemars integration.
Example configuration file (sec_fetcher_config.toml):
Configuration File Locations
The system searches for configuration files in the following order:
| Priority | Location | Description |
|---|---|---|
| 1 | User-provided path | Passed to ConfigManager::from_config(Some(path)) |
| 2 | System config directory | ~/.config/sec_fetcher_config.toml (Unix) |
| 3 | Current directory | ./sec_fetcher_config.toml |
Sources: src/config/config_manager.rs:103-104 src/config/config_manager.rs:129-131 src/config/config_manager.rs:166-170
Credential Management
Email Credential Requirement
The SEC EDGAR API mandates a valid email address in the HTTP User-Agent header for identification. The configuration system resolves this using the following precedence:
Interactive Mode Detection:
graph TD
Start["ConfigManager Resolution"]
CheckArg["App Identity Override?"]
CheckFile["Email in TOML Config?"]
CheckEnv["SEC_FETCHER_EMAIL Env Var?"]
CheckInteractive{"is_interactive_mode()?"}
PromptUser["CredentialManager::from_prompt()"]
ReadKeyring["Read from system keyring"]
PromptInput["Prompt user for email"]
SaveKeyring["Save to keyring"]
SetEmail["Set email in ConfigManager"]
Error["Return Error:\nCould not obtain email credential"]
Start --> CheckArg
CheckArg -->|No| CheckFile
CheckFile -->|No| CheckEnv
CheckEnv -->|No| CheckInteractive
CheckInteractive -->|Yes| PromptUser
CheckInteractive -->|No| Error
PromptUser --> ReadKeyring
ReadKeyring -->|Found| SetEmail
ReadKeyring -->|Not found| PromptInput
PromptInput --> SaveKeyring
SaveKeyring --> SetEmail
- Returns
trueifstdin/stdoutare a terminal (TTY) src/config/credential_manager.rs:82-84 - Can be overridden via
set_interactive_mode_override()for testing tests/config_manager_tests.rs81
Sources: src/config/config_manager.rs:41-56 src/config/credential_manager.rs:32-76
Configuration Merging Strategy
Merge Behavior
The AppConfig struct uses the merge crate with a custom overwrite_option strategy defined in src/config/app_config.rs:8-12:
Merge Rules:
Some(new_value)always replacesSome(old_value)src/config/app_config.rs:9-11Some(new_value)always replacesNoneNonenever replacesSome(old_value)
This ensures user-provided values take absolute precedence over defaults while allowing partial configuration.
Sources: src/config/app_config.rs:8-12 src/config/config_manager.rs:172-182
Schema Validation and Error Handling
Invalid Key Detection
The configuration system uses #[serde(deny_unknown_fields)] to reject unknown keys in TOML files src/config/app_config.rs15 When an invalid key is detected, the error message includes a complete list of valid keys with their types extracted via schemars:
Example error output:
graph LR
TOML["TOML File with\ninvalid_key"]
Deserialize["config.try_deserialize()"]
ExtractSchema["AppConfig::get_valid_keys()"]
Schema["schemars::schema_for!(AppConfig)"]
FormatError["Format error message with\nvalid keys and types"]
TOML --> Deserialize
Deserialize -->|Error| ExtractSchema
ExtractSchema --> Schema
Schema --> FormatError
FormatError --> Error["Return descriptive error"]
unknown field `invalid_key`, expected one of `email`, `app_name`, `app_version`, `max_concurrent`, `min_delay_ms`, `max_retries`, `cache_base_dir`
Valid configuration keys are:
- email (String | Null)
- app_name (String | Null)
- app_version (String | Null)
- max_concurrent (Integer | Null)
- min_delay_ms (Integer | Null)
- max_retries (Integer | Null)
- cache_base_dir (String | Null)
Sources: src/config/app_config.rs:69-83 src/config/config_manager.rs:175-182 tests/config_manager_tests.rs:154-181
Usage Examples
Loading Configuration
Default configuration path:
Custom configuration path:
Overriding App Identity:
Sources: src/config/config_manager.rs:108-110 src/config/config_manager.rs:129-131 src/config/config_manager.rs:151-155
Cache Isolation
The ConfigManager manages the lifetime of the cache directory. If cache_base_dir is not provided in the configuration, it creates a unique tempfile::TempDir src/config/app_config.rs:45-48
This ensures that tests and ephemeral runs have fully isolated cache storage that is cleaned up when the ConfigManager is dropped src/config/app_config.rs:47-48
Sources: src/config/app_config.rs:43-62 src/config/config_manager.rs:98-100
Testing
Test Coverage
The configuration system includes unit tests in tests/config_manager_tests.rs covering:
| Test Function | Purpose |
|---|---|
test_load_custom_config | Verifies loading from custom TOML file path tests/config_manager_tests.rs46 |
test_fails_on_invalid_key | Validates schema enforcement and error messages tests/config_manager_tests.rs154 |
test_email_from_env_var_non_interactive | Checks SEC_FETCHER_EMAIL resolution tests/config_manager_tests.rs103 |
test_config_file_email_takes_precedence | Validates priority of TOML over Env Vars tests/config_manager_tests.rs130 |
Sources: tests/config_manager_tests.rs:1-207
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Running Examples
Loading…
Running Examples
Relevant source files
- deny.toml
- examples/8k_exhibits_as_markdown.rs
- examples/8k_list.rs
- examples/cik_show.rs
- examples/company_search.rs
- examples/company_show.rs
- examples/config_show.rs
- examples/edgar_feed_poll.rs
- examples/edgar_index_browse.rs
- examples/filing_render.rs
- examples/filing_show.rs
- examples/fund_series_list.rs
- examples/holdings_show.rs
- examples/nport_render.rs
- examples/press_release_show.rs
- examples/ticker_list.rs
- examples/ticker_show.rs
- examples/us_gaap_column_stats.rs
- examples/us_gaap_search.rs
This page demonstrates how to run the example programs included in the rust-sec-fetcher repository. These programs illustrate the library’s capabilities, ranging from simple company lookups to complex filing rendering and EDGAR feed polling.
Example Programs Overview
The repository includes a variety of example programs in the examples/ directory. These serve as functional documentation for the system’s core modules.
| Example Program | Primary Purpose | Key Functions / Models |
|---|---|---|
config_show | Verify active configuration and credentials | ConfigManager::load, AppConfig::pretty_print |
company_search | Fuzzy-match names against the SEC ticker list | fetch_company_tickers, Ticker::get_by_fuzzy_matched_name |
company_show | Display full SEC profile and industry metadata | fetch_company_profile, fetch_company_description |
cik_show | Lookup CIK and find specific recent filings | fetch_cik_by_ticker_symbol, CikSubmission::most_recent_by_form |
8k_list | List all 8-K filings for a ticker with URLs | fetch_8k_filings, as_primary_document_url |
filing_render | Fetch and render any SEC URL to clean text | fetch_and_render, MarkdownView, EmbeddingTextView |
filing_show | Render primary documents and/or exhibits | render_filing, fetch_filings |
edgar_feed_poll | Delta-poll the live SEC Atom feed | fetch_edgar_feeds_since, FeedEntry |
edgar_index_browse | Browse historical quarterly master indexes | fetch_edgar_master_index, MasterIndexEntry |
fund_series_list | List registered mutual funds and share classes | fetch_investment_company_series_and_class_dataset |
Sources: examples/config_show.rs:1-12 examples/company_search.rs:1-9 examples/company_show.rs:1-14 examples/cik_show.rs:1-15 examples/8k_list.rs:1-10 examples/filing_render.rs:1-12 examples/filing_show.rs:1-7 examples/edgar_feed_poll.rs:1-36 examples/edgar_index_browse.rs:1-21 examples/fund_series_list.rs:1-14
System Integration Architecture
Diagram: Example Program Data Flow
Sources: examples/company_search.rs:46-49 examples/company_show.rs:88-109 examples/filing_render.rs:67-79 examples/edgar_feed_poll.rs:144-157
Company Lookup and Search
company_search: Fuzzy Matching
Purpose : Resolves ambiguous company names or ticker symbols against the official SEC ticker list using tokenization and fuzzy matching.
Usage :
Implementation :
- It calls
fetch_company_tickersto get the master list examples/company_search.rs49 - It tokenizes the query using
Ticker::tokenize_company_nameexamples/company_search.rs43 - It performs an exact match check before falling back to
Ticker::get_by_fuzzy_matched_nameexamples/company_search.rs:58-78
company_show: Detailed Metadata
Purpose : Aggregates data from multiple SEC endpoints to build a comprehensive profile, including SIC codes, industry descriptions, and fiscal year details.
Usage :
Implementation :
- It fetches the CIK via
fetch_cik_by_ticker_symbolexamples/company_show.rs106 - It retrieves the core profile via
fetch_company_profileexamples/company_show.rs108 - It fetches the extended company description via
fetch_company_descriptionexamples/company_show.rs109
Sources: examples/company_search.rs:1-83 examples/company_show.rs:1-150
Filing Retrieval and Rendering
filing_render: Generic URL Rendering
Purpose : Fetches any arbitrary SEC document URL and converts it to clean text using a specified FilingView.
Usage :
Implementation : The program uses fetch_and_render, passing either MarkdownView or EmbeddingTextView examples/filing_render.rs:76-79
filing_show: Smart Part Selection
Purpose : Automatically finds the most recent filing of a specific type (e.g., 10-K, 8-K) and renders the body, exhibits, or both.
Usage :
Implementation :
- It uses
fetch_filingsto locate the target document examples/filing_show.rs97 - It calls
ops::render_filing, which handles the complexity of identifying “substantive” exhibits (filtering out SOX certifications, XBRL schemas, and graphics) examples/filing_show.rs115
Sources: examples/filing_render.rs:1-84 examples/filing_show.rs:1-158
Live Feeds and Historical Indexes
edgar_feed_poll: Real-time Delta Polling
Purpose : Monitors the SEC Atom feed for new filings. It supports “delta mode,” where it only shows filings strictly newer than a provided high-water mark timestamp.
Usage :
Implementation : The core logic resides in fetch_edgar_feeds_since, which handles pagination and timestamp filtering examples/edgar_feed_poll.rs157 It identifies special routes like earnings releases by checking is_earnings_release() on FeedEntry examples/edgar_feed_poll.rs84
edgar_index_browse: Historical Backfills
Purpose : Accesses the quarterly master.idx files, which contain every filing since 1993.
Usage :
Implementation : It uses fetch_edgar_master_index to download and parse the pipe-delimited index file for the requested year and quarter examples/edgar_index_browse.rs113
Sources: examples/edgar_feed_poll.rs:1-209 examples/edgar_index_browse.rs:1-186
Code Entity Mapping
Diagram: Example Program CLI to Model Mapping
Sources: examples/cik_show.rs33 examples/8k_list.rs44 examples/edgar_feed_poll.rs:119-129 examples/edgar_index_browse.rs:98-99
Execution Patterns
All examples share a standard initialization sequence:
- Config Loading :
ConfigManager::load()is called to resolve environment variables and.tomlfiles examples/config_show.rs32 - Client Setup :
SecClient::from_config_manager(&config)initializes the HTTP client with required User-Agent headers and rate limiting examples/company_show.rs89 - Async Runtime : Examples use the
#[tokio::main]attribute to manage asynchronous network calls examples/cik_show.rs30
Common CLI Pattern : Most examples use clap for argument parsing, providing consistent help menus and type validation examples/filing_show.rs:54-69
Sources: examples/config_show.rs:26-37 examples/company_show.rs:83-100 examples/cik_show.rs:30-92
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Rust sec-fetcher Application
Loading…
Rust sec-fetcher Application
Relevant source files
Purpose and Scope
This page provides an architectural overview of the Rust sec-fetcher application, which is responsible for fetching financial data from the SEC EDGAR API and transforming it into structured formats. The application serves as the high-performance data collection and preprocessing layer in a larger system that combines Rust’s safety and speed for I/O operations with Python’s machine learning capabilities.
This page covers the high-level architecture, module organization, and data flow patterns. For detailed information about specific components, see:
Sources: src/lib.rs:1-12 src/network.rs:1-47
Application Architecture
The sec-fetcher application is built around a modular architecture that separates concerns into distinct layers: configuration, networking, data transformation, and storage. The core design principle is to fetch data from SEC APIs with robust error handling and caching, transform it into a standardized format (often using polars DataFrames), and output structured data for downstream consumption.
graph TB
subgraph "src/lib.rs Module Organization"
config["config\nConfigManager, AppConfig"]
enums["enums\nFundamentalConcept, Url\nCacheNamespacePrefix"]
models["models\nTicker, CikSubmission\nNportInvestment, AccessionNumber"]
network["network\nSecClient, fetch_* functions"]
ops["ops\nrender_filing, fetch_and_render"]
parsers["parsers\nXML/JSON parsing utilities"]
caches["caches\nInternal caching infrastructure"]
views["views\nMarkdownView, EmbeddingTextView"]
utils["utils\nVecExtensions, helpers"]
end
subgraph "External Dependencies"
reqwest["reqwest\nHTTP client"]
polars["polars\nDataFrame operations"]
simd["simd-r-drive\nDrive-based cache storage"]
tokio["tokio\nAsync runtime"]
serde["serde\nSerialization"]
end
config --> caches
network --> config
network --> caches
network --> models
network --> enums
network --> parsers
ops --> network
ops --> views
parsers --> models
network --> reqwest
network --> simd
network --> tokio
network --> polars
models --> serde
Module Structure
The application is organized into several core modules as declared in the library root:
| Module | Purpose | Key Components |
|---|---|---|
config | Configuration management and credential handling | ConfigManager, AppConfig |
enums | Type-safe enumerations for domain concepts | FundamentalConcept, Url, CacheNamespacePrefix, FormType |
models | Data structures representing SEC entities | Ticker, CikSubmission, NportInvestment, AccessionNumber |
network | HTTP client and data fetching functions | SecClient, fetch_company_tickers, fetch_us_gaap_fundamentals |
ops | Higher-level business logic and workflows | render_filing, fetch_and_render, diff_holdings |
parsers | XML/JSON parsing utilities | parse_us_gaap_fundamentals, parse_cik_submissions_json |
caches | Internal caching infrastructure | Caches (singleton), HTTP cache, preprocessor cache |
views | Rendering logic for filing data | MarkdownView, EmbeddingTextView |
normalize | Data normalization and cleaning | Pct type, 13F normalization logic |
Sources: src/lib.rs:1-12 src/network.rs:1-47 src/network/fetch_us_gaap_fundamentals.rs:1-108
Data Flow Architecture
Request-Response Flow with Caching
The data flow follows a pipeline pattern:
- Request Initiation : High-level operations in
opsor CLI binaries call specific fetching functions likefetch_us_gaap_fundamentalssrc/network/fetch_us_gaap_fundamentals.rs:54-58 - Client Middleware :
SecClientapplies throttling and caching policies before making HTTP requests src/network/sec_client.rs:1-10 - Cache Check : The system checks
simd-r-drivestorage for cached responses based onCacheNamespacePrefix. - API Request : If a cache miss occurs, the request is sent to the SEC EDGAR API (e.g.,
CompanyFactsendpoint src/network/fetch_us_gaap_fundamentals.rs:62-67). - Parsing : Raw JSON/XML is converted into structured models or DataFrames via the
parsersmodule src/network/fetch_us_gaap_fundamentals.rs69 - Enrichment : Data is often cross-referenced; for example, fundamentals are joined with submission data to resolve primary document URLs src/network/fetch_us_gaap_fundamentals.rs:74-105
Sources: src/network/fetch_us_gaap_fundamentals.rs:54-108 src/network.rs:1-47
Key Dependencies and Technology Stack
The application leverages modern Rust crates for performance and reliability:
| Category | Crate | Purpose |
|---|---|---|
| Async Runtime | tokio | Asynchronous I/O and task scheduling. |
| HTTP Client | reqwest | Underlying HTTP engine for SecClient. |
| Data Frames | polars | High-performance data manipulation, especially for US GAAP data src/network/fetch_us_gaap_fundamentals.rs9 |
| Caching | simd-r-drive | WebSocket-based key-value storage for persistent caching. |
| Serialization | serde | JSON/CSV serialization and deserialization. |
| XML Parsing | quick-xml | Fast parsing for SEC XML filings (13F, N-PORT, Form 4). |
Sources: src/network/fetch_us_gaap_fundamentals.rs:1-10 src/lib.rs:1-12
Module Interaction Patterns
US GAAP Data Retrieval Example
The interaction between modules is best exemplified by the US GAAP fundamentals retrieval process:
- Network Module :
fetch_us_gaap_fundamentalsis called src/network/fetch_us_gaap_fundamentals.rs54 - Models Module : It uses
Cik::get_company_cik_by_ticker_symbolto resolve the ticker src/network/fetch_us_gaap_fundamentals.rs60 - Enums Module : It constructs the target URL using
Url::CompanyFactssrc/network/fetch_us_gaap_fundamentals.rs62 - Parsers Module : It delegates the raw JSON to
parsers::parse_us_gaap_fundamentalssrc/network/fetch_us_gaap_fundamentals.rs69 - Network (Sub-call) : It calls
fetch_cik_submissionsto enrich the data with filing URLs src/network/fetch_us_gaap_fundamentals.rs74
Sources: src/network/fetch_us_gaap_fundamentals.rs:54-108
Error Handling Strategy
The application uses a layered error handling approach:
- Network Layer : Handles transient HTTP errors and rate limiting via retries and throttling.
- Parsing Layer : Returns specific error types (e.g.,
CikError,AccessionNumberError) when SEC data doesn’t match expected formats. - Operations Layer : Often implements “non-fatal” logic, where a failure to fetch secondary data (like submissions for URL enrichment) results in a warning rather than a process crash src/network/fetch_us_gaap_fundamentals.rs:101-105
Sources: src/network/fetch_us_gaap_fundamentals.rs:101-105 src/models.rs:1-5
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Network Layer & SecClient
Loading…
Network Layer & SecClient
Relevant source files
- src/network.rs
- src/network/fetch_cik_submissions.rs
- src/network/sec_client.rs
- src/utils.rs
- tests/sec_client_tests.rs
Purpose and Scope
This page documents the network layer of the rust-sec-fetcher application, specifically the SecClient HTTP client and its associated infrastructure. The SecClient provides the foundational HTTP communication layer for all SEC EDGAR API interactions, implementing throttling, caching, and retry logic to ensure reliable and compliant data fetching.
This page covers:
- The
SecClientstructure and initialization src/network/sec_client.rs:14-22 - Throttling and rate limiting policies src/network/sec_client.rs:65-88
- HTTP caching and preprocessor cache integration src/network/sec_client.rs:58-63 src/network/sec_client.rs:90-91
- Request/response handling including cache bypass src/network/sec_client.rs:159-184 src/network/sec_client.rs:186-214
- User-Agent management and email validation src/network/sec_client.rs:111-123
For information about the specific network fetching functions that use SecClient, see Data Fetching Functions. For details on the caching system architecture, see Caching & Storage System.
SecClient Architecture Overview
Component Diagram
Sources: src/network/sec_client.rs:14-108 src/network/sec_client.rs:159-214
SecClient Structure and Initialization
SecClient Fields
The SecClient struct maintains core components for networking and caching:
| Field | Type | Purpose |
|---|---|---|
email | String | Contact email for SEC User-Agent header src/network/sec_client.rs15 |
http_client | ClientWithMiddleware | reqwest client with middleware stack src/network/sec_client.rs18 |
cache_policy | Arc<CachePolicy> | Shared cache configuration src/network/sec_client.rs19 |
throttle_policy | Arc<ThrottlePolicy> | Shared throttle configuration src/network/sec_client.rs20 |
preprocessor_cache | Arc<DataStore> | Cache for processed/transformed data src/network/sec_client.rs21 |
Sources: src/network/sec_client.rs:14-22
Construction from ConfigManager
The from_config_manager() constructor performs the following initialization sequence:
- Extract Configuration : Reads
email,app_name,app_version,max_concurrent,min_delay_ms, andmax_retriesfromAppConfigsrc/network/sec_client.rs:31-56 - Create CachePolicy : Configures cache with 1-week TTL (hardcoded) and disables header respect src/network/sec_client.rs:58-63
- Create ThrottlePolicy : Configures rate limiting based on
max_concurrentandmin_delay_mswith a 500ms adaptive jitter src/network/sec_client.rs:82-88 - Initialize Caches : Retrieves both HTTP and preprocessor cache stores from the
ConfigManagersrc/network/sec_client.rs:90-91 - Build Middleware Stack : Combines the
DataStorewith policies usingreqwest_drivehelpers src/network/sec_client.rs:93-98
Sources: src/network/sec_client.rs:26-109
Throttle Policy & Compliance
The SEC’s public guidance for EDGAR states a maximum request rate of 10 requests/second. The SecClient chooses a conservative default of max_concurrent=1 and min_delay_ms=500 (~2 req/s) src/network/sec_client.rs:67-81
| Parameter | AppConfig Field | Default | Purpose |
|---|---|---|---|
base_delay_ms | min_delay_ms | 500 | Minimum delay between requests src/network/sec_client.rs83 |
max_concurrent | max_concurrent | 1 | Concurrent request limit src/network/sec_client.rs84 |
max_retries | max_retries | 3 | Retry attempt limit src/network/sec_client.rs85 |
adaptive_jitter_ms | N/A | 500 | Randomized delay for retry backoff src/network/sec_client.rs87 |
Sources: src/network/sec_client.rs:65-88
Request Flow and Caching
raw_request vs raw_request_bypass_cache
The client provides two primary ways to interact with the network:
raw_request: The standard path. It uses thehttp_clientwhich includes theCacheMiddleware. Requests are served from the on-disk cache if available and within TTL src/network/sec_client.rs:159-184raw_request_bypass_cache: Uses theCacheBypassextension to force a network fetch. This skips both reading from and writing to the cache while still respecting theThrottlePolicysrc/network/sec_client.rs:186-214
Request Pipeline
Sources: src/network/sec_client.rs:159-224 src/network/sec_client.rs:93-98
User-Agent Management
The get_user_agent() method generates a compliant User-Agent string as required by the SEC. It validates the email format every time it is called src/network/sec_client.rs:111-123
Format: {app_name}/{app_version} (+{email})
Example: sec-fetcher/0.1.0 (+user@example.com)
If the email provided in the configuration is invalid according to EmailAddress::is_valid, the client will panic src/network/sec_client.rs:115-118
Sources: src/network/sec_client.rs:111-123 tests/sec_client_tests.rs:7-23
Fetch and Render Pipeline
The fetch_and_render function (exported in src/network.rs) provides a high-level orchestration for retrieving filing documents and converting them to specific views (e.g., Markdown).
- Fetch : Retrieves the raw document from EDGAR using
SecClient. - Parse : Normalizes the document structure.
- Render : Applies a
Viewtransformation (likeMarkdownView).
Sources: src/network.rs:45-46 src/network/fetch_and_render.rs:1-10 (Note: file content for fetch_and_render.rs was not provided in detail, but its existence is noted in the module structure).
Testing Infrastructure
SecClient logic is verified in tests/sec_client_tests.rs using mockito to simulate SEC API responses.
| Test Case | Code Entity | Purpose |
|---|---|---|
test_user_agent | SecClient::get_user_agent | Verifies UA string construction tests/sec_client_tests.rs:7-23 |
test_invalid_email_panic | SecClient::get_user_agent | Ensures panic on malformed email tests/sec_client_tests.rs:105-117 |
test_fetch_json_with_retry_failure | SecClient::fetch_json | Verifies max_retries enforcement tests/sec_client_tests.rs:194-222 |
test_fetch_json_with_retry_backoff | SecClient::fetch_json | Validates recovery after 500 error tests/sec_client_tests.rs:225-233 |
Sources: tests/sec_client_tests.rs:1-233
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
CLI Binaries
Loading…
CLI Binaries
Relevant source files
- src/bin/check_form_type_coverage.rs
- src/bin/pulls/fund_holdings.rs
- src/bin/pulls/us_gaap_bulk.rs
- src/bin/refresh_test_fixtures.rs
This page documents the standalone binary programs provided by the rust-sec-fetcher crate. These tools are located in src/bin/ and serve specialized purposes ranging from test fixture maintenance and enum validation to bulk data extraction for machine learning pipelines.
1. refresh-test-fixtures
The refresh-test-fixtures utility automates the retrieval of real SEC EDGAR data to serve as test fixtures. It ensures that integration tests operate against authentic, version-controlled data without requiring live network access during test execution.
Purpose and Usage
This binary should be run whenever new test cases are added or when existing fixtures need to be updated to reflect modern EDGAR schema changes (e.g., the 2023 change in 13F value reporting).
Implementation Details
The program iterates through a hardcoded manifest of Fixture structs src/bin/refresh_test_fixtures.rs:55-63 Each fixture defines a TickerSymbol, an output filename, and a FixtureKind which determines the specific SEC endpoint to hit.
Key Components:
- FixtureKind : An enum specifying the target data:
Submissions,CompanyFacts,EightKPrimary,EightKFirstHtmlExhibit, orThirteenF(with a specific accession number) src/bin/refresh_test_fixtures.rs:65-88 - Data Flow : The program resolves the Ticker to a CIK src/bin/refresh_test_fixtures.rs205 fetches the raw bytes from the SEC via
SecClient, and saves them as Gzip-compressed files intests/fixtures/src/bin/refresh_test_fixtures.rs:197-230
Fixture Generation Flow
Sources: src/bin/refresh_test_fixtures.rs:90-173 src/bin/refresh_test_fixtures.rs:178-240
2. check-form-type-coverage
This binary validates the completeness of the FormType enum against actual data in the EDGAR Master Index. It performs both “Forward” and “Reverse” coverage checks.
Coverage Logic
- Forward Check : Ensures every variant defined in the
FormTypeenum (that isn’t marked asretired) actually appears in recent SEC filings src/bin/check_form_type_coverage.rs:16-19 - Reverse Check : Identifies any form types appearing frequently in the most recent quarter (above
MINIMUM_FILINGS_THRESHOLD) that are not currently represented in the enum src/bin/check_form_type_coverage.rs:20-22
Usage
Technical Implementation
The program calculates the last_completed_quarter src/bin/check_form_type_coverage.rs:51-61 and scans backwards up to MAX_LOOKBACK_QUARTERS (default 8) src/bin/check_form_type_coverage.rs34 It uses fetch_edgar_master_index to retrieve the list of all filings for those periods src/bin/check_form_type_coverage.rs110
Sources: src/bin/check_form_type_coverage.rs:1-40 src/bin/check_form_type_coverage.rs:72-146
3. pull-us-gaap-bulk
The pull-us-gaap-bulk binary is the primary data ingestion tool for the narrative_stack ML pipeline. It performs a massive sweep of all primary-listed companies and extracts their XBRL fundamentals.
Purpose and Usage
It fetches CompanyFacts for every ticker and flattens the complex JSON structure into a tabular CSV format suitable for training autoencoders or dimensionality reduction models.
Data Flow and Constraints
- Primary Listings Only : It calls
fetch_company_tickers(client, false)to exclude warrants, units, and preferred shares, preventing duplicate data for the same CIK src/bin/pulls/us_gaap_bulk.rs:54-55 - Transformation : It utilizes
fetch_us_gaap_fundamentalswhich converts the SEC’s hierarchical XBRL data into a PolarsDataFramesrc/bin/pulls/us_gaap_bulk.rs:72-73 - Persistence : Data is written using
polars::prelude::CsvWritersrc/bin/pulls/us_gaap_bulk.rs:78-81
Sources: src/bin/pulls/us_gaap_bulk.rs:1-33 src/bin/pulls/us_gaap_bulk.rs:45-95
4. pull-fund-holdings
This binary targets the investment management domain, specifically fetching N-PORT holdings for all registered investment companies (ETFs, Mutual Funds).
Purpose and Usage
It iterates through the SEC’s investment company dataset, finds the latest N-PORT-P (monthly portfolio holdings) filing for each fund, and exports the holdings to CSV.
Implementation Logic
- Dataset Retrieval : Calls
fetch_investment_company_series_and_class_datasetto get the master list of funds src/bin/pulls/fund_holdings.rs:74-75 - Filing Discovery : For each fund ticker, it resolves the CIK and calls
fetch_nport_filingsto find the most recent submission src/bin/pulls/fund_holdings.rs:96-114 - Holdings Extraction : It fetches the specific N-PORT XML, parses the investment table, and normalizes the data src/bin/pulls/fund_holdings.rs:126-134
- Partitioned Storage : To avoid directories with tens of thousands of files, it organizes output by the first letter of the ticker (e.g.,
data/fund-holdings/S/SPY.csv) src/bin/pulls/fund_holdings.rs:137-155
Fund Processing Pipeline
Sources: src/bin/pulls/fund_holdings.rs:1-38 src/bin/pulls/fund_holdings.rs:74-157
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Data Fetching Functions
Loading…
Data Fetching Functions
Relevant source files
- examples/fuzzy_match_company.rs
- src/lib.rs
- src/models/nport_investment.rs
- src/models/ticker.rs
- src/network/fetch_cik_by_ticker_symbol.rs
- src/network/fetch_company_description.rs
- src/network/fetch_company_tickers.rs
- src/network/fetch_edgar_feed.rs
- src/network/fetch_investment_company_series_and_class_dataset.rs
- src/network/fetch_sic_codes.rs
- src/network/fetch_us_gaap_fundamentals.rs
Purpose and Scope
This document describes the data fetching functions in the Rust sec-fetcher application. These functions provide the interface for retrieving financial data from the SEC EDGAR API, including company tickers, CIK lookups, submissions, company descriptions, SIC codes, EDGAR master index, NPORT filings, US GAAP fundamentals, and investment company datasets.
For information about the underlying HTTP client, throttling, and caching infrastructure, see [3.1 Network Layer & SecClient](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/3.1 Network Layer & SecClient) For details on how US GAAP data is transformed after fetching, see [3.4 US GAAP Concept Transformation](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/3.4 US GAAP Concept Transformation) For information about the data structures returned by these functions, see [3.5 Data Models & Enumerations](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/3.5 Data Models & Enumerations)
Overview of Data Fetching Architecture
The network module provides specialized fetching functions that retrieve different types of financial data from the SEC EDGAR API. Each function accepts a SecClient reference and returns structured data types or Polars DataFrames.
Data Flow: Natural Language to Code Entity Space
The following diagram maps high-level data requirements to specific Rust functions and their corresponding SEC EDGAR endpoints.
Diagram: Mapping Data Requirements to Code Entities
graph TB
subgraph "Data Requirement (Natural Language)"
ReqTickers["'Get all stock symbols'"]
ReqCIK["'Find CIK for AAPL'"]
ReqDesc["'What does this company do?'"]
ReqGAAP["'Get revenue and net income'"]
ReqFeed["'What was filed today?'"]
ReqFunds["'Search for mutual fund CIKs'"]
end
subgraph "Code Entity Space (Functions)"
fetch_company_tickers["fetch_company_tickers()\nsrc/network/fetch_company_tickers.rs"]
fetch_cik_by_ticker_symbol["fetch_cik_by_ticker_symbol()\nsrc/network/fetch_cik_by_ticker_symbol.rs"]
fetch_company_description["fetch_company_description()\nsrc/network/fetch_company_description.rs"]
fetch_us_gaap_fundamentals["fetch_us_gaap_fundamentals()\nsrc/network/fetch_us_gaap_fundamentals.rs"]
parse_edgar_atom_feed["parse_edgar_atom_feed()\nsrc/network/fetch_edgar_feed.rs"]
fetch_investment_company_series_and_class_dataset["fetch_investment_company_series_and_class_dataset()\nsrc/network/fetch_investment_company_series_and_class_dataset.rs"]
end
subgraph "EDGAR API Endpoints (Url Enum)"
Url_CompanyTickersJson["Url::CompanyTickersJson"]
Url_CompanyFacts["Url::CompanyFacts(cik)"]
Url_CikAccessionDocument["Url::CikAccessionDocument"]
Url_InvestmentCompanySeriesAndClassDataset["Url::InvestmentCompanySeriesAndClassDataset(year)"]
end
ReqTickers --> fetch_company_tickers
ReqCIK --> fetch_cik_by_ticker_symbol
ReqDesc --> fetch_company_description
ReqGAAP --> fetch_us_gaap_fundamentals
ReqFeed --> parse_edgar_atom_feed
ReqFunds --> fetch_investment_company_series_and_class_dataset
fetch_company_tickers --> Url_CompanyTickersJson
fetch_us_gaap_fundamentals --> Url_CompanyFacts
fetch_company_description --> Url_CikAccessionDocument
fetch_investment_company_series_and_class_dataset --> Url_InvestmentCompanySeriesAndClassDataset
Sources: src/network/fetch_company_tickers.rs:58-61 src/network/fetch_cik_by_ticker_symbol.rs:53-56 src/network/fetch_company_description.rs:35-38 src/network/fetch_us_gaap_fundamentals.rs:54-58 src/network/fetch_investment_company_series_and_class_dataset.rs:43-45 src/network/fetch_edgar_feed.rs118
Company Ticker and CIK Resolution
Function: fetch_company_tickers
Retrieves operating-company equity tickers. It supports merging primary listings with derived instruments (warrants, units, preferreds) src/network/fetch_company_tickers.rs:8-18
- Primary Source :
company_tickers.jsonsrc/network/fetch_company_tickers.rs:24-25 - Derived Source :
ticker.txt(ifinclude_derived_instrumentsis true) src/network/fetch_company_tickers.rs:27-32 - Backfilling : Uses CIK-to-name mapping from JSON to enrich text-only derived entries src/network/fetch_company_tickers.rs:84-88
Function: fetch_cik_by_ticker_symbol
Resolves a ticker to a 10-digit CIK. It prioritizes operating companies before falling back to the investment company dataset for mutual funds/ETFs src/network/fetch_cik_by_ticker_symbol.rs:27-36
Fuzzy Matching: Ticker::get_by_fuzzy_matched_name
Performs weighted token overlap matching to resolve company names to tickers src/models/ticker.rs:38-42 It uses a TOKEN_MATCH_THRESHOLD of 0.6 and applies boosts for exact matches and common stock src/models/ticker.rs:27-32
Sources: src/network/fetch_company_tickers.rs:58-61 src/network/fetch_cik_by_ticker_symbol.rs:53-72 src/models/ticker.rs:35-136
Company Profiles and Descriptions
Function: fetch_company_description
Extracts the “Item 1. Business” section from a company’s most recent 10-K filing src/network/fetch_company_description.rs:11-13
Implementation Strategy :
- Locates all occurrences of “Item 1” and “Item 1A” using regex src/network/fetch_company_description.rs:81-82
- Identifies the “real” section by finding the largest HTML byte gap between an Item 1 and its subsequent Item 1A (avoiding Table of Contents entries) src/network/fetch_company_description.rs:90-96
- Uses
html2textfor tag stripping and entity decoding src/network/fetch_company_description.rs:103-105 - Skips short heading lines and truncates at a sentence boundary near 800 characters src/network/fetch_company_description.rs:109-133
Function: fetch_sic_codes
Fetches the complete SEC SIC code list from siccodes.htm src/network/fetch_sic_codes.rs:11-15 It parses the HTML table rows into SicCode models containing the code, office, and industry title src/network/fetch_sic_codes.rs:63-71
Sources: src/network/fetch_company_description.rs:35-68 src/network/fetch_sic_codes.rs:33-45
US GAAP Fundamentals
Function: fetch_us_gaap_fundamentals
Fetches all XBRL-tagged financial data for a company as a structured DataFrame src/network/fetch_us_gaap_fundamentals.rs:11-12
Data Flow :
- Resolves Ticker to CIK src/network/fetch_us_gaap_fundamentals.rs60
- Fetches JSON from the
companyfactsendpoint src/network/fetch_us_gaap_fundamentals.rs67 - Accession Resolution : Calls
fetch_cik_submissionsto map accession numbers to primary document URLs (e.g., “aapl-20241228.htm”) src/network/fetch_us_gaap_fundamentals.rs:74-81 - Updates the
filing_urlcolumn in the resulting DataFrame src/network/fetch_us_gaap_fundamentals.rs99
Sources: src/network/fetch_us_gaap_fundamentals.rs:54-108
Investment Company Datasets
Function: fetch_investment_company_series_and_class_dataset
Retrieves the annual CSV of registered investment company share classes src/network/fetch_investment_company_series_and_class_dataset.rs:22-32
Year Fallback and Caching :
- It attempts to fetch the current year’s file. If it receives a 404, it decrements the year and retries until it finds a valid file (minimum year 2024) src/network/fetch_investment_company_series_and_class_dataset.rs:39-66
- The successful year is stored in the
preprocessor_cachewith a 1-week TTL to avoid repeated 404s src/network/fetch_investment_company_series_and_class_dataset.rs:68-73
Diagram: Investment Company Dataset Fetch Sequence
Sources: src/network/fetch_investment_company_series_and_class_dataset.rs:43-80 src/network/fetch_investment_company_series_and_class_dataset.rs:92-117
Real-time Feeds
Function: parse_edgar_atom_feed
Parses the EDGAR Atom XML feed into FeedEntry items src/network/fetch_edgar_feed.rs:112-118
- Extraction Logic : Uses regex to pull the CIK from the archive URL src/network/fetch_edgar_feed.rs:70-74 the company name from the title src/network/fetch_edgar_feed.rs:78-82 and 8-K item codes from the summary text src/network/fetch_edgar_feed.rs:93-97
- HTML Handling : Strips HTML tags from the summary to extract the filing date and item codes src/network/fetch_edgar_feed.rs:86-91
Sources: src/network/fetch_edgar_feed.rs:46-109 src/network/fetch_edgar_feed.rs:118-155
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Filing Retrieval & Rendering
Loading…
Filing Retrieval & Rendering
Relevant source files
- examples/ipo_list.rs
- examples/ipo_show.rs
- src/network/filings/fetch_10k.rs
- src/network/filings/fetch_10q.rs
- src/network/filings/fetch_13f.rs
- src/network/filings/fetch_8k.rs
- src/network/filings/fetch_def14a.rs
- src/network/filings/fetch_filings.rs
- src/network/filings/fetch_form4.rs
- src/network/filings/fetch_nport.rs
- src/network/filings/fetch_s1.rs
- src/network/filings/fetch_s2.rs
- src/network/filings/fetch_s3.rs
- src/network/filings/fetch_schedule_13d.rs
- src/network/filings/fetch_schedule_13g.rs
- src/network/filings/filing_index.rs
- src/network/filings/mod.rs
- src/views/embedding_text_view.rs
- src/views/html_helpers.rs
- src/views/markdown_view.rs
Purpose and Scope
This document details the filings submodule within the network layer, responsible for retrieving and parsing form-specific SEC filings. It covers the implementation of specialized fetchers for major form types (10-K, 10-Q, 8-K, 13F, Form 4, N-PORT, etc.), the FilingIndex parser for exploring filing archives, IPO feed polling, and the views system for rendering document content into Markdown or embedding-optimized text.
Filing Retrieval Architecture
The filing retrieval system is built on top of the SecClient and CikSubmission models. It provides a tiered approach: generic filing retrieval, form-specific convenience functions, and deep-parsing functions that extract structured data (like XML-based holdings) from within a filing.
Diagram: Filing Retrieval Logic Flow
graph TD
subgraph "Code Entity Space: Filing Retrieval"
FetchFilings["fetch_filings()\nsrc/network/filings/fetch_filings.rs"]
SpecificFetch["fetch_10k_filings()\nfetch_8k_filings()\n..."]
DeepFetch["fetch_13f()\nfetch_form4()\nfetch_nport()"]
FilingIndex["fetch_filing_index()\nsrc/network/filings/filing_index.rs"]
end
subgraph "Natural Language Space: SEC Concepts"
Submissions["Company Submissions\n(submissions/CIK.json)"]
Archive["EDGAR Archive Directory\n(data/CIK/Accession/)"]
PrimaryDoc["Primary Document\n(10-K HTML, 13F XML)"]
Exhibits["Exhibits & Supporting Docs\n(EX-99.1, InfoTable.xml)"]
end
FetchFilings -->|Filters| Submissions
SpecificFetch -->|Wraps| FetchFilings
DeepFetch -->|Parses| PrimaryDoc
DeepFetch -->|Uses| FilingIndex
FilingIndex -->|Scrapes| Archive
Archive --> Exhibits
Sources: src/network/filings/mod.rs:1-29 src/network/filings/fetch_filings.rs:67-77 src/network/filings/filing_index.rs:108-114
Form-Specific Fetchers
The library provides dedicated functions for common SEC forms. These functions encapsulate form-specific logic, such as including historical variants (e.g., 10-K405) or handling amendments.
| Function | Form Type(s) | Key Features |
|---|---|---|
fetch_10k_filings | 10-K, 10-K405 | Returns comprehensive annual reports; re-sorts mixed types newest-first. src/network/filings/fetch_10k.rs:59-77 |
fetch_10q_filings | 10-Q | Returns quarterly reports (Q1-Q3). src/network/filings/fetch_10q.rs:43-52 |
fetch_8k_filings | 8-K | Returns material event notifications. src/network/filings/fetch_8k.rs:54-63 |
fetch_13f_filings | 13F-HR | Returns institutional holdings report metadata. src/network/filings/fetch_13f.rs:32-41 |
fetch_form4_filings | 4, 4/A | Returns insider trading reports and amendments. src/network/filings/fetch_form4.rs:33-49 |
fetch_def14a_filings | DEF 14A | Returns definitive proxy statements for shareholder meetings. src/network/filings/fetch_def14a.rs:57-66 |
fetch_nport_filings | NPORT-P | Returns monthly portfolio holdings for registered funds. src/network/filings/fetch_nport.rs:35-44 |
Sources: src/network/filings/fetch_10k.rs:59-77 src/network/filings/fetch_13f.rs:32-41 src/network/filings/fetch_form4.rs:33-49
Filing Index and Deep Parsing
While most filings are identified by a primary document, many (like 13F or 8-K) contain critical data in secondary files or require XML parsing of the primary document.
Filing Index Parser
The fetch_filing_index function scrapes the EDGAR -index.htm page to discover all files associated with an accession number. This is necessary because the submissions.json API only points to the “Primary Document,” which may not be the file containing the raw data (e.g., a 13F’s informationTable.xml).
- Implementation : Uses regex to parse the HTML table in the index page src/network/filings/filing_index.rs:9-12
- Normalization : Automatically strips iXBRL viewer prefixes (
/ix?doc=) to find the actual file path src/network/filings/filing_index.rs:44-51
Deep Data Extraction
Several functions go beyond metadata to return structured Rust models:
fetch_13f: Uses theFilingIndexto find the “INFORMATION TABLE” XML, then parses it intoThirteenfHoldingobjects src/network/filings/fetch_13f.rs:77-107fetch_form4: Strips XSLT prefixes from the primary document path to fetch raw XML and parses it intoForm4Transactionobjects src/network/filings/fetch_form4.rs:83-109fetch_nport: Fetches the primary XML and enriches it with ticker symbols fromfetch_company_tickerssrc/network/filings/fetch_nport.rs:73-97
Sources: src/network/filings/filing_index.rs:23-76 src/network/filings/fetch_13f.rs:81-93 src/network/filings/fetch_form4.rs:91-102
graph LR
subgraph "Code Entity Space: IPO Tracking"
GetIPOFeed["get_ipo_feed_entries()\nsrc/ops/ipo_ops.rs"]
GetIPOReg["get_ipo_registration_filings()\nsrc/ops/ipo_ops.rs"]
FormType["FormType::IPO_REGISTRATION_FORM_TYPES\nsrc/enums/form_type.rs"]
end
subgraph "Natural Language Space: IPO Lifecycle"
S1["S-1 / F-1\n(Initial Registration)"]
S1A["S-1/A / F-1/A\n(Amendments)"]
B4["424B4\n(Final Pricing)"]
AtomFeed["EDGAR Live Feed\n(Polling)"]
end
GetIPOFeed -->|Polls| AtomFeed
AtomFeed -->|Filters for| FormType
GetIPOReg -->|Aggregates| S1
GetIPOReg -->|Aggregates| S1A
GetIPOReg -->|Aggregates| B4
IPO Feed Polling
The system includes specialized logic for tracking Initial Public Offerings (IPOs) via the EDGAR Atom feed.
Diagram: IPO Feed and Registration Lifecycle
get_ipo_feed_entries: Polls the EDGAR Atom feed (the fastest source for new filings) and filters for S-1, F-1, and 424B4 forms examples/ipo_list.rs:43-51get_ipo_registration_filings: Retrieves the full timeline of a company’s IPO process, from initial S-1 through all amendments to the final prospectus examples/ipo_show.rs:48-58
Sources: examples/ipo_list.rs:1-17 examples/ipo_show.rs:21-33 examples/ipo_list.rs:88-108
Document Rendering (Views)
The views system provides traits and implementations for converting SEC HTML/XBRL documents into readable text formats.
The FilingView Trait
The core abstraction for rendering. It defines how to format headers, sections, and tables.
Implementations
MarkdownView:- Goal : Lossless representation.
- Tables : Preserved as GitHub-Flavored Markdown (GFM) pipe tables examples/ipo_show.rs:94-95
- Usage : Standard reading and documentation.
EmbeddingTextView:- Goal : Optimization for Large Language Model (LLM) embeddings.
- Tables : Flattened into labeled sentences to preserve semantic context (e.g., “The value of Assets for 2023 was 100M”) examples/ipo_show.rs:96-97
- Prose : Cleaned of excessive whitespace and HTML artifacts.
Rendering Pipeline
The render_filing operation examples/ipo_show.rs48 orchestrates the process:
- Fetch the document content.
- Clean HTML using
html_helpers. - Apply the selected
FilingViewimplementation.
Sources: examples/ipo_show.rs:92-108 src/views/markdown_view.rs:1-10 src/views/embedding_text_view.rs:1-10
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
US GAAP Concept Transformation
Loading…
US GAAP Concept Transformation
Relevant source files
Purpose and Scope
This page documents the US GAAP concept transformation system, which normalizes raw financial concept names from SEC EDGAR filings into a standardized taxonomy. The core functionality is provided by the distill_us_gaap_fundamental_concepts function, which maps the diverse US GAAP terminology (57+ revenue variations, 6 cost variants, multiple equity representations) into a consistent set of 71 FundamentalConcept enum variants src/enums/fundamental_concept_enum.rs:5-71
For information about fetching US GAAP data from the SEC API, see Data Fetching Functions. For details on the data models that use these concepts, see US GAAP Concept Transformation. For the Python ML pipeline that processes the transformed concepts, see Python narrative_stack System.
System Overview
The transformation system acts as a critical normalization layer between raw SEC EDGAR filings and downstream data processing. Companies report financial data using various US GAAP concept names (e.g., Revenues, SalesRevenueNet, HealthCareOrganizationRevenue), and this system ensures all variations map to consistent concept identifiers.
Data Flow: Natural Language to Code Entity Space
The following diagram bridges the gap between the natural language of financial reporting and the internal code entities used for processing.
Sources: src/enums/fundamental_concept_enum.rs:5-71 src/enums.rs:7-8
graph TB
subgraph "Natural Language Space (SEC Filings)"
RawConcepts["Raw US GAAP Concept Names\n'Revenues'\n'SalesRevenueNet'\n'AssetsCurrent'"]
end
subgraph "Code Entity Space (rust-sec-fetcher)"
DistillFn["distill_us_gaap_fundamental_concepts\n(Function)"]
FCEnum["FundamentalConcept\n(Enum)"]
Assets["FundamentalConcept::Assets"]
CurrentAssets["FundamentalConcept::CurrentAssets"]
Revenues["FundamentalConcept::Revenues"]
end
RawConcepts -->|Input: &str| DistillFn
DistillFn -->|Output: Option<Vec<FC>>| FCEnum
FCEnum --> Assets
FCEnum --> CurrentAssets
FCEnum --> Revenues
The FundamentalConcept Taxonomy
The FundamentalConcept enum defines 71 standardized financial concept variants organized into main categories: Balance Sheet, Income Statement, Cash Flow, and Equity classifications src/enums/fundamental_concept_enum.rs:5-71 Each variant represents a normalized concept that may map from multiple raw US GAAP names.
Sources: src/enums/fundamental_concept_enum.rs:5-71
graph TB
subgraph "FundamentalConcept Enum Variants"
Root["FundamentalConcept\n(71 total variants)"]
end
subgraph "Balance Sheet"
Assets["Assets"]
CurrentAssets["CurrentAssets"]
NoncurrentAssets["NoncurrentAssets"]
Liabilities["Liabilities"]
CurrentLiabilities["CurrentLiabilities"]
NoncurrentLiabilities["NoncurrentLiabilities"]
LiabilitiesAndEquity["LiabilitiesAndEquity"]
end
subgraph "Income Statement"
Revenues["Revenues"]
CostOfRevenue["CostOfRevenue"]
GrossProfit["GrossProfit"]
OperatingExpenses["OperatingExpenses"]
OperatingIncomeLoss["OperatingIncomeLoss"]
NetIncomeLoss["NetIncomeLoss"]
InterestExpenseOperating["InterestExpenseOperating"]
ResearchAndDevelopment["ResearchAndDevelopment"]
end
subgraph "Cash Flow"
NetCashFlow["NetCashFlow"]
NetCashFlowFromOperatingActivities["NetCashFlowFromOperatingActivities"]
NetCashFlowFromInvestingActivities["NetCashFlowFromInvestingActivities"]
NetCashFlowFromFinancingActivities["NetCashFlowFromFinancingActivities"]
end
Root --> Assets
Root --> Liabilities
Root --> Revenues
Root --> NetIncomeLoss
Root --> NetCashFlow
Mapping Pattern Types
The transformation system implements four distinct mapping patterns to handle the diverse ways companies report financial concepts.
Pattern 1: One-to-One Mapping
Simple direct mappings where a single US GAAP concept name maps to exactly one FundamentalConcept variant.
| Raw US GAAP Concept | FundamentalConcept Output |
|---|---|
Assets | vec![Assets] |
Liabilities | vec![Liabilities] |
GrossProfit | vec![GrossProfit] |
OperatingIncomeLoss | vec![OperatingIncomeLoss] |
CommitmentsAndContingencies | vec![CommitmentsAndContingencies] |
Pattern 2: Hierarchical Mapping
Specific concepts map to multiple variants, including both the specific concept and parent categories. This enables queries at different levels of granularity.
| Raw US GAAP Concept | FundamentalConcept Output (Ordered) |
|---|---|
AssetsCurrent | vec![CurrentAssets, Assets] |
StockholdersEquity | vec![EquityAttributableToParent, Equity] |
NetIncomeLoss | vec![NetIncomeLossAttributableToParent, NetIncomeLoss] |
Pattern 3: Synonym Consolidation
Multiple US GAAP concept names that represent the same financial concept are consolidated into a single FundamentalConcept variant. For example, CostOfGoodsAndServicesSold, CostOfServices, and CostOfGoodsSold all map to FundamentalConcept::CostOfRevenue.
Pattern 4: Industry-Specific Revenue Mapping
The system handles dozens of industry-specific revenue variations (e.g., HealthCareOrganizationRevenue, OilAndGasRevenue, ElectricUtilityRevenue), mapping them all to the Revenues concept.
Sources: src/enums/fundamental_concept_enum.rs:5-71
The distill_us_gaap_fundamental_concepts Function
The core transformation function accepts a string representation of a US GAAP concept name and returns an Option<Vec<FundamentalConcept>>. The return type is an Option because not all US GAAP concepts are mapped, and a Vec because some concepts map to multiple standardized variants.
Implementation Logic
The function serves as the primary entry point for the transformation logic. It is utilized by higher-level operations to normalize data before it is stored or used for training.
Sources: src/enums/fundamental_concept_enum.rs:5-71 src/enums.rs:7-8
graph LR
subgraph "Input"
RawStr["&str: 'SalesRevenueNet'"]
end
subgraph "distill_us_gaap_fundamental_concepts"
Match["Pattern Match Engine"]
end
subgraph "Output"
Result["Some(vec![FundamentalConcept::Revenues])"]
end
RawStr --> Match
Match --> Result
Summary
The US GAAP concept transformation system provides:
- Standardization : Maps hundreds of raw US GAAP concept names to 71 standardized
FundamentalConceptvariants src/enums/fundamental_concept_enum.rs:5-71 - Flexibility : Supports four mapping patterns (one-to-one, hierarchical, synonyms, industry-specific) to handle diverse reporting practices.
- Queryability : Hierarchical mappings enable queries at multiple granularity levels (e.g., query for all
Assetsor specificallyCurrentAssets). - Integration : Serves as the critical normalization layer between SEC EDGAR API and downstream data processing/ML pipelines.
Sources: src/enums/fundamental_concept_enum.rs:5-71 src/enums.rs:7-8
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Data Models & Enumerations
Loading…
Data Models & Enumerations
Relevant source files
- examples/fuzzy_match_company.rs
- src/config/config_manager.rs
- src/config/credential_manager.rs
- src/enums.rs
- src/enums/form_type_enum.rs
- src/enums/fundamental_concept_enum.rs
- src/models.rs
- src/models/accession_number.rs
- src/models/cik.rs
- src/models/investment_company.rs
- src/models/nport_investment.rs
- src/models/thirteenf_holding.rs
- src/models/ticker.rs
- src/network/fetch_cik_by_ticker_symbol.rs
- src/network/fetch_company_tickers.rs
Purpose and Scope
This page documents the core data structures and enumerations used throughout the rust-sec-fetcher application. These models represent SEC financial data, including company identifiers, filing metadata, investment holdings, and financial concepts. The data models are defined across the src/models/ directory and centralized in src/models.rs:1-18 while enumerations are managed in src/enums.rs:1-15
Sources: src/models.rs:1-18 src/enums.rs:1-15
SEC Identifier Models
The system uses three primary identifier types to reference companies and filings within the SEC EDGAR system.
Ticker
The Ticker struct represents a company’s stock ticker symbol along with its SEC identifiers. It is the primary structure for mapping human-readable symbols to regulatory keys.
Structure:
cik: Cik- The company’s Central Index Key src/models/ticker.rs21symbol: TickerSymbol- Stock ticker symbol (e.g., “AAPL”) src/models/ticker.rs22company_name: String- Full company name src/models/ticker.rs23origin: TickerOrigin- Source of the ticker data (Primary vs Derived) src/models/ticker.rs24
Fuzzy Matching: The Ticker model includes a sophisticated fuzzy matching engine in get_by_fuzzy_matched_name src/models/ticker.rs:38-136 It uses tokenization, SIMD-accelerated cleaning src/models/ticker.rs:148-204 and weighted scoring (e.g., EXACT_MATCH_BOOST, PREFERRED_STOCK_PENALTY) to resolve company names to CIKs src/models/ticker.rs:27-33
Sources: src/models/ticker.rs:19-35 src/models/ticker.rs:38-136
Cik (Central Index Key)
The Cik struct represents a 10-digit SEC identifier that uniquely identifies a company or entity. CIKs are permanent and never reused src/models/cik.rs:11-36
Structure:
value: u64- The numeric CIK value src/models/cik.rs39
Key Characteristics:
- Formatting: Always zero-padded to 10 digits when displayed (e.g.,
320193→"0000320193") src/models/cik.rs:66-69 - Resolution:
get_company_cik_by_ticker_symbolhandles the logic of resolving derived instruments (warrants, units) back to their parent registrant’s CIK src/models/cik.rs:143-167
Sources: src/models/cik.rs:37-40 src/models/cik.rs:143-167
AccessionNumber
The AccessionNumber struct represents a unique identifier for SEC filings. Each accession number is exactly 18 digits and encodes the filer’s CIK, filing year, and sequence number.
Format: XXXXXXXXXX-YY-NNNNNN src/models/accession_number.rs:11-14
Key Methods:
from_str(accession_str: &str)- Parses from string, handling both dashed and plain formats src/models/accession_number.rs:80-112to_string()- Returns the canonical dash-separated format src/models/accession_number.rs:179-181
Sources: src/models/accession_number.rs:35-187
SEC Identifier Relationships
Sources: src/models/ticker.rs:20-25 src/models/cik.rs:143-167 src/models/accession_number.rs:35-40
Filing Data Structures
NportInvestment
The NportInvestment struct represents a single investment holding from an NPORT-P filing. It includes both raw data from the SEC and “mapped” fields enriched by the fetcher.
Key Fields:
- Mapped Data:
mapped_ticker_symbol,mapped_company_name,mapped_company_cik_numbersrc/models/nport_investment.rs:14-16 - Identifiers:
name,lei,cusip,isinsrc/models/nport_investment.rs:18-22 - Financials:
balance,val_usd, andpct_val(stored as a normalizedPcttype) src/models/nport_investment.rs:24-35
Sources: src/models/nport_investment.rs:9-41
ThirteenfHolding
The ThirteenfHolding struct represents a row in a Form 13F-HR information table. Unlike raw XML data, these fields are stored in normalized form src/models/thirteenf_holding.rs:4-9
value_usd: Normalized to actual dollars (correcting pre-2023 “thousands” reporting) src/models/thirteenf_holding.rs:17-21weight_pct: Portfolio weight on a 0–100 scale src/models/thirteenf_holding.rs:30-32
Sources: src/models/thirteenf_holding.rs:10-33
InvestmentCompany
Represents mutual funds and ETFs. It is primarily used to resolve tickers that do not appear in the standard operating company list src/models/investment_company.rs:6-49
get_fund_cik_by_ticker_symbol: Specifically searches the series/class dataset for fund CIKs src/models/investment_company.rs:52-67
Sources: src/models/investment_company.rs:52-67 src/network/fetch_cik_by_ticker_symbol.rs:67-69
Enumerations
FundamentalConcept
The FundamentalConcept enum defines 64 standardized financial concepts (e.g., Assets, NetIncomeLoss, Revenues). It is the backbone of the US GAAP transformation pipeline src/enums/fundamental_concept_enum.rs:1-72
FormType
The FormType enum covers SEC forms explicitly handled by the library, such as TenK (“10-K”), EightK (“8-K”), and Sc13G (“SCHEDULE 13G”) src/enums/form_type_enum.rs:65-200 It uses strum for case-insensitive parsing and provides the canonical EDGAR string via as_edgar_str src/enums/form_type_enum.rs:56-63
CacheNamespacePrefix
Defines the organizational structure of the simd-r-drive cache.
CompanyTickerFuzzyMatch: Used to cache expensive fuzzy matching results src/models/ticker.rs15CompanyTickers: Used for the raw ticker dataset.
Url
A centralized registry of SEC EDGAR endpoints, such as CompanyTickersJson and CompanyTickersTxt src/network/fetch_company_tickers.rs:62-73
TickerOrigin
Distinguishes between PrimaryListing (from company_tickers.json) and DerivedInstrument (from ticker.txt, including warrants and preferreds) src/network/fetch_company_tickers.rs:22-32
graph LR
subgraph Input["Natural Language Space"]
Query["'Apple' or 'AAPL'"]
end
subgraph Logic["Code Entity Space"]
SClient["SecClient"]
FCT["fetch_company_tickers"]
T_Fuzzy["Ticker::get_by_fuzzy_matched_name"]
C_Lookup["Cik::get_company_cik_by_ticker_symbol"]
subgraph Models["Data Models"]
M_Ticker["Ticker"]
M_Cik["Cik"]
M_Origin["TickerOrigin"]
end
end
Query --> T_Fuzzy
SClient --> FCT
FCT --> M_Ticker
T_Fuzzy --> M_Ticker
M_Ticker --> M_Origin
M_Ticker --> C_Lookup
C_Lookup --> M_Cik
Data Flow & Relationships
The following diagram bridges the natural language concepts of “Searching for a Company” to the specific code entities involved.
Sources: src/network/fetch_company_tickers.rs:58-65 src/models/ticker.rs:38-42 src/models/cik.rs:143-146 examples/fuzzy_match_company.rs:35-75
Implementation Details: Precision & Normalization
The system prioritizes financial accuracy by using specialized types:
rust_decimal::Decimal: Used for all currency and balance fields to avoid floating-point errors src/models/nport_investment.rs:25-30Pct: A custom wrapper for percentage values (0-100 scale) used in portfolio weighting src/models/nport_investment.rs35 src/models/thirteenf_holding.rs32
Sources: src/models/nport_investment.rs:2-35 src/models/thirteenf_holding.rs:1-32
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Parsers & Data Normalization
Loading…
Parsers & Data Normalization
Relevant source files
- src/config/app_config.rs
- src/normalize/mod.rs
- src/normalize/pct.rs
- src/normalize/thirteenf.rs
- src/parsers.rs
- src/parsers/parse_13f_xml.rs
- src/parsers/parse_form4_xml.rs
- src/parsers/parse_nport_xml.rs
- src/parsers/parse_us_gaap_fundamentals.rs
- tests/config_manager_tests.rs
- tests/us_gaap_parser_accuracy_tests.rs
Purpose and Scope
The parsers and normalize modules form the data ingestion backbone of rust-sec-fetcher. While the network layer retrieves raw bytes (XML, JSON, or CSV), the parsers transform these into structured Rust models. The normalize module ensures that numeric inconsistencies across SEC eras—such as the transition from thousands-of-dollars to actual-dollars in Form 13F—are handled centrally rather than being scattered across the codebase.
Sources: src/parsers.rs:1-21 src/normalize/mod.rs:1-16
Normalization Logic
The normalize module is the single source of truth for scale conversions and unit adjustments. It prevents “inline conversions” in parsers, ensuring that logic for handling SEC schema changes is testable in isolation.
13F Value Normalization
The SEC changed the <value> unit in Form 13F-HR informationTable.xml filings around January 1, 2023.
- Legacy Era: Values reported in thousands of USD.
- Modern Era: Values reported in actual USD.
The function normalize_13f_value_usd src/normalize/thirteenf.rs:144-150 uses the THIRTEENF_THOUSANDS_ERA_CUTOFF constant src/normalize/thirteenf.rs72 to determine if a 1000x multiplier should be applied based on the filing_date.
Percentage Handling (Pct Type)
The Pct struct src/normalize/pct.rs31 is a type-safe wrapper around Decimal that enforces a 0–100 scale (e.g., 7.75 means 7.75%, not 0.0775).
Pct::from_pct: Used when the SEC already provides a 0-100 value (e.g., N-PORTpctVal). src/normalize/pct.rs:57-59Pct::from_ratio: Multiplies a 0-1 ratio by 100. src/normalize/pct.rs:76-78
Sources: src/normalize/thirteenf.rs:1-150 src/normalize/pct.rs:1-110
XML Parsers
The system uses quick-xml for high-performance, stream-based parsing of large SEC filings.
N-PORT XML Parser
The parse_nport_xml function processes monthly portfolio holdings for registered investment companies.
- Stream Processing: It iterates through
invstOrSectags src/parsers/parse_nport_xml.rs30 - Fuzzy Matching: After parsing, it attempts to map holding names/titles to known
Tickersymbols usingTicker::get_by_fuzzy_matched_namesrc/parsers/parse_nport_xml.rs:135-138 - Normalization: Percentages are wrapped in the
Pcttype viaPct::from_pctsrc/parsers/parse_nport_xml.rs:104-105
Form 13F-HR Parser
The parse_13f_xml function extracts institutional investment manager holdings.
- Two-Pass Logic:
- Extracts all
<infoTable>entries and normalizes USD values src/parsers/parse_13f_xml.rs:91-106 - Calculates portfolio weights (
weight_pct) by summing total value and callingcompute_13f_weight_pctsrc/parsers/parse_13f_xml.rs:121-124
- Extracts all
- Sorting: Returns holdings sorted by
value_usddescending src/parsers/parse_13f_xml.rs126
Form 4 Parser
The parse_form4_xml function parses insider trading transactions.
- Table Support: Handles both
nonDerivativeTableandderivativeTablesrc/parsers/parse_form4_xml.rs47 - State Management: Uses a
tag_stackto track nested<value>elements within parents liketransactionSharessrc/parsers/parse_form4_xml.rs:79-88
Sources: src/parsers/parse_nport_xml.rs:15-149 src/parsers/parse_13f_xml.rs:26-128 src/parsers/parse_form4_xml.rs:14-158
US GAAP Fundamentals Parser
The parse_us_gaap_fundamentals function src/parsers/parse_us_gaap_fundamentals.rs41 converts the SEC’s companyfacts JSON into a Polars DataFrame.
Deduplication & Sorting
The parser implements a “Last-in Wins” strategy for amended filings:
- Chronological Sort: Data is sorted by the
fileddate descending src/parsers/parse_us_gaap_fundamentals.rs:32-33 - Pivot: During the
pivotoperation, it uses.first()to select the most recent filing for any given fiscal period (fy/fp) src/parsers/parse_us_gaap_fundamentals.rs:34-38 - Metadata: Every row is prefixed with
US_GAAP_META_COLUMNSincludingaccnandfiling_urlsrc/parsers/parse_us_gaap_fundamentals.rs:12-21
Magnitude Sanity Checks
Because filers often make “scale errors” (e.g., reporting millions as ones), the parser applies strictness checks:
- Annual (FY): If the fiscal year (
fy) is greater than the calendar year of theenddate, the record is discarded src/parsers/parse_us_gaap_fundamentals.rs:88-90 - Interim (Q1-Q3): Allows
fyto be up toend_year + 1to account for fiscal years ending early in a calendar year src/parsers/parse_us_gaap_fundamentals.rs:96-98
Sources: src/parsers/parse_us_gaap_fundamentals.rs:12-127
Data Flow Diagrams
graph TD
subgraph "Natural Language Space"
SEC_XML["SEC XML Source\n(13F / N-PORT)"]
RawVal["'value' (13F)\n'pctVal' (N-PORT)"]
FilingDate["'filingDate'"]
end
subgraph "Code Entity Space: Parsers"
P13F["parse_13f_xml()"]
PNPORT["parse_nport_xml()"]
end
subgraph "Code Entity Space: Normalize"
N13F["normalize_13f_value_usd()"]
NPCT["Pct::from_pct()"]
EraCheck["is_13f_thousands_era()"]
end
SEC_XML --> P13F
SEC_XML --> PNPORT
P13F -->|passes raw decimal| N13F
P13F -->|passes date| EraCheck
EraCheck -->|returns bool| N13F
PNPORT -->|passes raw decimal| NPCT
N13F -->|Result| Model13F["ThirteenfHolding.value_usd"]
NPCT -->|Result| ModelNPORT["NportInvestment.pct_val"]
From XML to Normalized Model
This diagram bridges the “Natural Language” SEC fields to the “Code Entities” in the normalize and parsers modules.
Sources: src/parsers/parse_13f_xml.rs98 src/normalize/thirteenf.rs144 src/parsers/parse_nport_xml.rs104
graph LR
subgraph "Input"
J["JSON Value\n(companyfacts)"]
end
subgraph "src/parsers/parse_us_gaap_fundamentals.rs"
Ext["Extraction Loop\n(fact_category_values, etc)"]
Sanity["Magnitude Sanity Checks\n(fy vs end_year)"]
DF["DataFrame::new()"]
Sort["sort(['filed'], descending=true)"]
Pivot["pivot(['fy', 'fp'])\naggregate: first()"]
end
subgraph "Output"
TDF["TickerFundamentalsDataFrame"]
end
J --> Ext
Ext --> Sanity
Sanity --> DF
DF --> Sort
Sort --> Pivot
Pivot --> TDF
US GAAP DataFrame Construction
The pipeline for converting raw JSON facts into an analysis-ready Polars DataFrame.
Sources: src/parsers/parse_us_gaap_fundamentals.rs:41-103 src/parsers/parse_us_gaap_fundamentals.rs:32-38
CSV and Text Parsers
Company Tickers
parse_company_tickers_json: Parses the SEC’scompany_tickers.jsonwhich maps CIKs to Tickers and Names src/parsers/parse_company_tickers.rs:19-20parse_ticker_txt: Parses theticker.txtfile used for broader ticker coverage src/parsers/parse_company_tickers.rs20
Master Index
parse_master_idx: Parses the quarterlymaster.idxfiles from EDGAR. It skips header lines and extracts CIK, Company Name, Form Type, Date Filed, and File Name (URL) src/parsers/parse_master_idx.rs:10-11
Investment Companies
parse_investment_companies_csv: Processes theinvestment_company_registrants_list.csvto identify entities registered under the Investment Company Act src/parsers/parse_investment_companies_csv.rs:13-14
Sources: src/parsers.rs:10-21
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Operations & Business Logic
Loading…
Operations & Business Logic
Relevant source files
- examples/ipo_list.rs
- examples/ipo_show.rs
- src/network/filings/filing_index.rs
- src/network/filings/mod.rs
- src/ops/filing.rs
- src/ops/holdings.rs
- src/ops/ipo.rs
- src/ops/mod.rs
The ops module provides high-level, orchestrated operations built on top of the lower-level network fetch functions. Each operation encapsulates multi-step workflows—such as filing index parsing, portfolio position normalization, and automated feed polling—allowing callers to work at a higher level of abstraction than raw HTTP requests.
Filing Operations
Filing operations handle the retrieval and transformation of SEC documents into human-readable or machine-learning-ready text. The primary entry point is render_filing, which coordinates fetching the primary document and its associated exhibits.
Rendering Pipeline
The rendering logic distinguishes between “substantive” exhibits (press releases, material contracts) and boilerplate (SOX certifications, auditor consents).
Sources: src/ops/filing.rs:61-84 src/ops/filing.rs:130-141
graph TD
subgraph "ops::filing"
RF["render_filing()"]
RE["render_exhibit_docs()"]
RED["render_exhibit_doc()"]
end
subgraph "network::filings"
FI["fetch_filing_index()"]
FAR["fetch_and_render()"]
end
RF -->|if render_body| FAR
RF -->|if render_exhibits| FI
FI -->|FilingIndex| RE
RE --> RED
RED --> FAR
FAR -->|FilingView| Output["Rendered Text"]
Key Functions and Structures
| Entity | Role | Source |
|---|---|---|
RenderedFiling | Container for the optional body text and a Vec of RenderedExhibit. | src/ops/filing.rs:20-25 |
render_filing | High-level orchestrator that fetches the primary doc and substantive exhibits. | src/ops/filing.rs:61-84 |
render_all_exhibits | Variant that skips substantive filtering to return every document in the archive. | src/ops/filing.rs:92-100 |
fetch_filing_index | Parses the EDGAR -index.htm page to find document filenames and types. | src/network/filings/filing_index.rs:108-114 |
The FilingIndex parser uses regex to extract data from the SEC’s HTML table, identifying documents by their Seq, Description, and Type src/network/filings/filing_index.rs:23-76
Holdings Operations
Holdings operations normalize investment data from disparate SEC forms (13F for institutional managers and N-PORT for registered investment companies) into a common Position format for comparison.
Position Normalization and Diffing
The system matches positions by CUSIP and calculates weight changes. A “significant” change is defined by the WEIGHT_CHANGE_THRESHOLD (default 0.10 percentage points).
Sources: src/ops/holdings.rs:45-71 src/ops/holdings.rs:79-120
graph LR
NPORT["NportInvestment"] -->|positions_from_nport| P1["Position"]
T13F["ThirteenfHolding"] -->|positions_from_13f| P2["Position"]
P1 --> DH["diff_holdings()"]
P2 --> DH
DH -->|Result| Diff["Diff Structure"]
subgraph "Diff Results"
Diff --> Added["added: Vec<Position>"]
Diff --> Removed["removed: Vec<Position>"]
Diff --> Changed["changed: Vec<(Old, New)>"]
end
Implementation Details
Position: Storescusip,name,val_usd, andweight(as aPcttype) src/ops/holdings.rs:18-28diff_holdings: UsesHashMaplookups to identify new buys (added), full sells (removed), and significant weight adjustments (changed) src/ops/holdings.rs:79-120- Sorting : The
changedlist is automatically sorted by absolute weight change descending src/ops/holdings.rs:109-113
IPO Operations & Lifecycle
The IPO module manages the discovery and retrieval of registration statements (S-1/F-1) and their subsequent amendments (S-1/A, F-1/A).
Registration Filing Lifecycle
The system tracks companies through the registration process, starting from the initial filing through amendments to the final pricing prospectus.
| Form Type | Description | Constant Group |
|---|---|---|
| S-1 / F-1 | Initial registration statement. | FormType::IPO_REGISTRATION_FORM_TYPES |
| S-1/A / F-1/A | Amendments responding to SEC comments. | FormType::IPO_REGISTRATION_FORM_TYPES |
| 424B4 | Final pricing prospectus (deal terms). | FormType::IPO_PRICING_FORM_TYPES |
Sources: src/ops/ipo.rs:40-43 examples/ipo_show.rs:26-32
Feed Polling and Deduplication
The get_ipo_feed_entries function provides a “delta-poll” capability. It filters the EDGAR Atom feed, which uses prefix matching (e.g., searching “S-1” returns “S-11”), to ensure exact form type matches.
Sources: src/ops/ipo.rs:83-110
graph TB
Start["get_ipo_feed_entries()"]
Fetch["fetch_edgar_feeds_since()"]
ExactMatch{"Exact Form Match?"}
Dedup{"Seen Accession?"}
Start --> Fetch
Fetch --> ExactMatch
ExactMatch -->|No| Drop["Discard (e.g. S-11)"]
ExactMatch -->|Yes| Dedup
Dedup -->|New| Collect["Add to Results"]
Dedup -->|Duplicate| Drop
Collect --> HW["Update High Water Mark"]
Identity Resolution
Because pre-IPO companies lack ticker symbols, the logic supports resolution via CIK. The ipo_show example demonstrates this by prioritizing CIK lookup and falling back to ticker-based CIK discovery for companies that have already listed examples/ipo_show.rs:115-121
Sources: src/ops/ipo.rs:8-49 examples/ipo_list.rs:87-108
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Caching & Storage System
Loading…
Caching & Storage System
Relevant source files
Purpose and Scope
This document describes the caching and storage infrastructure used by the Rust sec-fetcher application to minimize redundant API requests and improve performance. The system implements a two-tier caching architecture with persistent storage backed by simd-r-drive.
The caching system is designed to be isolated per ConfigManager instance, ensuring that different environments (e.g., production vs. unit tests) do not suffer from cross-test cache pollution.
Overview
The caching system provides two distinct cache layers managed by the Caches struct:
- HTTP Cache : Stores raw HTTP responses from SEC EDGAR API requests to avoid re-downloading immutable filing data.
- Preprocessor Cache : Stores transformed and processed data structures (e.g., mapping tables, calculated values, or TTL-based metadata).
Both caches use the simd-r-drive key-value storage backend with persistent file-based storage.
Sources: src/caches.rs:1-14 src/caches.rs:25-51
Caching Architecture
The following diagram illustrates the caching architecture and its integration with the configuration and network layers:
Sources: src/caches.rs:11-14 src/caches.rs:29-51 src/network/fetch_investment_company_series_and_class_dataset.rs:43-46
graph TB
subgraph "Initialization Space"
ConfigMgr["ConfigManager"]
CachesStruct["Caches Struct"]
OpenFn["Caches::open(base_path)"]
end
subgraph "Code Entity Space: Caches Module"
HTTP_DS["http_cache: Arc<DataStore>"]
PRE_DS["preprocessor_cache: Arc<DataStore>"]
end
subgraph "File System (On-Disk)"
HTTP_File["http_storage_cache.bin"]
PRE_File["preprocessor_cache.bin"]
end
subgraph "Network Integration"
SecClient["SecClient"]
FetchInv["fetch_investment_company_..."]
end
ConfigMgr -->|provides path| OpenFn
OpenFn -->|instantiates| CachesStruct
CachesStruct --> HTTP_DS
CachesStruct --> PRE_DS
HTTP_DS -->|persists to| HTTP_File
PRE_DS -->|persists to| PRE_File
SecClient -->|uses| HTTP_DS
FetchInv -->|uses| PRE_DS
Implementation Details
The Caches Struct
Unlike previous versions that used global OnceLock statics, the current implementation encapsulates the storage logic within the Caches struct. This allows for better dependency injection and testing isolation.
| Method | Description |
|---|---|
open(base: &Path) | Creates the directory if missing and opens two DataStore files: http_storage_cache.bin and preprocessor_cache.bin. |
get_http_cache_store() | Returns an Arc<DataStore> for the HTTP response cache. |
get_preprocessor_cache() | Returns an Arc<DataStore> for the preprocessor/metadata cache. |
Sources: src/caches.rs:25-59
CacheNamespacePrefix
To prevent key collisions within a single DataStore, the system utilizes CacheNamespacePrefix. This enum provides distinct prefixes for different types of cached data, which are then hashed using simd_r_drive::utils::NamespaceHasher.
Common namespaces include:
LatestFundsYear: Used to track the most recent available year for investment company datasets.
Sources: src/network/fetch_investment_company_series_and_class_dataset.rs:11-15 src/network/fetch_investment_company_series_and_class_dataset.rs:47-48
Preprocessor Cache Usage
The preprocessor cache is used for logic that requires persistence but isn’t a direct 1:1 mapping of an HTTP response. A primary example is the “Year-Fallback Logic” used when fetching investment company datasets.
sequenceDiagram
participant App as Fetch Logic
participant PreCache as Preprocessor Cache
participant SEC as SEC EDGAR API
App->>PreCache: read_with_ttl(Namespace: LatestFundsYear)
alt Cache Hit
PreCache-->>App: Return cached year (e.g., 2024)
else Cache Miss
App->>App: Default to Utc::now().year()
end
loop Fallback Logic
App->>SEC: GET Dataset for Year
alt 200 OK
SEC-->>App: CSV Data
App->>PreCache: write_with_ttl(year, TTL: 1 week)
Note over App: Break Loop\nelse 404 Not Found
App->>App: decrement year
end
end
Data Flow: Investment Company Dataset Fetching
Implementation Details:
- Function :
fetch_investment_company_series_and_class_datasetsrc/network/fetch_investment_company_series_and_class_dataset.rs:43-80 - Namespace :
CacheNamespacePrefix::LatestFundsYearsrc/network/fetch_investment_company_series_and_class_dataset.rs:11-15 - TTL : Hardcoded to 1 week (604,800 seconds) for the fallback year metadata src/network/fetch_investment_company_series_and_class_dataset.rs71
Sources: src/network/fetch_investment_company_series_and_class_dataset.rs:46-73
HTTP Cache & SecClient
The SecClient utilizes the http_cache provided by the Caches struct. This integration typically happens during the construction of the SecClient via the ConfigManager.
Storage Characteristics (simd-r-drive)
The underlying storage provided by simd-r-drive offers:
- High Performance : Optimized for fast key-value lookups.
- Atomic Operations : Ensures data integrity during writes.
- Simplicity : Single-file binary format (
.bin) per store.
Sources: src/caches.rs:31-46
Integration Summary
| Component | Role | File Reference |
|---|---|---|
Caches | Owner of DataStore handles | src/caches.rs:11-14 |
simd_r_drive::DataStore | Low-level storage engine | src/caches.rs1 |
NamespaceHasher | Scopes keys within a DataStore | src/network/fetch_investment_company_series_and_class_dataset.rs:11-15 |
StorageCacheExt | Provides read_with_ttl and write_with_ttl | src/network/fetch_investment_company_series_and_class_dataset.rs7 |
Sources: src/caches.rs:1-60 src/network/fetch_investment_company_series_and_class_dataset.rs:1-73
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Utility Functions & Accessors
Loading…
Utility Functions & Accessors
Relevant source files
- src/network/fetch_cik_submissions.rs
- src/utils.rs
- src/utils/is_interactive_mode.rs
- tests/sec_client_tests.rs
Purpose and Scope
This document covers the utility functions and helper modules provided by the utils module in the Rust sec-fetcher application. These utilities provide cross-cutting functionality used throughout the codebase, including data structure transformations, runtime mode detection, and collection extensions. Additionally, it documents the internal data extraction patterns used by the network layer to process complex SEC JSON responses.
Sources: src/utils.rs:1-9
Module Overview
The utils module is organized as a collection of focused sub-modules. The module uses Rust’s re-export pattern to provide a clean public API for environment detection and collection manipulation.
| Sub-module | Primary Export | Purpose |
|---|---|---|
is_development_mode | is_development_mode() | Runtime environment detection for configuration and logging. |
is_interactive_mode | is_interactive_mode(), set_interactive_mode_override() | Detects if the app is running in a TTY or should simulate one. |
vec_extensions | VecExtensions trait | Extension methods for Vec<T> used in data processing. |
Sources: src/utils.rs:1-9
Interactive Mode Management
The interactive mode utilities manage application state related to user interaction, controlling whether the application should prompt for user input or run in automated mode (e.g., CI/CD or piped output).
Implementation Details
The is_interactive_mode function checks the environment and standard streams to determine the session type:
- Override Check : It first looks for the
INTERACTIVE_MODE_OVERRIDEenvironment variable. Values"1"/"true"force interactive mode;"0"/"false"force non-interactive src/utils/is_interactive_mode.rs:8-28 - Terminal Detection : If no override exists, it uses
std::io::IsTerminalto check if bothstdinandstdoutare connected to a terminal src/utils/is_interactive_mode.rs:30-32
Function Signatures
src/utils/is_interactive_mode.rs21 src/utils/is_interactive_mode.rs62
Usage Scenarios
| Scenario | is_interactive_mode() is true | is_interactive_mode() is false |
|---|---|---|
| Missing Config | Prompt user for email/API keys | Exit with error message |
| Progress | Render dynamic progress bars | Log static status updates |
| Piping | Warning if output is not redirected | Clean data output for grep/jq |
Sources: src/utils/is_interactive_mode.rs:1-76
Data Accessors and Parsing Patterns
While not a standalone “accessor” module, the codebase implements standardized internal functions to extract data from SEC JSON structures (DataFrames-like objects in Python, but raw serde_json::Value in Rust).
Submission Parsing Logic
The fetch_cik_submissions module contains specialized logic to flatten the SEC’s columnar JSON format into a vector of strongly-typed models.
graph TD
subgraph "SEC JSON Structure"
JSON["Root JSON Object"]
Filings["'filings' Object"]
Recent["'recent' Block (Columnar)"]
Files["'files' Array (Pagination)"]
end
subgraph "Logic: extract_filings_from_block"
Zip["itertools::izip!"]
Acc["accessionNumber[]"]
Form["form[]"]
Doc["primaryDocument[]"]
Date["filingDate[]"]
end
subgraph "Output Space"
Model["Vec<CikSubmission>"]
end
JSON --> Filings
Filings --> Recent
Filings --> Files
Recent --> Acc & Form & Doc & Date
Acc & Form & Doc &
Date --> Zip
Zip --> Model
Files -.->|Recursively Fetch| JSON
Key Accessor Functions
extract_filings_from_block: Zips multiple arrays (accession numbers, forms, dates) from aserde_json::Valueinto aVec<CikSubmission>src/network/fetch_cik_submissions.rs:10-63parse_cik_submissions_json: The public entry point for converting a raw SEC JSON response into models, specifically targeting thefilings.recentblock src/network/fetch_cik_submissions.rs:75-87
Sources: src/network/fetch_cik_submissions.rs:10-177
Utility Function Relationships
This diagram maps the relationship between utility functions and the higher-level system components that consume them.
Sources: src/utils.rs:1-9 src/network/fetch_cik_submissions.rs:1-178
graph LR
subgraph "Utility Layer (src/utils/)"
IsDev["is_development_mode()"]
IsInt["is_interactive_mode()"]
VecExt["VecExtensions"]
end
subgraph "Network Layer (src/network/)"
Client["SecClient"]
SubParser["parse_cik_submissions_json()"]
end
subgraph "App Logic"
Config["ConfigManager"]
Main["main.rs / CLI"]
end
IsDev --> Config
IsInt --> Main
VecExt --> SubParser
SubParser --> Client
Development Mode Detection
The is_development_mode utility is a simple toggle used to gate features that should not be active in production, such as test fixture generation or bypasses for rate limiting in local mock environments.
Usage in Configuration
The ConfigManager and SecClient utilize these flags to determine if they should load production SEC endpoints or local mock servers during integration testing tests/sec_client_tests.rs:7-23
Sources: src/utils.rs:1-2 tests/sec_client_tests.rs:1-23
Vector Extensions Trait
The VecExtensions trait provides ergonomic helpers for common operations performed on lists of SEC data, such as deduplication or specialized filtering before rendering.
Trait Definition Pattern
src/utils/vec_extensions.rs:1-10 (referenced via src/utils.rs:7-8)
Sources: src/utils.rs:7-8
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Python narrative_stack System
Loading…
Python narrative_stack System
Relevant source files
Purpose and Scope
The narrative_stack system is the Python machine learning component of the dual-language financial data processing architecture. This system is responsible for transforming raw financial data fetched by the Rust sec-fetcher into learned latent representations using deep learning techniques.
For detailed information on specific subsystems:
- Data Ingestion & Preprocessing: See Data Ingestion & Preprocessing for details on CSV parsing, semantic embedding generation, and PCA dimensionality reduction.
- Machine Learning Training Pipeline : See Machine Learning Training Pipeline for documentation on the
Stage1Autoencodermodel, PyTorch Lightning setup, and dataset streaming. - Database & Storage Integration: See Database & Storage Integration for details on the
DbUsGaapinterface andDataStoreWsClientWebSocket integration. - US GAAP Distribution Analyzer : See US GAAP Distribution Analyzer for information on the Rust-based clustering tool used to analyze concept distributions.
The Rust data fetching layer is documented in Rust sec-fetcher Application.
System Architecture Overview
The narrative_stack system processes US GAAP financial data through a multi-stage pipeline that transforms raw CSV files into learned latent representations. The architecture consists of three primary layers: storage/ingestion, preprocessing, and training.
graph TB
subgraph "Storage Layer"
DbUsGaap["DbUsGaap\nMySQL Database Interface"]
DataStoreWsClient["DataStoreWsClient\nWebSocket Client"]
UsGaapStore["UsGaapStore\nUnified Data Facade"]
end
subgraph "Data Sources"
RustCSV["CSV Files\nfrom sec-fetcher\ntruncated_csvs/"]
MySQL["MySQL Database\nus_gaap_test"]
SimdRDrive["simd-r-drive\nWebSocket Server"]
end
subgraph "Preprocessing Components"
IngestCSV["ingest_us_gaap_csvs\nCSV Walker & Parser"]
GenPCA["generate_pca_embeddings\nPCA Compression"]
RobustScaler["RobustScaler\nPer-Pair Normalization"]
ConceptEmbed["Semantic Embeddings\nConcept/Unit Pairs"]
end
subgraph "Training Components"
IterableDS["IterableConceptValueDataset\nStreaming Data Loader"]
Stage1AE["Stage1Autoencoder\nPyTorch Lightning Module"]
PLTrainer["pl.Trainer\nTraining Orchestration"]
Callbacks["EarlyStopping\nModelCheckpoint\nTensorBoard Logger"]
end
RustCSV --> IngestCSV
MySQL --> DbUsGaap
SimdRDrive --> DataStoreWsClient
DbUsGaap --> UsGaapStore
DataStoreWsClient --> UsGaapStore
IngestCSV --> UsGaapStore
UsGaapStore --> GenPCA
UsGaapStore --> RobustScaler
UsGaapStore --> ConceptEmbed
GenPCA --> UsGaapStore
RobustScaler --> UsGaapStore
ConceptEmbed --> UsGaapStore
UsGaapStore --> IterableDS
IterableDS --> Stage1AE
Stage1AE --> PLTrainer
PLTrainer --> Callbacks
Callbacks -.->|Checkpoints| Stage1AE
Component Architecture
Sources:
- python/narrative_stack/notebooks/stage1_preprocessing.ipynb:13-40
- python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb175-176
Core Component Responsibilities
| Component | Type | Responsibilities |
|---|---|---|
DbUsGaap | Database Interface | MySQL connection management, query execution. |
DataStoreWsClient | WebSocket Client | Real-time communication with simd-r-drive server. |
UsGaapStore | Data Facade | Unified API for ingestion, lookup, and embedding management. |
IterableConceptValueDataset | PyTorch Dataset | Streaming data loader with internal batching to handle large datasets. |
Stage1Autoencoder | Lightning Module | Encoder-decoder architecture for learning financial concept embeddings. |
pl.Trainer | Training Framework | Orchestration of the training loop, logging, and checkpointing. |
Sources:
- python/narrative_stack/notebooks/stage1_preprocessing.ipynb:13-40
- python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb175-176
Data Flow Pipeline
The system processes financial data through a sequential pipeline from raw CSV files to trained model checkpoints.
graph LR
subgraph "Input Stage"
CSV1["Rust CSV Output\nproject_paths.rust_data/\ntruncated_csvs/"]
end
subgraph "Ingestion Stage"
Walk["walk_us_gaap_csvs\nDirectory Walker"]
Parse["UsGaapRowRecord\nParser"]
Store1["Store to\nDbUsGaap & DataStore"]
end
subgraph "Preprocessing Stage"
Extract["Extract\nconcept/unit pairs"]
GenEmbed["generate_concept_unit_embeddings\nSemantic Embeddings"]
Scale["RobustScaler\nPer-Pair Normalization"]
PCA["PCA Compression\nvariance_threshold=0.95"]
Triplet["Triplet Storage\nconcept+unit+scaled_value\n+scaler+embedding"]
end
subgraph "Training Stage"
Stream["IterableConceptValueDataset\ninternal_batch_size=64"]
DataLoader["DataLoader\nbatch_size from hparams\ncollate_with_scaler"]
Encode["Stage1Autoencoder.encode\nInput → Latent"]
Decode["Stage1Autoencoder.decode\nLatent → Reconstruction"]
Loss["MSE Loss\nReconstruction Error"]
end
subgraph "Output Stage"
Ckpt["Model Checkpoints\nstage1_resume-vN.ckpt"]
TB["TensorBoard Logs\nval_loss_epoch monitoring"]
end
CSV1 --> Walk
Walk --> Parse
Parse --> Store1
Store1 --> Extract
Extract --> GenEmbed
GenEmbed --> Scale
Scale --> PCA
PCA --> Triplet
Triplet --> Stream
Stream --> DataLoader
DataLoader --> Encode
Encode --> Decode
Decode --> Loss
Loss --> Ckpt
Loss --> TB
End-to-End Processing Flow
Sources:
- python/narrative_stack/notebooks/stage1_preprocessing.ipynb:21-68
- python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-556
Processing Stages
- CSV Ingestion : The system ingests CSV files produced by the Rust
sec-fetcherusingus_gaap_store.ingest_us_gaap_csvs(). python/narrative_stack/notebooks/stage1_preprocessing.ipynb:21-23 - Concept/Unit Pair Extraction : Unique
(concept, unit)pairs are extracted to define the semantic space. python/narrative_stack/notebooks/stage1_preprocessing.ipynb:67-68 - Semantic Embedding Generation : Embeddings capture relationships between financial concepts, compressed via PCA with a 0.95 variance threshold. python/narrative_stack/notebooks/stage1_preprocessing.ipynb:263-269
- Value Normalization :
RobustScaleris applied per-pair to handle outliers in financial magnitudes. python/narrative_stack/notebooks/stage1_preprocessing.ipynb:88-96 - Model Training : The
Stage1Autoencoderlearns a bottleneck representation of the concatenated embedding and scaled value. python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb479-486
Storage Architecture Integration
The Python system integrates with multiple storage backends to support different access patterns and data requirements.
graph TB
subgraph "Python Application"
App["narrative_stack\nNotebooks & Scripts"]
end
subgraph "Storage Facade"
UsGaapStore["UsGaapStore\nUnified Interface"]
end
subgraph "Backend Interfaces"
DbInterface["DbUsGaap\ndb_config"]
WsInterface["DataStoreWsClient\nsimd_r_drive_server_config"]
end
subgraph "Storage Systems"
MySQL["MySQL Server\nus_gaap_test database"]
WsServer["simd-r-drive\nWebSocket Server\nKey-Value Store"]
FileCache["File System\nCache Storage"]
end
subgraph "Data Types"
Raw["Raw Records\nUsGaapRowRecord"]
Triplets["Triplets\nconcept+unit+scaled_value\n+scaler+embedding"]
Matrix["Embedding Matrix\nnumpy arrays"]
PCAModel["PCA Models\nsklearn objects"]
end
App --> UsGaapStore
UsGaapStore --> DbInterface
UsGaapStore --> WsInterface
DbInterface --> MySQL
WsInterface --> WsServer
WsServer --> FileCache
MySQL -.->|stores| Raw
WsServer -.->|stores| Triplets
WsServer -.->|stores| Matrix
WsServer -.->|stores| PCAModel
Storage Backend Architecture
Sources:
Configuration and Initialization
The system uses centralized configuration for database connections and WebSocket server endpoints. The .vscode/settings.json file points to the specific Python environment for the stack.
- Python Interpreter :
python/narrative_stack/.venv/bin/python3.vscode/settings.json5 - Initialization : The
UsGaapStoreis initialized with aDataStoreWsClientandDbUsGaapinstance usingdb_configandsimd_r_drive_server_config. python/narrative_stack/notebooks/stage1_preprocessing.ipynb:13-40
Training Infrastructure
The training system uses PyTorch Lightning for experiment management.
Training Configuration
Sources:
Key Training Parameters
| Parameter | Value | Purpose |
|---|---|---|
EPOCHS | 1000 | Maximum training epochs. |
internal_batch_size | 64 | Dataset internal batching size. |
num_workers | 2 | DataLoader worker processes. |
gradient_clip_val | From hparams | Gradient clipping threshold. |
Sources:
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Machine Learning Training Pipeline
Loading…
Machine Learning Training Pipeline
Relevant source files
Purpose and Scope
This page documents the machine learning training pipeline for the narrative_stack system, specifically the Stage 1 autoencoder that learns latent representations of US GAAP financial concepts. The training pipeline consumes preprocessed concept/unit/value triplets and their semantic embeddings to train a neural network that can encode financial data into a compressed latent space.
The pipeline uses PyTorch Lightning for training orchestration, implements custom iterable datasets for efficient data streaming from the simd-r-drive WebSocket server, and provides comprehensive experiment tracking through TensorBoard.
Training Pipeline Architecture
The training pipeline operates as a streaming system that continuously fetches preprocessed triplets from the UsGaapStore and feeds them through the autoencoder model. The architecture emphasizes memory efficiency and reproducibility by avoiding full-dataset loads into RAM.
graph TB
subgraph "Data Source Layer"
DataStoreWsClient["DataStoreWsClient\n(simd_r_drive_ws_client)"]
UsGaapStore["UsGaapStore\nlookup_by_index()"]
end
subgraph "Dataset Layer"
IterableDataset["IterableConceptValueDataset\ninternal_batch_size=64\nreturn_scaler=True\nshuffle=True/False"]
CollateFunction["collate_with_scaler()\nBatch construction"]
end
subgraph "PyTorch Lightning Training Loop"
DataLoader["DataLoader\nbatch_size from hparams\nnum_workers=2\npin_memory=True\npersistent_workers=True"]
Model["Stage1Autoencoder\nEncoder → Latent → Decoder"]
Optimizer["Adam Optimizer\n+ CosineAnnealingWarmRestarts\nReduceLROnPlateau"]
Trainer["pl.Trainer\nEarlyStopping\nModelCheckpoint\ngradient_clip_val"]
end
subgraph "Monitoring & Persistence"
TensorBoard["TensorBoardLogger\ntrain_loss\nval_loss_epoch\nlearning_rate"]
Checkpoints["Model Checkpoints\n.ckpt files\nsave_top_k=1\nmonitor='val_loss_epoch'"]
end
DataStoreWsClient --> UsGaapStore
UsGaapStore --> IterableDataset
IterableDataset --> CollateFunction
CollateFunction --> DataLoader
DataLoader --> Model
Model --> Optimizer
Optimizer --> Trainer
Trainer --> TensorBoard
Trainer --> Checkpoints
Checkpoints -.->|Resume training| Model
Natural Language to Code Entity Space: Data Flow
Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb456-556
Stage1Autoencoder Model
Model Architecture
The Stage1Autoencoder is a fully-connected autoencoder that learns to compress financial concept embeddings combined with their scaled values into a lower-dimensional latent space. The model reconstructs its input, forcing the latent representation to capture the most important features of the US GAAP taxonomy.
graph LR
Input["Input Tensor\n[embedding + scaled_value]\nDimension: embedding_dim + 1"]
Encoder["Encoder Network\nfc1 → dropout → ReLU\nfc2 → dropout → ReLU"]
Latent["Latent Space\nDimension: latent_dim"]
Decoder["Decoder Network\nfc3 → dropout → ReLU\nfc4 → dropout → output"]
Output["Reconstructed Input\nSame dimension as input"]
Loss["MSE Loss\ninput vs output"]
Input --> Encoder
Encoder --> Latent
Latent --> Decoder
Decoder --> Output
Output --> Loss
Input --> Loss
Hyperparameters
The model exposes the following configurable hyperparameters through its hparams attribute:
| Parameter | Description | Typical Value |
|---|---|---|
input_dim | Dimension of input (embedding + 1 for value) | Varies based on embedding size |
latent_dim | Dimension of compressed latent space | 64-128 |
dropout_rate | Dropout probability for regularization | 0.0-0.2 |
lr | Initial learning rate | 1e-5 to 5e-5 |
min_lr | Minimum learning rate for scheduler | 1e-7 to 1e-6 |
batch_size | Training batch size | 32 |
weight_decay | L2 regularization parameter | 1e-8 to 1e-4 |
gradient_clip | Maximum gradient norm | 0.0-1.0 |
Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb479-490
Loss Function and Optimization
The model uses Mean Squared Error (MSE) loss between the input and reconstructed output. The optimization strategy combines:
- Adam optimizer with configurable learning rate and weight decay. python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb479-490
- CosineAnnealingWarmRestarts scheduler for cyclical learning rate annealing.
- ReduceLROnPlateau for adaptive learning rate reduction when validation loss plateaus.
Dataset and Data Loading
IterableConceptValueDataset
The IterableConceptValueDataset is a custom PyTorch IterableDataset that streams training data from the UsGaapStore without loading the entire dataset into memory.
Key characteristics:
graph TB
subgraph "Dataset Initialization"
Config["simd_r_drive_server_config\nhost + port"]
Params["Dataset Parameters\ninternal_batch_size\nreturn_scaler\nshuffle"]
end
subgraph "Data Streaming Process"
Store["UsGaapStore instance\nget_triplet_count()"]
IndexGen["Index Generator\nSequential or shuffled\nbased on shuffle param"]
BatchFetch["Internal Batching\nFetch internal_batch_size items\nvia batch_lookup_by_indices()"]
Unpack["Unpack Triplet Data\nembedding\nscaled_value\nscaler (optional)"]
end
subgraph "Output"
Tensor["PyTorch Tensors\nx: [embedding + scaled_value]\ny: [embedding + scaled_value]\nscaler: RobustScaler object"]
end
Config --> Store
Params --> IndexGen
Store --> IndexGen
IndexGen --> BatchFetch
BatchFetch --> Unpack
Unpack --> Tensor
- Iterable streaming : Data is fetched on-demand during iteration. python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-522
- Internal batching : Fetches
internal_batch_sizeitems (typically 64) at once from the WebSocket server to reduce network overhead. python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb175-176 - Optional shuffling : Randomizes index order for training or maintains sequential order for validation.
DataLoader and Collation
The collate_with_scaler function handles batch construction when the dataset returns triplets (x, y, scaler). It stacks the tensors into batches while preserving the scaler objects in a list. python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb506-507
| Parameter | Value | Purpose |
|---|---|---|
batch_size | model.hparams.batch_size | Outer batch size for model training. |
num_workers | 2 | Parallel data loading processes. |
pin_memory | True | Faster host-to-GPU transfers. |
persistent_workers | True | Keeps worker processes alive between epochs. |
Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb175-176 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb502-522
Training Configuration
PyTorch Lightning Trainer Setup
The training pipeline uses PyTorch Lightning’s Trainer class to orchestrate the training loop, validation, and callbacks. python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb468-548
Callbacks and Persistence
- EarlyStopping : Monitors
val_loss_epochand stops training if no improvement occurs for 20 consecutive epochs. python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb528-530 - ModelCheckpoint : Saves the best model weights based on validation loss to the
OUTPUT_PATH. python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb532-539 - TensorBoardLogger : Automatically logs
train_loss,val_loss_epoch, andlearning_rate. python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb543
Sources: python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb528-548
Checkpointing and Resuming Training
The pipeline supports resuming training from a .ckpt file. This is handled by passing ckpt_path to trainer.fit(). python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb555
Alternatively, a model can be loaded for fine-tuning with modified hyperparameters:
python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb479-486
Integration with Rust Caches
While the training occurs in Python, the underlying data is often derived from the Rust preprocessor. The Caches struct in the Rust application manages the preprocessor_cache.bin and http_storage_cache.bin src/caches.rs:11-14 which provide the raw data that the Python UsGaapStore eventually consumes. The Caches::open function src/caches.rs:29-51 ensures these data stores are correctly initialized on disk before the training pipeline attempts to access them via the WebSocket bridge.
Sources: src/caches.rs:11-60 python/narrative_stack/notebooks/old.stage1_training_(no_pre_dedupe).ipynb456-556
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
US GAAP Distribution Analyzer
Loading…
US GAAP Distribution Analyzer
Relevant source files
The us-gaap-dist-analyzer is a specialized Rust sub-crate designed to perform unsupervised clustering and distribution analysis of US GAAP (Generally Accepted Accounting Principles) financial concepts. It bridges the gap between raw SEC filing data and semantic financial modeling by grouping taxonomical labels based on their contextual and mathematical distributions.
Purpose and Scope
The analyzer processes large-scale financial datasets to identify patterns in how public companies report specific financial metrics. By utilizing BERT embeddings for semantic understanding and K-Means clustering for statistical grouping, the tool allows researchers to visualize the high-dimensional space of US GAAP concepts and identify synonymous or related reporting items that may not share identical taxonomical names.
Technical Architecture
The system follows a linear pipeline that transforms raw US GAAP column names and their associated values into a clustered spatial representation.
Data Flow and Pipeline
The transformation process moves from high-dimensional natural language space to a reduced numerical space for efficient clustering.
- Embedding Generation : Converts US GAAP concept strings into vector representations using a BERT-based transformer model.
- Normalization : Applies scaling to ensure that concept values (magnitudes) do not disproportionately bias the clustering.
- Dimensionality Reduction : Uses Principal Component Analysis (PCA) to project high-dimensional embeddings into a lower-dimensional space while preserving variance.
- Clustering : Executes K-Means algorithms to group concepts into
kdistinct clusters.
System Components Diagram
The following diagram illustrates the relationship between the logical analysis steps and the underlying implementation components.
US GAAP Analysis Pipeline
graph TD
subgraph "Natural Language Space"
A["US GAAP Concept Names"]
B["distill_us_gaap_fundamental_concepts"]
end
subgraph "Vector Space (Code Entities)"
C["BERT Embedding Engine"]
D["PCA (Principal Component Analysis)"]
E["K-Means Clusterer"]
end
subgraph "Output & Analysis"
F["Cluster Centroids"]
G["Concept Distribution Maps"]
end
A --> B
B -- "Normalized Strings" --> C
C -- "High-Dim Vectors" --> D
D -- "Reduced Vectors" --> E
E --> F
E --> G
Sources: us-gaap-dist-analyzer/Cargo.lock:1-217 us-gaap-dist-analyzer/Cargo.lock:198-217
Implementation Details
Dependency Stack
The analyzer relies on several heavy-duty mathematical and machine learning libraries to perform its operations.
| Component | Library / Crate | Purpose |
|---|---|---|
| Embeddings | rust-bert / tch | Loading and executing transformer models for semantic encoding. |
| Linear Algebra | ndarray / nalgebra | Matrix operations for PCA and distance calculations. |
| Clustering | linfa-clustering | Implementation of the K-Means algorithm. |
| Data Handling | polars | High-performance DataFrame operations for managing large SEC datasets. |
| Caching | cached-path | Managing local storage of model weights and pre-computed embeddings. |
Sources: us-gaap-dist-analyzer/Cargo.lock:198-201 us-gaap-dist-analyzer/Cargo.lock:53-61
Key Logic Flow
The analyzer’s execution logic is centered around the transition from raw SEC data to categorized clusters.
Clustering Logic Flow
sequenceDiagram
participant Data as "SEC Data (Polars DataFrame)"
participant BERT as "BERT Encoder"
participant DimRed as "PCA Transformer"
participant Cluster as "K-Means Engine"
Data->>BERT: Extract Concept Labels
BERT->>BERT: Generate 768-dim Embeddings
BERT->>DimRed: Pass Embedding Matrix
DimRed->>DimRed: Compute Covariance & Eigenvectors
DimRed->>Cluster: Reduced Matrix (n_components)
Cluster->>Cluster: Iterative Centroid Assignment
Cluster-->>Data: Append 'cluster_id' to Labels
Sources: us-gaap-dist-analyzer/Cargo.lock:44-50 us-gaap-dist-analyzer/Cargo.lock:151-157
Data Integration
The us-gaap-dist-analyzer works in tandem with the narrative_stack Python components and the core Rust sec-fetcher library.
Relationship to Fundamental Concepts
While the Rust core defines a strict taxonomy in the FundamentalConcept enum [3.4 US GAAP Concept Transformation], this analyzer is used to discover new relationships or validate the existing taxonomy by observing how concepts are actually used in the wild.
- Input : The analyzer typically consumes the output of
pull-us-gaap-bulkor theUsGaapStore. - Processing : It clusters concepts like
CashAndCashEquivalentsAtCarryingValueandCashAndCashEquivalentsPeriodIncreaseDecreaseto see if they consistently appear in the same reporting “neighborhood.” - Validation : Results are used to refine the mapping patterns used in
distill_us_gaap_fundamental_concepts.
Performance Considerations
- Memory Usage : Because it handles large embedding matrices, the crate utilizes
ndarrayfor memory-efficient slicing andtch(LibTorch) for GPU-accelerated tensor operations when available. - Persistence : Clustering models and PCA projections are often serialized to disk using
serdeto allow for incremental analysis of new filing batches without re-training the entire distribution map.
Sources: us-gaap-dist-analyzer/Cargo.lock:209-211 us-gaap-dist-analyzer/Cargo.lock:165-175
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Development Guide
Loading…
Development Guide
Relevant source files
Purpose and Scope
This guide provides an overview of development practices, code organization, and workflows for contributing to the rust-sec-fetcher project. It covers environment setup, code organization principles, development workflows, and common development tasks.
For detailed information about specific development topics, see:
- Testing strategies and test fixtures: Testing Strategy
- Continuous integration and automated testing: CI/CD Pipeline
- Docker container configuration: Docker Deployment
Development Environment Setup
Prerequisites
The project requires the following tools installed:
| Tool | Purpose | Version Requirement |
|---|---|---|
| Rust | Core application development | 1.87+ |
| Python | ML pipeline and preprocessing | 3.8+ |
| Docker | Integration testing and services | Latest stable |
| Git LFS | Large file support for test assets | Latest stable |
| MySQL | Database for US GAAP storage | 5.7+ or 8.0+ |
Rust Development Setup
-
Clone the repository and navigate to the root directory.
-
Build the Rust application:
-
Run tests to verify setup:
The Rust workspace is configured with necessary dependencies. Key development dependencies include:
mockitofor HTTP mocking in tests.tempfilefor temporary file/directory creation in tests.tokiotest macros for async test support.
Python Development Setup
-
Create a virtual environment:
-
Install dependencies using
uv: -
Verify installation by running integration tests (requires Docker):
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39
Configuration Setup
The application requires a configuration file at ~/.config/sec-fetcher/config.toml or a custom path. Minimum configuration:
For non-interactive testing, use AppConfig directly in test code as shown in tests/config_manager_tests.rs:36-57
Sources: tests/config_manager_tests.rs:36-57 tests/sec_client_tests.rs:8-20
Code Organization and Architecture
Repository Structure
Sources: src/network/sec_client.rs:1-181 tests/config_manager_tests.rs:1-95 tests/sec_client_tests.rs:1-159 python/narrative_stack/us_gaap_store_integration_test.sh:1-39
Module Dependency Flow
The dependency flow follows a layered architecture:
- Configuration Layer :
ConfigManagerloads settings from TOML files and credentials. - Network Layer :
SecClientwraps HTTP client with caching and throttling middleware. - Data Fetching Layer : Network module functions fetch raw data from SEC APIs via
Urlvariants src/enums/url_enum.rs:5-116 - Transformation Layer : Transformers normalize raw data into standardized concepts.
- Model Layer : Data structures represent domain entities.
Sources: src/network/sec_client.rs:1-181 tests/config_manager_tests.rs:1-95 src/enums/url_enum.rs:5-116
Development Workflow
Standard Development Cycle
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39
Running Tests Locally
Rust Unit Tests
Run all Rust tests with cargo:
Run specific test modules:
Test Structure Mapping:
| Test File | Tests Component | Key Test Functions |
|---|---|---|
tests/config_manager_tests.rs | ConfigManager | test_load_custom_config, test_load_non_existent_config |
tests/sec_client_tests.rs | SecClient | test_user_agent, test_fetch_json_without_retry_success |
Sources: tests/config_manager_tests.rs:1-95 tests/sec_client_tests.rs:1-159
Python Integration Tests
Integration tests require Docker services. Run via the provided shell script:
This script performs the following steps as defined in python/narrative_stack/us_gaap_store_integration_test.sh:1-39:
- Activates Python virtual environment.
- Installs dependencies with
uv. - Starts Docker Compose services.
- Waits for MySQL availability.
- Creates
us_gaap_testdatabase and loads schema. - Runs pytest integration tests.
- Tears down containers on exit.
Sources: python/narrative_stack/us_gaap_store_integration_test.sh:1-39
Writing Tests
Unit Test Pattern (Rust)
The codebase uses mockito for HTTP mocking:
Sources: tests/sec_client_tests.rs:35-62
Test Fixture Pattern
The codebase uses temporary directories for file-based tests via tempfile::tempdir() as shown in tests/config_manager_tests.rs:8-17
Sources: tests/config_manager_tests.rs:8-17
Common Development Tasks
Adding a New SEC Data Endpoint
To add support for a new SEC data endpoint:
- Add URL enum variant in
src/enums/url_enum.rssrc/enums/url_enum.rs:5-116 - Update
Url::value()to return the formatted URL string src/enums/url_enum.rs:121-165 - Create fetch function in
src/network/using the newUrlvariant. - Define data models in
src/models/for the response structure.
Sources: src/enums/url_enum.rs:5-165
Modifying HTTP Client Behavior
The SecClient is configured in src/network/sec_client.rs:21-89 Key configuration points:
| Configuration | Location | Purpose |
|---|---|---|
CachePolicy | src/network/sec_client.rs:45-50 | Controls cache TTL and behavior |
ThrottlePolicy | src/network/sec_client.rs:53-59 | Controls rate limiting and retries |
| User-Agent | src/network/sec_client.rs:91-108 | Constructs SEC-compliant User-Agent header |
Sources: src/network/sec_client.rs:21-108
Code Quality Standards
CI/CD and Maintenance
The project uses GitHub Actions for automated quality checks:
- Linting :
rust-lint.ymlruns clippy and rustfmt. - Testing :
rust-tests.ymlruns the test suite. - Documentation :
build-docs.ymlgenerates documentation weekly build-docs.yml:1-81 - Dependency Updates : Dependabot is configured for weekly Cargo updates dependabot.yml:1-10
Sources: .github/workflows/build-docs.yml:1-81 .github/dependabot.yml:1-10
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Testing Strategy
Loading…
Testing Strategy
Relevant source files
- src/bin/refresh_test_fixtures.rs
- src/network/fetch_cik_submissions.rs
- src/parsers/parse_us_gaap_fundamentals.rs
- src/utils.rs
- tests/cik_submissions_tests.rs
- tests/edgar_feed_tests.rs
- tests/parse_13f_xml_tests.rs
- tests/parse_nport_xml_tests.rs
- tests/sec_client_tests.rs
- tests/thirteenf_era_fixture_tests.rs
- tests/us_gaap_parser_accuracy_tests.rs
This page documents the testing approach for the rust-sec-fetcher codebase. The strategy emphasizes high-fidelity parsing verification using compressed EDGAR fixtures, HTTP layer isolation via mockito, and exhaustive mapping tests for US GAAP concept normalization.
Test Architecture Overview
The testing infrastructure is designed to handle the “dirty” nature of SEC EDGAR data (e.g., changing scale conventions in 13F filings or inconsistent XBRL tagging) by using real-world data snapshots as the primary source of truth.
graph TB
subgraph "Rust Unit & Integration Tests"
SecClientTest["sec_client_tests.rs\nTest SecClient HTTP & Retries"]
CikSubTest["cik_submissions_tests.rs\nTest Submissions JSON Parsing"]
ThirteenFTest["parse_13f_xml_tests.rs\nTest 13F Scaling & Sums"]
EraTest["thirteenf_era_fixture_tests.rs\nTest 2023 Schema Crossover"]
UsGaapAccTest["us_gaap_parser_accuracy_tests.rs\nVerify DF against JSON Source"]
end
subgraph "Test Infrastructure & Fixtures"
MockitoServer["mockito::Server\nHTTP mock server"]
FixtureFiles["tests/fixtures/*.gz\nCompressed SEC Snapshots"]
RefreshBin["refresh_test_fixtures.rs\nUtility to update snapshots"]
end
SecClientTest --> MockitoServer
CikSubTest --> FixtureFiles
ThirteenFTest --> FixtureFiles
EraTest --> FixtureFiles
UsGaapAccTest --> FixtureFiles
RefreshBin --> FixtureFiles
Test Component Relationships
Sources: tests/sec_client_tests.rs:1-132 tests/cik_submissions_tests.rs:1-39 tests/thirteenf_era_fixture_tests.rs:1-30 src/bin/refresh_test_fixtures.rs:1-30
Fixture-Based Testing Strategy
The project relies on a “Fixture-First” approach for data parsers. Instead of mocking complex nested JSON/XML structures by hand, the refresh_test_fixtures binary src/bin/refresh_test_fixtures.rs:1-173 downloads real filings from EDGAR and stores them as Gzip-compressed files in tests/fixtures/.
graph LR
subgraph "Development Space"
RefreshBin["bin/refresh_test_fixtures.rs"]
FixturesDir["tests/fixtures/"]
end
subgraph "Code Entity Space"
LoadFixture["load_fixture()"]
GzDecoder["flate2::read::GzDecoder"]
Parser["parse_cik_submissions_json()"]
end
RefreshBin -- "Fetch & Compress" --> FixturesDir
FixturesDir -- "Read .gz" --> LoadFixture
LoadFixture -- "Decompress" --> GzDecoder
GzDecoder -- "Stream JSON/XML" --> Parser
The Fixture Lifecycle
Sources: src/bin/refresh_test_fixtures.rs:31-173 tests/cik_submissions_tests.rs:16-30 tests/us_gaap_parser_accuracy_tests.rs:9-23
Parser Accuracy Verification
The us_gaap_parser_accuracy_tests.rs implements a deep-validation logic that ensures the Polars DataFrame produced by parse_us_gaap_fundamentals src/parsers/parse_us_gaap_fundamentals.rs:41-103 is 100% traceable to the source JSON.
- Deduplication Logic : It replicates the “Last-in Wins” logic where later amendments (e.g.,
10-K/A) overwrite earlier filings for the same fiscal period src/parsers/parse_us_gaap_fundamentals.rs:27-40 - Validation : For every cell in the
DataFrame, the test searches the source JSONfactsobject tests/us_gaap_parser_accuracy_tests.rs:75-82 applies the same fiscal year (FY) derivation logic tests/us_gaap_parser_accuracy_tests.rs:101-114 and asserts the values match within a small epsilon tests/us_gaap_parser_accuracy_tests.rs:148-150
Sources: tests/us_gaap_parser_accuracy_tests.rs:31-160 src/parsers/parse_us_gaap_fundamentals.rs:25-103
Specific Domain Testing
13F Era Crossover Testing
The SEC changed the <value> field in 13F-HR filings from “thousands of USD” to “actual USD” on 2023-01-01. The thirteenf_era_fixture_tests.rs uses three specific Berkshire Hathaway (BRK-B) fixtures to verify the normalization logic tests/thirteenf_era_fixture_tests.rs:1-12
| Fixture | Filing Date | Expected Scaling | Sanity Check |
|---|---|---|---|
BRK_B_13f_ancient.xml | 2022-11-14 | Value * 1,000 | AAPL Price ~$138/sh tests/thirteenf_era_fixture_tests.rs:92-106 |
BRK_B_13f_transition.xml | 2023-02-14 | Value * 1 | AAPL Price ~$130/sh tests/thirteenf_era_fixture_tests.rs:148-162 |
BRK_B_13f_modern.xml | 2026-02-17 | Value * 1 | Modern Schema tests/thirteenf_era_fixture_tests.rs12 |
Sources: tests/thirteenf_era_fixture_tests.rs:1-162 src/bin/refresh_test_fixtures.rs:147-172
EDGAR Atom Feed Testing
The edgar_feed_tests.rs validates the polling logic for the live EDGAR “Latest Filings” feed.
- Delta Filtering : Tests the
take_whilelogic that stops fetching when it hits a “high-water mark” (the timestamp of the last processed filing) tests/edgar_feed_tests.rs:43-51 - High-Water Mark : Ensures the
FeedDeltacorrectly identifies the newest entry’s timestamp as the next mark tests/edgar_feed_tests.rs:179-194
Sources: tests/edgar_feed_tests.rs:11-194
Network & Client Testing
The SecClient is tested using mockito to simulate SEC server responses, ensuring the crate respects the SEC’s strict User-Agent and rate-limiting requirements.
sequenceDiagram
participant Test as test_fetch_json_with_retry_failure
participant Client as SecClient
participant Mock as mockito::Server
Test->>Mock: Expect(3) calls to /path
Test->>Client: fetch_json(mock_url)
Client->>Mock: Attempt 1
Mock-->>Client: 500 Internal Server Error
Note over Client: Backoff Delay
Client->>Mock: Attempt 2 (Retry)
Mock-->>Client: 500 Internal Server Error
Client->>Mock: Attempt 3 (Final Retry)
Mock-->>Client: 500 Internal Server Error
Client-->>Test: Err("Max retries exceeded")
Test->>Test: assert!(result.is_err())
HTTP Mocking Pattern
Sources: tests/sec_client_tests.rs:194-222
User-Agent Compliance
The SEC requires a User-Agent in the format AppName/Version (+Email). Tests verify that SecClient correctly falls back to crate defaults or uses custom overrides provided in AppConfig tests/sec_client_tests.rs:7-82
- Default :
sec-fetcher/0.1.0 (+test@example.com)tests/sec_client_tests.rs:16-22 - Custom :
my-custom-app/2.0.0 (+test@example.com)tests/sec_client_tests.rs:58-61
Sources: tests/sec_client_tests.rs:7-82
Summary of Test Modules
| File | Subsystem Tested | Key Functions |
|---|---|---|
sec_client_tests.rs | Network / Auth | test_fetch_json_with_retry_backoff, test_user_agent |
cik_submissions_tests.rs | Submissions Parser | parse_cik_submissions_json, items_split_correctly_on_comma |
parse_13f_xml_tests.rs | 13F Holdings | weight_pct_is_on_0_to_100_scale, weights_sum_to_100 |
parse_nport_xml_tests.rs | N-PORT Holdings | pct_val_is_on_0_to_100_scale_for_tiny_position |
edgar_feed_tests.rs | Polling / Feed | parse_edgar_atom_feed, delta_filter_excludes_entries_at_or_before_mark |
us_gaap_parser_accuracy_tests.rs | XBRL Financials | validate_dataframe_against_json |
Sources: tests/sec_client_tests.rs:1-132 tests/cik_submissions_tests.rs:1-176 tests/parse_13f_xml_tests.rs:1-187 tests/parse_nport_xml_tests.rs:1-177 tests/edgar_feed_tests.rs:1-195 tests/us_gaap_parser_accuracy_tests.rs:1-160
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
CI/CD Pipeline
Loading…
CI/CD Pipeline
Relevant source files
Purpose and Scope
This document explains the continuous integration and continuous deployment (CI/CD) infrastructure for the rust-sec-fetcher repository. It covers the GitHub Actions workflow configuration, integration test automation, documentation deployment, and dependency management.
The CI/CD architecture is split between Rust-specific validation (linting, testing) and the Python narrative_stack system’s integration testing. For general testing strategies including Rust unit tests and Python test fixtures, see [Testing Strategy](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Testing Strategy)
GitHub Actions Workflows
The repository implements several GitHub Actions workflows to ensure code quality and system reliability across the dual Rust/Python architecture.
1. US GAAP Store Integration Test
This workflow validates the Python machine learning pipeline’s integration with external dependencies. It is triggered by changes to the python/narrative_stack/ directory [.github/workflows/us-gaap-store-integration-test.yml:3-11].
Sources: [.github/workflows/us-gaap-store-integration-test.yml:3-11]
graph TB
PushTrigger["Push Event"]
PRTrigger["Pull Request Event"]
PathCheck{"Changed paths include:\npython/narrative_stack/**\nor workflow file itself?"}
WorkflowRun["Execute us-gaap-store-integration-test.yml"]
Skip["Skip workflow execution"]
PushTrigger --> PathCheck
PRTrigger --> PathCheck
PathCheck -->|Yes| WorkflowRun
PathCheck -->|No| Skip
2. Build and Deploy Documentation
This workflow automates the generation of the project’s documentation using deepwiki-to-mdbook. It runs weekly or on manual dispatch [.github/workflows/build-docs.yml:4-7].
| Step | Implementation | Purpose |
|---|---|---|
| Resolve Metadata | Shell script | Determines repo name and book title [.github/workflows/build-docs.yml:25-52] |
| Generate Docs | jzombie/deepwiki-to-mdbook@main | Converts wiki content to mdBook format [.github/workflows/build-docs.yml:59-64] |
| Deploy | actions/deploy-pages@v4 | Publishes to GitHub Pages [.github/workflows/build-docs.yml:78-80] |
Sources: [.github/workflows/build-docs.yml:1-81]
Integration Test Job Structure
The us-gaap-store-integration-test workflow defines a single job named integration-test that executes on ubuntu-latest [.github/workflows/us-gaap-store-integration-test.yml:12-15].
Sources: [.github/workflows/us-gaap-store-integration-test.yml:17-50]
graph TB
Start["Job: integration-test"]
Checkout["Step 1: Checkout repo\nactions/checkout@v4\nwith lfs: true"]
SetupPython["Step 2: Set up Python\nactions/setup-python@v5\npython-version: 3.12"]
InstallUV["Step 3: Install uv\ncurl astral.sh/uv/install.sh"]
InstallDeps["Step 4: Install Python dependencies\nuv pip install -e . --group dev"]
Ruff["Step 5: Check style with Ruff\nruff check ."]
RunTest["Step 6: Run integration test\n./us_gaap_store_integration_test.sh"]
Start --> Checkout
Checkout --> SetupPython
SetupPython --> InstallUV
InstallUV --> InstallDeps
InstallDeps --> Ruff
Ruff --> RunTest
Integration Test Architecture
The integration test orchestrates multiple Docker containers to create an isolated environment for validating the narrative_stack data flow.
Container & Entity Mapping
This diagram maps the CI orchestration to specific code entities and external services.
Sources: [python/narrative_stack/us_gaap_store_integration_test.sh:1-39], [python/narrative_stack/Dockerfile.simd-r-drive-ci-server:1-34]
graph TB
subgraph "Docker Compose Project: us_gaap_it"
MySQL["Container: us_gaap_test_db\n(MySQL)"]
SimdRDrive["Container: simd_r_drive_ws_server_test\n(WebSocket Server)"]
TestRunner["Test Runner\npytest process"]
end
Schema["SQL Schema\ntests/integration/assets/us_gaap_schema_2025.sql"]
PyTestFile["tests/integration/test_us_gaap_store.py"]
TestRunner -->|Executes| PyTestFile
PyTestFile -->|SQL queries| MySQL
PyTestFile -->|WS connection| SimdRDrive
Schema -->|Loaded via mysql CLI| MySQL
Test Execution Flow
The integration test script [python/narrative_stack/us_gaap_store_integration_test.sh:1-39] manages the container lifecycle.
Sources: [python/narrative_stack/us_gaap_store_integration_test.sh:1-39]
graph TB
Start["Start: us_gaap_store_integration_test.sh"]
SetVars["Set variables\nPROJECT_NAME=us_gaap_it"]
RegisterTrap["Register cleanup trap\ntrap 'cleanup' EXIT"]
DockerUp["Start Docker containers\ndocker compose up -d --profile test"]
WaitMySQL["Wait for MySQL ready\nmysqladmin ping loop"]
LoadSchema["Load schema\nmysql < us_gaap_schema_2025.sql"]
RunPytest["Execute pytest\npytest tests/integration/test_us_gaap_store.py"]
Cleanup["Cleanup function\ndocker compose down --volumes"]
Start --> SetVars
SetVars --> RegisterTrap
RegisterTrap --> DockerUp
DockerUp --> WaitMySQL
WaitMySQL --> LoadSchema
LoadSchema --> RunPytest
RunPytest --> Cleanup
Docker Container Configuration
simd-r-drive-ws-server Container Build
The Dockerfile [python/narrative_stack/Dockerfile.simd-r-drive-ci-server:1-34] creates a single-stage image for the CI server. It installs the simd-r-drive-ws-server crate version 0.10.0-alpha [python/narrative_stack/Dockerfile.simd-r-drive-ci-server:12].
Sources: [python/narrative_stack/Dockerfile.simd-r-drive-ci-server:18-33]
graph LR
BuildTime["Build Time\n--build-arg SERVER_ARGS"]
BakeArgs["ENV SERVER_ARGS"]
Entrypoint["ENTRYPOINT interpolates\n$SERVER_ARGS + $@"]
ServerExec["Execute:\nsimd-r-drive-ws-server"]
BuildTime --> BakeArgs
BakeArgs --> Entrypoint
Entrypoint --> ServerExec
Dependency Management
The project uses Dependabot to maintain up-to-date dependencies for the Rust components.
| Ecosystem | Directory | Schedule |
|---|---|---|
cargo | / | Weekly [.github/dependabot.yml:6-9] |
Sources: [.github/dependabot.yml:1-10]
Environment Configuration
Python Environment
The CI pipeline uses uv for fast, reproducible environment setup [.github/workflows/us-gaap-store-integration-test.yml:27-37].
- Python Version : 3.12
- Installation :
uv pip install -e . --group dev
Project Isolation
To prevent resource conflicts, the integration test uses a specific Docker Compose project name: us_gaap_it [python/narrative_stack/us_gaap_store_integration_test.sh:7]. This ensures that networks and volumes are isolated from other development or CI tasks.
Sources: [python/narrative_stack/us_gaap_store_integration_test.sh:7-9]
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Dependencies & Technology Stack
Loading…
Dependencies & Technology Stack
Relevant source files
This page provides a comprehensive overview of all external dependencies used in the rust-sec-fetcher codebase, covering both the Rust sec-fetcher application and the Python narrative_stack ML system.
Overview
The system employs a dual-language architecture with distinct but complementary technology stacks. The Rust layer prioritizes high-performance I/O operations, concurrent data fetching, and reliable HTTP caching. The Python layer focuses on scientific computing, machine learning model training, and numerical data processing. Both layers share common infrastructure through the simd-r-drive data storage system and file-based CSV interchange.
Sources: Cargo.toml:1-82 Cargo.lock:1-100
Rust Technology Stack
Core Direct Dependencies
The Rust application declares its direct dependencies in the manifest, each serving specific architectural roles:
Dependency Categories and Usage:
graph TB
subgraph "Async Runtime & Concurrency"
tokio["tokio 1.50.0\nFull async runtime"]
rayon["rayon 1.11.0\nData parallelism"]
dashmap["dashmap 6.1.0\nConcurrent hashmap"]
end
subgraph "HTTP & Network"
reqwest["reqwest 0.13.2\nHTTP client"]
reqwest_drive["reqwest-drive 0.13.2-alpha\nDrive middleware"]
end
subgraph "Data Processing"
polars["polars 0.46.0\nDataFrame operations"]
csv_crate["csv 1.4.0\nCSV parsing"]
serde["serde 1.0.228\nSerialization"]
serde_json["serde_json 1.0.149\nJSON support"]
quick_xml["quick-xml 0.39.2\nXML parsing"]
end
subgraph "Storage & Caching"
simd_r_drive["simd-r-drive 0.15.5-alpha\nKey-value store"]
simd_r_drive_ext["simd-r-drive-extensions\n0.15.5-alpha"]
end
subgraph "Configuration & Validation"
config_crate["config 0.15.9\nConfig management"]
keyring["keyring 3.6.2\nCredential storage"]
email_address["email_address 0.2.9\nEmail validation"]
rust_decimal["rust_decimal 1.40.0\nDecimal numbers"]
chrono["chrono 0.4.44\nDate/time handling"]
end
subgraph "Development Tools"
mockito["mockito 1.7.0\nHTTP mocking"]
tempfile["tempfile 3.27.0\nTemp file creation"]
end
| Category | Crates | Primary Use Cases |
|---|---|---|
| Async Runtime | tokio | Event loop, async I/O, task scheduling Cargo.toml79 |
| HTTP Stack | reqwest, reqwest-drive | SEC API communication, middleware integration Cargo.toml:66-67 |
| Data Frames | polars | Large-scale data transformation, CSV/JSON processing Cargo.toml61 |
| Serialization | serde, serde_json, serde_with | Data structure serialization, API response parsing Cargo.toml:71-73 |
| Concurrency | rayon, dashmap | Parallel processing, concurrent data structures Cargo.toml:50-64 |
| Storage | simd-r-drive, simd-r-drive-extensions | HTTP cache, preprocessor cache, persistent storage Cargo.toml:74-75 |
| Configuration | config, keyring | TOML config loading, secure credential management Cargo.toml:48-59 |
| Validation | email_address, rust_decimal, chrono | Input validation, financial precision, timestamps Cargo.toml:46-68 |
| Utilities | itertools, indexmap, bytes | Iterator extensions, ordered maps, byte manipulation Cargo.toml:45-58 |
| Testing | mockito, tempfile | HTTP mock servers, temporary test files Cargo.toml:78-85 |
Sources: Cargo.toml:44-86
Storage and Caching System
The caching layer utilizes simd-r-drive to persist HTTP responses and preprocessed results.
Cache Architecture:
graph TB
subgraph "simd-r-drive Ecosystem"
simd_r_drive["simd-r-drive 0.15.5-alpha\nCore key-value store"]
DataStore["DataStore\n(simd_r_drive::DataStore)"]
simd_r_drive --> DataStore
end
subgraph "Cache Management"
Caches["Caches struct\n(src/caches.rs)"]
http_cache["http_cache: Arc<DataStore>"]
pre_cache["preprocessor_cache: Arc<DataStore>"]
Caches --> http_cache
Caches --> pre_cache
end
subgraph "On-Disk Files"
f1["http_storage_cache.bin"]
f2["preprocessor_cache.bin"]
http_cache -.-> f1
pre_cache -.-> f2
end
- Isolation : The
Cachesstruct manages two distinctDataStoreinstances src/caches.rs:11-14 - Initialization : The
Caches::openfunction ensures the base directory exists and opens the.binstorage files src/caches.rs:29-51 - Thread Safety : Access to stores is provided via
Arc<DataStore>clones src/caches.rs:53-59
Sources: src/caches.rs:1-60 Cargo.toml:74-75
Numeric Support and Precision
The system relies on rust_decimal for financial calculations where floating-point errors are unacceptable.
| Crate | Version | Key Types | Purpose |
|---|---|---|---|
rust_decimal | 1.40.0 | Decimal | Exact decimal arithmetic for US GAAP values Cargo.toml68 |
chrono | 0.4.44 | NaiveDate | Date handling for filing periods Cargo.toml46 |
polars | 0.46.0 | DataFrame | High-performance columnar data processing Cargo.toml61 |
Sources: Cargo.toml:46-68
Python Technology Stack
The Python narrative_stack system focuses on the machine learning pipeline and data analysis.
Machine Learning Framework
The training pipeline uses PyTorch and PyTorch Lightning to build and train autoencoders on US GAAP concepts.
Preprocessing Logic:
graph TB
subgraph "Training Pipeline"
Stage1Autoencoder["Stage1Autoencoder\n(PyTorch Lightning Module)"]
IterableConceptValueDataset["IterableConceptValueDataset\n(PyTorch Dataset)"]
Stage1Autoencoder --> IterableConceptValueDataset
end
subgraph "Scientific Stack"
numpy["NumPy\nArray operations"]
pandas["pandas\nData manipulation"]
sklearn["scikit-learn\nPreprocessing & PCA"]
end
subgraph "Preprocessing"
RobustScaler["RobustScaler\n(Normalization)"]
PCA["PCA\n(Dimensionality Reduction)"]
sklearn --> RobustScaler
sklearn --> PCA
end
- RobustScaler : Normalizes values per concept/unit pair to handle outliers in financial data.
- PCA : Reduces the dimensionality of semantic embeddings while preserving variance.
Sources: Project architecture overview, Cargo.toml61 (Polars/Python integration context)
Database and Storage Integration
The Python stack interacts with both relational and key-value stores:
- MySQL : Stores ingested US GAAP triplets (concept, unit, value).
- simd-r-drive (via WebSocket) : The
DataStoreWsClientallows the Python stack to access the same high-performance storage used by the Rust application.
Sources: Project architecture overview.
Shared Infrastructure
graph TB
subgraph "Rust: sec-fetcher"
distill["distill_us_gaap_fundamental_concepts"]
csv_out["CSV Export\n(src/bin/pulls/us_gaap_bulk.rs)"]
distill --> csv_out
end
subgraph "Storage"
shared_dir["/data/us-gaap/"]
csv_out --> shared_dir
end
subgraph "Python: narrative_stack"
ingest["Data Ingestion"]
shared_dir --> ingest
ingest --> model["Stage1Autoencoder"]
end
File System Interchange
Data is passed between the Rust fetcher and the Python ML stack primarily through CSV files and shared storage.
Sources: Cargo.toml:28-34 src/caches.rs:25-51
Development and CI/CD Stack
GitHub Actions Workflow
The project uses GitHub Actions for continuous integration, ensuring cross-platform compatibility and code quality.
- Rust Tests : Executes
cargo testacross the workspace. - Lints : Runs
cargo clippyandcargo fmt. - Integration : Uses
docker-composeto spin upsimd-r-drive-ws-serverand MySQL for end-to-end testing.
Sources: Cargo.toml:83-86 Cargo.lock:1-100
Version Compatibility Matrix
| Component | Version | Requirement |
|---|---|---|
| Rust Edition | 2024 | Cargo.toml6 |
| Polars | 0.46.0 | Cargo.toml61 |
| Tokio | 1.50.0 | Cargo.toml79 |
| Reqwest | 0.13.2 | Cargo.toml66 |
Sources: Cargo.toml:1-82
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Glossary
Loading…
Glossary
Relevant source files
- .github/workflows/build-docs.yml
- Cargo.toml
- README.md
- examples/fuzzy_match_company.rs
- examples/ipo_list.rs
- examples/ipo_show.rs
- src/enums.rs
- src/enums/form_type_enum.rs
- src/enums/fundamental_concept_enum.rs
- src/enums/url_enum.rs
- src/lib.rs
- src/models/nport_investment.rs
- src/models/ticker.rs
- src/network/fetch_cik_by_ticker_symbol.rs
- src/network/fetch_company_tickers.rs
- src/network/fetch_us_gaap_fundamentals.rs
- src/network/filings/filing_index.rs
- src/network/filings/mod.rs
This glossary defines the technical terms, abbreviations, and domain-specific concepts used throughout the rust-sec-fetcher codebase. It serves as a reference for onboarding engineers to understand the intersection of SEC regulatory requirements, financial data structures, and the system’s implementation patterns.
SEC & EDGAR Terminology
The following terms originate from the U.S. Securities and Exchange Commission (SEC) and its Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system.
Accession Number
A unique identifier assigned by the SEC to every filing. It follows the format 0000320193-25-000008, where the first part is the CIK of the filer, the second is the year, and the third is a sequence number.
- Code Entity:
AccessionNumbersrc/models/mod.rs (not shown, but referenced in src/network/fetch_us_gaap_fundamentals.rs2). - Usage: Used to construct direct URLs to filing index pages and primary documents src/network/fetch_us_gaap_fundamentals.rs:88-94
CIK (Central Index Key)
A 10-digit number used by the SEC to uniquely identify corporations and individuals who have filed disclosures.
- Code Entity:
Ciksrc/models/ticker.rs21 - Data Flow: Resolved from tickers via
fetch_cik_by_ticker_symbolsrc/network/fetch_cik_by_ticker_symbol.rs and used as a primary key for most API calls likeCompanyFactssrc/network/fetch_us_gaap_fundamentals.rs62
Filing Index
The HTML landing page for a specific filing (e.g., -index.htm). It lists all documents associated with a submission, including the primary form (10-K) and all exhibits (EX-99.1).
- Code Entity:
FilingIndexsrc/network/filings/filing_index.rs2fetch_filing_indexsrc/network/filings/filing_index.rs:108-114 - Implementation: Parsed using regex to extract filenames and document types src/network/filings/filing_index.rs:23-76
US GAAP & XBRL
US GAAP (Generally Accepted Accounting Principles) is the standard framework of guidelines for financial accounting. XBRL (eXtensible Business Reporting Language) is the XML-based standard used to “tag” these financial concepts (e.g., NetIncomeLoss) in filings.
- Code Entity:
FundamentalConceptsrc/enums/fundamental_concept_enum.rs - Data Fetching:
fetch_us_gaap_fundamentalsretrieves these tags from the SECcompanyfactsendpoint src/network/fetch_us_gaap_fundamentals.rs:54-58
Sources: src/network/fetch_us_gaap_fundamentals.rs:11-53 src/network/filings/filing_index.rs:14-22 src/models/ticker.rs:19-25
Codebase-Specific Concepts
Derived Instrument
Financial instruments that are not the primary common stock listing but share the same CIK, such as warrants (-WT), units (-UN), or preferred shares (-PA).
- Code Entity:
TickerOrigin::DerivedInstrumentsrc/enums.rs - Logic: These are merged from
ticker.txtand often require name backfilling from primary listings src/network/fetch_company_tickers.rs:27-35
Fuzzy Matching (Company Names)
A mechanism to resolve a human-readable company name to a Ticker and Cik using tokenization and weighted scoring.
- Code Entity:
Ticker::get_by_fuzzy_matched_namesrc/models/ticker.rs:38-42 - Algorithm: Uses
TOKEN_MATCH_THRESHOLD(0.6) and various boosts (e.g.,EXACT_MATCH_BOOST) to rank candidates src/models/ticker.rs:27-33
Rendering Views
Traits and structures that define how a raw SEC HTML document is transformed into a readable format.
- Code Entity:
FilingViewexamples/ipo_show.rs49MarkdownViewexamples/ipo_show.rs49EmbeddingTextViewexamples/ipo_show.rs49 - Usage: Passed to
render_filingto determine the output format src/ops/mod.rs (referenced in examples/ipo_show.rs130).
Sources: src/network/fetch_company_tickers.rs:50-55 src/models/ticker.rs:43-122 examples/ipo_show.rs:92-108
System Architecture Diagrams
graph TD
subgraph "Input Space"
Input["User Input: 'AAPL' or 'Apple Inc'"]
end
subgraph "Code Entity Space (src/network & src/models)"
Client["SecClient"]
FetchTicker["fetch_company_tickers"]
FuzzyMatch["Ticker::get_by_fuzzy_matched_name"]
CikLookup["fetch_cik_by_ticker_symbol"]
end
subgraph "Data Sources"
JSON["company_tickers.json"]
TXT["ticker.txt"]
end
Input --> Client
Client --> FetchTicker
FetchTicker --> JSON
FetchTicker --> TXT
FetchTicker --> FuzzyMatch
Input --> CikLookup
CikLookup --> Client
Ticker Resolution Pipeline
This diagram bridges the natural language “Ticker/Name” input to the code entities responsible for resolution.
Sources: src/network/fetch_company_tickers.rs:58-61 src/models/ticker.rs:38-42 src/network/fetch_cik_by_ticker_symbol.rs
graph LR
subgraph "SEC EDGAR"
Submissions["/submissions/CIK{cik}.json"]
IndexPage["-index.htm"]
DocHTML["primary_doc.htm"]
end
subgraph "Logic (src/network/filings & src/ops)"
FetchSub["fetch_cik_submissions"]
FetchIdx["fetch_filing_index"]
RenderOp["render_filing"]
end
subgraph "Views (src/views)"
MDV["MarkdownView"]
ETV["EmbeddingTextView"]
end
Submissions --> FetchSub
FetchSub --> FetchIdx
FetchIdx --> IndexPage
IndexPage --> RenderOp
RenderOp --> DocHTML
RenderOp --> MDV
RenderOp --> ETV
Filing Retrieval and Rendering Flow
This diagram shows the transition from a raw SEC submission to a rendered view.
Sources: src/network/filings/filing_index.rs:108-114 src/network/fetch_us_gaap_fundamentals.rs:74-81 examples/ipo_show.rs:110-114
Terminology Summary Table
| Term | Context | Code Reference |
|---|---|---|
| N-PORT | Monthly portfolio holdings for funds | NportInvestment src/models/nport_investment.rs11 |
| 13F | Quarterly holdings for institutional managers | fetch_13f_filings src/network/filings/mod.rs19 |
| S-1 / F-1 | IPO Registration Statements | FormType::IPO_REGISTRATION_FORM_TYPES examples/ipo_list.rs88 |
| 424B4 | Final Pricing Prospectus | FormType::IPO_PRICING_FORM_TYPES examples/ipo_list.rs:90-91 |
| Pct | Normalized percentage type (0-100) | Pct src/models/nport_investment.rs35 |
| NamespaceHasher | Cache key generation with prefixing | NamespaceHasher src/models/ticker.rs:13-17 |
Sources: src/models/nport_investment.rs:11-41 src/network/filings/mod.rs:16-29 examples/ipo_list.rs:1-17 src/models/ticker.rs:13-17
Dismiss
Refresh this wiki
Enter email to refresh