This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Overview

Loading…

Overview

Relevant source files

This document provides a high-level introduction to the rust-sec-fetcher repository, explaining its purpose, architecture, and the relationship between its Rust and Python components. This page covers the system’s overall design and data flow. For installation and configuration instructions, see [Getting Started](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Getting Started) For detailed implementation documentation of individual components, see [Rust sec-fetcher Application](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Rust sec-fetcher Application) and [Python narrative_stack System](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Python narrative_stack System)

Sources: Cargo.toml:1-10 README.md:1-8

System Purpose

The rust-sec-fetcher repository implements a dual-language financial data processing system that:

Fetches company financial data from the SEC EDGAR API, including filings (10-K, 10-Q, 8-K), fund holdings (13F, N-PORT), and IPO registrations. README.md:5-8
Transforms raw SEC filings into normalized US GAAP fundamental concepts or clean Markdown/text views. README.md:5-8 src/ops/filing_ops.rs:1-15
Stores structured data as CSV files or provides it via high-level data models like Ticker, Cik, and NportInvestment. src/models/ticker.rs:1-10 src/models/nport.rs:1-10
Trains machine learning models (in the Python narrative_stack) to understand financial concept relationships and perform semantic analysis.

The system specializes in processing US GAAP (Generally Accepted Accounting Principles) financial data, normalizing variations of concepts and consolidating them into a standardized taxonomy of FundamentalConcept variants. This normalization enables consistent querying across diverse financial reports and powers downstream machine learning applications.

Sources: Cargo.toml:1-10 README.md:1-8 src/enums/fundamental_concept.rs:1-20

Dual-Language Architecture

The repository employs a dual-language design that leverages the strengths of both Rust and Python:

Layer	Language	Primary Responsibilities	Key Reason
Data Fetching & Processing	Rust	HTTP requests, throttling, caching, data transformation, XML/JSON parsing	I/O-bound operations, memory safety, high performance
Machine Learning	Python	Embedding generation, model training, statistical analysis	Rich ML ecosystem (PyTorch, scikit-learn)

Rust Layer (`sec-fetcher`)

Implements SecClient with sophisticated throttling and caching policies. src/network/sec_client.rs:1-50
Fetches company tickers, CIK submissions, N-PORT filings, and US GAAP fundamentals. src/network/mod.rs:1-30
Transforms raw financial data via distill_us_gaap_fundamental_concepts. src/transformers/us_gaap.rs:1-20
Provides specialized parsers for 13F, N-PORT, and Form 4 XML documents. src/parsers/mod.rs:1-20

Python Layer (`narrative_stack`)

Ingests data generated by the Rust layer.
Generates semantic embeddings for concept/unit pairs.
Trains Stage1Autoencoder models using PyTorch Lightning.

Sources: Cargo.toml:1-42 README.md:15-50

High-Level Data Flow

The following diagram bridges the gap between the natural language data flow and the specific code entities responsible for each stage.

graph TB
    SEC["SEC EDGAR API\n(company_tickers.json, submissions, archives)"]
SecClient["SecClient\n(src/network/sec_client.rs)"]
NetworkFuncs["Network Functions\nfetch_company_tickers\nfetch_filings\nfetch_nport_filings\n(src/network/mod.rs)"]
Parsers["Parsers\nparse_13f_xml\nparse_nport_xml\n(src/parsers/mod.rs)"]
Ops["Operations Logic\nrender_filing\nfetch_and_render\n(src/ops/filing_ops.rs)"]
Models["Data Models\nTicker, CikSubmission\nNportInvestment\n(src/models/mod.rs)"]
Views["Views\nMarkdownView\nEmbeddingTextView\n(src/views/mod.rs)"]
PythonIngest["Python narrative_stack\nData Ingestion & Training"]
SEC -->|HTTP GET| SecClient
 
   SecClient --> NetworkFuncs
 
   NetworkFuncs --> Parsers
 
   Parsers --> Models
 
   NetworkFuncs --> Ops
 
   Ops --> Views
 
   Models --> PythonIngest

System Data Flow and Code Entities

Data Flow Summary:

Fetch : SecClient retrieves data from SEC EDGAR API endpoints using structured Url variants. src/enums/url_enum.rs:5-116
Parse : Raw XML/JSON data is processed by specialized parsers (e.g., parse_13f_xml) into internal models. src/parsers/thirteen_f.rs:1-10
Transform/Render : Filings are rendered into human-readable or machine-learning-ready formats via render_filing. src/ops/filing_ops.rs:118-130
Analyze : Normalized data is passed to the Python layer for ML training and embedding generation.

Sources: src/network/sec_client.rs:1-50 src/enums/url_enum.rs:5-116 src/ops/filing_ops.rs:118-130 README.md:110-140

graph TB
    main["main.rs / lib.rs\nEntry Points"]
config["config module\nConfigManager\nAppConfig\n(src/config/mod.rs)"]
network["network module\nSecClient\nThrottlePolicy\n(src/network/mod.rs)"]
ops["ops module\nrender_filing\nholdings operations\n(src/ops/mod.rs)"]
models["models module\nTicker, Cik, AccessionNumber\n(src/models/mod.rs)"]
enums["enums module\nFundamentalConcept\nUrl, FormType\n(src/enums/mod.rs)"]
parsers["parsers module\nXML/JSON parsers\n(src/parsers/mod.rs)"]
views["views module\nMarkdownView\n(src/views/mod.rs)"]
main --> config
 
   main --> network
 
   main --> ops
    
 
   network --> config
 
   network --> models
 
   network --> enums
    
 
   ops --> network
 
   ops --> views
 
   ops --> models
    
 
   parsers --> models
 
   models --> enums

Core Module Structure

The Rust codebase is modularized to separate networking, data modeling, and business logic.

Module Dependency Graph

Module	Primary Purpose	Key Exports
`config`	Configuration management and credential loading.	`ConfigManager`, `AppConfig` src/config/mod.rs
`network`	HTTP client, data fetching, and throttling.	`SecClient`, `fetch_filings`, `fetch_company_profile` src/network/mod.rs
`ops`	High-level business logic and data orchestration.	`render_filing`, `fetch_and_render` src/ops/mod.rs
`models`	Domain-specific data structures.	`Ticker`, `Cik`, `CikSubmission`, `NportInvestment` src/models/mod.rs
`enums`	Type-safe enumerations for SEC concepts.	`FundamentalConcept`, `FormType`, `Url` src/enums/mod.rs
`parsers`	Logic for converting SEC formats to Rust structs.	`parse_13f_xml`, `parse_nport_xml` src/parsers/mod.rs
`views`	Rendering logic for different output formats.	`MarkdownView`, `EmbeddingTextView` src/views/mod.rs

Sources: Cargo.toml:15-35 src/enums/url_enum.rs:1-5

Key Technologies

Rust Dependencies:

tokio: Async runtime for non-blocking I/O. Cargo.toml79
reqwest: Robust HTTP client with JSON support. Cargo.toml66
polars: High-performance DataFrame library for data manipulation. Cargo.toml61
quick-xml: Fast XML parsing for SEC filing documents. Cargo.toml62
html-to-markdown-rs: Converts SEC HTML filings to Markdown. Cargo.toml55
simd-r-drive: Integration with a high-performance data store. Cargo.toml74

Python Dependencies (narrative_stack):

PyTorch & PyTorch Lightning (ML training)
scikit-learn (Preprocessing and PCA)
BERT (Contextual embeddings for concept clustering)

Sources: Cargo.toml:44-82

Next Steps

Getting Started : Learn how to configure your SEC contact email and run your first lookup. See [Getting Started](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Getting Started)
Rust Architecture : Dive deep into the SecClient and networking layer. See [Rust sec-fetcher Application](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Rust sec-fetcher Application)
ML Pipeline : Explore the autoencoder training and US GAAP alignment. See [Python narrative_stack System](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Python narrative_stack System)

Sources: README.md:9-15

Dismiss

Refresh this wiki

Enter email to refresh

Keyboard shortcuts

rust-sec-fetcher Documentation