This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Overview
Loading…
Overview
Relevant source files
This document provides a high-level introduction to the rust-sec-fetcher repository, explaining its purpose, architecture, and the relationship between its Rust and Python components. This page covers the system’s overall design and data flow. For installation and configuration instructions, see [Getting Started](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Getting Started) For detailed implementation documentation of individual components, see [Rust sec-fetcher Application](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Rust sec-fetcher Application) and [Python narrative_stack System](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Python narrative_stack System)
Sources: Cargo.toml:1-10 README.md:1-8
System Purpose
The rust-sec-fetcher repository implements a dual-language financial data processing system that:
- Fetches company financial data from the SEC EDGAR API, including filings (10-K, 10-Q, 8-K), fund holdings (13F, N-PORT), and IPO registrations. README.md:5-8
- Transforms raw SEC filings into normalized US GAAP fundamental concepts or clean Markdown/text views. README.md:5-8 src/ops/filing_ops.rs:1-15
- Stores structured data as CSV files or provides it via high-level data models like
Ticker,Cik, andNportInvestment. src/models/ticker.rs:1-10 src/models/nport.rs:1-10 - Trains machine learning models (in the Python
narrative_stack) to understand financial concept relationships and perform semantic analysis.
The system specializes in processing US GAAP (Generally Accepted Accounting Principles) financial data, normalizing variations of concepts and consolidating them into a standardized taxonomy of FundamentalConcept variants. This normalization enables consistent querying across diverse financial reports and powers downstream machine learning applications.
Sources: Cargo.toml:1-10 README.md:1-8 src/enums/fundamental_concept.rs:1-20
Dual-Language Architecture
The repository employs a dual-language design that leverages the strengths of both Rust and Python:
| Layer | Language | Primary Responsibilities | Key Reason |
|---|---|---|---|
| Data Fetching & Processing | Rust | HTTP requests, throttling, caching, data transformation, XML/JSON parsing | I/O-bound operations, memory safety, high performance |
| Machine Learning | Python | Embedding generation, model training, statistical analysis | Rich ML ecosystem (PyTorch, scikit-learn) |
Rust Layer (sec-fetcher)
- Implements
SecClientwith sophisticated throttling and caching policies. src/network/sec_client.rs:1-50 - Fetches company tickers, CIK submissions, N-PORT filings, and US GAAP fundamentals. src/network/mod.rs:1-30
- Transforms raw financial data via
distill_us_gaap_fundamental_concepts. src/transformers/us_gaap.rs:1-20 - Provides specialized parsers for 13F, N-PORT, and Form 4 XML documents. src/parsers/mod.rs:1-20
Python Layer (narrative_stack)
- Ingests data generated by the Rust layer.
- Generates semantic embeddings for concept/unit pairs.
- Trains
Stage1Autoencodermodels using PyTorch Lightning.
Sources: Cargo.toml:1-42 README.md:15-50
High-Level Data Flow
The following diagram bridges the gap between the natural language data flow and the specific code entities responsible for each stage.
graph TB
SEC["SEC EDGAR API\n(company_tickers.json, submissions, archives)"]
SecClient["SecClient\n(src/network/sec_client.rs)"]
NetworkFuncs["Network Functions\nfetch_company_tickers\nfetch_filings\nfetch_nport_filings\n(src/network/mod.rs)"]
Parsers["Parsers\nparse_13f_xml\nparse_nport_xml\n(src/parsers/mod.rs)"]
Ops["Operations Logic\nrender_filing\nfetch_and_render\n(src/ops/filing_ops.rs)"]
Models["Data Models\nTicker, CikSubmission\nNportInvestment\n(src/models/mod.rs)"]
Views["Views\nMarkdownView\nEmbeddingTextView\n(src/views/mod.rs)"]
PythonIngest["Python narrative_stack\nData Ingestion & Training"]
SEC -->|HTTP GET| SecClient
SecClient --> NetworkFuncs
NetworkFuncs --> Parsers
Parsers --> Models
NetworkFuncs --> Ops
Ops --> Views
Models --> PythonIngest
System Data Flow and Code Entities
Data Flow Summary:
- Fetch :
SecClientretrieves data from SEC EDGAR API endpoints using structuredUrlvariants. src/enums/url_enum.rs:5-116 - Parse : Raw XML/JSON data is processed by specialized parsers (e.g.,
parse_13f_xml) into internal models. src/parsers/thirteen_f.rs:1-10 - Transform/Render : Filings are rendered into human-readable or machine-learning-ready formats via
render_filing. src/ops/filing_ops.rs:118-130 - Analyze : Normalized data is passed to the Python layer for ML training and embedding generation.
Sources: src/network/sec_client.rs:1-50 src/enums/url_enum.rs:5-116 src/ops/filing_ops.rs:118-130 README.md:110-140
graph TB
main["main.rs / lib.rs\nEntry Points"]
config["config module\nConfigManager\nAppConfig\n(src/config/mod.rs)"]
network["network module\nSecClient\nThrottlePolicy\n(src/network/mod.rs)"]
ops["ops module\nrender_filing\nholdings operations\n(src/ops/mod.rs)"]
models["models module\nTicker, Cik, AccessionNumber\n(src/models/mod.rs)"]
enums["enums module\nFundamentalConcept\nUrl, FormType\n(src/enums/mod.rs)"]
parsers["parsers module\nXML/JSON parsers\n(src/parsers/mod.rs)"]
views["views module\nMarkdownView\n(src/views/mod.rs)"]
main --> config
main --> network
main --> ops
network --> config
network --> models
network --> enums
ops --> network
ops --> views
ops --> models
parsers --> models
models --> enums
Core Module Structure
The Rust codebase is modularized to separate networking, data modeling, and business logic.
Module Dependency Graph
| Module | Primary Purpose | Key Exports |
|---|---|---|
config | Configuration management and credential loading. | ConfigManager, AppConfig src/config/mod.rs |
network | HTTP client, data fetching, and throttling. | SecClient, fetch_filings, fetch_company_profile src/network/mod.rs |
ops | High-level business logic and data orchestration. | render_filing, fetch_and_render src/ops/mod.rs |
models | Domain-specific data structures. | Ticker, Cik, CikSubmission, NportInvestment src/models/mod.rs |
enums | Type-safe enumerations for SEC concepts. | FundamentalConcept, FormType, Url src/enums/mod.rs |
parsers | Logic for converting SEC formats to Rust structs. | parse_13f_xml, parse_nport_xml src/parsers/mod.rs |
views | Rendering logic for different output formats. | MarkdownView, EmbeddingTextView src/views/mod.rs |
Sources: Cargo.toml:15-35 src/enums/url_enum.rs:1-5
Key Technologies
Rust Dependencies:
tokio: Async runtime for non-blocking I/O. Cargo.toml79reqwest: Robust HTTP client with JSON support. Cargo.toml66polars: High-performance DataFrame library for data manipulation. Cargo.toml61quick-xml: Fast XML parsing for SEC filing documents. Cargo.toml62html-to-markdown-rs: Converts SEC HTML filings to Markdown. Cargo.toml55simd-r-drive: Integration with a high-performance data store. Cargo.toml74
Python Dependencies (narrative_stack):
- PyTorch & PyTorch Lightning (ML training)
- scikit-learn (Preprocessing and PCA)
- BERT (Contextual embeddings for concept clustering)
Sources: Cargo.toml:44-82
Next Steps
- Getting Started : Learn how to configure your SEC contact email and run your first lookup. See [Getting Started](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Getting Started)
- Rust Architecture : Dive deep into the
SecClientand networking layer. See [Rust sec-fetcher Application](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Rust sec-fetcher Application) - ML Pipeline : Explore the autoencoder training and US GAAP alignment. See [Python narrative_stack System](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Python narrative_stack System)
Sources: README.md:9-15