Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Overview

Loading…

Overview

Relevant source files

This document provides a high-level introduction to the rust-sec-fetcher repository, explaining its purpose, architecture, and the relationship between its Rust and Python components. This page covers the system’s overall design and data flow. For installation and configuration instructions, see [Getting Started](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Getting Started) For detailed implementation documentation of individual components, see [Rust sec-fetcher Application](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Rust sec-fetcher Application) and [Python narrative_stack System](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Python narrative_stack System)

Sources: Cargo.toml:1-10 README.md:1-8

System Purpose

The rust-sec-fetcher repository implements a dual-language financial data processing system that:

  1. Fetches company financial data from the SEC EDGAR API, including filings (10-K, 10-Q, 8-K), fund holdings (13F, N-PORT), and IPO registrations. README.md:5-8
  2. Transforms raw SEC filings into normalized US GAAP fundamental concepts or clean Markdown/text views. README.md:5-8 src/ops/filing_ops.rs:1-15
  3. Stores structured data as CSV files or provides it via high-level data models like Ticker, Cik, and NportInvestment. src/models/ticker.rs:1-10 src/models/nport.rs:1-10
  4. Trains machine learning models (in the Python narrative_stack) to understand financial concept relationships and perform semantic analysis.

The system specializes in processing US GAAP (Generally Accepted Accounting Principles) financial data, normalizing variations of concepts and consolidating them into a standardized taxonomy of FundamentalConcept variants. This normalization enables consistent querying across diverse financial reports and powers downstream machine learning applications.

Sources: Cargo.toml:1-10 README.md:1-8 src/enums/fundamental_concept.rs:1-20

Dual-Language Architecture

The repository employs a dual-language design that leverages the strengths of both Rust and Python:

LayerLanguagePrimary ResponsibilitiesKey Reason
Data Fetching & ProcessingRustHTTP requests, throttling, caching, data transformation, XML/JSON parsingI/O-bound operations, memory safety, high performance
Machine LearningPythonEmbedding generation, model training, statistical analysisRich ML ecosystem (PyTorch, scikit-learn)

Rust Layer (sec-fetcher)

Python Layer (narrative_stack)

  • Ingests data generated by the Rust layer.
  • Generates semantic embeddings for concept/unit pairs.
  • Trains Stage1Autoencoder models using PyTorch Lightning.

Sources: Cargo.toml:1-42 README.md:15-50

High-Level Data Flow

The following diagram bridges the gap between the natural language data flow and the specific code entities responsible for each stage.

graph TB
    SEC["SEC EDGAR API\n(company_tickers.json, submissions, archives)"]
SecClient["SecClient\n(src/network/sec_client.rs)"]
NetworkFuncs["Network Functions\nfetch_company_tickers\nfetch_filings\nfetch_nport_filings\n(src/network/mod.rs)"]
Parsers["Parsers\nparse_13f_xml\nparse_nport_xml\n(src/parsers/mod.rs)"]
Ops["Operations Logic\nrender_filing\nfetch_and_render\n(src/ops/filing_ops.rs)"]
Models["Data Models\nTicker, CikSubmission\nNportInvestment\n(src/models/mod.rs)"]
Views["Views\nMarkdownView\nEmbeddingTextView\n(src/views/mod.rs)"]
PythonIngest["Python narrative_stack\nData Ingestion & Training"]
SEC -->|HTTP GET| SecClient
 
   SecClient --> NetworkFuncs
 
   NetworkFuncs --> Parsers
 
   Parsers --> Models
 
   NetworkFuncs --> Ops
 
   Ops --> Views
 
   Models --> PythonIngest

System Data Flow and Code Entities

Data Flow Summary:

  1. Fetch : SecClient retrieves data from SEC EDGAR API endpoints using structured Url variants. src/enums/url_enum.rs:5-116
  2. Parse : Raw XML/JSON data is processed by specialized parsers (e.g., parse_13f_xml) into internal models. src/parsers/thirteen_f.rs:1-10
  3. Transform/Render : Filings are rendered into human-readable or machine-learning-ready formats via render_filing. src/ops/filing_ops.rs:118-130
  4. Analyze : Normalized data is passed to the Python layer for ML training and embedding generation.

Sources: src/network/sec_client.rs:1-50 src/enums/url_enum.rs:5-116 src/ops/filing_ops.rs:118-130 README.md:110-140

graph TB
    main["main.rs / lib.rs\nEntry Points"]
config["config module\nConfigManager\nAppConfig\n(src/config/mod.rs)"]
network["network module\nSecClient\nThrottlePolicy\n(src/network/mod.rs)"]
ops["ops module\nrender_filing\nholdings operations\n(src/ops/mod.rs)"]
models["models module\nTicker, Cik, AccessionNumber\n(src/models/mod.rs)"]
enums["enums module\nFundamentalConcept\nUrl, FormType\n(src/enums/mod.rs)"]
parsers["parsers module\nXML/JSON parsers\n(src/parsers/mod.rs)"]
views["views module\nMarkdownView\n(src/views/mod.rs)"]
main --> config
 
   main --> network
 
   main --> ops
    
 
   network --> config
 
   network --> models
 
   network --> enums
    
 
   ops --> network
 
   ops --> views
 
   ops --> models
    
 
   parsers --> models
 
   models --> enums

Core Module Structure

The Rust codebase is modularized to separate networking, data modeling, and business logic.

Module Dependency Graph

ModulePrimary PurposeKey Exports
configConfiguration management and credential loading.ConfigManager, AppConfig src/config/mod.rs
networkHTTP client, data fetching, and throttling.SecClient, fetch_filings, fetch_company_profile src/network/mod.rs
opsHigh-level business logic and data orchestration.render_filing, fetch_and_render src/ops/mod.rs
modelsDomain-specific data structures.Ticker, Cik, CikSubmission, NportInvestment src/models/mod.rs
enumsType-safe enumerations for SEC concepts.FundamentalConcept, FormType, Url src/enums/mod.rs
parsersLogic for converting SEC formats to Rust structs.parse_13f_xml, parse_nport_xml src/parsers/mod.rs
viewsRendering logic for different output formats.MarkdownView, EmbeddingTextView src/views/mod.rs

Sources: Cargo.toml:15-35 src/enums/url_enum.rs:1-5

Key Technologies

Rust Dependencies:

  • tokio: Async runtime for non-blocking I/O. Cargo.toml79
  • reqwest: Robust HTTP client with JSON support. Cargo.toml66
  • polars: High-performance DataFrame library for data manipulation. Cargo.toml61
  • quick-xml: Fast XML parsing for SEC filing documents. Cargo.toml62
  • html-to-markdown-rs: Converts SEC HTML filings to Markdown. Cargo.toml55
  • simd-r-drive: Integration with a high-performance data store. Cargo.toml74

Python Dependencies (narrative_stack):

  • PyTorch & PyTorch Lightning (ML training)
  • scikit-learn (Preprocessing and PCA)
  • BERT (Contextual embeddings for concept clustering)

Sources: Cargo.toml:44-82

Next Steps

  • Getting Started : Learn how to configure your SEC contact email and run your first lookup. See [Getting Started](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Getting Started)
  • Rust Architecture : Dive deep into the SecClient and networking layer. See [Rust sec-fetcher Application](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Rust sec-fetcher Application)
  • ML Pipeline : Explore the autoencoder training and US GAAP alignment. See [Python narrative_stack System](https://github.com/jzombie/rust-sec-fetcher/blob/345ac64c/Python narrative_stack System)

Sources: README.md:9-15