Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Parsers & Data Normalization

Loading…

Parsers & Data Normalization

Relevant source files

Purpose and Scope

The parsers and normalize modules form the data ingestion backbone of rust-sec-fetcher. While the network layer retrieves raw bytes (XML, JSON, or CSV), the parsers transform these into structured Rust models. The normalize module ensures that numeric inconsistencies across SEC eras—such as the transition from thousands-of-dollars to actual-dollars in Form 13F—are handled centrally rather than being scattered across the codebase.

Sources: src/parsers.rs:1-21 src/normalize/mod.rs:1-16


Normalization Logic

The normalize module is the single source of truth for scale conversions and unit adjustments. It prevents “inline conversions” in parsers, ensuring that logic for handling SEC schema changes is testable in isolation.

13F Value Normalization

The SEC changed the <value> unit in Form 13F-HR informationTable.xml filings around January 1, 2023.

  • Legacy Era: Values reported in thousands of USD.
  • Modern Era: Values reported in actual USD.

The function normalize_13f_value_usd src/normalize/thirteenf.rs:144-150 uses the THIRTEENF_THOUSANDS_ERA_CUTOFF constant src/normalize/thirteenf.rs72 to determine if a 1000x multiplier should be applied based on the filing_date.

Percentage Handling (Pct Type)

The Pct struct src/normalize/pct.rs31 is a type-safe wrapper around Decimal that enforces a 0–100 scale (e.g., 7.75 means 7.75%, not 0.0775).

Sources: src/normalize/thirteenf.rs:1-150 src/normalize/pct.rs:1-110


XML Parsers

The system uses quick-xml for high-performance, stream-based parsing of large SEC filings.

N-PORT XML Parser

The parse_nport_xml function processes monthly portfolio holdings for registered investment companies.

Form 13F-HR Parser

The parse_13f_xml function extracts institutional investment manager holdings.

Form 4 Parser

The parse_form4_xml function parses insider trading transactions.

Sources: src/parsers/parse_nport_xml.rs:15-149 src/parsers/parse_13f_xml.rs:26-128 src/parsers/parse_form4_xml.rs:14-158


US GAAP Fundamentals Parser

The parse_us_gaap_fundamentals function src/parsers/parse_us_gaap_fundamentals.rs41 converts the SEC’s companyfacts JSON into a Polars DataFrame.

Deduplication & Sorting

The parser implements a “Last-in Wins” strategy for amended filings:

  1. Chronological Sort: Data is sorted by the filed date descending src/parsers/parse_us_gaap_fundamentals.rs:32-33
  2. Pivot: During the pivot operation, it uses .first() to select the most recent filing for any given fiscal period (fy/fp) src/parsers/parse_us_gaap_fundamentals.rs:34-38
  3. Metadata: Every row is prefixed with US_GAAP_META_COLUMNS including accn and filing_url src/parsers/parse_us_gaap_fundamentals.rs:12-21

Magnitude Sanity Checks

Because filers often make “scale errors” (e.g., reporting millions as ones), the parser applies strictness checks:

Sources: src/parsers/parse_us_gaap_fundamentals.rs:12-127


Data Flow Diagrams

graph TD
    subgraph "Natural Language Space"
        SEC_XML["SEC XML Source\n(13F / N-PORT)"]
RawVal["'value' (13F)\n'pctVal' (N-PORT)"]
FilingDate["'filingDate'"]
end

    subgraph "Code Entity Space: Parsers"
        P13F["parse_13f_xml()"]
PNPORT["parse_nport_xml()"]
end

    subgraph "Code Entity Space: Normalize"
        N13F["normalize_13f_value_usd()"]
NPCT["Pct::from_pct()"]
EraCheck["is_13f_thousands_era()"]
end

 
   SEC_XML --> P13F
 
   SEC_XML --> PNPORT
    
 
   P13F -->|passes raw decimal| N13F
 
   P13F -->|passes date| EraCheck
 
   EraCheck -->|returns bool| N13F
    
 
   PNPORT -->|passes raw decimal| NPCT
    
 
   N13F -->|Result| Model13F["ThirteenfHolding.value_usd"]
NPCT -->|Result| ModelNPORT["NportInvestment.pct_val"]

From XML to Normalized Model

This diagram bridges the “Natural Language” SEC fields to the “Code Entities” in the normalize and parsers modules.

Sources: src/parsers/parse_13f_xml.rs98 src/normalize/thirteenf.rs144 src/parsers/parse_nport_xml.rs104

graph LR
    subgraph "Input"
        J["JSON Value\n(companyfacts)"]
end

    subgraph "src/parsers/parse_us_gaap_fundamentals.rs"
        Ext["Extraction Loop\n(fact_category_values, etc)"]
Sanity["Magnitude Sanity Checks\n(fy vs end_year)"]
DF["DataFrame::new()"]
Sort["sort(['filed'], descending=true)"]
Pivot["pivot(['fy', 'fp'])\naggregate: first()"]
end

    subgraph "Output"
        TDF["TickerFundamentalsDataFrame"]
end

 
   J --> Ext
 
   Ext --> Sanity
 
   Sanity --> DF
 
   DF --> Sort
 
   Sort --> Pivot
 
   Pivot --> TDF

US GAAP DataFrame Construction

The pipeline for converting raw JSON facts into an analysis-ready Polars DataFrame.

Sources: src/parsers/parse_us_gaap_fundamentals.rs:41-103 src/parsers/parse_us_gaap_fundamentals.rs:32-38


CSV and Text Parsers

Company Tickers

Master Index

  • parse_master_idx : Parses the quarterly master.idx files from EDGAR. It skips header lines and extracts CIK, Company Name, Form Type, Date Filed, and File Name (URL) src/parsers/parse_master_idx.rs:10-11

Investment Companies

Sources: src/parsers.rs:10-21

Dismiss

Refresh this wiki

Enter email to refresh