This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Parsers & Data Normalization
Loading…
Parsers & Data Normalization
Relevant source files
- src/config/app_config.rs
- src/normalize/mod.rs
- src/normalize/pct.rs
- src/normalize/thirteenf.rs
- src/parsers.rs
- src/parsers/parse_13f_xml.rs
- src/parsers/parse_form4_xml.rs
- src/parsers/parse_nport_xml.rs
- src/parsers/parse_us_gaap_fundamentals.rs
- tests/config_manager_tests.rs
- tests/us_gaap_parser_accuracy_tests.rs
Purpose and Scope
The parsers and normalize modules form the data ingestion backbone of rust-sec-fetcher. While the network layer retrieves raw bytes (XML, JSON, or CSV), the parsers transform these into structured Rust models. The normalize module ensures that numeric inconsistencies across SEC eras—such as the transition from thousands-of-dollars to actual-dollars in Form 13F—are handled centrally rather than being scattered across the codebase.
Sources: src/parsers.rs:1-21 src/normalize/mod.rs:1-16
Normalization Logic
The normalize module is the single source of truth for scale conversions and unit adjustments. It prevents “inline conversions” in parsers, ensuring that logic for handling SEC schema changes is testable in isolation.
13F Value Normalization
The SEC changed the <value> unit in Form 13F-HR informationTable.xml filings around January 1, 2023.
- Legacy Era: Values reported in thousands of USD.
- Modern Era: Values reported in actual USD.
The function normalize_13f_value_usd src/normalize/thirteenf.rs:144-150 uses the THIRTEENF_THOUSANDS_ERA_CUTOFF constant src/normalize/thirteenf.rs72 to determine if a 1000x multiplier should be applied based on the filing_date.
Percentage Handling (Pct Type)
The Pct struct src/normalize/pct.rs31 is a type-safe wrapper around Decimal that enforces a 0–100 scale (e.g., 7.75 means 7.75%, not 0.0775).
Pct::from_pct: Used when the SEC already provides a 0-100 value (e.g., N-PORTpctVal). src/normalize/pct.rs:57-59Pct::from_ratio: Multiplies a 0-1 ratio by 100. src/normalize/pct.rs:76-78
Sources: src/normalize/thirteenf.rs:1-150 src/normalize/pct.rs:1-110
XML Parsers
The system uses quick-xml for high-performance, stream-based parsing of large SEC filings.
N-PORT XML Parser
The parse_nport_xml function processes monthly portfolio holdings for registered investment companies.
- Stream Processing: It iterates through
invstOrSectags src/parsers/parse_nport_xml.rs30 - Fuzzy Matching: After parsing, it attempts to map holding names/titles to known
Tickersymbols usingTicker::get_by_fuzzy_matched_namesrc/parsers/parse_nport_xml.rs:135-138 - Normalization: Percentages are wrapped in the
Pcttype viaPct::from_pctsrc/parsers/parse_nport_xml.rs:104-105
Form 13F-HR Parser
The parse_13f_xml function extracts institutional investment manager holdings.
- Two-Pass Logic:
- Extracts all
<infoTable>entries and normalizes USD values src/parsers/parse_13f_xml.rs:91-106 - Calculates portfolio weights (
weight_pct) by summing total value and callingcompute_13f_weight_pctsrc/parsers/parse_13f_xml.rs:121-124
- Extracts all
- Sorting: Returns holdings sorted by
value_usddescending src/parsers/parse_13f_xml.rs126
Form 4 Parser
The parse_form4_xml function parses insider trading transactions.
- Table Support: Handles both
nonDerivativeTableandderivativeTablesrc/parsers/parse_form4_xml.rs47 - State Management: Uses a
tag_stackto track nested<value>elements within parents liketransactionSharessrc/parsers/parse_form4_xml.rs:79-88
Sources: src/parsers/parse_nport_xml.rs:15-149 src/parsers/parse_13f_xml.rs:26-128 src/parsers/parse_form4_xml.rs:14-158
US GAAP Fundamentals Parser
The parse_us_gaap_fundamentals function src/parsers/parse_us_gaap_fundamentals.rs41 converts the SEC’s companyfacts JSON into a Polars DataFrame.
Deduplication & Sorting
The parser implements a “Last-in Wins” strategy for amended filings:
- Chronological Sort: Data is sorted by the
fileddate descending src/parsers/parse_us_gaap_fundamentals.rs:32-33 - Pivot: During the
pivotoperation, it uses.first()to select the most recent filing for any given fiscal period (fy/fp) src/parsers/parse_us_gaap_fundamentals.rs:34-38 - Metadata: Every row is prefixed with
US_GAAP_META_COLUMNSincludingaccnandfiling_urlsrc/parsers/parse_us_gaap_fundamentals.rs:12-21
Magnitude Sanity Checks
Because filers often make “scale errors” (e.g., reporting millions as ones), the parser applies strictness checks:
- Annual (FY): If the fiscal year (
fy) is greater than the calendar year of theenddate, the record is discarded src/parsers/parse_us_gaap_fundamentals.rs:88-90 - Interim (Q1-Q3): Allows
fyto be up toend_year + 1to account for fiscal years ending early in a calendar year src/parsers/parse_us_gaap_fundamentals.rs:96-98
Sources: src/parsers/parse_us_gaap_fundamentals.rs:12-127
Data Flow Diagrams
graph TD
subgraph "Natural Language Space"
SEC_XML["SEC XML Source\n(13F / N-PORT)"]
RawVal["'value' (13F)\n'pctVal' (N-PORT)"]
FilingDate["'filingDate'"]
end
subgraph "Code Entity Space: Parsers"
P13F["parse_13f_xml()"]
PNPORT["parse_nport_xml()"]
end
subgraph "Code Entity Space: Normalize"
N13F["normalize_13f_value_usd()"]
NPCT["Pct::from_pct()"]
EraCheck["is_13f_thousands_era()"]
end
SEC_XML --> P13F
SEC_XML --> PNPORT
P13F -->|passes raw decimal| N13F
P13F -->|passes date| EraCheck
EraCheck -->|returns bool| N13F
PNPORT -->|passes raw decimal| NPCT
N13F -->|Result| Model13F["ThirteenfHolding.value_usd"]
NPCT -->|Result| ModelNPORT["NportInvestment.pct_val"]
From XML to Normalized Model
This diagram bridges the “Natural Language” SEC fields to the “Code Entities” in the normalize and parsers modules.
Sources: src/parsers/parse_13f_xml.rs98 src/normalize/thirteenf.rs144 src/parsers/parse_nport_xml.rs104
graph LR
subgraph "Input"
J["JSON Value\n(companyfacts)"]
end
subgraph "src/parsers/parse_us_gaap_fundamentals.rs"
Ext["Extraction Loop\n(fact_category_values, etc)"]
Sanity["Magnitude Sanity Checks\n(fy vs end_year)"]
DF["DataFrame::new()"]
Sort["sort(['filed'], descending=true)"]
Pivot["pivot(['fy', 'fp'])\naggregate: first()"]
end
subgraph "Output"
TDF["TickerFundamentalsDataFrame"]
end
J --> Ext
Ext --> Sanity
Sanity --> DF
DF --> Sort
Sort --> Pivot
Pivot --> TDF
US GAAP DataFrame Construction
The pipeline for converting raw JSON facts into an analysis-ready Polars DataFrame.
Sources: src/parsers/parse_us_gaap_fundamentals.rs:41-103 src/parsers/parse_us_gaap_fundamentals.rs:32-38
CSV and Text Parsers
Company Tickers
parse_company_tickers_json: Parses the SEC’scompany_tickers.jsonwhich maps CIKs to Tickers and Names src/parsers/parse_company_tickers.rs:19-20parse_ticker_txt: Parses theticker.txtfile used for broader ticker coverage src/parsers/parse_company_tickers.rs20
Master Index
parse_master_idx: Parses the quarterlymaster.idxfiles from EDGAR. It skips header lines and extracts CIK, Company Name, Form Type, Date Filed, and File Name (URL) src/parsers/parse_master_idx.rs:10-11
Investment Companies
parse_investment_companies_csv: Processes theinvestment_company_registrants_list.csvto identify entities registered under the Investment Company Act src/parsers/parse_investment_companies_csv.rs:13-14
Sources: src/parsers.rs:10-21
Dismiss
Refresh this wiki
Enter email to refresh