This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
CLI Binaries
Loading…
CLI Binaries
Relevant source files
- src/bin/check_form_type_coverage.rs
- src/bin/pulls/fund_holdings.rs
- src/bin/pulls/us_gaap_bulk.rs
- src/bin/refresh_test_fixtures.rs
This page documents the standalone binary programs provided by the rust-sec-fetcher crate. These tools are located in src/bin/ and serve specialized purposes ranging from test fixture maintenance and enum validation to bulk data extraction for machine learning pipelines.
1. refresh-test-fixtures
The refresh-test-fixtures utility automates the retrieval of real SEC EDGAR data to serve as test fixtures. It ensures that integration tests operate against authentic, version-controlled data without requiring live network access during test execution.
Purpose and Usage
This binary should be run whenever new test cases are added or when existing fixtures need to be updated to reflect modern EDGAR schema changes (e.g., the 2023 change in 13F value reporting).
Implementation Details
The program iterates through a hardcoded manifest of Fixture structs src/bin/refresh_test_fixtures.rs:55-63 Each fixture defines a TickerSymbol, an output filename, and a FixtureKind which determines the specific SEC endpoint to hit.
Key Components:
- FixtureKind : An enum specifying the target data:
Submissions,CompanyFacts,EightKPrimary,EightKFirstHtmlExhibit, orThirteenF(with a specific accession number) src/bin/refresh_test_fixtures.rs:65-88 - Data Flow : The program resolves the Ticker to a CIK src/bin/refresh_test_fixtures.rs205 fetches the raw bytes from the SEC via
SecClient, and saves them as Gzip-compressed files intests/fixtures/src/bin/refresh_test_fixtures.rs:197-230
Fixture Generation Flow
Sources: src/bin/refresh_test_fixtures.rs:90-173 src/bin/refresh_test_fixtures.rs:178-240
2. check-form-type-coverage
This binary validates the completeness of the FormType enum against actual data in the EDGAR Master Index. It performs both “Forward” and “Reverse” coverage checks.
Coverage Logic
- Forward Check : Ensures every variant defined in the
FormTypeenum (that isn’t marked asretired) actually appears in recent SEC filings src/bin/check_form_type_coverage.rs:16-19 - Reverse Check : Identifies any form types appearing frequently in the most recent quarter (above
MINIMUM_FILINGS_THRESHOLD) that are not currently represented in the enum src/bin/check_form_type_coverage.rs:20-22
Usage
Technical Implementation
The program calculates the last_completed_quarter src/bin/check_form_type_coverage.rs:51-61 and scans backwards up to MAX_LOOKBACK_QUARTERS (default 8) src/bin/check_form_type_coverage.rs34 It uses fetch_edgar_master_index to retrieve the list of all filings for those periods src/bin/check_form_type_coverage.rs110
Sources: src/bin/check_form_type_coverage.rs:1-40 src/bin/check_form_type_coverage.rs:72-146
3. pull-us-gaap-bulk
The pull-us-gaap-bulk binary is the primary data ingestion tool for the narrative_stack ML pipeline. It performs a massive sweep of all primary-listed companies and extracts their XBRL fundamentals.
Purpose and Usage
It fetches CompanyFacts for every ticker and flattens the complex JSON structure into a tabular CSV format suitable for training autoencoders or dimensionality reduction models.
Data Flow and Constraints
- Primary Listings Only : It calls
fetch_company_tickers(client, false)to exclude warrants, units, and preferred shares, preventing duplicate data for the same CIK src/bin/pulls/us_gaap_bulk.rs:54-55 - Transformation : It utilizes
fetch_us_gaap_fundamentalswhich converts the SEC’s hierarchical XBRL data into a PolarsDataFramesrc/bin/pulls/us_gaap_bulk.rs:72-73 - Persistence : Data is written using
polars::prelude::CsvWritersrc/bin/pulls/us_gaap_bulk.rs:78-81
Sources: src/bin/pulls/us_gaap_bulk.rs:1-33 src/bin/pulls/us_gaap_bulk.rs:45-95
4. pull-fund-holdings
This binary targets the investment management domain, specifically fetching N-PORT holdings for all registered investment companies (ETFs, Mutual Funds).
Purpose and Usage
It iterates through the SEC’s investment company dataset, finds the latest N-PORT-P (monthly portfolio holdings) filing for each fund, and exports the holdings to CSV.
Implementation Logic
- Dataset Retrieval : Calls
fetch_investment_company_series_and_class_datasetto get the master list of funds src/bin/pulls/fund_holdings.rs:74-75 - Filing Discovery : For each fund ticker, it resolves the CIK and calls
fetch_nport_filingsto find the most recent submission src/bin/pulls/fund_holdings.rs:96-114 - Holdings Extraction : It fetches the specific N-PORT XML, parses the investment table, and normalizes the data src/bin/pulls/fund_holdings.rs:126-134
- Partitioned Storage : To avoid directories with tens of thousands of files, it organizes output by the first letter of the ticker (e.g.,
data/fund-holdings/S/SPY.csv) src/bin/pulls/fund_holdings.rs:137-155
Fund Processing Pipeline
Sources: src/bin/pulls/fund_holdings.rs:1-38 src/bin/pulls/fund_holdings.rs:74-157
Dismiss
Refresh this wiki
Enter email to refresh