Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

US GAAP Distribution Analyzer

Loading…

US GAAP Distribution Analyzer

Relevant source files

The us-gaap-dist-analyzer is a specialized Rust sub-crate designed to perform unsupervised clustering and distribution analysis of US GAAP (Generally Accepted Accounting Principles) financial concepts. It bridges the gap between raw SEC filing data and semantic financial modeling by grouping taxonomical labels based on their contextual and mathematical distributions.

Purpose and Scope

The analyzer processes large-scale financial datasets to identify patterns in how public companies report specific financial metrics. By utilizing BERT embeddings for semantic understanding and K-Means clustering for statistical grouping, the tool allows researchers to visualize the high-dimensional space of US GAAP concepts and identify synonymous or related reporting items that may not share identical taxonomical names.


Technical Architecture

The system follows a linear pipeline that transforms raw US GAAP column names and their associated values into a clustered spatial representation.

Data Flow and Pipeline

The transformation process moves from high-dimensional natural language space to a reduced numerical space for efficient clustering.

  1. Embedding Generation : Converts US GAAP concept strings into vector representations using a BERT-based transformer model.
  2. Normalization : Applies scaling to ensure that concept values (magnitudes) do not disproportionately bias the clustering.
  3. Dimensionality Reduction : Uses Principal Component Analysis (PCA) to project high-dimensional embeddings into a lower-dimensional space while preserving variance.
  4. Clustering : Executes K-Means algorithms to group concepts into k distinct clusters.

System Components Diagram

The following diagram illustrates the relationship between the logical analysis steps and the underlying implementation components.

US GAAP Analysis Pipeline

graph TD
    subgraph "Natural Language Space"
        A["US GAAP Concept Names"]
B["distill_us_gaap_fundamental_concepts"]
end

    subgraph "Vector Space (Code Entities)"
        C["BERT Embedding Engine"]
D["PCA (Principal Component Analysis)"]
E["K-Means Clusterer"]
end

    subgraph "Output & Analysis"
        F["Cluster Centroids"]
G["Concept Distribution Maps"]
end

 
   A --> B
    B -- "Normalized Strings" --> C
    C -- "High-Dim Vectors" --> D
    D -- "Reduced Vectors" --> E
 
   E --> F
 
   E --> G

Sources: us-gaap-dist-analyzer/Cargo.lock:1-217 us-gaap-dist-analyzer/Cargo.lock:198-217


Implementation Details

Dependency Stack

The analyzer relies on several heavy-duty mathematical and machine learning libraries to perform its operations.

ComponentLibrary / CratePurpose
Embeddingsrust-bert / tchLoading and executing transformer models for semantic encoding.
Linear Algebrandarray / nalgebraMatrix operations for PCA and distance calculations.
Clusteringlinfa-clusteringImplementation of the K-Means algorithm.
Data HandlingpolarsHigh-performance DataFrame operations for managing large SEC datasets.
Cachingcached-pathManaging local storage of model weights and pre-computed embeddings.

Sources: us-gaap-dist-analyzer/Cargo.lock:198-201 us-gaap-dist-analyzer/Cargo.lock:53-61

Key Logic Flow

The analyzer’s execution logic is centered around the transition from raw SEC data to categorized clusters.

Clustering Logic Flow

sequenceDiagram
    participant Data as "SEC Data (Polars DataFrame)"
    participant BERT as "BERT Encoder"
    participant DimRed as "PCA Transformer"
    participant Cluster as "K-Means Engine"

    Data->>BERT: Extract Concept Labels
    BERT->>BERT: Generate 768-dim Embeddings
    BERT->>DimRed: Pass Embedding Matrix
    DimRed->>DimRed: Compute Covariance & Eigenvectors
    DimRed->>Cluster: Reduced Matrix (n_components)
    Cluster->>Cluster: Iterative Centroid Assignment
    Cluster-->>Data: Append 'cluster_id' to Labels

Sources: us-gaap-dist-analyzer/Cargo.lock:44-50 us-gaap-dist-analyzer/Cargo.lock:151-157


Data Integration

The us-gaap-dist-analyzer works in tandem with the narrative_stack Python components and the core Rust sec-fetcher library.

Relationship to Fundamental Concepts

While the Rust core defines a strict taxonomy in the FundamentalConcept enum [3.4 US GAAP Concept Transformation], this analyzer is used to discover new relationships or validate the existing taxonomy by observing how concepts are actually used in the wild.

  1. Input : The analyzer typically consumes the output of pull-us-gaap-bulk or the UsGaapStore.
  2. Processing : It clusters concepts like CashAndCashEquivalentsAtCarryingValue and CashAndCashEquivalentsPeriodIncreaseDecrease to see if they consistently appear in the same reporting “neighborhood.”
  3. Validation : Results are used to refine the mapping patterns used in distill_us_gaap_fundamental_concepts.

Performance Considerations

  • Memory Usage : Because it handles large embedding matrices, the crate utilizes ndarray for memory-efficient slicing and tch (LibTorch) for GPU-accelerated tensor operations when available.
  • Persistence : Clustering models and PCA projections are often serialized to disk using serde to allow for incremental analysis of new filing batches without re-training the entire distribution map.

Sources: us-gaap-dist-analyzer/Cargo.lock:209-211 us-gaap-dist-analyzer/Cargo.lock:165-175

Dismiss

Refresh this wiki

Enter email to refresh