This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
US GAAP Distribution Analyzer
Loading…
US GAAP Distribution Analyzer
Relevant source files
The us-gaap-dist-analyzer is a specialized Rust sub-crate designed to perform unsupervised clustering and distribution analysis of US GAAP (Generally Accepted Accounting Principles) financial concepts. It bridges the gap between raw SEC filing data and semantic financial modeling by grouping taxonomical labels based on their contextual and mathematical distributions.
Purpose and Scope
The analyzer processes large-scale financial datasets to identify patterns in how public companies report specific financial metrics. By utilizing BERT embeddings for semantic understanding and K-Means clustering for statistical grouping, the tool allows researchers to visualize the high-dimensional space of US GAAP concepts and identify synonymous or related reporting items that may not share identical taxonomical names.
Technical Architecture
The system follows a linear pipeline that transforms raw US GAAP column names and their associated values into a clustered spatial representation.
Data Flow and Pipeline
The transformation process moves from high-dimensional natural language space to a reduced numerical space for efficient clustering.
- Embedding Generation : Converts US GAAP concept strings into vector representations using a BERT-based transformer model.
- Normalization : Applies scaling to ensure that concept values (magnitudes) do not disproportionately bias the clustering.
- Dimensionality Reduction : Uses Principal Component Analysis (PCA) to project high-dimensional embeddings into a lower-dimensional space while preserving variance.
- Clustering : Executes K-Means algorithms to group concepts into
kdistinct clusters.
System Components Diagram
The following diagram illustrates the relationship between the logical analysis steps and the underlying implementation components.
US GAAP Analysis Pipeline
graph TD
subgraph "Natural Language Space"
A["US GAAP Concept Names"]
B["distill_us_gaap_fundamental_concepts"]
end
subgraph "Vector Space (Code Entities)"
C["BERT Embedding Engine"]
D["PCA (Principal Component Analysis)"]
E["K-Means Clusterer"]
end
subgraph "Output & Analysis"
F["Cluster Centroids"]
G["Concept Distribution Maps"]
end
A --> B
B -- "Normalized Strings" --> C
C -- "High-Dim Vectors" --> D
D -- "Reduced Vectors" --> E
E --> F
E --> G
Sources: us-gaap-dist-analyzer/Cargo.lock:1-217 us-gaap-dist-analyzer/Cargo.lock:198-217
Implementation Details
Dependency Stack
The analyzer relies on several heavy-duty mathematical and machine learning libraries to perform its operations.
| Component | Library / Crate | Purpose |
|---|---|---|
| Embeddings | rust-bert / tch | Loading and executing transformer models for semantic encoding. |
| Linear Algebra | ndarray / nalgebra | Matrix operations for PCA and distance calculations. |
| Clustering | linfa-clustering | Implementation of the K-Means algorithm. |
| Data Handling | polars | High-performance DataFrame operations for managing large SEC datasets. |
| Caching | cached-path | Managing local storage of model weights and pre-computed embeddings. |
Sources: us-gaap-dist-analyzer/Cargo.lock:198-201 us-gaap-dist-analyzer/Cargo.lock:53-61
Key Logic Flow
The analyzer’s execution logic is centered around the transition from raw SEC data to categorized clusters.
Clustering Logic Flow
sequenceDiagram
participant Data as "SEC Data (Polars DataFrame)"
participant BERT as "BERT Encoder"
participant DimRed as "PCA Transformer"
participant Cluster as "K-Means Engine"
Data->>BERT: Extract Concept Labels
BERT->>BERT: Generate 768-dim Embeddings
BERT->>DimRed: Pass Embedding Matrix
DimRed->>DimRed: Compute Covariance & Eigenvectors
DimRed->>Cluster: Reduced Matrix (n_components)
Cluster->>Cluster: Iterative Centroid Assignment
Cluster-->>Data: Append 'cluster_id' to Labels
Sources: us-gaap-dist-analyzer/Cargo.lock:44-50 us-gaap-dist-analyzer/Cargo.lock:151-157
Data Integration
The us-gaap-dist-analyzer works in tandem with the narrative_stack Python components and the core Rust sec-fetcher library.
Relationship to Fundamental Concepts
While the Rust core defines a strict taxonomy in the FundamentalConcept enum [3.4 US GAAP Concept Transformation], this analyzer is used to discover new relationships or validate the existing taxonomy by observing how concepts are actually used in the wild.
- Input : The analyzer typically consumes the output of
pull-us-gaap-bulkor theUsGaapStore. - Processing : It clusters concepts like
CashAndCashEquivalentsAtCarryingValueandCashAndCashEquivalentsPeriodIncreaseDecreaseto see if they consistently appear in the same reporting “neighborhood.” - Validation : Results are used to refine the mapping patterns used in
distill_us_gaap_fundamental_concepts.
Performance Considerations
- Memory Usage : Because it handles large embedding matrices, the crate utilizes
ndarrayfor memory-efficient slicing andtch(LibTorch) for GPU-accelerated tensor operations when available. - Persistence : Clustering models and PCA projections are often serialized to disk using
serdeto allow for incremental analysis of new filing batches without re-training the entire distribution map.
Sources: us-gaap-dist-analyzer/Cargo.lock:209-211 us-gaap-dist-analyzer/Cargo.lock:165-175
Dismiss
Refresh this wiki
Enter email to refresh