Tesla's .smol Patent: Data Format IP for Counsel
Your ML infrastructure team may have already built something similar to what Tesla just tried to patent. On April 4, 2024, Tesla published PCT application WO2024073080 describing a proprietary file format for storing and accessing large-scale sensor and video data in AI training pipelines. Internally, they call it .smol. The interesting legal question is not whether the engineering is clever. It is whether the claims describe a defensible technical invention or a familiar way of organizing data faster.
For in-house counsel at tech companies with ML data infrastructure, the answer matters. If Tesla secures broad claims on hybrid columnar-row file formats with header-based indexing, companies building competing systems could face freedom-to-operate questions. The analysis below separates what the application actually discloses from what secondary commentary infers, then walks through the IP risk framework your team should apply.
Last updated: March 2026. This page is informational only and not legal advice. Consult a patent attorney for your specific situation.
WO2024073080 is a PCT application, not an issued patent. Claims can still change substantially during national-phase prosecution.
Section 101 eligibility is the primary battleground. Whether the format claims survive turns on Enfish vs. Alice.
Competing prior-art formats (Parquet, ORC, HDF5, Arrow, Zarr, TFRecord) each implement variations of the claimed techniques.
The reported IOPS reduction appears only in secondary commentary, not the patent document. Treat it as anecdotal until independently replicated.
What the Application Actually Discloses
WO2024073080 describes a hybrid data layout that combines columnar organization within a row-based structure. Traditional row-based formats like CSV store entire records sequentially. Columnar formats like Apache Parquet store all values for a single field together. Tesla's format does both: it groups records into row-based segments but organizes the fields within each segment in a columnar layout.
The file header contains an index that maps field names to their byte offsets within each segment. This enables selective retrieval. A training pipeline can read only the timestamps or only the lidar readings from a segment without loading the entire record into memory. The stated goal is to reduce IOPS, the primary throughput constraint when training ML models on petabyte-scale fleet data.
| Claim | Source | Reliability |
|---|---|---|
| Hybrid columnar-within-row layout with header-based field indexing | Patent application (WO2024073080) | Primary |
| Selective field retrieval without loading full records into memory | Patent application (WO2024073080) | Primary |
| 4× reduction in IOPS for AI training workloads | Social media & third-party commentary | Anecdotal |
| 400% training throughput improvement | Secondary analysis, no benchmark methodology | Anecdotal |
Section 101 Eligibility: Enfish or Alice?
The central eligibility question is whether Tesla's claims describe an improvement to the functioning of a computer system or an abstract idea of organizing data. The answer depends on which Federal Circuit precedent controls.
In Enfish LLC v. Microsoft Corp., 822 F.3d 1327 (Fed. Cir. 2016), the court held that claims directed to a self-referential table in a database were patent-eligible because they improved the computer's own functioning. The self-referential structure was a specific technical mechanism, not just a result. Making the computer itself work better was sufficient. No improvement outside the computer was required.
Tesla's application has structural similarities to Enfish. The hybrid columnar-within-row layout combined with header-based field indexing describes a specific technical mechanism for reducing unnecessary disk I/O. If the claims recite that mechanism with sufficient particularity, they may survive Step 1 of the Alice/Mayo framework as a concrete improvement to computer storage technology.
Efficiency gains alone do not guarantee eligibility. Courts have repeatedly rejected claims that amount to organizing data or accelerating known database operations when the claims recite only a desired result rather than specific technical means. If Tesla's claims are drafted at a high level of abstraction ("a file format that enables selective field retrieval"), a challenger could argue they are directed to the abstract idea of organizing information efficiently.
Enfish-Favorable Drafting
Claims that recite the specific hybrid layout, the header index structure, and the selective-read mechanism as a concrete technical architecture. Analogous to the self-referential table in Enfish.
Alice-Vulnerable Drafting
Claims drafted around the concept of reducing IOPS through format optimization without specifying the technical architecture. Directed to an abstract data-organization scheme.
What to Look For
Read the actual claims. If they recite the specific hybrid layout, the header index, and the selective-read mechanism, the Enfish analogy is strong. If they claim only a result, they are vulnerable.
The Key Principle
The outcome turns on claim drafting, not just the underlying technology. A clever engineering solution can still produce an ineligible patent claim if the drafter reaches too high.
Prior Art and Obviousness Under § 102/103
The most likely challenge to WO2024073080 is that its constituent techniques are individually well-known. Columnar data layouts have been standard in analytics databases for over a decade. Header-based indexing is a core feature of binary formats like HDF5, Apache ORC, and Parquet. Selective field retrieval is the defining characteristic of columnar storage. Compression within segments is implemented across nearly every modern data format.
Columnar formats widely used in ML pipelines. Both implement columnar storage with header metadata enabling selective column reads, directly analogous to Tesla's stated benefits.
Hierarchical and chunked array formats common in scientific computing and large-scale ML. Both support structured indexing and selective retrieval of data subsets.
In-memory columnar format optimized for analytics and ML. Defines a standardized layout for heterogeneous data types, including nested and variable-length fields.
TensorFlow's native sequential format and a memory-mapped key-value store. Both are purpose-built for ML training pipelines and address similar throughput constraints.
Many teams build custom binary formats with integrated indexing, compression, and layout optimizations for specific workloads. If undocumented, this art may not surface in a standard prior art search.
The question under 35 U.S.C. § 103 is whether combining these known techniques in a single format, specifically optimized for heterogeneous autonomous driving data (video, lidar, radar, CAN bus, timestamps, metadata), would have been obvious to a person of ordinary skill. Tesla's argument for non-obviousness likely rests on the specific combination and organization of these elements, not on any single component.
Section 112: Enablement and Written Description
A third vulnerability is whether the application provides enough implementation detail to support the scope of its claims under 35 U.S.C. § 112. This matters most if Tesla is claiming a broad family of file formats rather than a single narrowly disclosed implementation.
Autonomous driving data is unusually heterogeneous. A single training example might include synchronized video frames from multiple cameras, lidar point clouds, radar returns, CAN bus telemetry, GPS coordinates, and timestamped metadata. Storing and selectively retrieving fields from such multimodal records is a harder engineering problem than doing the same for homogeneous tabular data.
If the claims cover selective retrieval across all these data modalities, the specification must enable a skilled person to implement the format for each modality without undue experimentation. Claims that describe outcomes ("faster reads for multimodal sensor data") without fully enabling the architecture for each data type could face written-description or enablement challenges.
Strategic Implications for Your Team
Tesla's filing signals a broader strategy: protect not just autonomous driving algorithms but the data infrastructure that feeds those algorithms. Companies building competing autonomous systems or general-purpose ML data platforms could face freedom-to-operate questions if Tesla secures broad claims on file-format optimizations for ML training pipelines.
WO2024073080 will enter national phases in individual patent offices. Track the U.S. application number for claim amendments and examiner rejections. Issued claims may look very different from the PCT publication.
Identify published work on hybrid columnar-row formats, header-based indexing in binary files, and selective field retrieval in ML training contexts. If your team has internal documentation of similar approaches predating Tesla's priority date, preserve it.
Many infrastructure teams build custom binary formats and treat them as internal tooling rather than patentable innovations. If your format solves a concrete technical problem in a specific, non-obvious way, the patent window may still be open.
Even at the PCT stage, understanding the claim boundaries helps you plan design-arounds or identify licensing exposure early. Map your format's architecture to the specific elements recited in the current claims.
What This Means for Data-Infrastructure Patents Generally
Tesla's .smol format is one data point in a larger trend. Companies are increasingly seeking patent protection on the infrastructure layer of their ML stacks, not just the models themselves. Data formats, training pipeline architectures, data labeling workflows, and feature stores are all active targets.
Apply three questions to any data-infrastructure patent you encounter or consider filing:
Does the claim recite a specific technical mechanism (layout, structure, protocol) or merely a desired result like faster reads or lower IOPS? The former tracks Enfish. The latter risks Alice.
Is the claimed combination of known techniques one a skilled infrastructure engineer would arrive at given the same constraints? The answer often depends on how heterogeneous and demanding the target workload is.
Does the specification enable the full scope of the claims across all the data types and use cases they purport to cover? Broad claims need broad enablement.
Frequently Asked Questions
Is WO2024073080 an issued patent?
No. WO2024073080 is a PCT application published by WIPO on April 4, 2024. It is not an issued patent in any jurisdiction. The claims are subject to change during national-phase prosecution and have not been examined or allowed by any patent office. Any freedom-to-operate analysis should account for potential claim amendments before and after national-phase entry.
How much of Tesla's claimed format is actually novel versus known techniques?
The individual components (columnar layouts, header-based indexes, selective field retrieval, segment-based compression) are all well-established in storage engineering. Formats like Parquet, ORC, HDF5, and Arrow implement variations of these techniques. Tesla's potential novelty rests on the specific combination and organization of these elements for heterogeneous autonomous driving data. Whether that combination is non-obvious under 35 U.S.C. § 103 is the key question, and it has not yet been tested by an examiner.
Would these claims survive an Alice challenge?
It depends on claim drafting. If the claims recite the specific hybrid columnar-within-row layout, the header index structure, and the selective-read mechanism as a concrete technical architecture, they have a reasonable argument under Enfish LLC v. Microsoft Corp., 822 F.3d 1327 (Fed. Cir. 2016). If the claims are drafted broadly around the concept of reducing IOPS without specifying the technical means, a challenger could argue they are directed to an abstract data-organization scheme. The outcome hinges on specificity.
Is the reported 4× IOPS reduction reliable?
The 4× figure appears in secondary commentary and social media discussion, not in the patent document itself. No published benchmark methodology, workload specification, or independent replication supports it. Treat it as anecdotal. If you are assessing the patent's technical merit for licensing or litigation purposes, conduct independent testing against your own workloads and data characteristics.
Could this patent create freedom-to-operate issues for ML platform teams?
Potentially, but the risk depends on the final issued claims. If Tesla secures broad claims covering hybrid columnar-row formats with header indexing for ML training data, companies building custom binary formats for similar workloads could face exposure. If the claims are narrowed during prosecution to Tesla's specific implementation details, the risk decreases significantly. Monitor the national-phase prosecution in your key jurisdictions, particularly the USPTO application, and map claim amendments against your own format architecture.
Should my company consider patenting our own internal data formats?
If your data format solves a concrete technical problem in a specific, non-obvious way, it may be worth disclosing. Many engineering teams treat custom binary formats as internal tooling and miss the patent window. Document the problem your format solves, the specific technical mechanism it uses, and how that mechanism differs from existing formats. Start with a problem-first disclosure and have patent counsel evaluate eligibility under § 101, § 102, and § 103.
The inventions hiding in your codebase are not going to find themselves.
Your infrastructure team may have already built something patentable. Custom data formats, training pipeline optimizations, indexing architectures: these are exactly the innovations that get missed before the window closes.
No code leaves your device.
Scan Your Code Free