Project3DCodeVerse
Curation tool →3DCodeVerse: Towards Building the Universe via Code
Gallery
What you see are all built by code, from scratch.
Datasets & Progress
| Name | Status | Attribute | Language | Subfolders | Count | Source | License | Note |
|---|---|---|---|---|---|---|---|---|
3DCodeBench | Completed | 3D Objects |
| 4.3k | ✓Commercial✓AcademicBSD-3-Clause | Samples deduplicated from the 26K-sample 3DCodeBench corpus. | ||
ShaderToy | Captioning | Shader |
| 120k | ✗Commercial✓AcademicCC BY-NC-SA | A public platform where anyone can write and share OpenGL shaders. | ||
DeepCAD | Captioning | CAD Objects |
| 174k | ✓Commercial✓AcademicMIT | DeepCAD's 178k command sequences → standalone CadQuery via GenCAD-Code's converter; STEP + OCC checked → 174k usable, dedup → 128k unique. Fidelity vs ground-truth: IoU 0.985 (96.9% > 0.9); the small tail is tagged per-sample for filtering. Each sample carries Text2CAD captions + renders (PNG / STEP / GLB). | ||
Articraft | Completed | Articulated Objects |
| 13k | ✓Commercial✓AcademicCC BY 4.0 | Work in progress — to be filtered into Blender Python Code, but may be multi-file. | ||
Thingiverse-OpenSCAD | Curating | 3D Objects |
| 7k | ✗Commercial✓AcademicVaries | OpenSCAD (.scad) models scraped from Thingiverse — declarative CSG scripts. Currently curating; per-item licenses vary, being checked before release. |
Data format
Every entry is a self-contained, runnable project— the code, its rendered outputs, a metadata card, and text captions. One source folder at the top, coarse-category subfolders beneath (the same ones in the table above), and one folder per sample.
3dcodebench/ one source folder per origin
factories_geo/ subfolder — a coarse category
<sample>/ one self-contained, runnable sample
code.py the code — single file, or many; runs as-is
renders/ rendered outputs for this sample
view_00.png view_01.png … multi-view rendered images
video.mp4 video (optional)
object.glb 3D mesh — .glb / .obj / .stl (optional)
meta.json identity card (see below)
captions.json text descriptions, keyed by type (see below)
factories_tex/ … same structuremeta.json— the identity card
Declares what the sample is and exactly how to run it — type, code dialect, entry point, the runtime environment, where it came from, and who curated it.
{
"id": "factories_geo/0007",
"type": "3D Objects", // Attribute — what it is
"language": "Blender Python", // code dialect
"entry": "code.py", // file to run
"multi_file": false, // single file, or a multi-file project
"environment": "Blender 5.0", // exact runtime needed to execute
"renders": [ // artifacts under renders/
"renders/view_00.png",
"renders/object.glb"
],
"source": "3DCodeBench", // origin dataset
"license": "commercial-ok", // reuse terms
"curator": "yipeng", // who reviewed / curated it
"status": "curated" // curated | pending | revise
}captions.json— text descriptions, keyed by type
A flat { type: caption }map — one object can carry several captions from different angles (brief, detailed, geometry, function…). Kept separate from meta.json so captions can grow without touching the identity card.
{
"brief": "A wooden four-drawer dresser.",
"detailed": "A rectangular wooden dresser with four stacked drawers, each with a small round knob, a flat top and short tapered legs.",
"geometry": "Box body ~0.8 x 0.45 x 1.1 m, split into four equal drawer compartments; cylindrical handles centered on each drawer front.",
"function": "Bedroom storage; the four drawers slide out along the front face."
}On our radar — candidate sources
3DCodeVerse aims to unify multiple dialectsof 3D code under one quality-gated roof. Today: Infinigen (organic / natural, Blender Python). On our radar — all permissively licensed, so combinable with attribution:
Mechanical / parametric CAD code
| Dataset | Code form | Scale · License | Why it fits |
|---|---|---|---|
CADFS FeatureScript | Onshape FeatureScript + text, multi-view, STEP / STL | ~450k models · 382,609 dedup · ~90.5 GB CC BY 4.0 | Closest to real-world “3D code”: executable scripts with full design history (sketch, extrude, revolve, sweep, loft, fillet, chamfer, shell, boolean, pattern). Already shipped as text→code / image→code JSONL. |
CAD-Coder / GenCAD-Code CadQuery (Python) | CadQuery Python + rendered image | 163k image–code pairs Apache-2.0 | Most LLM-friendly CAD code — CadQuery is Python, trivially standalone. Complements FeatureScript with Python-CAD. |
Omni-CAD / CAD-MLLM Command seq (JSON) | CAD command-sequence JSON + text + multi-view + point clouds | ~450K · ~1.25 GB MIT | Strong multimodal grounding (image / text / point → CAD seq), cheap to ingest. Needs an executor / translator. |
SketchGraphs + DeepCAD Sketch / construction seq | 2D sketch constraint graphs / 3D construction sequences | 15M sketches (~43 GB) · 178k models MIT * | Foundational CAD grammar — sketches + constraints; DeepCAD convertible to CadQuery / FeatureScript. * Onshape sketch-copyright caveat. |
Scene-level procedural code
| Dataset | Code form | Scale · License | Why it fits |
|---|---|---|---|
ProcTHOR-10K House JSON + Python gen | Procedural house generator (AI2-THOR) | 1,633 objects · 108 categories · 3,278 materials Apache-2.0 | Best fit for scene-level code — rooms, placement, materials, lighting, interaction state. Complements Infinigen’s natural scenes; convertible to Blender scene scripts. |
Open challenges
Where the harness work lives.
- •Translators / executors — command-sequence → code, .blend → Blender Python, AI2-THOR → Blender scene scripts, FeatureScript ↔ CadQuery.
- •Cross-dataset dedup — DeepCAD ↔ CADFS ↔ Omni-CAD ↔ CAD-Coder share lineage; geometry / ID dedup before merge.
- •License & attribution tracking — keep provenance per entry so the merged corpus stays redistributable.
Best practices
Lessons from building the pipeline so far.
- 1Coding agent does the conversion
A coding agent transcribes each asset into standalone, runnable code — the heavy lifting of the asset → code transform.
- 2Human feedback closes the loop
Humans keep / drop / revise; their feedback drives the auto-refine loop instead of hand-rewriting code.
- 3Give the agent eyes
Build visualization tools the coding agent can call — render previews, diffs — so it verifies and self-corrects its output instead of working blind.
- 4Anomaly detection as a final gate
After curation, sweep for outliers — over-long code files, excessive character counts, degenerate geometry — to catch what slips past human review.
- 5Scale with Claude Code
The 20× (Max) plan plus headless mode (claude -p) lets you batch-convert and clean data in bulk, not one file at a time.
Contribute data — how it works
Bring your own (code → 3D) projects into the corpus with the 3dcode toolkit. Data uploads straight from your own machineto a private staging bucket, gets auto-validated + dedup-checked, then a maintainer reviews and ingests it into the corpus. Your bandwidth, not ours — nothing routes through a central server.
How data is organized
One source folder per origin; one sub-folder per data sample. Each sample is an independent, runnable project — code at the top, renders + mesh in renders/, and a meta.json identity card.
<source>/ one source per origin
<sample>/ one folder per data sample
model.py the code (runnable)
renders/ rendered images + .glb mesh
prompt.txt text prompt
meta.json auto-generated (id, dialect, hashes, status)Three steps
- 1Install
pipx install "git+https://github.com/gaoypeng/3dcode_toolkit" - 2Configure
3dcode config set --token <from a maintainer> - 3Push
3dcode push ./data --source you
- Validates
layout + anomaly checks before anything uploads
- Fingerprints
code / geometry / visual hashes — flags duplicates vs the corpus
- Uploads direct
from your machine to the private bucket, no central bottleneck
Toolkit & docs: github.com/gaoypeng/3dcode_toolkit
The 3D-code landscape
A concise, web-verified map of the languages and open datasets for representing 3D as code — the knowledge base 3DCodeVerse draws on. License colours: permissive, non-commercial / restricted, unclear. Always re-check a source’s license before use.
3D code dialects — languages & formats
- CadQueryPythonFluent Python parametric solid CAD on the OpenCASCADE kernel; exports STEP/STL.
- build123dPythonModern Pythonic B-rep CAD (algebraic + builder modes); a CadQuery sibling.
- OpenSCAD.scad“The programmer’s solid 3D CAD modeller” — declarative functional CSG scripting.
- JSCADJavaScriptParametric 2D/3D design in JS, runnable in the browser or CLI.
- FeatureScriptOnshape DSLOnshape’s language for custom parametric features, with a 3D-math type system.
- Blender Python (bpy)Python in BlenderScripts that drive Blender’s whole pipeline — the host for Infinigen-style generators.
- Geometry NodesBlender nodesVisual, non-destructive procedural geometry (transpilable to Python).
- L-systems (L-Py)grammar + PythonLindenmayer rewriting rules for branching/growth structures, e.g. plants.
- URDFXMLDeclarative description of a single robot (links + joints); the ROS standard.
- SDFormatXMLDescribes whole worlds — robots, objects, physics, lighting — beyond URDF’s single body.
- MJCFMuJoCo XMLMuJoCo’s native physics model format (kinematics, actuators, contacts).
- glTF.gltf / .glbA transmission/runtime format — baked results, NOT generative code (the delivery baseline).
- ShapeAssembly DSLresearch DSLBuilds shapes by declaring + attaching hierarchical cuboid part proxies.
- CGA (CityEngine).cgaRule-based shape grammar for mass-generating buildings/cities (proprietary).
Open datasets & generators
| Dataset | Format | Scale | License |
|---|---|---|---|
| Zero-to-CAD | CadQuery Python + STEP/STL | ~1M programs (100k curated) | Apache-2.0 |
| GenCAD-Code | CadQuery Python + image | ~163k img–code pairs | unclear (code: Apache-2.0) |
| DeepCAD | construction-seq JSON | ~178k models | MIT |
| Fusion 360 Gallery | B-rep + design seq + assemblies | 8.6k recon · 8.3k assemblies | Non-commercial research |
| SketchGraphs | 2D sketch constraint graphs | 15M sketches (~43 GB) | MIT |
| Text2CAD | text → CAD sequences | ~170k models · ~660k captions | CC BY-NC-SA 4.0 |
| Omni-CAD / CAD-MLLM | text+image+points+command-seq | ~450k multimodal | MIT |
| CADPrompt | prompt → CadQuery + STL | 200 prompts (eval set) | unclear |
| Dataset | Format | Scale | License |
|---|---|---|---|
| Infinigen | Blender Python | infinite generator | BSD-3-Clause |
| ProcFunc | Python (Blender procedural) | library / infinite | BSD-3-Clause |
| BlenderProc | Python render pipeline | infinite (pipeline) | GPL-3.0 |
| ProcTHOR-10K | house scene JSON (AI2-THOR) | 10k houses · 108 categories | Apache-2.0 |
| Holodeck | text → scene JSON | infinite (Objaverse assets) | Apache-2.0 |
| Sapling Tree Gen | Blender add-on (params) | infinite generator | GPL-3.0-or-later |
| Dataset | Format | Scale | License |
|---|---|---|---|
| MuJoCo Menagerie | MJCF (MuJoCo XML) | ~70 curated models | Apache-2.0 (per-model varies) |
| PartNet-Mobility | URDF + mesh | 2,346 objects · ~14k parts | unclear (gated) |
| Dataset | Format | Scale | License |
|---|---|---|---|
| ShapeAssembly | Python DSL programs | PartNet-based programs | Brown non-commercial |
| CSGNet | CSG programs | synthetic + CAD | MIT |
| Dataset | Format | Scale | License |
|---|---|---|---|
| VLMaterial | Blender material Python | procedural-material set | code MIT · data CC BY-NC 4.0 |