Project3DCodeVerse

Curation tool →

3DCodeVerse: Towards Building the Universe via Code

Text / Image / Video3D Code3D Assets
3DCodeVerse aims to gather any code that creates, renders, or governs 3D— every dialect, one corpus.Hugging Face3D Code Data
Object ModelingScene & World ModelingRenderingMaterials & ShadersPhysics & SimulationAnimation & Articulation

Gallery

What you see are all built by code, from scratch.
Black Hole
OpenGL · GLSL shader
Ocean
OpenGL · GLSL shader
Characters
Characters
OpenGL · GLSL shader
Code → AI Render
Blender scene (code) → Seedance 2.0 video
Articulated Bicycle
URDF → articulated object
3DCodeBench Showcase
Blender Python → rendered objects

Datasets & Progress

NameStatusAttributeLanguageSubfoldersCountSourceLicenseNote
3DCodeBench
Completed3D Objects
  • factories_geo: 243
  • factories_tex: 243
  • instances_geo: 1953
  • instances_tex: 1953
4.3k
CommercialAcademicBSD-3-Clause
Samples deduplicated from the 26K-sample 3DCodeBench corpus.
ShaderToy
CaptioningShader
  • TBD
120k
CommercialAcademicCC BY-NC-SA
A public platform where anyone can write and share OpenGL shaders.
DeepCAD
CaptioningCAD Objects
  • TBD
174k
CommercialAcademicMIT
DeepCAD's 178k command sequences → standalone CadQuery via GenCAD-Code's converter; STEP + OCC checked → 174k usable, dedup → 128k unique. Fidelity vs ground-truth: IoU 0.985 (96.9% > 0.9); the small tail is tagged per-sample for filtering. Each sample carries Text2CAD captions + renders (PNG / STEP / GLB).
Articraft
CompletedArticulated Objects
  • urdf_geo_only: 3.1k
  • urdf_tex: 3k
  • cadquery_single_tex: 6.9k
13k
CommercialAcademicCC BY 4.0
Work in progress — to be filtered into Blender Python Code, but may be multi-file.
Thingiverse-OpenSCAD
Curating3D Objects
  • TBD
7k
CommercialAcademicVaries
OpenSCAD (.scad) models scraped from Thingiverse — declarative CSG scripts. Currently curating; per-item licenses vary, being checked before release.

Data format

Every entry is a self-contained, runnable project— the code, its rendered outputs, a metadata card, and text captions. One source folder at the top, coarse-category subfolders beneath (the same ones in the table above), and one folder per sample.

3dcodebench/                          one source folder per origin
  factories_geo/                      subfolder — a coarse category
    <sample>/                         one self-contained, runnable sample
      code.py                         the code — single file, or many; runs as-is
      renders/                        rendered outputs for this sample
        view_00.png  view_01.png  …   multi-view rendered images
        video.mp4                     video (optional)
        object.glb                    3D mesh — .glb / .obj / .stl (optional)
      meta.json                       identity card (see below)
      captions.json                   text descriptions, keyed by type (see below)
  factories_tex/                      … same structure

meta.json— the identity card

Declares what the sample is and exactly how to run it — type, code dialect, entry point, the runtime environment, where it came from, and who curated it.

{
  "id": "factories_geo/0007",
  "type": "3D Objects",            // Attribute — what it is
  "language": "Blender Python",    // code dialect
  "entry": "code.py",              // file to run
  "multi_file": false,             // single file, or a multi-file project
  "environment": "Blender 5.0",    // exact runtime needed to execute
  "renders": [                     // artifacts under renders/
    "renders/view_00.png",
    "renders/object.glb"
  ],
  "source": "3DCodeBench",         // origin dataset
  "license": "commercial-ok",      // reuse terms
  "curator": "yipeng",             // who reviewed / curated it
  "status": "curated"              // curated | pending | revise
}

captions.json— text descriptions, keyed by type

A flat { type: caption }map — one object can carry several captions from different angles (brief, detailed, geometry, function…). Kept separate from meta.json so captions can grow without touching the identity card.

{
  "brief":    "A wooden four-drawer dresser.",
  "detailed": "A rectangular wooden dresser with four stacked drawers, each with a small round knob, a flat top and short tapered legs.",
  "geometry": "Box body ~0.8 x 0.45 x 1.1 m, split into four equal drawer compartments; cylindrical handles centered on each drawer front.",
  "function": "Bedroom storage; the four drawers slide out along the front face."
}

On our radar — candidate sources

3DCodeVerse aims to unify multiple dialectsof 3D code under one quality-gated roof. Today: Infinigen (organic / natural, Blender Python). On our radar — all permissively licensed, so combinable with attribution:

Mechanical / parametric CAD code

DatasetCode formScale · LicenseWhy it fits
CADFS
FeatureScript
Onshape FeatureScript + text, multi-view, STEP / STL
~450k models · 382,609 dedup · ~90.5 GB
CC BY 4.0
Closest to real-world “3D code”: executable scripts with full design history (sketch, extrude, revolve, sweep, loft, fillet, chamfer, shell, boolean, pattern). Already shipped as text→code / image→code JSONL.
CAD-Coder / GenCAD-Code
CadQuery (Python)
CadQuery Python + rendered image
163k image–code pairs
Apache-2.0
Most LLM-friendly CAD code — CadQuery is Python, trivially standalone. Complements FeatureScript with Python-CAD.
Omni-CAD / CAD-MLLM
Command seq (JSON)
CAD command-sequence JSON + text + multi-view + point clouds
~450K · ~1.25 GB
MIT
Strong multimodal grounding (image / text / point → CAD seq), cheap to ingest. Needs an executor / translator.
SketchGraphs + DeepCAD
Sketch / construction seq
2D sketch constraint graphs / 3D construction sequences
15M sketches (~43 GB) · 178k models
MIT *
Foundational CAD grammar — sketches + constraints; DeepCAD convertible to CadQuery / FeatureScript. * Onshape sketch-copyright caveat.

Scene-level procedural code

DatasetCode formScale · LicenseWhy it fits
ProcTHOR-10K
House JSON + Python gen
Procedural house generator (AI2-THOR)
1,633 objects · 108 categories · 3,278 materials
Apache-2.0
Best fit for scene-level code — rooms, placement, materials, lighting, interaction state. Complements Infinigen’s natural scenes; convertible to Blender scene scripts.

Open challenges

Where the harness work lives.

  • Translators / executors — command-sequence → code, .blend → Blender Python, AI2-THOR → Blender scene scripts, FeatureScript ↔ CadQuery.
  • Cross-dataset dedup — DeepCAD ↔ CADFS ↔ Omni-CAD ↔ CAD-Coder share lineage; geometry / ID dedup before merge.
  • License & attribution tracking — keep provenance per entry so the merged corpus stays redistributable.

Best practices

Lessons from building the pipeline so far.

  1. 1
    Coding agent does the conversion

    A coding agent transcribes each asset into standalone, runnable code — the heavy lifting of the asset → code transform.

  2. 2
    Human feedback closes the loop

    Humans keep / drop / revise; their feedback drives the auto-refine loop instead of hand-rewriting code.

  3. 3
    Give the agent eyes

    Build visualization tools the coding agent can call — render previews, diffs — so it verifies and self-corrects its output instead of working blind.

  4. 4
    Anomaly detection as a final gate

    After curation, sweep for outliers — over-long code files, excessive character counts, degenerate geometry — to catch what slips past human review.

  5. 5
    Scale with Claude Code

    The 20× (Max) plan plus headless mode (claude -p) lets you batch-convert and clean data in bulk, not one file at a time.

Contribute data — how it works

Bring your own (code → 3D) projects into the corpus with the 3dcode toolkit. Data uploads straight from your own machineto a private staging bucket, gets auto-validated + dedup-checked, then a maintainer reviews and ingests it into the corpus. Your bandwidth, not ours — nothing routes through a central server.

your server3dcode pushR2 staging (private)maintainer reviewcurated corpus

How data is organized

One source folder per origin; one sub-folder per data sample. Each sample is an independent, runnable project — code at the top, renders + mesh in renders/, and a meta.json identity card.

<source>/                  one source per origin
  <sample>/                one folder per data sample
    model.py               the code (runnable)
    renders/               rendered images + .glb mesh
    prompt.txt             text prompt
    meta.json              auto-generated (id, dialect, hashes, status)

Three steps

  1. 1Install
    pipx install "git+https://github.com/gaoypeng/3dcode_toolkit"
  2. 2Configure
    3dcode config set --token <from a maintainer>
  3. 3Push
    3dcode push ./data --source you
  • Validates

    layout + anomaly checks before anything uploads

  • Fingerprints

    code / geometry / visual hashes — flags duplicates vs the corpus

  • Uploads direct

    from your machine to the private bucket, no central bottleneck

Toolkit & docs: github.com/gaoypeng/3dcode_toolkit

Reference

The 3D-code landscape

A concise, web-verified map of the languages and open datasets for representing 3D as code — the knowledge base 3DCodeVerse draws on. License colours: permissive, non-commercial / restricted, unclear. Always re-check a source’s license before use.

3D code dialects — languages & formats

CAD-as-code
  • CadQueryPythonFluent Python parametric solid CAD on the OpenCASCADE kernel; exports STEP/STL.
  • build123dPythonModern Pythonic B-rep CAD (algebraic + builder modes); a CadQuery sibling.
  • OpenSCAD.scad“The programmer’s solid 3D CAD modeller” — declarative functional CSG scripting.
  • JSCADJavaScriptParametric 2D/3D design in JS, runnable in the browser or CLI.
  • FeatureScriptOnshape DSLOnshape’s language for custom parametric features, with a 3D-math type system.
Procedural / DCC
  • Blender Python (bpy)Python in BlenderScripts that drive Blender’s whole pipeline — the host for Infinigen-style generators.
  • Geometry NodesBlender nodesVisual, non-destructive procedural geometry (transpilable to Python).
  • L-systems (L-Py)grammar + PythonLindenmayer rewriting rules for branching/growth structures, e.g. plants.
Scene / world / sim
  • URDFXMLDeclarative description of a single robot (links + joints); the ROS standard.
  • SDFormatXMLDescribes whole worlds — robots, objects, physics, lighting — beyond URDF’s single body.
  • MJCFMuJoCo XMLMuJoCo’s native physics model format (kinematics, actuators, contacts).
  • glTF.gltf / .glbA transmission/runtime format — baked results, NOT generative code (the delivery baseline).
Shape grammar / DSL
  • ShapeAssembly DSLresearch DSLBuilds shapes by declaring + attaching hierarchical cuboid part proxies.
  • CGA (CityEngine).cgaRule-based shape grammar for mass-generating buildings/cities (proprietary).
Material / shader
  • MaterialX.mtlxOpen standard node graphs that compile to GLSL/OSL/MDL across renderers.
  • OSL.oslProgrammable closure-based shading language for production renderers.
  • GLSLGPU shadersC-like real-time shader language for the OpenGL/Vulkan/WebGL pipeline.

Open datasets & generators

CAD / parametric code
DatasetFormatScaleLicense
Zero-to-CADCadQuery Python + STEP/STL~1M programs (100k curated)Apache-2.0
GenCAD-CodeCadQuery Python + image~163k img–code pairsunclear (code: Apache-2.0)
DeepCADconstruction-seq JSON~178k modelsMIT
Fusion 360 GalleryB-rep + design seq + assemblies8.6k recon · 8.3k assembliesNon-commercial research
SketchGraphs2D sketch constraint graphs15M sketches (~43 GB)MIT
Text2CADtext → CAD sequences~170k models · ~660k captionsCC BY-NC-SA 4.0
Omni-CAD / CAD-MLLMtext+image+points+command-seq~450k multimodalMIT
CADPromptprompt → CadQuery + STL200 prompts (eval set)unclear
Procedural generators & scenes
DatasetFormatScaleLicense
InfinigenBlender Pythoninfinite generatorBSD-3-Clause
ProcFuncPython (Blender procedural)library / infiniteBSD-3-Clause
BlenderProcPython render pipelineinfinite (pipeline)GPL-3.0
ProcTHOR-10Khouse scene JSON (AI2-THOR)10k houses · 108 categoriesApache-2.0
Holodecktext → scene JSONinfinite (Objaverse assets)Apache-2.0
Sapling Tree GenBlender add-on (params)infinite generatorGPL-3.0-or-later
Robot / simulation structure
DatasetFormatScaleLicense
MuJoCo MenagerieMJCF (MuJoCo XML)~70 curated modelsApache-2.0 (per-model varies)
PartNet-MobilityURDF + mesh2,346 objects · ~14k partsunclear (gated)
Shape programs & grammar
DatasetFormatScaleLicense
ShapeAssemblyPython DSL programsPartNet-based programsBrown non-commercial
CSGNetCSG programssynthetic + CADMIT
Material / shader
DatasetFormatScaleLicense
VLMaterialBlender material Pythonprocedural-material setcode MIT · data CC BY-NC 4.0