Evaluation Workflow

This guide describes the process for evaluating vision-language models (VLMs) on the PhyBlock 3D block assembly benchmark.

Dataset

The dataset is officially released on HuggingFace.

Standard Evaluation Procedure

1. Generate Predictions

Use the scripts in inference_scripts/ to generate raw model predictions for all 400 block assembly scenarios. - Each model should output predictions in JSON format.

2. Evaluate Model Outputs

Prepare the following directories:

Argument

Description

--raw_outputs_json_dir

Path to raw model prediction JSON files

--matching_json_dir

Candidate block JSONs (see data/SCENEs_400_Cand_Jsons)

--groundtruth_json_dir

Ground-truth structure JSONs (see data/SCENEs_400_Goal_Jsons)

--extracted_json_dir

Output directory for parsed assembly sequences

--results_dir

Directory for evaluation results (CSV/TXT)

Run the evaluation with:

python evaluate_block_construction.py

3. Results

After evaluation, you will obtain metrics and summary files, such as:

  • Results_level1.csv: Level 1 (basic matching)

  • Results_level234.csv: Levels 2–4 (complex reasoning)

  • Results_Levels.txt: Overall summary

Visualization Examples

The following are sample visualizations of model-predicted assembly processes:

_images/006.gif _images/007.gif _images/008.gif _images/018.gif _images/029.gif _images/110.gif _images/059.gif _images/089.gif

Note

Each animation demonstrates how a model constructs the target block structure step by step.


For detailed API documentation and further customization, refer to the script docstrings and in-code comments.