Evaluation Workflow
This guide describes the process for evaluating vision-language models (VLMs) on the PhyBlock 3D block assembly benchmark.
Dataset
The dataset is officially released on HuggingFace.
Standard Evaluation Procedure
1. Generate Predictions
Use the scripts in inference_scripts/ to generate raw model predictions for all 400 block assembly scenarios.
- Each model should output predictions in JSON format.
2. Evaluate Model Outputs
Prepare the following directories:
Argument |
Description |
|---|---|
|
Path to raw model prediction JSON files |
|
Candidate block JSONs (see |
|
Ground-truth structure JSONs (see |
|
Output directory for parsed assembly sequences |
|
Directory for evaluation results (CSV/TXT) |
Run the evaluation with:
python evaluate_block_construction.py
3. Results
After evaluation, you will obtain metrics and summary files, such as:
Results_level1.csv: Level 1 (basic matching)Results_level234.csv: Levels 2–4 (complex reasoning)Results_Levels.txt: Overall summary
Visualization Examples
The following are sample visualizations of model-predicted assembly processes:
Note
Each animation demonstrates how a model constructs the target block structure step by step.
For detailed API documentation and further customization, refer to the script docstrings and in-code comments.