top of page


fall 22'

Stacky-Block Environment is an environment for training and tuning a reinforcement learning (RL) model to solve a block-stacking puzzle game built in Unity. The objective of the game is to take a sequence of random block types and stack as many of them atop one another before reaching the maximum height. The environment is developed with Unity’s ML-Agents package in order to realize an agent that is able to perform significantly better than a random agent. This paper’s goal is to explore the potential for RL agents to observe 3D space in order to develop spatial configurations for design-oriented tasks.

Possible block selections to be placed

There are six different block types that could potentially be placed on the below grid space of different sizes and orientations. There is no rotating of blocks when being placed on the below grid so having static shapes of different sizes provides this variation. The above image displays the block types and their corresponding dimensions. The agent performs actions within a discrete action space. When a decision is requested by the agent, the agent provides a two-length discrete vector, with values per index ranging between 0-4. These values indicate the x,y grid coordinate for the next block to be placed.

Observastion space of 3D toggles

When an action is requested, the observation of the agent is a 251-length vector of integers. This vector describes the 3D space and the next block to be placed. Index 0 is an integer between 0-5 indicating the next block type to be placed whereas the rest of the vector is a value of 0-1 representing each 3D coordinate within the 3D puzzle space. If a space has a block within it, it will provide a 1 in the corresponding observation index, and 0 by default. This is implemented by having a Box Collider at each possible 3D coordinate with a boolean to toggle whether there is a block there or not.

Random agent performance

The reward function is given by each step of action. In this environment, rewards are given as such:

  • If a boolean 3D switch is activated:

    • Agent receives: r / (h + 1) + z

  • If a block collides with the maximum height:

    • Agent receives: -m


For the final trained agent:

  • r = 1, which is a hyperparameter for the agent reward

  • h = the height of the given boolean 3D switch

  • z = 0.5, which is the hyperparameter for tuning rewards

  • m = 10, which is the hyperparameter for giving a negative reward for reaching the maximum height

Trained agent performance

Evaluation Average Return


Trained agent training evaluation

For this project's future implications, even though the environment is abstracted as a Tetris-like video game puzzle, 3D arrangements of certain types correspond to a variety of different design problems. This could include program blocks in massing design for architecture or certain object parts in industrial design. That being said, given an application, changing the action space and reward function is necessary. But this reveals that in 3D spatial design configurations, it may be possible to speculate reward functions as a tool for leading agent-based design automation toward specific strategies and behaviors.

bottom of page