Academic Project Page

Abstract

To safely and efficiently solve motion planning problems in multi-agent settings, most approaches attempt to solve a joint optimization that explicitly accounts for the responses triggered in other agents. This often results in solutions with an exponential computational complexity, making these methods intractable for complex scenarios with many agents. While sequential predict-and-plan approaches are more scalable, they tend to perform poorly in highly interactive environments. This paper proposes a method to improve the interactive capabilities of sequential predict-and-plan methods in multi-agent navigation problems by introducing predictability as an optimization objective. We interpret predictability through the use of general prediction models, by allowing agents to predict themselves and estimate how they align with these external predictions. We formally introduce this behavior through the free-energy of the system, which reduces (under appropriate bounds) to the Kullback-Leibler divergence between plan and prediction, and use this as a penalty for unpredictable trajectories. The proposed interpretation of predictability allows agents to more robustly leverage prediction models, and fosters a ‘soft social convention’ that accelerates agreement on coordination strategies without the need of explicit high level control or communication. We show how this predictability-aware planning leads to lower-cost trajectories and reduces planning effort in a set of multi-robot problems, including autonomous driving experiments with human driver data, where we show that the benefits of considering predictability apply even when only the ego-agent uses this strategy.

▶

Methodology

Our objective is to develop a framework that allows agents to trade-off predictability with progress toward the goal. Our motivation is that by accounting for predictability a group of agents can reduce uncertainty in the environment and implicitly induce decentralized coordination. Additionally, accounting for predictability can serve as a means to ‘robustify’ a prediction model a-posteriori, where agents modify their behavior to match the prediction distribution widening the space of suitable prediction models for a given problem.

Incorporating predictability in this manner parallels the principle of free-energy minimization found in active inference and control systems. Here, agents do not solely pursue reward maximization, they also seek to minimize the discrepancy between their internal prediction models and actual observations. Within multi-agent interactions, each agent maintains probabilistic beliefs about the actions of others, with the accuracy of these beliefs directly influenced by its own behavior. Through the minimization of free energy, an agent can optimize actions that both reduce environmental uncertainty and validate its internal model, while simultaneously working towards reward maximization. This method ensures that agents act in a manner that is not only goal-oriented but also supportive of maintaining coherent, accurate beliefs about the behavior of surrounding agents. Given the interdependence of trajectories in multi-agent interactions, a prediction distribution about surrounding agents is conditional on an a trajectory distribution for the ego agent. If the ego significantly deviates from this distribution, this will comromise predictions for surrounding agents.

Accordingly, we introduce a free-energy term in our derivation and establish a set of transformations that enable its use as an objective function in planning. This formulation provides a structured approach for balancing predictability with progress in multi-agent systems.

Trajectory Planning Formulation

The trajectory planning problem for a single agent is formulated as the following optimization problem:

\[ \min_{\mathbf{u} \in U, \mathbf{x} \in X} \sum_{k=0}^{K-1} J_k(\mathbf{x}_k, \mathbf{u}_k) + J_K(\mathbf{x}_K) \]

subject to:

\[ \mathbf{x}_0 = \mathbf{x}_{\text{init}}, \]

\[ \mathbf{x}_{k+1} = f(\mathbf{x}_k, \mathbf{u}_k), \quad k = 0, \dots, K-1, \]

\[ P \left[ C(\mathbf{x}_k, \delta_{o_k}), \forall o \right] \geq 1 - \epsilon_k, \quad \forall k. \]

where:

• \( \mathbf{u} = \{\mathbf{u}_0, \dots, \mathbf{u}_K\} \) represents the control inputs,
• \( \mathbf{x}_k \) denotes the robot states, and
• \( J_k(\mathbf{x}_k, \mathbf{u}_k) \) is the stage cost function for performance metrics,
• \( f(\cdot) \) is the state transition dynamics function,
• \( C(\cdot) \) denotes collision avoidance constraints, and \( \delta_{o_k} \) accounts for the uncertainty in other agents' positions.

This formulation ensures that the probability of collision remains below a specified threshold \( \epsilon_k \) at each timestep through chance-constraints.

Free Energy as a Predictability Surrogate

Drawing inspiration from previous works, we begin by defining the free energy of a trajectory prediction distribution \( \mathcal{F}(S, P, \lambda) \) as follows:

\[ \mathcal{F}(S, P, \lambda) = -\lambda \log\left( \mathbb{E}_{\mathbf{x} \sim P} \left[ \exp\left(-\frac{1}{\lambda} S(\mathbf{x})\right) \right] \right) \]

where:

• \( S \) is a state cost function representing the trajectory objective,
• \( P \) is a trajectory prediction distribution,
• \( \lambda \) is an inverse temperature parameter.

The free energy function can be interpreted as measuing how effectively a prediction distribution \( P \) minimizes cost. Under this interpretation, to minimize the free energy an agent would behave so as to push the prediction distribution toward the optimal trajectory thereby minimizing the expected cost \( \mathbb{E}_{\mathbf{x} \sim P} [S(\mathbf{x})] \) of trajectories sampled from the distribution.

By performin an expectation switch to include the agents plan distribution \( p(\mathbf{x}) \) and ppplying Jensen's inequality, this free energy function can be bounded and simplified as follows:

\[ F(S, P, \lambda) \leq \mathbb{E}_{\mathbf{x} \sim Q}[S(\mathbf{x})] + \lambda \, \text{KL}(q(\mathbf{x}) \| p(\mathbf{x})), \]

where \( \text{KL} \) denotes the Kullback-Leibler divergence between the agent’s plan distribution \( q(\mathbf{x}) \) and the predicted distribution \( p(\mathbf{x}) \). This formulation provides a cost function with two main components:

Performance Cost: Encourages goal-oriented behavior.
Predictability Cost: Penalizes deviations from expected behavior.

The stage cost function for the planning problem then becomes:

\[ J(\tau_{0:K}) = \sum_{k=0}^K J_k(\mathbf{x}_k, \mathbf{u}_k) + \lambda \, \text{KL}(q(\mathbf{x}_k) \| p(\mathbf{x}_k)), \]

where \( J_k(\mathbf{x}_k) \) represents the performance cost together with some control cost and \( \lambda \) adjusts the weight assigned to predictability. This cost function serves as an upper bound to the free energy, thus minimizing the cost function also minimizes the free energy.

▶

Single-Agent Experiment

Left figure shows that increasing 𝜆 effectively decrease the belief update rate for the observer. Right figure shows the nominal trajectory, rendered in red. Given the observer holds mistaken initial beliefs about the robot’s goal, we observe that increasing the predictability score 𝜆 results in trajectories that are more compliant with the observer’s expectation.

In this experiment, we present a single agent interacting with a hand-crafted multi-modal prediction model, serving as a model of an observer’s expectation with a mistaken initial belief distribution. We use this example to study how 𝜆 should be tuned to control the trade-off. If predictability dominates (e.g., 𝜆 = 20 or 𝜆 = 40), this results in observations that further reinforce the observer’s mistaken belief. It becomes more costly for the robot to pursue its intrinsic motivation with each time-step, thus it fails to complete the task. Conversely, if 𝜆 is too low, the robot may still behave unpredictably .

For reference, the resulting behavior of an agent optimizing for legibility are shown as the black lines. From the perspective of coor- dination, legibility can be understood as an anticipatory mechanism: By conveying intention in advance, other agents anticipate better in their planning resulting n a more predictable environment from their perspective. Our approach similarly mitigates sudden environmental changes, however instead of aiming to directly influence the other agents’ beliefs, we rely on a prediction model to avoid the surprising observations throughout the interaction. While this can occasionally result in slightly more costly trajectories for the agent, we achieve a similar outcome without requiring explicit modeling of the other agent, making it more computationally efficient and robust to situations where the agent may not be able to successfully convey its intention. As demonstrated in Figure 1, when the observer’s beliefs are misaligned, the agent adopts a pro-social behavior, gently guiding the observer toward the correct belief.

▶

Swapping Tasks

The results of the swapping tasks are presented in the following sections, including visualizations and a summary table of the performance metrics for different values of the predictability score \( \lambda \). The optimal solution requires all agents to coordinate by selecting the same collision avoidance strategy, either passing left or right. Additionally, two more scenarios were tested: an asymmetrical swapping task and a double-crossing task, to explore different geometries and interactions. The experiments use a game-theoretic prediction model based on the ALGAMES framework [11], which solves constrained dynamic games to find an optimal joint strategy over a 20-step horizon

3 swapping task problems: Symmetric (left), Unsymmetrical (Center), Double-Crossing(Right)

From the perspective of an agent, accounting for predictability results in more accurate predictions, allowing for smoother and more efficient coordination. This mechanism is especially effective for interactions where the main coordination challenge lies in equilibrium selection

Table summarizing results for the 3 swapping tasks: Symmetrical, Unsymmetrical, and Double-Crossing

Exp.	Metric	λ = 0.0	λ = 2.5	λ = 5.0
Sym	PE (m²)	2.116 ± 1.000	0.516 ± 0.161	0.501 ± 0.145
	Acc (m/s²)	0.209 ± 0.101	0.038 ± 0.008	0.043 ± 0.009
	Ang (rad/s)	0.283 ± 0.041	0.225 ± 0.026	0.219 ± 0.026
Unsym	PE (m²)	0.877 ± 0.489	0.291 ± 0.178	0.187 ± 0.162
	Acc (m/s²)	0.196 ± 0.114	0.138 ± 0.090	0.112 ± 0.080
	Ang (rad/s)	0.363 ± 0.112	0.221 ± 0.095	0.177 ± 0.071
D-Cross	PE (m²)	0.969 ± 0.416	0.388 ± 0.124	0.311 ± 0.116
	Acc (m/s²)	0.249 ± 0.124	0.123 ± 0.080	0.125 ± 0.070
	Ang (rad/s)	0.434 ± 0.128	0.283 ± 0.096	0.252 ± 0.078

As seen in the table, increasing 𝜆 consistently led to improved performance across all metrics: planning effort (PE), acceleration (Acc), and an- gular velocity (Ang). Notably, even selecting a small 𝜆 causes a pronounced decrease in planning effort, with further increases in 𝜆 yielding diminishing returns. This phenomenon can be attributed to the coordination challenge agents face in this environment, which primarily involves equilibrium selection. In situations where agents must choose between two equally viable strategies, such as passing left or passing right, our method addresses this challenge by relying on a prediction model to establish a ‘soft social convention’. This introduces a subtle bias towards one of the strategies, improving implicit coordination.

▶

Robot-Robot Traffic Experiments

In this experiment, we focus on robot-robot coordination in driving scenarios, where the environment has a stronger influence on agent’s behavior. We use CommonRoad [1] as a simulator, which includes the Wale-Net [17] prediction model, a learning-based model that outputs predictions as Gaussians, accounting for uncertainty, road geometry, and the interaction with surrounding agents.

T-Junction Scenario. 𝜆=0 (Left), 𝜆=5 (Right)

Lane Merge Scenario. 𝜆=0 (Left), 𝜆=5 (Right)

When agents fail to coordinate in road scenarios, they often experience deadlocks or, in the worst case, collisions. Deadlocks are common in limited space environments such as intersections or narrow passages. Initially, the model may predict one agent will yield while the other advances. However, as deviations occur and both agents hesitate, their predictions begin to reinforce each other’s hesitation, creating the deadlock. The model may then be unable to introduce asymmetry to prioritize one of the agents in ambiguous situations, preventing the agents from breaking away from the deadlock. Results show that agents incorporating predictability into their models achieve better coordination, as indicated by less pronounced slowdowns resulting in higher travelled distance and the disappearance of deadlocks as seen in the table below.

Results for T-Junction and Lane Merge Scenarios (Dlk indicates Deadlocks)

Exp	Metric	λ = 0.0	λ = 2.5	λ = 5.0
T-J	Dlk (%)	30.0	0.0	0.0
	Dist (m)	30.172	51.248	47.000
	PE (m²)	1.366 ± 1.126	2.318 ± 0.313	2.507 ± 0.759
	Acc (m/s²)	-0.142 ± 0.238	0.293 ± 0.045	0.287 ± 0.233
	Ang (rad/s)	0.0037 ± 0.0028	0.0005 ± 0.0002	0.0028 ± 0.0026
LM	Dlk (%)	73.3	0.0	0.0
	Dist (m)	46.878	75.800	69.909
	PE (m²)	2.079 ± 0.785	3.513 ± 0.564	3.315 ± 0.760
	Acc (m/s²)	0.111 ± 0.092	0.337 ± 0.054	0.317 ± 0.087
	Ang (rad/s)	0.0032 ± 0.0032	0.0009 ± 0.0006	0.0001 ± 0.0001

A noteworthy observation is that, beyond reducing uncertainty, agents enhance their performance by adopting pro-social behaviors embedded in the model’s latent space. These behaviors include adherence to social norms and subtle cues learned from training data, mirroring the behavior of experts used to train the model. This behavior resembles imitation learning, where agents learn cooperative strategies directly from expert demonstrations embed- ded in the prediction model. As seen in the lane merge videos for example, although both outcomes are equally plausible from the raw planning problem, agents consistently converge on the solution where the merging agent yields, which aligns with typical human driving patterns.

▶

Experiments with Human Driver Data

The goal of this experiment is to evaluate whether predictability can bridge the gap between algorithmic planning and the natural driving patterns observed in humans, facilitating smoother and more adaptive interactions in complex driving environments. We utilize a state-of-the-art (SOTA) prediction model and a standard MPPI reference-tracking planner. For this experiment, we replay recorded scenes from the Waymo dataset’s test set, meaning agents in the environment follow pre-recorded, non-interactive trajectories. The goal is for the ego agent to replicate expert behavior observed during training.

Crossing 1 Scenario. 𝜆=0 (Left), 𝜆=75 (Right)

Crossing 2 Scenario. 𝜆=0 (Left), 𝜆=75 (Right)

Intersection Scenario. 𝜆=0 (Left), 𝜆=75 (Right)

Results comparing the performance of an MPPI-based planner on Waymo Open Motion Dataset scenarios for different λ values. For 30 iterations, we present the number of collisions and the mean value of other performance metrics.

Scenario	λ	Col (%)	Dist (m)	Acc (m/s²)	Lat_Acc (m/s²)	L2 (m)
Crossing1	0	43.3	74.540 ± 1.928	1.085 ± 0.121	1.615 ± 0.578	4.101 ± 0.532
	75	0	68.353 ± 0.982	0.681 ± 0.025	0.412 ± 0.044	3.358 ± 0.221
	120	0	55.796 ± 1.321	0.472 ± 0.035	0.149 ± 0.025	2.554 ± 0.053
Crossing2	0	86.6	75.843 ± 4.803	1.418 ± 0.099	2.121 ± 0.295	12.707 ± 1.112
	75	0	55.353 ± 1.403	0.957 ± 0.091	0.314 ± 0.036	4.603 ± 1.246
	120	23.3	37.518 ± 12.954	1.534 ± 0.736	0.227 ± 0.084	2.947 ± 3.206
Intersection	0	26.6	69.747 ± 4.794	1.450 ± 0.153	1.817 ± 0.782	24.808 ± 3.632
	75	0	72.700 ± 0.115	0.709 ± 0.029	0.493 ± 0.075	24.618 ± 1.036
	120	0	71.885 ± 0.227	0.613 ± 0.043	0.304 ± 0.026	19.900 ± 0.733
Emergency	0	53.3	61.377 ± 20.244	1.258 ± 0.175	0.882 ± 0.141	8.631 ± 5.877
	75	0	68.883 ± 0.525	0.763 ± 0.010	0.365 ± 0.031	1.961 ± 0.069
	120	0	60.058 ± 0.797	0.581 ± 0.019	0.184 ± 0.016	1.515 ± 0.091

From Table 3, it is evident that increasing the weight of the predictability objective results in fewer collisions and smoother control inputs. Interestingly, however, this does not necessarily correlate with improved progress along the reference path. This can be attributed to the planner inducing less distributional shift in the prediction model. The model, trained on scenes where all agents exhibit expert behaviors, struggles when the planner deviates significantly from these patterns, as it encounters situations outside its training distribution. In such cases, the model attempts to extrapolate and produces sub-optimal predictions, such as incorrectly anticipating that an agent may yield or maneuver differently than it actually does based on the recorded data. This misalignment leads to overconfident behavior in some instances, which, while promoting progress along the reference path, increases the risk of collisions. Evidence supporting this hypothesis is found in the Human L2 loss metric, which measures the L2 loss between the agent’s trajectory and the corresponding human trajectory that the planner aims to replicate. For 𝜆 = 0, the higher L2 loss indicates significant deviation from human behavior, suggesting that the agent diverges more from the expert’s trajectory. In contrast, when predictability is considered, the L2 loss decreases, indicating that the agent’s behavior aligns more closely with the human data. This results in reduced distributional shift and, consequently, more accurate predictions and smoother trajectories.

Predictability Awareness for Efficient and Robust Multi-Agent Coordination