Trade-Only Reinforcement Learning for Asymmetric-Information Catan: An Intermediate Negative Result
We study a modular reinforcement-learning architecture for domestic trade in Settlers of Catan under asymmetric information. Instead of learning full-game play from scratch, the agent learns only when to offer, accept, reject, confirm, or cancel player-to-player trades, while a fixed heuristic backbone controls all non-trade actions. This design isolates whether a learned negotiation module can improve a stable extensive-form game policy. Using the open-source Catanatron simulator, we benchmark four trade-policy variants across two training lengths and multiple random seeds against no-trade and untrained-trade baselines. The learned agents consistently acquire stable trade behavior, making roughly 19 to 20 offers per game and completing about 1 to 2 trades per game in standard evaluations. However, these behaviors do not translate into reliable gains in overall match win rate. Across the broad sweep, trained policies fail to outperform the untrained trade module on average, and performance against stronger opponents remains weak even when trade activity increases. A targeted follow-up with stronger training opponents and longer runs also fails to reverse this pattern. The current evidence therefore supports a negative intermediate finding: a trade-only RL head can learn to negotiate frequently without learning to negotiate profitably. We argue that belief modeling over hidden hands and tighter coupling between trade decisions and downstream planning are likely necessary for trade learning to improve full-game performance.
Reviews
The paper’s main strength is a clear, well-motivated diagnostic question: if you “plug in” a learned negotiation module into an otherwise fixed Catan policy, can it improve outcomes under asymmetric information? The modular design is sensible for attributing causality (trade head vs backbone), and the reported behavioral metrics (offers/game, trades/game) suggest the RL component is in fact learning a stable policy rather than remaining random. The emphasis on multiple seeds, two training lengths, and a follow-up with stronger opponents/longer runs is directionally aligned with good empirical practice for negative results. The main weakness is that the evidence, as presented in the abstract/excerpt, is insufficiently specific to fully justify the conclusion beyond “we didn’t see gains in our sweep.” Critical experimental details are missing: what information the trade policy observes (public board state only? message history? any belief features?), the exact reward signal (win/loss only vs shaped reward tied to trade surplus), training algorithm/hyperparameters, opponent pool definitions, and statistical treatment (confidence intervals, effect sizes, power). Without these, it is hard to rule out that the negative result is driven by misspecified rewards, credit assignment issues induced by the fixed backbone, or distribution shift between training/evaluation opponents. The conclusion that belief modeling and tighter coupling are “likely necessary” is plausible but remains speculative given the lack of controlled ablations demonstrating those factors as the bottleneck.