Robust Algorithmic Collusion
This paper develops a formal framework to assess policies of learning algorithms in economic games. We investigate whether reinforcement- learning agents with collusive pricing policies can successfully extrapolate collusive behavior from training to the market. We find that in testing environments collusion consistently breaks down. Instead, we observe static Nash play. We then show that restricting algorithms’ strategy space can make algorithmic collusion robust, because it limits overfitting to rival strategies. Our findings suggest that policy-makers should focus on firm behavior aimed at coordinating algorithm design in order to make collusive policies robust.
1 Introduction
Software systems that take over pricing decisions are becoming widespread. Pricing algorithms can allow firms to monitor and process large amounts of data and adjust prices quickly to changing circumstances. The ascent of such systems poses a potential challenge for the current regulatory landscape: pricing algo- rithms based on artificial intelligence (AI) may learn to autonomously collude without any previous intentional agreement or explicit instruction to do so.
∗NicolasEschenbaum:UniversityofSt.Gallen,InstituteofEconomics,Varnbu ̈elstrasse19, 9000 St. Gallen, Switzerland (nicolas.eschenbaum@unisg.ch); Filip Mellgren: Stockholm Uni- versity, Department of Economics, Universitetsva ̈gen 10 A, SE-106 91 Stockholm, Sweden (filip.mellgren@ne.su.se); Philipp Zahn: University of St.Gallen, Institute of Economics, Varnbu ̈elstrasse19,9000St.Gallen,Switzerland(philipp.zahn@unisg.ch).Eschenbaumand Zahn gratefully acknowledge financial support from the Basic Research Fund of the University of St. Gallen and the Hasler Foundation.
1
arXiv:2201.00345v2 [econ.GN] 5 Jan 2022
A growing literature has shown algorithmic collusion to be possible in princi- ple.1 The results documented so far are a clear warning sign: Even simple algo- rithms learn to tacitly collude and thereby harm consumers. However, existing analyses have studied the behavior of algorithms in their training environment. It is well-known that algorithms tend to overfit to the training environment, and results cannot easily be extrapolated to other environments (e.g. Lanctot et al., 2017). In practice, firms train their algorithms offline before using them and face substantial uncertainty about important parameters of the market and their competitors, as well as potentially significant cost from randomized learning in the marketplace. Conclusions drawn from existing work therefore implicitly assume that the training environment and market environment are symmetric and identical, and that results can be extrapolated from one environment to the other.
This paper develops a framework to guide the analysis of learning algorithms in economic games. We provide a formal representation of the environments that algorithms face that is parameterized by a “context”. Evaluating policies in their training context compared to a (suitably chosen) testing context al- lows for an assessment of the behavior of algorithms in markets. We apply our framework to Q-learning algorithms in repeated Bertrand games. We show that algorithms overfit to rival strategies from training and cannot successfully extrapolate their collusive policies to other counterparts, or differently parame- terized environments. In testing contexts algorithmic collusion vanishes, and does not recover with further iterations. Instead, we observe evidence of Nash play of the underlying stage game, in particular with (nearly) identically pa- rameterized environments. Continued policy updating allows algorithms to overcome this breakdown in collusion, but requires many iterations and is un- likely to be feasible in market environments. We then show that restricting algorithms’ strategy space by only allowing them to condition on their own past price, but not competitors’ prices, can make algorithmic collusion robust for a set of parameterizations, because it forces them to learn collusive policies based on simpler patterns that are not too specific to the training context and can thus be successfully extrapolated.
Our findings illustrate a key challenge with the current setup of the analysis of algorithmic collusion: results are reported based on the outcomes in the training environment. In practice however, algorithms are trained and deployed in separate environments. The tendency of machine learning algorithms to overfit to the training environment (or data), and that therefore a separate testing environment is required to assess their behavior is well-established (see e.g. Lanctot et al., 2017; Zhang et al., 2018b,a; Song et al., 2019). But this testing
1For example, Calvano et al. (2020b, 2021); Klein (2021); Kastius and Schlosser (2021). 2
environment is not readily available with reinforcement learners. To the best of our knowledge, we are the first to provide a formal framework to overcome this limitation for algorithms in economic games and develop a consistent approach to assessing behavior.
Our findings highlight the relevance of coordination at the level of algo- rithm design. The tendency of algorithms to jointly learn to collude appears generally insufficient for firms to be able to achieve collusive outcomes in the market. However, they may still be able to successfully achieve algorithmic (tacit) collusion by coordinating on high-level approaches to the implementation of pricing algorithms. Each firm implements and trains its algorithm indepen- dently, yet by having coordinated parts of the parameterization and the strategy space appropriately, collusive policies robust to deployment in the market can be consistently learnt. Intuitively, because the extent of coordination among competing firms is (legally) restricted, and algorithms may need to work in a range of environments, the policies employed must rely on simpler patterns.
This paper contributes to a growing literature on algorithmic collusion (for a recent survey of the economic literature on AI see Abrardi et al. (2021)). We employ simple Q-learning algorithms in line with related work in e.g. Calvano et al. (2020b, 2021); Klein (2021). Our baseline scenario and parameterization is built on the environment studied in Calvano et al. (2020b). There is also a related and growing literature on reinforcement learning in revenue manage- ment practice, demonstrating the relevance of assessing reinforcement learners in economic games (e.g. Kastius and Schlosser, 2021; Acuna-Agost et al., 2021; Bondoux et al., 2020).
Our paper is related to the computer science literature that studies the overfitting of reinforcement learning (RL) algorithms. Lanctot et al. (2017) show that the overfitting to rival agents’ policies we observe is a common problem in RL. Zhang et al. (2018b) examine different ways how deep RL algorithms overfit to the environment and show that attempted solutions in the literature of adding stochasticity to the environment do not necessarily prevent overfitting. The difficulty for algorithms to extrapolate policies to new environments is known as the “zero-shot coordination problem” (e.g. Treutlein et al., 2021; Hu et al., 2020). Our framework builds on Kirk et al. (2021). Related to our approach is a strand of literature that assumes there exists a distribution of Markov-decision-problems of the scenario of interest, and then trains algorithms on a finite set of samples from this distribution before testing the behavior on the entire distribution (e.g. Zhang et al., 2018a; Nichol et al., 2018; Justesen et al., 2018).
Lastly, our paper contributes to the literature on competition policy and regulatory responses to algorithmic collusion. The potential challenge to policy has been previously discussed both by the European Commissioner for Com-
3
petition (Vestager, 2017) and Commissioner of the Federal Trade Commission (Ohlhausen, 2017), and potential solutions have been suggested (e.g. Calvano et al., 2020a; Harrington, 2018; Beneke and Mackenrodt, 2021). Our results provide some novel perspectives and qualify existing results. When learning in the market itself is feasible and not too costly, algorithmic collusion appears very likely to arise. In any other case, our findings suggest that the main policy challenge is to detect or prevent coordination of algorithm design, implying that the actual danger of algorithmic collusion may not necessarily be in the market interaction itself, but in coordinative moves beforehand.2 For instance, pricing specialists at rival firms may be well-informed about the work of their counter- parts and there may be an industry-level understanding of the best specification of pricing algorithms.
The remainder of this paper is organized as follows. In Section 2 we develop a formal framework for the analysis of learners in testing environments. Section 3 introduces the specific setting of algorithmic collusion that we study, and in Section 3.6 details how we apply our framework in this setting. Section 4 presents and discusses our results. Finally, Section 5 concludes.
2 Training vs. Testing: A General Framework
The evidence accumulated on the behavior of reinforcement learners in eco- nomic games is based on scenarios where the agents learn together in the same environment. In the simplest scenario, two learners are interacting iteratively until either convergence or a fixed number of iterations is reached. The assess- ment of pricing behavior is then based on the last rounds of these interactions. In practice however, firms train their algorithms offline first before deploying them to the market. Algorithms must therefore successfully extrapolate col- lusive policies from training to the market, and evidence of collusion during training does not imply that firms can actually employ learning systems that tacitly collude. Instead, an assessment of the behavior of algorithms requires a separation between “training” and “testing” environments.3
In this section, we develop a general framework to guide the analysis of learners in testing environments. We provide a formal representation of the interactive environments that learning agents face. We consider these envi- ronments to be parameterized by some “context”. This context describes how
2One particular case of this, where competing firms buy pricing-services from the same upstream supplier, has been noted before in the literature. For instance, Harrington Jr (2021) develops a model of sellers that outsource their pricing algorithms to a third-party.
3This is a general challenge in the reinforcement learning literature, and is also known as the “zero-shot coordination problem” (see e.g. Treutlein et al., 2021; Hu et al., 2020).
4
the environment varies with a change in exogenous conditions. Evaluating the difference in behavior between the training and testing environment is then equivalent to comparing the performance of an algorithm between two contexts, one for training and one for testing.
We begin by defining the dynamic system in which interactive reinforcement learning takes place. Scenarios of interest for economists can be cast in terms of a Partially Observable Markov Game.4 Formally
Our analysis suggests that a key driving force of algorithmic collusion in practice is coordination of algorithm design. Firms in practice may be able to successfully achieve algorithmic (tacit) collusion by coordinating on high-level ideas for the implementation of learning algorithms. Each firm then implements and trains its algorithm independently. Yet by choosing the parameterization
22
Mean Outcome
of the environment, and the observation and thus policy space appropriately, collusive policies can be made sufficiently robust to extrapolate to the market. Intuitively, precisely because the degree of coordination among competing firms is (legally) restricted, and algorithms must potentially work in a range of envi- ronments, the policy employed must rely on simpler patterns and cannot be too specific.12
5 Conclusion
Companies worldwide increasingly make use of reinforcement learning (RL) and other machine learning techniques for their pricing decisions. The application of such tools and the resulting automation of pricing decisions has gained the attention of policy-makers and researchers. One major concern is that self-learning pricing algorithms may lead to collusion without any explicit instruction to do so by firms. A nascent literature that studies the behavior of RL in economic games indicates that concerns about algorithmic collusion are not unfounded and algorithms can indeed learn sophisticated strategies supporting supra-competitive pricing. However, existing analyses have studied the behavior of algorithms in their training environment, but in practice the offline training environment is not identical to the market environment in which an algorithm is subsequently deployed.
This paper develops a framework to guide the analysis of learning algorithms in economic games. We consider a formal, general representation of the environ- ment that is parameterized by a context. Evaluating policies learnt by algorithms in their training context in a suitably chosen testing context allows inference over the ability of policies to be extrapolated successfully to the market. We show that Q-learning algorithms in repeated Bertrand games overfit to rival strategies, and are unable to successfully use their collusive policies outside their training environment. In testing contexts, algorithmic collusion vanishes and does not recover with further iterations. Instead, we observe evidence of static Nash play, in particular with (nearly) identically parameterized contexts. We show that this is due to an overfit of policies to the rival training policy, and as a consequence we find that restricting algorithms’ strategy space can make algorithmic collusion robust.
Our results provide a novel perspective on the existing findings of algorithmic collusion, and highlight the relevance of coordination at the level of algorithm design. While jointly learning algorithms appear prone to collusion, our analysis
12Treutlein et al. (2021) study the “zero-shot coordination problem” in a multiple settings and similarly note the importance for algorithms to “only rely on general principles for coordination” (p. 10).
23
shows that this does not imply that collusive outcomes can be achieved in the market. Yet, by appropriately coordinating on aspects of the parameterization and strategy space, independently trained algorithms from rival firms may be designed to learn collusive policies that are robust to deployment in the market. Intuitively, because the extent of coordination among competing firms is (legally) restricted, and algorithms must work in a range of environments, the collusive policies employed must rely on simpler patterns and be fit to one another
Comments
Post a Comment