At Conductrics, we enable companies to run A/B and MVT experiments, optimize campaigns with contextual bandit technology, and provide relevant customer experience and insights through our A/B Survey technology. However, as great as our technology is, ultimately we believe that the overarching value of any experimentation program is in how it provides a principled procedure for companies to be explicit in their beliefs and assumptions about their customers. This means companies being intentional about what to learn and what actions to take, and to take those actions based on the value it provides for their customers and organizations.
This view of experimentation is what guides how we develop and service our products. It is why our experimentation platform allows for K-anonymized data collection and is built following Privacy by Design (PbD) principles (link to our paper on K-Anonymous A/B Testing). PbD, especially the principle of data minimization ultimately involves thinking about how the next additional bit of collected data adds value the customer – an idea that is cut from the same cloth as intentional experimental design. Our novel A/B Survey technology helps our clients be more mindful and responsive to not just the technical delivery of their services, but how that service makes their customers feel. It guides us to think first about about how more complex statistical methods might have unintended consequences before pushing them to market. For example, more complex methods, like CUPED and regression adjustment can do wonders for precision and reduced time to test by efficiently recycling existing data. But to be useful that data needs to exist, and the application of these methods requires the input of additional hyper-parameters and additional pre-experiment data estimates. Perhaps more importantly, if these methods are blindly used without understanding certain nuances in their setup, they can unintentionally lead to increases in both Type1 and Type 2 errors.
Our intentional lens is not the mainstream industry view. The current prevailing view is that value is achieved less through the goal of intentional decision making, and more through the goal of scale. This view of A/B Testing and Experimentation is influenced by both ideas and money from Venture Capital. Being more Peter Thiel than George Box (see ’76 Science and Statistics), it is a techno-solutionism that rests on appeals to emulate the actions of the few top tech companies – Amazon, Alphabet, Meta, Microsoft, and Netflix. The implicit argument goes like this: The big and successful tech companies represent a potential ideal future state. These big tech companies run thousands of experiments with an ever increasingly complex set of statistical methods. So experimentation appears to be a core reason they are successful. Therefore emulating their experimentation programs and methods is a requirement and path for other companies to also arrive at the more advanced, and successful state.
Ironically, this argument about the value of A/B testing, a causal inference approach, rests mostly on a correlation between running experiments at scale and success. Sadly, it ignores potential confounding effects such as the fact that successful tech companies already have many PhDs on their payrolls who are experts at working with code at scale and that these companies have the capacity to run experiments at scale. So it is not clear whether success and earnings are caused by the act of mechanizing experimentation or some other factors associated with the skills or resources of these companies. If the value from this approach is generalizable, why is it the same 5 to 10 technology companies that are the experimentation role models year after year. Why aren’t companies from other industries being added each year to the list of innovators?
It is not that expanding experimentation, and improving efficiency can’t be extremely valuable – it often is. However, the mere act of implementing software, and then productizing the running of 1,000’s of experiments is unlikely to provide much learning or value. Inference is not a technology. In order to learn, an experimentation flywheel needs to also incorporate Box’s learning feedback loop where “ … learning is achieved, not by mere theoretical speculation on the one hand, nor by the undirected accumulation of practical facts on the other, but rather by a motivated iteration between theory and practice …” (Box ‘76). It is intentional experimentation, rather than performative experimentation, that leads to organizational learning.
There has always been an internal inconsistency, or at least tension, inherent in the scale/velocity approach to A/B Testing. What the VC world really cares about is Type 2 errors – the false negative. It is the lost opportunities that have the exponential loss – the miss of investing early in Google, Facebook, Netflix, etc. The world of the fat tail payoffs and power laws lurking in shadows. Type 2 errors are the concerns of people of action – the bold. The leaders and innovators who Move Fast and Break Things. Worrying about Type 1 error, the false positive, as part of learning is to be conservative and from the VC view of the world, for the meek, the NPCs, the followers.
The funny thing is that the gateway to experimentation, A/B Testing (at least the Pearson-Neyman approach) is fundamentally about controlling Type 1 errors (see Neyman ‘33). Yet controlling these errors incurs a cost in both samples and in time. But reducing time to action is a key guiding principle of the scale and velocity view. So ‘new’ methods and additional complexity need to be continuously sold to mitigate this hold up – e.g. continuous sequential designs, regression adjustment etc.. Any considerations about potential trade-offs and unintended consequences of this added complexity (biased treatment effects, inflation factors, computing costs, increase of hyperparameters, errors due to human factors) are ignored as part of the sale. Even if we ignore these trade-offs, as we hit the limit of speed ups from these approaches, the demand to reduce time to action will build and we will see more arguments to reduce, or eliminate Type 1 error control as the default behavior, rather than as the exception.
That is not inherently bad. Having developed and worked with A/B Testing and Reinforcement Learning/Bandit tech for years, it has been clear that different classes of problems require different approaches. We have written extensively on the benefits of reducing Type 1 error control (See our Power Pick Bandits posts here and here). But that decision is based on being mindful of the problem at hand, and then selecting the right solution with an understanding of the various trade-offs.
Of course all things come with a cost and like medicine, both increased use of and the added complexity in tech can have both positive and negative effects. Also like medicine, tech solutions can be over subscribed leading to the field becoming addicted to complexity, always seeking more when less maybe more beneficial.
In many cases these approaches are helpful, and we use many of them here. However, when used mindlessly without intention they can be less than helpful. Simply trying to emulate the successful, if one is not careful, can lead to a mindlessness of action, and a costly ritual of performative experimentation, rather than leading to enlightenment and understanding of the customer. Perhaps being seen as performing the same rituals as the successful can have its own signaling benefits. But scaling experimentation for the optics is a costly shibboleth.
There is no ground truth here. There are just different attitudes and ways of seeing the world. But attitudes matter. They inform next steps and the most fruitful future directions to follow. Our view is that experimentation is of value not just in the doing but when it both helps clarify beliefs and makes assumptions explicit and as a process to iteratively update and hone those beliefs based on direct feedback from the environment.
