Demystifying Optimize A/A Tests
November 12, 2020 -
It’s not a mystery that when something new is implemented, the development team wants to make sure it is working before actually deploying the new configuration to Production. Not doing so can cause not only the no-flow of data but in some cases, it can even break your website (WordPress developers know what I’m talking about).
With Google Marketing Platform – GMP tools it’s no different. Of course, you won’t break your site implementing GA, GTM or Optimize, but the flow of data is extremely important as this is the main reason why you implement these tools. For example, by implementing Google Analytics you can populate the URLs with UTMs and check the Real-time data. With GTM you can set the preview mode or use Tag Assistant (by Google) Chrome extension. However, Optimize does not have anything specific that would help the implementation team confirm the tool is working.
At least nothing official.
From checking the website, trusting that the implementation was done correctly and waiting for an experiment to start getting sessions, most marketers like to do an A/A Calibration test.
Some other reasons an A/A calibration test is used are:
- To check how accurate the A/B Testing Tool is.
- To set a base conversion rate.
- Deciding on minimal sample size.
What is an A/A test?
An A/A test is a type of experiment where marketers test two identical versions (variants) of a specific page against each other. This is different from an A/B test where the test consists of comparing two different pages – Original and Variation.
The main idea here is to make sure the tool is working properly. If the A/A test tool returns with one of the variants as a winner, something is possibly wrong – but, it doesn’t mean that there’s necessarily a problem.
Things to keep in mind
Just because the sole purpose of A/A testing is to return undefined results, it doesn’t mean a difference in conversion rate won’t happen. This does not necessarily mean the tool has a problem, or there’s an issue with its implementation. However, if that’s the case, some reasons could be:
- There’s something wrong with the implementation.
- There’s something wrong with the test setup.
- The testing tool does not have enough data to process (time-consuming).
And, as always, we can also have a False Positive, this means the tool is working properly but, due to the type of algorithm and how the information is processed by the tool, one of the variants will eventually end up as a winner.
What is Google’s opinion about A/A testing?
Google Optimize uses a Bayesian inference instead of Frequentist. A fancy way of saying Google uses data they already have to make better assumptions about new data. As the new data comes in, Google refines the “model” of the world, producing more accurate results.
For a better understanding, here’s a practical example.
Imagine you’ve lost your phone in your house, and you hear it ringing in one of 5 rooms. You know from previous experience that you often leave your phone in your bedroom.
A frequentist approach would require you to stand still and listen to it ring, hoping you can tell with enough certainty from where you’re standing (without moving!) which room it’s in. And, by the way, you wouldn’t be allowed to use the knowledge of where you usually leave your phone.
On the other hand, a Bayesian approach is well aligned with our common sense. First, you know you often leave your phone in your bedroom, so you have an increased chance of finding it there, and you’re allowed to use that knowledge. Secondarily, each time the phone rings, you’re allowed to walk a bit closer to where you think the phone is. Your chances of finding your phone quickly are much better.
So, where do we stand on A/A testing?
There are a lot of speculations about A/A tests and whether it’s a good thing for Napkyn or not. It’s not wrong to say the main objective of an A/A test is to verify if the testing tool is properly implemented, it’s a sanity check more than anything else.
However, these speculations are always related to the tool a specific marketing company is using, and not actually to the results themselves. Different comparing tools use different inferences, which mean different results based on a variety of variables. This means, an inconclusive result a specific tool returns, could be different from a variant winner in another test.
How does the Optimize statistical model work?
Optimize specifically uses a statistical model that uses data they already have to make better assumptions about new data; not the data from GA but data already collected in its own experiment. The more data it collects, the more accurate the results can be. Because of the Bayesian prior probability distribution (a.k.a. random event), this can cause misleading results depending on how long a specific experiment runs.
At first, Bayesian prior are modeled beliefs about how Optimize thinks a variant or experiment will behave. With less data, it will look into the results and calculate, among other things, how fast a specific variant can convert. This gives that specific variant the lead in the A/A (or A/B testing). When data comes in, the prior is blended with the data to form a posterior, which is the result. As more data comes in, the prior is said to be “overwhelmed”, and matters less and less.
For Optimize, Google uses a variety of priors. As more and more data comes in, the prior’s influence fades away.
Where does everything fit into the A/A testing, then?
Now, as we mentioned earlier, running an A/A test means running the same page against itself. Because Optimize, using the Bayesian model, cannot differentiate between these pages, it will be looking at something else in that experiment. As per Google, “any small difference will make the difference”. The tool looks into many different models to return the best results. Unfortunately, we don’t know which models, as that information is not available for the general public. However, let’s say for example, that one of the models is session duration.
If a specific group of users spent on average more time on that page than others, the tool “understands” that the first group has a higher possibility of conversion, thus returning a specific variant as a winner. The more data comes in, the more variables the statistical model can analyze.
Furthermore, the code behind Optimize doing the verification on the variants is improving every day and doing an A/A test is going against the main objective of the tool: Compare two different variants. In other words, you are trying to “break” Optimize in order to make sure it’s working as it should. It doesn’t make sense!
All that being said, depending on the tool you’re using for your A/B testing, we do not recommend using A/A testing as a calibration method. Understanding that the main objective of it is to confirm the tool is properly installed, Google Optimize has, under the Settings part of each experiment, an option for Optimize Installation where you can run a diagnostic that verifies Optimize is correctly installed. It is recommended this be run in all experiments before start. This will give you Errors like “Optimize plugin not found” and Suggestions such as “Anti-flicker snippet not found” (not mandatory).
So, how can I solve my calibration test problem?
We still have the main problem that the A/A test “came” to solve: Making sure my tool implementation is working. So, if you’re not using the A/A test, what should you use? Well, you can always have a controlled environment experiment. This could be an A/B test with expected results or an experiment where you already know the results. If the results ended as expected, the tool is properly installed.
Create two variants in an A/B test and define the weight of one of the variants to 100%.
This means you’re allocating ALL traffic to that specific variant and, because of this allocation, sessions and visits on variant 2 should be 0. If users are seeing variant 2, something is wrong with the tool.
Another example, set up a Page Targeting to a URL that does not exist. Unless users know that specific URL, the experiment should have NO sessions or users going through. Again, if sessions are showing here, there’s probably something wrong with the tool.
NOTE: If your site is configured to redirect users from a 404 page to an actual real page, you may have a problem here, so keep that in mind.
Once you have your A/A test set up, a couple of things to consider are:
Give it time: In general, Google documentation says that two weeks, at least, is the minimum you should run your experiment for the tool to gather enough data to process a winner.
Keep an eye on the everyday results: You’re not running a regular test where you can just set it up and leave it running until a result comes up. Look at the data the tool collected compared with the previous day, and see if there are any major modifications. This should give you a hint if one of the variants is performing better than the other – when it shouldn’t. This check does not need to take more than 5 minutes.
A/A tests are susceptible to false-positive: As mentioned before, a false positive can happen and if it does, you may have to rethink the method you’re using for the A/A test.
A/A test is a non-typical type of test: Because the A/A test goes against what an A/B test should be, you may have different and/or nonlinear results. Somehow this is expected, so don’t be alarmed. If you’re not happy with the results, you can always set a new experiment.
Don’t mix oranges and apples: Just because someone used a specific tool and the A/A test was inconclusive, it doesn’t mean that tool is better than Optimize. Each tool uses different approaches to measure experiment results. Most of them use a frequentist-based analysis of the results over the life of the experiment. Which means that this method takes conclusion from sample data by emphasizing the frequency or proportion of the data.
Don’t do the A/A test: You read right. A/A tests are not mandatory and more often than not, they are only for the marketer’s piece of mind that the tool is working. Google Optimize has a diagnostic tool inside each experiment and a preview mode you can check to make sure the experiment will run as expected. So take advantage of it.
Finally, there’s no need to create a whole process for a test that goes against the main purpose of the tool, takes more time than usual (depending on the amount of data), and could return something you don’t want to “hear”, putting you into a ghost chase.