The Results are In! My A/A Testing Experiment Outcomes
January 14, 2021 -
A while ago I wrote an article about A/A tests and reasons why this type of test may not be the best option with Google Optimize when the objective is to make sure all the implementation was properly done. The idea to do deployment validation with an A/A test was actually Jim’s (Napkyn CEO) idea. He was wrong :). I know, it’s a bold move to say this in public, but let’s see if we can prove our point with this article.
As I mentioned in the previous article, using this type of test can take a bad turn and lead you down a path where even though the tool may be working, the results from your experiment will say otherwise. For more details (and if you haven’t read it already) check my first article Demystifying Optimize A/A Tests for better context.
To put more context on why an A/A test may not be a good idea, I decided to run my own for 90 days, and I created an A/B testing with the following configuration:
- 3 variants: Original, Variant 1, and Variant 2 equally weighted with 33% each.
- Page Targeting: Main home page of my site.
- Audience targeting: All visitors to the page(s) targeted above.
- Measurement and Objectives
- Primary Objective: Pageviews
- Additional Objectives: Bounce and Session Duration
- Variants had NO modifications (otherwise it wouldn’t be an A/A test anymore).
As mentioned, the experiment ran for 90 days and collected 865 sessions. It may not seem like a lot, and there were some days with no users at all, but most of the days were pretty stable, and I believe it will give us the expected results to better understand A/A testing results using Google Optimize.
Interesting enough and somehow expected, after 90 days, I had one of the variants with “100% probability to be best”.
How was that expected when it’s supposed to be an A/A test?
As mentioned in my previous article, the architecture Google Optimize uses to decide which variant is the best is not only related to the page and what we can see, but many other factors that we’re going to explore in this article.
So, let’s break down the report that Optimize provided.
On the top, we have a dropdown menu with PageViews (Primary) information. This is related to the Objectives set in the beginning of the experiment. Clicking on that menu will give me this:
Keep in mind, each of those options will change the information in the Observed Data and Optimize Analysis fields. Google Optimize gives you not only Observed Data related to Pageviews but also to Additional Objectives that you had set up initially. Although this won’t necessarily affect the final results of the winner variant, it’s good information to have access to.
Here’s an overview of my reporting with Pageviews.
This is the empirical or raw data that is used as input to Optimize statistical models. It is composed of three sub-items:
Here we have all the sessions attributed to the experiment. All sessions during this period, whether the experiment was applied or not, will be counted. In my experiment, I had for the Original variant, 266 sessions, Variant 1, 266, and Variant 2, 333 (which would totalize 865, the total number of Collected Sessions). In summary, this is the total number of sessions divided into a specific variant.
Pageviews in any Experiment Session but excluding any Pageviews prior to experiment execution in the first session per user. For example, if your experiment is supposed to run on https://mypage.com/contactus, any navigation previous to this page (home, shop, details, products, etc) won’t be counted until the user reaches the contact us page.
Calculated Pageviews per Session
Experiment Pageviews divided by experiment Sessions. This one is a simple math division for visualization. For example, if we divide 788 Experiment Pageviews by 266 Experiment Sessions, the result would be 2.96.
You may have noticed a “few” differences between the variants and the numbers and, on a first look one may think “yeah, we can clearly see that Variant 2 is the best possible one” but that’s not the only reason. Observed Data is simply as the name says: The data that’s observed in order to make a final decision. In some cases you could have Experiment Sessions and Pageviews really close and none of them would be a winner. It is also important to highlight that in the course of those three months, the Observed Data changed as the experiment collected more data. That means we know NOW that Variant 2 is the winner, but before, that’s not possible.
Finally, the far right part of our results we have Optimize Analysis. This one, different from the previous topic, is divided into 4 sub-categories. However, one option, based on the Objectives will change and this is the third column.
These are the ones I have:
Pageview reports: You will have Modeled Pageviews per Session.
Session Duration: You will have Modeled Duration per Session.
Bounces: You will have Modeled Bounce Rate.
What does that really mean if it’s different for everyone?
This data shows how the variants have performed to the date against an objective selected. This can be represented in a graph below:
In this case, I’m looking into Modeled Bounce Rate and in the beginning of the experiment, it’s common that the graphic would start all over the place and over time, fall into place as more and more data is taken into account, allowing for a better determination of how the variant is performing against the original.
The other sub-categories under Optimize Analysis are:
Probability to be Best
The main probability that this variant is performing better than all the other variants, given the available experiment data. Keep in mind that this number will be different almost every day based on how your variant is performing, until the experiment is over.
Probability to Beat Original
The probability that a variant performs better than the original. When creating an experiment in Optimize, variants are not only competing with each other but also with the Original. That is good because what if none of your modifications in an experiment is good enough? That means you wouldn’t have to modify anything.
For a given objective, the difference in conversion rate – measured as a percentage – between the variation and the baseline, a.k.a. “Lift”. That means how much your model improved over time compared to other variants using more data.
Optimize results in Google Analytics
What about Variant 2? Why is that the winner?
In order to answer that question we will have to view all this data in Analytics. Optimize has a pretty cool option that is View In Analytics.
And that would give you this:
Now, not only do I have access to the Sessions, pageviews and calculation, but also to Conversions, Site Usage, Ecommerce, Segments, data comparison vs a specific metric, and much more. Keep in mind that some of this information you can also access in Optimize, but Google Analytics gives you a better readability of the data and makes it easier for us to answer our questions.
For example, Site Usage in the Explorer option, I’d have:
And here we can notice some quite interesting data such as Average Session Duration, % of New Sessions, and Bounce Rate. Unfortunately (or fortunately) we can’t base the winner’s decision only on this information. It’s important to understand that Optimize generates reports based on the Bayesian Inference which allows the tool to continually refine results as more data is gathered. So, it takes into consideration everything.
Let’s take a look at the Ecommerce option in the Explorer:
Did you notice something here? Variant 2 was the only variant to generate Revenue, even though in Variant 1 we had a higher percentage of new Sessions and lower Bounce Rate. At the end of the day Optimize selected Variant 2 as a winner based, as well, on the fact that this variant generated revenue, and it could keep it generating compared to the other variants where we had no revenue.
What if we had a close amount of revenue and not 0 vs $250, then?
As I mentioned, Optimize looks at everything. If we have Variant 1 and Variant 2 with both $250 exactly in revenue, chances are that it would be evaluating the variants based on Average Session Duration or Bounce Rate. Unfortunately, we don’t know. What we do know is that it is BASED on something. As per Google documentation “Bayes’ theorem is an equation that tells us how we can use observable data to make inferences on unobservable things”.
We had a look into all the items in Optimize Reporting field and GA, and broke down each of the items to better understand why in an A/A test, Optimize still selected a variant as a winner. At this point, you should have the knowledge to consider whether or not to do that type of test.
Finally, I’m not going deep into all the Bayesian Inference as I have already touched on that in my first article. But, what I want you to take away from here, is that even though you may be looking into A/A tests to make sure Optimize or any other A/B testing tool is working and properly implemented, it may not be the best way to approach due to a couple of the reasons I mentioned above.