Channel Groupings In Google Analytics 360 BigQuery Data

by Colin Temple

In my previous posts on Google Analytics 360’s BigQuery export, I outlined the basics of the Google Analytics 360-BigQuery integration, and some introductory lessons on how to query the data once you have it.

Now, it’s time to get a little more sophisticated.

In that first post, I mentioned that Google Analytics 360’s BigQuery export does not include much data that is derived from the raw data. By way of example, I mentioned that custom channel groupings are not included. That being said, custom channel groupings can be quite valuable, especially if you’re considering making changes to your attribution scheme or you want to make comparisons between different groupings. If these custom channel groupings are derived from the raw data, how do we re-create them in BigQuery?

For this example, I’m going to be re-creating the default channel groupings in BigQuery. Fortunately, the definitions of the default channels are quite straightforward, and everything you need to create them is exported to BigQuery via that link. Unfortunately, the scoping and naming of that information doesn’t exactly match the dimensions and metrics in Google Analytics, so the SQL required to do this gets a little long.

The SQL

Before we get into details, here’s a look at an SQL statement that re-creates the default Google Analytics channel groupings. Note that I used BigQuery’s Standard SQL dialect to produce these results; the same can be done with Legacy SQL, but you’ll need to use different functions.

The Results

Let’s look at what that produces:

channel revenue transactions users sessions pageviews bounces
Paid Search 3892105 16869 312032 345671 125368121 80926
Display 2156102 31054 450211 468100 241588014 91855
. . .
Social 1856721 89152 989362 1153251 568841352 102568
(Other) 423256 52612 301256 561225 2231381 40325

Basically, we get a breakdown of all of the metrics I’ve chosen to include here by marketing channel, which is a recreation of the Default Channel Grouping dimension in Google Analytics, assuming you haven’t changed its definitions.

Dissecting the Query

Let’s look at what I’m doing in this query from the bottom up. You might notice that I have two SELECT  statements here, one nested in the other. This creates a table with the inner statement, which is then queried by the outer statement. Let’s look first at the FROM clause for that inner statement — where am I fundamentally getting this data from?

At its base, all of the data here comes from the exported Google Analytics 360 sessions tables. As I’ve shown in my previous post, I’m using a wildcard on the sessions table name to query a set of tables with the date range I specified in the WHERE clause. But I’m also doing something with the repeated hits records within each session. Namely, I’m flattening the resulting table with UNNEST() .

The  UNNEST() function, which is similar to Legacy SQL’s FLATTEN() function, takes the supplied table with its repeated nested records and flattens the data. This means that, where I originally had a row representing a session and nested rows within it representing hits, I now have just a set of rows representing all hits in the date range, with repeated information about the session each hit belongs to in every row. Hits with common values for session-level columns belong to the same session.

I’m doing all this for one reason: According to the default channel groupings, the definition of the Social channel includes as a sufficient condition, a check for a social referrer. Google Analytics identifies social referrers from an internal list and sets a dimension value to “Yes” or “No” whenever a hit’s referrer is on the social list. This dimension is set at the hit level, so we need to identify sessions that contain a hit with a “Yes” social referrer value to drop them in the Social channel. More on this later.

Let’s look at the columns I’m pulling from my newly-flattened data:

The first few things I’m collecting here are pretty basic. I want the ID of the visit to isolate individual sessions from my flattened data, the ID of the visitor to calculate my Users metric later, as well as the medium and source of each hit’s session, since these are important to determining the channels each belongs to.

The next bit uses a conditional check to determine whether or not a session is social. Recall that every hit that has a social referrer is so-marked, and we can identify social sessions as those including a social-referrer hit. This is done by the hit.social.hasSocialSourceReferral column. This column contains the string “Yes” if the hit is from a social referrer. So, using a COUNTIF()  function, I count the number of times in each session that there’s a “Yes” value for the hasSocialSourceReferral  column within that session’s hits. If that number is greater than zero, we set “Yes” for every row in the session so that when we re-aggregate the results, we’ll have a “Yes” for our new socialSession  column.

In summary, what I’m doing here is saying that as I re-aggregate the data into sessions I am setting a new column, socialSession, to “Yes” if the session contains a social hit. Otherwise I’m setting it to “No”.

Next I collect the trafficSource.adwordsClickInfo.adNetworkType column as well, as that’s important for distinguishing between Display and Paid Search channels when it comes to AdWords ads.

Finally, I’m grabbing my metrics but a bit differently than in some of my past examples:

Recall that I’m working with flattened data at this stage, meaning that I have a row for every hit, not just every session. The row for each hit includes all the data about the session, including its totals for revenue, transactions, etc. Those total values are going to be repeated across each hit. In previous examples, I’ve just used the  SUM() function to add up totals but now that I have repeated redundant values, I need a way of getting the individual values for each session. I only need to select each metric once, so instead of using SUM() to add the values together I’m using MAX()  to get the highest value of each group (they should all be the same, anyways).

Lastly, to end my inner query I group my results by the dimensions I’ve selected:

This ensures that I have only one row per session again.

Back to the top of my query, we can now see what I’m selecting from this inner table. It begins with a new column I’m creating called channel . I finally have everything I need to re-create the channel definitions in Google Analytics, so let’s get to it:

Basically, I’m using a big CASE  statement to go through each definition, applying the label for the channel WHEN  its definition conditions are met. The sources are mostly defined by source and/or medium values, but there are a few cases where things are more interesting:

  • I’m using the socialSession  column I created in the definition of the Social channel.
  • In various cases I use the REGEXP_CONTAINS()  function to determine whether the given column (always medium in these cases) matches some regular expression, per the definitions Google supplies for its channels.
  • If none of the conditions are met, I assign the channel “(Other)” just as Google Analytics does.

Now it’s time to sum up my totals for each channel:

Since I used the MAX()  function in the inner query to ensure each row got the literal value of the original session’s metrics and not a total, I can safely add together the rows that match each channel definition to get my new totals using SUM() . I also use the COUNT(DISTINCT)  method to determine the number of unique visitor IDs in my data and get a Users metric.

  • You’ll notice that revenue is divided by one million here; This is because Google Analytics stores currency values as Micros (or one millionths) of the view’s currency.
  • The same caveat I applied to the Users metric in my previous post applies: this is the “old” method of counting users that is still used in unsampled reporting. You may see different values in the UI and Analytics Reporting API because a different method of counting users is used there by default.

Finally, all that’s left is to re-aggregate the results by channel using GROUP BY  and make our table a little nicer by sorting by revenue, using ORDER BY :

Takeaways

That’s a fair bit of work to accomplish something that seems fairly simple in Google Analytics, but it highlights some of the specifics of this data. Of course, this isn’t the only way to do this. I wouldn’t claim it’s the most efficient either, but it gets the job done.

The main thing I hope you get from this is a better understanding of the way this data is structured so you can start playing around and making your own channel definitions, or regrouping other dimension values in the same way. Happy querying!

Colin Temple

Director, Analytics Solutions, Napkyn Analytics

Colin serves as Director of Analytics Solutions for Napkyn Analytics, where he focuses on keeping his team at the forefront of digital analytics. A diverse background in software and marketing and an education in logic and philosophy give Colin a unique perspective on where the analytics practice is, and where it should be.