Channel Groupings In Google Analytics 360 BigQuery Data

April 2, 2018 -
In my previous posts on Google Analytics 360’s BigQuery export, I outlined the basics of the Google Analytics 360-BigQuery integration, and some introductory lessons on how to query the data once you have it.
Now, it’s time to get a little more sophisticated.
In that first post, I mentioned that Google Analytics 360’s BigQuery export does not include much data that is derived from the raw data. By way of example, I mentioned that custom channel groupings are not included. That being said, custom channel groupings can be quite valuable, especially if you’re considering making changes to your attribution scheme or you want to make comparisons between different groupings. If these custom channel groupings are derived from the raw data, how do we re-create them in BigQuery?
For this example, I’m going to be re-creating the default channel groupings in BigQuery. Fortunately, the definitions of the default channels are quite straightforward, and everything you need to create them is exported to BigQuery via that link. Unfortunately, the scoping and naming of that information doesn’t exactly match the dimensions and metrics in Google Analytics, so the SQL required to do this gets a little long.
The SQL
Before we get into details, here’s a look at an SQL statement that re-creates the default Google Analytics channel groupings. Note that I used BigQuery’s Standard SQL dialect to produce these results; the same can be done with Legacy SQL, but you’ll need to use different functions.
SELECT
CASE
WHEN
source = "(direct)"
AND
(medium = "(not set)" OR medium = "(none)")
THEN "Direct"
WHEN
medium = "organic"
THEN "Organic Search"
WHEN
socialSession = "Yes"
OR
REGEXP_CONTAINS(medium, r"^(social|social-network|social-media|sm|social network|social media)$")
THEN "Social"
WHEN
medium = "email"
THEN "Email"
WHEN
medium = "affiliate"
THEN "Affiliates"
WHEN
medium = "referral"
THEN "Referral"
WHEN
REGEXP_CONTAINS(medium, r"^(cpc|ppc|paidsearch)$")
AND
adNetworkType <> "Content"
THEN "Paid Search"
WHEN
REGEXP_CONTAINS(medium, r" ^(cpv|cpa|cpp|content-text)$")
THEN "Other Advertising"
WHEN
REGEXP_CONTAINS(medium, r"^(display|cpm|banner)$")
OR
adNetworkType = "Content"
THEN "Display"
ELSE "(Other)"
END AS channel,
SUM(revenue)/1000000 as revenue,
SUM(transactions) as transactions,
COUNT(DISTINCT fullVisitorId) AS users,
SUM(sessions) as sessions,
SUM(pageviews) as pageviews,
SUM(bounces) as bounces
FROM (
SELECT
visitId,
fullVisitorId,
trafficSource.medium AS medium,
trafficSource.source AS source,
CASE
WHEN
COUNTIF(hits.social.hasSocialSourceReferral = "Yes") > 0
THEN
"Yes"
ELSE
"No"
END AS socialSession,
trafficSource.adwordsClickInfo.adNetworkType AS adNetworkType,
MAX(totals.totalTransactionRevenue) AS revenue,
MAX(totals.transactions) AS transactions,
MAX(totals.visits) AS sessions,
MAX(totals.pageviews) AS pageviews,
MAX(totals.bounces) AS bounces
FROM
`project-id.VIEW_ID.ga_sessions_*` AS sessionsTable,
UNNEST(sessionsTable.hits) AS hits
WHERE
sessionsTable._TABLE_SUFFIX BETWEEN "20180201" AND "20180228"
GROUP BY
visitId, fullVisitorId, medium, source, adNetworkType
)
GROUP BY
channel
ORDER BY
revenue DESC
The Results
Let’s look at what that produces:
channel | revenue | transactions | users | sessions | pageviews | bounces |
---|---|---|---|---|---|---|
Paid Search | 3892105 | 16869 | 312032 | 345671 | 125368121 | 80926 |
Display | 2156102 | 31054 | 450211 | 468100 | 241588014 | 91855 |
. . . | ||||||
Social | 1856721 | 89152 | 989362 | 1153251 | 568841352 | 102568 |
(Other) | 423256 | 52612 | 301256 | 561225 | 2231381 | 40325 |
Basically, we get a breakdown of all of the metrics I’ve chosen to include here by marketing channel, which is a recreation of the Default Channel Grouping dimension in Google Analytics, assuming you haven’t changed its definitions.
Dissecting the Query
Let’s look at what I’m doing in this query from the bottom up. You might notice that I have two SELECT
statements here, one nested in the other. This creates a table with the inner statement, which is then queried by the outer statement. Let’s look first at the FROM
clause for that inner statement — where am I fundamentally getting this data from?
FROM
`project-id.VIEW_ID.ga_sessions_*` AS sessionsTable,
UNNEST(sessionsTable.hits) AS hits
WHERE
sessionsTable._TABLE_SUFFIX BETWEEN "20180201" AND "20180228"
At its base, all of the data here comes from the exported Google Analytics 360 sessions tables. As I’ve shown in my previous post, I’m using a wildcard on the sessions table name to query a set of tables with the date range I specified in the WHERE clause. But I’m also doing something with the repeated hits records within each session. Namely, I’m flattening the resulting table with UNNEST()
.
The UNNEST()
function, which is similar to Legacy SQL’s FLATTEN()
function, takes the supplied table with its repeated nested records and flattens the data. This means that, where I originally had a row representing a session and nested rows within it representing hits, I now have just a set of rows representing all hits in the date range, with repeated information about the session each hit belongs to in every row. Hits with common values for session-level columns belong to the same session.
I’m doing all this for one reason: According to the default channel groupings, the definition of the Social channel includes as a sufficient condition, a check for a social referrer. Google Analytics identifies social referrers from an internal list and sets a dimension value to “Yes” or “No” whenever a hit’s referrer is on the social list. This dimension is set at the hit level, so we need to identify sessions that contain a hit with a “Yes” social referrer value to drop them in the Social channel. More on this later.
Let’s look at the columns I’m pulling from my newly-flattened data:
SELECT
visitId,
fullVisitorId,
trafficSource.medium AS medium,
trafficSource.source AS source,
CASE
WHEN
COUNTIF(hits.social.hasSocialSourceReferral = "Yes") > 0
THEN
"Yes"
ELSE
"No"
END AS socialSession,
trafficSource.adwordsClickInfo.adNetworkType AS adNetworkType,
MAX(totals.totalTransactionRevenue) AS revenue,
MAX(totals.transactions) AS transactions,
MAX(totals.visits) AS sessions,
MAX(totals.pageviews) AS pageviews,
MAX(totals.bounces) AS bounces
The first few things I’m collecting here are pretty basic. I want the ID of the visit to isolate individual sessions from my flattened data, the ID of the visitor to calculate my Users metric later, as well as the medium and source of each hit’s session, since these are important to determining the channels each belongs to.
The next bit uses a conditional check to determine whether or not a session is social. Recall that every hit that has a social referrer is so-marked, and we can identify social sessions as those including a social-referrer hit. This is done by the hit.social.hasSocialSourceReferral column. This column contains the string “Yes” if the hit is from a social referrer. So, using a COUNTIF()
function, I count the number of times in each session that there’s a “Yes” value for the hasSocialSourceReferral column within that session’s hits. If that number is greater than zero, we set “Yes” for every row in the session so that when we re-aggregate the results, we’ll have a “Yes” for our new socialSession column.
CASE
WHEN
COUNTIF(hits.social.hasSocialSourceReferral = "Yes") > 0
THEN
"Yes"
ELSE
"No"
END AS socialSession,
In summary, what I’m doing here is saying that as I re-aggregate the data into sessions I am setting a new column, socialSession
, to “Yes” if the session contains a social hit. Otherwise I’m setting it to “No”.
Next I collect the trafficSource.adwordsClickInfo.adNetworkType
column as well, as that’s important for distinguishing between Display and Paid Search channels when it comes to AdWords ads.
Finally, I’m grabbing my metrics but a bit differently than in some of my past examples:
MAX(totals.totalTransactionRevenue) AS revenue,
MAX(totals.transactions) AS transactions,
MAX(totals.visits) AS sessions,
MAX(totals.pageviews) AS pageviews,
MAX(totals.bounces) AS bounces
Recall that I’m working with flattened data at this stage, meaning that I have a row for every hit, not just every session. The row for each hit includes all the data about the session, including its totals for revenue, transactions, etc. Those total values are going to be repeated across each hit. In previous examples, I’ve just used the SUM()
function to add up totals but now that I have repeated redundant values, I need a way of getting the individual values for each session. I only need to select each metric once, so instead of using SUM()
to add the values together I’m using MAX()
to get the highest value of each group (they should all be the same, anyways).
Lastly, to end my inner query I group my results by the dimensions I’ve selected:
GROUP BY
visitId, fullVisitorId, medium, source, adNetworkType
This ensures that I have only one row per session again.
Back to the top of my query, we can now see what I’m selecting from this inner table. It begins with a new column I’m creating called channel . I finally have everything I need to re-create the channel definitions in Google Analytics, so let’s get to it:
SELECT
CASE
WHEN
source = "(direct)"
AND
(medium = "(not set)" OR medium = "(none)")
THEN "Direct"
WHEN
medium = "organic"
THEN "Organic Search"
WHEN
socialSession = "Yes"
OR
REGEXP_CONTAINS(medium, r"^(social|social-network|social-media|sm|social network|social media)$")
THEN "Social"
WHEN
medium = "email"
THEN "Email"
WHEN
medium = "affiliate"
THEN "Affiliates"
WHEN
medium = "referral"
THEN "Referral"
WHEN
REGEXP_CONTAINS(medium, r"^(cpc|ppc|paidsearch)$")
AND
adNetworkType <> "Content"
THEN "Paid Search"
WHEN
REGEXP_CONTAINS(medium, r" ^(cpv|cpa|cpp|content-text)$")
THEN "Other Advertising"
WHEN
REGEXP_CONTAINS(medium, r"^(display|cpm|banner)$")
OR
adNetworkType = "Content"
THEN "Display"
ELSE "(Other)"
END AS channel,
Basically, I’m using a big CASE
statement to go through each definition, applying the label for the channel WHEN
its definition conditions are met. The sources are mostly defined by source and/or medium values, but there are a few cases where things are more interesting:
- I’m using the
socialSession
column I created in the definition of the Social channel. - In various cases I use the
REGEXP_CONTAINS()
function to determine whether the given column (always medium in these cases) matches some regular expression, per the definitions Google supplies for its channels. - If none of the conditions are met, I assign the channel “(Other)” just as Google Analytics does.
Now it’s time to sum up my totals for each channel:
SUM(revenue)/1000000 as revenue,
SUM(transactions) as transactions,
COUNT(DISTINCT fullVisitorId) AS users,
SUM(sessions) as sessions,
SUM(pageviews) as pageviews,
SUM(bounces) as bounces
Since I used the MAX()
function in the inner query to ensure each row got the literal value of the original session’s metrics and not a total, I can safely add together the rows that match each channel definition to get my new totals using SUM()
. I also use the COUNT(DISTINCT)
method to determine the number of unique visitor IDs in my data and get a Users metric.
- You’ll notice that revenue is divided by one million here; This is because Google Analytics stores currency values as Micros (or one millionths) of the view’s currency.
- The same caveat I applied to the Users metric in my previous post applies: this is the “old” method of counting users that is still used in unsampled reporting. You may see different values in the UI and Analytics Reporting API because a different method of counting users is used there by default.
Finally, all that’s left is to re-aggregate the results by channel using GROUP BY and make our table a little nicer by sorting by revenue, using ORDER BY :
GROUP BY
channel
ORDER BY
revenue DESC
Takeaways
That’s a fair bit of work to accomplish something that seems fairly simple in Google Analytics, but it highlights some of the specifics of this data. Of course, this isn’t the only way to do this. I wouldn’t claim it’s the most efficient either, but it gets the job done.
The main thing I hope you get from this is a better understanding of the way this data is structured so you can start playing around and making your own channel definitions, or regrouping other dimension values in the same way. Happy querying!