How To Query Google Analytics 360 Data Via BigQuery
March 26, 2018 -
In a previous post, I covered the basics of setting up the Google Analytics 360 BigQuery Export, and covered some details about the schema that export provides. In this post, I want to start digging in to the fun part: querying the Google Analytics data in Google BigQuery.
Now, this post requires a very basic understanding of Structured Query Language (SQL), which forms the basis of the languages you can use to query data using BigQuery. We’re not going to be updating the data in BigQuery with any of the queries we’ll be writing, so we really only need to understand the syntax for the
SELECT statement, which is used to pull data rows from a database, such as the one that powers BigQuery.
In SQL, you obtain data from the database using a
SELECT statement, which describes what data you want to look at. The basic anatomy of a
SELECT statement is like this:
SELECT /* the columns you want */ FROM /* the table that contains the data */ WHERE /* some condition is met */ GROUP BY /* columns that, when their values match, should result in a rollup of the data */ HAVING /* some condition that is met after grouping */ ORDER BY /* some column you want to order results by */ LIMIT /* some number of results you want to limit the response to */
There are more things you can do, of course. You can create or rename columns, you can join multiple tables together. But the basic structure looks like the above.
When looking at these examples, take note that BigQuery currently has two query languages. One is Legacy SQL, the original BigQuery language. The other is called Standard SQL, and aims to bring BigQuery’s SQL dialect closer to the standard SQL used by most database servers. I’ve generally supplied examples in Standard SQL, unless the Legacy SQL method differs significantly, in which case I’ve provided both.
The first thing you’ll want to understand about the Google Analytics data set in BigQuery is that it follows the
<prefix><day> format that BigQuery requires to save each day’s worth of data in its own table. This means that, in addition to querying the literal name of a table, you can also query multiple tables simultaneously by defining a date range. So, if you were pulling data only for a single day, you might do something like this in Legacy SQL:
SELECT SUM(totals.visits) as Sessions FROM [project-name:VIEW_ID.ga_sessions_20171001]
Or, in Standard SQL:
SELECT SUM(totals.visits) as Sessions FROM `project-name.VIEW_ID.ga_sessions_20171001`
That basic query will get you a session count for a single day, in this case October 1, 2017. But what if I want to get all sessions in October? This is accomplished differently in the two dialects of BigQuery SQL. Using Legacy SQL, you can do this with the
TABLE_DATE_RANGE() function and some
TIMESTAMP() -encoded dates:
SELECT SUM(totals.visits) as Sessions FROM TABLE_DATE_RANGE( [project-name:VIEW_ID.ga_sessions_], TIMESTAMP('2017-10-01'), TIMESTAMP('2017-10-31') )
Notice that I left off the date portion of the table name when I supplied it to the
TABLE_DATE_RANGE() function. That function will automatically add the date portions when it builds the list of tables to query.
If you’re using Standard SQL, you can instead query the dates using a table suffix wildcard:
SELECT SUM(totals.visits) as Sessions FROM `project-name.VIEW_ID.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN "20171001" AND "20171031"
As I mentioned in my previous post, most of the data in the BigQuery export for Google Analytics 360 excludes derivative dimensions and metrics, although there are some totals that are kept for common top-line metrics. Those totals are summed up for each session (therefore, in each row). You can query them like this, in Standard SQL:
SELECT date, SUM(totals.totalTransactionRevenue)/1000000 AS revenue, SUM(totals.transactions) AS transactions, COUNT(DISTINCT fullVisitorId) AS users, SUM(totals.visits) AS sessions, SUM(totals.pageviews) AS pageviews, SUM(totals.bounces) AS bounces FROM `projectname.viewid.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN "20171001" AND "20171031" GROUP BY date ORDER BY date ASC
This query will pull the total revenue, transactions, users, sessions, pageviews and bounces for each session and add them up, grouping them together by date. The result will be a table that looks like this:
|. . .|
Some things to notice about this query:
- We are defining new columns for each of our metrics, calculated using the
- We are naming each column
- We are grouping by date, which means that instead of having the date spread out over several rows, we are aggregating the results for each date (using the
- We are ordering by ascending date, which means our output goes from the first to the 31st of October, in this case.
- Sessions are called “visits” in the exported schema. This is an artefact of the days when Google Analytics used the term “visits” instead of “sessions”. We’ve re-named the column as “sessions” in our output.
- Revenue here is divided by one million before returned. This is because Google Analytics stores currency amounts as micros, one millionths of the currency used in the view. We need to convert this back to single units before reporting it.
- We’re using the
COUNT(DISTINCT fullVisitorId)method to count users. This is roughly the old calculation that Google Analytics used to calculate its Users metric before early 2017 when it changed things. The old calculation is still used in unsampled reporting in Google Analytics 360. Analytics standard reports still use the old calculation by default, but if you want your Analytics 360 reports to match BigQuery exactly, you’ll have to disable the Enable Users Metric in Reporting feature in the property settings. This forces the old calculation to be used.
In this query, I used Date as a dimension. That gets us those metrics broken down by day, but you can use the same method to pull data by any other dimension. For instance, to pull data by medium and source:
SELECT medium, source, SUM(totals.totalTransactionRevenue)/1000000 AS revenue, SUM(totals.transactions) AS transactions, COUNT(DISTINCT fullVisitorId) AS users, SUM(totals.visits) AS sessions, SUM(totals.pageviews) AS pageviews, SUM(totals.bounces) AS bounces FROM `projectname.viewid.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN "20171001" AND "20171031" GROUP BY medium, source ORDER BY revenue DESC
This produces a similar table to the above, but in this case we have data by medium and source, and it’s sorted by a metric this time (in this case, transaction revenue).
That covers off some basics when it comes to pulling this data, but we haven’t even scratched the surface. In my next post in this series, I’m going to look at re-creating the default channel groupings in Google Analytics 360 within the BigQuery-exported data. Stay tuned!