Same-Weekday Year-Over-Year Comparisons In Google Data Studio With GA 360 And BigQuery

by Colin Temple

When it comes to performance reporting, Google Data Studio has been a game-changer. Its place in the Google Analytics 360 Suite has made it a common platform for reporting digital analytics data, but that’s hardly a limitation of the tool. With the ability to connect to many sources of arbitrary data, including Google Sheets, Google BigQuery, and countless third-party sources via Community Connectors, along with its ease of use and portability, Data Studio is a fantastic tool for measurement reporting, full stop.

Despite my love for it, Data Studio has a few limitations that we’ve eagerly been waiting for enhancements to overcome. Among them is our present topic: flexibility in date range comparisons.

The problem with date range comparisons is a common one. There are many organizations, especially the larger ones, that have this problem with Data Studio. So, be forewarned: this is a long post, in which I go into detail on why this is a problem, and how we can fix it using the Google Analytics 360 integration with Google BigQuery. This gets technical; the way we solve this is with a fair bit of SQL code, and I do my best to explain how and why it works in detail. I also provide some walkthrough on getting things set up to use this data in the Data Studio user interface. There’s a lot of ground to cover here, but it’s a common enough problem and the solutions are not clear.

It is my hope that, before long, Google will render this post obsolete by providing more options for date ranges and date range comparisons in Data Studio reports. I will be happy to see a better way, and at that time this post can simply hang around for posterity so that others might learn something from the mechanics. Until that happens, though, this is the best method I’m aware of for same-weekday year-over-year comparisons using Google Analytics data in Data Studio, and so off we go.

TL;DR

Given that this is quite long, a summary: You can use the query I provide in this post as a Custom Query in Google Data Studio to obtain Analytics 360 data that has been exported to BigQuery and have year-over-year comparisons that line up the same days of the week, rather than calendar dates. There are some things you’ll need to be careful about when using the resulting data set, and the date range comparisons you’re left with have some limitations around conditional formatting. Still, if same-weekday year-over-year comparisons are critical for your reporting, this is presently the best way to do this while taking advantage of the excellent features that Data Studio does have.

The Problem

Elements in a Data Studio report feature a date range comparison option that allows you to show a metric’s value or % change compared to its value for a previous reporting period. For the moment, that comparison tool has two options: “previous period” and “previous year”.

Choosing ‘previous period’ is simple enough: if your date range is set to Yesterday, the comparison will be with the day before that. If you choose last week, then the metric values will be compared with the same from two weeks ago. Previous year works the way you’d probably expect on the face of it: it selects data for the same calendar date range in the previous year. If I set my date range for April 30, 2018, the previous-year comparison will show me the change since April 30, 2017.

That’s fine for many purposes, but it’s not always the best comparison. April 30, 2018 was a Monday, but April 30, 2017 was a Sunday. In many cases, differences between user behavior on a weekend, compared to a weekday, are significantly different. Monday-vs-Sunday is not a fair comparison. Select a Sunday-to-Saturday week and your previous-year comparisons show you deltas from a Saturday-to-Sunday week, or a Friday-to-Saturday week if there’s a leap day in the last year. In some cases, you’ll want to see the date range lined up with the same day(s) of the week in the previous year, rather than the same dates. If you’re, say, a U.S. retailer and the week in question is the lead-up to Black Friday, the difference in the comparison here could be huge.

Option: Multiple Date Ranges?

Data Studio features a built-in date range widget that allows users of the report to change the date range that elements within a report will, by default, display data for. Connectors like that for Google Analytics automatically respond to these widgets, while other data sources, such as Google BigQuery, allow you to choose a date dimension to pull data according to.

If we’re looking at a Google Analytics data source, our date range dimension is already picked for us, and can’t be changed. Our only real choice here is to use multiple elements with multiple date ranges. Data Studio lets you associate elements in a report with each other using Groups. Not only does that make those elements easy to select and move together, but if a date range picker is in the group, all elements in that group will respond to it. This makes it easy to have a date range for primary reporting and a second date range for comparisons. The nice part about this option is that it leaves you with more user control. Users can make any comparisons they want, since the date range selections are completely up to them. There are some downsides to this, though, that make this option pretty limiting:

  • It’s less usable: date range comparisons take more action by the user and open up more possibility for user error. It’s also hard to choose good defaults for that comparison selector.
  • Calculating the delta in a metric value between the two date ranges is pretty much impossible because any calculated metric can only display one date range at a time.
  • Similarly, any table or chart can only show you data for a single date range, so comparisons in the same report element can’t be done.

Any one of those issues is a likely deal breaker. What else can be done?

Option: Google BigQuery?

Google Analytics 360 customers have a better option. The premium version of Analytics comes with an export to Google BigQuery, Google’s speedy data warehousing and analytics tool. Data Studio comes equipped with a BigQuery connector, so if you have the link set up and appropriate access to the Google Cloud Platform project receiving the data, connecting to BigQuery from Data Studio is trivial.

However, it doesn’t take long to realize that we haven’t gotten much further with this basic connection. The date functions available to calculated fields don’t give us many options for producing another date field we might use, and by default we are locked into using the table suffix as our date range. This leaves us with exactly the same problems, at least until we make some changes.

Option: Google BigQuery Custom Query

Data in BigQuery is retrieved using Structured Query Language (SQL) queries on the data table(s) connected. Data Studio writes its own SQL queries to pull data when it needs to. It also provides you with the option to introduce an intervening step between the permanent BigQuery data and its own queries. This step is called a Custom Query.

Custom Queries allow you to include some SQL code that will run as a subquery within the queries that Data Studio writes. By that, I mean that you can write a query that accesses the raw table in BigQuery and transforms it, or adds to it, producing a result that Data Studio’s queries will then act on.

To do create a Custom Query, go to the Data Sources page in Google Data Studio. Choose to add a new one and select BigQuery (by Google) under the Google Connectors section. This will give you multiple columns. Under the first one, choose CUSTOM QUERY, and then under Project, choose the Google Cloud Platform project that contains the data exported by Google Analytics 360. If you haven’t enabled this link to actually push the data to BigQuery yet, have a look at my earlier post outlining the basics of the Google Analytics 360-BigQuery integration.

Once you choose a Cloud Platform project to use, the box in which to enter your query will appear. (In the case of a custom query, you don’t need to select a data set or table, since your query could pull data in from multiple sources.) You can enter any query in here that works in BigQuery, such as my example below. In the case of my query, I’m using the newer Standard SQL dialect (rather than BigQuery’s Legacy SQL), so if you use my example be sure to uncheck the “Use Legacy SQL” option underneath the query text box.

Let’s dive in to a query I wrote to obtain the data for any date range and its year-over-year comparison with the same days of the week.

Why, yes, that is a 253-line query (after some formatting). There’s a fair bit going in here, so let’s dissect the above and see what we’re doing here.

I begin my query using a WITH  statement. WITH is used in SQL to identify queries that you can later refer to by name. The first one, which I call data , is the only one that refers to my exported Google Analytics tables directly. It’s also the only one that you need to change in order to make use of this example.

This SELECT  statement queries data FROM  the  `project-id.VIEW_ID.ga_sessions_*`  set of tables (at Line 87). This part (highlighted above) is what you need to change in order to refer to the Cloud Platform project ID and Analytics view ID that your data comes from. Note the WHERE  clause. Because we’re using a wildcard ( * ) in our table selection, our table name will match both the ga_sessions_YYYYMMDD  tables and the ga_sessions_intraday_YYYYMMDD  tables (from the non-streaming intraday batch import) if you have them. We haven’t set a date range in this query; that will come in the outer query that Data Studio will wrap around ours. By making sure the _TABLE_SUFFIX  that matches our wildcard starts with a “2”, we’re just ensuring that we don’t grab those intraday tables by mistake. This code will be good until the year 3000, and as much as I want my resources here to be helpful and long-lasting, I sincerely hope your successors have moved on from this by that time.

Anyway, that’s all pretty standard. What is perhaps less common is what I’m selecting from the table. Because I generally want everything that Google Analytics gives me to be available to anyone editing the Data Studio reports, I’m selecting *  here, meaning all columns from the ga_sessions_*  tables are to be included for every row we select. We’ll let Data Studio worry about narrowing down the fields for individual queries. But I’m choosing here to modify some of the columns obtained from * .  This is done using the SELECT * REPLACE([expression] as column)  syntax, which removes a column from the resulting table and replaces it with one that you specify by giving it the same alias. The first one is  hits. The hits  column is an array; it’s a repeated record containing all of the hits for each individual session (row) we select. We want to keep all of the information present, but notice that I’m just adding an extra column to each hit. That is, I UNNEST  the hits  array (I convert it from an array to a virtual table), I select every hit from it, I add to each hit a new property, called num , which is set to 0. Then I re-package the hits back into an ARRAY  so future queries can operate on it as expected. Each hit is formed into a STRUCT (essentially an object with a set of properties) so that it can be placed into an array, hence SELECT AS STRUCT.

I do this with every repeated field (or array) in the table. These are hits , hits.product, hits.product.customDimensions, hits.product.customMetrics, hits.promotion, hits.experiment, hits.customVariables, hits.customDimensions, hits.customMetrics, hits.publisher_infos, and customDimensions. Any repeated object in the data, however deeply it is nested, gets a new property called num.

It’s probably not immediately clear why I’m doing this — why should all hits have an extra appendage with a predictable, constant value? Ultimately, it’s not the value that’s important here; it’s what I’m doing to the schema of a hit that I’m returning. These hits will be worked on as individual objects called STRUCT s, and any STRUCT  type in BigQuery has a well-defined and immutable set of properties. More on that later.

Up a level, I’m adding two properties to each row of data we select. The first is a representation of the _TABLE_SUFFIX  for the table the row came from. This is the property that will define what tables we actually pull rows from. We’re naming this property as tableDate , mostly because Data Studio gets fussy if we try to give a column a name that starts with an underscore. The second property I’m adding is another num  property for each session, again populated with the integer 0. As with hits, I’m just making a schema change to this temporary table, and to the STRUCT  objects we’ll be pulling out of it.

Next is a much simpler named query, called justOne. Its name is informative: it returns a table with one cell, containing the integer one (1) in a column called “num”. We’ll make use of that in a moment.

My next named query is called empty , and it’s where I’m going to start making use of those modified schemas I generated for hits and sessions, as well as my number one.

As with the data query, let’s start with where this data comes FROM, all the way down at line 210. In this case, I’m joining two tables together. The first one is the single row returned by the  justOne query.

To my single-celled table, I join all of the exactly 0 rows that I will now select from the  data  query. I do this by selecting *  (all columns) from data, under the condition WHERE FALSE. The constant FALSE has this useful property where it never evaluates to TRUE, and so for every single row in data, we’re not selecting it. Just like my extra num columns above, I’m doing this because, in this instance, I’m not interested in the data in this table. I’m interested in the schema of it.

A Brief Overview Of A FULL OUTER JOIN

Let’s consider what a FULL OUTER JOIN  does. This takes two sets of rows of data and combines them together. This is typically illustrated by means of a Venn diagram:

Let’s consider an example. Say we have two tables of data, one called Customers:

Id Name
123 Stacy
456 Luke
789 Vanessa

and another called Transactions:

Id CustomerId Items ShippingCountry
A12 123 3 Canada
B23 456 22 Angola
C34 456 1 NULL

Suppose we join them with a little bit of SQL:

The result would look something like this:

c.Name t.Items t.ShippingCountry
Stacy 3 Canada
Luke 22 Angola
Luke 1 NULL
Vanessa NULL NULL

Because a FULL OUTER JOIN  includes every row from both tables, whether or not the condition specified in ON  is met, every row from both tables is represented here. We have Stacy’s order aligned with her customer name. We have each of Luke’s orders aligned with his. And, since no orders were placed by Vanessa, all of the columns from the Transactions table are null.

That last bit is important to us, here. For any rows of the first table (the “left” table) that are not matched to columns in the second table (the “right” table), those rows are added to the resulting table with null values for the columns in the second table. The inverse is also true; if we had transactions with no matching customer record, they’d show up in our joined table with a null customer name.

Back To The Matter At Hand

So, here I’m joining a single row, with a single cell, containing the number 1 in it to another table, with the schema of our data  table, but with no actual rows. I set my join condition to count a match in every case ( ON TRUE), though I could have easily put ON FALSE  in its place — since one of our tables is empty, no comparisons are actually made.

So, in the left table I have:

num
1

and in the right I have no rows, but a fairly detailed schema (which I will abbreviate here):

visitorId visitNumber visitId visitStartTime date

When we fully join those together, we get:

num visitorId visitNumber visitId visitStartTime date
1 NULL NULL NULL NULL NULL

That is, I’ve managed to create a session with NULL  values for every field that normally comes in a Google Analytics export without having to define what that schema is.

That’s the kicker. I could declare an empty STRUCT , but recall that STRUCT objects have defined sets of properties, and if we want to, say, create a table that joins multiple STRUCT  objects together under the same schema, they must either have identical schemas, or we must be prepared to list all of their properties. So, if I wanted to make a session with all-null values, I’d need to do something like this to define it:

That’s not great for a couple of reasons. The first is that I’m lazy; I don’t want to write out the full schema of the BigQuery export. At the time of this writing there are 337 properties involved here once we count all of the nested records. The second is that, should the schema of the BigQuery exports from Google Analytics 360 ever change, so too must our queries. I’d rather infer the schema dynamically.

The method I’m using here (making a dummy row and joining the full schema to it) is pretty much the easiest way to instantiate a STRUCT  with an unknown type. BigQuery doesn’t give you a table’s schema in its query-accessible metadata. You can try to rebuild the schema from a row of the data, if you’d like, by JSON-encoding a row with the TO_JSON_STRING()  function, passing that resulting string to a JavaScript user-defined function (UDF) that, in turn, returns a set of fully-qualified keys paired with their types, and then (somehow) using those values to construct a new virtual table. (Don’t even think about instantiating the null-value session from a UDF. My kingdom for dynamic types in those functions.) I know these methods don’t work particularly well because I tried them.

So, once those two sets are joined, and I have a single empty row, what am I selecting? I start by selecting all of the columns from the 0-row data  result, with the exception of num , the column I added to each row in the data  query. Note that I gave it the alias d  when I selected it.

As with the data  query, I’m replacing some of the columns I select using REPLACE() ; specifically, all of those columns that contain repeated records. However, this time I’m doing it a little differently.

When I was working with the raw data, I joined all of the properties of the object (in this case a custom dimension) with a new property, num , which I explicitly set to 0. The FROM  in those subqueries was simple: I was just pulling the columns from the nested table. In this case, I’m doing a little more than that. I’m pulling all of the columns from the nested object, but remember that we’re creating a dummy session here. All of the repeated columns are empty, but I want my resulting “empty” table to contain at least one record for every repeatable column. To do this, I’m using a LEFT JOIN  to take our single-row table (just containing the number 1 in one cell), and extending it with the schema for the repeatable record.

This works because of what LEFT JOIN  does. Specifically, it takes every row from the first (or “left”) table, and compares it with every row from the second (or “right”) table using the condition specified in ON . If the condition evaluates to TRUE  for that pair of rows, the values from the right table are joined to the values in the left table. If there are any rows remaining in the left table to which nothing has been joined, those rows are added to the result with NULL  values for all columns not in the left table’s schema. Any rows that were not joined from the right table are ignored.

Because there are no rows in the right table to join at all, no comparisons are made. (I set the condition for the join to ON TRUE , but it could have been anything.) I’m left with one unjoined row in the left table, and it is returned with NULL  values for all columns in the right table’s schema.

OK, so now we know how we can get an all-null STRUCT with a schema consistent with the one we used when we selected actual rows from our data table. What I have not done yet is explained why we would want to do this. To understand that, we first need to look at how my example finally outputs our data.

Onward!

Now we’re on the last part of the query, the actual select statement that triggers the subqueries we defined above. This is where we’re going to put together our current-year and previous-year data into a common table, from which we can query values for either or both as we please.

This query will (if run on its own) SELECT all of the rows of sessions generated from the data query and put them into a column called thisYear. It will then take the results of my empty query (which is a single row with nearly all-null values) and throw them into a column called lastYear. It will then add to these two columns a third column named dataDate, which we will simply set to the tableDate value we obtained from _TABLE_SUFFIX of the original tables.

To all of those rows just obtained, it will add a second pass through the data query’s results, this time assigning the data obtained to the column called lastYear. To the thisYear column, it will add the empty row. And this time, it will set the dataDate column to a different value: 364 days in the future of the tableDate from data.

The rows from each pass are combined together into a single table using UNION ALL.

So, the result is a table that, in principle, contains two rows for every session in the original table. One of those rows has the data for the session in the thisYear column, and has a dataDate that matches the date column (the Google Analytics Date dimension) within it, and the _TABLE_SUFFIX of the table it originally came from. The other row has the same session data in the lastYear column, but this time has a dataDate that is 364 days after the value of the date column and _TABLE_SUFFIX of its original table.

That seems like a lot of data, but the magic happens when Data Studio wraps a query around this. If we create an element in a Data Studio report and set the date dimension of that element to dataDate, then when Data Studio asks for a value for a given date range, it will limit the set of columns it ultimately selects, and add a WHERE clause to its outer query. That clause will look something approximately like this:

… and consequently we’re not actually aggregating all the tables in your Google Analytics dataset.

The 364-day offset is what we need to grab last year’s data. A row in our table that has yesterday’s date in dataDate  but nothing but an empty STRUCT  in thisYear  will have a record in lastYear  corresponding to a session on the same weekday last year. So, when we select based on dataDate, we’ll end up with all the data from our date range, and all of the data for the same range shifted 364 days in the past, which lines the dates up by weekday.

The 364 number is constant, and accounts for any year, because lining up weekdays is as simple as counting back 52 weeks, and 52 × 7 is 364. The same date, on the other hand, drifts back one weekday every year (365 days), except for leap years (366 days) when it drifts back two.

Why the “empty” STRUCT values?

We’re finally in a place where I can explain why I went to all that trouble to alter my table schemas and create empty STRUCT  objects by joining nothing with almost nothing. The answer has to do with the way Data Studio queries our data when we want to view this year’s data and last year’s data in the same element, such as in a chart or a table. Let’s say we want to grab the total number of hits this year and last year. In Data Studio, we define this as a calculated field with the formula COUNT(thisYear.hits.type) for this year, and COUNT(lastYear.hits.type) for last year, and then we either group them by a common dimension in a table, or we have a third calculated field that computes the difference between these values.

When we do this, Data Studio will generate a query that grabs all of the hits from both our lastYear and thisYear objects. Since hits are a nested, repeated record inside each row, they need to be unnested and joined back to the data. Here’s what Data Studio does:

This is exactly what you would do if you wanted to get all of the hits from this year and last year for the relevant date under the assumption that all rows have hits for this year or hits for last year.

In our case, unless we add in that empty struct, either thisYear.hits  or lastYear.hits  is a NULL  array, and so UNNEST  produces a table with no rows.

We’ve looked at FULL OUTER JOIN and LEFT JOIN; what happens when we CROSS JOIN?

 

CROSS JOIN produces a Cartesian product of the two tables. That is, from two tables (left and right), it produces a result that contains all combinations of the rows in the left table and the rows in the right table. In a  CROSS JOIN , you take all the members of the left table and, for each member of the right table, create a row that contains all the values of each.

And so, that’s exactly what happens when, say, lastYear.hits is empty — the join takes every hit in the session for this year and joins them with all of the zero hits from last year, discarding the rest. And vice-versa for rows where thisYear.hits is empty. So, when we join arrays from both objects, we end up with no rows every time because at least one of those arrays is null.

If we were writing this query ourselves, we could simply replace CROSS JOIN  with a LEFT JOIN and we wouldn’t have this problem. But since Data Studio is writing this query, we need to be prepared. I’ve done this by instantiating an empty object for every single array in the table, from custom dimensions and metrics to hits and products. Note that it is not enough to create an array with a single null member; BigQuery interprets those arrays as empty. Futher, if you declare something as NULL, it’s interpreted as a null integer unless the type of the column or object is already specified. So, instantiating an empty STRUCT requires us to know or infer the full schema if we ever hope to put that object in a table.

And since, in our case, we potentially want to compare any repeatable record with its same values from the previous year, I’ve chosen to instantiate an object in each of them.

Using The results In Data Studio

In Data Studio, you can basically use the thisYear  or lastYear columns as you would any Google Analytics 360 data in BigQuery, but some care needs to be taken when you want to query these two fields together, such as when grouping them by common dimensions. And, if you’re unfamiliar with the GA 360 BigQuery export to begin with, you’ll want to know something about the nature of this data. My quick primer on querying Google Analytics 360 data in BigQuery may help as well.

Here’s a look at what we’re building:

A couple of things to note quickly here:

  • Rather than contain comparisons in a single scorecard, or a single table column, the metrics here are each represented by two separate metrics, one for our date range and one for our last-year date range. For example, we have a “Sessions” metric and a “Sessions Last Year” metric.
  • Similarly, the comparison is itself done in a separate metric. That is, we have a “YoY %Δ Sessions” metric.

The Data Studio Data Set

When you load in a normal BigQuery data set containing Google Analytics 360 data, you get a set of fields that match the BigQuery export schema for Google Analytics. Fields are named in the hierarchy of that data: nested records, like those belonging to hits, are preceded by their corresponding prefixes (e.g., fields include visitId, hits.type, hits.product.productPrice and so on, based on their nesting).

In our new data set, the fields are named the same, but duplicated — there are two versions of each field, one prefixed with thisYear and another prefixed with lastYear.

There’s also an important, extra field here, called dataDate. That’s the one we want to use as the Date field for all of our report elements in our report. Recall that this field’s value will, in each row, represent either the date corresponding to the data that is in the thisYear column (in which case lastYear will contain a single dummy row), or the date 364 days after the date of the data in the lastYear column (in which case thisYear will contain a single dummy row). This means that when we set a date range using the dataDate column, we’re grabbing all rows (all sessions) for the date range we set, and for that same date range shifted 364 days in the past, with an easy way to differentiate between data belonging to the current date range or the shifted one.

Dimension And Metric Formatting

There’s a consequence of our division of all fields into this-year and last-year columns. While it’s easy enough to count metric values as being uniquely for this year or last, we will want to aggregate results based on common values for the dimensions, such as in the Channel Grouping table in our above output. But, which column do you use? If you choose to group your aggregated metrics by thisYear.channelGrouping , you’ll end up with a proper breakdown for this year, but your totals for last year will all be in a “null” grouping.

For this reason, you’ll want to create a calculated field in Data Studio for every dimension for which you want to handle a date range comparison in Data Studio. The formula for such a dimension will look something like this:

(In this example, I’m using  lastYear.date  as my field to check if the last-year side of the row is empty, because I know if there’s data there, date  will always have a value (where some fields may not).

The above formula is the one I’m using for the “Channel Grouping” calculated field that I’m using as a dimension in the table for my report.

If you’re creating a detailed report, making all these custom dimensions can be a pain. For presentation purposes, you probably want to at least rename them from the BigQuery data, but there is an option to set up your data source to include a single column that can be used for your dimensions by default. In the SQL, it’s just a matter of duplicating your data that’s going into the thisYear or lastYear objects into a third column, which we could call all, for this example. If you did this, you could set up your table to use all.channelGrouping as a dimension, and then thisYear.visits  and lastYear.visits for your Sessions metrics. That gives you a common field that will be populated for both sets of rows, but it does add some some redundancy to your query which makes it a little less efficient in terms of data transport. Your call.

Finally, a bit of niceness to add: Metrics in the output will, when BigQuery returns no rows for a query, show “null” instead of “0” in your reports. For example, a table breaking down Channel Groupings like ours that had no sessions for “Paid Search” last year will show “null” in the table for last year’s sessions. That’s a bit ugly and not what you really expect in reporting. This can be quickly solved with calculated fields as well:

Using the above formula, I can create a field called “Sessions” which will report null values as “0”.

Once I’ve made my “Sessions” and “Sessions Last Year” calculated fields, building a third one to calculate their % difference is easy:

Using that formula, and formatting the resulting field as a percentage, gives you what you need.

Once I’ve created my calculated fields for the dimensions and metrics that I want to use in my report, it’s a simple matter of dropping the reporting elements into place and pointing them at the right data.

Limitations

There are still some limitations here. One is that our comparisons are separate calculated fields, rather than as comparisons. They do not appear within the same scorecards as a lift indicator. There’s no arrow or color coding to give a clear visual indication of whether or not the delta we’re displaying is an increase or a decrease — this is identified simply by whether the value is a positive or a negative one.

This is an annoying limitation for presentation, but it does come with a silver lining: because the deltas here are their own metrics, you can sort a table by them, meaning that you can rank dimension values (such as channel groupings) by their biggest change year-over-year for some metric.

Directional Arrows

Because Data Studio doesn’t yet have any conditional formatting, there’s no good way to add a color-based indicator to these values (such as red for negative, green for positive). I just said there’s also no arrow to indicate direction, but can we make one?

As it turns out, you can, but it’s difficult. Say, for example, I wanted my first comparison there to appear not as “7.2%” but as “+7.2% ⯅” instead. You might expect that you could do something like this:

Interestingly, Data Studio will not parse a CASE statement that has a CONCAT() function as its result. Trying the above simply gets you an “Invalid formula” error, and looking at the JSON that’s actually returned by the server when validating the expression, the error message is “Invalid input expression – Failed to parse CASE statement”. OK, perhaps we could simply get an arrow that we could present on its own:

You could then use that arrow in a table report element, perhaps, with no header, row numbers or pagination. The single cell could then be placed next to our scorecards at the top of the report.

Unfortunately, this yields another error: “Only numeric literals are allowed if aggregated fields are used in CASE.” Ultimately, I was able to get those arrows to appear by separating out my reasoning into stages. First, I made a “Sessions Direction Stage 1” calculated field with the following formula:

That worked. Then, I made another “Stage 2” field to turn that number into a text value with the arrow:

I was able to then use that “Stage 2” field on the report (again, in a table with one cell and no header, row numbers or pagination).

This seems a little unnecessary. Given that Data Studio is indeed able to compute these text results from aggregated metric values, it seems to me that these restrictions in saving formulas may not be entirely necessary, or perhaps some version of them that handles these use cases better is possible. (In fact, although most of the time I couldn’t save a formula like my second one in this section, wherein I output a character based on an aggregate metric, I did get it through once and it worked as I expected it to, but attempting to edit it, even without actual changes, would not save the same field. That seems like a bug.)

At any rate, I did want to make a nicer version of this that put the direction and the metric together (perhaps with a “+” for positive values), but ultimately was unable to save any of those formulas.

I’ll note quickly that I also tried to solve this issue from the BigQuery side by making a version of my query that provided pre-aggregated top-line totals to Data Studio that could be interpreted as text and given nice directional indicators. The problem with that surfaced quickly: it’s hard to determine what the date range being requested is from the inner query that you have access to, and aggregate totals for it when you have to output the date as an individual column as well. If anyone has a good solution there, I’m all ears.

Optimizing Queries

To cover off some housekeeping, I have a couple of notes on custom queries that you may care to be aware of.

(1) I am using a custom query within Data Studio to transform the data in this example. This task can be pushed off, though, to IT resources managing Cloud Platform for your organization. If that’s desirable, you can leave the SQL out of BigQuery and instead use it to define a BigQuery View: an intermediary table that can be queries as if it were a permanent one. This behaves essentially the same way that Custom Queries do: the intermediate query is run every time you request data from it.

(2) Because of this fact, both Custom Queries and Views have computing costs associated with them. Refreshing the data causes a significant amount of the raw tables to be scanned, pulled and transformed. It happens quickly, but that work can be distributed among hundreds or thousands of machines. The costs aren’t astronomical, but there are things that can be done to lower them. One of these things is to trade computing costs for storage ones. If you have a report used by many users, that is frequently refreshed and read, it may be worth it to replace the Custom Query or View with some scripting that routinely runs the same intermediate query and writes the resulting transformations to a permanent table.

As I said, the costs here aren’t astronomical. BigQuery is pretty cheap at the volume of data we’re talking about, and if you’re an Analytics 360 customer already, you’re likely to be able to handle expense, especially after the $500 U.S./month BigQuery credit that Analytics 360 customers get.

Wrapping up

This has been quite the journey to solve a problem that Google will probably solve with some new Data Studio features in the (hopefully) near future. But on it, I was able to uncover quite a bit about how the way Data Studio queries BigQuery impacts what’s actually possible in your end result. My lengthy post is a testament to how complicated such a seemingly simple thing can become, but it does solve the problem, at least until Google provides us with something more elegant.

Aside from the actual answer to the question, I hope this does provide some insights into these tools.

Colin Temple

Director, Analytics Solutions, Napkyn Analytics

Colin serves as Director of Analytics Solutions for Napkyn Analytics, where he focuses on keeping his team at the forefront of digital analytics. A diverse background in software and marketing and an education in logic and philosophy give Colin a unique perspective on where the analytics practice is, and where it should be.