Vandalism with Google Analytics exploits
Google Analytics has a design approach to web analytics software that differs from many of its competitors. Whereas some tools require you to pre-define anything you want to track (such as events, page names and campaign data), Google Analytics allows you to define these in the code or URL of a page, and simply accepts whatever data is thrown at it. This greatly cuts down on the cost, in both time and money, of implementing and maintaining a Google Analytics account. The ease of implementation has been a huge win for Google Analytics.
However, this philosophy comes at a price. Because it indiscriminately accepts any data it’s given, it accepts any data it’s given. The result is that, if someone with the right skills is feeling particularly malicious, they can vandalize and seriously distort your business’ data. There are two ways this can be done.
We’ve been aware of these potential issues for some time now, but we wrestled a bit with the decision of whether or not to post this. On one hand, we like to share our knowledge and, since this is a very real fact about Google Analytics, it’s good for GA users to be aware of it. On the other hand, we’re potentially teaching people how to mess with someone’s GA deployment. Ultimately we decided on transparency and honesty — after all, we’re also going to tell you what you can do to protect yourself from these. But we must begin with a caveat: we do not endorse doing anything like this. We offer this information so you can be aware of potential security risks with your own data, and take the necessary steps to protect your data integrity. We are strong supporters of the Web Analyst’s Code of Ethics, and though that code doesn’t say much about messing with others’ data, the idea is generally to be open and honest with data.
(Update: I should also point out that Google Analytics is not alone in being vulnerable to some of this. The approach to campaigns and ease of copying other data makes it easier than with some tools, I think, but those stem from Google’s strengths rather than weakness. I offer Google Analytics up because they don’t have a service level agreement for everyone, and hence it’s up to you to protect some of your data. Despite any vulnerability, I do want to be clear that Google Analytics is a fine tool and this alone is not cause for alarm, just something to be aware of when implementing this tool, and by extension, others like it.)
With that out of the way, here are the potential exploits we’ve seen:
Google Analytics makes campaigns tracking easy. Unlike tools like Adobe SiteCatalyst which store campaign tracking codes and convert them into useful data, Google Analytics sets campaign names directly in the URL query parameters, accepting any campaign name that it receives. This saves you time managing all your campaigns and channels, and makes setting up Google Analytics significantly faster. But with GA merely accepting any campaign names it gets, what’s to stop me from visiting your site using a bogus campaign name?
What’s the problem?
Campaigns in Google Analytics work by adding the names of campaigns, media and sources to URLs. For instance, if you want to track a summer email campaign that links to www.example.com, you may enter a URL like this:
In this example, you’re pushing though three pieces of information: the source of your list (newsletter), the medium over which you’re marketing (email) and the name of the individual campaign (SummerBlast). This data will be recorded in Google Analytics, no questions asked. You don’t even have to tell Google about the campaign ahead of time.
But what’s to stop me from visiting your site on this URL?
The answer is nothing. If I were to visit a GA-tracked website with those query parameters attached, their Google Analytics implementation would show that someone came to their site magically, by means of a spaceship, through a campaign called Stupidhead. I did this to one of my own sites, and here’s what I got.
How dangerous is it?
The most someone can do is create a bunch of meaningless data. The effect of a single vandal acting alone would be minimal, though an extremely determined vandal could set up a sort of vandalism bot — an automated software that repeatedly visits your website using falsified campaign data.
If you’re smart about your reporting, you’re probably more concerned about your converting campaigns. In order for vandals to mess with those reports, they’d have to become converting visitors. They may not have a problem with filling out a lead generation form, but if you are running an ecommerce site, these reports have a built-in protection: vandals will have to pay for the opportunity to seriously mess up your reports. (However, you’ll still need to account for the second scenario below.)
How do you fix it?
The first step is to identify vandalism. Chances are, it will be obvious — if someone has decided to vandalize your site, it’s probably because they want you to see it. So a bogus campaign name that shows up in your reports will be clear. If you’ve been smart about maintaining a convention for your campaign naming, you should have an easier time detecting falsified campaign information, though a determined vandal could spoof your own conventions.
Getting rid of the campaign data isn’t as easy. In fact, it’s impossible. What you can do instead is segment it out, so that you see only data from non-vandals. To do this, you need to create an advanced segment. Creating a new custom segment (using the ‘Advanced Segments’ area at the top of a report in the new Google Analytics interface), you can choose to exclude campaigns, media or sources that contain the offending terms.
The problem here is that if you’re the victim of serious vandalism, such as from the bot scenario given above, you have to use this segment every time you look at a report in Google Analytics. That’s a pain.
If you’re a large organization and you’re afraid of receiving an attack to your Google Analytics account, you may consider running more than one analytics solution, or copying the relevant data to your own datamart. The larger you are financially, the more likely such an attack is, but the more resources you’ll have to back-up your data.
What should Google do?
Probably nothing. I think that the fact that you don’t have to do campaign management within Google Analytics is a plus. It cuts down the overhead — every organization should have some method to the madness of creating campaigns and campaign names, but the extra work of punching data into your web analytics tool isn’t always worth the benefit, especially for smaller organizations.
Given the fact that traffic has to convert, and actually spend money if you’re an ecommerce site, in order to mess with valuable reports. If someone really wanted to hit your site hard with this, the most they could do is become a nuisance. It won’t destroy your reporting, but it will make it harder to pull clean data.
However, since Google is gradually approaching the enterprise market with its Analytics product, its product team may consider providing two options for campaign management: both the current consume-everything version, and an internally-managed campaign list in the style of SiteCatalyst. The benefit would be for large customers, who have the resources to properly manage their campaigns, to be able to do so risk-free.
Fake Data Injection
Ok, so, if I want to, I can mess up the campaign data a bit. And if I want to mess up your revenue sources, then at least I have to pay you for the opportunity to do so, and it may not be so bad. But what if I want to mess up the rest of your data? Surely, I wouldn’t be able to do that, right?
Wrong. Unless you’ve set up filters to prevent this, Google Analytics will accept data for your Google Analytics tracking account from any server, as long as it sends the web property ID for your website.
What’s the problem?
Because Google Analytics accepts this data from anywhere, anyone can create a web page using your Google Analytics tracking code, view it, and have traffic, events or ecommerce data show up in your Google Analytics report.
For example, what happens to your reports if I create a fake transaction, using your Google Analytics tracking code, with a transaction of -$90 million? Here’s what happens:
The other days in that report aren’t at zero dollars. They range from $50,000 – $100,000, but you can’t see the trends because the fake transaction has skewed everything.
How dangerous is it?
The damage here is greater, in that it will severely distort any reports. Someone could take an obvious step, like the above examples, of pushing huge transactions into your Google Analytics account. However, the vandalism could be more subtle: one could push several smaller transactions with false source data to try and misguide you, or push events that you can’t reconcile with your order management system.
The effect of this and the campaign vandalism method I mentioned above can be compounded. Recall that you’d have to buy something to mess with revenue source data with the method above? It turns out that, if you fake realistic-looking transactions while using spoofed campaigns, you can make an even bigger mess of things.
One limiting factor here is that the visits have to be run from a server that’s connected to the Internet and can host web pages. As a result, you can use the Hostnames report in Google Analytics to identify where the fake data came from. This does mean that if someone wishes to vandalize your data in this way, they will have to do so carefully, otherwise they may be identifiable. Potential vandals would have to go to greater lengths to ensure their anonymity.
How do you fix it?
Finding the fake data could be tricky. In the case of revenue and transaction data, you probably have an order management system with which you can compare the data. However, when you’re strictly looking at Google Analytics, the fake data may not be obvious if the vandal has chosen to be sneaky about it. The first step is to check your Hostnames report. Hostnames are the domain names or IP addresses from which your website is viewed. In the new Google Analytics, you can find the list of hostnames that have been used to view your site from the Visitors > Technology > Network report.
If Google Analytics code executes on hostnames that you don’t own, you’ll want to investigate the problem. In some cases, those hostnames will simply be search engine caches or translation services that are copying your analytics code. However, if you notice transactions or strange events and campaign data from suspicious hostnames, then you may want to look into the matter.
However, if you want to prevent yourself from these attacks entirely, then you’ll need to add some filters to your Google Analytics profiles (or create new filtered versions of your main profiles). The goal here would be to create a list of hostnames — the domain names and subdomains that you use for your website — and ONLY accept data from those hosts. Now, you’ll probably want to set up your filters on a new profile, which is a filtered version of your original. That way, you have 100% of the data collected by your site, but also a clean/safe copy to work with.
The example filter I’ve given here will only count traffic, events and transactions from the domain name ‘example.com’. A better way might be to only include traffic from specific IP addresses, if you know the IP addresses of your website(s)–this could prevent attempts to spoof your hostname and push vandalism that appears to be legitimate. In either case, be sure to keep this up-to-date! If you change your domains, subdomains or IP addresses, it may affect your filtered profile and cut out some legitimate, valuable data.
One quick note: Sometimes you’ll see additional domains in your list that are from hostnames that have a legitimate purpose. For instance, Google will serve up your site when it shows either a cached version or a translated version — in both cases, the hostname includes ‘googleusercontent.com’. Bing also shows page caching on cc.bingj.com. You may want to exclude data from caches or translated versions of your pages, but if you’d prefer to see all of it, include data from those domains as well.
What should Google do?
Google should provide these filters as standard options. You should be able, when creating a profile in Google Analytics, to specifiy what host names and/or IP addresses you’re willing to accept data from, and be able to provide an on/off switch for accepting data from other sources. Making this option more prominent may help businesses be aware of the issue and protect themselves from day one.
At any rate, rumblings of a paid, enterprise-focused Google Analytics can be heard from the horizon. If a service level agreement becomes available to some Google Analytics customers, data integrity and security will be chief concerns.
Until next time,
UPDATE: Just a quick note of clarification. Although I focused this post on Google Analytics, I should clarify that GA is not the only tool vulnerable to this — especially the second method of vandalism. The first method is the easiest thing, and that’s more specific to GA. This post came out of an internal discussion about the campaign vandalism. To be clear, this kind of thing isn’t particularly common, and as Emer mentions in a comment below, it tends to be a result of negligence when people copy code or designs, rather than a malicious attempt. So, there’s no need for immediate concern for most people, but I think it’s worth being aware of what you can do to protect yourself from this inherent vulnerability in most analytics tools.