We’re often asked how to do A/B testing with Gatsby. Like a good senior developer, we usually answer that question with another question: what are you trying to achieve?
A/B testing is one tool in a broader marketing analytics or digital insights toolkit. Occasionally, it’s the best tool for the job. Often, other types of software and workflows can often provide the same types of functionality.
In addition, out-of-the-box tools for A/B testing can degrade performance. We care deeply about performance, so we’ve outlined a specific method we recommend that will yield insights while preserving Gatsby’s page load speed.
Before diving into the recommendations, it’s important to understand how we got here.
Origins
In the late 2000s and early 2010s, American popular media brimmed with excitement about data. Books like Freakonomics and Moneyball starring statistics nerds topped the best-seller lists and turned into movies. In tech company backends, the rise of “big data” caused a realignment of infra teams, with the creation of data teams and the role of the data scientist.
On the web, the concept of A/B testing — creating and testing multiple variations of a page in order to find the best-performing one — has been around since the turn of the century, at least in the largest consumer technology companies. Google famously tested 41 shades of blue on their homepage to find the one that converted best.
In 2007, a Google product manager joined the fledgling campaign of then-presidential candidate Barack Obama, and introduced them to A/B testing. Turquoise-tinted Obama family photos were tested against black-and-white versions; “Learn More” was tested against “Sign Up Now”. It worked. One variant frequently outperformed by 10% or 20%, and sometimes those improvements stacked on top of each other. Retrospecting afterwards, the campaign estimated that around one-third of their email signups were attributable to A/B testing.
In 2010, after Obama’s victory, that ex-Google employee founded a company to popularize this idea more broadly. That company, Optimizely, was built around a product allowing websites to run A/B tests.
Growth, and challenges
Trailing presidential star power, and hitting the mood of the moment, Optimizely captured a base of early adopters — in the media world, a number of household-name TV networks and newspapers, plus prominent consumer brands, and technology companies. By 2012, Optimizely had a set of homepage customer logos most startups would kill for — Fox, NBC, Disney, the Guardian….
For these companies — especially the media ones — content was their business. They changed their headlines multiple times per day as stories unfolded. They published dozens of new stories every day, and compelling headlines drove and largely determined revenue. In addition, as national and global media brands, they received enough daily traffic for A/B tests to yield winners in hours.
But content and media firms weren’t the only ones to use A/B testing. The allure of data-driven decision making pulled the software into all sorts of firms — consumer companies, B2B companies, and so on. These companies didn’t have the kind of traffic media websites did, which caused problems — slower decision-making and more incorrect conclusions.
And with smaller benefits, A/B testing’s cost — flickering and delayed page loads — loomed larger.
Flickering
Many users visiting pages with A/B tests experienced a visual “flicker” during page load.
Formally known as a “Flash of Original Content”, flickers are typically caused by the sequencing of users’ browser loading different content types. First, the browser would load HTML, including page content, and render the page, including an original headline. A few hundred milliseconds later, Optimizely’s Javascript snippet would load, and rewrite the headline.
The flickering phenomenon was and is quite disorienting for users, often leading to wholesale page abandonment.
Delaying page loads
To avoid flickers, A/B testing tools sometimes simply blocked the whole page load until their JavaScript initialized, but this adds an extra second or two of delay. In addition, content teams often forget to clean up and remove their A/B tests, so A/B snippets would run on pages where no testing is currently happening — further hurting performance.
Slower decision-making
Digital media organizations have a high content change frequency — headlines often change multiple times a day! A/B testing in that world — or in email campaigns, where you might send out multiple emails per week — makes a lot of sense.
But if you’re an e-commerce company, you probably change headlines on the order of months, rather than daily. Simultaneously, you probably need weeks to get the tens of thousands of visitors you need to get statistically significant results.
And if you only change headlines occasionally, running tests for weeks before making a decision distracts you from other priorities and campaigns. Many marketers would rather ship a good headline today than a great one next month.
Incorrect conclusions
A/B testing is based on statistics. Statistics is complicated. It’s easy to get wrong; data still needs people to interpret it. (WPEngine founder Jason Cohen, sharing the results of an “A/A” test he ran in 2012, provides perhaps the most humorous example here).
Worse, with weeks-long time periods to get results, impatient teams often threw A/B tests at the wall so something showed a result. This led to drawing incorrect conclusions due to false positives — in other words, your test said you got a result, but it was actually just random chance.
Alternatives emerge
Due to these challenges, usage and interest in A/B testing started to plateau around 2014. Meanwhile, marketing analytics as a field was exploding. The number of marketing analytics tools included in annual industry surveys grew from 1,000 at the beginning of 2015, to over 5,000 three years later.
A number of different categories of software began emerging to measure user behavior on websites and web applications:
- Session recording tools like FullStory, instead of focusing on just a few key decisions, focused on capturing everything your users do, so you can analyze accordingly. They generated heatmaps, clickmaps, scrollmaps, rageclick reports, and so on.
- Product analytics tools like Heap took a similar approach, but focused more on identifying key events and constructing funnels — for example, seeing the conversion rate of site visitors doing a specific action on purchase behavior over a 28-day window. In addition, product analytic tools tend to give users fairly rich tools to query their data and find anomalously high- or low-performing items.
- Feature flagging tools like LaunchDarkly allowed SaaS product teams to gradually ramp up features from 5% of users to 10% to 20% to 50% while observing the impact on user actions and error rates.
As these products matured in the late 2010s, feature sets converged. Feature flagging tools added A/B testing. A/B testing tools added session recording and product analytics. Product analytics tools added session recording, and vice versa.
From the standpoint of a typical demand generation or growth marketing user, what this meant was that where there used to be just A/B testing, there were now a number of ways to test messaging.
Certainly, A/B experimenting on the main headline for key pages is one method. But other ways exist, too.
Running AdWords campaigns against different landing pages and comparing conversion rates. Testing messaging and phrasing via email split tests. Just changing the headline and monitoring before/after conversion rates. Carefully observing user behavior via FullStory or Heap. And so on.
A/B testing with Gatsby
Where does that leave us today?
By the late 2010s, despite raising a lot of venture funding, it became clear that Optimizely was struggling. They abandoned their free tier, moved to sell more to enterprises, and was acquired by the enterprise CMS company Episerver, which then renamed itself Optimizely. Users largely moved to other tools, most prominently Google Optimize and VWO.
Unfortunately, these tools don’t work well on the Jamstack. They do work, technically — but they tend to negate the performance benefits of Gatsby and other Jamstack frameworks. We experimented with using VWO on our homepage earlier this year. However, using the automatic Lighthouse reports generated in Gatsby Cloud, we saw that VWO reduced our Lighthouse performance score by 10 points. As a result, we removed it.
Other approaches are emerging. One approach native to edge-based HTML file serving is to generate multiple versions of a page, and serve one or the other from the CDN depending on the user. This can be configured in one of two ways.
First, branch-based edge A/B testing, where some users are served a site compiled from the main branch, and some users served a site compiled from an alternative branch.
This was an early tool in the Jamstack, and has some fans, but faces a couple of challenges. One, it introduces technical debt, since development teams have to maintain two “main” code branches for the duration of the test. Two, if the source of truth for content is in a headless CMS rather than version-controlled code, it can be unintuitive to store and refer to the variant.
Second, page-based edge A/B testing — generating multiple versions of individual pages on the edge depending on A/B test configuration. This is a promising approach and there are two ways to do this.
First, newer vendors like Uniform and Outsmartly enable both A/B testing and content personalization at the edge. Right now, these approaches are only available at enterprise-level price points, out of reach for most teams.
Second, teams can implement this themselves by creating two variants, randomly selecting users to view each variant, and persisting the choice of variant to the client using cookies or localStorage. While this requires a bit of configuration, this can be done without adding any additional vendors. We used this earlier this quarter to relaunch our Starters page.
We’ll see how this space evolves more over the coming years.
Alternate approaches to actionable analytics
One trend we’ve noticed is teams on Gatsby using alternate approaches to get the user insight they need to further boost conversions. One example is digital agency Novvum which uses Heap and Gatsby together to boost conversion rates for their client, the men’s jewelry brand Jaxxon.
Novvum’s workflow is to add specific content modules into their e-commerce pages, but not in the main flow — “below the fold” of the page, in a clickable modal, and so on.
They then let it sit there for a few weeks, and then compare 28-day conversion to purchase rates for users who did complete the action to users who did not complete the action.
Jaxxon’s dashboard in Heap: purchase conversion rates for users who clicked an accordion element on a product page vs users who didn’t.
Jaxxon’s normal visitor conversion rate is 2%. Using Heap, they’re able to compare conversion rates for users who completed an action, to users who did not complete an action, to measure the impact of that module on conversion.
Because anyone who clicks on an out-of-the-way content module is likely a “high-intent user”, they don’t use 2% as a threshold — instead, they use a threshold of around 6-8%. One content module they shipped — the Jaxxon Fair Pricing Chart — gave a conversion rate of 11%, so they immediately moved it above the fold and saw an increase in conversions.
Jaxxon Fair Pricing module
Another example is a quiz in a modal to help visitors find the right jewelry for them. The quiz is controlled by a third party provider. After finding that adding it lifted conversion, they moved the modal link to a more prominent spot on their homepage, immediately below the fold.
Jaxxon Style quiz homepage banner (background) and modal (foreground)
Conclusion
A/B testing is a useful tool to get data on how different messaging converts. If you are performance conscious and value user experience — and we all should be — there may be better techniques to gain measurable insights.
A/B testing has its place, like any tool in your stack, but you should be diligent with using it, and carefully assess whether it’s the right tool for the job.