Peak performance: How retailers used Google Cloud during Black Friday/Cyber MondayPeak performance: How retailers used Google Cloud during Black Friday/Cyber MondaySr. Director, Cloud Support

At Google Cloud, we work with businesses in a range of industries, and we’ve seen nearly every business experience peak events when their online traffic skyrockets. For retailers, their peak events are Black Friday and Cyber Monday (or BFCM)—the period right after Thanksgiving in the U.S., when holiday shopping starts. The weekend kicks off the all-important holiday shopping season of November and December, when an estimated 20% of all annual retail sales occur.

During an average day, online retail sales in the U.S. total about $1.4 billion, CNET reports. In contrast, on Black Friday 2018, U.S. online sales totaled $6.22 billion (up 24% from 2017). Cyber Monday 2018 sales surged to $7.9 billion (up 19% from 2017)—the biggest online sales day ever in the U.S., according to Adobe Analytics.  

Traffic to retailers’ mobile and shopping apps surges to levels unmatched during the rest of the year, and availability or scalability issues can result in millions of dollars of lost sales. Every year, there are well-publicized retail website crashes, so avoiding downtime—along with the accompanying reputation damage, unhappy customers and stressed, overworked IT teams—is particularly important for retailers.

We know that a solid technology infrastructure is the foundation for retailers to stay ahead of demand and succeed during this busy season. Beyond that, though, support for that infrastructure is essential. Support isn’t just activated if something goes wrong. Support for an event like Black Friday and Cyber Monday involves preparation well ahead of time, and includes testing, architecture reviews, capacity planning, operational drills, and war rooms during the event itself. We took a prescriptive approach to BFCM support, setting expectations and ownership early (more than six months ahead), to understand what each retail customer needed, both on their side and from our team.  

We’ll go through the steps that helped our retail customers have a fruitful and disaster-free season. These steps can generally help you prepare for your own peak event. We’ll also describe how one large-scale retail platform in particular—Shopify—had a successful BFCM using Google Cloud.  

Preparing to support retailers on Black Friday/Cyber Monday

We started planning for Black Friday and Cyber Monday for our retail customers in the spring of 2018 to align with their typical preparation timeline. We formed a task force composed of representatives from Google Cloud’s Professional Services, Customer Engineering, Support, Customer Reliability Engineering (CRE), and Product and Engineering teams. We met regularly to strategize, develop tactics, and execute on those tactics with the goal of making sure Google team members and our GCP retail customers were well-prepared.

We focused on a few key technology areas where planning could help prevent any issues.

1. Early capacity planning

As early as May 2018, our account teams began reaching out to GCP retail customers. We discussed high-level planning, such as their particular holiday shopping objectives and the infrastructure capacity they might need to meet those goals.

We worked closely with retailers to review their architectures and advise on techniques to forecast and plan for increases in capacity before Black Friday, since scalability is essential when planning for traffic spikes. We conducted tests across teams and services, and stress-tested systems to uncover any constraints or weaknesses and remediate as needed. Those tailored preparations paid off across the board. With GCP capacity status firmly green—available—throughout Black Friday and Cyber Monday, shoppers visiting our retail customers’ sites could make their purchases without running into a slow or unresponsive site.

2. Reliability testing

Identifying potential reliability issues in a “pre-mortem” (an important component of CRE) was another preemptive step we took. Early on, our CRE team partnered with our retail customers to analyze the reliability of their infrastructures, and run through tabletop exercises to see how well-prepared the customer was in the face of a failure. In some cases, the Professional Services team helped perform load testing to make sure retailers’ platforms could handle expected levels of peak traffic, and in others we encouraged regular load testing and evaluation. And given how important mobile commerce has become, we also tested the performance and reliability of customers’ mobile apps. We also employed Apigee’s API monitoring tools to ensure API stability. We’ve seen APIs become more important in retail technology, since they allow more flexible, microservice-based e-commerce sites.

3. Operational war rooms

“What could possibly go wrong?”

That’s the million-dollar question to ask before a big IT event. We got together with our retail customers’ IT and engineering teams to explore and test for possible worst-case scenarios, like an entire site crash. We created a central Black Friday/Cyber Monday war room staffed with senior-level, experienced Googlers from the Professional Services, Support, and Site Reliability Engineering (SRE) teams. This team of first responders was prepared to use real-time communications to stay connected and address any problems as soon as they arose. This was in addition to understanding customer and vendor integrations and making sure escalation paths were defined ahead of time, so that customer expectations were clear for various channels.

During that weekend, we doubled the number of on-call support staff available to retail customers. In some cases, we placed account teams on-site at GCP and Apigee retail customer locations to help as needed. We monitored whether any retail customers were starting to have reliability or latency problems. If something needed to be triaged, the war room team kicked into action, tackling issues and advising on next steps. The Google war room team also had direct, open access to Google engineers and executives for additional support.

Apigee team members kept a close eye on API traffic during the Black Friday period. The number of API calls for Apigee’s customers (excluding those who host the platform on-premises) grew 95% compared to the same span of time in 2017. Peak API traffic running through Apigee more than doubled, from 48,000 transactions per second (TPS) to 108,000 TPS this year, and the platform remained 99.999% available.

How retailers sailed through Black Friday and Cyber Monday

One of our retail partners, Shopify, is an e-commerce platform supporting more than 600,000 independent retailers. The complexity of managing all those storefronts makes predicting holiday site traffic and sales spikes even more challenging. Shopify provides a platform with 99.98% uptime, and calls BFCM their annual “World Cup” event.