Azure wobbles

Hello. I'm doing a lot of work with Azure these days, and I think it's valuable for me to share my experiences with it. It's something of a love / hate thing for me. Azure has shed loads of services and configuration options that are pretty impressive. But.. it's sometimes lacking in the one thing you need most from PaaS: stability.

About my app

So yeah, I've got this web application. It's a fairly typical ASP.NET MVC web application. I'm using a bunch of things like Dapper as my ORM, and ReactJS, but it's basically a fairly standard stack. My database is SQL Server.

We've got a lot of users (tens of thousands) using the app, but not very often. So we're getting about 10,000+ page view hits a day, and at any given time we don't have more than 20 people on the site generally.

Hosting on Azure

Based on your needs and budget, there's an enormous range of options for hosting on Azure. Depending on your requirements, you may or may not need to ensure that downtime is something that you must guard against at all costs. If this is the case, then you can use things like Azure traffic manager, and SQL Server geo-replication with secondary servers in case of major data centre outages. Plus for a web app on Azure, you've got many auto-scale options that should help for spikey traffic. All very cool stuff indeed...

For me, I'm currently trying to strike a balance between performance, resilience, and cost. I've got a well-specced single instance web app, with auto scale. My database is running on a strong "standard" plan, with geo-replication configured for a fail-over database.

For the most part..

Usually, things are great. Performance is good, and I've got a fairly good setup with auto-scale in case of the odd peak. Azure web jobs are doing a great job running background processes, I have an Azure redis cache for, well, caching, and my database runs happily on SQL Azure. Deployment is a joy thanks to git deployment and web deployment slots

But then things like this happen..

*image*

Eek! That's a bit of a spike. This is coming from New Relic's awesome monitoring service. Ok, so what happened, we must have had a massive peak of traffic, or maybe some issues with some actions that took a long time to complete and blocked, right?

No. I have no idea. There was no peak in traffic, no non-standard operations, nothing that should raise suspicion. But Azure had what I have coined "an Azure wobble" - for 10 mins, my app was pretty much out. Nothing was reported in the Azure status portal when this happened. In all likeliness, Azure was probably doing some maintenance on the data centre, that caused this temporary issue. When you host in a cloud PaaS like Azure you need to consider that things like this will happen, and you need to plan for them as best as you can. But... these events seem to happen more frequently than you'd think in Azure. Sometimes I can see things like this every day. Let's think about mitigation options here:

  • Auto-scale? Well that was active, but didn't trip. I've configured it by CPU load though, so I'm going to experiment with response-time-based auto-scale. Although - if the outage was localised to all web apps running in one data centre region, auto-scale might not make a difference.
  • Fail-over to another web app hosted in another data centre. This is probably the only thing that would have prevented the issue (and all the others like it). I could use Azure traffic manager to configure a failover and route all traffic to another instance on another data centre. But then what about the database - should I point to the same one (latency!) or should I also have an active db serving my fail-over instance, with replication.... argh! As you can imagine, this can quickly get tricky, and cost becomes a major factor here.

Closing thoughts

As I soon found out on Azure, you need to consider and plan everything - from the Azure configuration/setup all the way to the design/architecture of your code. Everything needs to be built with resilience in mind. On the code side, you need to think about retry strategies, and I found I needed to make sure all my web job code was idempotent.

I knew that compared to in-house hosting, it was normal to expect blips that cause periodic downtime when you go to a behemoth of a PaaS like Azure (all within the SLA). However, I didn't quite expect them in the frequency I got them. I've not got experience of other providers like AWS, but I hear a lot of grumbles like this from other Azure customers. It's still something of a learning curve for me, so I'll update as and when I find a solution.

PS While I was writing this, Azure had a mini-wobble and I had an outage for 5 mins!

Comments powered by Disqus