Ezora is a Cloud BI product that delivers Financial Control, drives Business Performance and supports Strategic Decision-Making.
Last year we migrated our entire infrastructure from a co-located setup in a Dutch datacentre to Amazon Web Services. Now that the dust has settled, I thought I’d share some insights on the migration process and what we learned.
Our Existing Infrastructure
At the time we put our previous infrastructure in place (2010) AWS did exist, however it was still at a relatively early stage, and didn’t have critical security features like VPCs. Also, convincing our customers that their data was going to be stored “in the cloud” was actually a significant barrier back then.
By 2015, our infrastructure was in need of a significant overhaul, and significant improvements in IaaS, made it the obvious choice. While there was a “cloud” offering provided by our previous data centre, it would still have been a significant update to switch to it, and we felt that if we were going to make a switch it should be the right one.
When we started to evaluate the different IaaS providers, it became clear that there was really only one choice. As I’ve discussed before, choosing AWS as our infrastructure partner became an easy decision.
Defining a New Architecture
After the choice of IaaS provider, we started to think about our architecture on AWS, and how it would change from our traditional setup. A first step was to go to one of the AWS roadshow events that they run on a regular basis in major cities. This gave us an overview of how AWS works and the different components we would need to consider.
After some more research and planning, we had initial discussions with an AWS solutions architect who helped us to shape the plan for our architecture. This continued to evolve over time – particularly during the work with our implementation partner – until we had a blueprint that we were happy to start implementing.
There are numerous articles & whitepapers on how to design your application to get the most out of AWS, so I won’t repeat them here. However, if you are planning a migration, you should take the time to read these, as there is a different mindset required.
One of the key things to understand is that instances (servers) should be treated as stateless and failure should be expected. If you can design with that in mind from the start it will make your life a lot easier.
Choosing an Implementation Partner
Our next decision was whether to implement the solution ourselves or to get help with the implementation. There’s no right answer to this, it will depend on your team size, and the amount of resources you can allocate to the project. One important consideration though is that you choose a partner that will give you the level of access you need during the project. Given that you will ultimately be responsible for your infrastructure, you don’t want a partner that implements the solution without involving you in key decisions along the way.
We ended up choosing a local implementation partner called TerraAlto, as we felt they gave us the right level of involvement with the project, while taking care of the lower-level fundamentals of AWS implementation.
Our final architecture utilised Amazon’s Elastic Load Balancers combined with Auto Scaling Groups to balance the traffic across multiple instances and automate the creation/termination of instances across multiple Availability Zones. We also split out a number of components of our infrastructure to make it more modular and fault-tolerant. This provides us with redundancy across multiple geographic locations, while also having the ability to scale horizontally with ease. At the data-layer, we decided to use RDS rather than trying to manage MySQL on custom instances.
Deployment & Elastic Beanstalk
Initially we had looked at using Elastic Beanstalk to manage our deployment, ELBs, ASGs, logging & monitoring. When we looked at it in more detail though, the deployment process through elastic Beanstalk was pretty restrictive. Specifically, deployment of an application version resulted in the staged termination of the instances running that application, and their replacement with new instances running the updated codebase.
For us this wasn’t an option, we needed more control over how our application was deployed and what happened at the different stages. We also needed to be able to deploy without terminating instances in certain parts of the product. We ended up with a custom approach, which involved shell scripts for setting up our base AMIs, user-data scripts that are automatically run whenever a new instance comes online and fabric scripts to automate deployment across multiple instances.
Version Controlling Your infrastructure
A nice advantage of this is that it allowed us to start versioning our infrastructure. This started very simply – we just created a git repo with all of the relevant setup & configuration scripts, along with any relevant documentation etc. This has evolved over time, and we now have relatively detailed information on how to recreate our infrastructure from scratch. Combining this with CloudFormation allows you to take this to another level, where you can template and automate the creation of almost all AWS services.
The Data Layer & Amazon Aurora
As mentioned above, we decided to use RDS for our data-layer. From the outset, we had planned on running MySQL as the engine on all of our RDS instances… however, during our implementation AWS Aurora became publicly available. Aurora is a MySQL-compatible RDBMS which was designed from the ground-up to run on AWS, and has a number of benefits over MySQL running on RDS.
We had already done a lot of testing on MySQL-RDS, but decided that as we hadn’t gone live yet, it was a great opportunity to test switching to Aurora. The switch itself was pretty straightforward, and didn’t throw up any significant issues. We didn’t notice a significant performance boost – certainly not the 5x that Amazon claim – but it did give us the other benefits of Aurora over MySQL, such as improved redundancy & durability.
While we have found Aurora very good, there have been a few issues that are worth knowing – particularly if you’re planning on switching to it. Firstly, while Aurora does replicate your data 6 ways across 3 availability zones, failure of the primary instance can lead to up to 10 mins downtime while it is rebuilt. If you’re running replicated Aurora instances this is cut to 60-120 seconds, but it’s just something to be aware of, and something that isn’t very obvious from reading the documentation initially.
Secondly, Aurora patching does happen occasionally (every month or two), and will result in a restart of your instance. Apparently this is something they’re working on at AWS, but currently there is no workaround – it won’t failover to a replicated Aurora instance if you’re running one for example.
Getting Our Application Ready for AWS
The next step for us was to get our application ready. If you’re migrating from a traditional architecture, you’ll probably have some work to get your application “cloud ready”. This will typically involve making sure that your application can run on stateless instances, and a switch in thinking to assume that local storage is ephemeral. For us, this meant shifting any local file storage to S3, using Elasticache for sessions & caching, and updating our queue and cron services… among other changes.
The Migration Process
One thing we wanted to avoid with the migration process was having to switch everything over in one go. Given the number of clients involved, this just seemed like a recipe for disaster.
To avoid this, we came up with a system that allowed us to migrate clients individually. This involved registering some extra SSL certs, some updates to our admin application to manage & track the state of clients and export/import data, and some extra logic in our application to handle redirects.
This meant that our support team could handle the migration on a client-by-client basis, allowing us to start by migrating and testing our test clients. Once we were happy that everything was fully working, we were then able to plan out the migration of all live clients over a two-week period.
Performance Down Under
A number of our clients are based in Australia & New Zealand. During the latter stages of the implementation process, we started getting performance complaints from some of them. We had been using a CDN to deliver static content from local edge-locations in Australia, but the round-trip for dynamic content was still causing a problem.
We were able to take a simplified version of our entire infrastructure – with only the components used by those clients – and set it up in the Sydney AWS region in a matter of a day. After some testing with the clients to ensure everything was working, we switched them over to use the local region for their production environment. The resulting performance benefits have been significant, and the performance issues are now a thing of the past.
This really illustrated the power of AWS to us – what would have previously been a significant infrastructure & procurement project, became a small side-project for us, with immediate benefits to our customers.
The Benefits We’ve Noticed Since Migrating to AWS
We’ve noticed a number of benefits since migrating to AWS, these are some of the key ones:
Our uptime has improved significantly since the migration. We have had the same 3rd party uptime monitoring software running since before the migration, and the reported weekly uptime improved significantly as soon as we migrated.
- Peace of mind
While more difficult to quantify, the extra peace of mind provided by knowing that we’re running on state-of-the-art infrastructure, and all single points of failure have been eliminated, is significant.
The ability to setup a replicated version of our infrastructure to serve Australian clients locally demonstrated the power IaaS flexibility to us.
- Ability to innovate
Being on AWS opens up an array of new services and technologies that are now significantly more accessible to us. Whether it’s looking at new storage engines like Redshift or services like Amazon Machine Learning or Lambda, the time to implement – and therefore innovate – is significantly reduced.
- Integration to other systems
A large part of what we do at Ezora involved integrating with other systems. Being on AWS opens up new possibilities in terms of how we integrate and the tools & services we can utilise to connect to other systems.
Being in complete control of our infrastructure has significant benefits for us. Before, we always felt one step removed from what was actually happening, and were always reliant on other people to make certain things happen. That dependency & delay is now gone.
Our ability to handle scale has changed dramatically. Whether is scaling our data layer within RDS, or scaling out our application layer horizontally, our infrastructure is now built to handle this.
All of the components in our new infrastructure have been designed to be highly available and redundant. There are no single points of failure, and all data and services are spread over at least 2 independent geographical locations.
Obviously we had a few problems along the way. Here’s a list of the key ones, hopefully you can avoid them if you’re planning your own migration:
- Elastic Beanstalk
As previously discussed, Elastic Beanstalk wasn’t really flexible enough for our needs. The way it handled deployments in particular just didn’t give us enough control over what happened during the process.
- Insert performance on write heavy applications
One interesting issue we had was with the performance of a large volume of inserts. As part of our application, we import data from other sources. To profile this, we setup a sample data import that processed and performed around 3m transactions. On Aurora it took around 7 times as long to complete with the default parameter sets. After a lot of testing, we finally narrowed it down to the parameter “innodb_flush_log_at_trx_commit”. Turning this off dropped the time on Aurora to around twice what it was on our old infrastructure. A key difference here is that our previous database was not being replicated, so there was always going to be a drop in performance when inserts were being replicated to one or more slaves. With “innodb_flush_log_at_trx_commit” set to 1 (which is default), contents of the log buffer are written to the binary log (for replication) after every transaction, set to 0, this only happens once a second. For us, this was an acceptable trade-off for the performance gain.
- Query indexing
We noticed some unusual differences in the way indexes were chosen when we migrated to Aurora. Initially we thought this was Aurora-specific behaviour, but after further investigation we found that it actually came down to changes in the query optimiser between MySQL 5.5 and 5.6. For us, the answer was to use FORCE INDEX in a few specific places.
- Aurora Patching & Failover – as discussed above.
- EC2 Swap Space
One small problem we ran into initially was that swap space was not enabled by default on EC2 instances running Amazon Linux. This is something we hadn’t allowed for, so we had to update our base-AMIs to include swap space.
- Backup Policies
This was less of a problem, but just something we had to figure out. The backup options for RDS are a little confusing. At a base-level, you can just turn on automatic nightly snapshots and you have instant cover for a 35-day period, with incremental rollback to any point in time. While this is very easy to implement and a great start, there are a few limitations to it.
- Our backup policies and SLA required longer than a 35-day retention period.
- The snapshot is at an OS-level, so you have to restore the whole thing. There is no option to restore an individual database for example.
- Similarly, if you wanted access to the SQL for a particular database or table, you would have to restore an entire snapshot first, then export the data you needed.
- RDS snapshots are also somewhat limited… you can’t archive them to S3 or Glacier, for example.
In order to get around these limitations we implemented our own backup solution on top of the automated snapshots. This involved taking nightly SQL dumps for all clients, encrypting them and storing them on S3 with specific retention periods set.
Cost Comparison with Previous Infrastructure
One question that everyone asks when you mention migrating to AWS is “what does it cost”. The billing structure for AWS is complicated and it can be notoriously difficult to predict your costs accurately.
We had done some calculations and estimated that our monthly costs would be a little less than we were spending on our previous infrastructure. Now that things have settled down, our costs are approximately the same as what we were spending previously. However, we now have a vastly superior infrastructure, with all single points of failure eliminated. It includes significantly more power and the ability to scale. That price comparison also includes a replicated version of our infrastructure in Sydney for our Australian customer-base – so to compare the two really isn’t fair.
Some final recommendations if you’re going to take on your own migration to IaaS:
- Version & automate everything – if you can, do as little as possible on the console. Script everything and version control it. Try to think of your infrastructure as another codebase, and treat it accordingly.
- Choose your implementation partner carefully.
- Make your application as fault-tolerant as possible – plan for failure.
- Make the most of IaaS – use the available services, decouple your components, build in elasticity and try to make your product as cloud-ready as possible.
- Plan your actual migration process carefully.
While the project definitely took longer than we initially estimated, a lot of this was due to decisions we made to make the most out of AWS and IaaS in general. Whether this was decoupling components or updating our product to be more suited to IaaS architecture, they were all worth it at end of the day.
We now have full control over our infrastructure and can implement the changes we want, when we want them. While this is definitely an advantage, it is something that you need to factor in when looking at tech resourcing, particularly with a small team.
Our new infrastructure is significantly more flexible, redundant, scalable and performant than the old one ever was. Combined with the value of the service and the new possibilities it opens up, the project has been a real success for us.
If you have any questions about Ezora or our migration to AWS, feel free to get in touch.