We are dreadfully sorry, but you appear to be using a rather out of date browser…
There's nothing wrong with that but our site was built to take advantage of the latest HTML & CSS features.
If you want to look at updating to a newer browser you can visit this site to get an idea of the options you have: https://whatbrowser.org/
The story of our website server migration to WireHive.
Warning: This is one for the techies amongst you.
We moved to Rackspace a few years ago when we started using MODX as our CMS platform, I designed an infrastructure using Puppet and the new “auto scaling” product that they had to offer.
The concept was simple - design a solution that would have nearly no SPOF but be cost effective as well. For a few years this process worked well but we grew… grew so much that the total disk space for all clients applications was just over 20GB and we started to see some issues with auto scale times over the past six months.
There are two approaches to auto scaling:
Both of these concepts work fine with a small amount of changes - but with bigger disk sizes they but take way too long to scale.
With 1) We saw the puppet execution time take nearly 25-30 minutes to run from scratch. (we even tried to host git repos on our own gitlab instance to reduce network latency for git checkout)
The second option was what we had always done, but because of the way Open Stack handles server images we were getting penalised for the network transmit. When you create a server image it is moved off to Cloud Files for storage so when you create a new server from the image it has to copy it back first. This meant our auto-scale took anywhere from 5 minutes to 25 minutes.
This caused a lot of intermittent “slow downs” that caused us frustrations due to not being able to scale quickly enough to cater for our clients traffic needs.
We also saw some disk latency issues with our Percona Galera Cluster where the cluster would have to wait for disk IO to free up before it could continue processing SQL statements.
I’ve been an advocate for Rackspace for many years - but I think we finally outgrew their cloud offering. So the time has come to migrate our hosting to another provider to allow us to continue controlling our infrastructure.
Welcome WireHive & Amazon Web Services (AWS). Moving to AWS was an interesting choice because they have a lot of offerings and we need to be sure they can cater for our use-case.
To move all client sites from Rackspace London to AWS Ireland with minimal downtime for all clients and ensure the least amount of impact across the service
The two biggest issues we had surrounding the migration were:
To start with I created a Google doc highlighting the details of every client that was impacted. This displayed as much information as possible to help maintain visibility across the whole migration. Some of the columns were:
This sheet was shared with the other developers and the account managers at Adido. Thankfully our account managers dealt with all client communications amazingly well and everyone was prepared for when the time comes to update DNS records.
Once we had our AWS account set-up and ready to go the time came to start the move.
Originally I planned to move cloud files after the servers, but the dev team had a little downtime at the beginning of the week to get ahead of this. Surprisingly getting files out of Cloud Files was quite a challenge. We had time-out issues, concurrency issues, and things just taking ages.
To give an idea of how much content we had to move - 250,000 files totalling 30GB in about 60 different containers. The key to this content migration was to tackle each client one at a time and to move the files as quickly as possible - to minimize the risk of a client uploading a file to the old container before the settings were updated to look at S3.
We tried cyberduck to download the containers locally and then to upload direct to an S3 bucket but we had issues with the total files in folders/containers which would cause cyberduck to crash.
There was also Turbolift which I have used in the past to bulk upload vast quantities of files very quickly. Unfortunately I could not get the download functionality to work at all.
I wrote a very basic script that would download a container through the API but this was very slow as it was only doing one file at a time. However at the eleventh hour whilst reading Hacker News a tool appeared on the front pages called rclone. This would let us copy files from a container directly to S3 from a server and is also concurrent.
So we now had a process to migrate clients from cloud files to S3:
This process worked flawlessly for most of our clients. We had issues where the file migration took a long time (the longest was nearly 6 hours long). So these ran overnight and updated the production details in the morning.
We immediately saw some issues with the MODX S3 media source that is to do with generating image thumbnails and the amount of images within folders. So we made some changes to the media sources we use internally and hoping to push these back to the community in the coming weeks.
On to the next task
Because we had issues with the boot time of servers at Rackspace, I was aware that this was something we would have to get right from the start and using AWS allowed us to create our own AMIs.
Previously we were using Puppet to orchestrate our servers and ensure they were up to date, however over the years our Puppet codebase got quite messy and hacked apart so it was time to start afresh. Welcome Ansible.
Ansible looked a lot cleaner and also the WireHive team are all promoters of the language so they can give support when we need it on these scripts. Our playbook had two main objectives:
The first objective is fairly simple with the yaml structure of roles allowed us to create a few different folders and tasks to cover our whole infrastructure relatively simply.
Once this objective was complete it was time for the tricky one: How do you manage many client sites and only deploy one if you need to. The solution was using tags and variables. By writing our own role and creating a group variable which included all client sites and their respective branch + commit hash allowed me to use the “with_items” option and loop our build task for every client. It also gave us the ability to run a deployment for a single client - couple this with the ec2 inventory script we had a fully dynamic deployment system.
So now we had ansible doing deployments and managing the server infrastructure we needed a way to invoke Ansible and manage everything on a human level. Previously we were using Hipchat for our Chat Ops but have moved to Slack for our Chat Ops in the hope that we can add more tools for the rest of the company and role out to everyone increasing visibility to everyone.
Marvin is our new robot. Based on Github’s Hubot he handles all our deployments and AWS interactions. From creating nightly AMIs and updating our auto-scale groups to deploying new versions of websites through Ansible. In the coming weeks and months we are going to extend him further to integrate more with our internal tooling.
In our main Ops room we also have New Relic sending us performance alerts if client websites are performing poorly and I’ve set up an AWS Lambda function to send notifications into slack about scaling activity and also alarms like high CPU and unhealthy nodes.
We were nearly there.
One thing we had to ensure was that once we moved any third party integrations would remain working. So I had a list of sites that would require extra testing from the new infrastructure. Once we had Ansible updating the auto-scale group (it was sat on idle as it was receiving no traffic at the moment), I could direct my own traffic (updating my local hosts file) to the respective client sites and test the process end-to-end. This gave us the reassurance that everything would be OK
We moved from our Percona Galera Cluster to a MySQL RDS Instance. After some playing around with security groups I finally managed to get all our data imported (Thank you mydumper and the great Percona blog for explaining how we set this up)
After the 90 minutes to import all the data, the RDS instance was set up as a slave and another thing was checked off the list. Although we wouldn’t be sure what settings to put in the Parameter group - we had a few that we could set based on our previous database setup. But this would be something that would need tweaks over the coming weeks.
One thing that we had to be careful about was when migrating the sites - We had to not break replication. This meant some very clever DNS tricks were needed. All our sites use the same hostname for the DB connection with is a CNAME to its actual destination so I updated the hosts file on the old infrastructure to reference the old database and used the normal DNS to reference the new server. This would mean we wouldn’t need any code updates for the new DB connections.
When we were finally happy to go live on the new infrastructure, there was a very simple process to follow:
This process will send all traffic to AWS and the new infrastructure without requiring any DNS updates in the immediate future.
So late Sunday night - I hit the switch to redirect the traffic and after what can only be known as the longest single minute of my life. After all the load balancers updated the sites were back up and it’s onto the next task.
We had access to a few clients sites and this was key to get clients moved off a single server and being proxied through Rackspace and onto the Auto Scale infrastructure at AWS. Each client had there own CNAME record which is managed in route 5s. This would allow us in the future to move their sites around and not have to trouble them with DNS updates.
So Monday morning our account managers got busy notifying clients telling them it is time to update to the new DNS records. Because we could not use a CNAME on the root domain of a website we maintain an A record for the root and CNAMe for the www. Record.
Five days on and the infrastructure has settled down. It’s been a bit of a steep learning curve with understanding how EC2 instances work and what would be best for our set-up but with the help of WireHive and www.ec2instances.info, it looks like we have a scaling solution that works well for us.
The main point of this exercise was to improve performance of our clients websites and to reduce the time it takes our platform to scale. We are now scaling new servers in around 1-2 minutes which is considerably lower than the old method.
In terms of site performance the below graphs are from an e-commerce site, on our old platform against our new platform
Read Part 2 of David Berendt's talk on Voice Search and the implications is has for your business at our Attention! 2017 event.
Read Part 1 of David Berendt's talk on Voice Search and the implications is has for your business at our Attention! 2017 event.
We decided to move all our hosting to WireHive & Amazon Web Services (AWS). Our senior developer Mark describes the process we took and the experience we had.