Shifting a Legacy Site to a CDN, Part 2

Submitted by Josh on

In the first half of my CDN blog, I talked a little about why I wanted to move my biggest website to leverage cloud delivery of content, and the broad strokes of how I planned to get started. This second half details more of the specifics of how I technically achieved the early days of that transition.

An Actual, Factual CDN

Photo by <a  data-cke-saved-href="https://unsplash.com/@pedroplus?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText" href="https://unsplash.com/@pedroplus?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Pedro da Silva</a> on <a  data-cke-saved-href="https://unsplash.com/s/photos/leaky-bucket?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText" href="https://unsplash.com/s/photos/leaky-bucket?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>   This is the part that needs some Amazon. Once you have your Amazon Web Services account configured - yeah, it needs to be different than the one you use to buy stuff - a simple CDN like this needs only two services: Simple Storage Service (S3) and Cloudfront. S3 is the file system, with a directory structure called "buckets" that you create to use for storing files. In my case, I wanted two buckets - one for my development site, and one for the production site. There are likely more clever ways to organize a static CDN that could get by with just a single bucket, using versioning or other tweaks, but separate buckets keeps things very clearly segmented and straightforward.
 
Cloudfront is the service that actually creates the CDN. You can serve files straight from S3 if you really want - but you don't want to, because you lose most of the benefits of a real CDN and you'll end up paying more for it too. Cloudfront retrieves files from S3 and caches them within its own systems before serving them outward, creating the performance efficiencies for which the CDN is known.
 
For my implementation matching the pair of S3 buckets, I needed two "distributions," the individual instances of Cloudfront that operate as the CDN.  I maintain the development site and the production site under two completely different top-level domains; so, for each of those, I configured new CNAME records in the domain's name records using the pattern "cdn.domain.com". The steps for doing this will depend on how you manage your domain's DNS and where, but if you're in the business of running your own domains it shouldn't be complicated to get this up and running with whatever vanity URLs you need. Meanwhile, back at AWS, I generated SSL certificates for both so that all the content could be served securely, and then updated the records back on the webserver to validate the certificates. This isn't necessary to operate the CDN, but having the nicer vanity URLs to serve the content creates a more professional implementation.
 
Finally, I needed to set up an IAM role to manage the access to the resources. IAM is essentially AWS' way of setting up privileged users for security's sake , so in order to programmatically maintain the content in the S3 buckets, I needed a user that could write the files to the buckets and lock out any other user or service from doing so. I have to admit I don't remember what resources I read online to get this working, as I did have a couple permissions issues early on, but I can tell you that in order for the next steps to work, the IAM user needed these five S3 policies: ListBucket, GetBucketLocation, GetEncryptionConfiguration, PutObject, and PutObjectAcl.

The Bucket Brigade

As briefly mentioned, I use Bitbucket as my repository host. Why Bitbucket specifically? I don't really remember, it's been a long time. But I wanted to continue the process of having the new repository hosted there because it's important to have the content stored offsite anyway, it's convenient to have it nearby the other repos I use for the site, and for its ability to connect directly to S3.
 
So, the first step, of course, is to create this new repository, and then to get the files into it. This boils down to grabbing all the files from where they lived in the old repository, setting up the new file structure in the new repository, and committing them. If you're read this far, then there's no need for more detail there - I hope you understand the concepts of creating a versioned git repository already. I also took that file structure and, for the initial load, ran the files up to S3 manually using the web interface so I could get started on the deployment faster.
For a longer-term solution, I plugged in a Bitbucket Pipeline to ship files straight from Bitbucket to S3. Most of the pipeline setup comes straight from Atlassian's docs and a little Google searching, so it's not incredibly exciting, but it does handle my usecases:
  1. Watch for commits to the develop and master branches.
  2. Trigger a build to the correct bucket based on the commit.
  3. Upload only the (best guess) of changed or added files to avoid taking on unnecessary cost for copying files that aren't needed
What it doesn't do is trigger an automatic invalidation (aka cache clear) on Cloudfront - to the best of my knowledge and experience, that requires more scripting to connect to the AWS IAM service and make the invalidation request via the AWS command-line interface. For my use right now, I'm fine doing the invalidations manually when needed, because you don't need to invalidate brand new files, and the existing files will almost never change.
 
So, here's my pipeline configuration:
image: atlassian/default-image:3
pipelines:
  branches:
    develop:
      - step:
          name: Deploy to Development S3
          deployment: production
          script:
            - pipe: atlassian/aws-s3-deploy:1.1.0
              variables:
                AWS_ACCESS_KEY_ID: $aws_key
                AWS_SECRET_ACCESS_KEY: $aws_secret
                AWS_DEFAULT_REGION: 'us-east-1'
                S3_BUCKET: 'dev-bucket-name'
                LOCAL_PATH: '/opt/atlassian/pipelines/agent/build'
                ACL: 'public-read'
                EXTRA_ARGS: '--size-only'
    master:
      - step:
          name: Deploy to Production S3
          deployment: production
          script:
            - pipe: atlassian/aws-s3-deploy:1.1.0
              variables:
                AWS_ACCESS_KEY_ID: $aws_key
                AWS_SECRET_ACCESS_KEY: $aws_secret
                AWS_DEFAULT_REGION: 'us-east-1'
                S3_BUCKET: 'prod-bucket-name'
                LOCAL_PATH: '/opt/atlassian/pipelines/agent/build'
                ACL: 'public-read'
                EXTRA_ARGS: '--size-only' 
You can see here that Atlassian did all the heavy lifting. My involvement is was simply setting up the two configs for the branches, as well as setting up the pipeline variables to store my AWS key/secret pairs, and to set up the EXTRA_ARGS to handle point #3 above. These pipelines will now automatically handle any images changed or added when they're committed to their respective branches - the pipeline will spin up a virtual machine, copy the images into that machine, and then use it to update the S3 bucket. I suspect that there is likely a way to use repository variables to set the S3_BUCKET variable programmatically, which would allow me to condense the develop/master configs into a single config set, but I have not yet fully considered that. From a brief glance, I think adding some logic to the pipeline to check the value of BITBUCKET_BRANCH would do the trick.

Converting the Codebase

Naturally, the old webcode is not set up to point at the new CDNs. And, because it's very old code, built over the span of twenty years, it's not really optimized for a quick change to this sort of path. Therefore, this required a lot of manual changes - first to set up a configured variable to store the correct CDN URL, and then within the dozens of files impacted, changes to use the new URL. Changes had to be made in the webcode, the static CSS, some Javascript, and even in a couple database locations because of a big mix of old code that isn't really up to the standards I would set now. I made the changes section by section, testing them on the dev site and pushing them to production as quickly and often as possible to verify the previous steps and also start taking advantage of the CDN more rapidly.
 
Once the bulk of the code was updated, I also began tailing access logs on the server for the account to see what I missed, since there was no one single change I could make to switch every image. If you have a site where you're looking to do something similar, I have to assume you have access to the raw access logs too. You'll have to find them yourself, of course, if you're following my steps, on a similar Linux+Apache system, the general tail command will look like this to watch live:
tail -f <log name> | grep -iE "jpg|gif|png|svg"
If you're not familiar with your server logs or Linux commands, this one-liner displays log lines on your terminal as they come in, but shows only logs that contain a string that contains those common image file extensions. Adjust to taste. What this will show you is any images that match that pattern that are still being called against the webserver, where the images originally lived. It will also show you what webpage made the call, so you can track that file down and fix it to point it at the CDN.

For the Future

I'm writing this post really early on in the process, relatively. It hasn't even been two weeks since the first code was deployed that used the nascent CDN, and only a few days since I considered the work "done" save for tweaks I catch in the logs. So, there are certainly a few things that I might want to do later to finish things up.
Of course, I'll continue monitoring the logs to see what I might have missed. The very, very old code that runs the discussion forums on this site, for instance, recorded the old-style image emoji as HTML inside each post in their database records when the emoji are used. These paths were forever hardcoded to use the images on the webserver, which means that to convert them, I had to first verify that newer posts were using the CDN correctly, and then run a text search/replace inside the database itself to convert the old posts. Search engines also have records of the old paths for use in image search, so at some point I will need to force those engines to recrawl the images and pick up the CDN URLs or else I will reduce my SEO attractiveness when I do the next step...
 
That step, as you might surmise, is to get rid of all the images that still live on the webserver. If you follow the steps I've already taken, you'll note that there's no step for cleanup that I've already taken, because I don't want users to start getting broken or missing images while there are still paths in the webcode that point at the old locations, or search indices that have crawled the images for their image search results. Eventually, though, I want those images to be gone entirely from their old homes and live only in the new, to slim down that repository as I mentioned waaaaaaaaaay back in the first post. This will also have to include the user-submitted content I mentioned before, which will certainly be a bit of a trick if it will also work for new submissions without me having to manually copy the images every time.
 
And, after that, I need to take better advantage of my new infrastructure. Can I tweak the CDN to better serve the files, or even tweak the files themselves to be more conducive to being served, by running them through optimizers and readding them to the repo? Can I start serving more files via the CDN, such as my static javascript or CSS? Is this an opportunity to finally build a CSS preprocessor into the flow so I can start using easy minification and variables like everyone else has been doing for years? Or, is it simple as I need to take more steps to make the project cost-effective? I will need to see a couple months of typical usage in order to determine what my average costs are going to be and see where I need to go. These are all new opportunities finally available to me by taking baby steps to get a very old codebase looking just a little newer and more efficient with some relatively simple steps.