Hosting & Servers

Two Cases Where Frontend Microcaching Saved The Day

I've done sysadmin type work in addition to general web development for a long time. I've built many a server stack and helped solve a lot of performance issues.

I've heard the term microcaching a lot recently. I had never heard of it for http web caches until then. If that is a new term for this process then it's a term for something people have been doing for a long time. If I just somehow missed that it was called that then ooops :p

What is Microcaching?

I don't know if there is a technical definition for it specifically but essentially it's the process of caching things for unusually short lifetimes. It helps a ton on sites with specific dynamic pages but also are highly trafficked.

The Semi-Unusual Cases Recently Where Microcaching Was A Big Win

Twice in the last 6 months I've been contacted by clients who were in similar situations. Steadily growing and been asked by their hosting companies to upgrade because they were hitting limits or impacting other users.

When I was contacted they had upgraded in the past as well from shared to VPS hosting, one of the sites had even employed someone previously to tweak caching plugins and apache configs.

The usual big win caching items were all enabled and tweaked, in-memory caches were available on the local host. The problem was that most of those were done for only logged out users.

Logged in users often need fresher content – or the rules of caching are more complicated than is viable to offer through a plugin interface – so only minor improvements could be made by the site through the plugin.

A reverse proxy with some specially tailored caching rules was an incredible fix in areas that the plugins couldn't handle.

Even though both sites had previously been looked into by their hosts and even at one point a performance specialist somehow they missed something…

Logs showed that both sites were getting 50-75% of all their traffic on a single page type from logged in users. Looking at the pages there were obvious reasons for the high number of requests to them. Speaking with the site owners though I discovered they didn't need to be generated all the time. 

Group Social Freshness

There was a social network where the main group channel was the culprit for most of the server load. 

The page was long, with many short messages each requiring it's own query and loop to get the data and output it to the page, new messages were posted every 5 seconds during peak times and there was no client-side loading. It was an expansive operation compared to other pages on the site. 

It turns out that the group channel showed the same content regardless of who viewed it so long as they were logged in. 

Short term caching of this page made a huge difference. ~85% cache hit rate on a highly dynamic page that changes roughly every 5 seconds dropped the server load to less than half what it was before.

Globally cached for all users who visit it with freshness lifetime of 5 seconds, stale pages allowed but only for up to 30 seconds.

Online Store Confirmation/Delivery Details

The other site was an online store, it had a slightly different problem. The page it was getting most hits on was unique to each user and each order. It was the each customer which each customer visited directly after ordering. They would sit on the page and refresh it over and over again until status changed from pending to success.

Since this site was items with customization options available through drop shipping it took between 5-10 minutes for the order to be fully confirmed.

Refreshing between order and confirmation many times created dozens and dozens of needlessly generated pages. To make it worse the pages were also making Ajax calls to check for updated shipping info on each request. So all requests essentially = 2 requests.

Orders took between 300-600 seconds to confirm because of the drop shipping and no amount of refreshing is going to change what is returned within that window. 

There was at least 50 times the requests to that page than there were unique sales.

Unique cached item for each user with 60 second freshness, allow stale for up to 90 seconds.

Takeaway: Win-Win

Microcaching is a thing. It brings big wins on highly dynamic pages with lots of requests per second.

Ultimately in these 2 specific cases a single rule was able to drop server load by more than half and prevent the need for these companies to upgrade to more expensive hosting plans.

Bonus: Users get much pages must faster.

Hosting & Servers

Nginx Reverse Proxy Cache of WordPress on Apache

An NGINX reverse proxy for WordPress sites running on Apache is my standard setup for running WP sites. I've got a pretty slick setup running entirely self-contained NGINX reverse proxy to WP on Apache PHP7 using Docker to Proxy Multiple WordPress Instances.

Every single shared and manage host I've personally used in the last 10-15 years ran Apache as the default http server. Every client I've ever had with a shared or managed account too. I've only every once been offered the option of anything different, it was not default configuration though.

NGINX is very capable of doing the exact same thing as Apache but I see it used more commonly as a proxy. You can also use Apache for a proxy if you want to.

Apache and NGINX are both http servers, they are pretty interchangeable if you are only interested in an end result being a page reaching the requesting user.

Some Key High Level Differences Between Apache and NGINX

Apache is incredibly well supported and used by a huge amount of servers. It can be installed and works almost right out of the box. It's modular, works on many systems and is capable of hosting a wide range of sites with relatively minimal configuration.

It's the default http server of choice for so many for a reason – it copes well with most situations and is generally simple to configure.

On the other hand NGINX has a smaller market share, can be a little more tricky to install, make it work right – and may require additional setup for particular applications.

It's not as modular (turning on features sometimes requires complete rebuild from source) but it performs a lot better than non-tuned Apache installs. It is less memory hungry and handles static content way better than Apache. In comparisons is excels particularly well when handling concurrent connections.

Why Put An HTTP Server In Front Of An HTTP Server?

I get asked this by site builders a lot more than I ever thought I would. There are several technical reasons and infrastructure reasons why you may want to do this. There's also performance reasons and privacy reasons. I won't go into great detail about any of them but I encourage you to Google for more detail if you are intrigued.

There are 2 simple reasons why I do this that are both related to separating the access to a site from the operation of a site.

  1. Isolating front-end from back-end means that I can have specially tweaked configurations, run necessary services spanning multiple host machines and know that all of that in transparent to the end user.
  2. The other reason is performance based. The front-end does nothing dynamic, it serves only static html and other static content that it is provided from the backend services. It can manage load balancing and handle service failover. It can cache many of the resources it has – this results in less dynamic work generating pages and more work actually serving the pages once they have been generated.

When To Cache A Site At The Proxy

I cache almost every request to WordPress sites when users are not logged in. Images, styles and scripts, the generated html. Cache it all, and for a long time.

That is because the kinds of sites I host and almost completely content providing sites. They are blogs, service sites and resources. I think most sites fit into that same bucket.

These kinds of sites are not always updated daily, comments on some posts are days or weeks between them. Single pages often stay the same for a long time, homepages and tax pages may need updated more often but still not as often as to require a freshly generated page every time.

Some Particular Caching Rules and Configs For These Sites

A good baseline confg for my kind of sites would follow rules similar to these:

  • Default cache time of 1 month.
  • Default cache pragma of public
  • Cache statics, like images and scripts, on first request – cache for 1 year. 
  • Cache html only after 2 requests, pass back 5-10% of requests to backend to check for updated page.
  • Allow serving of stale objects and do a refresh check in the background when it occurs.
  • Clear unrequested objects every 7 days.

A long default cache lifetime is good to start with, I'd even default to 1 year in some instances. 1 month is more appropriate for more cases though.

Setting cache type to public means that not just browsers will cache but also other services as well between request and response.

Static resources are unlikely to change ever. Long cache lifetimes for these items. Some single pages may have content that doesn't ever change but the markup can still be different sometimes – maybe there's a widget of latest articles or comments that would output a new item every now and again.

Because of that you should send some of the requests to the backend to check for an updated page. Depending on how much traffic you have and how dynamic the pages are you can tweak the percentage.

The reason that html is set not to be cached on the first 2 requests is because the backend sometimes does it's own caching and optimizations that require 1 or 2 requests to start showing. We should let the backend have some requests to prime it's cache so that when it is cached at the proxy it is caching the fully optimized version of the page.

Serving stale objects while grabbing new ones from the backend helps to ensure that as many requests as possible are cached. If the backend object hasn't changed then the cache just has it's date changed but if it is update then the cache is updated with the new item.

Clearing out cached items that were never requested every so often helps to keep filesize down for the total cache.

Hosting & Servers

Ensuring Email Deliverability – SPF, DKIM & DMARC

Email deliverability is deceptively complex. For most people it just works. You write an email, send it and it arrives at the other end. A lot goes on between when you click send and when it is accepted at the other end.

What goes on between clients/mail servers – and mail server/mail server – is complicated enough but people also need to make sure when they get there they don't end up in the SPAM folder.

Ensuring Email Deliverability

There is so much SPAM email being sent that almost every email sent goes through more than one SPAM check on it's journey between sender and receiver.

Different places do different kinds of checks. Often when email is sent from your computer or phone it goes up to an external outgoing mail server to be sent. Even at that early stage some checks might be done – your mail client might do SPAM score checking and the mail server should certainly require authentication for outgoing mail.

When it leaves your server it bounces through routers and switches, different hosts and relays, before arriving at the receiving mail server. Checks may be done in the process of its transfer.

When the end server receives the message it will probably do more checks before putting it into the mailbox of the receiver. In the end the receiver might even do additional checks in the mail client.

Securing Your Outgoing Mail

There are a handful of accepted standards to help make sure mail you send gets to where it needs to be and that it stays out of the SPAM folder.  They also help prevent anyone sending mail and spoofing your address or pretending to be you.

Mail Missing In Transit

Mail from known bad hosts, IP ranges and domains are often terminated en-route.

You want this to happen. You should not be sending mail from any known bad addresses.

The most commonly used method to ensure the host sending outgoing mail is authorised to send for that domain is called SPF.

SPF – Sender Prefered From

At the DNS server you can add some records that inform others which hosts and IPs you want to allow mail to be sent from. You also set default actions to take when messages fail SPF check.

Not everyone treats SPF records with the respect they deserve. It's because a lot of SPF records are actually misconfigured. Trusting a system which many obviously have misconfigured would not be great for everyone.

The next common way to secure your outgoing mail is DKIM.

DKIM – DomainKeys Identified Mail

DKIM is a method to cryptographically sign a message, either as the origin or an authorised intermediary host. Receivers can use the key to confirm the signature of the message and that it's authorised and untampered.

Since DKIM requires key generation and is underpinned by a more complex set of sub-systems it is often treated with much more authority than SPF.

The final piece of the trio is DMARC.

DMARC – Domain-based Message Authentication, Reporting & Conformance

Some mail hosts will use SPF or DKIM for to validate a message. Some hosts don't. And many treat failures differently.

DMARC allows you to instruct mail servers who listen exactly what you want to happen to messages that fail those SPF or DKIM checks.

You can set a policy of:

  • do nothing
  • quarantine (goes to spam)
  • or reject

As well as the percentage of mails to apply the policy to (this helps during initial testing and when any changes are made).

What it also does is allow a method for mail receivers to easily contact you and report results of the mail they have processed for you. They will report sending IPs and results from SPF/DKIM as well as what they done with the message in the end.

That information is extremely useful to anyone managing an outgoing mail server and can be used to spot problems with sending (or fake senders) very quickly.

When You Want Mail To Be Terminated In Transit

If mail is received and you have not authorised it then you want it to be terminated before it gets into anyone's mailbox. At the very least you will want it to go to SPAM.

Mail failing authorisation is probably using a spoofed from address or is otherwise illegitimate.

SPF, DKIM and DMARC combined helps to stop any mail you did not authorise to send from ending up in front of the user. That prevents server algorithms picking up on cues from the user when they delete without opening or throw messages into spam folders.

When Termination In Transit Is A Problem

I'm going to say that you always want unauthenticated mail to be terminated. No exceptions. The problem is that very often other sites spoof your email for a legitimate reason.

Say you fill in a form online and add your email address, often that notification is sent to a site owner via email with your address as the FROM address.

Those messages will fail your checks (actually sometimes they might not and instead be allowed through but treated as a soft failure).

It's a common practice but I'm going to say it right now. It's just plain wrong. You should never be sending mail with a FROM address that you are not explicitly allowed to send for.

The proper configuration is this, please use it:

  • FROM: [server address]
  • TO: [receiver address]
  • REPLYTO: [form filler address]

Deliverability for Senders with SPF, DKIM and DMARC is Dramatically Improved

No matter what you are sending mail for: it could be personal mail or business mail; follow ups, outreach messages or newsletters. No matter the purpose of the mail it's always better when it arrives at it's destination.

Using these systems helps to build domain trust from receivers and shows you have taken steps to secure your mail. Deliverability of mail that's taken step to ensure it arrives is generally better than mail sent with no thoughts about that.

The only messages you do not want to arrive are SPAM messages you have not authorised. These systems allow you to publish policies instructing receiving servers that you do not want that unauthorised mail to arrive.

Terminating mail that is questionable before users see it also means that cues used by email providers to spot messages users consider as SPAM are never shown on your messages. This increases the domain trust even more.

Hosting & Servers

Monitoring Site and Service Uptime

Testing a site is operating correctly, and it's required services are also available, can be challenging. There are various surface metrics you can test but often they are not reliable and are unable to give any depth of information about many important factors.

Surface Level Data

When it comes to web services you can get a good set of working or broken tests running with ease. By testing only surface data you can find out with some certainty if all services are up or if they are not.

There are loads of online systems that offer free site uptime checks. I've used Pingdom for it before but there are many others. The Jetpack WordPress plugin also has an uptime monitor feature which I have used.

Pinging Hosts

Many hosts are accessible through the internet and they will respond when you ask them to. You can ping the domain and assume a response from the host means it's ok. Checking ping response times and packet loss is a decent metric as well.

This doesn't check that what you want returned to user requests is what is being sent through. It only checks if the host is accessible.

Making HTTP Requests

When checking a website is running you can go a step farther that pinging and send an http request to the site. Every http response should contain a code number which can be used to determine success or failure.

When the http service returns code 200 it indicates success. The downside of relying on http response codes is that even success codes don't necessarily mean that a site is running. Other services might not be working correctly and the site might not be giving the correct output.

One way to enhance http testing for site and service uptime is to do additional checks when success codes are returned. Testing the response for some known output (for example look for a certain tag in the header, perhaps inclusion of a style.css file). If your known output doesn't exist in the response and a success code is returned then there is a chance a supporting service is down.

Deeper System Health Metrics

Surface level metrics can be an easy way to test for mostly all working or something is broken somewhere. It often doesn't give any insight into what is broken or how well working services are performing. 

You can get all kinds of information from the server that runs your sites and services if you are able to open a session to the host.

Shared hosts rarely give access to shell, when they do it's always severely limited to ensure security between customers.

System Monitor

Even in a limited shell you can probably get information about your own running processes. Linux shells usually have access to the `top` command. It's essentially a task manager that shows things like CPU usage, Memory usage etc.

In top you should be able to see the total CPU cores, RAM, Virtual Memory, average system load and some detailed information about the processes running. In limited shells you may only see processes running from your user accounts but on a dedicated server or VM you will probably be able to see all of the processes and which is using what system resource and how often.

Realtime system metrics like this can show what is happening right now on a host.

Checking on Important Services

There are a number of ways to check status of different services.

Upstart Scripts

Many services will provide a way to check their status. Often these are provided as scripts for your operating system to execute. I've heard them called startup scripts, upstart scripts, init scripts.

Depending on your OS commands like these could be used to check on some service statuses.

service httpd status
service mysqld status
service memcached status
/etc/init.d/mysql status
/etc/init.d/apache2 status
/etc/init.d/memcached status

Checking Log files

Most production softwares have in-built logging facilities. They can push data into different system logs or to their own logging mechanisms. Usually logs end up as easily readable text files stored somewhere on a system. Many *nix systems store a lot of the logs in /var/log/ or /home/[username]/logs/.

When it comes to running websites the most common setup is a LAMP stack. Default settings are usually to log requests, some types of queries and php errors in those systems.

Reading logs will be able to give you all kinds of useful information about services. There are also ways to configure some services to output more verbose data to the logs.

External Site, Service & Infrastructure Monitors

There are a number of dedicated server health monitoring suites available. Premium services like New Relic and DataDog are capable of tracking all kinds of deep level data using specifically built reporting agents capable of running on a system and reporting all of that deep data from your servers and processes.

Until very recently I was a customer of New Relic for personal sites. I used them especially for infrastructure monitoring and deep error reporting and I would highly recommend them if that's what your looking for. NOTE: New Relic also offer other services I did not use, check them out to see all features.

Open Source Monitoring Suites

In addition to premium services available for monitoring there is also some fairly strong contenders in the Open Source market that are very capable.

Most can run on a host and check metrics for it and many can also check remote hosts as well.

Nagios springs to mind right away. It can do various service checks, pings, resource monitoring and system tests in a very configurable way. It's highly configurable nature makes it extremely powerful.

Munin is another software I've used to keep track of things like network and disk IO as well as post queue monitoring.

Nagios and Munin I can recommend as potential monitoring solutions if you want to self-host them.

Hosting & Servers

Using Varnish as a CDN

Update – 22/02/15: This site now uses a Varnish Backed CDN. Turns out it was pretty strait forward to implement 🙂

Varnish is a front-end caching proxy that serves only static content. The way it works is not too far removed from how the top level server operations work with an origin-pull CDN.

A CDN server receives a request for a static file and it delivers it. If the file us on hand then it sends it, if it isn’t available then it pulls the file from the origin and serves that while storing the file it in it’s cache. That’s basically what Varnish does as well.

I’ve had this idea for a while but only recently added a spare Varnish caching proxy to my cluster that I can use for testing.

The primary server will be the one rewriting urls to point to the CDN domains.

I already have a basic statics server set-up and running on a separate box from the main domain. It’s got a push-style propagation method, is configured with long expirations for files and doesn’t accept cookies.

That already gets populated with files so I’ll use that as the origin to keep the busy worker count on the main server down. I want about half the requests to hit the statics server and half to hit the varnish server. The main domain will be the one that handles rewriting to make that happen. I may use the W3 Total Cache plugin to do it through PHP or I might use mod_pagespeed domain sharding through Apache.

Using Pagespeed for it has it’s benefits but it would likely be a much simpler set-up process if done with W3 Total Cache.

The statics server already gets filled with it’s files (initially filled with rsync and kept up to date by W3 Total Cache – rsync runs on cron to make sure the files are always the latest versions). The Varnish server is not primed when it starts like the statics server is. We need to tell Varnish where it can find the files it doesn’t have.

To do that a backend is set that points to the statics server and Varnish gets told that when a request arrives on a specific domain and it doesn’t have the file in it’s cache then it should request from a specific server then cache it and serve from the cache next time.

Once things are set-up and the Varnish cache has been running for a while it’ll serve the function of being a 2nd node in the Content Delivery Network. The benefit of doing it this way is the ability to add extra nodes with ease and they can be put to use like I’ve described, behind load balancers or with GeoIP targeting all with minimal configuration.

That’s the basic idea in a nutshell. In reality things are a little more complicated. There’s on-the-fly optimizations done by the upstream server that get performed by mod_pagespeed and likely a need to create some kind of tweaked LRU eviction function for Varnish so that it’s cache doesn’t fill with multiple versions of the same file at various different optimization stages.

I’ll deal with the problems as I come across them but there’s no harm in testing the idea and building the system. The potential performance increase for a small single site is probably negligible but across an entire network of sites it’s likely to amount to a substantial performance improvement all round.