ETags vs resource versioning when using a CDN

I gave a presentation at Webbisauna II on Effective use of a CDN. All presentations were short, only 25 minutes each. I decided to run through the basics first and then move on to some lessons that I have learnt from experience. After the presentation people asked some really good questions, of which I was very impressed. One of the questions was “Could server generated ETag header be used to avoid explicit resource versioning?” and I wasn’t able to give a proper answer at the time. I gave it some thought and I now have an answer for you.

What is explicit resource versioning?

In my presentation I recommended that all static resources deployed with an application should be versioned or hashed if they are to be distributed though a CDN. Instead of using just /js/foo.js one should use /js/foo-1.0.0.js or /js/foo-d3b07384d113edec49eaa6238ad5ff00.js where the hash could be the build or release id or possibly a hash calculated from file contents. When a resource gets updated the URI acting as a cache key changes and the old version that is sitting in various caches is no longer used or referred to. The application and/or deploy mechanics must update all links and references to use the new version. There are resources that cannot be versioned this way (at least favicon.ico and other hard coded icon names) and their time-to-live has to be set to a lower value than “practically indefinite” to allow reasonable content updates. It may be possible to flush the CDN cache but nothing will instantly remove the old resource globally from all browser caches and ill-behaving proxies etc. In some cases resource versioning may also be done with an URL parameter but as the number of URL parameters may vary and their ordering is free the query string may not be as effective cache key as the URI.

What is an ETag?

So could the server generated ETag header be used to avoid this versioning ..? What exactly is an ETag? It is a http 1.1 header defined in RFC7232 “Hypertext Transfer Protocol (HTTP/1.1): Conditional Requests”. In short http server may send a ETag header with the http response and the header value is usually a hash calculated from file content and/or file metadata. Once the ETag value of an object is known by the client the hash value may be sent to the server in subsequent http requests in If-Match or If-None-Match headers. The idea is to tell the http server “we already have this resource with this ETag, if the hash hasn’t changed there’s no need to send the same object to us again”. HTTP server may then respond with a short 304 Not Modified response to reduce data transfer.

How does the ETag header function with a CDN?

If a CDN Edge is caching the passing traffic from Origin to Client, the Client may use these conditional requests to reduce traffic between the Client and the Edge. The Edge can also use conditional requests to reduce traffic between the Origin and the Edge. But if a static object gets updated in the Origin, the Edge would probably do a conditional “verification request” only after the object is due to be expired from the cache. There’s no reason for the Edge to do it any earlier or in sync with updates to the Origin.

Conclusion

To me it seems that:

  • Without explicit versioning or hashing there’s no way for the CDN Edge to be aware of content changes in the Origin before cache expiration so the old version doesn’t get flushed when we want it to. ETags don’t really help with this at all, they just optimize the data transfer when possible.

  • Without explicit versioning there’s no way for the Client to refer to a specific version of the object so there’s no control over which version gets downloaded. The updated html page cannot indicate that “I want a newer version of /js/foo.js, I just don’t see how that could happen without either explicit versioning or the cache expiration.

All in all it was an excellent question! I hadn’t really thought about ETags in combination with CDNs before.

Random tip

Apache HTTP Server 2.2 FileETag default setup calculates hashes based on file metadata AND filesystem inode number. This means that if you have the same file on two separate filesystems, with the same size and the same timestamp, they WILL have different inode numbers and therefore the hashes from different filesystems will be different. Happily this default has been changed in 2.4 so that the inode is no longer used.