Is “Transparent” Web Caching Dead?

Over the last few months, I’ve been re-exam the value of web caches in a network architecture. Peers have asked me to write up the results. What follows are the results. The are provided to help operators not get pushed by vendors to make costly mistakes.

“Web Caching” has been an effective and productive tool to scale the Internet. Those of us who advocated, developed, and deployed huge cache infrastructures has seen the OPEX savings benefits. The customer experience benefits (lower latency), bandwidth savings, traffic optimization, and detailed visibility allcombine to make web caching a powerful tool to cost effectively scale a network. But over the last few years, technology forces are diminishing the value of web caching systems. This diminishing value requires all operators to re-examine all past “caching” assumptions. These assumptions that “web caching is always good” could be wrong. Several mobile vendors are putting forward new “smart caching” systems using these old assumptions. Test these assumptions and validate the return on investment (ROI). In the end, there may not be any benefits from traditional web caching.

Past Benefits of Demand Based Web Caches

Why did we deploy web caches before? The number one reason was bandwidth savings. 30% cache hit rate with save enough bandwidth to pay for the disk within one year. But bandwidth savings were not the only benefit. Cache hits would mean the content downloaded from the cache would have a lower latency and a faster download. This improved the customer experience. In addition, the TCP connections to the web cache would use TCP options optimized to improve the download speeds. This improved customer experience and maximized the goodput on the links between the cache and the customer. Web caches also provided visibility into what customers where browsing. The business, operations, security, and planning teams would be able build content maps of the customer’s demands. For service providers, this content visibility allows optimization with peering and which content to colocate. With enterprise networks, it allows the security teams to understand what their fellow employees are doing and if there are “leaks.” Finally, web caches where one way to censor content. The URL based visibility was more granular that the IP and domain based filters. But, like anything on the Internet “barriers” to communication are always opportunities to “engineer” a solution. So web caching based filters have an appearance of effectiveness, but never 100% effective.

These five benefits have been the cornerstone of web caching benefits. The problem is that all five benefits are under pressure from new trends, architectures, and technologies. The principle of saving “enough bandwidth to pay for the disk within one year” is not true anymore. Here is why …..

Lots of Cacheable Content – Not True Anymore

The common perception is that much of the Internet is cacheable. This is not true perception.

Caching content has always been controversial. The W3C/IETF standards for HTTP (RFC 7234) gives the control over what does and does not get cached to the content provider. This is governed in with the cache-control. For example, Yahoo, for over a decade, set the “cache-control” to “no-cache” in all their HTTP traffic. This means no web cache that complied to the standards would cache Yahoo. Yahoo did this to maintain control over the content – allowing for optimization per service and per property (business team inside Yahoo).

In addition, truly “cacheable” content is content that has some chance to have a “hit.” In other words, it would be content that many people would wish to view and download. “Hit rates” on content was very effective until ~2002. At that time, “Web 2.0” emerged. The core element of “Web 2.0” is personalization. Unique pages and content tuned to the user was the core value proposition. This started a long tail problem with caching. Where huge amount of disk space was needed to cover a huge range content just to achieve any time of hit rate (see Figure 1).

Figure 1 – Example of taking the top 100 sites and using them to determine the ROI for caching.
(graphic courtesy of http://www2.alcatel-lucent.com/techzine/a-new-approach-to-publishing-and-caching-video/)

This long tail issue was exacerbated with the effective growth of Content Distributed Networking (CDNs). Companies like Akamai, Mirror Image, Limelight and many others help organizations scale their web properties by moving the content closer to the customer. In many cases these CDNs partner with the services provider, putting their CDN node inside the service provider’s network. The content on these CDN nodes would be “pre-cached” or “pushed” to be ready for fist download. The CDN provider would have their special redirection tools to ensure the client would connect to the topologically closest CDN node. This provides for a better customer experience. It saves a lot of bandwidth (some Akamai sites get a 5:1 effectiveness ratio). But, these CDNs target the top 1000 “cacheable” sites on the Internet. The benefits of the CDN bring to the service provider obsoletes the benefit of a normal demand based web cache.

Google jumped in with Google Global Cache (GGC). Think of GGC as a reverse proxy that is co-located inside of a service provider’s network. GGC nodes speed access to Google properties, improvises the customer experience, saves Google bandwidth, and saves the service provider a lot of bandwidth cost. The GGCs are very popular with the service provider community. For the majority of these operators, Google traffic is in that top 1000 sites. The more Google caches with GGC, the less effective the traditional demeaned based web caches.

The impact of the CDNs and GGC combined with Web 2.0 has diminished the return on investment (ROI) impact of a web cache. Service providers with existing web caches should do a detailed investigation before expanding. Service Providers who are looking to by new web caches need to build a realistic ROI model to ensure that the perception of saves really exist.

Secure Everything – The End of Web Caching as We Know It

Content that is encrypted cannot be cached! Hence, any HTTPS content is NON-CACHEABLE. That means all that content on Facebook, Google+, Twitter, and other social media sites that secure content is not cacheable. It means Youtube is not cacheable. The function of HTTPS is to secure the end-to-end communications between the client and the server. Web Caches get in-between the conversation – potentially breaking the session security.

Granted some web caches will still manage a HTTPS session between client and server. But, these sessions are glued together to maintain the end-to-end HTTPS integrity. None of the content is ever cached. It will diminish the customer experience by adding unnecessary latency.

The movement to secure more and more of the web space has accelerated over the last three years. The increase cyber-crime threat accreted the movement up to 2013. Then the Michel Snowden exposures promoted transformational thinking – where security is not optional anymore. Cisco measured that 10% of the 1 billion websites were encrypted. Then we had a wave of security vulnerabilities. This pushed the industry to even more encryption. The “The Matter of Heartbleed” study looked specifically at HTTPS, showing a larger increase in encryption across the Internet. In Jan 2014, Julien Vehent did a study of the Alexa’s top 1 million web sites. He found “451,470 websites have been found to have TLS enabled. Out of 1,000,000, that’s a 45% ratio.” (see https://jve.linuxwall.info/blog/index.php?post/TLS_Survey). Finally, the W3C and IETF went all in with HTTP 2.0 being default encrypted. While default encryption is controversial, forcing “default encryption” will further drive deployment. All of this impacts the effectiveness of demand based caching.

Figure 2 – Growth of Web Encryption (Source ATIS Open Web Alliance)

What does this mean for the old demand based web caching? With HTTP 2.0, nothing will be “cacheable” in the ways it was in the past. But “caching” was just one of the benefits. What about protocol optimization?

Traffic Optimization Benefits from Web Caching – Not any more

There was a time where service providers on one side of the ocean would put web caches on the other side and linked these two nodes with HTTP 1.1 pipeline function. This allowed for highly effective – but TCP window sessions between the two caches. It maximized the “goodput” of the link and validated that there were better more effective options to the default TCP options. These “trans-oceanic web caches” lasted for a few years until the economics would not justify them. But, that has not stopped innovation. Two new protocols have been put forward that either build on the lessons of HTTP 1.1 pipelining or think “out of the box” to optimized the client server connection. Both disrupt traditional web caching.

SPDY (pronounced speedy) is protocol developed to optimize the connection between the web client and server. Its addition has been growing with all the major web browser supporting and an increasing number of web sites supporting. It has been put forward as a candidate for the HTTP 2.0 protocol. It also uses encryption by default. Use this link to get started: http://en.wikipedia.org/wiki/SPDY.

Figrue 3 – SPDY Illustration (Source: Google Developers Session – http://www.youtube.com/watch?v=hQZ-0mXFmk8)

QUIC (Quick UDP Internet Connections, pronounced quick) is an experimental protocol built into Android to improve the performance on mobile systems. It has very interesting characteristics with an ecosystem that can be capitalized on by application developers. If you are a mobile operator get to know QUIC. If you are an application developer, code a prototype using QUIC and see what happens. Everything in QUIC is encrypted, so none of it is cacheable. Use this link to get started: http://en.wikipedia.org/wiki/QUIC

No Head of Line Blocking with QUIC (Source: Google Developers Session)

Application Oriented Universe and the Impact to Web Caching

Smartphones, tablets, and wearables are all user interface devices where the application rules. Browsers are supporting roles to the applications. Traditional web caching is a browser technology. This means the majority of the content from a smartphone is driven by the application, not the browser. Little to none of this “application content” is cacheable.

Recommendation for the Service Provider – Content Caching’s Future

The message should be clear by now:

Traditional demand based content caching need very detailed cost benefit analysis using data directly from the network. The impact of HTTP 2.0 (everything encrypted) will have a huge impact on cache effectiveness.
CDNs and Hybrid CDNs (i.e. like GGC) should be encouraged. They are a cost effective partnerships where everyone wins. They also work around the HTTP 2.0 problem – with encrypted connections directly to the local CDN and encrypted paths to their upstream content.

Barry Greene is an Internet Technologist, System Architect long time CyberSecurity Expert, and mentor of new talent. Connect to Barry via Linkedin (www.linkedin.com/in/barryrgreene/), follow on Twitter (@BarryRGreene), catch his blogs on Senki (www.senki.org).