Caching Mastodon Preview Card Responses

Posted on 02.11.2024

I ran into two instances recently where people remarked that the Fediverse can be a bit of a Distributed Denial of Service attack: When posts link to an URL, some Fediverse software helpfully tries to collect some metadata from the page to show a preview card, like any modern social media software is supposed to do.

The problem is that in the Fediverse, the post gets replicated to all servers that are supposed to receive the post, through subscriptions or reposts, and every single one of these servers will download the same file for the same data, usually within a very short period of time.

jwz called it the “Mastodon stampede”, which I think is a wonderful name for the phenomenon. Now, my all-static-files blog doesn’t care too much. It’s a bit of traffic (or maybe a bit more, the blog is connected to the world wide web through a rather slim connection) but otherwise no biggie.

However, when linking to a site that makes a database lookup (or a few hundred) for each request, it can be overwhelmed easily, while Mastodon servers (and other software fetching that data) aren’t all that interested in all the variable stuff. On the other end of the consideration, that stampede happens for a short while and then everything is gone again.

So why not cache things, and cache them early enough so that even complex database stress-test software like WordPress are no problem, because they never learn about it in the first place? (If you care about SEO and web site analytics, this might be a problem because you never see that precious traffic, but that’s not a me-problem, so…)

And that’s exactly what I did: I run a bunch of services, each of which could be stomped into the ground by a well-distributed Mastodon link, nearly all of them fronted by an nginx webserver, using either the proxy or the fastcgi module to forward requests to the actual software.

With the below change, only the first request by a Mastodon gets actual calculated data. Every other server fetching the same Open Graph data is served by my photo copier.

And as a bonus, since I’m messing in that general area, I also added a 403 response for a bunch of AI bots - Good riddance.

http {
  # …
  map "$http_user_agent" $skipcache {
    "~*(anthropic-ai|Claude-Web|ClaudeBot)"       403; # Anthrophic-AI
    "~*Applebot-Extended"                         403; # Apple AI
    "~*Bytespider"                                403; # ByteDance / TikTok
    "~*cohere-ai"                                 403; # Cohere AI
    "~*Diffbot"                                   403; # Diffbot
    "~*(GoogleOther|Google-Extended)"             403; # Google: unspecified research, Gemini
    "~*FacebookBot"                               403; # Meta
    "~*(OAI-SearchBot|ChatGPT-User|GPTBot)"       403; # OpenAI
    "~*PerplexityBot"                             403; # Perplexity
    "~*ImagesiftBot"                              403; # The Hive
  
    "~*Mastodon"                                  0; # Mastodon Preview Card
  
    default                                       1;
  }
  
  proxy_cache_path /var/cache/nginx/proxy keys_zone=fedicache:10m;
  proxy_cache fedicache;
  proxy_cache_valid any 10m;
  proxy_cache_bypass $skipcache;
  proxy_no_cache $skipcache;
  
  fastcgi_cache_path /var/cache/nginx/fcgi keys_zone=fedicache_fcgi:10m;
  fastcgi_cache fedicache_fcgi;
  fastcgi_cache_valid any 10m;
  fastcgi_cache_key $scheme$proxy_host$request_uri;
  fastcgi_cache_bypass $skipcache;
  fastcgi_no_cache $skipcache;
  
  server {
    if ($skipcache = 403) {
      return 403;
    }
    # …
  }
}