Carbon Brief: Archive audit

26 May, 2025

A cornerstone of the web ecosystem, links make it possible to connect documents freely across different sites. Links provide context, evidence, definitions, depth and plenty more besides. As time passes though, links tend to rot; hosting payments lapse, sites are carelessly redesigned or restructured (contra Tim Berners-Lee's famous Cool URIs don't change (1998) memo) etc. etc. For a news website that aims at being authoritative and trustworthy this can be a problem, if you're appealing to an external source as evidence you want that source to still be there when a reader encounters it weeks or years down the line.

The Carbon Brief site has grown and the linking strategy has evolved over time in a more or less ad hoc fashion, as we approach the 10th anniversary of the last major redesign I wanted to get a handle on the archive and have some idea about how we've used links and how that usage has evolved. To that end I put together an archiving process which takes a snapshot of the site on a regular basis and saves it as a series of plain markdown files₁. These can then form the material for further analysis e.g. the creation of a database of links. Last weekend I ran a Node.js script over that link database to test the health of link in our archive. After 24 hours of HTTP requests here are the headline results of that process:

About one in five of our links are dead

a tree map About 80% of the graphic is blue rectangles, ie. http status codes in the 200 range, the rest is red rectangles, roughly split between 404's, 403's, 400's, and miscellaneous others

HTTP response status codes for Carbon Brief's 135,543 links as of may the 26th 2025.

To be honest, this is better than I was expecting. A huge chunk (somewhere between 1 in 4 and 1 in 5) of the non-functional red links come from social media sites. X for example has recently changed domain name, tends to provide obfuscated links via link shorteners (i.e. t.co) , and many users have recently deleted their accounts all of which are bad for link stability. In addition when a request is not coming from a web browser X's servers appear to pick response status codes more or less at random. Linkedin on the other hand prefers to use a totally made up status code (999). At the other end of the scale the single biggest link target that's not a social media site is nature.com almost all of the links to which still return a 200 status code₂.

Next steps

Links occupy a strange middle ground between editorial and user experience (insert "they're the same picture" meme here). They're embedded in the content of a website but are clearly UI elements too. I'd like to get an idea of what we use links for and how that serves our readers, there's some information we can draw from the link database and the archive, but it seems likely these questions will need a more qualitative approach. The end goal of this is to have some solid guidelines around use of linking and the writing of link text as well as being more intentional with respect to digital preservation and archival practices.

Notes

Why Markdown? That's a whole post by itself! Briefly though, Wordpress tends to mash up content and presentation by storing HTML in its database this isn't ideal for a couple of reasons but most pragmatically it can make it hard to analyse the content.
Probably worth saying that a 200 status code does not necessarily mean the page still contains the same content as when the link was originally created.