A proposal for creating databases of URL equivalences, to bypass redirects and counteract link rot.

Status: just an idea, waiting for interest/validation.

The problem

Documents are often referred to by multiple URLs due to widespread (ab)use of HTTP redirection, often through use of link shorteners and tracker URLs. For example, a link might point to when it actually means to refer to

This practice creates problems, because it is impossible to tell what a link points to without first resolving each URL. As a simple example, a browser cannot indicate whether you already visited a linked document (through some other path). A more worrisome problem this that the redirection service might have become unreachable or even have ceased to exist, thereby breaking the links completely. Of course, there is also the risk that the linked documents themselves disappear from the web, but redirections add yet another form of link rot.

Possible solution

In web archiving, graph analysis or other systems, it would be pleasant to be able to ignore redirections, and treat equivalent URLs as if they were exactly the same. A hypothetical solution would be a universal database of URL equivalences, so that for any given URL you can obtain its ultimate target or get a list of all equivalent URLs. Though it may be unrealistic to get and store every pair, we can build useful tools that deliver part of the result, and make steps in the right direction:

  1. A first step would be to decide what is to be stored, and how. Most basic would be to store pairs of URLs. One may want to add the date this was resolved, and possibly more data. The format of such a data record could be agreed upon and standardised.
  2. We can then build software components/libaries (in any languages) that let one create and query such data easily. For example, getRealUrl(url: String) would look the URL up in its local cache, query a remote equivalence database (see next item), and/or resolve it for you. For normal (non-redirect) URLs, it would just be the identity function. It could possibly use heuristics (e.g. a hard-coded list of redirection services) to decide whether to bother looking it up.
  3. We could build and host a service that builds a database with many records, and exposes them to users. It could provide a simple HTTP API for querying data; e.g. would return the resolved target URL. If it was not in the database, it could be resolved and added.
  4. To not centralise this service, we could make it replicate among multiple instances. Instances could possibly specialise on serving URLs from particular domains, effectively creating replicas of the original redirection services (but speaking a different protocol).

More ambitious plans could of course be made, to perhaps go beyond URL equivalences and store documents themselves, to archive the whole web on a distributed hash table, or whatever. Fun, but to be discussed elsewhere. This idea is about doing something simple that solves one small problem, so others can build upon that and forget about the issue.

Whatever the path, we should end up not having to rely on singleton services to know how things connect on the web.

All these steps are currently just rough ideas. Only if more people have interest in the issue, it might be worth making a plan and tackling it for real (if you want to express interest, for now just shoot me a message).

Related work