A proposal for creating databases of URL equivalences, to bypass redirects and counteract link rot.
Status: just an idea, waiting for interest/validation.
Documents are often referred to by multiple URLs due to widespread (ab)use of HTTP redirection, often through use of link shorteners and tracker URLs. For example, a link might point to
https://t.co/1PT68A6LEt when it actually means to refer to
This practice creates problems, because it is impossible to tell what a link points to without first resolving each URL. As a simple example, a browser cannot indicate whether you already visited a linked document (through some other path). A more worrisome problem this that the redirection service might have become unreachable or even have ceased to exist, thereby breaking the links completely. Of course, there is also the risk that the linked documents themselves disappear from the web, but redirections add yet another form of link rot.
In web archiving, graph analysis or other systems, it would be pleasant to be able to ignore redirections, and treat equivalent URLs as if they were exactly the same. A hypothetical solution would be a universal database of URL equivalences, so that for any given URL you can obtain its ultimate target or get a list of all equivalent URLs. Though it may be unrealistic to get and store every pair, we can build useful tools that deliver part of the result, and make steps in the right direction:
- A first step would be to decide what is to be stored, and how. Most basic would be to store pairs of URLs. One may want to add the date this was resolved, and possibly more data. The format of such a data record could be agreed upon and standardised.
- We can then build software components/libaries (in any languages) that let one create and query such data easily. For example,
getRealUrl(url: String)would look the URL up in its local cache, query a remote equivalence database (see next item), and/or resolve it for you. For normal (non-redirect) URLs, it would just be the identity function. It could possibly use heuristics (e.g. a hard-coded list of redirection services) to decide whether to bother looking it up.
- We could build and host a service that builds a database with many records, and exposes them to users. It could provide a simple HTTP API for querying data; e.g.
https://directer.org/get/https://t.co/1PT68A6LEtwould return the resolved target URL. If it was not in the database, it could be resolved and added.
- To not centralise this service, we could make it replicate among multiple instances. Instances could possibly specialise on serving URLs from particular domains, effectively creating replicas of the original redirection services (but speaking a different protocol).
More ambitious plans could of course be made, to perhaps go beyond URL equivalences and store documents themselves, to archive the whole web on a distributed hash table, or whatever. Fun, but to be discussed elsewhere. This idea is about doing something simple that solves one small problem, so others can build upon that and forget about the issue.
Whatever the path, we should end up not having to rely on singleton services to know how things connect on the web.
All these steps are currently just rough ideas. Only if more people have interest in the issue, it might be worth making a plan and tackling it for real (if you want to express interest, for now just shoot me a message).
- URLTeam: Crawls and archives URL pairs from dozens of shortener services, solving a large part of the problem already!
- 301Works.org: A similar (past?) effort by the Internet Archive, obtaining data directly from redirection services, and keeping data private while the services still operate (according to the faq).
- N2T: Name-to-Thing Identifier Resolver: Service that resolves DOIs, URNs, etcetera to "the best known web addresses".
- Hash Archive: Service that lets you query (in either direction) the hashes of files and the URLs they are served at.