THE environmental impact of AI is a key theme of this year’s Green Libraries Conference. AI bots not only consume energy training models but cause major ramifications for systems that have to fend them off. Jonathan Field, Founder & Managing Director of Open Fifth, discusses how library management systems are keeping them out.... for now.
This article follows on from a blog post Steven Raith and I wrote back in June this year. At that time, Open Fifth, like many other organisations across the globe, were being attacked by AI (artificial intelligence) bots.
This isn’t just your usual aggressive search engine crawling. It is, in effect, the equivalent of a denial of service (DDoS) attack. To give you an indication of the problem, one site was getting over a million web requests a week of which only a few thousand were genuine. This kind of load is far above what most library systems are specified to accommodate.
This would all be OK if library platforms had a few static web pages. However, libraries have also been hit hard by these bots because of the nature of the data they hold in their library catalogues and discovery layers. These are full of metadata, all subtly connected together by author links, subjects links, series links and so on. So, essentially the bots can go on forever following one link after another… pretty much indefinitely! In this update, I’ll share our progress on fighting the AI bots.
The Aftermath
We are now well into Autumn and these problems are largely behind us. At the time, a large proportion of our sites were behind our Cloudflare Content Delivery Network (CDN). The rules inside Cloudflare allow you to be very specific about what you allow and deny.
However, more surprisingly, we discovered that the sites who use their own institutional (‘vanity’) domains and DNS (Domain Name System) records to point to our applications didn't appear to be doing a huge amount to protect their own domains. This surprised us, but perhaps reflected that many of their institutional systems were internal and could therefore be protected by other tools protecting the wider network?
For these systems, we implemented a fantastic open source tool called Anubis Botstopper. It uses complex algorithms to look at the behaviour of the web request in order to try and identify whether a request is genuine or not. This is the beauty of open source, an application like this can quickly become available, be deployed and then improved upon by others in quick time.
For us, Anubis was a huge success and we decided to sponsor the project to recognise the hard work that had gone into it. Around the same time, Duke University described the great results they have had running Anubis in front of their catalogue, archives and repository. Out of curiosity, I went back to have a look at one of our university libraries who had Anubis installed in mid-July when they first had problems.
Since then, 2.5 million requests to their site have been blocked with only 17,000 requests being allowed through and, to be specific, these are requests from outside of the JANET (Joint Academic Network) network (which are already allowed through by default).
Since the summer we have been finessing and adapting both our Cloudflare and Anubis rules, trying to accommodate all the exceptions. We found a couple of things. Firstly, some genuine third-party scrapers (like discovery tools, citation software) either were using very old User-Agent Header strings or, worse still, no User-Agent string at all. Consequently we had to add additional exceptions to cover these or speak to the third parties (some of whom have made improvements to their own applications).
The other thing we have found is that the AI bots have started to become more sophisticated. In the early days, they were sending User-Agents that were rather random - old versions of Windows CE and Internet Explorer, for example. As time has gone by, they are using more modern and realistic User-Agents which has meant, again, adapting our rule sets.
I guess this is the ‘ugly’ side of AI. An ‘arms race’ to spin up LLMs (Large Language Models), get training data - scraped from the entire internet - and be the next big thing, regardless of the cost to anyone else. They are quickly adapting as they are blocked to find a new way around our rules. We certainly haven’t heard the end of it.
The Years Ahead
Open source software has deep roots in library technology, with systems like Koha and DSpace in use for over two decades, however they continue to evolve rapidly. The difference now is that open source is no longer the preserve of early adopters, it is increasingly mainstream, across the wider technology landscape as well as libraries.
A significant and growing portion of artificial intelligence software is being released under open source licenses. Of course, by their nature, libraries share many of the same values as open source software and open data. For example, Collaboration and Community, Privacy and Security, Information Neutrality. So how do we, in libraries, balance this need for openness and information sharing with what is, in effect, copyright theft and stolen content by people building LLMs? These are questions we are going to have to address in the coming years.
The open source ethos of transparency and shared stewardship aligns naturally with the principles libraries need to uphold as AI systems become more pervasive. If AI depends on open data, then libraries can help model what ethical openness looks like. At Open Fifth we believe that open source software (including the systems we support like Koha, Aspen Discovery, Dspace, etc.) will be at the heart of information technology going forward and, more specifically, libraries.
While the current wave of AI-driven scraping represents the darker side of this new technology landscape, we shouldn’t lose sight of how open source and open data continue to empower libraries and other organisations to build better, more transparent, and more inclusive solutions.
Open Fifth is exhibiting at the Green Libraries Conference on Monday 27 Oct, remaining tickets on sale until Friday.