Guns of the SysAdmin

Guns of the SysAdmin
"Ninja Consultant" - Jemimus

War has changed.

Sorry, let me start again.

The Internet has changed.

In Metal Gear Solid 4: Guns of the Patriots, proxy conflicts permeate a dystopian world where mindless robots wage war in a controlled fashion to keep a war economy well oiled. A parallel real situation, the internet is now chock-full of IPs and User-Agents waging war on the sysAdmin to keep the data economy well fed.

The System Administrator is a long-standing mythical figure in the computer world. They are the guardians that protect data from being maliciously used. They prevent employees from falling for innocuous phishing scams. They are professionals who know the entire computer and network infrastructure to diagnose an issue when the going gets tough. SysAdmins ensure that computers are running correctly to keep the web running.

I recently came in contact with a sysAdmin from OpenStreetMap who recently noticed an enormous uptick in network traffic, bots scraping data from the project’s websites in the last week. OpenStreetMap is an open-data and open-source project that started in 2004, providing free mapping information. They usually block around 10,000 residential IP addresses from accessing their websites. However, the sysAdmin now has to manage upwards of 300,000 residential IPs in bursts of hours. Each network request from an IP takes up resources to process from a server. Now imagine an onslaught of IPs trying to use up all of the network resources at once. 

“It’s been a tough few years for private sysAdmin.”

Wikimedia and GNOME are other important free software projects and they pride themselves for self-hosting, owning their own hardware and services. For these communities, self-hosting is non-negotiable. Content delivery network (CDN) providers like CloudFlare, platforms engineered to manage these onslaughts, are avoided at all costs.

The sysAdmin soldiers are akin to veterans who lived a glorious past, now overtaken by time and reckoning with their age. What started with the gunpowder ended with the fighter jet. What started with hobby server racks ended with oversized companies controlling the routes of the online world. 

“The Internet has changed.”

This increase in traffic, the sysAdmin hypothesizes, is linked to third-party applications that install code in other applications that internet scraping bots can use to hide their true nature. With the heated LLM race from big tech companies, large volumes of data must be scraped as companies are running out of data to train their models. A sysAdmin can identify that a bot is trying to ping its websites and will sometimes refer to a robots.txt file that acts as a gate blocking bot access.

To avoid being blocked, the bots use these trojan-like third-party apps as a proxy, for the bot to camouflage through the device, to then request connections as if they are regular people checking out their information.

“Got no human grace, your eyes without a face.”

In our email exchange, private sysAdmin explains, “The scraping of our website isn't necessary, we already publish full, official data exports on our download service.”

The sysAdmin was able to get some of the third-party app developers to block OpenStreetMap websites; but some have not yet answered. With limited resources, it is unclear what the next best path forward is for our embattled sysAdmin as they face continued traffic onslaught.

While it is true that some cases are caused by scraping for LLM data, a clear cause is not established. Another recent event occurred that partially absolves the LLM data market for this activity.

On January 28, Google published a report on disrupting a network of residential IPs that acted in a scheme similar to what was described above. It is claimed that the network is managed by IPIDEA with bot IP addresses originating from Iran, Russia, North Korea, and China. Was it coordinated by data brokers around the world? Was it randomly coordinated? Time will tell, but there are clear markets for this kind of activity.

While it is good that Google is fighting this, we should address the elephant in the room: Every single data leech is to blame, including Google. They created the market conditions for such operations to thrive. Make no mistake, this problem was turbo-charged by the AI hype, but it exploded in our faces many years ago. When ClaudeBot runs this kind of operation, Anthropic gets a pass. However, when a group outside of the tech cartel does, it is condemned and seen as a threat.

LLM data, cyber-attacks, ads analytics, big data; these trenches are not friendly for private sysAdmin. Perhaps it is best to leave the front.

Jonathan Corbet, writing at lwn.net, a website covering Linux projects, says, “LWN is not served by some massive set of machines just waiting to keep the scraperbots happy.” He laments that to protect LWN’s services, they may have to resort to platforms such as Cloudflare. The risk with this growing reliance is that there are a handful of those CDN companies, still in good standing surprisingly, and nothing guarantees that they will not abuse their reach.

“The Internet has changed.”