The futility of good website coding practices in the LLM era?
April, 2024
When I started this website, I wanted to try to maintain good practices. Over time, as I learned more about best practices and HTML semantics, I wanted to improve the code into a more "proper" form.
However, one comment I read kept sticking in my mind. Someone speculated that proper "semantic web" practices merely make it easier for large language models (LLMs—the cutting edge of the machine learning artificial intelligence craze) to acquire training data.
As I read more, I discovered that was in fact the plan all along. Back in the 1990s, Tim Berners-Lee (the inventor of the world wide web, HTML, and HTTP) discussed "intelligent agents" ("agentic AI") that would need a machine-readable web to work efficiently. I will give him the benefit of the doubt that he was blissfully ignorant of just how badly these high-tech aspirations would affect the world. Or maybe I shouldn't.
The entire web is being scraped to create monstrosities that none of us consented to. This is effecting everyone from major news websites, to random forum content and small blogs.
Perhaps LLMs today are just fun and games, but the hubris, lack of ethnics, and carelessness of those who program these things makes it clear that the negative consequences will outweigh any positive ones by a ratio of over a million-to-one... Already, LLMs are being used to create bots to massively spam the web; their human-like language has been used in political propaganda campaigns; and we all know these programmers aren't going to stop until they unleash a dystopia that over a 100 years of writers and film-makers have kept warning the world about.
In a never-ending arms race, some have turned to "data poisoning" to mess with the algorithms. In particular, artists have used various algorithms that mess with generative AI which steals their art. Garbling a website as a defensive act is unfortunate, because web users who have screen readers or other devices which rely on good coding practices to make the web accessible shouldn't be caught in the crossfire. But what's the point of carefully crafting "proper" code when half of web traffic is already bots (and has been since at least 2017 or even 2014); SEO-optimized spam websites and comments generated by LLMs are clogging up the web, and when our lifetime will unleash horrors beyond our comprehension?