do-not-train.txt

May 23, 2025

There’s always been a quiet understanding humming beneath the internet. Not a rule, exactly. More like etiquette. A social contract between strangers: you can visit, but tread lightly. Don’t index everything. Don’t take more than you need. We encoded that unspoken thing into a tiny file: robots.txt. No passwords. No locks. Just a polite request whispered to the bots: please don’t go here. And for the most part, they listened. Not because they had to, but because the web worked better when we respected each other.

But that was then.

AI doesn’t crawl. It learns.
So maybe it’s time to say something else. Not “don’t visit.” That ship’s already out of port. Maybe now we need to say: don’t learn. Maybe we need a do-not-train.txt.

The Ghibli Glitch

Let's rewind for a second.
Early 2024. OpenAI launches Sora — a video generation model that can conjure entire scenes from text prompts. The tech was impressive. But the buzz? That started somewhere else.

Someone asked Sora/ChatGPT to "Ghiblify" a photo. And it did. Soft brushstroke textures. Wistful pacing. Warm, saturated hues. That quiet magic you instantly recognize — not copied, not traced, but unmistakably Ghibli-coded. The emotional cadence. The stillness between beats. The way light bends like memory.

Sam Altman’s Twitter profile photo looked… familiar. Ghiblified. It wasn’t a stock avatar. It was him, re-rendered through that same dreamlike filter.

And that’s when things took off. Artists started asking questions. Fans blinked. Wait — did Studio Ghibli collaborate? Nope. No nod. No credit. No permission. Just a model that, when prompted, could channel a style it had very likely seen — a lot. When asked about the training data, OpenAI stayed vague. But let’s be honest: you don’t get that good at mimicking without steeping in the source for a while.

There’s no code it’s breaking. No stroke-for-stroke theft. But that’s not the point.
It’s not stealing. It’s absorbing. And for many, that’s what makes it even more uncanny — the sense that something intimate and handcrafted has been metabolized by something that never asked.

That’s when the question slipped in sideways: How do you tell an AI not to learn from something?
Not copy. Not steal. Just… don’t study it. Don’t internalize it.

Vibes ≠ Rights

Here’s where things get sticky: there’s a big gap between what’s legal and what actually feels okay.

Legally? If something’s online and public, chances are it’s fair game for training AI. Models are built to scrape, digest, and learn from massive piles of data — the more, the better. So if your blog post, your artwork, your tweets are out there, AI companies might see it as open buffet.

But culturally? It’s a messier, fuzzier story.

Imagine you spend years curating your personal brand on Instagram — your unique way of storytelling, your voice, your aesthetic. Now imagine a giant corporation copies the style of your posts, the rhythm of your captions, and the feel of your feed, then uses it to train an AI that churns out “content” just like yours, filling the platform with knock-offs. They didn’t swipe your pictures pixel-for-pixel, and they didn’t steal your text outright — but they took your essence without ever asking.

You didn’t break copyright law. You broke the vibe.

That feeling — the slow burn of something intangible slipping away—is what’s missing in the legal landscape. It’s not about plagiarism in the traditional sense; it’s about the erosion of trust and respect.

The AI didn’t outright steal. But it didn’t ask permission either. There was no conversation, no consent, no acknowledgment that your work is more than just raw data to be digested.

And that’s the problem. The uncomfortable sense that your creative identity, your cultural signature, your vibe, is being metabolized and repurposed without so much as a nod in your direction.

So What is `do-not-train.txt`?

Imagine a simple file, dropped at the root of your site. Like robots.txt, but for AI. A way to say: Hey. This space isn’t training data.

It doesn’t stop bad actors. Neither did robots.txt. That was never the point. The point was to declare intent. To set a boundary — protocol-level politeness. A flag planted in digital soil that says: this was made by humans, for humans.

It could look like this:

User-Agent: *
Disallow-Learn: /
Allow-Browse: /
Note: This site does not consent to AI training.

Or:

User-Agent: GPT-4o
Disallow-Learn: /journal/
Allow-Learn: /blog/

Maybe it gets fancier over time — opt-in hashes, dataset audit tags, even training fingerprints. But it doesn’t have to start complex. robots.txt didn’t. The spec can grow. Why bother?

Because even a whisper can matter.

Will it stop every model? No. But neither do copyright claims or Creative Commons badges or Terms of Service pages. We still use them. Not because they’re perfect — but because they say something.

They create friction. Not the kind that breaks systems, but the kind that reminds us we’re still here. That someone made this. That not everything should be digested by the machine.

And maybe — just maybe — AI companies will start respecting these signals. Not out of fear, but out of reputation. Because ignoring a clear “no” might one day cost more than obeying it.

The Meaning of Restraint

What made Studio Ghibli special wasn’t just the animation. It was the restraint. The care. The refusal to move fast and break things. An AI can mimic that in seconds. But it can’t mean it. Not unless we build space for meaning. For asking. For not-touching. For not-learning.

Maybe that’s what do-not-train.txt really is — a soft refusal. A protocol for pausing. A digital do-not-disturb sign for the parts of the internet that still have a pulse. So what now?

What's Next?

Maybe next post, we sketch the spec. Break down enforcement ideas, training audits, even community tooling. Maybe we build a browser plugin that warns you if a site’s been absorbed into a model.

We don’t solve this overnight. But for now, maybe we just whisper:
This wasn’t made for you. Or at least… not yet.