How LLMs actually work

01

The Honest Starting Point

Before we explain anything, we need to be upfront about the state of knowledge in this space.

The large AI companies — OpenAI, Anthropic, Google, Meta — train their models on massive datasets. These datasets, and the exact processes that turn them into model behaviour, are proprietary. They don't publish what's in the training data. They don't publish the full details of how models are fine-tuned. This is their competitive advantage, and they guard it.

That means anyone claiming to know exactly how to "get into" an LLM's training data is either oversimplifying or misleading you.

What we're honest about

Known

How LLMs process text at inference time — the mechanics of how they read, reason, and respond

Observed

What types of content LLMs tend to surface, cite, and recommend — based on testing and reverse engineering

Unknown

The exact contents of training datasets and the precise weighting of different signals in model behaviour

02

Two Layers of AI Discovery

When people talk about "getting discovered by AI," they're usually conflating two very different things. Understanding the distinction matters.

Layer 1

Training data

Extremely difficult to influence

This is what the model "knows" before you ever talk to it. The vast corpus of text scraped from the web, books, code repositories, and other sources that was used to train the model.

You can't control this. The AI companies decide what goes in, how it's weighted, and when it's updated. If your brand or content made it into the training data, the model may know about you. If it didn't, it doesn't. And you have very limited ability to change that.

This is the part most "AI SEO" vendors oversell. They imply you can engineer your way into training data. The reality is much messier.

Layer 2

Inference — how the model reads you live

Where the real value is

This is where it gets interesting. When an LLM browses the web, reads your website, processes a document, or is given context about you — it's doing inference. It's reading and interpreting your content in real time.

This is the layer you can actually influence. Not by gaming anything, but by understanding how LLMs read and making your content genuinely useful to them.

03

How LLMs Read

An LLM reading your website is similar to Google's crawler in some ways — and completely different in others. Understanding both the similarities and the differences is where the value is.

Like Google

They crawl and read web pages
They value clear structure and headings
They care about the quality and relevance of content
They follow links and understand site structure
They can identify authoritative vs. spammy content

Unlike Google

They actually understand the meaning of text, not just keywords
They can synthesize information across multiple sources
They don't rank by backlinks — they evaluate by coherence and usefulness
They read the full text, not just the first paragraph
They respond to how clearly you explain things, not how well you keyword-stuff

The biggest shift: Google matches keywords. LLMs understand meaning. This changes everything about how you should structure content.

04

What LLMs Value

Based on publicly available research, our testing, and the broader reverse-engineering community, these are the signals that consistently matter when LLMs process content. We mark our confidence level for each.

Clear, structured content

High confidence

Semantic HTML, logical heading hierarchies, clear section boundaries. LLMs parse structure to understand what content is about and how ideas relate to each other. This isn't about gaming — it's about being readable.

Specificity over vagueness

High confidence

Concrete facts, specific numbers, named entities. "Founded in 2019 in Melbourne, employs 45 people" is vastly more useful to an LLM than "a growing company with a dedicated team." Specific information can be extracted, verified, and cited.

Structured data and metadata

High confidence

Schema.org markup, Open Graph tags, well-formed meta descriptions. These are machine-readable signals that help LLMs (and their tool-use pipelines) quickly understand what a page is about without ambiguity.

LLM-specific files (llms.txt)

Medium confidence

A growing convention where websites provide a plain-text file specifically for LLM consumption. Like robots.txt told search crawlers what to do, llms.txt tells language models what the site is about. Adoption is early but growing.

Consistent identity across sources

Medium confidence

When your name, brand, and key facts appear consistently across your website, social profiles, directories, and publications, LLMs are more likely to surface accurate information about you. Inconsistency creates confusion for models just like it does for people.

Freshness and update signals

Medium confidence

Content with clear dates, regular updates, and recent timestamps tends to be preferred by LLMs with web access. Stale content without dates gets less weight. This is more relevant for models with retrieval-augmented generation (RAG) than for training-data-only responses.

05

What We Don't Know

Intellectual honesty means being explicit about the gaps. Here's what nobody in this space can tell you with certainty — and you should be skeptical of anyone who claims otherwise.

?

Exact training data composition

No one outside the AI companies knows exactly what's in the training data, how it's weighted, or how often it's refreshed.

?

How citation decisions are made

When an LLM cites a source in its response, the exact mechanism for choosing that source over alternatives isn't fully understood.

?

Model-specific differences

ChatGPT, Claude, Gemini, and others all behave differently. What works for one may not work for another. Anyone selling a "universal AI strategy" is oversimplifying.

?

How fast things are changing

These models are updated frequently. Behaviours observed today may change tomorrow. Any strategy needs to be adaptive, not fixed.

06

The SEO Parallel

Google never published its full ranking algorithm. But an entire discipline — SEO — emerged from observation, testing, and reverse engineering. The best practitioners understood the principles even when the specifics shifted with every algorithm update. They mapped an opaque system and built on what they found.

Shelf engineering is at the same frontier. We don't have the algorithm. But we understand the principles. And the principles are rooted in something durable: make your information clear, structured, specific, and genuinely useful.

The tactics will change. The fundamentals won't. Good content that's well-structured for machines will always outperform vague content that's only designed for human eyes — because increasingly, the machines are the first readers. The grid is being built now. The question is whether you can see it.

07

What You Can Do Now

You don't need to wait for the field to mature. The fundamentals are already clear.

01

Audit your content for clarity

Can an LLM read your website and immediately understand what you do, who you serve, and what makes you different? Test it — paste your homepage into ChatGPT and ask it to summarize your business.

02

Add structured data

Schema.org markup, clean meta descriptions, consistent naming. These aren't just good for Google — they're increasingly important for AI systems that use web browsing tools.

03

Be specific

Replace vague marketing language with concrete facts. Numbers, locations, dates, capabilities. The more specific your content, the more useful it is to both humans and AI.

04

Consider an llms.txt

A plain-text summary of your site designed for LLM consumption. It's a small effort with potential upside as AI systems increasingly look for these files.

05

Stay informed, stay skeptical

This field is moving fast. Read broadly, test things yourself, and be wary of anyone selling certainty. The honest practitioners will always tell you what they don't know.

How LLMsactually work.

The Honest Starting Point

What we're honest about

Two Layers of AI Discovery

Training data

Inference — how the model reads you live

How LLMs Read

Like Google

Unlike Google

What LLMs Value

Clear, structured content

Specificity over vagueness

Structured data and metadata

LLM-specific files (llms.txt)

Consistent identity across sources

Freshness and update signals

What We Don't Know

Exact training data composition

How citation decisions are made

Model-specific differences

How fast things are changing

The SEO Parallel

What You Can Do Now

Audit your content for clarity

Add structured data

Be specific

Consider an llms.txt

Stay informed, stay skeptical

How LLMs
actually work.