Starting from scratch: Training a 30M Topological Transformer

If you want to prove (i.e. show that it works and/or it's faster in a real-world scenario) a new alternative to attention without breaking the bank then one of the best ways to do that would probably be to retrain an already existing model, just with swapped attention modules. Then once you have such a model you can do apples-to-apples benchmarks.

This has been done successfully in the past:

https://huggingface.co/featherless-ai/QRWKV-72B

Note that this is a 72B model which would be very expensive to train from scratch, but here they did the conversion for less than $2000.

I'd say try the nanogpt speedrun. It's much easier to train, and gives you a better comparison vs optimized systems.

https://github.com/KellerJordan/modded-nanogpt

The linked paper tested nanoGPT with this new transformer:

https://www.techrxiv.org/users/685780/articles/1375955-topol...

thanks for linking.

Yes the paper compares the new architecture (that is also a fork of my implementation of nanoGPT) with Karpathy's nanoGPT. There are also links to the code and bench used.

Note I didn't say Karpathy's nanoGPT, I said use the speedrun.

Transformers are universal function approximators. When well-tuned, they often start to approximate other innovations. Not always, thank god, but often enough that you have to be careful.

Labs were also competing to train BERT's for $20 or less. People still use them a lot, too.

https://www.databricks.com/blog/mosaicbert

I'll add they should do a number of small, training runs with different architectures and data mixes. That proves generalization.

Depending on how different the attention mechanism is, that might not work. If it’s just a faster / different way of finding the tokens to attend to, sure. But I get the sense the author is implying this method uses different semantics somehow. Although tbh I didn’t follow it entry.

This is interesting. Has there been more research into this architecture? I hear about it once every few years but it always seems like a niche / experimental thing. But based on the graph in their blog post you'd expect every company to be using this.

This is a novel re-interpretation of the Transformer, based on my previous research made with a library called `arrowspace`.

It is somehow what is called a "Grassmann-like flow" but without the Plucker embedding, or also similar to what is done in DavisTensor but relying on spectral Laplacian instead of purely geometric distances.

The problem with a lot of stuff done before is that it focuses on dense representations. This architecture is focuses on sparse representation and provides a new approximation computation based on energy-informed graphs.

thanks for reading. I cannot retrain an existing model as the self-attention mechanism has been completely redesigned. The Keys and Values in self-attention are stored as scalars, so a latent space with traditional weights does not make sense if used in the context of a topological transformer. The two latent spaces would be somehow equivalent eventually but they would store totally different values.

That doesn’t tell you if the new method continues to perform better at higher parameter counts.

it most-likely will in terms of performance as it uses 50% less memory (for sure it will at inference time that is the most used operation on web services), because it can leverage longer T and D if the design is confirmed and the quality of generation is comparable to other models. If this very basic assumption is correct, it means a lot of savings in electricity as the same GPUs can resolve more requests.

By performance, I meant the accuracy of the model, not the runtime/memory characteristics.

Nor that the training from scratch will even work.

exactly, that is the current objective. To proove that generation for a specific domain is on-par with causal attention models

I wonder what if we just crammed more into the "tokens"? I am running an experiment of replacing discrete tokens with embeddings + small byte encoder/decoder. That way you can use embedding space much more efficiently and have it contain much more nuance.

Experiments I want to build on top of it:

1. Adding lsp context to the embeddings - that way the model could _see_ the syntax better, closer to how we use IDEs and would not need to read/grep 25k of lines just to find where something is used. 2. Experiments with different "compression" ratios. Each embedding could encode a different amount of bytes and we would not rely on a huge static token dictionary.

I'm aware that papers exist that explore these ideas, but so far no popular/good open source models employ this. Unless someone can prove me wrong.

I found a few papers in this direction with perplexity like this one https://ceur-ws.org/Vol-4005/paper1.pdf and it doesn't seem to be that relevant for now.

The progress of a handful models seem to be so much better (because limited compute, we have only a handful of big ones, i presume) that these finetunings are just not yet relevant.

I'm also curious if a english java + html + css + javascript only model would look like in size and speed for example.

Unfortunate whenever i ask myself the question of finetunging tokens (just a few days ago this question came up again), deep diving takes too much time.

Claude only got lsp support in november i think. And its not even clear to me to what extend. So despite the feeling we are moving fast, tons of basic ideas haven't even made it in yet

if you have a corpus of code snippets to train the manifold (Laplacian) on (and a good embedding model), it is definitely possible to try something like this.

There’s many examples of noisily encoding a large embedding vocabulary. This sounds a bit like T-free or H-net? Or BLT?

One of the main issues with lines of work around this are that you end up trading embedding parameters for active parameters. This is rarely a good trade-off for the sake of compute.

Isn't this just an awkward way of adding an extra layer to the NN, except without end-to-end training?

Models like Stable Diffusion sort of do a similar thing using Clip embeddings. It works, and it's an easy way to benefit from the pre-training Clip has. But for a language model it would seemingly make more sense to just add the extra layer.

I mean this is exactly what it is. Just a wrapper to replace the tokenizer. That is exactly how LLMs can read images.

I'm just focusing on different parts

Not an expert in the space, but I’m not sure you need to modify tokens to get the model to see syntax, you basically get that exact association from attention.

You get that association that is relevant to your project only if you can cram the whole codebase. Otherwise it is making rough estimates and some of the time that seems to be where the models fail.

It can only be fully resolved with either infinite context length, or doing it similar to how humans do it - add some LSP "color" to the code tokens.

You can get a feel of what LLMs deal with when you try opening 3000 lines of code in a simple text editor and try to do something. May work for simple fixes, but not whole codebase refactors. Only ultra skilled humans can be productive in it (using my subjective definition of "productive")

Thanks to all that have read. I would be glad to answer further scoped questions on the content of the post and the paper. I answered some comments that may clarify the ideas from the redesign.

Comparison with vanilla of the same size/flops budget?

The big bet with this technique is in having a fixed (non learned) matrix which converts the tokens latent space to the linear attention space. So you can kinda cheat and say your model is small because a bunch of the smarts are in this fixed big graph laplacian matrix L.

So how do you scale this up from a toy problem? Well that L would Have to get bigger. And it’s hard to imagine it being useful if L is not trained. Then it starts to look a lot more like a conventional transformer, but probably harder to train, with the benefit of smaller KV caches. (Half the size - not a massive win.)

So overall doesn’t seem to me like it’s gonna amount to anything.

also: precomputing a sparse Laplacian for N vectors at dimension D (NxD) is infinitely cheaper (if using `arrowspace`, my previous paper) than computing distances on the same full dense vectors billions of times. There are published tests that compute a Laplacian on 300Kx384 space in 500 secs on a laptop on CPU. So it is a trade-off: potentially few minutes of pretaining or hours of dot-product on dense matrices

the idea is to have a lot of "narrow" models to work with RAG instead of one model for all the knowledge domains or also distil the metadata that is currently in enterprise Knowledge Graphs

I'm not sure if that is the right calculation.

Provided the flops are not prohibitive. Output quality per model bytes might be better. In general people run the largest model they can.

I certainly think trading speed for quality at the same size is worth looking at. Especially if it uses methods that can benefit from the efforts of others to improve speed in general.

That said performance difference at 30M may not be representative of performance difference at 30B

There are probably a lot of really good ideas out there waiting for someone to drop a few million in training to reveal how good they are on large sizes.

So no comparison?

comparisons will be run when the quality of generation will be on pair with other available models. It is useless to have preformance if the quality is not at lease on par.

The paper runs a bench (code and bench in the paper) to compare the performance with a causal attention GPT-2 model (nanoGPT) at inference (20% faster) and at training (equivalent for T and D larger than a threshold).

Does this make any sense, to anyone?

I think this is an attempt to try to enrich the locality model in transformers.

One of the weird things you do in transformers is add a position vector which captures the distance between the token being attended to the some other token.

This is obviously not powerful enough to express non-linear relationships - like graph relationships.

This person seems to be experimenting with doing pre-processing of the input token set, to linearly reorder it by some other heuristic that might map more closely to the actual underlying relationship between each token.

  > like graph relationships

Once upon a time during me being language modeling researcher I built and finetuned a big (at the time - about 5 billions parameters) Sparse Non-Negative Matrix Language Model [1].

[1] https://aclanthology.org/Q16-1024/

As this model allows for mix-and-match of various contexts, one thing that I did is to have a word-sorted context. This effectively transforms position-based context into a word-set based context, where "you and me", "me and you" and "and me you" are the same.

This allowed for longer contexts and better prediction.

I've saved it to look at it in the future. I also remembered Kristina Tautanova's name (your editor). Looking up recent publications, she's done interesting work on analyzing pretraining mixtures.

https://aclanthology.org/2025.acl-long.1564/

Thanks to you both for two, interesting papers tonight. :)

I am not an author of SNMLM paper. ;)

I was using their model in my work.

I misunderstood what you said.

Well, in your work, whay benefit did you get from it? And do you think it would be beneficial today combined with modern techniques? Or obsoleted by other technqiue?

(I ask because I'm finding many old techniques are still good or could be mixed with deep learning.)

At the time (2018), it had perplexity close to LSTM, while having more coefficients and much shorter (hours vs days) training time.

I tried to apply SNMLM's ideas to the byte-level prediction modeling here: https://github.com/thesz/snmlm-per-byte

It was not bad, but I had trouble scaling it to the 1B set. Mostly because I have not enough time.

I do hold same mindset as yours, that many old techniques are misunderstood or underapplied. For example, decision trees, in my experiments, allow for bit-length-per-byte comparable to LSTM (lstm-compress or LSTM in nncp experiments): https://github.com/thesz/codeta

> This is obviously not powerful enough to express non-linear relationships - like graph relationships.

the distance metrics used is based on energy-informed graphs that encode energy relations in a distribution called taumode, see my previous paper on spectral indexing for vector databases for a complete roll-out

Adding the position vector is basic sure, but it's naive to think the model doesn't develop its own positional system bootstrapping on top of the barebones one.

For some reason people are still adding position encodings into embeddings.

As if they are not relying on the model's ability to develop its own "positional system bootstrapping on top of the barebones one."

it makes sense architecturally

they replace dot-product attention with topology-based scalar distances derived from a laplacian embedding - that effectively reduces attention scoring to a 1D energy comparison which can save memory and compute

that said, i’d treat the results with a grain of salt give there is no peer review, and benchmarks are only on 30M parameter model so far

Yup, keyword here is “under the right conditions”.

This may work well for their use case but fail horribly in others without further peer review and testing.

no, from my point of view is being more domain-focused instead of going full-orthogonal.

right. this is a proposal that needs to be tested. I started testing it on 30M parameters then I will move to a 100M and evaluate the generation on domain-specific assisting tasks

it made sense to me as it is a very simple idea I guess: causal self-attention compute QKV distances computing on the full vectors for Q,K and V; the topological transformer can provide the same computation using Q, scalar K and V. Instead of [N², N², N²] -> [N², N, N²] is used. If generation is confirmed to be on par in terms of quality, the gains are evident.

I haven’t read the paper yet, but the graph laplacian is quite useful in reordering matrices, so it isn’t that surprising if they managed to get something out of it in ML.

No, its a new form of alchemy that turns electricity into hype. The technical jargon is more.of.a thieves cant to help identity other conmen to one another

that's a strange way to spell "no, I didn't understand the paper"

Perhaps someone who does understand the paper will kindly make it a bit clearer for those of who get a bit lost.

Honestly why I would really apprechiate something like this, hn is not an explain platform.

For sure, some words or feedback on what you understood (did you get it right) etc. yeah.

But otherwise, if you do not understand a research paper, you have to do the same hard work as everyone else. Sitting down, going through it paragraph by paragraph and learning it. This takes massive time.

and for a high level overview, chatgpt and co are really really good getting papers.

Try get over your ai hate.

If you need help getting more out of ai, you can use chatgpt and co to go through papers and let yourself eli5 paragarphs. 1blue3brown also has a few great videos about transformer and how they work

Ideologues usually aren't great at primary source understanding/reasoning, hence why they end up with such strong opinions.

I dug into this a bit (with AI ofc) and it spat this out. I found it an easy way to visualise and start to understand:

> Standard AI models (like GPT-4) treat data using Global Geometry. They imagine every word as a point floating in a massive, flat, high-dimensional room. To see how two words relate, they draw a straight line between them.

> Local Topology changes the "room" into a landscape (a manifold). Instead of a flat void, the data exists on a curved surface that has hills, valleys, and paths.

What is a "high-dimensional room"? A "room" is by definition three-dimensional in so far as we're using metaphor for description. Then to add this "high-dimensional" modifier does little for me, since the only visualizable high-dimensional cube is a tesseract, which still leaves you at 4-d.

The presented counterpoint to this metaphor has the "room" change into a "landscape". The room is a "flat void" compared to a landscape with "hills, valleys, and paths". None of these landscape features evoke higher dimensionality in my imagination. Certainly not in the way, say, the metaphor of the "coastline" of Great Britain does when discussing the unusual properties of a fractal.

These moves don't shift my railroad mind from one track onto another. So I wonder, if a metaphoric usage is not in some way universal, how can it be instructive?

The metaphor works only if you already understand the maths.

Maths I’ve never heard of. Possible. Probable. And what you’re saying is the words “room” and “landscape” are _over coded_ to such an extent the natural logic of 3-d rooms and 2-d landscapes are easily overcome by scaffolds of mathematical instruction—such that the latter could be imagined as having *higher* dimensionality than the former, for example? Or whatever other idea orbiting those words counter to their nature. That’s very interesting.