LWTMoE
LessWrong Transformer MoE is a ten-million parameter model trained on 102M tokens of LessWrong content. Out of those ten-million, only six-million are activated. LessWrong Transformer shows surprising results, beating Pythia-14m-deduped and its counterpart, LWTDense-6M on a coherence benchmark.
Architecture
LWTMoE uses the Qwen3MoE architecture.
| Parameter | Value |
|---|---|
| Hidden Layers | 4 |
| Hidden Size | 256 |
| Attention Heads | 4 |
| KV Heads | 4 |
| Experts | 6 |
| Experts per Token | 2 |
| Vocab Size | 8,004 |
| Intermediate Size | 1,024 |
| MoE Intermediate Size | 512 |
| RoPE Theta | 100,000 |
| Max Position Embeddings | 512 |
| Tie Word Embeddings | True |
| MLP-Only Layers | [0] |
| Max Window Layers | 4 |
Training
We trained LWTMoE on 102M tokens of LessWrong content. The specific dataset is linked here
Hardware
LWTMoE was trained on one NVIDIA T4 GPU in Google Colaboratory with a batch size of 16.
Training Results
| Step | Val Loss | PPL |
|---|---|---|
| 2500 | 4.5856 | 98.1 |
| 5000 | 4.1450 | 63.1 |
| 7500 | 3.9416 | 51.6 |
| 10000 | 3.8021 | 44.7 |
| 12500 | 3.7019 | 40.6 |
| 15000 | 3.6332 | 37.8 |
| 17500 | 3.5840 | 36.0 |
| 20000 | 3.5492 | 34.8 |
| 22500 | 3.5277 | 34.0 |
| 25000 | 3.5160 | 33.6 |
Generations
Prompt: The
Output:
world has been very useful.
* **Diplomacy:** Take the world as a whole, it's often difficult to understand what it does and how you think about your future experiences in this way (and thus it is not necessary). But with the idea of taking an AI system for that goal, there are many possible ways in which it can be applied to humans or other AIs, such as [Singzen](https://www-cdn.anthropic.com/1adeficial-reasoning/) and [CFAR's AI alignment frameworks](http://www.damnnecius.org/), including some more "civilization" version of the AI safety community. This may well help us address these problems if we're trying to find them interestingly; but I don't really need to know anything about this. There might also be more work on building AI systems than we do. It would require making people go down the lines we have nowhere near the end of our life - but even then, things will become better at doing stuff like our own. If so, what kind of research should we make into account the extent to whoever doesn't want to pursue its goals? And I wouldn't worry: suppose it could actually take over, no one else gets it done! But that seems like an opportunity to get out of the right side until we've heard ourselves having their power to achieve their goal.
* What sort of problem-solving argument could happen if everyone were willing to take actions from each other. Or perhaps they'd come up with something that would produce a bad outcome where someone was wrong – and sometimes he'd ask me if I had the money to build AGI. We didn't realize why the human mind wasn't going to die before the last two years ago. So yesterday people got out of their circles, but when it came to believe, maybe they would look like they'll just get off by it because they'd say nothing about whether you already knew the next step.
* Why aren't you working together for the first time the first day of the gamework is too expensive, rather than brave enough that they won't give you any feedback to the second weekend for the first hour without being able to run. In particular, they might still be able to figure out what happens.
* To those who want to live in ARC, let's try to pick a
Prompt: #
Output:
**SAEs are not only good, but they don't yet have to be at the end of this post.** It's easy for people to find some sort of "catches" in their mind which is basically just using an abstraction as a set of features and functions that would allow them to perform well. They're also useful as part of a sequence of neurons which can do it with those two components:
* *Theorem Barfish*, where we think about each neuron (which corresponds to what's going on in our neural network) or something like that. For example, you could imagine that when I talk about LTA at all, then the [Normal Frameworks](https://arxiv.org/abs/2402.13744), which seems plausible because if you want to use NOR-in-the-state-of-context learning theory, there may be ways to improve these algorithms. If you might get a bunch of stuff into your brain, you need to know how much of a thing from me. So here are my guesses about whether or not this is relevant to learning things from other concepts - but mostly some of the work that helps us understand more about what is happening.
*Putting up every textbook you want yourself to learn, so let's go back to a bit more detailing any of these concepts. The idea isn't really obvious. You might find a way to explain why [this essay](https://www.lesswrong.com/posts/gMJ4XDb6jCKGvhqnF/how-to-play-predict-conceptivity?commentId=HQBw9ZzYyxErL8WRd). And note that I'm interested in trying to give this discussion, especially with the rest of this section on the topic. But yeah, I feel like most researchers were interested in thinking that AI research was very hard different reasons. And I don't believe, like, "Do you have done this right now?" I expect AIs should still have a lot less time than others at some point, and I've been doing better than humans by years before, though they seem promising enough to make progress without getting rid of alignment. What will happen next year? Like, maybe someone gets the money as a whole!
This feels great, however
Prompt: AI
Output:
development is likely to be a very promising approach. The most recent progress on AGI's alignment research, and the general case of transformative AI will emerge from AIs trained with the same capability level; I think that it would not significantly decrease compute as the scale over time could result in more capable capabilities than humans do, e.g., [AI 2027](https://ai-2027.com/blog/trendanon).
* **More generally**:** We are currently getting ahead of us at [Artificial Intelligence: A Conjecture for Humanization Institute.](https://www.alignmentforum.org/posts/vHCuWKh5zLYpSw7Ea/agi-safety-controlling-problem)[^2] The first step forward this year was to help me understand why we should expect takeoff scenarios by 2040, and how much does superintelligence exist? In particular, if we can get into these risks, and what they need to do about them (though there aren't some reasons to have done so far.)
* If an existential risk has been done, then there might be other opportunities for increasing our chances of AGI being solved in a way which is more efficient but less expensive because the next 10 years were required. This kind of work would allow us to build up some of those options under a given range of timelines, including many different ways of thinking about "safe"", where even though such plans don't make sense at all, it may take longer than you expected, especially when the future is too low and faster to actually solve the problem.
* For example, suppose that a human-level plane here ends up doing something like 50% of things or 80%. And there's no reason to believe the goal isn't worth it, unless it feels crazy to me that it takes place around 10%, compared to 100x slower than it depends on how well it gets into AI safety. It probably wond out if someone wanted to give up with it and also tried it right away.
Prompt: Through a
Output:
between fine and dying than most other people, the fact that I am a miracle is an extremely large number of different species.
I think this sort of thing has been done in terms of how many countries with more powerful AIs are likely to be able to achieve some goal or not by doing it without having any goals (or if I get these kinds of AI alignment stuff), but I do not think there's much better ways to take seriously what you want. There was something like 2-3 years ago when humans were just going through the game before their current job was created as a whole. But in general I'm very confident I have to use them all for me, so we'll go back to a bit more precisely because they're trying to prevent things from getting into training my ability to figure out which side of our capabilities will lead to AGI at a higher rate. So I'm worried about using the two groups here:
**Rohin Shah:** Yeah, let's say no one else does! And then... now that part of why it could actually happen is… well, I don't know whether Eliezer's research project is important, nor I see it today, and maybe he thinks \[the\] field isn't really hard; it's probably only a little more of a big deal. I also haven't heard it yet. What would help me? We've got lots of additional data on AI safety work and making progress on this topic, though I still won't come up with various forms of discussion that feels less useful than it seems like it should make us more valuable to build superintelligence as a result. And I feel like I can possibly solve it! And this is mostly my main point at the idea that we will need to solve this problem, but I expect I should spend time in the next few months.
Benchmark
| Model | Total Params | Activated Params | Score | Coherent | Mostly Coherent | Partially Coherent | Incoherent |
|---|---|---|---|---|---|---|---|
| gpt2 | 124M | 124M | 0.4227 | 0 | 51 | 78 | 171 |
| LWTMoE-10M-A6M (this) | 10M | 6M | 0.3520 | 1 | 31 | 49 | 219 |
| LWTDense-6M | 6M | 6M | 0.2962 | 2 | 13 | 32 | 253 |
| Pythia-14m-deduped | 14M | 14M | 0.2846 | 0 | 23 | 24 | 253 |
To evaluate the coherency, factuality, and fluency of our models, we use Qwen3-32B to judge three hundred generations from unconditional prompts. Our model, LWTMoE-10M-A6M, beats Pythia-14m-deduped with less than half the active parameter count.
Limitations
- Does not generalize to out of distribution prompts
- No instruction-following
- Can't reason
- Mostly incoherent
Use Cases
- Educational Research
- Deployment on costrained devices
- Or, for fun.
Inference
import torch
from transformers import AutoModelForCausalLM, PreTrainedTokenizerFast, AddedToken
from pathlib import Path
# --- Config ---
MODEL_PATH = "harley-ml/LWTMoE-10M-A6M"
TOKENIZER_PATH = MODEL_PATH
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
MAX_NEW_TOKENS = 256
TEMPERATURE = 0.7
TOP_K = 50
TOP_P = 0.9
REP_PENALTY = 1.3
# --- Tokenizer ---
tok = PreTrainedTokenizerFast(tokenizer_file=str(Path(TOKENIZER_PATH)))
specials = {}
if tok.bos_token is None: specials["bos_token"] = AddedToken("<|bos|>", special=True)
if tok.eos_token is None: specials["eos_token"] = AddedToken("<|eos|>", special=True)
if tok.unk_token is None: specials["unk_token"] = AddedToken("<|unk|>", special=True)
if tok.pad_token is None: specials["pad_token"] = AddedToken("<|pad|>", special=True)
if specials:
tok.add_special_tokens(specials)
tok.padding_side = "left"
# --- Model ---
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=torch.float16,
trust_remote_code=True,
# device_map="auto",
# low_cpu_mem_usage=True,
)
model.config.output_router_logits = False
model.eval()
if not hasattr(model, "device_map") or model.device_map is None:
model.to(DEVICE)
# --- Debug prints ---
print(f"tok.vocab_size: {tok.vocab_size}")
print(f"len(tok): {len(tok)}")
print(f"model vocab size: {model.config.vocab_size}")
print(f"bos_token: {tok.bos_token!r} id={tok.bos_token_id}")
print(f"eos_token: {tok.eos_token!r} id={tok.eos_token_id}")
print(f"pad_token: {tok.pad_token!r} id={tok.pad_token_id}")
print("Qwen3MoE load-balancing bug workaround applied (output_router_logits=False)")
def generate(prompt: str) -> str:
full_prompt = tok.bos_token + prompt
inputs = tok(full_prompt, return_tensors="pt")
if hasattr(model, "device_map") and model.device_map is not None:
inputs = inputs.to("cpu")
else:
inputs = inputs.to(DEVICE)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=MAX_NEW_TOKENS,
do_sample=True,
temperature=TEMPERATURE,
top_k=TOP_K,
top_p=TOP_P,
repetition_penalty=REP_PENALTY,
eos_token_id=tok.eos_token_id,
pad_token_id=tok.pad_token_id,
)
new_tokens = output_ids[0][inputs["input_ids"].shape[-1]:]
return tok.decode(new_tokens, skip_special_tokens=True)
if __name__ == "__main__":
print("\nModel loaded successfully! Ready for inference.\n")
while True:
try:
prompt = input("Prompt: ")
if not prompt.strip():
print("Empty prompt → exiting.")
break
print("Generating...")
response = generate(prompt)
print(response)
print("-" * 80)
except KeyboardInterrupt:
print("\nInterrupted by user.")
break
except Exception as e:
print(f"Error during generation: {e}")
import traceback
traceback.print_exc()
break
Related Models
- Downloads last month
- 854