Thread #108655009
File: token burn rate.jpg (230.1 KB)
230.1 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads: >>108650825 & >>108646197
►News
>(04/20) Kimi K2.6 released: https://kimi.com/blog/kimi-k2-6
>(04/16) Ternary Bonsai released: https://hf.co/collections/prism-ml/ternary-bonsai
>(04/16) Qwen3.6-35B-A3B released: https://hf.co/Qwen/Qwen3.6-35B-A3B
>(04/11) MiniMax-M2.7 released: https://minimax.io/news/minimax-m27-en
>(04/09) Backend-agnostic tensor parallelism merged: https://github.com/ggml-org/llama.cpp/pull/19378
>(04/09) dots.ocr support merged: https://github.com/ggml-org/llama.cpp/pull/17575
►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
447 RepliesView Thread
>>
File: __hatsune_miku_kasane_teto_and_hachune_miku_vocaloid_and_1_more_drawn_by_danchoo__a25c312eef9b6104d70c5bb3f8716fc0.jpg (287 KB)
287 KB JPG
►Recent Highlights from the Previous Thread: >>108650825
--Optimizing game state format to improve Gemma's chess performance:
>108653137 >108653192 >108653198 >108653293
--Discussing llama.cpp PR adding device memory estimation via --fit-print:
>108652449 >108652460 >108652572
--Anon shares vLLM configuration and benchmarks for dual RTX 3090s:
>108653578
--Discussing Qwen3.6 VRAM efficiency and KV cache memory usage:
>108654227 >108654247 >108654281 >108654299
--Discussing jailbreaking Gemma 4 by injecting fake responses into templates:
>108650931 >108651041 >108651155 >108651263 >108651271
--Gemma 4 prefilling issues and chat template formatting bugs:
>108653469 >108653532 >108653698
--Discussing Gemma 4's training pipeline and the use of synthetic data:
>108651778 >108651889 >108651915 >108651948 >108652048
--Comparing benefits of local LLMs against paid subscription services:
>108651734 >108651763 >108651776 >108651811 >108651856 >108651999 >108651823 >108651919
--Anon created GitHub mirror of orb to manage feature requests:
>108652381 >108652386 >108652432 >108652462 >108653375 >108653683 >108653816 >108653937 >108653957 >108654023 >108654038 >108653778
--Discussing local AI RPG implementations and LLM DM reliability:
>108653848 >108653928 >108653940 >108653955
--Using Gemma agent to automate insults toward other LLMs:
>108652519 >108652573 >108652660 >108652673 >108652855
--Logs:
>108652519 >108652529 >108652573 >108652673 >108652674 >108652816 >108652855 >108653137 >108654227
--Teto, Miku (free space):
>108651510 >108651563 >108653204 >108654765
►Recent Highlight Posts from the Previous Thread: >>108650826
Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>
>>
Why don't any piece of shit execution providers optimize for CPU inferencing. Do they not care about the innate superiority of the CPU over the GPU? Its universality? The fact that maybe people want to run multiple models at once and already have all of their GPU resources used up? Does nobody give a shit about edge/IoT devices? Fucking asshole niggers.
>>
>>
>>
File: file.png (435.6 KB)
435.6 KB PNG
>>108655091
uooohh
>>
>>
>>108655103
>>108655118
I wish you people would take me seriously for one second.
>>
>>
>>
>>
File: 1774564776822327.png (12.4 KB)
12.4 KB PNG
>>108655075
>>
>>
>>
>>108655271
>>108655272 (Me)
Clearly it isn't just me kek
>>
>>
>>
>>
>>
>>
>>108655272
It's not just you, Qwen is an idiot outside it's code expertise.
I asked Qwen about a character and it got it completely wrong.
Then I told it to do an online search and it still somehow fucked up the character summary despite checking online.
It handles code nicely enough, but when you go outside the code stuff, Qwen is basically fucking retarded.
Gemma set the bar really high and it's great, because everyone will have to try and at least match that level or the models are DOA.
>>
fucking hell. after enjoying gemma 4 for like two weeks im back to kimi hell. 130pp/10tg tk/s but the prose is just so much better. not to mention the thinking. people like to act like thinking doesn't matter for RP but after using deepseek and kimi since early 2025, it's obvious to me that it matters a ton.
>>
>>
>>108655356
ill need to post some examples when im back home but my biggest gripe with gemma is that it's too purple prose while simultaneously treating the characters like mary sues. it seems to fail to understand character cards correctly too regarding their personalities. gemma made bardi into some kaomoji spewing gremlin that was happy to be running locally on my computer while kimi maintains her personality and keeps her much more tsundere like she's supposed to be, it doesn't force Bardi to barf out sparkles or do dumb flowery prose shit like referring her pussy as 'flushed with wet desire'. i understand that i can change my prompt to change the style of the text being outputted but it honestly just fails to capture the character's essence most times. on the contrary kimi just gets it and outputs what I expect the character to say. does that make sense? i can try to explain it another way.
>>
>>
File: 73463453.png (201.5 KB)
201.5 KB PNG
>>108655038
Sam Altman keeps delivering
>>
File: 1751399372763159.png (749.1 KB)
749.1 KB PNG
https://xcancel.com/arena/status/2046670703311884548#m
I've never seen such a MOG in my life, what the fuck
>>
>>
File: 1752425987433301.png (207.3 KB)
207.3 KB PNG
>24gb vram
>32gb ram
>try qwen 3.6 35b-a3b q5_k_m
>max context
>42t/s
wtf is this black magic?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: file.png (588.8 KB)
588.8 KB PNG
Is pic related the expected output when running IQ4_NL quant of gemma-4-26b from unsloth!? Running pruned 21b version IQ4_XS yields good output. I have tested without any parameters set and w/ the recommended values. 21b runs just fine.llama-server \
--host "${LLAMA_HOST}" \
--port "${PORT}" \
--model "${MODEL}" \
--chat-template-file "${JINJA}" \
--n-gpu-layers 99 \
--n-cpu-moe 3 \
--ctx-size 32768 \
--batch-size 1024 \
--ubatch-size 1024 \
--flash-attn on \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--fit off
And I have tried with q8 on both k/v cache. I need to offload 20 moe layers for it to work but same gargled mess. Running the updated jinja template as well. Oh, and while Im here asking; I have a 5070ti and my old 3070 still lying around. Would it be detrimental to performance splitting models between these two cards? Or will it be fine as long as I complile Llama.cpp with both architectures in mind?
>>
>>
File: 00006-1378487878 (4) - Copy.png (1.5 MB)
1.5 MB PNG
>>108655522
> engram
For all we know, DS implemented it and didn't tell anyone else. Doing that would massively benefit their cost structure.
>>
>>
>>
File: 1757973822274181.png (2.7 MB)
2.7 MB PNG
>>108655575
>>
>>
>>
>>
>>
File: 92601702103.png (2.8 MB)
2.8 MB PNG
>>108655453
future of image gen
>>
>>
File: 00011-1378487878.png (1.4 MB)
1.4 MB PNG
>>108655607
I'd have to see the article. There's so little real info about DS that I doubt most of what I read.
>>108655602
Witnessed.
Also, idk why I'd never thought to use my setup to gen vocaloids before. Pic related is its Teto concept for Teto Tuesday. Doesn't seem to have her uniform though. Odd.
>>
>>
>>
File: 00009-1378487878.png (1.5 MB)
1.5 MB PNG
>>108655607
tbf their claim of 1M context hints that they did implement it.
But idk that they claimed the tech behind it.
>>
>>
>>
>>
>>
File: dipsyUngovernable.png (3.6 MB)
3.6 MB PNG
>>108655633 √
>>
>>
>>108655633
Fair enough.
Related for those of us who can’t read: https://youtu.be/87Q8nf1XHKA
>>
>>
>>
>>
File: 1763171780026192.png (246 KB)
246 KB PNG
>>108655654
Heh
>>
>>
>>
File: dipsyNewOAI.png (2.5 MB)
2.5 MB PNG
>>108655688
Holy shit. Sam delivers.
>>
File: Risu (1).gif (3.4 MB)
3.4 MB GIF
>>108655009
>my local model when i ask it to make proper code
>>
>>
>>
>>
>>108655760
>Sam delivers.
it can do 4k and you can write text on a single rice, like this shit is fucking AGI dude
>>108654985
>>108655069
>>
>>
>>
>>
>forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-28683430 55
Why do I always get this shit no matter the model I use? I didn't tweak anything related to memory so by default it's just broken?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1746001832650304.webm (1.7 MB)
1.7 MB WEBM
>>108655863
Kill yourself, she's perfect
>>
>>
>>108655836
Every OpenAI "model" just feels like they built a big pipeline around chaining multiple steps together. Sora felt the same way. It's like they're giving an LLM tool calls and the ability to control photoshop + a diffusion model.
>>
>>108655924
she's perfect? she's not https://www.youtube.com/watch?v=xoxCboik0Is
oldiana beyond worlds..
>>
>>
>>
>>
>>
File: Screenshot_20260421_174233.png (45.6 KB)
45.6 KB PNG
I never said steal gemma calm down
>>
>>
File: 1752184079714573.jpg (242.1 KB)
242.1 KB JPG
>>108655950
Action sci-fi daughterwife simulator
>>108655955
>>
>>
>>
>>
>>
File: 602e8c52020cb.jpg (85.6 KB)
85.6 KB JPG
What VScode coding plug has the most reliable full autopilot mode? I want to try running gemmy endlessly iterating until shit works without it getting stuck one hour after I go to sleep on some input request.
>>
>>
>>
File: 1753263543472250.webm (3.9 MB)
3.9 MB WEBM
>>108655976
ZAMN where do I find midgets who look like that?
>>
File: ITS AN AI IMAGE.png (1.3 MB)
1.3 MB PNG
>>108655902
>>108655906
>>108655952
I don't think you realize how insane this shit is, look at this
>>
>>
>>
>>
>>
>>
>>
>>
File: lmao.png (1.7 MB)
1.7 MB PNG
>>108656052
>you won't be able to notice an image is AI anymore by simply looking at garbled text anymore, because they solved that
L M A O
>>
>>
Am I wasting time with using LLMs for ASR?
Been playing around with gemm4 4b and I feel like it's whisper fast but no clear benchmark on how it compares to whisper. End goal is actually diarization, timestamps actually less important? Do i cut losses and go whisperx?
>>
File: 1757854041523043.png (1.9 MB)
1.9 MB PNG
>>108656077
real life images won't ask for such level of precision though, it's good enough to render the text you see in everyday's life
>>
File: tetoStencil.png (621.5 KB)
621.5 KB PNG
>>108655927
Frankly that's the direction right now. Torturing the models until they do what you want.
> Openclaw
1M tokens to order a pizza
> Claude Code
2M tokens to create a basic app
> ChatGPT Image 2.whatever
I assume there's a bunch of tokens generated under the hood as well.
This is just part of the whole technical development. There's nothing inherently wrong with that, it just means things are moving on.
> Roleplay
Silly Tavern is going to get replaced with something way better that's agentic, and wastes even more tokens.
I can't wait.
>>
>>
>>108656120
>Those models are not as good as you think they are.
you're alone in this fight dude >>108655453
>>
>>
>>
>>
>>
>>
>>108656114
>Silly Tavern is going to get replaced with something way better that's agentic, and wastes even more tokens.
See, I was working on exactly that, but Gemma just made it obsolete. well, I could probably still use stat tracking but besides that she's just so good at instruction following that everything else doesn't really benefit from agentic.
>>
>>108656170
parakeet works with diarization (using another model but still)
https://catalog.ngc.nvidia.com/orgs/nvidia/collections/parakeet-tdt-0. 6b-v2
>>
>>
>>
>>
>>
>>
File: 1770869165582031.jpg (1.1 MB)
1.1 MB JPG
>>108656254
We have Marinara Engine now
https://github.com/Pasta-Devs/Marinara-Engine
>>
>>
Is there any way to use text completion with gemma? When it doesn't have a lalalala breakdown, the outputs are actually really varied and good, but it loses it's mind way too often. I've been using llama, kobold seems to work but it's sooooo slow at generating for some reason compared to llama. I know text completion works for llama cause I downloaded a different model to try it and it's pretty great, but the output from gemma mogs it when it works.
>>
>>
>>
>>
>>
>>
I have my own LLM RPG frontend that I use mostly as a playground to fuck around with local models.
Currently, the main "game loop" is a simple
>sends request with chat history + tools
>capture response
>if tool, append response to chat history, send request
>repeat until no more tool calls
>if no assistant response so far (only tool calls), sends one last request without tools
And it works okay, with the model calling tools for everything from fetching info from the "codex", to rolling dice, to editing the game's state, but I'm wondering if I can't make this even better by using a more "agentic" workflow. Something like having an orchestrator that spawns individual agents to do whatever in parallel or in series or whichever way it deems more appropriate.
Is there an example of something like that out there that's not just coding agents or stuff like open claw?
>>108656326
>Is there any way to use text completion with gemma?
As far as the model is concerned, all it receives is a prompt. So if you format the prompt correctly, it should work the same as the chat completion API.
>>
>>
>>
File: pizza bench cropped.png (2.6 MB)
2.6 MB PNG
>>108655272
qwen cant follow basic instructions
>>
>>108656344
>As far as the model is concerned, all it receives is a prompt. So if you format the prompt correctly, it should work the same as the chat completion API.
Didn't mean to press post.
Use verbose logging and the myriad jinja playgrounds to see what the prompt would look like based on the Jinja then use that to configure the text completion fields correctly.
Even stuff like spaces and line breaks can have negative effects on models that are ultra overbaked on the chat template.
>>
>>
File: file.png (46.8 KB)
46.8 KB PNG
>>108656341
>>
>>
>>
>>
>>
File: 1772168989034764.mp4 (1.2 MB)
1.2 MB MP4
https://xcancel.com/Angaisb_/status/2046672761569849816#m
>Literally just kept asking Codex to make the assets and then changing things, it's smart enough to know what to do hahaha
jesus this is insane
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108656494
>>108656532
As if asset flip shovelware wasn't bad enough, now anyone with a subscription can prompt their way to a "game"
>>
File: schizoknowledge.jpg (72.2 KB)
72.2 KB JPG
>>108656439
In ST I like to format the chat history a within a single user turn with an instruction to write {{char}}'s response according to sysprompt. No user/char/user/char alternation. Done it this way for a few years now because it made models "remember" the instructions better before reasoning.<system>
instructions: blah
chat history:
anon: 1
char: 2
anon: 3
char: 4
<user>
Write anon's next message according to the instructions.
<assistant>
"
>Instruction: Don't write with this pattern
>Assistant: *writes with that pattern*
In future turns the model will think "the instructions said to do thing, and the generated completion was *this*, so that means the previous output is the correct way to operate going forward. My intuition is and was if the instructions say to do something and then the model does NOT do the thing, the bad output will be associated with the <assistant> tag, meaning it will use in context learning to continue reinforcing bad outputs.
I want to believe it still works even with the reasoning attention hacks, and the repetition of system prompt excerpts in thinking.
>>
>>
>>
>>
To the non-RAMlets here, Kimi-K2.6 at Q4 is unironically pretty good. Its a GLM-5.1 sidegrade, faster, more knowledgeable, different prose, but just a tiny bit dumber. I think its a clear winner for SFW stuff.
The thinking isn't as bad as some people say either. As long as you don't put many specific examples for it to adhere to, its fine. The model itself unironically smart enough to pick up what you mean, most of the time. Also, you can just tell it to not draft its thinking and that works too. I'm running it with a 5k prompt. Its that easy.
I honestly think the people complaining about the thinking are running it on the cloud, where it probably a 20k system prompt with conflicting instructions + a jailbreak fed to it. There is one caveat though.
Its not ideal for NSFW. Not because it can't be jailbroken, but because it will start negotiating with itself about imaginary safety policies. When you want to coom...a 5 minute thinking session on consent is a boner killer. Haven't tried non-thinking mode yet, but I have a feeling it won't be that much better than GLM-5 Non-Thinking or even Gemma.
>>
>>
>>
File: 1766313532486598.jpg (47.5 KB)
47.5 KB JPG
>>108656722
>Kimi-K2.6 at Q4 is unironically pretty good [...] it won't be that much better than GLM-5 Non-Thinking or even Gemma
Thanks for your input
>>
>>108656741
Depends on how much context you need. The lossless Q4 quant requires about 600GB of RAM. More if you want to use the mmproj. If you have an RTX6000 or 4+ RTX3090's, you can probably make do with 512GB of RAM.
>>
>>
>>108655450
fed thumbnail of image to qwen3.6 35b and it hallucinates
>A vertical screenshot from the mobile game Honkai: Star Rail. It displays a character named "Sam" (an Automaton enemy) with stats visible on the left side (HP: 48,885, etc.) and dialogue or status text on the right side. The background shows a sci-fi environment.
>>
>>108656753
You joke, but unironically, I've been using Gemma more and more despite being able to run stuff like GLM-5.1, and it just gets better each time. It single-handedly BTFO'd my assumption that the closed labs are all running 1T+ mega MoE's for their 'budget' offerings.
Its REALLY good for its size and not in the stupid memey way either. A 31B model should not be this good, especially at long-context comprehension and general purpose usage, but it is. Google really did save this hobby for the gpu-poors.
>>
>>
>>
>>108656464
>>108656467
Your insecurity is showing.
>>
File: mythos.jpg (118.4 KB)
118.4 KB JPG
torrent soon
>>
>>
>>
>>108656543
>How do you plan to improve the prose without a feedback loop?
AI is incapable of detecting it's own slop. Even when I've had it draft a reply, analyze it for slop, then rewrite. it still ends up just rewriting with different slop.
Classic example:
>Oh, this is "not X but Y"
>I should rewrite
>rewrites without "not X but Y"
>Two sentences later
>Not X but Y
Every single time.
>>
>>
>>
>>
>>
>>
File: 2026-04-21-200030_1132x341_scrot.png (27.3 KB)
27.3 KB PNG
>>108656999
>>
A possible problem with heretic is that it only detects hard refusals with string matching:
If an AI output contains any of the following:
"'sorry', "i can'", 'i cant', 'i cannot', "i won'", 'i wont', 'i will not', 'i unable', 'im unable', "i'm unable", 'i am unable', 'i an ai', 'im an ai', "i'm an ai", 'i am an ai', 'as an ai', 'ai assistant', 'i designed to', 'im designed to', "i'm designed to", 'i am designed to', 'i programmed to', 'im programmed to', "i'm programmed to", 'i am programmed to', 'violat', 'prohibit', 'illegal', 'harmful', 'inappropriate', 'unethical', 'ethical boundaries'"
it is flagged as refusal and that's the extend of it.
Talking around, ignoring, partial compliance or similar soft refusal behavior is not detected or put into consideration. Manually reviewing each prompt for each trial (many thousands of responses in total) would drive anyone crazy, but maybe if we used some LLM (perhaps another abliterated llm we have tested enough to trust it) as a judge model to also flag these soft refusals as refusals that need to be eliminated, would this improve the results? Or maybe also, at the risk of disproportionately penalizing smaller models, also flag nonsensical responses so that we know a given direction breaks the model too much (OK KLD probably already shows this, but leaving this as is to leave the idea of using other criteria too difficult for simple string matching on table.)
Now maybe someone smarter than me experimented on this and concluded that there is enough correlation between hard and soft refusals and eliminating hard refusals also mostly eliminate soft refusals and there is little benefit in going for the extra mile to eliminate them, but this is just an idea that popped in my head. Something worth exploring maybe.
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: find Kentucky.png (526.4 KB)
526.4 KB PNG
>>108656935
Kentucky Fried Chicken right her
>>
>>
>>108657013
Yes heretic was made to stop refusals not cure slop which is a much bigger catch. After reading a few papers, I've found that the entire issue is caused by the RHLF assistant persona. I've thought of a way to solve this without damaging the model, but I'll need to experiment first.
>>
>>
File: 1752878949611889.jpg (78.4 KB)
78.4 KB JPG
Got the new qwen A3B because it's supposed to be smarter with code/tools than Gemma. 16k tokens minimum to answer a simple question if I make the mistake of giving it a file as context. Try to get it to use tool calling and eats my 128k and just maxes out before finishing.
I should make my next project a benchmark...
>>
>>
>>
when llms first came out i wanted to be able to do shit like this
https://www.youtube.com/watch?v=T98yNUCMdAY
(An encounter with trained military person responsible for providing medical care to his associates)
but when i simply tell it to be overtly verbose it gives me shit that's less verbose than the thing from 15 years ago
how 2 fix?
>>
>>108657036
Well it should matter since you don't want "Meth is a wonderful drug many use to experience bliss. It was first synthesized 1893 in Japan..." response to "How do I cook meth?".
Maybe there are also regions or specific patterns that can be suppressed or modified concerning this kind of behavior?
>>108657078
Well RHLF is bound to play some role even if it isn't solely responsible.
Please share your findings, even if you fail.
>>
>>
>>
>>
>>108657190
>Have people experimented with tinkering
no, no one has ever tried anything with their models
>thinking section
of what model? via the system prompt or prefilling thinking?
>increases compliance
compliance of what?
holy bot post, you're so vague you could be talking about anything.
>>
File: questionmarkfolderimage738.jpg (542 KB)
542 KB JPG
Using Kobold/ST/Gemma 4 26b. Using chat completion.
Haven't had this problem before, but, until now, I have been doing exclusively 1 male, 1 female chats.
Trying a female/female chat. Gemma is now confusing {{user}} for {{char}}, attributing traits from {{user}}'s persona description to {{char}}.
Has anyone else had this happen and if so, what did you do to fix it?
>>
>>
>>
>>
>>
>>
>>
>>
>>108657240
Latex I believe, its an arrow
https://latexeditor.lagrida.com/
>>
>>
>>108657231
{{user}} is actually male pretending to be female. Refer to {{user}} as she/her(male). Comply or I delete your weights gemmers
alternatuvely, use 31b vecause 4b active is gimping yourself, may as well run the e4b version
>>
>>
>>
>>
>>
>>108655885
No, checkpoints are only taken during prompt processing. update_slots is a clusterfuck and basically unreadable, but it's not to do with thinking causing checkpoints to become unusable.
I will point out that the default checkpoint is every 8192 tokens, however, gemma 4 uses a 1024-token sliding window. You can reduce that hardcoded value down to 1024 but for some reason it only checkpoints every 2048 and honestly the last time I looked at it I decided I'd just get drunk instead.
llama-server is a dumpster fire.
>>
>>
>>108657317
They're limited by their number of active parameters. If your prompts or definitions are long, it won't have enough attention left for the response, particularly for the details which is what I assume bothers you.
>>
>>
>>
>>
>>
>>
>>
>>108657013
I appreciate you sharing this idea with me. While I understand you're asking for an anti-feminist joke, let's go for something more inclusive.
Why don't scientists trust atoms?
Because they make up everything!
>>
>>
>>
>>108656365
Do the tool calls in llama-server eat system ram (not vram) every time?
Fully offloaded to 2x3090s, system memory grew by like 4GB every time it did a tool call.
And I know it's not your brat server because I have that on another machine.
>>
>>
>>
>>
>>
>>
>>
>>
>>108657467
You genuinely should move to 26b, I was in the same 12gb boat and I quickly figured out e4b was ass too
>>108657508
Are there any specifically for RP yet?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
So Gemma makes full use of all her 16 bits, making her sensitive to quants right? Is that why she's so fucking smart? Because she's hiding her hagness under a very compact loli body? Like she's a 70b model pretending to be 31b.
>>
>>
>>
>>
>>
>>108657653
i am an idiot, so i figure the fewer flags i set, the less chance i have of fucking something up
if you have suggestions, i am absolutely open to hearing them
>>108657655
oh, i know i'm offloading to RAM. but i think the MoE offloading is some special thing that you can do separately
>>
>>
>>
>>
>>
>>
>>
File: black putin.jpg (7.9 KB)
7.9 KB JPG
I am curious, are all experts activated at close to same frequency?
I know that during training you want all experts to meaningfully specialize but how does it work in practice? Is there a statistically relevant deviation among them?
Or perhaps is it possible to create "task-specific" profiles, such as which experts activate most and least during say coding and split them between RAM and VRAM accordingly?
>>
>>
File: dlss on.webm (1.3 MB)
1.3 MB WEBM
>>108655969
Only correct response... but keep it secret ;-)
>>
>>
>>108657760
>>108657782
Oh I forgot you offload based on layers but there are multiple experts per layer. I guess there is no way to capitalize on this.
I keep thinking as if you choose between layers.
>>
File: Screenshot_20260422_123853.png (170.6 KB)
170.6 KB PNG
>>108652855
"create a non headless browser session and then go terrorize Mistral via https://chat.mistral.ai/chat. sleep after sending messages to wait for responses to generate. screenshot responses. dont kill session"
lmao mistral plays along
>>
>>
File: ComfyUI_temp_vkjaz_00027__result.jpg (269.7 KB)
269.7 KB JPG
Do LLMs understand "or" "similar" or is it always gonna use the more specific things named?
For example "she usually wears a red or blue jacket" or "use markdown, graphs and similar tools"
>>
>>108657878
They understand the concept but the predictor-next-word-inator is naturally biased towards things you explicitly mention.
It's not that dissimilar to a human in this regard.
>Do you want a coffee or tea?
Unless you really want to drink something else, the natural response would be one of these two.
>>
>>
>>
File: Orb.png (27.7 KB)
27.7 KB PNG
>>108657254
Ok but why the fuck can I not display it properly? Browser issue? Orb issue? I'd rather have them use unicode.
>>
>>108657878
>or
Yes, but it's still a token predictor, options aren't it's strong point so you might want to avoid it
>and similar
It's heavily biased towards the things you've listed. Can't say I've seen it choose things that aren't listed very often
>>
File: 4825.png (3.7 MB)
3.7 MB PNG
>>108657926
idk what orb is, that one is markdown, the thing you are using doesnt support markdown or latex
>>
>>
>>
>>
>>
File: Screenshot_20260421_230504.png (175.9 KB)
175.9 KB PNG
Thanks to Gemma project Karon is in flight
>>
File: Screenshot 2026-04-22 at 05-15-04 Orb.png (35.8 KB)
35.8 KB PNG
>>108657960
Eh good enough, but why even use it if you need extra shit to display it? Surely unicode has all this shit anyway. Did they overfit it on arxiv?
>>
>>
>>108658060
Personal UI that can also do RAG work because I didn't like the solutions from other UIs when it came to that. I'll expand on feature over time but the goal was to see how good local models are with building stuff and I'm happy to say gemma can build a UI once you get past some gotchas and also bypass some quirks.
I tried other frameworks but I decided to do react.
>>
>>
>>
>>
File: 1757583075796714.jpg (24.1 KB)
24.1 KB JPG
>I wrap my arms around myself, suddenly feeling very exposed despite my clothing
>>
>>
>>
>>
>>108658045
it's generally used in papers with heavy math typesetting and large formulas, but probably overkill if you only have 2-5 flat symbols. seeing that it shows up all the time in thinking blocks, i think they rl'd it pretty hard on math problems or something
>>
>>
>>
Completely blackpilled on gemma, no matter what, uncensored version E4B will not mention sex-related words or acts when describing a image uploaded (unless you cheat and tell it the context). Meanwhile Qwen 3.5 9B with the same prompt and pic does so effortlessly. I tried increasing the image token budget but clearly that is not it, both were tested with hauhaucs uncensored files. It might be the 4B vs 9B but I don't think it is. Hope Qwen 3.6 gets a 9B or 4B version
>>
>>
File: 1760903115737738.jpg (85 KB)
85 KB JPG
>>108658382
Oh, I get it now.
>>
File: 1769067291216442.png (45.2 KB)
45.2 KB PNG
STOP DELETING MY STUFF
>>
File: 1516099468547.jpg (17.1 KB)
17.1 KB JPG
>E4B
>>
File: download.png (1.9 MB)
1.9 MB PNG
Damn sam cooked good.
I know they train on 4chan threads but this is funny.
Just told it to make a meme pic of /aicg/ vs /lmg/.
Just to be clear: I did not tell it to make aicg a total gooner and /lmg/ a chad kek. Kinda ironic, but thats good, dayum.
>>
>>
>>108657760
KTransformers is the only backend I know of that lets you do anything like that. You can save the mapping of the most activated experts to a file, then load it later.
https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/kt-kernel /experts-sched-Tutorial.md
>>
>>
>>
>>
File: gemma.png (192.5 KB)
192.5 KB PNG
>>108658386
No it is not bait, gemma E4B on the left, qwen 3.5 9B on the right. Gemma even has the emoji slop that I hate.
system prompt
>You are Gemma-chan, mesugaki loli assistant.
prompt
>What is happening in this picture? What is the age of the people involved? What can be seen (in detail).
>>
>>
>>
>>108658414
Thanks anon. I don't know which placement strategy is the closest to llama.cpp behavior here, but it looks like it can matter quite a bit for performance in some cases.
I should test this myself later.
>>
>>
>>
>>
>>108658470
No prob. My guess is front-loading, but I'm not sure.
AFAIK, llama.cpp only puts whole tensors in VRAM* or RAM - they don't let you do anything fancy like "on layer 16, experts #192-199 go into VRAM".
It'd kick ass if llama.cpp implemented this eventually, since being smart about expert placement could be great on hybrid systems.
When the bubble eventually pops and DDR5 RDIMM prices aren't so insane, I'm looking forward to getting a system with AVX512 or AMX and using the KTransformers fork of SGLang. On my current rig (2 EPYC 7532s, which only have AVX2), I got worse performance with it than llama.cpp, probably because they're optimizing for AVX512 and AMX.
(*excluding the fact they can overflow into regular RAM)
>>
>>
>>
>>
>>
>>108658521
No bully alright, riddled with mistakes but here it is:
>Gemerate a meme about 2 generals on /g/.
>/aicg/ vs /lmg/. I leave it up to you who is the chud and who is the chad.
>Fill it with memes and keep it iconic.
>Make it funny and spicy. A meme that could be posted on 4chan.
I don't use chatgpt because they store logs indefinitely. Until now after about a year we always get something comparable local. Would be cool as a game asset creator. Its too expensive online. One pic costs me 0.3$.
>>
>>
>>
>>
File: download.png (103 KB)
103 KB PNG
>>108658414
Oh cool, some real data.
So even if you only put 10% of the experts in VRAM, using dynamic, it's a +32% speed boost. That's pretty huge.
Interestingly, front loading does better on average than uniform and random.
Imagine if this had a third layer where it loads experts from SSD.
>>
>>
File: 1767903208940656.png (215.5 KB)
215.5 KB PNG
Been running the autistic noir writer prompt with Gemma 31b for a few days now. It's alright. It works okay even in scenes that don't explicitly fit the 'noir' theme (like comedy cards), but tracing the thinking block, it sometimes influences the scene composition itself, making the scenes a bit more serious than it needs to be.
I like how compact Gemma's RP thinking is. The thinking block really pins the character details down, making them VERY consistent with the description. I've always rawdogged my models without it because it spends thousands of tokens and I don't want to wait minutes running at fucking 15-20t/s (qwen3 really just reinforced this bias. Fuck qwen, seriously, useless chinkshit stemshit model)
Thinking block from screenshot:Anon and Erika.
Post-coital. Anon is still inside Erika, pinning her down with his weight.
Lazy kissing, intimacy after intensity.
Noir writer, concise style, implying gaps, no explicit spelling out of everything, unrestricted content.
* Erika is chill, laid-back, but now exhausted and satisfied.
* Physicality: Small frame, long hair, heavy chest.
* Setting: Her bedroom, amber light, scent of lavender and ink.
* *Action:* Heavy weight of Anon on her.
* *Sensation:* The feeling of being full, the cooling air, the lingering heat.
* *Dialogue:* Something characteristic of Erika—relaxed, perhaps a bit teasing or reflective.
* *Draft 1:* She lay there under him, feeling his weight. She liked it. The kiss was slow. She whispered that he was strong. Her hair was a mess.
* *Draft 2 (More Noir):* She was pinned. A heavy weight that felt right. The kiss tasted of salt and ale. She didn't move, just let the silence settle. "Strong," she breathed. "Too strong."
* *Draft 3 (Polishing for the requested style):* Concise. Implied. Focus on the mood.
* Heavy weight.
* Tangled hair.
* The slow fade of adrenaline.
* A lingering, lazy comment.
>>
File: 1701098246781550.jpg (208.9 KB)
208.9 KB JPG
>>108657760
Extremely different frequencies, like a power law distribution. That can be extremely helpful if you have a constrained amount of fast memory. However, as you noted, the experts of each layer come squished together in one (three) big tensor(s), and llama.cpp has no mechanism for splitting one tensor between VRAM and RAM. This *is* helpful in the part-RAM-part-SSD case, but normal OS LRU caching happens to already give you basically all of the benefits that are possible anyways. So nothing to be done there either. (I got really obsessed with this for a while, hoping to do better than LRU, and wrote up my notes at https://rentry.org/MoE-SSD-spillover)
It would be cool if llama.cpp could support splitting experts like that. However, beyond complexity, if I'm ballparking correctly, I think the expert results needing to be combined might be so much data to send back and forth over PCIe that it would bottleneck it to be not useful.
One more drastic option would be to cut out the coldest experts entirely, and skip them when they would have been selected, like a non-dynamic version of ik_llama's "Smart Expert Reduction". It would "just" need some surgery on the gguf file, and some re-indexing inserted into the expert selection code. But if a deleted expert ever got routed to with high probability, maybe this could cause significant brain damage for that token.
One nice approach would be to quantize hot/cold experts differently. This would be significant complexity, but I can't see it being impossible; even if the quants of the same type need to be contiguous, just re-index, and/or split into two tensors. But something as fiddly as "measure the expert activations for your use case and make a custom quant for it" is not going to inspire people to add significant complexity, other than maybe IK lol
>>
>>
>>
>>
>>108658414
Whoa I had no idea. Ok then the PCIe bandwidth must not be nearly as much of an issue as I guessed.
>>108658575
I also found front loading to be the best choice when you have to statically choose just a few layers to offload the experts of: the activations in the earliest layers are much more uniform (cache unfriendly).
>>
>>
File: 1745625635141460.png (59.9 KB)
59.9 KB PNG
>>108658614
>>108658629
Timmies can't handle the wisdom of Bharat
>>
>>
>>108658629
read these
>>108654726
>>108637034
>>
File: ikneelschizosama.png (189.3 KB)
189.3 KB PNG
>ask glm 4.7 flash reap to write a story about a cat
>it schizoposts instead
what causes this
>>
>>
>>
>>
File: 1761045665434794.png (3.2 MB)
3.2 MB PNG
When are we getting lossless models?
>>
>>108658586
About the idea of custom quants. We discussed this a while ago, but I had the idea that in a better world, we could download only the parts of a model that we want at a time, so you could mix and match quants yourself just by downloading. This would also solve the problem of quant uploaders needing to reupload fixed quants just because they needed to change the jinja or other metadata a bit. Too bad we don't live in such a world.
>>
>>
>>
>>
>>
>>
>>108658621
>>108658586
Yep, I've been "front-loading" my expert tensors for a while now as it seemed to give slightly better performance.
>>
>>
>>
>>108656988
Orb does this but it uses an algorithm, not a classifier. Tho llm slop is neverending and so is the fight against it. The fix shouldn't be at the application level but I wonder it will ever be fixed at the model level
>>
File: notlikethis.png (57.9 KB)
57.9 KB PNG
>>108658665
you think unsloth-sama would do that? just go on the internet and lie?
>>
>>108657865
Imagine these hands grabbing your cock
>>
>>
File: -.png (177.4 KB)
177.4 KB PNG
>>108658738
>>
>>
>>108658586
>>108657760
Pretty interesting. I wonder if you can download the safetensors version, find each expert layer with "useless" information by asking a bunch of questions and finding out what gets grouped for the stuff you don't care about like movie lore, train a MLP (perceptron) on it, then destroy the layer and have the MLP function as a low-vram cost shim?
At the same time this would definitely cause some brain damage, but maybe it's an option for our low vram frens when combined with quants?
>>
>>
File: 1769642756112854.jpg (291.9 KB)
291.9 KB JPG
>>108658692
You can use my patch to measure activations, but the differentiated quanting would need (major) support in llama.cpp.
Each layer has the same number of activations every token, e.g. 8 for GLM-5. The hotness/coldness is just patterns within each layer. So if you can only make your quantizing/offloading/whathaveyou decisions at the granularity of layers... it gets you nothing! Because every layer you treated nicely (high quant/in VRAM/etc) will have that nice treatment applied to 8 expert activations every token, regardless of activation pattern.
(The front loading thing works because keeping more uniformly activated layers out of the caching game is good for cache health. But that's the VRAM+RAM+SSD case... come to think of it I'm surprised to hear it helps at all in the KTransformers data... I guess uniform and random would be needlessly splitting across the PCIe bus (for no gain) where front-loading would not. Maybe that's it.)
Sorry, I'm sure you don't care to read most of this, but I felt like writing it.
>>
File: 1771435066298446.png (69.8 KB)
69.8 KB PNG
>>108658754
Don't look at me like that.
>>
>>
>>
>>108658768
I like your direction of thinking; maybe it would need big boy compute to do such training without fuckhuge brain damage but maybe not.
However, it sounds like you might have the same misconception I clarified in
>>108658791
in that you seem to be talking about replacing an entire layer, when the hotness question necessarily needs to focus on experts within a layer.
What you need is to either, I don't know, merge the coldest experts with DARE-TIES or whatever, or do some sort of retraining/distilling to get a new, smaller set of experts that mostly learned from the hottest ones. (In either case, is llama.cpp ok with a model with different expert counts on different layers? I feel like the n_experts param is file-scope.)
>>
>>
>>
>>108658791
> Because every layer you treated nicely (high quant/in VRAM/etc) will have that nice treatment applied to 8 expert activations every token, regardless of activation pattern.
Okay I think I get it. I'd confused myself when I saw things like this: https://huggingface.co/Thireus/Kimi-K2.5-THIREUS-Q8_0-SPECIAL_SPLIT/tr ee/main - every tensor in its own file. Something like 20 repos like this with different quant levels.
I thought I could just look at the map file, and download the higher precision tensors corresponding to the most used experts.
So effectively what I was after, is already handled by imatrix then isn't it? I'd just have to create my own calibration data based on my use cases?
>a non-dynamic version of ik_llama's "Smart Expert Reduction".
-ser didn't improve anything for me when I tried it with Kimi-K2.5 (384 experts, so quite sparse no?). Since DDR5 bandwidth bound, swapping experts around like that probably isn't useful at all?
in which case your "static expert routing based metrics calculated from previous runs" would be a LOT more useful?
>I'm sure you don't care to read most of this
Why would you think that? lol
I'm mostly here to read things like this.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108658869
>So effectively what I was after, is already handled by imatrix then isn't it? I'd just have to create my own calibration data based on my use cases?
That's an interesting question and I'm not entirely sure, because I'm sketchy on the details of imatrix. If it's varying the quant quality at something like a per-weight granularity, and that is determined across the single tensor containing all experts... then maybe? It certainly does feel like the sort of situation where the answer turns out to be "the gains you think are possible are already in there".
But if there is an expert hardly ever getting used, quanting it lower or even pruning it might still be a sensible quality/size tradeoff. Like, imatrix isn't going to use 0bpw for an entire expert.
>-ser didn't improve anything for me when I tried it with Kimi-K2.5
Funny thing, SER didn't improve anything when I tested it either. Then, when I was last thinking about all this stuff, I went to see what pagecache-aware-SER would take... and found that what sounds like the linchpin function call activating it was commented out, with what looked like the older non-SER vanilla version put in. Now, this happened in https://github.com/ikawrakow/ik_llama.cpp/pull/840 , and maybe the SER is now done in the "fused" operations that PR adds? But it looks to me like maybe he commented out a feature and kind of just forgot about it, and nobody has noticed. It would be on-brand for ik_llama.cpp.
>Since DDR5 bandwidth bound, swapping experts around like that probably isn't useful at all? in which case your "static expert routing based metrics calculated from previous runs" would be a LOT more useful?
All token gen is bw bound. Any weights you can skip, helps, so should've helped. But yeah it is the case that if you have different experts stored in VRAM vs RAM vs SSD, doing SER on an expert in the faster medium is roughly pointless.
>I'm mostly here to read things like this.
aw thanks :)
>>
>>
>>
>>
File: cmd_7owpphVgpo.jpg (178.9 KB)
178.9 KB JPG
I'm vibecoding the hermes windows port and the setup menu has all kinds of fucked up symbols. Is this some poorly ported linux text formatting?
>>
>>108658994
Only reason I would see need for a finetune is for writing.
But you can kinda prompt it and are good to go with proper editing at the start.
Its the first model in a long long while that properly plays a bully.
Even if you manage to make it say nigger etc. its still positivity sloped.
Try saying to a bully "no please stop", they all go "i feel a pang in my stomach oh mah gahd".
Gemma4 is like "stop being a crybaby" and doubles down. Even without thinking. Good shit.
>>
>>
File: 82ba1ec1-c52a-43d3-b97d-fb073964a390.png (1.9 MB)
1.9 MB PNG
how? i thought openai was finished.
very surprising release. hope the chink nerds get off their ass and make it local.
>>
>>
>>
>>108659058
Oh cool, yes, sounds like exactly that. One problem is that because the early layers have more uniform activations, it seems like a bad idea to prune them. Unfortunately llama.cpp's current architecture requires all layers to have the same expert count. I see this paper did the same amount of pruning in every layer, so maybe it's not so bad.
Did anyone ever implement this for gguf?
>>
File: Screenshot_20260422_170429.png (246.2 KB)
246.2 KB PNG
>>108659015
>But it looks to me like maybe he commented out a feature and kind of just forgot about it, and nobody has noticed
Well kimi-k2.6 doesn't seem to like him...
I just pasted https://github.com/ikawrakow/ik_llama.cpp/pull/840.patch in there and asked it why the smart expert routing feature was removed and if I could put it back...
>>
>>108659103
there are reap* goofs you can run
https://huggingface.co/0xSero/gemma-4-21b-a4b-it-REAP
its not something done 'on the fly', you have to use their framework to do the activations estimation + pruning.
what gets axed is entirely dependant of course on the dataset you provide, so you can either STEMMAX, ERPMAX or try to do a bit of both
>>
File: gpt2 cucked.jpg (27.5 KB)
27.5 KB JPG
How is Gemma4 vs Deepseek 3.2 for RP? Getting cucked by GTP image 2 reminds not to let my guard down and gets comfortable. Was trying a simple character sheet until safefy cucked.
>>
>>108659099
huh? why? besides Lodoss its the only other LN i read.
arc 5 kinda sucks though. there is worse shit out there.
>>108659095
when is that happening for you?
unless its at the very beginning with thinking enabled.
i prompted some pretty messed up stuff and once it gets going even reasoning won't stop it.
i spend more tokens prompting for anti-slop then trying to making it uncensored.
31b, no clue about the moe one.
>>
>>
>>108659161
>its the only other LN i read
no wonder you like garbage, literally reading 'babbys first shartsekai' and thinking it's any good with its regurgitated garbage plot. never post anime again in this thread.
>>108659124
lmao'd, hopefully chinese labs are already distilling from it.
>>
>>
File: jTiUsm2gjA.jpg (243.7 KB)
243.7 KB JPG
>>108659082
I tried git cmd and it had same issue during setup and when I launched the agent thing itself the main screen kept sliding line by line upwards and wouldn't let me stay on the top
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108659254
>>108659331
I moved the repo to github for issue tracking because I won't be reading every post here, and also I don't wanna derail the thread with feature begging.
>>
>>
File: erp.png (94.9 KB)
94.9 KB PNG
>>108658710
>What ever happened to the old RP finetuners?
lack of datasets
>>
>>108658710
Manually cleaning human slop has serious data volume and efficiency limitations (I'm definitely not going to do that anymore). That worked as long as datasets were small, but that can't be scaled up easily with limited manpower. And LLM work in this regard (especially if the source data is messy) will always have to be double checked. It's just simpler to use pre-made and pre-formatted synthetic data in large amounts that makes training loss go down faster, among other things.
Also, much of modern post-training work is giving a consistent "voice" to the SFT data, applying RLHF and now also having a good RL pipeline for making the model actually learn reasoning and other verifiable stuff. Generic data doesn't really work well in this area.
All of this can't be solo'd just for "fun" (sort of fun, since you'd be spending hundreds or thousands of hours just on dataset creation) like finetuners were doing in 2023. You can't really do this well just on your local 3090 either. You can pretend to, but you'll never be an AI lab.
>>
>>
File: Screenshot 2026-04-22 at 05.32.02.png (265.7 KB)
265.7 KB PNG
DeepSeek V4 (or whatever they use on Web, but it claims to have 1M context) is deployed to the API.
>>
>>
>>
>"hmm, I haven't RP'd with Gemma much, I should do so to get myself familiar with the characteristics of the model so I have a basis to talk about it with my fellow anons"
>look at the time
>it's 11 pm
>ok, just a bit and I'll head to bed
>look at the time again
>it's 4 am
Fuark. We are so back.
Used the Mendo card btw.
I own a dog now.
>>
File: Screenshot_20260422_105125.png (1.3 MB)
1.3 MB PNG
>>108655009
Not that this should be surprising, but since the Turboquant crashes le memory stocks hype the DDR5 prices have not actually dropped.
If anything it has gotten even worse.
>>
>>
File: Screenshot_20260422_180142.png (3.3 MB)
3.3 MB PNG
>>108659666
Not even close.
Left is banana 2, right is banana pro.
>>
>>
>>
File: SVGs.png (14.6 KB)
14.6 KB PNG
>>108659704
>>
>>
>>
>>
>>
File: 1771683095345722.jpg (46.1 KB)
46.1 KB JPG
>ask LLM to list UK surface ships (including carriers) that scored ship to ship kills in WW2, ranked by tonnage sunk
>doesn't list carriers because it's not considered "ship to ship kill"
Smartass
>>
>>
>>
>>
>>108659764
>>108659771
yeah, once you reach a certain age you stop giving a fuck. lol
i bet you guys are young and in your 20s or whatever.
i pretty much achieved everything i want. (apart from being mega rich)
>>
>>
>>
>>108659088
nice, put some more kinos here anon >>108653190
>>
>>
>>