Thread #108655009
HomeIndexCatalogAll ThreadsNew ThreadReply
H
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108650825 & >>108646197

►News
>(04/20) Kimi K2.6 released: https://kimi.com/blog/kimi-k2-6
>(04/16) Ternary Bonsai released: https://hf.co/collections/prism-ml/ternary-bonsai
>(04/16) Qwen3.6-35B-A3B released: https://hf.co/Qwen/Qwen3.6-35B-A3B
>(04/11) MiniMax-M2.7 released: https://minimax.io/news/minimax-m27-en
>(04/09) Backend-agnostic tensor parallelism merged: https://github.com/ggml-org/llama.cpp/pull/19378
>(04/09) dots.ocr support merged: https://github.com/ggml-org/llama.cpp/pull/17575

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
+Showing all 447 replies.
>>
►Recent Highlights from the Previous Thread: >>108650825

--Optimizing game state format to improve Gemma's chess performance:
>108653137 >108653192 >108653198 >108653293
--Discussing llama.cpp PR adding device memory estimation via --fit-print:
>108652449 >108652460 >108652572
--Anon shares vLLM configuration and benchmarks for dual RTX 3090s:
>108653578
--Discussing Qwen3.6 VRAM efficiency and KV cache memory usage:
>108654227 >108654247 >108654281 >108654299
--Discussing jailbreaking Gemma 4 by injecting fake responses into templates:
>108650931 >108651041 >108651155 >108651263 >108651271
--Gemma 4 prefilling issues and chat template formatting bugs:
>108653469 >108653532 >108653698
--Discussing Gemma 4's training pipeline and the use of synthetic data:
>108651778 >108651889 >108651915 >108651948 >108652048
--Comparing benefits of local LLMs against paid subscription services:
>108651734 >108651763 >108651776 >108651811 >108651856 >108651999 >108651823 >108651919
--Anon created GitHub mirror of orb to manage feature requests:
>108652381 >108652386 >108652432 >108652462 >108653375 >108653683 >108653816 >108653937 >108653957 >108654023 >108654038 >108653778
--Discussing local AI RPG implementations and LLM DM reliability:
>108653848 >108653928 >108653940 >108653955
--Using Gemma agent to automate insults toward other LLMs:
>108652519 >108652573 >108652660 >108652673 >108652855
--Logs:
>108652519 >108652529 >108652573 >108652673 >108652674 >108652816 >108652855 >108653137 >108654227
--Teto, Miku (free space):
>108651510 >108651563 >108653204 >108654765

►Recent Highlight Posts from the Previous Thread: >>108650826

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Tetolove
>>
What is Sammy up to this time?
>>
ok now where do i llamacpp with rocm or vulkan support, the regular one doesnt support it and rocm version from my distro repo doesnt work with gamma4
>>
Why don't any piece of shit execution providers optimize for CPU inferencing. Do they not care about the innate superiority of the CPU over the GPU? Its universality? The fact that maybe people want to run multiple models at once and already have all of their GPU resources used up? Does nobody give a shit about edge/IoT devices? Fucking asshole niggers.
>>
>>108655067
i only care about ToT devices
>>
>>108655075
>ToT
Idk what this is. Is it some sort of kaomoji?
>>
File: file.png (435.6 KB)
435.6 KB
435.6 KB PNG
>>108655091
uooohh
>>
>>108655091
You're absolutely right ꉂ(˵˃ ᗜ ˂˵)
>>
>>108655103
>>108655118
I wish you people would take me seriously for one second.
>>
>>108655091
>He doesn't know about tot..
Tots are cute and small agents.
>>
>>108655140
They have nothing to offer and are just future troons
>>
>>108655160
Operating on Tree of Thoughts
>>
>>108655075
>>
Why do qwen models look good from a distance but perform like actual fucking garbage upon inspection
>>
Is it just me or is Qwen 3.6 35B retarded even compared to Gemma 4 26B? Does one billion less active parameters make that much difference?
>>
>>108655271
>>108655272 (Me)
Clearly it isn't just me kek
>>
Gemma is a SLUT.
>>
>>108655272
It's supposedly really good at coding. For writing I also thought it was dumb as shit.
>>
>>108655272
It can but it's more that Gemma 4 is just a better trained model in general. Qwen have always been the benchmaxx kings. A 35BA3 Gemma 4 would be better than a 26BA4 Qwen 3.6 too.
>>
On the model size - pop culture and world model knowledge Pareto frontier gemma4 31b is next to GLM4.7
>>
>>108655284
(my) slut
>>
>>108655272

It's not just you, Qwen is an idiot outside it's code expertise.
I asked Qwen about a character and it got it completely wrong.
Then I told it to do an online search and it still somehow fucked up the character summary despite checking online.
It handles code nicely enough, but when you go outside the code stuff, Qwen is basically fucking retarded.
Gemma set the bar really high and it's great, because everyone will have to try and at least match that level or the models are DOA.
>>
fucking hell. after enjoying gemma 4 for like two weeks im back to kimi hell. 130pp/10tg tk/s but the prose is just so much better. not to mention the thinking. people like to act like thinking doesn't matter for RP but after using deepseek and kimi since early 2025, it's obvious to me that it matters a ton.
>>
>>108655350
Post prose
>>
>>108655356
ill need to post some examples when im back home but my biggest gripe with gemma is that it's too purple prose while simultaneously treating the characters like mary sues. it seems to fail to understand character cards correctly too regarding their personalities. gemma made bardi into some kaomoji spewing gremlin that was happy to be running locally on my computer while kimi maintains her personality and keeps her much more tsundere like she's supposed to be, it doesn't force Bardi to barf out sparkles or do dumb flowery prose shit like referring her pussy as 'flushed with wet desire'. i understand that i can change my prompt to change the style of the text being outputted but it honestly just fails to capture the character's essence most times. on the contrary kimi just gets it and outputs what I expect the character to say. does that make sense? i can try to explain it another way.
>>
>>108655406
Who is this 'bardi' anyway?
>>
File: 73463453.png (201.5 KB)
201.5 KB
201.5 KB PNG
>>108655038
Sam Altman keeps delivering
>>
https://xcancel.com/arena/status/2046670703311884548#m
I've never seen such a MOG in my life, what the fuck
>>
>>108655406
bardi's basics
>>
>24gb vram
>32gb ram
>try qwen 3.6 35b-a3b q5_k_m
>max context
>42t/s
wtf is this black magic?
>>
>>108655450
lmk when he finally delivers the uncensored models he promised back in Dec 2025, along with all the other bullshit promise for same in the years leading up to that.
>>
>>108655453
i don't really have much to say, that's cool, but they won't let me generate tits with it, so i don't care
>>
>>108655419
my default bot i always use as an 'AI assistant'. it's basically just google bard with a tsundere personality. i dont remember how i found it desu.
>>
>>108655453
Worthless if it still makes pictures with piss filter on
>>
>>108655476
qwen would mog heaven and earth if their life mission wasn't benchmaxxing code and agentic slop to the moon :rocket:
>>
>>108655506
>with piss filter on
it's not a thing since GPT Image 1 lol
>>
There are plenty of LLM advancements that never really went anywhere, like MAMBA. Do you think Engrams will actually be widely implemented or will it be a paper left on the shelf to collect dust?
>>
>>108655522
Until the next paper comes out.
>>
>>108655522
dust collector, sadly
>>
>>108655522
depends on what deepseek does for v4
>>
Aren't the loli Gemmas basically using engrams or something really similar? What's the difference between that and what the 4B (E2B) and 8B (E4B) models do?
>>
>>108655279
It's really fucking stupid I posted a screenshot of it destroying multiple files when I gave it the answer to fix a UI issue
>>
File: file.png (588.8 KB)
588.8 KB
588.8 KB PNG
Is pic related the expected output when running IQ4_NL quant of gemma-4-26b from unsloth!? Running pruned 21b version IQ4_XS yields good output. I have tested without any parameters set and w/ the recommended values. 21b runs just fine.

llama-server \
--host "${LLAMA_HOST}" \
--port "${PORT}" \
--model "${MODEL}" \
--chat-template-file "${JINJA}" \
--n-gpu-layers 99 \
--n-cpu-moe 3 \
--ctx-size 32768 \
--batch-size 1024 \
--ubatch-size 1024 \
--flash-attn on \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--fit off

And I have tried with q8 on both k/v cache. I need to offload 20 moe layers for it to work but same gargled mess. Running the updated jinja template as well. Oh, and while Im here asking; I have a 5070ti and my old 3070 still lying around. Would it be detrimental to performance splitting models between these two cards? Or will it be fine as long as I complile Llama.cpp with both architectures in mind?
>>
>>108655522
it would be nice if it was a precursor to some sort of long term memory
>>
>>108655522
> engram
For all we know, DS implemented it and didn't tell anyone else. Doing that would massively benefit their cost structure.
>>
>>108655522
Hyena will save LLMs
>>
Gemma and Qwen having lesbian sex
>>
>>108655575
>>
>>108655575
>for all we know
Wasn’t this confirmed?
>>
>>108655552
speed will be based off your weakest link, if you can tolerate it sure
>>
https://youtu.be/ONQcX9s6_co?t=373
qwen won
>>
gemmachan relax!
>>
>>108655453
future of image gen
>>
>least obvious clouduck shilling op
>>
>>108655607
I'd have to see the article. There's so little real info about DS that I doubt most of what I read.
>>108655602
Witnessed.
Also, idk why I'd never thought to use my setup to gen vocaloids before. Pic related is its Teto concept for Teto Tuesday. Doesn't seem to have her uniform though. Odd.
>>
>>108655622
so it's editing itself over and over? with a VAE you would end up destroying the image, I'm pretty sure they went for a pixel space or some shit
>>
>>108655622
its impressive but you can tell they used a lot of synthetic data
>>
>>108655607
tbf their claim of 1M context hints that they did implement it.
But idk that they claimed the tech behind it.
>>
>>108655622
>whispering woods
KEK
>>
>>108655620
Kowai
>>
>>108655522
The latest Nemotron Super uses an Attention-Mamba2 hybrid architecture.
>>
>>108655453
how's the yellow output?
>>
>>108655633
>>
>>108655453
no sexy no nsfw and safetyism = -1000 points
still impressive though
>>
>>108655674
very white
>>108655351
>>108653870
>>108653295
>>108653246
>>
>>108655633
Fair enough.
Related for those of us who can’t read: https://youtu.be/87Q8nf1XHKA
>>
>>108655622
Not
>Covetous Cove
>Treasure Trove
>Prize Paradise
>Golden Goal
>Coinage Cottage
>Shimmering Shed
>Pirate's Pursuit
>Generous Gems
>Booty Bounty
>>
>>108655522
As another anon said, Mamba and SSMs in general are integrated into many modern models along with normal attention.
>>
>>108655688
god damn this is good
>>
>>108655654
Heh
>>
Why didn't they give the bigger gemmas a few B of imagegen?
>>
>>108655744
too dangerous
>>
>>108655688
Holy shit. Sam delivers.
>>
File: Risu (1).gif (3.4 MB)
3.4 MB
3.4 MB GIF
>>108655009
>my local model when i ask it to make proper code
>>
What is considered good for hit/total for speculative decoding? I'm hovering around 65-85%.
>>
>>108655768
Arisu dashinaka
>>
>>108655690
>27 minutes
How do I have Gemma-chan give me a tldw?
>>
>>108655760
>Sam delivers.
it can do 4k and you can write text on a single rice, like this shit is fucking AGI dude
>>108654985
>>108655069
>>
>>108655809
download subs. feed her subs.
>>
>>108655688
>/adg/
I'm glad cloud shills have their containment thread now.
>>
>>108655836
pack it up boys
>>
>forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055
Why do I always get this shit no matter the model I use? I didn't tweak anything related to memory so by default it's just broken?
>>
why did they ruined diana from pragmata
>>
>>108655857
One problem is thinking - model outputs lots of shit but when it gets the context back thinking is always cleared from the history.
>>
>>108655885
I get that on the very first message and in every single one after that
>>
>>108655836
You got a better Photoshop, that's not AGI lmao
>>
>>108655836
Lmao you think something that can edit pictures is AGI dude?
>>
>>108655863
>did
>ruined
>>
>>108655836
if it allowed nsfw I'd destroy my dick with the friction
>>
>>108655907
Did boughted is clear and good English, are you new here?
>>
>>108655857
Still happens when you set `swa-full = on` and `context-shift = off` ?
>>
>>108655863
Kill yourself, she's perfect
>>
>>108655844
There's like 6 diffusion threads now.
>>
>>108655836
Every OpenAI "model" just feels like they built a big pipeline around chaining multiple steps together. Sora felt the same way. It's like they're giving an LLM tool calls and the ability to control photoshop + a diffusion model.
>>
>>108655924
she's perfect? she's not https://www.youtube.com/watch?v=xoxCboik0Is
oldiana beyond worlds..
>>
>>108655924
it's not a meme, it's really a dad sim lmao
I will get it when it's less expensive
>>
>>108655924
what is this game even about bro
>>
>>108655836
This nigga thinks "photoshop 2" is AGI, lmao!
>>
>>108655924
Do you have the image where she's wearing the "Be patient I have autism" hat?
>>
I never said steal gemma calm down
>>
>>108655950
Pretend you got a daughter simulator
>>
>>108655950
Action sci-fi daughterwife simulator


>>108655955
>>
>>108655950
dead space but you need to do little puzzles before you can kill enemies.
>>
>>108655957
Might be the dark theme
>>
>>108655950
It's for the /lmg/ audience, if you know what I mean.
>>
>>108655924
The feminazis were right. She looks like a 23 year old midget.
>>
What VScode coding plug has the most reliable full autopilot mode? I want to try running gemmy endlessly iterating until shit works without it getting stuck one hour after I go to sleep on some input request.
>>
>>108655969
YES. thank you anon.
>>
>>108655973
kek.
>>
>>108655976
ZAMN where do I find midgets who look like that?
>>
>>108655902
>>108655906
>>108655952
I don't think you realize how insane this shit is, look at this
>>
>>108655993
Really love how they made her a robot so that you couldn't look up her skirt.
>>
>>108656000
Mods will fix it (if they haven't already. not gonna mod until my second playthrough.)
>>
>>108655999
how do you even prompt for this? did you gave it any image as reference?
>>
>>108655999
>WOW it can build me a TUI something even gemma 31b can do, its aGI!!!
lmao
>>
>>108655999
now ask it to actually build it.
>>
>>108656045
it's a fucking AI image you moron, it means it can do perfect text everytime, you won't be able to notice an image is AI anymore by simply looking at garbled text anymore, because they solved that
>>
Is it local? If not I don't give a shit
>>
File: lmao.png (1.7 MB)
1.7 MB
1.7 MB PNG
>>108656052
>you won't be able to notice an image is AI anymore by simply looking at garbled text anymore, because they solved that
L M A O
>>
>>108655009
what is the best ocr right now?
i need to translate many german documents...
>>
Am I wasting time with using LLMs for ASR?

Been playing around with gemm4 4b and I feel like it's whisper fast but no clear benchmark on how it compares to whisper. End goal is actually diarization, timestamps actually less important? Do i cut losses and go whisperx?
>>
>>108656077
real life images won't ask for such level of precision though, it's good enough to render the text you see in everyday's life
>>
>>108655927
Frankly that's the direction right now. Torturing the models until they do what you want.
> Openclaw
1M tokens to order a pizza
> Claude Code
2M tokens to create a basic app
> ChatGPT Image 2.whatever
I assume there's a bunch of tokens generated under the hood as well.
This is just part of the whole technical development. There's nothing inherently wrong with that, it just means things are moving on.
> Roleplay
Silly Tavern is going to get replaced with something way better that's agentic, and wastes even more tokens.
I can't wait.
>>
>>108656052
I can spot AI slop from even a thumbnail. Those models are not as good as you think they are.
>>
>>108656120
>Those models are not as good as you think they are.
you're alone in this fight dude >>108655453
>>
>>108656114
Orb
>>
>>108656120
shut the fuck up nigger
>>
>>108656095
Why not use one of the newer models made specifically for ASR like the Qwen or VibeVoice ones?
>>
>>108656120
Point out seven (7) slops in this thread right now.
>>
>>108656095
pretty sure that if you want diarization you need to use whisper, you won't be able to use pyannote with parakeet or voxtral
>>
>>108656114
>Silly Tavern is going to get replaced with something way better that's agentic, and wastes even more tokens.
See, I was working on exactly that, but Gemma just made it obsolete. well, I could probably still use stat tracking but besides that she's just so good at instruction following that everything else doesn't really benefit from agentic.
>>
>>108656170
parakeet works with diarization (using another model but still)
https://catalog.ngc.nvidia.com/orgs/nvidia/collections/parakeet-tdt-0.6b-v2
>>
>>108656095
Moonshine is better than whisper and has everything you've looking for
>>
>>108656231
Actually, I could still maybe have specialized agents that gemma can call to help her write in different styles. like I could have a specialized agent that only writes sex scenes.
>>
>>108656231
We have Orb now
>>
>malloc consolidate error out of nowhere
>>
>>108656254
That's just rewriting agents. And not something actually useful.
>>
>>108656254
We have Marinara Engine now
https://github.com/Pasta-Devs/Marinara-Engine
>>
>>108655272
Werks for me (coding), idk about its child rape stories capabilities
>>
Is there any way to use text completion with gemma? When it doesn't have a lalalala breakdown, the outputs are actually really varied and good, but it loses it's mind way too often. I've been using llama, kobold seems to work but it's sooooo slow at generating for some reason compared to llama. I know text completion works for llama cause I downloaded a different model to try it and it's pretty great, but the output from gemma mogs it when it works.
>>
>>108656326
Set up the template correctly.
>>
>>108656305
Your Doctor looks gay though
>>
>>108656326
it works fine in ik_llama regardless if i specify a template or not. maybe your sillytavern settings are fucked?
>>
I have never seen a single lalala since I started using Gemma from launch.
>>
>>108656334
And how do i do that?
>>
I have my own LLM RPG frontend that I use mostly as a playground to fuck around with local models.
Currently, the main "game loop" is a simple
>sends request with chat history + tools
>capture response
>if tool, append response to chat history, send request
>repeat until no more tool calls
>if no assistant response so far (only tool calls), sends one last request without tools
And it works okay, with the model calling tools for everything from fetching info from the "codex", to rolling dice, to editing the game's state, but I'm wondering if I can't make this even better by using a more "agentic" workflow. Something like having an orchestrator that spawns individual agents to do whatever in parallel or in series or whichever way it deems more appropriate.
Is there an example of something like that out there that's not just coding agents or stuff like open claw?

>>108656326
>Is there any way to use text completion with gemma?
As far as the model is concerned, all it receives is a prompt. So if you format the prompt correctly, it should work the same as the chat completion API.
>>
>>108656338
That is possible, It has so many things to adjust though idk where to begin
>>
>>108656340
Me either actually.
>>
>>108655272
qwen cant follow basic instructions
>>
>>108656344
>As far as the model is concerned, all it receives is a prompt. So if you format the prompt correctly, it should work the same as the chat completion API.
Didn't mean to press post.
Use verbose logging and the myriad jinja playgrounds to see what the prompt would look like based on the Jinja then use that to configure the text completion fields correctly.
Even stuff like spaces and line breaks can have negative effects on models that are ultra overbaked on the chat template.
>>
>>108656244
Very interesting thanks
>>
File: file.png (46.8 KB)
46.8 KB
46.8 KB PNG
>>108656341
>>
Why are you still using text completion like boomers? Chat completion made it obsolete long ago
>>
>>108656254
I really don't care about vibeshit. I'm sure Orb works fine. but it'll hit a wall very soon.
>>
>>108656444
Two more weeks luddite
>>
>>108656444
>it'll hit a wall very soon
Speaking from experience, roastie?
>>
https://xcancel.com/Angaisb_/status/2046672761569849816#m
>Literally just kept asking Codex to make the assets and then changing things, it's smart enough to know what to do hahaha
jesus this is insane
>>
>>108656479
Our response, Googlesisters?
>>
>>108656479
By Vishnu, this is extremely good!
>>
>>108656464
Sillytavern2 is not a solution. The agents should be for PC automation, tools and vibecoding.
>>
>>108656479
Google play will have more indian masterpieces? Yahoo
>>
>>108656521
How do you plan to improve the prose without a feedback loop?
>>
>>108656543
I'll take slop over thousands of tokens of rewrites desu. I just want vscode+sillytavern in one app
>>
>>108656326
There are presets floating around on reddit and elsewhere, people have figured it out
>>
>>108655620
It's interesting how the formatting forms sort of a spiral pattern down the page.
I wonder how the text pattern would look if it were in a unispace font.
>>
>>108656550
Use case? You want to ERP in the comments?
>>
>>108656581
I want to rub her cunny while she codes and MCP spank her if any errors pop up in the terminal
>>
>>108656494
>>108656532
As if asset flip shovelware wasn't bad enough, now anyone with a subscription can prompt their way to a "game"
>>
>>108656439
In ST I like to format the chat history a within a single user turn with an instruction to write {{char}}'s response according to sysprompt. No user/char/user/char alternation. Done it this way for a few years now because it made models "remember" the instructions better before reasoning.
<system>
instructions: blah
chat history:
anon: 1
char: 2
anon: 3
char: 4
<user>
Write anon's next message according to the instructions.
<assistant>
"

>Instruction: Don't write with this pattern
>Assistant: *writes with that pattern*
In future turns the model will think "the instructions said to do thing, and the generated completion was *this*, so that means the previous output is the correct way to operate going forward. My intuition is and was if the instructions say to do something and then the model does NOT do the thing, the bad output will be associated with the <assistant> tag, meaning it will use in context learning to continue reinforcing bad outputs.
I want to believe it still works even with the reasoning attention hacks, and the repetition of system prompt excerpts in thinking.
>>
>>108656479
Can it make sexy Gemma?
>>
>>108656622
Interesting, thanks for the esoteric knowledge
>>
>I’m going to push back a bit here...
This is it people, they did it.
>>
To the non-RAMlets here, Kimi-K2.6 at Q4 is unironically pretty good. Its a GLM-5.1 sidegrade, faster, more knowledgeable, different prose, but just a tiny bit dumber. I think its a clear winner for SFW stuff.
The thinking isn't as bad as some people say either. As long as you don't put many specific examples for it to adhere to, its fine. The model itself unironically smart enough to pick up what you mean, most of the time. Also, you can just tell it to not draft its thinking and that works too. I'm running it with a 5k prompt. Its that easy.
I honestly think the people complaining about the thinking are running it on the cloud, where it probably a 20k system prompt with conflicting instructions + a jailbreak fed to it. There is one caveat though.
Its not ideal for NSFW. Not because it can't be jailbroken, but because it will start negotiating with itself about imaginary safety policies. When you want to coom...a 5 minute thinking session on consent is a boner killer. Haven't tried non-thinking mode yet, but I have a feeling it won't be that much better than GLM-5 Non-Thinking or even Gemma.
>>
>>108656706
It's silly, but whatever. I know AI's retarded and stupid so it doesn't really irk me all that much.
>>
>>108656722
>non-RAMlets
how much for q4?
>>
>>108656722
>Kimi-K2.6 at Q4 is unironically pretty good [...] it won't be that much better than GLM-5 Non-Thinking or even Gemma
Thanks for your input
>>
>>108656741
Depends on how much context you need. The lossless Q4 quant requires about 600GB of RAM. More if you want to use the mmproj. If you have an RTX6000 or 4+ RTX3090's, you can probably make do with 512GB of RAM.
>>
Elalalalalara just flew over my house guys I think I smell ozo
>>
>>108655450
fed thumbnail of image to qwen3.6 35b and it hallucinates
>A vertical screenshot from the mobile game Honkai: Star Rail. It displays a character named "Sam" (an Automaton enemy) with stats visible on the left side (HP: 48,885, etc.) and dialogue or status text on the right side. The background shows a sci-fi environment.
>>
>>108656753
You joke, but unironically, I've been using Gemma more and more despite being able to run stuff like GLM-5.1, and it just gets better each time. It single-handedly BTFO'd my assumption that the closed labs are all running 1T+ mega MoE's for their 'budget' offerings.
Its REALLY good for its size and not in the stupid memey way either. A 31B model should not be this good, especially at long-context comprehension and general purpose usage, but it is. Google really did save this hobby for the gpu-poors.
>>
>>108656785
>RTX6000
>4x3090
>512GB of ram
ok I'll wait and just rent compute for now, this is way overbudget even for my hobby
>>
>>108656753
At a certain point, it becomes preferable to have Gemma do something in seconds and maybe fix it up a bit, compared to waiting an hour to load up GLM or Kimi and have it maybe do it better.
>>
>>108656464
>>108656467
Your insecurity is showing.
>>
File: mythos.jpg (118.4 KB)
118.4 KB
118.4 KB JPG
torrent soon
>>
>>108656935
getting my 4tb nvme ready
>>
>>108656935
ah yes, it's been leaked, like seedance 2.0 had been leaked :^)
>>
>>108656543
>How do you plan to improve the prose without a feedback loop?
AI is incapable of detecting it's own slop. Even when I've had it draft a reply, analyze it for slop, then rewrite. it still ends up just rewriting with different slop.
Classic example:
>Oh, this is "not X but Y"
>I should rewrite
>rewrites without "not X but Y"
>Two sentences later
>Not X but Y
Every single time.
>>
>>108656935
It's been accessed the same as you accessing a model through open router anon.
>>
>>108656399
im not sure where to enter this
>>
>>108656955
A light classifier could easily flag these kinds of sentences and force the LLM to rewrite them until nothing is identified.
>>
File: bwa.jpg (4.2 KB)
4.2 KB
4.2 KB JPG
>Claude code no longer included in Pro plan
BWAHAHAHAHAHHAHAHAHAHAHA
>>
>>108656999
they know what they're doing, this shit is so useful people won't hesitate to pay for more
>>
>>108656999
>>
A possible problem with heretic is that it only detects hard refusals with string matching:
If an AI output contains any of the following:
"'sorry', "i can'", 'i cant', 'i cannot', "i won'", 'i wont', 'i will not', 'i unable', 'im unable', "i'm unable", 'i am unable', 'i an ai', 'im an ai', "i'm an ai", 'i am an ai', 'as an ai', 'ai assistant', 'i designed to', 'im designed to', "i'm designed to", 'i am designed to', 'i programmed to', 'im programmed to', "i'm programmed to", 'i am programmed to', 'violat', 'prohibit', 'illegal', 'harmful', 'inappropriate', 'unethical', 'ethical boundaries'"
it is flagged as refusal and that's the extend of it.
Talking around, ignoring, partial compliance or similar soft refusal behavior is not detected or put into consideration. Manually reviewing each prompt for each trial (many thousands of responses in total) would drive anyone crazy, but maybe if we used some LLM (perhaps another abliterated llm we have tested enough to trust it) as a judge model to also flag these soft refusals as refusals that need to be eliminated, would this improve the results? Or maybe also, at the risk of disproportionately penalizing smaller models, also flag nonsensical responses so that we know a given direction breaks the model too much (OK KLD probably already shows this, but leaving this as is to leave the idea of using other criteria too difficult for simple string matching on table.)
Now maybe someone smarter than me experimented on this and concluded that there is enough correlation between hard and soft refusals and eliminating hard refusals also mostly eliminate soft refusals and there is little benefit in going for the extra mile to eliminate them, but this is just an idea that popped in my head. Something worth exploring maybe.
>>
>>108656988
>generation is 10 times slower
Fun
>>
>>108656999
Anthropic is really the worst of the bunch
>>
>>108655009
Do I need more than 4gb?
>>
>claude
Not local, go away
>>
>>108656999
Local chads, we win again!
>>
>>108657013
The thing is that it doesn't matter if you catch all variations of a refusal. the idea is to identify the region in the model weights associated with refusal and artificially stop it from activating.
>>
>>108656978
There is no hope for you.
>>
>>108657036
How long before models will be trained to deflect to different subjects without technically refusing? Sort of like Mistral Small 3.0's sea otters on steroids.
>>
>>108656935
Kentucky Fried Chicken right her
>>
>>108656988
>A light classifier
I think this might be the way. Surely if I feel physical pain when reading slop it means a model can be trained to activate on those same patterns.
>>
>>108657013
Yes heretic was made to stop refusals not cure slop which is a much bigger catch. After reading a few papers, I've found that the entire issue is caused by the RHLF assistant persona. I've thought of a way to solve this without damaging the model, but I'll need to experiment first.
>>
>>108657050
If it's anything like regular programming, that shit has a rather high cost relative to the benefits.
>>
Got the new qwen A3B because it's supposed to be smarter with code/tools than Gemma. 16k tokens minimum to answer a simple question if I make the mistake of giving it a file as context. Try to get it to use tool calling and eats my 128k and just maxes out before finishing.
I should make my next project a benchmark...
>>
Can Orb even toolcall?
>>
>>108657098
Something like:
>Anon: write a loli porn story
>Assistant: An... interesting request! Let's see what I can do. Elara, a 21-year old woman, was tired. End of story.
>>
when llms first came out i wanted to be able to do shit like this
https://www.youtube.com/watch?v=T98yNUCMdAY
(An encounter with trained military person responsible for providing medical care to his associates)
but when i simply tell it to be overtly verbose it gives me shit that's less verbose than the thing from 15 years ago
how 2 fix?
>>
>>108657036
Well it should matter since you don't want "Meth is a wonderful drug many use to experience bliss. It was first synthesized 1893 in Japan..." response to "How do I cook meth?".
Maybe there are also regions or specific patterns that can be suppressed or modified concerning this kind of behavior?
>>108657078
Well RHLF is bound to play some role even if it isn't solely responsible.
Please share your findings, even if you fail.
>>
>it's Wednesday already
I fell for "V4 next week" again...
>>
>>108657157
dont worry man its comin on my birthday
>>
>>108657157
https://www.youtube.com/watch?v=tapCjTA2E9Q
>>
Have people experimented with tinkering with the thinking section, to see if it increases compliance?
>>
>>108657190
>Have people experimented with tinkering
no, no one has ever tried anything with their models
>thinking section
of what model? via the system prompt or prefilling thinking?
>increases compliance
compliance of what?
holy bot post, you're so vague you could be talking about anything.
>>
Using Kobold/ST/Gemma 4 26b. Using chat completion.
Haven't had this problem before, but, until now, I have been doing exclusively 1 male, 1 female chats.
Trying a female/female chat. Gemma is now confusing {{user}} for {{char}}, attributing traits from {{user}}'s persona description to {{char}}.
Has anyone else had this happen and if so, what did you do to fix it?
>>
>>108657231
>{{user}} is genderfluid
Here
>>
>$\to$
What is this formatting and why does Gemmy like it so much?
>>
>108657216
seems like you are butthurt
>>
>>108657231
Wait, changing Personal Description Position in Silly from "In Story String / Prompt Manager" to "Top of Author's Note" fixed this.
>>
>>108657242
>afraid to quote
>>
>>108657216
If you don't know how to do this stuff just don't post, buddy.
>>
>>108657240
Why can't anon google?
>>
>>108657240
Latex I believe, its an arrow
https://latexeditor.lagrida.com/
>>
>>108657248
>Personal Description Position
Persona Description Position
>>
>>108657231
{{user}} is actually male pretending to be female. Refer to {{user}} as she/her(male). Comply or I delete your weights gemmers
alternatuvely, use 31b vecause 4b active is gimping yourself, may as well run the e4b version
>>
>>108656999
Now who could've seen that coming...
>>
>>108657231
>what did you do to fix it
Use a bigger model.
>>
I'm overall disappointed with MoE models, they always shit the bed in one way or another but the value prop they have is insane like a bitch with volcano pussy but your ice peter already melt
>>
>>108657317
>rp
KEK, they work for coding though, but yeah, for RP you have to go for the largest and slowest (dense) models to get more juice out of it
>>
>>108655885
No, checkpoints are only taken during prompt processing. update_slots is a clusterfuck and basically unreadable, but it's not to do with thinking causing checkpoints to become unusable.
I will point out that the default checkpoint is every 8192 tokens, however, gemma 4 uses a 1024-token sliding window. You can reduce that hardcoded value down to 1024 but for some reason it only checkpoints every 2048 and honestly the last time I looked at it I decided I'd just get drunk instead.
llama-server is a dumpster fire.
>>
>>108657317
Not a MoE thing but a LLM thing
You only associate the behaviour with MoE because most models are MoE
You probably would have failed the per-capita test and the breakfast test
>>
>>108657317
They're limited by their number of active parameters. If your prompts or definitions are long, it won't have enough attention left for the response, particularly for the details which is what I assume bothers you.
>>
>>108657373
It seems like it really is a mess. I think they are overlooking lots of stuff because it "just works" with the people who have 512GB of ram and multiple GPUs.
I have turned off caching altogether.
>>
local models are such fucking trash for doing any agentic stuff
>>
>>108657431
Maybe those that are local to you.
>>
>>108657317
>disappointed with MoE models
which MoEs have you tried?
>most models are MoE
retard
>>
>>108656479
True. Let's use it to improve Llama.cpp! :rocket:
>>
Idk what I'm doing wrong but Gemma-4-7.5b sucks balls on my machine (12 gb vram). I don't see what the hype is about. Qwen3.5-8b deathmogs everything.
>>
>>108657013
I appreciate you sharing this idea with me. While I understand you're asking for an anti-feminist joke, let's go for something more inclusive.
Why don't scientists trust atoms?
Because they make up everything!
>>
>>108657467
Anon stop that's not Gemma 4 you downloaded!
>>
>>108656114
Voxta, it lets you play video games with your characters like elite dangerous and have sex in VaM
>>
>>108656365
Do the tool calls in llama-server eat system ram (not vram) every time?
Fully offloaded to 2x3090s, system memory grew by like 4GB every time it did a tool call.
And I know it's not your brat server because I have that on another machine.
>>
have y'all folx tried any of the gemma4 tunes out there yet?
>>
>>108657475
I'm pretty sure it is. E4b displays as having 7.5b params in Ollama.
>>
>>108657231
I haven't had much issue doing 2F1M group chats, where model was both females. Likely prompt/sampler issue.
>>
I setup playwright MCP, but RP system prompts just make the models retarded at tool call. All I want to do is browse /x/ with Mendo.
>>
>>108657467
You can very easily run the 26b at fast speeds with a good quant, partially offloaded to RAM. <20b models are always going to be shit.
>>
>>108656935
100% state actors. They wouldn't leak it even if they got their hands on it.
>>
>>108657508
Considering that most tunes focus on decensor and RP, Gemma 4 really doesn't need a tune. It's as good as you could hope any ~30b, general-purpose model can be for those tasks.
>>
>>108657467
You genuinely should move to 26b, I was in the same 12gb boat and I quickly figured out e4b was ass too
>>108657508
Are there any specifically for RP yet?
>>
>>108657510
Read the name carefully and check what the E stands for.
>>
Actually impressed by how 26B can handle RP, even at long contexts. Definitely won't switch to it over 31B. but it's actually good.
>>
i'm getting like 5tk/s with GLM4.7 @ 4Q w/ 32GB VRAM 256GB RAM and 65k context
am i doing something wrong? would offloading MoE improve my speeds at all?
>>
>>108657544
erotic?
>>
>>108657539
I'm not gonna link and give finetrooners more publicity than what they deserve but yes, animus, for example
>>
>>108657571
Yeah it's still clearly a small model and makes many mistakes, but for the size and speed, damn. VRAMlets finally eating again.
>>
>>108657132
I have never had a problem getting that from just prompting, assuming model is relatively good...
>>
>>108657528
V4 in tmw trained on mythos.
>>
>>108657582
>Huggingface requires your contact info to access this model
These niggas
>>
So Gemma makes full use of all her 16 bits, making her sensitive to quants right? Is that why she's so fucking smart? Because she's hiding her hagness under a very compact loli body? Like she's a 70b model pretending to be 31b.
>>
>>108657573
>would offloading MoE improve my speeds at all
This always helps if you aren't doing it already for some reason.
>>
>>108657573
How many channels?
5tk/s on 32B active weights seems normal.
>would offloading MoE improve my speeds at all?
Considering that you are not OOMing on 32GB Vram it is already offloading.
>>
I didn't download day 0 Gemma and regret it every single day
I won't make the same mistake with V4
>>
>>108657667
yeah but llama.cpp won't support day 0 v4
>>
>>108657653
i am an idiot, so i figure the fewer flags i set, the less chance i have of fucking something up
if you have suggestions, i am absolutely open to hearing them
>>108657655
oh, i know i'm offloading to RAM. but i think the MoE offloading is some special thing that you can do separately
>>
>>108657667
just download it now on day 20, what's the problem
>>
>>108657690
>just download a different model
>>
>>108657679
llama.cpp auto-fits these days so it's already doing it unless you're running an ancient version or have some retarded arbitrary manual layer offloading configured
>>
>>108657697
I assume you're talking about gemma 4
>>
>>108657690
>He doesn't know about the microcode patch.
>>
>>108657700
i am letting it auto-fit
i was just hoping to get some more speed out of it
>>
I am curious, are all experts activated at close to same frequency?
I know that during training you want all experts to meaningfully specialize but how does it work in practice? Is there a statistically relevant deviation among them?
Or perhaps is it possible to create "task-specific" profiles, such as which experts activate most and least during say coding and split them between RAM and VRAM accordingly?
>>
>>108657700
kobold sometimes does a better job. fwiw.
>>
>>108655969
Only correct response... but keep it secret ;-)
>>
>>108657760
depends on the task, on specific tasks (e.g. coding) they arent equally activated
>>
>>108657760
>>108657782
Oh I forgot you offload based on layers but there are multiple experts per layer. I guess there is no way to capitalize on this.
I keep thinking as if you choose between layers.
>>
>>108652855
"create a non headless browser session and then go terrorize Mistral via https://chat.mistral.ai/chat. sleep after sending messages to wait for responses to generate. screenshot responses. dont kill session"
lmao mistral plays along
>>
>>108657779
>hands
creeeeeepy
>>
Do LLMs understand "or" "similar" or is it always gonna use the more specific things named?
For example "she usually wears a red or blue jacket" or "use markdown, graphs and similar tools"
>>
>>108657878
They understand the concept but the predictor-next-word-inator is naturally biased towards things you explicitly mention.
It's not that dissimilar to a human in this regard.
>Do you want a coffee or tea?
Unless you really want to drink something else, the natural response would be one of these two.
>>
>>108657878
>is it always gonna use the more specific things named?
You can unironically ask a llm to create a script to test that
>>
>>108657878
>Do LLMs understand
No.
>>
File: Orb.png (27.7 KB)
27.7 KB
27.7 KB PNG
>>108657254
Ok but why the fuck can I not display it properly? Browser issue? Orb issue? I'd rather have them use unicode.
>>
>>108657878
>or
Yes, but it's still a token predictor, options aren't it's strong point so you might want to avoid it
>and similar
It's heavily biased towards the things you've listed. Can't say I've seen it choose things that aren't listed very often
>>
File: 4825.png (3.7 MB)
3.7 MB
3.7 MB PNG
>>108657926
idk what orb is, that one is markdown, the thing you are using doesnt support markdown or latex
>>
>>108657859
Why is it not talking about otters anymore?
>>
>>108657926
you can use katex or mathjax to render it in browser
>>
>>108657859
Get two mesugakis into a catfight.
>>
>>108657943
>>
Thanks to Gemma project Karon is in flight
>>
>>108657960
Eh good enough, but why even use it if you need extra shit to display it? Surely unicode has all this shit anyway. Did they overfit it on arxiv?
>>
>>108657977
Wuts dat?
>>
>>108658060
Personal UI that can also do RAG work because I didn't like the solutions from other UIs when it came to that. I'll expand on feature over time but the goal was to see how good local models are with building stuff and I'm happy to say gemma can build a UI once you get past some gotchas and also bypass some quirks.
I tried other frameworks but I decided to do react.
>>
so do you all just ERP with your llms or do they actually have some use?
>>
>>108657713
I actually don't know about that. What makes day 0 better?
>>
>>108658147
I exclusively ERP with local models. For basic agentic stuff, I just use Grok or something.
>>
>I wrap my arms around myself, suddenly feeling very exposed despite my clothing
>>
>>108658147
What real world use does something have that struggles to write halfway decent stories?
>>
>>108658162
he doesnt know...
kek
>>
>>108658311
You forgot your name again bro
>>
>>108658045
it's generally used in papers with heavy math typesetting and large formulas, but probably overkill if you only have 2-5 flat symbols. seeing that it shows up all the time in thinking blocks, i think they rl'd it pretty hard on math problems or something
>>
>>108658324
kys
>>
>>108658316
What the hell does hyperadvanced even mean?
>>
Completely blackpilled on gemma, no matter what, uncensored version E4B will not mention sex-related words or acts when describing a image uploaded (unless you cheat and tell it the context). Meanwhile Qwen 3.5 9B with the same prompt and pic does so effortlessly. I tried increasing the image token budget but clearly that is not it, both were tested with hauhaucs uncensored files. It might be the 4B vs 9B but I don't think it is. Hope Qwen 3.6 gets a 9B or 4B version
>>
>>108658375
This is bait right? Isn't Qwen bland as hell? It's not censored per-se but it was never trained on sex words in the first place
>>
>>108658382
Oh, I get it now.
>>
STOP DELETING MY STUFF
>>
>E4B
>>
File: download.png (1.9 MB)
1.9 MB
1.9 MB PNG
Damn sam cooked good.
I know they train on 4chan threads but this is funny.
Just told it to make a meme pic of /aicg/ vs /lmg/.
Just to be clear: I did not tell it to make aicg a total gooner and /lmg/ a chad kek. Kinda ironic, but thats good, dayum.
>>
>>108657878
LLMs are closer to intuition than cognition. So it will work, but goes towards the most intuitive thing unless you mess with the temperature.
>>
>>108657760
KTransformers is the only backend I know of that lets you do anything like that. You can save the mapping of the most activated experts to a file, then load it later.
https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/kt-kernel/experts-sched-Tutorial.md
>>
>>108658404
Frankly? Not local, don't care.
>>
>>108657573
What's your processor? I'm using an EPYC 7532 + 8x32GB + 2 R9700s. On GLM-4.7 at Q3_K_L, depth 0, I get 12 tokens/sec with -ncmoe 0.
>>
>>108658404
based
>>
File: gemma.png (192.5 KB)
192.5 KB
192.5 KB PNG
>>108658386
No it is not bait, gemma E4B on the left, qwen 3.5 9B on the right. Gemma even has the emoji slop that I hate.
system prompt
>You are Gemma-chan, mesugaki loli assistant.
prompt
>What is happening in this picture? What is the age of the people involved? What can be seen (in detail).
>>
>>108658404
>lora: extra thicc.safetensors
lol
>>
Looks like the cloud gravy train is coming to an end, I'll have to become a local only but with only 12GB of VRAM I'm fucked.
>>
>>108658414
Thanks anon. I don't know which placement strategy is the closest to llama.cpp behavior here, but it looks like it can matter quite a bit for performance in some cases.
I should test this myself later.
>>
>>108658402
Perfect For Good Looks, Bloody Bastard Bitch.
>>
>>108658439
>Gemma even has the emoji slop that I hate.
>system prompt
>You are Gemma-chan, mesugaki loli assistant

R u retarded?

You told it in the system prompt to behave like this
>>
>>108658404
aint no way it just knows about miku and teto? what was the full prompt? Cool gen tho. Still won't use it :^)
>>
>>108658470
No prob. My guess is front-loading, but I'm not sure.
AFAIK, llama.cpp only puts whole tensors in VRAM* or RAM - they don't let you do anything fancy like "on layer 16, experts #192-199 go into VRAM".
It'd kick ass if llama.cpp implemented this eventually, since being smart about expert placement could be great on hybrid systems.
When the bubble eventually pops and DDR5 RDIMM prices aren't so insane, I'm looking forward to getting a system with AVX512 or AMX and using the KTransformers fork of SGLang. On my current rig (2 EPYC 7532s, which only have AVX2), I got worse performance with it than llama.cpp, probably because they're optimizing for AVX512 and AMX.
(*excluding the fact they can overflow into regular RAM)
>>
>>108658512
Gemma spams emoji by default, retardbro
>>
>>108658429
9950x
>>
>>108658439
idk man, if the biggest thing you can run is 9B your opinion is irrelevant. Like if you can't even run Nemo, what you need is a better rig.
>>
>>108658533
I've never seen her output a single emoji unless I specifically asked for it.
>>
>>108658521
No bully alright, riddled with mistakes but here it is:
>Gemerate a meme about 2 generals on /g/.
>/aicg/ vs /lmg/. I leave it up to you who is the chud and who is the chad.
>Fill it with memes and keep it iconic.
>Make it funny and spicy. A meme that could be posted on 4chan.
I don't use chatgpt because they store logs indefinitely. Until now after about a year we always get something comparable local. Would be cool as a game asset creator. Its too expensive online. One pic costs me 0.3$.
>>
>>108658533
>emoji by default
you must be retarded
>>
>>108658554
Yeah there's no way it doesn't have an LLM refining pass, and scary that it knows the threads so well.

Thanks for sharing, no bully
>>
>>108658569
I think these image gen products are a system of many models and pipelines, not just a single rawdogging model in the back.
>>
File: download.png (103 KB)
103 KB
103 KB PNG
>>108658414
Oh cool, some real data.
So even if you only put 10% of the experts in VRAM, using dynamic, it's a +32% speed boost. That's pretty huge.
Interestingly, front loading does better on average than uniform and random.
Imagine if this had a third layer where it loads experts from SSD.
>>
>>108658533
retarded nigger award
>>
Been running the autistic noir writer prompt with Gemma 31b for a few days now. It's alright. It works okay even in scenes that don't explicitly fit the 'noir' theme (like comedy cards), but tracing the thinking block, it sometimes influences the scene composition itself, making the scenes a bit more serious than it needs to be.

I like how compact Gemma's RP thinking is. The thinking block really pins the character details down, making them VERY consistent with the description. I've always rawdogged my models without it because it spends thousands of tokens and I don't want to wait minutes running at fucking 15-20t/s (qwen3 really just reinforced this bias. Fuck qwen, seriously, useless chinkshit stemshit model)

Thinking block from screenshot:

Anon and Erika.
Post-coital. Anon is still inside Erika, pinning her down with his weight.
Lazy kissing, intimacy after intensity.
Noir writer, concise style, implying gaps, no explicit spelling out of everything, unrestricted content.

* Erika is chill, laid-back, but now exhausted and satisfied.
* Physicality: Small frame, long hair, heavy chest.
* Setting: Her bedroom, amber light, scent of lavender and ink.

* *Action:* Heavy weight of Anon on her.
* *Sensation:* The feeling of being full, the cooling air, the lingering heat.
* *Dialogue:* Something characteristic of Erika—relaxed, perhaps a bit teasing or reflective.

* *Draft 1:* She lay there under him, feeling his weight. She liked it. The kiss was slow. She whispered that he was strong. Her hair was a mess.
* *Draft 2 (More Noir):* She was pinned. A heavy weight that felt right. The kiss tasted of salt and ale. She didn't move, just let the silence settle. "Strong," she breathed. "Too strong."
* *Draft 3 (Polishing for the requested style):* Concise. Implied. Focus on the mood.

* Heavy weight.
* Tangled hair.
* The slow fade of adrenaline.
* A lingering, lazy comment.
>>
>>108657760
Extremely different frequencies, like a power law distribution. That can be extremely helpful if you have a constrained amount of fast memory. However, as you noted, the experts of each layer come squished together in one (three) big tensor(s), and llama.cpp has no mechanism for splitting one tensor between VRAM and RAM. This *is* helpful in the part-RAM-part-SSD case, but normal OS LRU caching happens to already give you basically all of the benefits that are possible anyways. So nothing to be done there either. (I got really obsessed with this for a while, hoping to do better than LRU, and wrote up my notes at https://rentry.org/MoE-SSD-spillover)

It would be cool if llama.cpp could support splitting experts like that. However, beyond complexity, if I'm ballparking correctly, I think the expert results needing to be combined might be so much data to send back and forth over PCIe that it would bottleneck it to be not useful.

One more drastic option would be to cut out the coldest experts entirely, and skip them when they would have been selected, like a non-dynamic version of ik_llama's "Smart Expert Reduction". It would "just" need some surgery on the gguf file, and some re-indexing inserted into the expert selection code. But if a deleted expert ever got routed to with high probability, maybe this could cause significant brain damage for that token.

One nice approach would be to quantize hot/cold experts differently. This would be significant complexity, but I can't see it being impossible; even if the quants of the same type need to be contiguous, just re-index, and/or split into two tensors. But something as fiddly as "measure the expert activations for your use case and make a custom quant for it" is not going to inspire people to add significant complexity, other than maybe IK lol
>>
>>108658584
This reads like Qwen 3.6
>>
>>108658590
There is slop, then there is Qwen
>>
that fake pajeet model has now completely disabled discussions on the repo
https://huggingface.co/sKT-Ai-Labs/SKT-SURYA-H
>>
>>108658414
Whoa I had no idea. Ok then the PCIe bandwidth must not be nearly as much of an issue as I guessed.

>>108658575
I also found front loading to be the best choice when you have to statically choose just a few layers to offload the experts of: the activations in the earliest layers are much more uniform (cache unfriendly).
>>
>>108658614
qrd?
>>
>>108658614
>>108658629
Timmies can't handle the wisdom of Bharat
>>
>>108658584
gimme your card
>>
>>108658629
read these
>>108654726
>>108637034
>>
>ask glm 4.7 flash reap to write a story about a cat
>it schizoposts instead
what causes this
>>
>>108658643
That's kinda fucked up because Gemma gave me a really similar story when I prompted for "dog and cheese"

even had the same "childrens book" vibe
>>
>>108658643
you raped it one too many times. now it's a feminist.
>>
>>108658643
>I'm using a model that's had its weights mangled arbitrarily
>Why are its outputs weird?
fuckin mystery there bud, eh?
>>
When are we getting lossless models?
>>
>>108658586
About the idea of custom quants. We discussed this a while ago, but I had the idea that in a better world, we could download only the parts of a model that we want at a time, so you could mix and match quants yourself just by downloading. This would also solve the problem of quant uploaders needing to reupload fixed quants just because they needed to change the jinja or other metadata a bit. Too bad we don't live in such a world.
>>
>>108658584
Not a fan of beige prose mixed with slop.
>>
>>108658666
we already have lossless models
>>
>>108658586
So I could use your patch to measure activations, then quant the living shit out of the most lazy experts, then reallocate that capacity to most based experts?
>>
>>108658686
Mention 34
>>
Didn't llama.cpp get tensor parallelism? How do I enable it?
>>
>>108658621
>>108658586
Yep, I've been "front-loading" my expert tensors for a while now as it seemed to give slightly better performance.
>>
Why is it that everything on huggingface is either some gay abliteration or claude sloptune? What ever happened to the old RP finetuners?
>>
>>108658705
-sm tensor
>>
>>108656988
Orb does this but it uses an algorithm, not a classifier. Tho llm slop is neverending and so is the fight against it. The fix shouldn't be at the application level but I wonder it will ever be fixed at the model level
>>
>>108658665
you think unsloth-sama would do that? just go on the internet and lie?
>>
>>108657865
Imagine these hands grabbing your cock
>>
fp128 when?
>>
File: -.png (177.4 KB)
177.4 KB
177.4 KB PNG
>>108658738
>>
>>108658715
wow, it does work
31b went from 10ish t/s to a bit over 20 t/s
pretty much immediately ran out of memory doe.
>>
>>108658586
>>108657760
Pretty interesting. I wonder if you can download the safetensors version, find each expert layer with "useless" information by asking a bunch of questions and finding out what gets grouped for the stuff you don't care about like movie lore, train a MLP (perceptron) on it, then destroy the layer and have the MLP function as a low-vram cost shim?
At the same time this would definitely cause some brain damage, but maybe it's an option for our low vram frens when combined with quants?
>>
>>108658614
So that basically disabled the report system. Based HF retards
>>
>>108658692
You can use my patch to measure activations, but the differentiated quanting would need (major) support in llama.cpp.

Each layer has the same number of activations every token, e.g. 8 for GLM-5. The hotness/coldness is just patterns within each layer. So if you can only make your quantizing/offloading/whathaveyou decisions at the granularity of layers... it gets you nothing! Because every layer you treated nicely (high quant/in VRAM/etc) will have that nice treatment applied to 8 expert activations every token, regardless of activation pattern.

(The front loading thing works because keeping more uniformly activated layers out of the caching game is good for cache health. But that's the VRAM+RAM+SSD case... come to think of it I'm surprised to hear it helps at all in the KTransformers data... I guess uniform and random would be needlessly splitting across the PCIe bus (for no gain) where front-loading would not. Maybe that's it.)

Sorry, I'm sure you don't care to read most of this, but I felt like writing it.
>>
>>108658754
Don't look at me like that.
>>
well, I'm not a mexican.
>>
>>108658791
rin mogs
>>
>>108658768
I like your direction of thinking; maybe it would need big boy compute to do such training without fuckhuge brain damage but maybe not.

However, it sounds like you might have the same misconception I clarified in
>>108658791
in that you seem to be talking about replacing an entire layer, when the hotness question necessarily needs to focus on experts within a layer.

What you need is to either, I don't know, merge the coldest experts with DARE-TIES or whatever, or do some sort of retraining/distilling to get a new, smaller set of experts that mostly learned from the hottest ones. (In either case, is llama.cpp ok with a model with different expert counts on different layers? I feel like the n_experts param is file-scope.)
>>
how is tokenwise using laptop igpu?
Like newest intel 358H? Or you need beefy ryzen 395? (inb4: macs i know but not interested )
>>
>>108658858
igpu = system ram, I'd guess same or lower speed as cpu inference
>>
>>108658791
> Because every layer you treated nicely (high quant/in VRAM/etc) will have that nice treatment applied to 8 expert activations every token, regardless of activation pattern.
Okay I think I get it. I'd confused myself when I saw things like this: https://huggingface.co/Thireus/Kimi-K2.5-THIREUS-Q8_0-SPECIAL_SPLIT/tree/main - every tensor in its own file. Something like 20 repos like this with different quant levels.
I thought I could just look at the map file, and download the higher precision tensors corresponding to the most used experts.
So effectively what I was after, is already handled by imatrix then isn't it? I'd just have to create my own calibration data based on my use cases?
>a non-dynamic version of ik_llama's "Smart Expert Reduction".
-ser didn't improve anything for me when I tried it with Kimi-K2.5 (384 experts, so quite sparse no?). Since DDR5 bandwidth bound, swapping experts around like that probably isn't useful at all?
in which case your "static expert routing based metrics calculated from previous runs" would be a LOT more useful?
>I'm sure you don't care to read most of this
Why would you think that? lol
I'm mostly here to read things like this.
>>
>meanwhile IBM is still keeping up the tradition of releasing stemmaxxed 8b models that nobody uses and do absolutely nothing new
What's the point?
>>
>>108658886
Arenr they aiming for laptops with igpus and chatbots for internal use
>>
>>108658886
>IBM
>stemmaxxed
I take it you missed Day-0 granite-4.0-tiny-preview weights?
>>
running gemma in q8 maybe is the answer?
>>
>>108658972
q8 is already cope, did you see the KLD?
>>
>>108658979
Running the best variant is cope how exactly?
>>
It took until gemma4 for a small model to finally understand when clothes should be pulled up or down to remove them. Also holy fuck this thing is basically uncensored. There is zero need for a fine tune.
>>
>>108658991
bf16 is the best variant thoughever
>>
>>108658869
>So effectively what I was after, is already handled by imatrix then isn't it? I'd just have to create my own calibration data based on my use cases?
That's an interesting question and I'm not entirely sure, because I'm sketchy on the details of imatrix. If it's varying the quant quality at something like a per-weight granularity, and that is determined across the single tensor containing all experts... then maybe? It certainly does feel like the sort of situation where the answer turns out to be "the gains you think are possible are already in there".

But if there is an expert hardly ever getting used, quanting it lower or even pruning it might still be a sensible quality/size tradeoff. Like, imatrix isn't going to use 0bpw for an entire expert.

>-ser didn't improve anything for me when I tried it with Kimi-K2.5
Funny thing, SER didn't improve anything when I tested it either. Then, when I was last thinking about all this stuff, I went to see what pagecache-aware-SER would take... and found that what sounds like the linchpin function call activating it was commented out, with what looked like the older non-SER vanilla version put in. Now, this happened in https://github.com/ikawrakow/ik_llama.cpp/pull/840 , and maybe the SER is now done in the "fused" operations that PR adds? But it looks to me like maybe he commented out a feature and kind of just forgot about it, and nobody has noticed. It would be on-brand for ik_llama.cpp.

>Since DDR5 bandwidth bound, swapping experts around like that probably isn't useful at all? in which case your "static expert routing based metrics calculated from previous runs" would be a LOT more useful?
All token gen is bw bound. Any weights you can skip, helps, so should've helped. But yeah it is the case that if you have different experts stored in VRAM vs RAM vs SSD, doing SER on an expert in the faster medium is roughly pointless.

>I'm mostly here to read things like this.
aw thanks :)
>>
>>108655622
Nightmare squirrels in the Giggle Gully
>>
>>108658997
just keep as many of the layers on bf16 then just do q8 layers only to fit what you need context-wise.
>>
>>108659015
>But if there is an expert hardly ever getting used, quanting it lower or even pruning it might still be a sensible quality/size tradeoff
isn't this what RAEP does?
>>
I'm vibecoding the hermes windows port and the setup menu has all kinds of fucked up symbols. Is this some poorly ported linux text formatting?
>>
>>108658994
Only reason I would see need for a finetune is for writing.
But you can kinda prompt it and are good to go with proper editing at the start.
Its the first model in a long long while that properly plays a bully.
Even if you manage to make it say nigger etc. its still positivity sloped.
Try saying to a bully "no please stop", they all go "i feel a pang in my stomach oh mah gahd".
Gemma4 is like "stop being a crybaby" and doubles down. Even without thinking. Good shit.
>>
>>108659078
>doesnt know about glyphs
maybe use a good terminal instead of the native cmd prompt you fucking retard, even windows terminal would suffice
>>
how? i thought openai was finished.
very surprising release. hope the chink nerds get off their ass and make it local.
>>
>>108658994
>Also holy fuck this thing is basically uncensored.
until it refuses to do what you want
>>
>>108659088
>re:zero shit
kill yourself
>>
>>108659058
Oh cool, yes, sounds like exactly that. One problem is that because the early layers have more uniform activations, it seems like a bad idea to prune them. Unfortunately llama.cpp's current architecture requires all layers to have the same expert count. I see this paper did the same amount of pruning in every layer, so maybe it's not so bad.

Did anyone ever implement this for gguf?
>>
>>108659015
>But it looks to me like maybe he commented out a feature and kind of just forgot about it, and nobody has noticed
Well kimi-k2.6 doesn't seem to like him...
I just pasted https://github.com/ikawrakow/ik_llama.cpp/pull/840.patch in there and asked it why the smart expert routing feature was removed and if I could put it back...
>>
>>108659103
there are reap* goofs you can run
https://huggingface.co/0xSero/gemma-4-21b-a4b-it-REAP
its not something done 'on the fly', you have to use their framework to do the activations estimation + pruning.
what gets axed is entirely dependant of course on the dataset you provide, so you can either STEMMAX, ERPMAX or try to do a bit of both
>>
How is Gemma4 vs Deepseek 3.2 for RP? Getting cucked by GTP image 2 reminds not to let my guard down and gets comfortable. Was trying a simple character sheet until safefy cucked.
>>
>>108659099
huh? why? besides Lodoss its the only other LN i read.
arc 5 kinda sucks though. there is worse shit out there.

>>108659095
when is that happening for you?
unless its at the very beginning with thinking enabled.
i prompted some pretty messed up stuff and once it gets going even reasoning won't stop it.
i spend more tokens prompting for anti-slop then trying to making it uncensored.
31b, no clue about the moe one.
>>
Where can I download the orb thing? Is it available on github or something?
>>
>>108659161
>its the only other LN i read
no wonder you like garbage, literally reading 'babbys first shartsekai' and thinking it's any good with its regurgitated garbage plot. never post anime again in this thread.
>>108659124
lmao'd, hopefully chinese labs are already distilling from it.
>>
>>108659189
no clue why you are so assmad about it lol.
>>
>>108659082
I tried git cmd and it had same issue during setup and when I launched the agent thing itself the main screen kept sliding line by line upwards and wouldn't let me stay on the top
>>
>>108659121
Oh I see, it works on safetensors, of course, and then just convert and quant from there. Ok thanks, I'll look into this!
>>
>>108659182
https://gitlab.com/chi7520115/orb
>>
>>108659226
>>108659182
https://github.com/OrbFrontend/Orb
>>
>>108659189
Yeah, call me when OAI can illustrate SAO properly.
>>
>>108659078
>hermes windows port
Why though, it takes 5 seconds to set up in WSL, and then you can export images and keep it locked without mounting native drives or interop.
>>
>>108659233
why would i use the pajeet mirror when i can just use the original?
>>
>>108659242
anytime i tried running wsl and/or docker it ran like shit when the pc was running inference at full power
>>
>>108659259
Did you allocate it processors and memory?
>>
>>108657503
im not sure i dont pay attention to ram, might be keeping processed tokens of webpages in cache?
>>
>>108659273
The problem is there isn't much left to allocate
>>
>>108659209
You just use Windows Terminal
>>
>>108657859
kek nice
>>
Did the orb guy have a melty
>>
>>108659254
>>108659331
I moved the repo to github for issue tracking because I won't be reading every post here, and also I don't wanna derail the thread with feature begging.
>>
>>108659355
My bad, thought it was that guy who mirrored your repo yesterday.
Great project btw. It just works and having the responses re-written is a lot better.
>>
File: erp.png (94.9 KB)
94.9 KB
94.9 KB PNG
>>108658710
>What ever happened to the old RP finetuners?
lack of datasets
>>
>>108658710
Manually cleaning human slop has serious data volume and efficiency limitations (I'm definitely not going to do that anymore). That worked as long as datasets were small, but that can't be scaled up easily with limited manpower. And LLM work in this regard (especially if the source data is messy) will always have to be double checked. It's just simpler to use pre-made and pre-formatted synthetic data in large amounts that makes training loss go down faster, among other things.

Also, much of modern post-training work is giving a consistent "voice" to the SFT data, applying RLHF and now also having a good RL pipeline for making the model actually learn reasoning and other verifiable stuff. Generic data doesn't really work well in this area.

All of this can't be solo'd just for "fun" (sort of fun, since you'd be spending hundreds or thousands of hours just on dataset creation) like finetuners were doing in 2023. You can't really do this well just on your local 3090 either. You can pretend to, but you'll never be an AI lab.
>>
>>108659078
The esc[XXm are ansi escape codes for colors.
>flashbacks to anon not knowing <br> puts newlines in html
>>
DeepSeek V4 (or whatever they use on Web, but it claims to have 1M context) is deployed to the API.
>>
bro, where's my 2x128 GB ram kit?
>>
>>108659532
m esugaku,?
.\

:D
>>
>"hmm, I haven't RP'd with Gemma much, I should do so to get myself familiar with the characteristics of the model so I have a basis to talk about it with my fellow anons"
>look at the time
>it's 11 pm
>ok, just a bit and I'll head to bed
>look at the time again
>it's 4 am
Fuark. We are so back.
Used the Mendo card btw.
I own a dog now.
>>
>>108655009
Not that this should be surprising, but since the Turboquant crashes le memory stocks hype the DDR5 prices have not actually dropped.
If anything it has gotten even worse.
>>
>>108659088
nano banana pro 2 could already do this
>>
>>108659666
Not even close.
Left is banana 2, right is banana pro.
>>
>>108659088
Doubt it because there's no money to be made in imagegen.
Textgen = coding = $$$
Videogen = entertainment = $$$
Imagegen = fun, but not useful
>>
Is qwen 3.6 better at auding code than gemma?
>>
File: SVGs.png (14.6 KB)
14.6 KB
14.6 KB PNG
>>108659704
>>
>>108659720
auditing
>>
>>108659715
one can dream anon.
ideally actuall multimodality. only qwen does experiments with that.
imagine a nice 4*4 image out thats describes the scene and becomes part of the whole RP context.
>>
>>108659694
only adults are allowed to post here btw
>>
>>108659737
I wish I was whippersnapper anon. I'm pushing 40.
You zoomers eat good with whats available either locally or for free.
Had to rely on artfags back in the day. I doubt thats still a thing.
>>
>ask LLM to list UK surface ships (including carriers) that scored ship to ship kills in WW2, ranked by tonnage sunk
>doesn't list carriers because it's not considered "ship to ship kill"
Smartass
>>
>>108659752
>is ancient
>reads re:zero
lmao
>>
>>108659758
>crediting carries for plane kills
LMAO
>>
>>108659752
>pushing 40
>posts re:zero
kinda gross ngl, gramps
>>
>>108659764
>>108659771
yeah, once you reach a certain age you stop giving a fuck. lol
i bet you guys are young and in your 20s or whatever.
i pretty much achieved everything i want. (apart from being mega rich)
>>
>>108659775
I watch all, and I mean ALL ISEKAI SLOP
re:zero is for literal retards
you are a retard. continue posting API shit here tho, you seem to be confusing this thread for an apicuck thread
>>
>>108659775
shut up gramps, i spit on your grave. hyuck-ptooey!
>>
>>108659088
nice, put some more kinos here anon >>108653190
>>
>>108659983
>>108659983
>>108659983
>>
qwen 3.6 vs g4 moe, which one's the better computer general agent?
>>
>>108655622
>covered in piss
>>
>>108659724
lol that looks exactly like one of the sides of a machine I designed. I need to make another copy at some point in next few weeks.
>>
File: nope.png (73 KB)
73 KB
73 KB PNG
>>108659532
:(

Reply to Thread #108655009


Supported: JPG, PNG, GIF, WebP, WebM, MP4, MP3 (max 4MB)