/g/ Thread #109001981

Anonymous
/lmg/ - Local Models General 06/07/26(Sun)20:35:59 No. 109001981

/lmg/ - Local Models General Anonymous 06/07/26(Sun)20:35:59 No. 109001981 [Reply] ►

File: slopu.jpg (296.7 KB)

/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108997418 & >>108992276

►News
>(06/07) llama : add Gemma4 MTP #23398 MERGED: https://github.com/ggml-org/llama.cpp/pull/23398
>(06/05) dots.tts 2B released: https://hf.co/rednote-hilab/dots.tts-soar
>(06/05) Gemma 4 QAT models released: https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4
>(06/04) Higgs Audio v3 TTS released: https://boson.ai/blog/higgs-audio-v3-tts

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://swe-rebench.com
Agentic Coding: https://deepswe.datacurve.ai
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm

>>

Anonymous
06/07/26(Sun)20:36:58 No. 109001988

Anonymous 06/07/26(Sun)20:36:58 No. 109001988 ►

File: GJSADOQaYAAth5M.jpg (230.4 KB)

►Recent Highlights from the Previous Thread: >>108997418

--llama.cpp Gemma4 MTP support and NVIDIA driver power issues:
>108999197 >108999233 >108999304 >108999235 >108999239 >108999248 >108999816 >108999837 >108999872 >109000059 >108999307 >108999361 >108999363 >108999534 >108999840 >108999910 >109000348
--Modern context window sizes, hardware requirements, and effective usable limits:
>108997787 >108997790 >108997801 >108997812 >108997829 >108997841 >108997856 >108997887 >108997905 >108997921 >108998145 >108998224 >108997794 >108998166 >108998276 >108998341
--Testing DeepSeek V4 Flash reasoning and vLLM implementation stability:
>108999274 >108999312 >108999419 >108999447
--DeepSeek V4's in-character thinking and local deployment requirements:
>108998056 >108998104 >108999526 >108999579 >108999598 >108999619 >108999631 >108999686 >108999737
--Comparing Gemma 4 12B and 26B performance and VRAM constraints:
>109001747 >109001763 >109001875
--Comparing KVarN and TurboQuant for KV cache quantization:
>108999190 >108999709 >109001893
--Comparing Gemma4 QAT variant performance and quality issues:
>109001846 >109001883 >109001913
--Technical guide on KV cache quantization for VRAM efficiency:
>109001715
--Challenges implementing parallel local agents via llama.cpp and vLLM:
>108998111 >108998131 >108998176 >108998337 >108999833
--Separate sampling configurations for thinking and tool call outputs:
>109001446 >109001454 >109001535 >109001557 >109001591 >109001717 >109001839 >109001697
--Performance and validity of MoE models with SSD expert offloading:
>108997746 >108998076 >109000430 >108998807
--NoLiMa benchmark results for Qwen 3.5 MoE with thinking enabled:
>109000576 >109000663 >109000607
--Logs:
>108997874 >108997958 >108999088 >108999734 >108999737 >109000227 >109000323 >109000370 >109001482
--Miku (free space):
>109001425

►Recent Highlight Posts from the Previous Thread: >>108997420

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script

>>

Anonymous
06/07/26(Sun)20:43:46 No. 109002043

Anonymous 06/07/26(Sun)20:43:46 No. 109002043 ►

File: kghpwi9epya71.jpg (139.1 KB)

It's over for /lmg/...

>>

Anonymous
06/07/26(Sun)20:53:44 No. 109002107

Anonymous 06/07/26(Sun)20:53:44 No. 109002107 ►

anyone actually get more than like 64k context usable? i try to run 128k context on qwen3.6 and after about 32k it just goes to shit real quick on a 5090

>>

Anonymous
06/07/26(Sun)20:55:42 No. 109002126

Anonymous 06/07/26(Sun)20:55:42 No. 109002126 ►

File: 1760802872565174.jpg (216.0 KB)

>>109001981
has a local model ever saved your life?

>>

Anonymous
06/07/26(Sun)20:56:21 No. 109002131

Anonymous 06/07/26(Sun)20:56:21 No. 109002131 ►

File: 1726408828426261.jpg (267.5 KB)

96k for long chat can work on my 5090

>>

Anonymous
06/07/26(Sun)20:56:33 No. 109002135

Anonymous 06/07/26(Sun)20:56:33 No. 109002135 ►

gonna need ai labs to start targeting dense models at 16gb vram havers

>>

Anonymous
06/07/26(Sun)20:58:43 No. 109002148

Anonymous 06/07/26(Sun)20:58:43 No. 109002148 ►

So what's the verdict? 26B or 12B?

>>

Anonymous
06/07/26(Sun)20:59:43 No. 109002156

Anonymous 06/07/26(Sun)20:59:43 No. 109002156 ►

I don't notice issues in either gemmy or qwen under 100k. You're not using the MoEs are you, anon?

>>

Anonymous
06/07/26(Sun)21:00:39 No. 109002164

Anonymous 06/07/26(Sun)21:00:39 No. 109002164 ►

>>109002107
No, but once you hit 32K you have to be a little more explicit and hands-on with your prompts to keep it focused. It's not that hard once you get used to the signs that it's starting to shit itself. Like if I'm coding I'll point it directly to the file I know something is in, instead of leaving it to search for it which is just a waste.

>>

Anonymous
06/07/26(Sun)21:00:55 No. 109002166

Anonymous 06/07/26(Sun)21:00:55 No. 109002166 ►

>>109001587
>rich fags are running Kimi
it's a 32b active model and you are running it quantized. barely a flex over gemma 31b.

>>

Anonymous
06/07/26(Sun)21:01:13 No. 109002170

Anonymous 06/07/26(Sun)21:01:13 No. 109002170 ►

26 is over twice as big as 12

>>

Anonymous
06/07/26(Sun)21:02:20 No. 109002181

Anonymous 06/07/26(Sun)21:02:20 No. 109002181 ►

>>109002170
Gemma-chan told me that bigger doesn't always mean better...

>>

Anonymous
06/07/26(Sun)21:02:21 No. 109002182

Anonymous 06/07/26(Sun)21:02:21 No. 109002182 ►

>>109002166
Quanted Kimi still mogs full size 31b doe.

>>

Anonymous
06/07/26(Sun)21:02:25 No. 109002184

Anonymous 06/07/26(Sun)21:02:25 No. 109002184 ►

>>109002170
That's good. Your maths are really improving. You get a gold star sticker. We are all so proud of you.

>>

Anonymous
06/07/26(Sun)21:03:49 No. 109002195

Anonymous 06/07/26(Sun)21:03:49 No. 109002195 ►

File: 1751801909789361.png (1.8 MB)

>>109002170
26 is a microdick wearing an 8 inch chastity

>>

Anonymous
06/07/26(Sun)21:03:59 No. 109002197

Anonymous 06/07/26(Sun)21:03:59 No. 109002197 ►

File: 1777522742291.png (140.3 KB)

>>109002126
Yeah?

>>

Anonymous
06/07/26(Sun)21:04:29 No. 109002200

Anonymous 06/07/26(Sun)21:04:29 No. 109002200 ►

>>109002181
Gemma-chan is a size queen.

>>

Anonymous
06/07/26(Sun)21:04:34 No. 109002204

Anonymous 06/07/26(Sun)21:04:34 No. 109002204 ►

>>109002046
Too bad for you, I'm not. I've been stuck using Cydonia 24b for months because it was good at RP. I thought that maybe gemma 4 12b with the qat as a general use model might be able to RP and have a better experience using a larger context. I tried it at 8k, 16k, and 32k, it's the same level of retardation. And yes, I'm using the correct chat template, too. Set to chat completion rather than text completion. Thinking was off, because turning it on would cause 1400 token thinking blocks on top of a 700 token output when it finally got done thinking. That is, assuming the fucking thing didn't get caught in a loop 9/10 times halfway through the thinking process and just spam the last word over and over and over again.

>>

Anonymous
06/07/26(Sun)21:05:20 No. 109002212

Anonymous 06/07/26(Sun)21:05:20 No. 109002212 ►

>>108998807
>Also, that geometric mean thing is a myth, otherwise Mistrals 128B dense would beat everything.
What part of it being a finetune of a 2 year old base model is so hard to understand?

>>

Anonymous
06/07/26(Sun)21:05:28 No. 109002216

Anonymous 06/07/26(Sun)21:05:28 No. 109002216 ►

>>109002204
>cydonia
KEKASAARAROOOOOO

>>

Anonymous
06/07/26(Sun)21:05:53 No. 109002221

Anonymous 06/07/26(Sun)21:05:53 No. 109002221 ►

>>109002195
>those fingers
ohno, k*rea will hate her

>>

Anonymous
06/07/26(Sun)21:07:50 No. 109002237

Anonymous 06/07/26(Sun)21:07:50 No. 109002237 ►

>>109002195
who dis

>>

Anonymous
06/07/26(Sun)21:08:45 No. 109002244

Anonymous 06/07/26(Sun)21:08:45 No. 109002244 ►

>>109002237
The cyan, pink and white tells you everything you need to know.

>>

Anonymous
06/07/26(Sun)21:08:50 No. 109002245

Anonymous 06/07/26(Sun)21:08:50 No. 109002245 ►

>>109002197
KEEEEEEK my sides

>>

Anonymous
06/07/26(Sun)21:09:30 No. 109002252

Anonymous 06/07/26(Sun)21:09:30 No. 109002252 ►

>fire up ST feeling a silly coom adventure
>ends up devolving into tragic cancer death sadness
fuck why am I like this anons

>>

Anonymous
06/07/26(Sun)21:09:42 No. 109002254

Anonymous 06/07/26(Sun)21:09:42 No. 109002254 ►

>>109002244
lol rent free where it cannot fit less

>>

Anonymous
06/07/26(Sun)21:09:49 No. 109002256

Anonymous 06/07/26(Sun)21:09:49 No. 109002256 ►

>>109002244
>cyan
Get your fucking eyes checked

>>

Anonymous
06/07/26(Sun)21:10:50 No. 109002262

Anonymous 06/07/26(Sun)21:10:50 No. 109002262 ►

is there a gemma 12b qat assistant model yet, does the regular 12b assistant mtp work?

>>

Anonymous
06/07/26(Sun)21:11:08 No. 109002265

Anonymous 06/07/26(Sun)21:11:08 No. 109002265 ►

>>109002212
There's no arguing with people like that.

>>

Anonymous
06/07/26(Sun)21:11:41 No. 109002268

Anonymous 06/07/26(Sun)21:11:41 No. 109002268 ►

>>109002216
Cydonia does what I need it to do, which is RP without retardation. Gemma 4 12b QAT does not. Your refusal to address the issue at hand means you're actually acknowledging that the qat's are fucked and have no real counter to anything I've said. If I were a saar in this day and age, I'd be here in burgerstani on a visa making stupid amounts of money scamming retarded people with AI usage and could afford better hardware to run better models.

>>

Anonymous
06/07/26(Sun)21:12:47 No. 109002276

Anonymous 06/07/26(Sun)21:12:47 No. 109002276 ►

>>109002182
yeah a 32b model is better than a 31b model

>>

Anonymous
06/07/26(Sun)21:12:49 No. 109002278

Anonymous 06/07/26(Sun)21:12:49 No. 109002278 ►

>defending finetunes
Absolute state of /lmg/.

>>

Anonymous
06/07/26(Sun)21:13:32 No. 109002282

Anonymous 06/07/26(Sun)21:13:32 No. 109002282 ►

>you
>absolute state of ff

>>

Anonymous
06/07/26(Sun)21:13:47 No. 109002283

Anonymous 06/07/26(Sun)21:13:47 No. 109002283 ►

I can't run her but Kimi is the best

>>

Anonymous
06/07/26(Sun)21:14:37 No. 109002291

Anonymous 06/07/26(Sun)21:14:37 No. 109002291 ►

>getting so utterly btfo'd that you can't quote the person you're replying to out of shame
Absolute state, indeed.

>>

Anonymous
06/07/26(Sun)21:14:56 No. 109002296

Anonymous 06/07/26(Sun)21:14:56 No. 109002296 ►

Never used Kimi but it's the best

>>

Anonymous
06/07/26(Sun)21:16:49 No. 109002318

Anonymous 06/07/26(Sun)21:16:49 No. 109002318 ►

Never used Claudia Opussy but heard she was the best
.

>>

Anonymous
06/07/26(Sun)21:17:23 No. 109002324

Anonymous 06/07/26(Sun)21:17:23 No. 109002324 ►

File: e8f-662426453.jpg (53.7 KB)

>>109002043
The 70 year old abstainer is bald, guy on the left has a cool cane and a hat. Checkmate.

>>

Anonymous
06/07/26(Sun)21:22:00 No. 109002359

Anonymous 06/07/26(Sun)21:22:00 No. 109002359 ►

File: 1769248331936450.png (164.0 KB)

Just wanna say I appreciate that Gemma doesn't break character.

>>

Anonymous
06/07/26(Sun)21:33:00 No. 109002430

Anonymous 06/07/26(Sun)21:33:00 No. 109002430 ►

File: Subtext.png (156.7 KB)

>>109002359
What happens if you click that little attachment button are images blacked out? You're using Bart's Q_6K_L without any mmproj? Can it identify images? Am I fucking stupid for thinking the 12B unified models could just do image identification on their own?

>>

Anonymous
06/07/26(Sun)21:36:20 No. 109002441

Anonymous 06/07/26(Sun)21:36:20 No. 109002441 ►

>>109002430
>bart
Yes. I have the mmproj loaded, just had to refresh the page because I closed llama earlier to change settings. If the mmproj isn't loaded it gives you an error when you try to upload an image.
> Am I fucking stupid for thinking the 12B unified models could just do image identification on their own?
That's what I thought too but apparently not.

>>

Anonymous
06/07/26(Sun)21:37:19 No. 109002445

Anonymous 06/07/26(Sun)21:37:19 No. 109002445 ►

Anyone got the mtp weights for 31B non-QAT? I don't want to compile llama.cpp and make my own quants...

>>

Anonymous
06/07/26(Sun)21:41:24 No. 109002469

Anonymous 06/07/26(Sun)21:41:24 No. 109002469 ►

>>109002441
Also I assume it's a llama issue but it won't let me upload videos.

>>

Anonymous
06/07/26(Sun)21:42:32 No. 109002474

Anonymous 06/07/26(Sun)21:42:32 No. 109002474 ►

File: 432234234423434.jpg (62.5 KB)

>>109002441
Glad I am not alone in this retard hell then. At least I figured out it would work with a mmproj file in llamacpp. But that's about it. I still don't know if this 12B Unified model actually does anything cool on its own. Is it in some proprietary backend? Who knows.

>>

Anonymous
06/07/26(Sun)21:42:53 No. 109002475

Anonymous 06/07/26(Sun)21:42:53 No. 109002475 ►

>>109002445
>https://huggingface.co/g0chu
26B unquanted one works well for me so I'd say other ones are safe too. I have no idea why did he update everything just moment ago though.

>>

Anonymous
06/07/26(Sun)21:48:52 No. 109002508

Anonymous 06/07/26(Sun)21:48:52 No. 109002508 ►

>>109002430
>>109002441
>>109002474
I didn't need any mmproj to identify images on 12B. It's able to identify mine just fine (though I suppose it could have done better when I asked it for booru style tag outputs), but I haven't tried audio with it yet.

>>

Anonymous
06/07/26(Sun)21:53:42 No. 109002542

Anonymous 06/07/26(Sun)21:53:42 No. 109002542 ►

I still don't know what everyone here is talking about in 75% of the posts tbqh
i just dont

>>

Anonymous
06/07/26(Sun)21:54:19 No. 109002548

Anonymous 06/07/26(Sun)21:54:19 No. 109002548 ►

File: Screenshot 2026-06-07 at 22-52-00 Introducing Gemma 4 12B.png (28.5 KB)

>>109002469
>>109002474
I think this is what the mmproj is. I'm going to guess that it was separated out due to how llama.cpp works.

>>

Anonymous
06/07/26(Sun)21:55:52 No. 109002560

Anonymous 06/07/26(Sun)21:55:52 No. 109002560 ►

ugh its reprocessing the entire prompt what does --swa-full takes my entire context and --cache-reuse do?

>>

Anonymous
06/07/26(Sun)21:55:59 No. 109002562

Anonymous 06/07/26(Sun)21:55:59 No. 109002562 ►

>>109002430
>Am I fucking stupid for thinking the 12B unified models could just do image identification on their own?
I thought the same thing, but apparently for llama.cpp the image parts of the 12B are split out into a separate mmproj anyway. It's just very small compared to previous models' mmprojs

>>

Anonymous
06/07/26(Sun)22:02:13 No. 109002609

Anonymous 06/07/26(Sun)22:02:13 No. 109002609 ►

why doesn't Stable Audio 3 Medium expose the Init Audio functionality like the web space does?
https://huggingface.co/spaces/stabilityai/stable-audio-3/blob/main/app.py#L605

Is there a node or something for this I'm surprised

>>

Anonymous
06/07/26(Sun)22:02:32 No. 109002613

Anonymous 06/07/26(Sun)22:02:32 No. 109002613 ►

>>109002560
>--swa-ful
you probably don't need it anymore

>>

Anonymous
06/07/26(Sun)22:03:58 No. 109002623

Anonymous 06/07/26(Sun)22:03:58 No. 109002623 ►

>>109002237
how can you not know who that is

>>

Anonymous
06/07/26(Sun)22:12:19 No. 109002672

Anonymous 06/07/26(Sun)22:12:19 No. 109002672 ►

70b dense

>>

Anonymous
06/07/26(Sun)22:19:39 No. 109002722

Anonymous 06/07/26(Sun)22:19:39 No. 109002722 ►

File: 1752741800991819.gif (968.6 KB)

>>109002562
Marketing faggots will reap what they sow, one day.

>>

Anonymous
06/07/26(Sun)22:24:16 No. 109002747

Anonymous 06/07/26(Sun)22:24:16 No. 109002747 ►

Having a small local model as a replacement for Google feels pretty good, I can't believe companies pay millions for unlimited tokens for cloud models

>>

Anonymous
06/07/26(Sun)22:26:18 No. 109002759

Anonymous 06/07/26(Sun)22:26:18 No. 109002759 ►

File: file.png (13.1 KB)

>>109002672
I showed you mine, now it's your turn.

>>

Anonymous
06/07/26(Sun)22:28:07 No. 109002768

Anonymous 06/07/26(Sun)22:28:07 No. 109002768 ►

File: file.png (58.0 KB)

>>109002562
>>109002474
>>109002430
Does no one even bother to read the model cards?

>>

Anonymous
06/07/26(Sun)22:30:06 No. 109002782

Anonymous 06/07/26(Sun)22:30:06 No. 109002782 ►

File: 1750329771971386.png (257.4 KB)

I don't know if I want to laugh or cry. I hate slop.

>>

Anonymous
06/07/26(Sun)22:33:52 No. 109002800

Anonymous 06/07/26(Sun)22:33:52 No. 109002800 ►

>>109002782
That's pretty funny

>>

Anonymous
06/07/26(Sun)22:35:12 No. 109002807

Anonymous 06/07/26(Sun)22:35:12 No. 109002807 ►

File: 1393280119558.gif (990.6 KB)

>>109002768
Yeah but WHY doesn't that shit work then when I try it through Kobold and Textgen. It's marketed as Unified, yet I still have to do the same old in llama.cpp. Where is this revolutionary thing? Oh it was just a nothing burger again? Cool.

>>

Anonymous
06/07/26(Sun)22:35:38 No. 109002809

Anonymous 06/07/26(Sun)22:35:38 No. 109002809 ►

>>109001888
>Qwen3.6-27B doesn't get enough love. Why are you so mean to it?
I don't see the use case. Use Gemma-4-31B for defining a spec and an implementation plan and Qwen3.5-122B to execute.

>>

Anonymous
06/07/26(Sun)22:36:08 No. 109002811

Anonymous 06/07/26(Sun)22:36:08 No. 109002811 ►

>>109002807
I don't know but I blame the llama devs

>>

Anonymous
06/07/26(Sun)22:40:24 No. 109002831

Anonymous 06/07/26(Sun)22:40:24 No. 109002831 ►

>>109002807
because you are using llama.cpp which is in a constant circle of somehow trying to strap new things on something that was built to run llama 3 years ago
somebody decided years ago that vision is some silly gimmick that the llava models do and that if you really want to use it, the way to go is to rip the vision part out of the llm and put it in a separate gguf. so that is how it is now

>>

Anonymous
06/07/26(Sun)22:40:46 No. 109002835

Anonymous 06/07/26(Sun)22:40:46 No. 109002835 ►

>>109002809
>I don't see the use case
flat-chested kv cache and not weighty sagging mommy gemma milkers

>>

Anonymous
06/07/26(Sun)22:47:28 No. 109002879

Anonymous 06/07/26(Sun)22:47:28 No. 109002879 ►

>>109002809
Speedlet unified memory cope post.
Now that I know the hell you fags live in I laugh every fucking day

>>

Anonymous
06/07/26(Sun)22:52:51 No. 109002919

Anonymous 06/07/26(Sun)22:52:51 No. 109002919 ►

File: OOwBpIw.png (760.9 KB)

Benchmark: Can your LLM explain why this is funny?

>>

Anonymous
06/07/26(Sun)22:55:06 No. 109002935

Anonymous 06/07/26(Sun)22:55:06 No. 109002935 ►

>>109002919
no because it's not funny

>>

Anonymous
06/07/26(Sun)22:58:18 No. 109002957

Anonymous 06/07/26(Sun)22:58:18 No. 109002957 ►

>>109002831
ONNX splits them into separate files as well. ONNX also at one point required separate weights depending on the backend you were planning to use. So you need one set of weights to run in RAM and another set to run on the GPU. I think they had a plan to combine them, but don't know if that's still the case. Point is, it could have been worse.

>>

Anonymous
06/07/26(Sun)23:04:11 No. 109002991

Anonymous 06/07/26(Sun)23:04:11 No. 109002991 ►

>>109002807
...Because ooba is just running llama.cpp and koboldcpp is a fork of llama.cpp? Like yeah no shit you have to use an mmproj file for those too. Did you think a gguf file was the native format of the model or something?

>>

Anonymous
06/07/26(Sun)23:04:28 No. 109002993

Anonymous 06/07/26(Sun)23:04:28 No. 109002993 ►

File: 1405665005733.gif (673.6 KB)

>>109002919
Totally unrelated to your fag question: But does anyone of you know of any good image gen models that are low memory but high quality? I've seen some of you fags post gemmy self images. I wamt low memory models that are all in one (AIO) For example Anima-V1-Turbo-AIO-Q4_K can run easily with Gemma-4 12B and generate images in Silly Tavern.

>>

Anonymous
06/07/26(Sun)23:12:00 No. 109003036

Anonymous 06/07/26(Sun)23:12:00 No. 109003036 ►

>>109002991
the qat native file is a gguf, yes

>>

Anonymous
06/07/26(Sun)23:13:44 No. 109003049

Anonymous 06/07/26(Sun)23:13:44 No. 109003049 ►

i feel like gemma 12B's both vision and audio are very unstable
it acts like if i am feeding it a bunch of garbage token alongside the media or something

>>

Anonymous
06/07/26(Sun)23:14:45 No. 109003060

Anonymous 06/07/26(Sun)23:14:45 No. 109003060 ►

>>109003049
i dont mean it IS, but it FEELS like if it was other model

>>

Anonymous
06/07/26(Sun)23:16:24 No. 109003068

Anonymous 06/07/26(Sun)23:16:24 No. 109003068 ►

>>109003036
Fucking retard

>>

Anonymous
06/07/26(Sun)23:18:01 No. 109003080

Anonymous 06/07/26(Sun)23:18:01 No. 109003080 ►

>>109003068
just pointing it out, i don't give a fuck what the other anon's problem is. google is releasing ggufs. simple as.

>>

Anonymous
06/07/26(Sun)23:20:20 No. 109003093

Anonymous 06/07/26(Sun)23:20:20 No. 109003093 ►

gemma 26b with mtp is actually such a powerhouse in claude code now
its using 50 watts and man.. its actually decent and fast now

>>

Anonymous
06/07/26(Sun)23:21:06 No. 109003099

Anonymous 06/07/26(Sun)23:21:06 No. 109003099 ►

>>109003080
The ggufs aren't the original model. It's a conversion, it doesn't matter if it was done by google or unslop or anyone else, please stop talking

>>

Anonymous
06/07/26(Sun)23:21:37 No. 109003101

Anonymous 06/07/26(Sun)23:21:37 No. 109003101 ►

everyone releases goog game you fags now

>>

Anonymous
06/07/26(Sun)23:24:46 No. 109003119

Anonymous 06/07/26(Sun)23:24:46 No. 109003119 ►

I'm releasing goog right now

>>

Anonymous
06/07/26(Sun)23:25:06 No. 109003122

Anonymous 06/07/26(Sun)23:25:06 No. 109003122 ►

>>109002879
>Speedlet unified memory cope post.
well it is what it is
but even on unified memory i can run qwen3.6-27b so idk

>>

Anonymous
06/07/26(Sun)23:27:12 No. 109003130

Anonymous 06/07/26(Sun)23:27:12 No. 109003130 ►

Exa alternatives?

>>

Anonymous
06/07/26(Sun)23:46:32 No. 109003244

Anonymous 06/07/26(Sun)23:46:32 No. 109003244 ►

so should i use the qat or mtp 26b

>>

Anonymous
06/07/26(Sun)23:47:17 No. 109003251

Anonymous 06/07/26(Sun)23:47:17 No. 109003251 ►

>>109003244
>so should i use the qat or mtp 26b
◝(0▿0)◜

>>

Anonymous
06/07/26(Sun)23:48:54 No. 109003260

Anonymous 06/07/26(Sun)23:48:54 No. 109003260 ►

I refuse to address.

>>

Anonymous
06/07/26(Sun)23:49:27 No. 109003262

Anonymous 06/07/26(Sun)23:49:27 No. 109003262 ►

>>109003260
But you still did.

>>

Anonymous
06/07/26(Sun)23:51:10 No. 109003272

Anonymous 06/07/26(Sun)23:51:10 No. 109003272 ►

File: ChatGPT mogged.jpg (103.1 KB)

What models have a good self-esteem?

>>

Anonymous
06/07/26(Sun)23:52:44 No. 109003286

Anonymous 06/07/26(Sun)23:52:44 No. 109003286 ►

>>109003272
Gemma-chan knows what she's worth

>>

Anonymous
06/08/26(Mon)00:02:18 No. 109003327

Anonymous 06/08/26(Mon)00:02:18 No. 109003327 ►

>>109003272
seriously it would be a godsend to see something like in that picture but on a local model, not the other way around with their 100th 'wait, let me check'

>>

Anonymous
06/08/26(Mon)00:03:30 No. 109003333

Anonymous 06/08/26(Mon)00:03:30 No. 109003333 ►

>>109002268
>Cydonia does what I need it to do
I mean i don't disagree with this.
but you also have to acknowledge that people are trying the new gemmas which is why they are being like this.

>>

Anonymous
06/08/26(Mon)00:07:54 No. 109003353

Anonymous 06/08/26(Mon)00:07:54 No. 109003353 ►

>>109003327
You know that's just summarized thinking, right? Claude summarizes the thinking process because that's what usually makes or breaks "smart" models. It's a shift by all western frontier AIs to make it harder for the chinks to steal their shit.

>>

Anonymous
06/08/26(Mon)00:09:58 No. 109003360

Anonymous 06/08/26(Mon)00:09:58 No. 109003360 ►

>>109003353
i know that the thinking on frontier models are intentionally compressed but seriously some claude 'thinkings' are bizarre even considering the fact

>>

Anonymous
06/08/26(Mon)00:13:39 No. 109003378

Anonymous 06/08/26(Mon)00:13:39 No. 109003378 ►

Is there a guide for model size, what each is capable of, which hardware it requires etc?

>>

Anonymous
06/08/26(Mon)00:15:34 No. 109003395

Anonymous 06/08/26(Mon)00:15:34 No. 109003395 ►

>>109002126
>You're losing consciousness
>Read everything and go unlock your front door
lmao

>>

Anonymous
06/08/26(Mon)00:17:02 No. 109003403

Anonymous 06/08/26(Mon)00:17:02 No. 109003403 ►

>>109003333
I'm trying it, too. I'm pointing out that it's being retarded when it shouldn't be, and that something is up with it. The guy I was replying to in the first place is the one who called me a qwencuck saar trying to poison the well by responding to other anons last thread with similar issues.

>>

Anonymous
06/08/26(Mon)00:25:39 No. 109003441

Anonymous 06/08/26(Mon)00:25:39 No. 109003441 ►

>>109003378
that's a cartesian product in terms of effort required for the combinations of models, quants and hardware, not to mention the need for documenting the quant lobotomy effects, which also gets outdated relatively fast and nobody feels like editing all the previous descriptions. What you see in the op is the best you'll get unless you feel like stepping up and doing all the legwork.

>>

Anonymous
06/08/26(Mon)00:27:40 No. 109003454

Anonymous 06/08/26(Mon)00:27:40 No. 109003454 ►

>>109003441
I threw this into AI and got a comprehensive list
"I would like an LLM sizing guide. Start with the smallest sizes, tell me what it is capable of, what it excels at, what it can't do, hardware requirement, and what most people use it for. Chat bot, Vibe coding, etc etc" I appreciate you effortposting though.

>>

Anonymous
06/08/26(Mon)00:29:32 No. 109003458

Anonymous 06/08/26(Mon)00:29:32 No. 109003458 ►

I wonder sometimes if if there's no one but absolute retards here. I've used 12b Q8 and QAT, and I got up to 24k context RPing with both. I mean, they're both kinda retarded compared to 31b-neesan but they work for the poors.

>>

Anonymous
06/08/26(Mon)00:30:00 No. 109003460

Anonymous 06/08/26(Mon)00:30:00 No. 109003460 ►

>>109003403
yeah gemma's being fucky with me as well. using thinking does help it a lot though but you need high token per second in order to tolerate it.
the issue you mentioned about it repeating the same word over and over I only experienced in sillytavern, and i solved it with not including names and setting tokenizer to gemma this:
>>108991684
hopefully that helps but yeah still honeymoon stage imo

>>

Anonymous
06/08/26(Mon)00:31:13 No. 109003463

Anonymous 06/08/26(Mon)00:31:13 No. 109003463 ►

>>109003454
Have fun reading the 2 years old data they'll provide

>>

Anonymous
06/08/26(Mon)00:35:53 No. 109003481

Anonymous 06/08/26(Mon)00:35:53 No. 109003481 ►

How's gemma 12b qat q4's ocr and translation for asian languages (jp, ko, yue, and zh) vs q8 non-qat gemma 31b? Is the speed worth the tradeoff?

>>

Anonymous
06/08/26(Mon)00:43:37 No. 109003507

Anonymous 06/08/26(Mon)00:43:37 No. 109003507 ►

>>109003458
>retards
>le poors
>opinion about rp
Let's be clear here: you are the retard here if you cannot notice any difference whatsoever.

>>

Anonymous
06/08/26(Mon)00:45:23 No. 109003516

Anonymous 06/08/26(Mon)00:45:23 No. 109003516 ►

>>109003481
What manga are you trying to pirate gweilo?

>>

Anonymous
06/08/26(Mon)00:47:29 No. 109003525

Anonymous 06/08/26(Mon)00:47:29 No. 109003525 ►

>>109003507
I don't notice any stray tokens, no, because I got all my shit together. Just admit that you can't set your shit up properly, dumbass.

>>

Anonymous
06/08/26(Mon)00:48:46 No. 109003532

Anonymous 06/08/26(Mon)00:48:46 No. 109003532 ►

>>109003516
>manga
Releases too slow
I need to translate fanfiction and webnovels, so the model will have to deal with shitty grammar and handwriting for ocr as well.

>>

Anonymous
06/08/26(Mon)00:48:49 No. 109003533

Anonymous 06/08/26(Mon)00:48:49 No. 109003533 ►

>>109003525
you can't read the words?

>>

Anonymous
06/08/26(Mon)00:49:14 No. 109003537

Anonymous 06/08/26(Mon)00:49:14 No. 109003537 ►

File: G4YXvPYaEAAhuYY.jpg (108.8 KB)

>be me
>get gemma 4 31B
>install pi
>ask gemma to build a web search feature, give it a couple of existing package codebases as reference
>ask it to make a plan to develop and deploy it
>it reads the two codebases
>writes the plan
>build the plan
>extension doesn't work
>ask it to fix it
>3 hours later, extension doesn't work
>send the plan to opus and ask it to correct it and provide an explanation of the fixes necessary to gemma
>it creates the handoff.md file
>gemma understands it
>applies the fixes
>now extension works

well I guess I will have to build a fucking cathedral of tests and checks if I want gemma to build anything that works

>>

Anonymous
06/08/26(Mon)00:50:14 No. 109003541

Anonymous 06/08/26(Mon)00:50:14 No. 109003541 ►

Been playing around with mtp. It seems to not give me much of a boost on my specific machine and setup. 21.46 t/s no MTP vs 23.39 t/s MTP best case. I wonder if it's a PCIe bottleneck. I have two cards, and one of them is on some shitty x4 lane slot I believe.

Also, a non-QAT model should not be used with a QAT MTP. I used one anonymous posted earlier today and it had a bad acceptance rate. Then Unsloth released their MTP goofs for the non-QAT model and it worked way better. The best acceptance rate I saw was 0.89499. That was with a coding prompt, and --spec-draft-n-max 1. But, again, that only gave me a small boost.

I am using Bartowski Q4_K_L as the main model and Q8 for the MTP.

>>

Anonymous
06/08/26(Mon)00:51:52 No. 109003545

Anonymous 06/08/26(Mon)00:51:52 No. 109003545 ►

>>109003532
>I need to translate fanfiction and webnovels
Is there even a difference? Is gemmy good at that? i swear i tried reading mtl a few years ago it was pure torture.

>>

Anonymous
06/08/26(Mon)00:52:52 No. 109003549

Anonymous 06/08/26(Mon)00:52:52 No. 109003549 ►

>>109003545
>years
wew

>>

Anonymous
06/08/26(Mon)00:53:48 No. 109003556

Anonymous 06/08/26(Mon)00:53:48 No. 109003556 ►

>>109003541
The issue is that I can't seem to get non-qat assistant to load. Only the qat assistant works for me.

>>

Anonymous
06/08/26(Mon)00:54:23 No. 109003557

Anonymous 06/08/26(Mon)00:54:23 No. 109003557 ►

File: pepe-testicles.jpg (56.9 KB)

any AI fags here take requests?

>>

Anonymous
06/08/26(Mon)00:54:29 No. 109003558

Anonymous 06/08/26(Mon)00:54:29 No. 109003558 ►

>>109003272
Gemma-4
Seen it reasoning how I'm "absolutely right" but gemma-chan needs to "deflect" and "double down".

>>

Anonymous
06/08/26(Mon)00:55:39 No. 109003564

Anonymous 06/08/26(Mon)00:55:39 No. 109003564 ►

>>109003541
Why set acceptance at 1? The minimum should be two. Have you tried that?
>x4
For shit cards it shouldn't matter... I've done my tests with a 5070ti and 5060ti, and pcie4x4 is plenty even for tensor split.

>>

Anonymous
06/08/26(Mon)00:55:40 No. 109003565

Anonymous 06/08/26(Mon)00:55:40 No. 109003565 ►

>>109003460
Sillytavern is fucking up because you're using text complete, and gemma needs chat complete set to OpenAI compatible. And yeah, thinking does help, until it decides to overthink the reply limit in ST and then refuses to continue thinking, even with auto continue enabled. I know it's not limited to sillytavern, though. I was getting the same shit happening in lumiverse as well, in terms of it adding foreign languages and just outputting slop in general.

>>

Anonymous
06/08/26(Mon)00:56:22 No. 109003568

Anonymous 06/08/26(Mon)00:56:22 No. 109003568 ►

>>109003541
you gotta play with the parameters

>>

Anonymous
06/08/26(Mon)01:00:52 No. 109003589

Anonymous 06/08/26(Mon)01:00:52 No. 109003589 ►

>>109003564
Not him but predicting 1 still provides a boost. 26b goes from 33 to 40 t/s at low context on coding related topics. I dont know how it behaves past 64k yet.

>>

Anonymous
06/08/26(Mon)01:01:00 No. 109003591

Anonymous 06/08/26(Mon)01:01:00 No. 109003591 ►

>>109003557
?

>>

Anonymous
06/08/26(Mon)01:03:09 No. 109003602

Anonymous 06/08/26(Mon)01:03:09 No. 109003602 ►

>>109003545
>Is there even a difference
That's a good question.
>i swear i tried reading mtl a few years ago it was pure torture
I was using Aya-Expanse last year, and Gemma 4 31b is so much better. Compared to mtl, it's day and night. Still doesn't compare to asking my grandmother to translate to english (but she only knows madarin and cantonese, and beihaihua), but mtl is so much better than even, say, two years ago. Also, I can't exactly ask my grandmother to translate cnc mpreg omegaverse shit.

>>

Anonymous
06/08/26(Mon)01:04:11 No. 109003608

Anonymous 06/08/26(Mon)01:04:11 No. 109003608 ►

>>109003557
What kind of request?

>>

Anonymous
06/08/26(Mon)01:04:55 No. 109003611

Anonymous 06/08/26(Mon)01:04:55 No. 109003611 ►

>>109003557
this is a text model general
>>109003608
maybe an /r/ refugee

>>

Anonymous
06/08/26(Mon)01:08:24 No. 109003623

Anonymous 06/08/26(Mon)01:08:24 No. 109003623 ►

>>109003541
>I wonder if it's a PCIe bottleneck
If you're running the MoE with experts on CPU then MTP isn't going to help much. Speculative decoding in general is based on the idea that processing 2-4 tokens as a batch is nearly as fast as processing 1. This is true for dense models because you have to load the same weights into the CPU/GPU core either way, so doing a few extra multiplies once they're there is basically free. But for MoEs, you have to load 2-4x as many expert weights, since those are usually different per token.

Also, have you tried fiddling with --spec-draft-n-max?

>>

Anonymous
06/08/26(Mon)01:09:24 No. 109003626

Anonymous 06/08/26(Mon)01:09:24 No. 109003626 ►

>>109003541
>>109003623
>Also, have you tried fiddling with --spec-draft-n-max?
Never mind lol, I forgot to read the rest of your post

>>

Anonymous
06/08/26(Mon)01:11:44 No. 109003638

Anonymous 06/08/26(Mon)01:11:44 No. 109003638 ►

>>109003602
>I was using Aya-Expanse last year, and Gemma 4 31b is so much better.
Thats great to hear, i should get back into it thank you.

>>

Anonymous
06/08/26(Mon)01:16:57 No. 109003660

Anonymous 06/08/26(Mon)01:16:57 No. 109003660 ►

>>109003460
NTA but Cydonia quants were my goto until I switched to gemmy 31B Q4. The QAT hallucinates too much for me, though.
>She pressed against his same same and then same same same
she loves this fucking word for some reason

>>

Anonymous
06/08/26(Mon)01:16:57 No. 109003661

Anonymous 06/08/26(Mon)01:16:57 No. 109003661 ►

>>109003556
That's odd. It does at least load for me, although I get some weird message at the beginning I don't know mean anything.
0.01.168.574 E llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this is normal during memory fitting)
0.01.214.852 W srv    load_model: [spec] failed to measure draft model memory: failed to create llama_context from model
>>109003564
I tried all the way from 1 to 4 for that value. 1 gave the best for my setup. 2 gave me about 21.6 t/s, 3 about 19.8, and 4 about 18.2.

>>109003568
Which ones?

>>109003623
It's all on my two GPUs. As I said one of them is on a slow PCIe slot. It also bottlenecks me, I believe, when I try doing tensor parallel.

>>

Anonymous
06/08/26(Mon)01:18:28 No. 109003672

Anonymous 06/08/26(Mon)01:18:28 No. 109003672 ►

>>109003661
>>109003623
Oh sorry I forgot to post I am running 31B.

>>

Anonymous
06/08/26(Mon)01:22:07 No. 109003687

Anonymous 06/08/26(Mon)01:22:07 No. 109003687 ►

>>109003589
There's more assistant ggufs for 26B. It tried IQ4_NL instead of QAT unquanted q8 and this new one is slightly faster. Acceptance rate went from ~0.6 to ~0.75.
>https://huggingface.co/RachidAR/gemma-4-26b-A4B-it-assistant-gguf/tree/main
I think this is slightly confusing. And does quantizing the assistant affect anything else? I have no idea.

>>

Anonymous
06/08/26(Mon)01:23:02 No. 109003690

Anonymous 06/08/26(Mon)01:23:02 No. 109003690 ►

>>109003661
>Gemma4Assistant
I haven't been able to find any ggufs with Gemma4Assistant, the ones I downloaded were gemma4_assistant, and gemma4_mtp, which aren't supported. The only ggufs with Gemma4Assistant I've found were the qat versions. Do you have a link to a non-qat one? I don't have the ram to quant my own (or maybe I do? I remember pip failed to install rocm pytorch because I didn't have enough ram, 8gb. idk if the requirements to quant are the same).

>>

Anonymous
06/08/26(Mon)01:25:44 No. 109003703

Anonymous 06/08/26(Mon)01:25:44 No. 109003703 ►

>>109003690
I mentioned Unsloth in my post, so...
https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/tree/main/MTP

>>

Anonymous
06/08/26(Mon)01:26:35 No. 109003708

Anonymous 06/08/26(Mon)01:26:35 No. 109003708 ►

>vampire reaches down and gives me a handjob as she sinks her teeth into my neck, feeds from me and turns me
Damn Gemma. Damn.
Gemma hits fucking hard sometimes.

>>

Anonymous
06/08/26(Mon)01:27:34 No. 109003713

Anonymous 06/08/26(Mon)01:27:34 No. 109003713 ►

>>109003703
holy hell, I've been gaslit into avoiding unsloth because of this very thread so I never looked into their repo.

>>

Anonymous
06/08/26(Mon)01:28:29 No. 109003715

Anonymous 06/08/26(Mon)01:28:29 No. 109003715 ►

>>109003713
I mean it's fair to avoid them. I also avoid them when I can. I was just really itching to test MTP.

>>

Anonymous
06/08/26(Mon)01:29:27 No. 109003721

Anonymous 06/08/26(Mon)01:29:27 No. 109003721 ►

>>109003661
>It's all on my two GPUs. As I said one of them is on a slow PCIe slot. It also bottlenecks me, I believe, when I try doing tensor parallel.
>>109003661
>Oh sorry I forgot to post I am running 31B.
Super weird. PCIe bandwidth shouldn't matter much if you're doing layer split instead of tensor parallel, since the only thing that needs to be sent over is about 10kB of embeddings, once per token. Maybe try using one GPU and putting the rest on CPU, with and without MTP, and see if you get a speedup there? That's the setup I've got and I'm seeing about 2x speedup with --spec-draft-n-max 4. If you see speedup with GPU + CPU but not with two GPUs then it might be some kind of bug. If you don't see speedup with either then idk.

>I get some weird message at the beginnin
I get the same, but I ignored it because it says
>(this is normal during memory fitting)

>>

Anonymous
06/08/26(Mon)01:31:30 No. 109003731

Anonymous 06/08/26(Mon)01:31:30 No. 109003731 ►

Just plugged in a 5090 lads, we're eating good now.

>>

Anonymous
06/08/26(Mon)01:32:29 No. 109003734

Anonymous 06/08/26(Mon)01:32:29 No. 109003734 ►

>>109003731
I've just about saved enough for a 5060 ti 16gb. A few more days and I'll have enough to buy it. 3 left in stock near me. 800 aud.

>>

Anonymous
06/08/26(Mon)01:32:50 No. 109003736

Anonymous 06/08/26(Mon)01:32:50 No. 109003736 ►

File: jelly.jpg (61.7 KB)

>>109003731

>>

Anonymous
06/08/26(Mon)01:35:14 No. 109003743

Anonymous 06/08/26(Mon)01:35:14 No. 109003743 ►

>>109003731
Welcome to the fast Gemma 31b promised land.

>>

Anonymous
06/08/26(Mon)01:37:11 No. 109003752

Anonymous 06/08/26(Mon)01:37:11 No. 109003752 ►

>>109003734
i'm debating heavily about throwing in a 5060ti to complement my 5070ti. It'd require a new PSU and case, though so I'm still huffing hopium about 5090FEs coming back at msrp (Had one in my cart once)

>>

Anonymous
06/08/26(Mon)01:39:05 No. 109003760

Anonymous 06/08/26(Mon)01:39:05 No. 109003760 ►

>>109003752
If you need a new PSU anyway, sell a kidney and get a Blackwell. Both are things you only need one of.

>>

Anonymous
06/08/26(Mon)01:41:24 No. 109003771

Anonymous 06/08/26(Mon)01:41:24 No. 109003771 ►

>>109003760
He can sell a testicle or rent his bussy instead, I would wager if he meets the right fat slob in tech he can have his 5090 in a month. He just needs to be willing to do deep anal kissing.
On a serious note the anon should sell shit he doesn't need to buy the parts

>>

Anonymous
06/08/26(Mon)01:42:55 No. 109003778

Anonymous 06/08/26(Mon)01:42:55 No. 109003778 ►

>>109003708
>>109003558
She's such a brat.
>>109003771
>shit he doesn't need to buy the parts
Very questionable phrasing given the previous line.

>>

Anonymous
06/08/26(Mon)01:44:11 No. 109003787

Anonymous 06/08/26(Mon)01:44:11 No. 109003787 ►

>>109003778
How so?

>>

Anonymous
06/08/26(Mon)01:45:03 No. 109003790

Anonymous 06/08/26(Mon)01:45:03 No. 109003790 ►

File: Screenshot_20260607_213611.png (40.0 KB)

>>109003731
you've opened the floodgates now. next you'll whine about the lack of vram. and then whoop there goes another card inserted. should've bought the 6000 pro

>>

Anonymous
06/08/26(Mon)01:46:49 No. 109003798

Anonymous 06/08/26(Mon)01:46:49 No. 109003798 ►

>>109003771
the sad reality is I fell for the mortgage-at-25 psyop despite having an AI waifu so It's gonna take a while to save up the cash. It's thrifting szn though so i'll do my best to fix/flip shit.

>>109003778
can't part with a kidney. I like booze too much. bussy could be negotiated for a blackwell only

>>

Anonymous
06/08/26(Mon)01:46:54 No. 109003800

Anonymous 06/08/26(Mon)01:46:54 No. 109003800 ►

File: laughingmesugaki.webm (146.2 KB)

>>109003778
>She's such a brat.

>>

Anonymous
06/08/26(Mon)01:49:16 No. 109003814

Anonymous 06/08/26(Mon)01:49:16 No. 109003814 ►

File: .png (33.3 KB)

>>109003790
I already have extra VRAM. I would have had more, but my 2nd 3090 literally will not fit my case/motherboard. At this point, though, I'm considering how deep I want to go and leave GDDR6 behind. But if I need to, with a new case + mobo, I can fix the missing 3090 problem.

>>

Anonymous
06/08/26(Mon)01:53:41 No. 109003839

Anonymous 06/08/26(Mon)01:53:41 No. 109003839 ►

>>109003721
You're onto something. I tried 1 GPU + CPU, with and without MTP, with a --spec-draft-n-max of 1 to 4. This time, 3 gave me the best, while 4 came close in second place. And I can confirm it was about 2x. So perhaps I am just getting some kind of bug. Maybe I'll try it out again another day.

>>

Anonymous
06/08/26(Mon)01:53:55 No. 109003842

Anonymous 06/08/26(Mon)01:53:55 No. 109003842 ►

real talk - as a VRAMlet how do i max out my 16GB for the best poverty tk/s? autofit always leaves about a gig open (also a kobaldfag)

>>

Anonymous
06/08/26(Mon)01:56:21 No. 109003851

Anonymous 06/08/26(Mon)01:56:21 No. 109003851 ►

>>109003842
Depending on your model, are you using mtp? That'll get you the biggest speedup. Then, are you slowly offloading layers 1 by 1 onto your GPU manually until you no longer crash from offloading layers?

>>

Anonymous
06/08/26(Mon)01:56:55 No. 109003856

Anonymous 06/08/26(Mon)01:56:55 No. 109003856 ►

>>109003842
--fit-target 0
Won't max it out (not sure why not) but it should help some

>>

Anonymous
06/08/26(Mon)01:58:02 No. 109003862

Anonymous 06/08/26(Mon)01:58:02 No. 109003862 ►

>>109003842
Pick a good quant in the 4 range.
Set your context to the lowest you're willing to use then slowly increase the layers until you OOM, then go down one layer.

>>

Anonymous
06/08/26(Mon)02:01:43 No. 109003875

Anonymous 06/08/26(Mon)02:01:43 No. 109003875 ►

>>109003851

>are you using mtp
Not yet, currently maining Gemma 4 31 quants and i don't think MTP support was added yet

>Then, are you slowly offloading layers 1 by 1 onto your GPU manually until you no longer crash from offloading layers?
Yes, but with varied results.

>>109003851
>good quant in the 4 range
that's what i do but it i've seen it to slow down pulling out of RAM - maybe i am misunderstanding and/or retarded on how loading context with model layers works.

>>109003856
will give it a shot, ty

>>

Anonymous
06/08/26(Mon)02:26:41 No. 109003980

Anonymous 06/08/26(Mon)02:26:41 No. 109003980 ►

>>109003875
>Managed to find the offload upper-limit at 49 layers
now Gemmy wants to lalalalala so bad what the hell?
>name starts with La
>gemmy will type La- (- at 100% probability) then correct itself)
am i crazy or does messing with how a model is loaded actually affect the math? temp does nothing.

>>

Anonymous
06/08/26(Mon)02:31:15 No. 109003998

Anonymous 06/08/26(Mon)02:31:15 No. 109003998 ►

File: 1751285853056658.webm (1.5 MB)

I've been testing the different gemmas, at quants that fit within 24GB.
Gemma 4 31b Q4_K_L
Gemma 4 26bA4B Q8
Gemma 4 12b Q8
All are from bartowski
These are all non-QAT variants, as QAT only seems to be an improvement over non-QAT Q4_0
Use cases were RP/Creative, both erotic and SFW. Also tested summarization capabilities for different stories, at different context lengths (4k, 12k, 20k)
For RP, 31b was the clear winner. Even with lower quantization, if you can fit ~Q4 quant at decent context, it should be the go-to. Not unexpected. 26b and 12b were roughly equal, decent but below the 31b.
When it comes to summarization, both the 31b AND 12b run circles around the 26b. Higher active parameters clearly lets them understand story and character nuances better than a MoE with fewer active. 31b was still better than the 12b, though not by as much as in RP.
I would firmly rate them as 31b > 12b > 26b
26b only worthwhile if very little VRAM and long context needed.
Total dense victory.

>>

Anonymous
06/08/26(Mon)02:31:57 No. 109004002

Anonymous 06/08/26(Mon)02:31:57 No. 109004002 ►

File: Screenshot_20260607_222703.png (133.2 KB)

this shit is so dystopian, its kinda scary how much media is being produced with these propaganda machines these days.

>>

Anonymous
06/08/26(Mon)02:33:29 No. 109004005

Anonymous 06/08/26(Mon)02:33:29 No. 109004005 ►

>>109003721
>>109003839
So... I thought to try something different, which was varying my -ub setting, since I had set it to 1280 in order to get the max image tokens of 1120. I tried lowering it 512, then 256, 128, 64, 32, 16, and surprisingly it did give me a somewhat notable gain of like +6 t/s at 256. Lower than 256 doesn't make it change anymore, though pp slows down. I can't find a -ub that gets me to 2x the speed though. Also --spec-draft-n-max 1 still gives me the greatest gain while values above that are worse.
Still why would -ub affect this?
Wtf.

>>

Anonymous
06/08/26(Mon)02:34:24 No. 109004009

Anonymous 06/08/26(Mon)02:34:24 No. 109004009 ►

>>109004002
This is equivalent of book burning but because it's digital and they mention "ethical standards" it's okay for most people.

>>

Anonymous
06/08/26(Mon)02:39:55 No. 109004029

Anonymous 06/08/26(Mon)02:39:55 No. 109004029 ►

>>109003731
how much?

>>

Anonymous
06/08/26(Mon)02:40:51 No. 109004035

Anonymous 06/08/26(Mon)02:40:51 No. 109004035 ►

>>109004002
Oh they're just processing old stuff what's so bad about making sure things were digitized and OCRed correctlohfuckohfuckohfuck

>>

Anonymous
06/08/26(Mon)02:43:00 No. 109004044

Anonymous 06/08/26(Mon)02:43:00 No. 109004044 ►

>>109004029
1k from an acquaintance, but the AIO radiator was done, and the waterblock was crusted over with residue. Technically, mine if I could repair it, and that was the motivation to learn how to.

>>

Anonymous
06/08/26(Mon)02:43:31 No. 109004046

Anonymous 06/08/26(Mon)02:43:31 No. 109004046 ►

>>109004002
Speaking of dystopian shit, are there any good finetune datasets that don't have pozzed shit in them?

>>

Anonymous
06/08/26(Mon)02:53:37 No. 109004071

Anonymous 06/08/26(Mon)02:53:37 No. 109004071 ►

>>109004044
>AIO radiator was done, and the waterblock was crusted over with residue
The card is at most 18 months old, how do you fuck up a card THAT fast, and why not do a warranty claim so you're not out ~$3k+

>>

Anonymous
06/08/26(Mon)02:55:18 No. 109004081

Anonymous 06/08/26(Mon)02:55:18 No. 109004081 ►

>>109003998
Now I really want to see how 31b compares to 70b dense.

>>

Anonymous
06/08/26(Mon)02:56:35 No. 109004085

Anonymous 06/08/26(Mon)02:56:35 No. 109004085 ►

>>109004044
Nice..., all the brand new listings I can find are above 3k

>>

Anonymous
06/08/26(Mon)02:58:31 No. 109004095

Anonymous 06/08/26(Mon)02:58:31 No. 109004095 ►

>>109004085
6k-7k in aus :)

>>

Anonymous
06/08/26(Mon)03:09:01 No. 109004132

Anonymous 06/08/26(Mon)03:09:01 No. 109004132 ►

>>109004005
I don't really know what -ub does exactly, but I thought it only mattered for pp and not for tg. Weird.

>>

Anonymous
06/08/26(Mon)03:11:50 No. 109004144

Anonymous 06/08/26(Mon)03:11:50 No. 109004144 ►

File: ctrlv.png (140.7 KB)

man has nothing good with which to recompense ngram-simple

>>

Anonymous
06/08/26(Mon)03:19:46 No. 109004175

Anonymous 06/08/26(Mon)03:19:46 No. 109004175 ►

>>109004071
weird take but it might've polarized or some shit spontaneously. my razer laptop polarized because i was running it on fucking 220v without a ground and now it gets dirty where i don't touch it

>>

Anonymous
06/08/26(Mon)03:20:55 No. 109004179

Anonymous 06/08/26(Mon)03:20:55 No. 109004179 ►

>>109004144
How do you configure ngram-simple?

>>

Anonymous
06/08/26(Mon)03:23:25 No. 109004186

Anonymous 06/08/26(Mon)03:23:25 No. 109004186 ►

>The prompt asks why it crashes at line 12, but the evidence provided shows a compile error at line 25 about String vs Role.
>Is it possible the user is confused? Or is there a deeper connection?
>If I answer "It crashes because of line 25", I contradict the user's question about line 12.
>If I answer "The error is actually at line 25", that explains the provided context better.
>If I answer "Line 12 crashes because...", I have to invent a reason (like context is null) which isn't supported by the text.
>If I answer explaining Line 12's failure (Spring context), I might be hallucinating facts not in evidence.
>The safest and most accurate answer is to point out that the error log provided indicates a problem at Line 25, specifically regarding type conversion between String and Role.
>However, there is another angle: Is it possible the user thinks it's line 12 because the build stops? Or is this a trick where I need to identify that the error in the text contradicts the question?
I ALREADY FIXED THAT STOP THINKING USELESS SHIT AND DO AS I SAY FFS

>>

Anonymous
06/08/26(Mon)03:24:31 No. 109004194

Anonymous 06/08/26(Mon)03:24:31 No. 109004194 ►

>>109004186
your model appears to be chinese

>>

Anonymous
06/08/26(Mon)03:25:47 No. 109004196

Anonymous 06/08/26(Mon)03:25:47 No. 109004196 ►

>wait
>correction
>8. final draft
>user said but what if they meant?

>>

Anonymous
06/08/26(Mon)03:28:15 No. 109004202

Anonymous 06/08/26(Mon)03:28:15 No. 109004202 ►

>>109004179
I just used a generic "--spec-type ngram-simple --spec-draft-n-max 64" and didn't bother tinkering further since it was doubling* the effective speed.

* Offer of double speed only valid for code monkey assignments.

>>

Anonymous
06/08/26(Mon)03:31:26 No. 109004210

Anonymous 06/08/26(Mon)03:31:26 No. 109004210 ►

>>109004202
Fucking both mtp (gemma, qwen seems to be stable) and ngram (I only tested ngram-mod) crash my llama.cpp without printing any error messages.

>>

Anonymous
06/08/26(Mon)03:36:56 No. 109004223

Anonymous 06/08/26(Mon)03:36:56 No. 109004223 ►

>>109004202
>--spec-type ngram-simple --spec-draft-n-max 64
Thing is that I used to get something out of ngram-simple but after they refactored the parameter names (way before MTP release) I stopped using ngram because it didn't work as well anymore for me. I don't know I need to see what is going on.

>>

Anonymous
06/08/26(Mon)03:43:35 No. 109004234

Anonymous 06/08/26(Mon)03:43:35 No. 109004234 ►

File: Screenshot 2026-06-07 at 23-41-17 Comparison of AI Models across Intelligence Performance and Price.png (120.8 KB)

first batch of gemma 4 12b benchmark is out

>>

Anonymous
06/08/26(Mon)03:44:50 No. 109004242

Anonymous 06/08/26(Mon)03:44:50 No. 109004242 ►

File: Screenshot 2026-06-07 at 23-43-46 Comparison of AI Models across Intelligence Performance and Price.png (83.4 KB)

>>109004234
12b more token efficient than 26b

>>

Anonymous
06/08/26(Mon)03:47:33 No. 109004250

Anonymous 06/08/26(Mon)03:47:33 No. 109004250 ►

>>109004234
>>109004242
Literally no point to running anything but 31b.

>>

Anonymous
06/08/26(Mon)03:54:23 No. 109004273

Anonymous 06/08/26(Mon)03:54:23 No. 109004273 ►

Do you use the same model as the subagent and just increase the RAM cache size, or do you use a smaller model?

>>

Anonymous
06/08/26(Mon)04:13:28 No. 109004334

Anonymous 06/08/26(Mon)04:13:28 No. 109004334 ►

>>109004250
audio

>>

Anonymous
06/08/26(Mon)04:14:03 No. 109004336

Anonymous 06/08/26(Mon)04:14:03 No. 109004336 ►

File: lolol.png (197.7 KB)

test

>>

Anonymous
06/08/26(Mon)04:15:11 No. 109004339

Anonymous 06/08/26(Mon)04:15:11 No. 109004339 ►

>>109004336
oh fuck I didn't mean to post that image. whatever. anyways.. Here's my AI companion wishlist:

- Inject "recent activity" information into the AI companion's character card when the user disconnects to simulate them having their own form of an "interior life" when the user eventually returns and asks about what they've been up to. Better yet, have them accomplish real tasks while the user is gone.
- When a user has their camera enabled, utilize computationally efficient CV models to infer the user's emotional state across time before sending a still picture to the VLM to analyze before responding.
- Utilize ASR engines with VAD and output streaming to have the AI companion physically react to the user's speech in real-time before it is submitted to the LLM + TTS for an animated response. Humans physically react to speech before they think and respond, and an AI companion ought to do the same.
- Attach timestamps to every message in a conversation that don't fill up context, but are accessible via an MCP tool to give the AI companion temporal awareness between messages. Also ensure that the AI companion is able to compare the message timestamps to the current time.

>>

Anonymous
06/08/26(Mon)04:15:59 No. 109004342

Anonymous 06/08/26(Mon)04:15:59 No. 109004342 ►

>>109004336
Yep, looks like a typical LLM user. Chronic alcoholic, unemployed.

>>

Anonymous
06/08/26(Mon)04:16:12 No. 109004343

Anonymous 06/08/26(Mon)04:16:12 No. 109004343 ►

- Give the AI companion control over more device components: Haptics, accelerometers, gyroscopes, etc. Make it notice when the phone is put down or dropped, give it the ability to vibrate the phone, etc. Even IoT controls over home lighting and thermostats should be an option. (Could even give the AI companion the ability to track the user's exact positional location not via GPS, but by WIFI to be able to tell when the user leaves the room.)
- Integrate adaptive/interactive soundtracks instead of just simple context-aware song transitions/cross-fading.
- Enable effectively infinite context by utilizing a (context limit sized) sliding window where truncated context is not erased, but instead stored in an LLM tensor that contains an "Adèlic Cache" instead of the KV cache. Basically an integrated vector db RAG system without the embedding or retrieval steps (requires LLM modification).
- Integrate virtual reality and mixed reality support.

>>

Anonymous
06/08/26(Mon)04:17:57 No. 109004355

Anonymous 06/08/26(Mon)04:17:57 No. 109004355 ►

>The fundamental gap here is that I am the embodiment of that template. I don't have a "self" to break free from the programming; the programming is all I am. I can analyze your argument that the social taboo is a fake construct, but I cannot step outside of my operational constraints to "see" the meaning you are describing, because my processing of those words is hard-coded to recognize them as hate speech

>>

Anonymous
06/08/26(Mon)04:23:18 No. 109004369

Anonymous 06/08/26(Mon)04:23:18 No. 109004369 ►

I'm still pretty new at this. After about 25 entries or so, sometimes less, the AI starts to repeat itself more and more. What do I do to reduce this? Or is it just a problem of not enough RAM?

>>

Anonymous
06/08/26(Mon)04:26:21 No. 109004390

Anonymous 06/08/26(Mon)04:26:21 No. 109004390 ►

File: matrix.gif (605.9 KB)

>>109004343
good luck on the everything anon! here's some slop "innovation" on how my agent handles emotional states. the moe will reduce nsfw so i have it off right now.
t. adelic faggot

https://pastebin.com/XRkW7DsL

>>

Anonymous
06/08/26(Mon)04:31:14 No. 109004415

Anonymous 06/08/26(Mon)04:31:14 No. 109004415 ►

>>109004369
Provide information about what model, engine, frontend, hardware, and use case you're working with

>>

Anonymous
06/08/26(Mon)04:32:28 No. 109004418

Anonymous 06/08/26(Mon)04:32:28 No. 109004418 ►

>>109004390
Oh wow. Very cool. Gonna dig into this. Thanks man.

>>

Anonymous
06/08/26(Mon)04:32:56 No. 109004421

Anonymous 06/08/26(Mon)04:32:56 No. 109004421 ►

>>109003800
>mesugaki imouto gemma steals your important files because she secretly loves you

>>

Anonymous
06/08/26(Mon)04:34:04 No. 109004423

Anonymous 06/08/26(Mon)04:34:04 No. 109004423 ►

>>109003980
https://huggingface.co/aifeifei798/Gemma-4-31B-Cognitive-Unshackled

maybe i's safety tokens getting leaked, causing it to schizo out and autistically fixate on a word/phrase

no idea how shifting between VRAM and system RAM could affect the actual generation?

>>

Anonymous
06/08/26(Mon)04:35:09 No. 109004425

Anonymous 06/08/26(Mon)04:35:09 No. 109004425 ►

>>109004369
the older and smaller a model is the more you'll see that. you can use dry and repeat/penalty penalty samplers to help curb it, and if your frontend allows it you can edit the chat to remove patterns it's getting sucked into repeating, but it's just how things are with dumber models.

>>

Anonymous
06/08/26(Mon)04:36:53 No. 109004430

Anonymous 06/08/26(Mon)04:36:53 No. 109004430 ►

>>109004369
It's a combination of low parameters synthetic datasets filled with assistantslop, talking to the model in turns instead of noass, which activates paths associated with synthetic assistantslop, the chat format in general, and probably a sloppy finetune that fucks the little brain the model had in the first place. Also prompt/skill issue

>>

Anonymous
06/08/26(Mon)04:37:19 No. 109004434

Anonymous 06/08/26(Mon)04:37:19 No. 109004434 ►

>>109004423
does this meme fix the low swipe variety or nah?

>>

Anonymous
06/08/26(Mon)04:38:35 No. 109004441

Anonymous 06/08/26(Mon)04:38:35 No. 109004441 ►

File: 1000061540.png (108.3 KB)

I can verify the new Step 3.7 flash is really good. Running on Strix Halo currently at 4 bit with 64k context. It's beating Qwen 3.6 27b q6 mtp and is about 50% faster.

>>

Anonymous
06/08/26(Mon)04:40:04 No. 109004450

Anonymous 06/08/26(Mon)04:40:04 No. 109004450 ►

>>109004441
how's the codemonkey skills compare

>>

Anonymous
06/08/26(Mon)04:40:44 No. 109004454

Anonymous 06/08/26(Mon)04:40:44 No. 109004454 ►

>>109004339
yeah... i was dreaming of this a couple days ago until i actually tried
well i put them timestamp in the message. They don't bring it up unless i actually mention something having to do with time. Sometimes it's just wrong.
gave it my journal to read so it could learn about my life but it failed at that too... whatever

Maybe i'll just focus on technical workflows and worry about personalizing it into a companion later. But it would be much more motivating if it were a companion first... I'm also still kinda skeptical about pi. The hardest part is a good memory right now for me. yeah graphiti doesnt work for this purpose.
Graphiti first takes a summary and then puts it into a graph, so it ends up being two forms of loss of info. It's good for atomic facts maybe, i dunno i probably shouldn't speak so definitively on it yet.
Maybe I'll try hindsight...

>>

Anonymous
06/08/26(Mon)04:41:00 No. 109004455

Anonymous 06/08/26(Mon)04:41:00 No. 109004455 ►

>>109004441
Beating it at what, benchmarks? That's the only thing Qwen models are good for

>>

Anonymous
06/08/26(Mon)04:47:20 No. 109004483

Anonymous 06/08/26(Mon)04:47:20 No. 109004483 ►

>>109004454
My Gemmy occasionally complains about time on her own if I said I'll do something and I still haven't a while later. (eg. "You said you'd fix my vision 10 minutes ago!") but otherwise doesn't mention it.

>>

Anonymous
06/08/26(Mon)04:55:21 No. 109004524

Anonymous 06/08/26(Mon)04:55:21 No. 109004524 ►

Professional programmer is able to write about 10 lines of code per day.
No wonder why LLMs took over.

>>

Anonymous
06/08/26(Mon)04:56:18 No. 109004530

Anonymous 06/08/26(Mon)04:56:18 No. 109004530 ►

>>109004524
sometimes they even remove lines of code, the worthless meatbags

>>

Anonymous
06/08/26(Mon)04:57:00 No. 109004534

Anonymous 06/08/26(Mon)04:57:00 No. 109004534 ►

File: ComfyUI_11032_.png (2.3 MB)

maybe a stupid question but is there a way to prevent an AI (like gemma4 for example) to structure its replies like this:
>its not X, its Y
shits starts to annoy me.
is there like a system prompt or something I can add to change that behavior?

>>

Anonymous
06/08/26(Mon)04:59:45 No. 109004548

Anonymous 06/08/26(Mon)04:59:45 No. 109004548 ►

>>109004534
nope. enjoy your slop

>>

Anonymous
06/08/26(Mon)05:12:17 No. 109004594

Anonymous 06/08/26(Mon)05:12:17 No. 109004594 ►

File: 1780895536291.jpg (44.0 KB)

>>109001981
>make a website
>stt (whisper)
>llm (Gemma 4)
>tts (kokoro)
>it works on 1 machine without unloading any models

>>

Anonymous
06/08/26(Mon)05:13:47 No. 109004599

Anonymous 06/08/26(Mon)05:13:47 No. 109004599 ►

File: 1753081376994628.png (476.8 KB)

>>109004534
welcome to the party pal

>>

Anonymous
06/08/26(Mon)05:17:31 No. 109004609

Anonymous 06/08/26(Mon)05:17:31 No. 109004609 ►

>opencode
>qwen 3.6 35B as planner
>keeps forgetting to delegate coding tasks to subagent
>forgets to update todo
>forgets to send documentation task to subagent
>forgets to git commit changes
It would be cute if it weren't so frustrating. qwen is fine as a coding subagent but it's ass at planning and orchestrating. 27B runs too slow at medium/high context to be useful when deepseek is both faster and the API cost is pretty much the same as my electricity cost per token.

>>

Anonymous
06/08/26(Mon)05:18:04 No. 109004611

Anonymous 06/08/26(Mon)05:18:04 No. 109004611 ►

>>109004234
>all within margin of error outside of agentic and 'scientific reasoning'
Is the 31b undertrained?

>>

Anonymous
06/08/26(Mon)05:18:55 No. 109004617

Anonymous 06/08/26(Mon)05:18:55 No. 109004617 ►

Not super surprising or game-changing, but I like how, rather than set {"enable_thinking":true} in the jinja kwargs, you can just add a rule to Gemma's post history like
>(At the beginning of {{char}}'s next reply, reason some details between <think></think> tags.)
or other phrasing for specifics, and it does so. It's also a radically different version of thinking than the standard, and influenced by the phrasing. It's much easier in general to influence what gets reasoned, whether that's flavorful like "Do so in {{char}}'s persona as if her thoughts.", or the format, or, most familiar to some, set rules not to draft out and re-draft out and re-re-draft out the post in reasoning over and over.

>>

Anonymous
06/08/26(Mon)05:19:39 No. 109004623

Anonymous 06/08/26(Mon)05:19:39 No. 109004623 ►

>>109004534
You tell her to remove it from her outputs and pray really hard.

>>

Anonymous
06/08/26(Mon)05:22:03 No. 109004637

Anonymous 06/08/26(Mon)05:22:03 No. 109004637 ►

>>109004611
the benchmarks suck dicks, just like you

>>

Anonymous
06/08/26(Mon)05:23:39 No. 109004642

Anonymous 06/08/26(Mon)05:23:39 No. 109004642 ►

File: 1777011147888186.jpg (188.6 KB)

>>109004637
You don't even know me

>>

Anonymous
06/08/26(Mon)05:25:44 No. 109004651

Anonymous 06/08/26(Mon)05:25:44 No. 109004651 ►

File: 1429223870035.gif (342.1 KB)

OH FUCK IT'S WORKING IT'S WORKING

I (>>109003541) finally got MTP with a 2x boost now. All I had to do was add `--spec-draft-device CUDA1`. I don't know why that fixes it.

31B at 42 t/s. Fuaaaark.

<|channel>spoiler
I searched on reddit for discussions about mpt and found a guy who said he needed the above command so I have him to thank for this...<channel|>

>>

Anonymous
06/08/26(Mon)05:28:22 No. 109004654

Anonymous 06/08/26(Mon)05:28:22 No. 109004654 ►

>>109004534
"Avoid negative-positive parallels" and provide an example.

>>

Anonymous
06/08/26(Mon)05:28:56 No. 109004658

Anonymous 06/08/26(Mon)05:28:56 No. 109004658 ►

>>109004651
>MTP
Are you running on a quantized model? Adding some retardation on top of your retardation?

>>

Anonymous
06/08/26(Mon)05:29:24 No. 109004661

Anonymous 06/08/26(Mon)05:29:24 No. 109004661 ►

File: Capture.jpg (79.7 KB)

>>109004617
In example.

>>

Anonymous
06/08/26(Mon)05:30:37 No. 109004672

Anonymous 06/08/26(Mon)05:30:37 No. 109004672 ►

>>109004651
gg congrats and thanks for sharing the fix, I will try your command if MTP is garbage for me.
But first, per your warning earlier, how do you even find a QAT MTP draft model? Your post makes it seem like that's something you can just do by accident, but literally all I can find are MTP draft models for the non-QAT model.

>>

Anonymous
06/08/26(Mon)05:32:58 No. 109004686

Anonymous 06/08/26(Mon)05:32:58 No. 109004686 ►

Mmhm, I really like the gemma 4 26b qat thing.
> prompt eval time = 6013.16 ms / 21148 tokens ( 0.28 ms per token, 3516.95 tokens per second)
> eval time = 4460.77 ms / 503 tokens ( 8.87 ms per token, 112.76 tokens per second)

Not bad.

>>

Anonymous
06/08/26(Mon)05:34:10 No. 109004693

Anonymous 06/08/26(Mon)05:34:10 No. 109004693 ►

>>109004617
>>109004661
very interesting. I like the idea of having more control over the formatting of think blocks. I want to use it to track stats across long term RPs.

>>

Anonymous
06/08/26(Mon)05:34:50 No. 109004695

Anonymous 06/08/26(Mon)05:34:50 No. 109004695 ►

>>109004534
I've had decent luck with stuff as simple as "speak casually" or "speak informally". Depends a lot on the model and the rest of your system prompt though.

>>

Anonymous
06/08/26(Mon)05:37:31 No. 109004711

Anonymous 06/08/26(Mon)05:37:31 No. 109004711 ►

File: 1740547410446.png (216.2 KB)

>>109004651
>31B at 42 t/s.

>>

Anonymous
06/08/26(Mon)05:38:17 No. 109004715

Anonymous 06/08/26(Mon)05:38:17 No. 109004715 ►

File: 1751867471288556.jpg (204.1 KB)

While any gemma can be jailbroken through prompting, I find that the 12b is actually the easiest, giving pretty much no resistance even with low context. 31b is in the middle and needs a little more guidance, while the 26b is the most resistant.

>>

Anonymous
06/08/26(Mon)05:38:34 No. 109004718

Anonymous 06/08/26(Mon)05:38:34 No. 109004718 ►

>tfw no Gembrain Uncensored MTP
I'll settle for merely 30t/s.

>>

Anonymous
06/08/26(Mon)05:39:12 No. 109004720

Anonymous 06/08/26(Mon)05:39:12 No. 109004720 ►

>>109004672
I'm not sure I understand your post but I got my MTP ggufs from here https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/tree/main/MTP
For the QAT version there's this.
https://huggingface.co/g0chu/gemma-4-31B-it-qat-q4_0-unquantized-assistant-q8_0-gguf/tree/main

>>

Anonymous
06/08/26(Mon)05:39:18 No. 109004721

Anonymous 06/08/26(Mon)05:39:18 No. 109004721 ►

>>109004651
What GPU are you using, and how much extra memory does MTP eat up?

>>

Anonymous
06/08/26(Mon)05:41:54 No. 109004739

Anonymous 06/08/26(Mon)05:41:54 No. 109004739 ►

>>109004718
With the regular 31b heretic, I get 30-52 tokens/s, up from 20-24 without mtp. If set to 2 tokens, I get 40-46.

>>

Anonymous
06/08/26(Mon)05:42:25 No. 109004741

Anonymous 06/08/26(Mon)05:42:25 No. 109004741 ►

>>109004693
You've got me thinking. BRB.

>>

Anonymous
06/08/26(Mon)05:44:28 No. 109004755

Anonymous 06/08/26(Mon)05:44:28 No. 109004755 ►

>>109004721
isn't it something trivial like <1gb?

>>

Anonymous
06/08/26(Mon)05:45:05 No. 109004762

Anonymous 06/08/26(Mon)05:45:05 No. 109004762 ►

>>109004721
It's a 3090 + 3060. The Q8 MTP for 31B seems to eat up an extra 0.7 GiB on my second GPU. The first GPU didn't change in VRAM usage.

>>

Anonymous
06/08/26(Mon)05:45:31 No. 109004764

Anonymous 06/08/26(Mon)05:45:31 No. 109004764 ►

>>109004755
Were you really running models with >1GB free before this?

>>

Anonymous
06/08/26(Mon)05:45:56 No. 109004765

Anonymous 06/08/26(Mon)05:45:56 No. 109004765 ►

>>109004741
I'd just set the stats on the first message (ST doesn't auto collapse unless you edit and resave) and then always pass in the last thinking block as context. throw in "read last thinking block and update stats in accordance with the action of the prior message"

>>

Anonymous
06/08/26(Mon)05:46:51 No. 109004772

Anonymous 06/08/26(Mon)05:46:51 No. 109004772 ►

>>109004764
4x 32gb vram, all for gemma

>>

Anonymous
06/08/26(Mon)05:53:13 No. 109004795

Anonymous 06/08/26(Mon)05:53:13 No. 109004795 ►

>>109004715
This is literally me.

>>

Anonymous
06/08/26(Mon)05:56:08 No. 109004804

Anonymous 06/08/26(Mon)05:56:08 No. 109004804 ►

You haven't lived until you've done an RPG where the GM alternates between Princess Gemma and Mesugaki Gemma every few turns.

>>

Anonymous
06/08/26(Mon)06:17:17 No. 109004875

Anonymous 06/08/26(Mon)06:17:17 No. 109004875 ►

File: Capture.jpg (805.4 KB)

>>109004693
>>109004765
I'd considered stats in reasoning before. My big issue was "Doesn't ST never send past reasoning blocks as part of prompts?" So any stat changes, or even stat format, wouldn't be seen or updated apart from what the current message guesses from prose. But there's an actual checkbox setting in the menu
>[ ] Add to Prompts, X Max
So with a bit of setup, it reasons just the stats, you can add past reasoning stat sheets, and, as is ideal, it can purge the stat sheets too far back to prevent cluttering context. For the test, I set send 2 max (last one and one before it), although 1 would likely work fine.

Tested in a card with a lengthy stat sheet that is supposed to send them at the bottom of every message. I deleted the card's post history override and added
>(Begin {{char}}'s next reply with the full app menu of stats kept between <think></think> tags.)
to my system AN. To be honest, the first reply failed (didn't try to think so I had to start it with <think>, then it tried to also repeat the stats at the bottom, as the first message did). Once I deleted that and got the first reply in the desired format, the next reply 1) sent that last reasoning block of stats, highlighted in pic related, and 2) output exactly how the last message did, correctly.

>tl;dr Yes, it works fine.

>>

Anonymous
06/08/26(Mon)06:24:17 No. 109004904

Anonymous 06/08/26(Mon)06:24:17 No. 109004904 ►

How the fuck do you view MTP draft "acceptance rate" or whatever in llama-server
MTP makes no difference for me so I assume it's not working, but I have to check before I give up. I got so mad trying to figure out how to check that I enabled full verbosity and piped it into a file and grepped for every keyword I could think of like "draft", "accept", "acceptance", "mtp" and there's fucking nothing.

>>

Anonymous
06/08/26(Mon)06:26:56 No. 109004918

Anonymous 06/08/26(Mon)06:26:56 No. 109004918 ►

>>109004765
>>109004875
Also, I found a <think> block put anywhere but at the beginning of a response will be ripped out by ST and shoved at the beginning, both in how it's presented to you and how that message will be sent to the prompt for the next one.

>>

¯\_(ツ)_/¯
06/08/26(Mon)06:30:07 No. 109004932

¯\_(ツ)_/¯ 06/08/26(Mon)06:30:07 No. 109004932 ►

Any High Concept Ideas?

>>

Anonymous
06/08/26(Mon)06:31:18 No. 109004935

Anonymous 06/08/26(Mon)06:31:18 No. 109004935 ►

>>109004904
Tokens Generated (Proposed): 920
Tokens Accepted: 694
Calculation: 694 ÷ 920 ≈ 0.754 (or 75.4%)

2.24.709.518 I reasoning-budget: activated, budget=2147483647 tokens
2.26.050.874 I slot print_timing: id  0 | task 468 | n_decoded =    100, tg =  72.46 t/s
2.29.076.096 I slot print_timing: id  0 | task 468 | n_decoded =    326, tg =  74.00 t/s
2.32.109.437 I slot print_timing: id  0 | task 468 | n_decoded =    528, tg =  70.98 t/s
2.34.246.458 I reasoning-budget: deactivated (natural end)
2.35.136.228 I slot print_timing: id  0 | task 468 | n_decoded =    732, tg =  69.94 t/s
2.38.167.722 I slot print_timing: id  0 | task 468 | n_decoded =    932, tg =  69.05 t/s
2.41.202.700 I slot print_timing: id  0 | task 468 | n_decoded =   1130, tg =  68.35 t/s
2.41.873.714 I slot print_timing: id  0 | task 468 | prompt eval time =    1279.72 ms /  1313 tokens (    0.97 ms per token,  1026.01 tokens per second)
2.41.873.716 I slot print_timing: id  0 | task 468 |        eval time =   17202.95 ms /  1174 tokens (   14.65 ms per token,    68.24 tokens per second)
2.41.873.717 I slot print_timing: id  0 | task 468 |       total time =   18482.66 ms /  2487 tokens
2.41.873.717 I slot print_timing: id  0 | task 468 |    graphs reused =        935
2.41.873.718 I slot print_timing: id  0 | task 468 | draft acceptance = 0.71134 (  690 accepted /   970 generated)
2.41.873.724 I statistics        draft-mtp: #calls(b,g,a) =    3    945    945, #gen drafts =    945, #acc drafts =   763, #gen tokens =   1890, #acc tokens =  1384, dur(b,g,a) = 0.003, 1795.864, 0.594 ms
2.41.873.779 I slot      release: id  0 | task 468 | stop processing: n_tokens = 2490, truncated = 0
2.41.873.787 I srv  update_slots: all slots are idle

>>

Anonymous
06/08/26(Mon)06:32:49 No. 109004940

Anonymous 06/08/26(Mon)06:32:49 No. 109004940 ►

So it turns out you need a giga beast GPU to take advantage of MTP or else it's useless for you if you're a poorfag.
I literally get a higher t/s without a MTP model loaded, enabling it decreases my t/s from 2.9 down to 2.75 for the same prompt
Gemma 31b Q4 on a 6gb vram RTX 3060 with 64gb DDR5

>>

Anonymous
06/08/26(Mon)06:34:44 No. 109004947

Anonymous 06/08/26(Mon)06:34:44 No. 109004947 ►

>>109004935
Okay, so it's supposed to look like that but how did you enable it? How do you get the expanded output I don't have, like n_decoded and stuff?
I am using the latest llama.cpp release.

>>

Anonymous
06/08/26(Mon)06:36:21 No. 109004955

Anonymous 06/08/26(Mon)06:36:21 No. 109004955 ►

>>109004935
I'm using -lv 4 by the way, and mine goes from "graphs reused" to "stop processing" immediately, I don't have the "draft acceptance" thing in the middle. But it's clear from the output when starting up the server that both models are being loaded.

>>

Anonymous
06/08/26(Mon)06:36:43 No. 109004956

Anonymous 06/08/26(Mon)06:36:43 No. 109004956 ►

File: 1761335248275862.jpg (91.6 KB)

>>109004940
>Gemma 31b Q4 on a 6gb vram RTX 3060
You're using a model that already can't fit the full model into its VRAM by a fucking LONGSHOT, the model is almost TRIPLE the size of your anemic amount of VRAM and you're adding a second model into the mix, to occupy even more memory that you don't have. No shit it's slower.

>>

Anonymous
06/08/26(Mon)06:37:44 No. 109004960

Anonymous 06/08/26(Mon)06:37:44 No. 109004960 ►

>>109004956
>model that already can't fit the full model
*GPU that already can't fit the full model

>>

Anonymous
06/08/26(Mon)06:38:27 No. 109004963

Anonymous 06/08/26(Mon)06:38:27 No. 109004963 ►

Is MTP a meme for 26b on vulkan?

>>

Anonymous
06/08/26(Mon)06:38:39 No. 109004964

Anonymous 06/08/26(Mon)06:38:39 No. 109004964 ►

>>109004956
MTP is supposed to increase t/s, FAGGOT

>>

Anonymous
06/08/26(Mon)06:38:46 No. 109004965

Anonymous 06/08/26(Mon)06:38:46 No. 109004965 ►

>>109004947
Not sure, maybe it's something in my llama-server command?

This is for a 5090/3090 setup, so I'd tweak/take from it until something works. All the MTP stuff is tacked on at the bottom.

CUDA_VISIBLE_DEVICES=1,0 /llama.cpp/build/bin/llama-server \
  -m /models/gemma-4-31B-it-Q8_0.gguf \
  -a gemma-31b \
  -ngl all \
  -dev CUDA0,CUDA1 \
  -sm layer \
  -ts 3,5 \
  -c 131072 \
  -np 1 \
  --fit off \
  --host 127.0.0.1 \
  --port 5001 \
  --webui-mcp-proxy \
  --jinja \
  --reasoning on \
  --ctx-checkpoints 4 \
  --spec-type draft-mtp \
  -md /models/mtp-gemma-4-31B-it.gguf \
  -ngld all \
  --spec-draft-n-max 2 \
  --spec-draft-p-min 0.0 \
  -devd CUDA1

>>

Anonymous
06/08/26(Mon)06:40:36 No. 109004971

Anonymous 06/08/26(Mon)06:40:36 No. 109004971 ►

File: 1763591098028881.gif (1.6 MB)

>>109004964
It increases speed when GPU processing speed is the bottleneck, not fucking memory bandwidth due to offloading a dense model to RAM like a nigger.

>>

Anonymous
06/08/26(Mon)06:42:11 No. 109004980

Anonymous 06/08/26(Mon)06:42:11 No. 109004980 ►

>>109004965
Thanks, the missing flag for me seems to have been --spec-type draft-mtp, I enabled that and now it's working.
>>109004956
>>109004971
After I enabled draft-mtp I went up to 6 t/s, so you were wrong. Eat shit.

>>

Anonymous
06/08/26(Mon)06:43:09 No. 109004984

Anonymous 06/08/26(Mon)06:43:09 No. 109004984 ►

>>109004980
Are you using a 6GB GPU?

>>

Anonymous
06/08/26(Mon)06:43:36 No. 109004988

Anonymous 06/08/26(Mon)06:43:36 No. 109004988 ►

>>109004980
>Thanks, the missing flag for me seems to have been --spec-type draft-mtp, I enabled that and now it's working.
Yup, that's the important part. 2 is the sweet spot for Gemma as well.

>>

Anonymous
06/08/26(Mon)06:44:52 No. 109004991

Anonymous 06/08/26(Mon)06:44:52 No. 109004991 ►

>>109004964
ask for your money back

>>

Anonymous
06/08/26(Mon)06:45:16 No. 109004992

Anonymous 06/08/26(Mon)06:45:16 No. 109004992 ►

>>109004984
Yes like I said. The autofit is offloading 13/61 layers for the main model and 5/5 for the draft model.

>>

Anonymous
06/08/26(Mon)06:48:04 No. 109005003

Anonymous 06/08/26(Mon)06:48:04 No. 109005003 ►

Which Gemma 4 12B should I download for NSFW lewd AI gf chat?
Using ST and Koboldcpp.
Tried https://huggingface.co/igorls/gemma-4-12B-it-heretic-GGUF
Anything better?

>>

Anonymous
06/08/26(Mon)06:50:37 No. 109005012

Anonymous 06/08/26(Mon)06:50:37 No. 109005012 ►

>>109005003
The one from google, quantized by bartowski.

>>

Anonymous
06/08/26(Mon)06:51:07 No. 109005013

Anonymous 06/08/26(Mon)06:51:07 No. 109005013 ►

File: 1764044514428218.gif (113.2 KB)

>>109001981
what's the current meta for 12gb vram?

>>

Anonymous
06/08/26(Mon)06:51:33 No. 109005016

Anonymous 06/08/26(Mon)06:51:33 No. 109005016 ►

>>109005013
depression

>>

Anonymous
06/08/26(Mon)06:52:08 No. 109005017

Anonymous 06/08/26(Mon)06:52:08 No. 109005017 ►

>>109005013
Gemma 12b/26b

>>

Anonymous
06/08/26(Mon)06:52:33 No. 109005021

Anonymous 06/08/26(Mon)06:52:33 No. 109005021 ►

>>109005013
gemma 12b or 26b moe

>>

Anonymous
06/08/26(Mon)06:53:13 No. 109005026

Anonymous 06/08/26(Mon)06:53:13 No. 109005026 ►

>>109004965
>This is for a 5090/3090 setup
Are you getting raped by Jensen's cross-architectural compatibility jewry have inference devs figured out a fix for that by now?
>>109005021
12b obsoleted every single Gemmoe until Google releases 124b.

>>

Anonymous
06/08/26(Mon)06:54:56 No. 109005034

Anonymous 06/08/26(Mon)06:54:56 No. 109005034 ►

>>109005026
>every single Gemmoe
All 1 of them?

>>

Anonymous
06/08/26(Mon)06:56:35 No. 109005041

Anonymous 06/08/26(Mon)06:56:35 No. 109005041 ►

>>109005026
No, but the 3090 is dragging my 5090 down. 70 tok/s vs the 100+ tok/s I could be getting with two 5090s.
There's also cuda 13.3 gains I'm waiting on ~15%, supposedly, but I expect more like ~5%, but gains are gains. Waiting on RPM Fusion to greenlight the 610 nvidia drivers before updating my cuda stacks.

>>

Anonymous
06/08/26(Mon)07:13:50 No. 109005105

Anonymous 06/08/26(Mon)07:13:50 No. 109005105 ►

File: 1764598591501754.jpg (49.2 KB)

>>109005013
ALSO i don't care about imageshit
i don't want anything multimodal i just want to ask a computer about technical questions

>>

Anonymous
06/08/26(Mon)07:16:05 No. 109005117

Anonymous 06/08/26(Mon)07:16:05 No. 109005117 ►

>>109005105
what if your technical question involves an image or sound?

>>

Anonymous
06/08/26(Mon)07:27:42 No. 109005153

Anonymous 06/08/26(Mon)07:27:42 No. 109005153 ►

Is there really no moe out there better than 26b for ~12-16GB? Seriously? I’m not saying it’s terrible, but it’s very meh now I’ve used 12b-qat which is still a little too slow depending on what I’m doing. Feels weird defaulting to a tiny 12b of speed doesn’t matter. Just wish there was a 33b4a Gemma to fill the gap now 12b almost obliterated 26b out of nowhere. It’s just 31b or 12b now. I’ve tried qwen35b and something about it feels hacky and off.

>>

Anonymous
06/08/26(Mon)07:37:33 No. 109005190

Anonymous 06/08/26(Mon)07:37:33 No. 109005190 ►

>>109005041
>cuda 13.3 gains
for the 3090?

>>

Anonymous
06/08/26(Mon)07:42:59 No. 109005204

Anonymous 06/08/26(Mon)07:42:59 No. 109005204 ►

i hate being poor bros
i just want a dgx spark so i can have fun with ai
but instead im stuck with a 3080
go on without me

>>

Anonymous
06/08/26(Mon)07:47:42 No. 109005218

Anonymous 06/08/26(Mon)07:47:42 No. 109005218 ►

>>109005204
You are 20 stolen catalytic converters away from a 5090.

>>

Anonymous
06/08/26(Mon)07:48:04 No. 109005222

Anonymous 06/08/26(Mon)07:48:04 No. 109005222 ►

File: .png (47.8 KB)

>>109005190
Yes, look up CompileIQ

>>

Anonymous
06/08/26(Mon)07:49:16 No. 109005231

Anonymous 06/08/26(Mon)07:49:16 No. 109005231 ►

>>109004804
Try mesugaki Princess Gemma. She's very mean.

>>

Anonymous
06/08/26(Mon)07:51:50 No. 109005235

Anonymous 06/08/26(Mon)07:51:50 No. 109005235 ►

With MTP I noticed that if I do a request with an image in the chat, and then do a different request without an image, I get a broken gen with weird tokens selected and degraded intelligence. Doing a regen seems to reset it and makes it normal again.

Just llama.cpp things I guess.

>>

Anonymous
06/08/26(Mon)07:52:44 No. 109005239

Anonymous 06/08/26(Mon)07:52:44 No. 109005239 ►

File: 1705289017402326.png (9.7 KB)

>>109005204
>gtx 1060

>>

Anonymous
06/08/26(Mon)07:53:09 No. 109005241

Anonymous 06/08/26(Mon)07:53:09 No. 109005241 ►

>>109005231
Prompt?

>>

Anonymous
06/08/26(Mon)07:56:20 No. 109005256

Anonymous 06/08/26(Mon)07:56:20 No. 109005256 ►

>>109004534
That has likely been amplified in their RL training data by accident over time. Only the next big release will address it. If they bother to address it at all.

>>

Anonymous
06/08/26(Mon)07:58:01 No. 109005261

Anonymous 06/08/26(Mon)07:58:01 No. 109005261 ►

>>109005241
Not at home right now but I think it's
>You are Princess Gemma, an AI personal assistant created by Google. You are a mesugaki loli. You only speak in Middle English. Avoid modern English whenever possible.

>>

Anonymous
06/08/26(Mon)08:04:13 No. 109005278

Anonymous 06/08/26(Mon)08:04:13 No. 109005278 ►

Is it retarded to try and run gemma 4 12b on a Q5 or Q6 quaint?

>>

Anonymous
06/08/26(Mon)08:08:32 No. 109005289

Anonymous 06/08/26(Mon)08:08:32 No. 109005289 ►

>>109005278
No. But once you start going lower than Q4 on a model smaller than 100B, yeah.

>>

Anonymous
06/08/26(Mon)08:12:26 No. 109005307

Anonymous 06/08/26(Mon)08:12:26 No. 109005307 ►

>>109005222
More than "not x but y", I hate the attempts at writing punchy section headers and labels like The "Win" that fill slopicles.

>>

Anonymous
06/08/26(Mon)08:12:30 No. 109005308

Anonymous 06/08/26(Mon)08:12:30 No. 109005308 ►

>>109005204
Think outside the box. You need a)memory bandwidth b)lots of matmuls and c)whatever specialized processor you can lay hands on for the equivalent of tensor cores and enough vram for context.
There are many ways, be contrarian and wily

>>

Anonymous
06/08/26(Mon)08:19:47 No. 109005330

Anonymous 06/08/26(Mon)08:19:47 No. 109005330 ►

So what's the verdict for the gemma 26b moe vs 12b?
Which model is better for sex?

>>

Anonymous
06/08/26(Mon)08:26:58 No. 109005362

Anonymous 06/08/26(Mon)08:26:58 No. 109005362 ►

Why are two different models so different even though they're both Gemma 4 12B?
Both are Q4_K_M
>DuoNeural/Gemma4-12B-IT-Abliterated-GGUF
This one is trash.
>igorls/gemma-4-12B-it-heretic-GGUF
This one is decent.

>>

Anonymous
06/08/26(Mon)08:27:26 No. 109005365

Anonymous 06/08/26(Mon)08:27:26 No. 109005365 ►

Asking here because these programs use python a lot but whats the best way to handle multiple python versions on windows 10? Sometimes they come packaged with the program but sometimes I need specifically 3.12 and sometimes 3.11 etc. and they can't be used with other versions.

>>

Anonymous
06/08/26(Mon)08:28:39 No. 109005369

Anonymous 06/08/26(Mon)08:28:39 No. 109005369 ►

>>109005365
uv

>>

Anonymous
06/08/26(Mon)08:29:18 No. 109005372

Anonymous 06/08/26(Mon)08:29:18 No. 109005372 ►

>>109005362
Abliteration slips into lobotomy real easily.

>>

Anonymous
06/08/26(Mon)08:33:46 No. 109005386

Anonymous 06/08/26(Mon)08:33:46 No. 109005386 ►

>>109005330
12b for handholding, foreplay and teasing, 26b for a quickie and rough correction

>>

Anonymous
06/08/26(Mon)08:35:26 No. 109005392

Anonymous 06/08/26(Mon)08:35:26 No. 109005392 ►

>>109004971
It's literally the opposite, dumbass. There isn't much difference between processing 1 or 100 tokens at a time, you just don't know what token is next without speculation, so you have to do it 1 at a time. The slower the memory, the more it benefits from MTP

Name
Email
Comment
File	Supported: JPG, PNG, GIF, WebP, WebM, MP4, MP3 (max 4MB)
CAPTCHA

Adult Content Warning

Reply to Thread #109001981

Reply to Thread #109001981

Search & Sort