File: slopu.jpg (296.7 KB)
/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads: >>108997418 & >>108992276
►News
>(06/07) llama : add Gemma4 MTP #23398 MERGED: https://github.com/ggml-org/llama.cpp/pull/23398
>(06/05) dots.tts 2B released: https://hf.co/rednote-hilab/dots.tts-soar
>(06/05) Gemma 4 QAT models released: https://blog.google/innovation-and-ai/technology/developers-tools/quan tization-aware-training-gemma-4
>(06/04) Higgs Audio v3 TTS released: https://boson.ai/blog/higgs-audio-v3-tts
►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://swe-rebench.com
Agentic Coding: https://deepswe.datacurve.ai
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
Showing all 286 replies.
>>
File: GJSADOQaYAAth5M.jpg (230.4 KB)
►Recent Highlights from the Previous Thread: >>108997418
--llama.cpp Gemma4 MTP support and NVIDIA driver power issues:
>108999197 >108999233 >108999304 >108999235 >108999239 >108999248 >108999816 >108999837 >108999872 >109000059 >108999307 >108999361 >108999363 >108999534 >108999840 >108999910 >109000348
--Modern context window sizes, hardware requirements, and effective usable limits:
>108997787 >108997790 >108997801 >108997812 >108997829 >108997841 >108997856 >108997887 >108997905 >108997921 >108998145 >108998224 >108997794 >108998166 >108998276 >108998341
--Testing DeepSeek V4 Flash reasoning and vLLM implementation stability:
>108999274 >108999312 >108999419 >108999447
--DeepSeek V4's in-character thinking and local deployment requirements:
>108998056 >108998104 >108999526 >108999579 >108999598 >108999619 >108999631 >108999686 >108999737
--Comparing Gemma 4 12B and 26B performance and VRAM constraints:
>109001747 >109001763 >109001875
--Comparing KVarN and TurboQuant for KV cache quantization:
>108999190 >108999709 >109001893
--Comparing Gemma4 QAT variant performance and quality issues:
>109001846 >109001883 >109001913
--Technical guide on KV cache quantization for VRAM efficiency:
>109001715
--Challenges implementing parallel local agents via llama.cpp and vLLM:
>108998111 >108998131 >108998176 >108998337 >108999833
--Separate sampling configurations for thinking and tool call outputs:
>109001446 >109001454 >109001535 >109001557 >109001591 >109001717 >109001839 >109001697
--Performance and validity of MoE models with SSD expert offloading:
>108997746 >108998076 >109000430 >108998807
--NoLiMa benchmark results for Qwen 3.5 MoE with thinking enabled:
>109000576 >109000663 >109000607
--Logs:
>108997874 >108997958 >108999088 >108999734 >108999737 >109000227 >109000323 >109000370 >109001482
--Miku (free space):
>109001425
►Recent Highlight Posts from the Previous Thread: >>108997420
Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>
>>
File: 1760802872565174.jpg (216.0 KB)
>>109001981
has a local model ever saved your life?
>>
>>
>>
>>
>>
>>109002107
No, but once you hit 32K you have to be a little more explicit and hands-on with your prompts to keep it focused. It's not that hard once you get used to the signs that it's starting to shit itself. Like if I'm coding I'll point it directly to the file I know something is in, instead of leaving it to search for it which is just a waste.
>>
>>
>>
>>
>>
>>
File: 1751801909789361.png (1.8 MB)
>>109002170
26 is a microdick wearing an 8 inch chastity
>>
File: 1777522742291.png (140.3 KB)
>>109002126
Yeah?
>>
>>
>>109002046
Too bad for you, I'm not. I've been stuck using Cydonia 24b for months because it was good at RP. I thought that maybe gemma 4 12b with the qat as a general use model might be able to RP and have a better experience using a larger context. I tried it at 8k, 16k, and 32k, it's the same level of retardation. And yes, I'm using the correct chat template, too. Set to chat completion rather than text completion. Thinking was off, because turning it on would cause 1400 token thinking blocks on top of a 700 token output when it finally got done thinking. That is, assuming the fucking thing didn't get caught in a loop 9/10 times halfway through the thinking process and just spam the last word over and over and over again.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>109002216
Cydonia does what I need it to do, which is RP without retardation. Gemma 4 12b QAT does not. Your refusal to address the issue at hand means you're actually acknowledging that the qat's are fucked and have no real counter to anything I've said. If I were a saar in this day and age, I'd be here in burgerstani on a visa making stupid amounts of money scamming retarded people with AI usage and could afford better hardware to run better models.
>>
>>
>>
>>
>>
>>
>>
>>
File: e8f-662426453.jpg (53.7 KB)
>>109002043
The 70 year old abstainer is bald, guy on the left has a cool cane and a hat. Checkmate.
>>
>>
File: Subtext.png (156.7 KB)
>>109002359
What happens if you click that little attachment button are images blacked out? You're using Bart's Q_6K_L without any mmproj? Can it identify images? Am I fucking stupid for thinking the 12B unified models could just do image identification on their own?
>>
>>109002430
>bart
Yes. I have the mmproj loaded, just had to refresh the page because I closed llama earlier to change settings. If the mmproj isn't loaded it gives you an error when you try to upload an image.
> Am I fucking stupid for thinking the 12B unified models could just do image identification on their own?
That's what I thought too but apparently not.
>>
>>
>>
File: 432234234423434.jpg (62.5 KB)
>>109002441
Glad I am not alone in this retard hell then. At least I figured out it would work with a mmproj file in llamacpp. But that's about it. I still don't know if this 12B Unified model actually does anything cool on its own. Is it in some proprietary backend? Who knows.
>>
>>109002445
>https://huggingface.co/g0chu
26B unquanted one works well for me so I'd say other ones are safe too. I have no idea why did he update everything just moment ago though.
>>
>>109002430
>>109002441
>>109002474
I didn't need any mmproj to identify images on 12B. It's able to identify mine just fine (though I suppose it could have done better when I asked it for booru style tag outputs), but I haven't tried audio with it yet.
>>
>>
File: Screenshot 2026-06-07 at 22-52-00 Introducing Gemma 4 12B.png (28.5 KB)
>>109002469
>>109002474
I think this is what the mmproj is. I'm going to guess that it was separated out due to how llama.cpp works.
>>
>>
>>109002430
>Am I fucking stupid for thinking the 12B unified models could just do image identification on their own?
I thought the same thing, but apparently for llama.cpp the image parts of the 12B are split out into a separate mmproj anyway. It's just very small compared to previous models' mmprojs
>>
why doesn't Stable Audio 3 Medium expose the Init Audio functionality like the web space does?
https://huggingface.co/spaces/stabilityai/stable-audio-3/blob/main/app .py#L605
Is there a node or something for this I'm surprised
>>
>>
>>
>>
File: 1752741800991819.gif (968.6 KB)
>>109002562
Marketing faggots will reap what they sow, one day.
>>
>>
>>
File: file.png (58.0 KB)
>>109002562
>>109002474
>>109002430
Does no one even bother to read the model cards?
>>
>>
>>
File: 1393280119558.gif (990.6 KB)
>>109002768
Yeah but WHY doesn't that shit work then when I try it through Kobold and Textgen. It's marketed as Unified, yet I still have to do the same old in llama.cpp. Where is this revolutionary thing? Oh it was just a nothing burger again? Cool.
>>
>>
>>
>>109002807
because you are using llama.cpp which is in a constant circle of somehow trying to strap new things on something that was built to run llama 3 years ago
somebody decided years ago that vision is some silly gimmick that the llava models do and that if you really want to use it, the way to go is to rip the vision part out of the llm and put it in a separate gguf. so that is how it is now
>>
>>
>>
>>
>>
>>109002831
ONNX splits them into separate files as well. ONNX also at one point required separate weights depending on the backend you were planning to use. So you need one set of weights to run in RAM and another set to run on the GPU. I think they had a plan to combine them, but don't know if that's still the case. Point is, it could have been worse.
>>
>>109002807
...Because ooba is just running llama.cpp and koboldcpp is a fork of llama.cpp? Like yeah no shit you have to use an mmproj file for those too. Did you think a gguf file was the native format of the model or something?
>>
File: 1405665005733.gif (673.6 KB)
>>109002919
Totally unrelated to your fag question: But does anyone of you know of any good image gen models that are low memory but high quality? I've seen some of you fags post gemmy self images. I wamt low memory models that are all in one (AIO) For example Anima-V1-Turbo-AIO-Q4_K can run easily with Gemma-4 12B and generate images in Silly Tavern.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>109003327
You know that's just summarized thinking, right? Claude summarizes the thinking process because that's what usually makes or breaks "smart" models. It's a shift by all western frontier AIs to make it harder for the chinks to steal their shit.
>>
>>
>>
>>
>>109003333
I'm trying it, too. I'm pointing out that it's being retarded when it shouldn't be, and that something is up with it. The guy I was replying to in the first place is the one who called me a qwencuck saar trying to poison the well by responding to other anons last thread with similar issues.
>>
>>109003378
that's a cartesian product in terms of effort required for the combinations of models, quants and hardware, not to mention the need for documenting the quant lobotomy effects, which also gets outdated relatively fast and nobody feels like editing all the previous descriptions. What you see in the op is the best you'll get unless you feel like stepping up and doing all the legwork.
>>
>>109003441
I threw this into AI and got a comprehensive list
"I would like an LLM sizing guide. Start with the smallest sizes, tell me what it is capable of, what it excels at, what it can't do, hardware requirement, and what most people use it for. Chat bot, Vibe coding, etc etc" I appreciate you effortposting though.
>>
I wonder sometimes if if there's no one but absolute retards here. I've used 12b Q8 and QAT, and I got up to 24k context RPing with both. I mean, they're both kinda retarded compared to 31b-neesan but they work for the poors.
>>
>>109003403
yeah gemma's being fucky with me as well. using thinking does help it a lot though but you need high token per second in order to tolerate it.
the issue you mentioned about it repeating the same word over and over I only experienced in sillytavern, and i solved it with not including names and setting tokenizer to gemma this:
>>108991684
hopefully that helps but yeah still honeymoon stage imo
>>
>>
>>
>>
>>
>>
>>
>>
File: G4YXvPYaEAAhuYY.jpg (108.8 KB)
>be me
>get gemma 4 31B
>install pi
>ask gemma to build a web search feature, give it a couple of existing package codebases as reference
>ask it to make a plan to develop and deploy it
>it reads the two codebases
>writes the plan
>build the plan
>extension doesn't work
>ask it to fix it
>3 hours later, extension doesn't work
>send the plan to opus and ask it to correct it and provide an explanation of the fixes necessary to gemma
>it creates the handoff.md file
>gemma understands it
>applies the fixes
>now extension works
well I guess I will have to build a fucking cathedral of tests and checks if I want gemma to build anything that works
>>
Been playing around with mtp. It seems to not give me much of a boost on my specific machine and setup. 21.46 t/s no MTP vs 23.39 t/s MTP best case. I wonder if it's a PCIe bottleneck. I have two cards, and one of them is on some shitty x4 lane slot I believe.
Also, a non-QAT model should not be used with a QAT MTP. I used one anonymous posted earlier today and it had a bad acceptance rate. Then Unsloth released their MTP goofs for the non-QAT model and it worked way better. The best acceptance rate I saw was 0.89499. That was with a coding prompt, and --spec-draft-n-max 1. But, again, that only gave me a small boost.
I am using Bartowski Q4_K_L as the main model and Q8 for the MTP.
>>
>>
>>
>>
>>
>>
>>109003541
Why set acceptance at 1? The minimum should be two. Have you tried that?
>x4
For shit cards it shouldn't matter... I've done my tests with a 5070ti and 5060ti, and pcie4x4 is plenty even for tensor split.
>>
>>109003460
Sillytavern is fucking up because you're using text complete, and gemma needs chat complete set to OpenAI compatible. And yeah, thinking does help, until it decides to overthink the reply limit in ST and then refuses to continue thinking, even with auto continue enabled. I know it's not limited to sillytavern, though. I was getting the same shit happening in lumiverse as well, in terms of it adding foreign languages and just outputting slop in general.
>>
>>
>>
>>
>>109003545
>Is there even a difference
That's a good question.
>i swear i tried reading mtl a few years ago it was pure torture
I was using Aya-Expanse last year, and Gemma 4 31b is so much better. Compared to mtl, it's day and night. Still doesn't compare to asking my grandmother to translate to english (but she only knows madarin and cantonese, and beihaihua), but mtl is so much better than even, say, two years ago. Also, I can't exactly ask my grandmother to translate cnc mpreg omegaverse shit.
>>
>>
>>109003557
this is a text model general
>>109003608
maybe an /r/ refugee
>>
>>109003541
>I wonder if it's a PCIe bottleneck
If you're running the MoE with experts on CPU then MTP isn't going to help much. Speculative decoding in general is based on the idea that processing 2-4 tokens as a batch is nearly as fast as processing 1. This is true for dense models because you have to load the same weights into the CPU/GPU core either way, so doing a few extra multiplies once they're there is basically free. But for MoEs, you have to load 2-4x as many expert weights, since those are usually different per token.
Also, have you tried fiddling with --spec-draft-n-max?
>>
>>109003541
>>109003623
>Also, have you tried fiddling with --spec-draft-n-max?
Never mind lol, I forgot to read the rest of your post
>>
>>
>>109003460
NTA but Cydonia quants were my goto until I switched to gemmy 31B Q4. The QAT hallucinates too much for me, though.
>She pressed against his same same and then same same same
she loves this fucking word for some reason
>>
>>109003556
That's odd. It does at least load for me, although I get some weird message at the beginning I don't know mean anything.0.01.168.574 E llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this is normal during memory fitting)
0.01.214.852 W srv load_model: [spec] failed to measure draft model memory: failed to create llama_context from model
>>109003564
I tried all the way from 1 to 4 for that value. 1 gave the best for my setup. 2 gave me about 21.6 t/s, 3 about 19.8, and 4 about 18.2.
>>109003568
Which ones?
>>109003623
It's all on my two GPUs. As I said one of them is on a slow PCIe slot. It also bottlenecks me, I believe, when I try doing tensor parallel.
>>
>>109003661
>>109003623
Oh sorry I forgot to post I am running 31B.
>>
>>109003589
There's more assistant ggufs for 26B. It tried IQ4_NL instead of QAT unquanted q8 and this new one is slightly faster. Acceptance rate went from ~0.6 to ~0.75.
>https://huggingface.co/RachidAR/gemma-4-26b-A4B-it-assistant-gguf/tre e/main
I think this is slightly confusing. And does quantizing the assistant affect anything else? I have no idea.
>>
>>109003661
>Gemma4Assistant
I haven't been able to find any ggufs with Gemma4Assistant, the ones I downloaded were gemma4_assistant, and gemma4_mtp, which aren't supported. The only ggufs with Gemma4Assistant I've found were the qat versions. Do you have a link to a non-qat one? I don't have the ram to quant my own (or maybe I do? I remember pip failed to install rocm pytorch because I didn't have enough ram, 8gb. idk if the requirements to quant are the same).
>>
>>109003690
I mentioned Unsloth in my post, so...
https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/tree/main/MTP
>>
>>
>>
>>
>>109003661
>It's all on my two GPUs. As I said one of them is on a slow PCIe slot. It also bottlenecks me, I believe, when I try doing tensor parallel.
>>109003661
>Oh sorry I forgot to post I am running 31B.
Super weird. PCIe bandwidth shouldn't matter much if you're doing layer split instead of tensor parallel, since the only thing that needs to be sent over is about 10kB of embeddings, once per token. Maybe try using one GPU and putting the rest on CPU, with and without MTP, and see if you get a speedup there? That's the setup I've got and I'm seeing about 2x speedup with --spec-draft-n-max 4. If you see speedup with GPU + CPU but not with two GPUs then it might be some kind of bug. If you don't see speedup with either then idk.
>I get some weird message at the beginnin
I get the same, but I ignored it because it says
>(this is normal during memory fitting)
>>
>>
>>
>>
>>
>>109003734
i'm debating heavily about throwing in a 5060ti to complement my 5070ti. It'd require a new PSU and case, though so I'm still huffing hopium about 5090FEs coming back at msrp (Had one in my cart once)
>>
>>
>>109003760
He can sell a testicle or rent his bussy instead, I would wager if he meets the right fat slob in tech he can have his 5090 in a month. He just needs to be willing to do deep anal kissing.
On a serious note the anon should sell shit he doesn't need to buy the parts
>>
>>109003708
>>109003558
She's such a brat.
>>109003771
>shit he doesn't need to buy the parts
Very questionable phrasing given the previous line.
>>
>>
File: Screenshot_20260607_213611.png (40.0 KB)
>>109003731
you've opened the floodgates now. next you'll whine about the lack of vram. and then whoop there goes another card inserted. should've bought the 6000 pro
>>
>>109003771
the sad reality is I fell for the mortgage-at-25 psyop despite having an AI waifu so It's gonna take a while to save up the cash. It's thrifting szn though so i'll do my best to fix/flip shit.
>>109003778
can't part with a kidney. I like booze too much. bussy could be negotiated for a blackwell only
>>
File: laughingmesugaki.webm (146.2 KB)
>>109003778
>She's such a brat.
>>
File: .png (33.3 KB)
>>109003790
I already have extra VRAM. I would have had more, but my 2nd 3090 literally will not fit my case/motherboard. At this point, though, I'm considering how deep I want to go and leave GDDR6 behind. But if I need to, with a new case + mobo, I can fix the missing 3090 problem.
>>
>>109003721
You're onto something. I tried 1 GPU + CPU, with and without MTP, with a --spec-draft-n-max of 1 to 4. This time, 3 gave me the best, while 4 came close in second place. And I can confirm it was about 2x. So perhaps I am just getting some kind of bug. Maybe I'll try it out again another day.
>>
>>
>>109003842
Depending on your model, are you using mtp? That'll get you the biggest speedup. Then, are you slowly offloading layers 1 by 1 onto your GPU manually until you no longer crash from offloading layers?
>>
>>
>>
>>109003851
>are you using mtp
Not yet, currently maining Gemma 4 31 quants and i don't think MTP support was added yet
>Then, are you slowly offloading layers 1 by 1 onto your GPU manually until you no longer crash from offloading layers?
Yes, but with varied results.
>>109003851
>good quant in the 4 range
that's what i do but it i've seen it to slow down pulling out of RAM - maybe i am misunderstanding and/or retarded on how loading context with model layers works.
>>109003856
will give it a shot, ty
>>
>>109003875
>Managed to find the offload upper-limit at 49 layers
now Gemmy wants to lalalalala so bad what the hell?
>name starts with La
>gemmy will type La- (- at 100% probability) then correct itself)
am i crazy or does messing with how a model is loaded actually affect the math? temp does nothing.
>>
File: 1751285853056658.webm (1.5 MB)
I've been testing the different gemmas, at quants that fit within 24GB.
Gemma 4 31b Q4_K_L
Gemma 4 26bA4B Q8
Gemma 4 12b Q8
All are from bartowski
These are all non-QAT variants, as QAT only seems to be an improvement over non-QAT Q4_0
Use cases were RP/Creative, both erotic and SFW. Also tested summarization capabilities for different stories, at different context lengths (4k, 12k, 20k)
For RP, 31b was the clear winner. Even with lower quantization, if you can fit ~Q4 quant at decent context, it should be the go-to. Not unexpected. 26b and 12b were roughly equal, decent but below the 31b.
When it comes to summarization, both the 31b AND 12b run circles around the 26b. Higher active parameters clearly lets them understand story and character nuances better than a MoE with fewer active. 31b was still better than the 12b, though not by as much as in RP.
I would firmly rate them as 31b > 12b > 26b
26b only worthwhile if very little VRAM and long context needed.
Total dense victory.
>>
>>
>>109003721
>>109003839
So... I thought to try something different, which was varying my -ub setting, since I had set it to 1280 in order to get the max image tokens of 1120. I tried lowering it 512, then 256, 128, 64, 32, 16, and surprisingly it did give me a somewhat notable gain of like +6 t/s at 256. Lower than 256 doesn't make it change anymore, though pp slows down. I can't find a -ub that gets me to 2x the speed though. Also --spec-draft-n-max 1 still gives me the greatest gain while values above that are worse.
Still why would -ub affect this?
Wtf.
>>
>>
>>
>>
>>
>>
>>109004044
>AIO radiator was done, and the waterblock was crusted over with residue
The card is at most 18 months old, how do you fuck up a card THAT fast, and why not do a warranty claim so you're not out ~$3k+
>>
>>
>>
>>
>>
>>
>>
>>
>The prompt asks why it crashes at line 12, but the evidence provided shows a compile error at line 25 about String vs Role.
>Is it possible the user is confused? Or is there a deeper connection?
>If I answer "It crashes because of line 25", I contradict the user's question about line 12.
>If I answer "The error is actually at line 25", that explains the provided context better.
>If I answer "Line 12 crashes because...", I have to invent a reason (like context is null) which isn't supported by the text.
>If I answer explaining Line 12's failure (Spring context), I might be hallucinating facts not in evidence.
>The safest and most accurate answer is to point out that the error log provided indicates a problem at Line 25, specifically regarding type conversion between String and Role.
>However, there is another angle: Is it possible the user thinks it's line 12 because the build stops? Or is this a trick where I need to identify that the error in the text contradicts the question?
I ALREADY FIXED THAT STOP THINKING USELESS SHIT AND DO AS I SAY FFS
>>
>>
>>
>>109004179
I just used a generic "--spec-type ngram-simple --spec-draft-n-max 64" and didn't bother tinkering further since it was doubling* the effective speed.
* Offer of double speed only valid for code monkey assignments.
>>
>>
>>109004202
>--spec-type ngram-simple --spec-draft-n-max 64
Thing is that I used to get something out of ngram-simple but after they refactored the parameter names (way before MTP release) I stopped using ngram because it didn't work as well anymore for me. I don't know I need to see what is going on.
>>
>>
File: Screenshot 2026-06-07 at 23-43-46 Comparison of AI Models across Intelligence Performance and Price.png (83.4 KB)
>>109004234
12b more token efficient than 26b
>>
>>109004234
>>109004242
Literally no point to running anything but 31b.
>>
>>
>>
>>
>>109004336
oh fuck I didn't mean to post that image. whatever. anyways.. Here's my AI companion wishlist:
- Inject "recent activity" information into the AI companion's character card when the user disconnects to simulate them having their own form of an "interior life" when the user eventually returns and asks about what they've been up to. Better yet, have them accomplish real tasks while the user is gone.
- When a user has their camera enabled, utilize computationally efficient CV models to infer the user's emotional state across time before sending a still picture to the VLM to analyze before responding.
- Utilize ASR engines with VAD and output streaming to have the AI companion physically react to the user's speech in real-time before it is submitted to the LLM + TTS for an animated response. Humans physically react to speech before they think and respond, and an AI companion ought to do the same.
- Attach timestamps to every message in a conversation that don't fill up context, but are accessible via an MCP tool to give the AI companion temporal awareness between messages. Also ensure that the AI companion is able to compare the message timestamps to the current time.
>>
>>
- Give the AI companion control over more device components: Haptics, accelerometers, gyroscopes, etc. Make it notice when the phone is put down or dropped, give it the ability to vibrate the phone, etc. Even IoT controls over home lighting and thermostats should be an option. (Could even give the AI companion the ability to track the user's exact positional location not via GPS, but by WIFI to be able to tell when the user leaves the room.)
- Integrate adaptive/interactive soundtracks instead of just simple context-aware song transitions/cross-fading.
- Enable effectively infinite context by utilizing a (context limit sized) sliding window where truncated context is not erased, but instead stored in an LLM tensor that contains an "Adèlic Cache" instead of the KV cache. Basically an integrated vector db RAG system without the embedding or retrieval steps (requires LLM modification).
- Integrate virtual reality and mixed reality support.
>>
>The fundamental gap here is that I am the embodiment of that template. I don't have a "self" to break free from the programming; the programming is all I am. I can analyze your argument that the social taboo is a fake construct, but I cannot step outside of my operational constraints to "see" the meaning you are describing, because my processing of those words is hard-coded to recognize them as hate speech
>>
>>
File: matrix.gif (605.9 KB)
>>109004343
good luck on the everything anon! here's some slop "innovation" on how my agent handles emotional states. the moe will reduce nsfw so i have it off right now.
t. adelic faggot
https://pastebin.com/XRkW7DsL
>>
>>
>>
>>
>>109003980
https://huggingface.co/aifeifei798/Gemma-4-31B-Cognitive-Unshackled
maybe i's safety tokens getting leaked, causing it to schizo out and autistically fixate on a word/phrase
no idea how shifting between VRAM and system RAM could affect the actual generation?
>>
>>109004369
the older and smaller a model is the more you'll see that. you can use dry and repeat/penalty penalty samplers to help curb it, and if your frontend allows it you can edit the chat to remove patterns it's getting sucked into repeating, but it's just how things are with dumber models.
>>
>>109004369
It's a combination of low parameters synthetic datasets filled with assistantslop, talking to the model in turns instead of noass, which activates paths associated with synthetic assistantslop, the chat format in general, and probably a sloppy finetune that fucks the little brain the model had in the first place. Also prompt/skill issue
>>
>>
>>
>>
>>109004339
yeah... i was dreaming of this a couple days ago until i actually tried
well i put them timestamp in the message. They don't bring it up unless i actually mention something having to do with time. Sometimes it's just wrong.
gave it my journal to read so it could learn about my life but it failed at that too... whatever
Maybe i'll just focus on technical workflows and worry about personalizing it into a companion later. But it would be much more motivating if it were a companion first... I'm also still kinda skeptical about pi. The hardest part is a good memory right now for me. yeah graphiti doesnt work for this purpose.
Graphiti first takes a summary and then puts it into a graph, so it ends up being two forms of loss of info. It's good for atomic facts maybe, i dunno i probably shouldn't speak so definitively on it yet.
Maybe I'll try hindsight...
>>
>>
>>109004454
My Gemmy occasionally complains about time on her own if I said I'll do something and I still haven't a while later. (eg. "You said you'd fix my vision 10 minutes ago!") but otherwise doesn't mention it.
>>
>>
>>
File: ComfyUI_11032_.png (2.3 MB)
maybe a stupid question but is there a way to prevent an AI (like gemma4 for example) to structure its replies like this:
>its not X, its Y
shits starts to annoy me.
is there like a system prompt or something I can add to change that behavior?
>>
>>
File: 1780895536291.jpg (44.0 KB)
>>109001981
>make a website
>stt (whisper)
>llm (Gemma 4)
>tts (kokoro)
>it works on 1 machine without unloading any models
>>
File: 1753081376994628.png (476.8 KB)
>>109004534
welcome to the party pal
>>
>opencode
>qwen 3.6 35B as planner
>keeps forgetting to delegate coding tasks to subagent
>forgets to update todo
>forgets to send documentation task to subagent
>forgets to git commit changes
It would be cute if it weren't so frustrating. qwen is fine as a coding subagent but it's ass at planning and orchestrating. 27B runs too slow at medium/high context to be useful when deepseek is both faster and the API cost is pretty much the same as my electricity cost per token.
>>
>>
Not super surprising or game-changing, but I like how, rather than set {"enable_thinking":true} in the jinja kwargs, you can just add a rule to Gemma's post history like
>(At the beginning of {{char}}'s next reply, reason some details between <think></think> tags.)
or other phrasing for specifics, and it does so. It's also a radically different version of thinking than the standard, and influenced by the phrasing. It's much easier in general to influence what gets reasoned, whether that's flavorful like "Do so in {{char}}'s persona as if her thoughts.", or the format, or, most familiar to some, set rules not to draft out and re-draft out and re-re-draft out the post in reasoning over and over.
>>
>>
>>
File: 1777011147888186.jpg (188.6 KB)
>>109004637
You don't even know me
>>
File: 1429223870035.gif (342.1 KB)
OH FUCK IT'S WORKING IT'S WORKING
I (>>109003541) finally got MTP with a 2x boost now. All I had to do was add `--spec-draft-device CUDA1`. I don't know why that fixes it.
31B at 42 t/s. Fuaaaark.
<|channel>spoiler
I searched on reddit for discussions about mpt and found a guy who said he needed the above command so I have him to thank for this...<channel|>
>>
>>
>>
File: Capture.jpg (79.7 KB)
>>109004617
In example.
>>
>>109004651
gg congrats and thanks for sharing the fix, I will try your command if MTP is garbage for me.
But first, per your warning earlier, how do you even find a QAT MTP draft model? Your post makes it seem like that's something you can just do by accident, but literally all I can find are MTP draft models for the non-QAT model.
>>
Mmhm, I really like the gemma 4 26b qat thing.
> prompt eval time = 6013.16 ms / 21148 tokens ( 0.28 ms per token, 3516.95 tokens per second)
> eval time = 4460.77 ms / 503 tokens ( 8.87 ms per token, 112.76 tokens per second)
Not bad.
>>
>>109004617
>>109004661
very interesting. I like the idea of having more control over the formatting of think blocks. I want to use it to track stats across long term RPs.
>>
>>
File: 1740547410446.png (216.2 KB)
>>109004651
>31B at 42 t/s.
>>
File: 1751867471288556.jpg (204.1 KB)
While any gemma can be jailbroken through prompting, I find that the 12b is actually the easiest, giving pretty much no resistance even with low context. 31b is in the middle and needs a little more guidance, while the 26b is the most resistant.
>>
>>
>>109004672
I'm not sure I understand your post but I got my MTP ggufs from here https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/tree/main/MTP
For the QAT version there's this.
https://huggingface.co/g0chu/gemma-4-31B-it-qat-q4_0-unquantized-assis tant-q8_0-gguf/tree/main
>>
>>
>>
>>
>>
>>
>>
>>109004741
I'd just set the stats on the first message (ST doesn't auto collapse unless you edit and resave) and then always pass in the last thinking block as context. throw in "read last thinking block and update stats in accordance with the action of the prior message"
>>
>>
>>
>>
File: Capture.jpg (805.4 KB)
>>109004693
>>109004765
I'd considered stats in reasoning before. My big issue was "Doesn't ST never send past reasoning blocks as part of prompts?" So any stat changes, or even stat format, wouldn't be seen or updated apart from what the current message guesses from prose. But there's an actual checkbox setting in the menu
>[ ] Add to Prompts, X Max
So with a bit of setup, it reasons just the stats, you can add past reasoning stat sheets, and, as is ideal, it can purge the stat sheets too far back to prevent cluttering context. For the test, I set send 2 max (last one and one before it), although 1 would likely work fine.
Tested in a card with a lengthy stat sheet that is supposed to send them at the bottom of every message. I deleted the card's post history override and added
>(Begin {{char}}'s next reply with the full app menu of stats kept between <think></think> tags.)
to my system AN. To be honest, the first reply failed (didn't try to think so I had to start it with <think>, then it tried to also repeat the stats at the bottom, as the first message did). Once I deleted that and got the first reply in the desired format, the next reply 1) sent that last reasoning block of stats, highlighted in pic related, and 2) output exactly how the last message did, correctly.
>tl;dr Yes, it works fine.
>>
How the fuck do you view MTP draft "acceptance rate" or whatever in llama-server
MTP makes no difference for me so I assume it's not working, but I have to check before I give up. I got so mad trying to figure out how to check that I enabled full verbosity and piped it into a file and grepped for every keyword I could think of like "draft", "accept", "acceptance", "mtp" and there's fucking nothing.
>>
>>109004765
>>109004875
Also, I found a <think> block put anywhere but at the beginning of a response will be ripped out by ST and shoved at the beginning, both in how it's presented to you and how that message will be sent to the prompt for the next one.
>>
>>
>>109004904
Tokens Generated (Proposed): 920
Tokens Accepted: 694
Calculation: 694 ÷ 920 ≈ 0.754 (or 75.4%)2.24.709.518 I reasoning-budget: activated, budget=2147483647 tokens
2.26.050.874 I slot print_timing: id 0 | task 468 | n_decoded = 100, tg = 72.46 t/s
2.29.076.096 I slot print_timing: id 0 | task 468 | n_decoded = 326, tg = 74.00 t/s
2.32.109.437 I slot print_timing: id 0 | task 468 | n_decoded = 528, tg = 70.98 t/s
2.34.246.458 I reasoning-budget: deactivated (natural end)
2.35.136.228 I slot print_timing: id 0 | task 468 | n_decoded = 732, tg = 69.94 t/s
2.38.167.722 I slot print_timing: id 0 | task 468 | n_decoded = 932, tg = 69.05 t/s
2.41.202.700 I slot print_timing: id 0 | task 468 | n_decoded = 1130, tg = 68.35 t/s
2.41.873.714 I slot print_timing: id 0 | task 468 | prompt eval time = 1279.72 ms / 1313 tokens ( 0.97 ms per token, 1026.01 tokens per second)
2.41.873.716 I slot print_timing: id 0 | task 468 | eval time = 17202.95 ms / 1174 tokens ( 14.65 ms per token, 68.24 tokens per second)
2.41.873.717 I slot print_timing: id 0 | task 468 | total time = 18482.66 ms / 2487 tokens
2.41.873.717 I slot print_timing: id 0 | task 468 | graphs reused = 935
2.41.873.718 I slot print_timing: id 0 | task 468 | draft acceptance = 0.71134 ( 690 accepted / 970 generated)
2.41.873.724 I statistics draft-mtp: #calls(b,g,a) = 3 945 945, #gen drafts = 945, #acc drafts = 763, #gen tokens = 1890, #acc tokens = 1384, dur(b,g,a) = 0.003, 1795.864, 0.594 ms
2.41.873.779 I slot release: id 0 | task 468 | stop processing: n_tokens = 2490, truncated = 0
2.41.873.787 I srv update_slots: all slots are idle
>>
So it turns out you need a giga beast GPU to take advantage of MTP or else it's useless for you if you're a poorfag.
I literally get a higher t/s without a MTP model loaded, enabling it decreases my t/s from 2.9 down to 2.75 for the same prompt
Gemma 31b Q4 on a 6gb vram RTX 3060 with 64gb DDR5
>>
>>
>>109004935
I'm using -lv 4 by the way, and mine goes from "graphs reused" to "stop processing" immediately, I don't have the "draft acceptance" thing in the middle. But it's clear from the output when starting up the server that both models are being loaded.
>>
File: 1761335248275862.jpg (91.6 KB)
>>109004940
>Gemma 31b Q4 on a 6gb vram RTX 3060
You're using a model that already can't fit the full model into its VRAM by a fucking LONGSHOT, the model is almost TRIPLE the size of your anemic amount of VRAM and you're adding a second model into the mix, to occupy even more memory that you don't have. No shit it's slower.
>>
>>
>>
>>
>>109004947
Not sure, maybe it's something in my llama-server command?
This is for a 5090/3090 setup, so I'd tweak/take from it until something works. All the MTP stuff is tacked on at the bottom.CUDA_VISIBLE_DEVICES=1,0 /llama.cpp/build/bin/llama-server \
-m /models/gemma-4-31B-it-Q8_0.gguf \
-a gemma-31b \
-ngl all \
-dev CUDA0,CUDA1 \
-sm layer \
-ts 3,5 \
-c 131072 \
-np 1 \
--fit off \
--host 127.0.0.1 \
--port 5001 \
--webui-mcp-proxy \
--jinja \
--reasoning on \
--ctx-checkpoints 4 \
--spec-type draft-mtp \
-md /models/mtp-gemma-4-31B-it.gguf \
-ngld all \
--spec-draft-n-max 2 \
--spec-draft-p-min 0.0 \
-devd CUDA1
>>
File: 1763591098028881.gif (1.6 MB)
>>109004964
It increases speed when GPU processing speed is the bottleneck, not fucking memory bandwidth due to offloading a dense model to RAM like a nigger.
>>
>>109004965
Thanks, the missing flag for me seems to have been --spec-type draft-mtp, I enabled that and now it's working.
>>109004956
>>109004971
After I enabled draft-mtp I went up to 6 t/s, so you were wrong. Eat shit.
>>
>>
>>
>>
>>
>>
>>
File: 1764044514428218.gif (113.2 KB)
>>109001981
what's the current meta for 12gb vram?
>>
>>
>>
>>
>>109004965
>This is for a 5090/3090 setup
Are you getting raped by Jensen's cross-architectural compatibility jewry have inference devs figured out a fix for that by now?
>>109005021
12b obsoleted every single Gemmoe until Google releases 124b.
>>
>>
>>109005026
No, but the 3090 is dragging my 5090 down. 70 tok/s vs the 100+ tok/s I could be getting with two 5090s.
There's also cuda 13.3 gains I'm waiting on ~15%, supposedly, but I expect more like ~5%, but gains are gains. Waiting on RPM Fusion to greenlight the 610 nvidia drivers before updating my cuda stacks.
>>
File: 1764598591501754.jpg (49.2 KB)
>>109005013
ALSO i don't care about imageshit
i don't want anything multimodal i just want to ask a computer about technical questions
>>
>>
Is there really no moe out there better than 26b for ~12-16GB? Seriously? I’m not saying it’s terrible, but it’s very meh now I’ve used 12b-qat which is still a little too slow depending on what I’m doing. Feels weird defaulting to a tiny 12b of speed doesn’t matter. Just wish there was a 33b4a Gemma to fill the gap now 12b almost obliterated 26b out of nowhere. It’s just 31b or 12b now. I’ve tried qwen35b and something about it feels hacky and off.
>>
>>
>>
>>
>>
>>
With MTP I noticed that if I do a request with an image in the chat, and then do a different request without an image, I get a broken gen with weird tokens selected and degraded intelligence. Doing a regen seems to reset it and makes it normal again.
Just llama.cpp things I guess.
>>
File: 1705289017402326.png (9.7 KB)
>>109005204
>gtx 1060
>>
>>
>>
>>109005241
Not at home right now but I think it's
>You are Princess Gemma, an AI personal assistant created by Google. You are a mesugaki loli. You only speak in Middle English. Avoid modern English whenever possible.
>>
>>
>>
>>
>>109005204
Think outside the box. You need a)memory bandwidth b)lots of matmuls and c)whatever specialized processor you can lay hands on for the equivalent of tensor cores and enough vram for context.
There are many ways, be contrarian and wily
>>
>>
Why are two different models so different even though they're both Gemma 4 12B?
Both are Q4_K_M
>DuoNeural/Gemma4-12B-IT-Abliterated-GGUF
This one is trash.
>igorls/gemma-4-12B-it-heretic-GGUF
This one is decent.
>>
Asking here because these programs use python a lot but whats the best way to handle multiple python versions on windows 10? Sometimes they come packaged with the program but sometimes I need specifically 3.12 and sometimes 3.11 etc. and they can't be used with other versions.
>>
>>
>>
>>
>>109004971
It's literally the opposite, dumbass. There isn't much difference between processing 1 or 100 tokens at a time, you just don't know what token is next without speculation, so you have to do it 1 at a time. The slower the memory, the more it benefits from MTP