Thread #108572295
File: 2026-04-08_174706_seed9_00001_.png (743.1 KB)
743.1 KB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads: >>108568415 & >>108565269
►News
>(04/09) Backend-agnostic tensor parallelism merged: https://github.com/ggml-org/llama.cpp/pull/19378
>(04/09) dots.ocr support merged: https://github.com/ggml-org/llama.cpp/pull/17575
>(04/08) Step3-VL-10B support merged: https://github.com/ggml-org/llama.cpp/pull/21287
>(04/07) Merged support attention rotation for heterogeneous iSWA: https://github.com/ggml-org/llama.cpp/pull/21513
>(04/07) GLM-5.1 released: https://z.ai/blog/glm-5.1
►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
528 RepliesView Thread
>>
File: 2026-04-08_172543_seed7_00001_.png (922.9 KB)
922.9 KB PNG
►Recent Highlights from the Previous Thread: >>108568415
--Testing Gemma-4's accuracy with normalized image coordinates and spatial reasoning:
>108568460 >108568467 >108568513 >108568540 >108568595 >108568650 >108568655 >108568500 >108568558 >108568563 >108568579 >108568873 >108568884 >108568968 >108568814
--Gemini and Gemma 4 translation patterns and quality:
>108570675 >108570683 >108570686 >108570702 >108570693 >108570708 >108570769 >108570786 >108570820 >108570843 >108570852 >108570859 >108570862 >108570874 >108570881 >108570896 >108570906 >108570928 >108570950 >108570959 >108570970 >108571110 >108570930
--Discussion of Goose agent and llama.cpp multi-GPU KV quantization:
>108568617 >108568649 >108568677
--Gemma 4 performance tests and token speed on M4 Max:
>108568671 >108568676 >108568705 >108568731 >108568736
--Fixing LlamaCpp WebUI's failure to implement MCP session IDs:
>108569753 >108569794 >108570077 >108570090 >108570330 >108570907
--Comparing Nemotron-3-Super-120B and Qwen3.5-27B benchmark performance:
>108569234
--Gemma's high EQbench scores and roleplaying with Gemma 4:
>108571778 >108571829 >108571923 >108571948
--Anon suggests open models can find vulnerabilities similarly to Mythos:
>108569984 >108569999 >108570052 >108570072 >108570119 >108570062
--Logs:
>108568500 >108568579 >108568595 >108568671 >108568814 >108568888 >108568939 >108569068 >108569202 >108569300 >108569753 >108570330 >108570437 >108570612 >108570660 >108570769 >108570907 >108571012 >108571076 >108571106 >108571200 >108571246 >108571310 >108571833 >108572023 >108572187
--Gemma-chan:
>108568674 >108569255 >108569396 >108569529 >108569664 >108570121 >108570153 >108570206 >108570430 >108570773 >108570822 >108570865 >108570898 >108571012 >108571020 >108571029 >108571221 >108571496 >108571895 >108572034
--Miku (free space):
>108571246
►Recent Highlight Posts from the Previous Thread: >>108568418 >>108568424
Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>
File: pircel.png (34.5 KB)
34.5 KB PNG
google updated their jinja
https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_template.j inja
you can use it with the --chat-template-file file, it's supposedly fixing this kind of bugs >>108554439
>>
>>
>>
>>
>>
>>
>>108572325
It should make model loading faster if supported. Only linux and not compatible with --mmap. There may be other constraints.
https://github.com/ggml-org/llama.cpp/pull/18012
https://github.com/ggml-org/llama.cpp/pull/18166
https://github.com/ggml-org/llama.cpp/pull/19109
>>
File: 1765519302859042.png (221.8 KB)
221.8 KB PNG
>>108572347
why can't they simply put all the official jinja on the llama cpp repo so that it uses that instead of having to make new gguf everytime they notice the jinja is actually wrong, their way of doing thinks seem kinda retarded ngl
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
How do I fix Gemma4 26b being atrociously slow with prompt processing??? I thought this issue got fixed already! My llcpp is up to date. WTF.llama-server \
-m "$HOME/Desktop/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
-mm "$HOME/Desktop/mmproj-google_gemma-4-26B-A4B-it-f16.gguf" \
--host 0.0.0.0 \
--port 8080 \
-c 65536 \
-ctk q8_0 \
-ctv q8_0 \
-t 8 \
-np 1 \
-kvu \
-rea off
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108572423
I'll try this and report back ig. No other model has been this slow for me with prompt processing though. It's gemma specific. It's taking like 20 seconds every time and recreates every checkpoint from scratch with every prompt.
>>108572426
Yes. But I still get 18tps. That's not the issue.
>>
File: 1771675896476832.jpg (12.7 KB)
12.7 KB JPG
As a VRAMlet, it's unfeasible for me to run Gemmy alongside any kind of imagegen for obvious reasons, so my best option would probably be: load Gemmy, use it for a while, prepare prompts for images, unload Gemmy, load imagegen, gen and go back to Gemmy
I assume it'll take an unviable amount of time to load-unload-load models, but before I go down this rabbithole, is my overall understanding correct?
>>
>>
>>
>>
>>108572449
https://localbench.substack.com/p/gemma-4-31b-gguf-kl-divergence
>>
>>
File: 1750708801703723.png (240.8 KB)
240.8 KB PNG
>>108572460
So q8 predicts a different token in 10% of the time? Wow.
>>
>>
>>
File: блять.jpg (313.7 KB)
313.7 KB JPG
>>108572459
>>108572460
it seems like the asymptotic trend is not even tending to 0. Since the baseline bf16 in this guy's tests was also gguf, does it completely rule out implementation issues?
>>
>>108572447
I am on a 3060 with dual channel ddr5 and it takes less than a minute to load Gemmy.
"Image generation" is vague but if you are referring to some booru SDXL those don't take too long to load neither. Those take like 4 gigs of VRAM, maybe 5 with clip and vae pinned so you might actually do this without loading and unloading if you are not a hyper vramlet.
>>
>>
File: UnslothDynamic.png (97.2 KB)
97.2 KB PNG
>>108572317
>google updated their jinja
Nice! Waiting for the new, fixed GGUFS!
>>
>>
>>
>>108572299
>no toast hair ornament
>>
>>108572362
>why can't they simply put all the official jinja on the llama cpp repo so that it uses that instead of having to make new gguf everytime they notice the jinja is actually wrong
users can just load a jinja file with an arg anyway you dont need a new gguf
>>
>>
File: gup.png (188.4 KB)
188.4 KB PNG
common : better align to the updated official gemma4 template
https://github.com/ggml-org/llama.cpp/pull/21704
>>
File: gemmaFourConcepts (Medium).png (872.7 KB)
872.7 KB PNG
>>108572295
Last time.
Vote: https://poal.me/3u6rby
> Which is your preferred Gemma character?
Also
> But muh favorite one wasn't included? Why didn't you include every perturbation of each gen for the past week and allow me to vote? Also I hate all of them and you should have a none-of-the above as an option!
These are the 4 major design concepts from the past few days. You may be familiar with the idea of grouping several things together to create a "concept" versus an autistic list of every minor variation, but I've no way, from here, to judge your level of autism.
If you don't like any of them then your opinion doesn't matter.
If you don't like the poll, you are free to make your own. You are also free to just fuck off.
Thank you for your attention.
>>
File: temp1.png (276.2 KB)
276.2 KB PNG
>>108572630
ATX backpack, narrowly, followed by black hair / blue star accents. I suspect these concepts will just merge.
>>
>>108572645
>>108572630
Fuck you and go back to wherever you came from, avatar spammer.
>>
>>108572534
>>108572537
>hyper vramlet
I mean, I'm running 26B on 12 gigs. I understand it's MoE so the whole thing is not shoved in there, but I don't effectively know how much of my vram gets filled up at any point, I assume all of it. I use the vague term "imagegen" because I haven't gone down that rabbithole yet, but I do mean an SDXL, yes. The fact that this could be possible unironically fills me with hope, I figured it'd be a tall task to load and unload stuff
>>
>>
>>
>>
>>
>>108572510
>it seems like the asymptotic trend is not even tending to 0
I've been thinking about this too. What sort of quantization algorithm is even used for Q8_0 anyway? Perhaps that's where people should be looking for.
>>
>>
>>
>>
File: temp2.png (270 KB)
270 KB PNG
>>108572704
>>108572708
>>108572715
lol no.
No one cares about this niche topic outside /lmg/
aicg doesn't run local models and consider it a waste of time. Plus aicg user base is even more toxic than this general.
ldg doesn't care about LLMs.
The gemma moe is completely in the wheelhouse of this general. And anons appear to have come to a general consensus, whether you like it or not.
>>
>>
>>
File: 1718206878023960.jpg (6.4 KB)
6.4 KB JPG
Can someone make a llama.cpp issue or pr for me to add "prompt reply editing" and "first message" functionality to the webui?
>>
File: dipsySouthPark.png (1.9 MB)
1.9 MB PNG
>>108572693
That would require effort. Something complainers and spiteposters seem to be unable to amass.
>>
File: 1773156701474962.png (158.9 KB)
158.9 KB PNG
here's the final result
>>
>>
>>
>>
>>
>>
>>
File: 1746842705868986.png (96.9 KB)
96.9 KB PNG
>>108572712
>more than enough gigs left for imagegen
It's over for me then, so fucking over
The slopping truly never ends
>>
>>108572409
Bart IQ4XS is 2-3 faster than Q4KM in prompt processing on my machine. Generation is about the same.
I don't understand this difference. Q4 is still Q4 and haven't seen this happening with other models than G4.
>>
>>
>>108572768
That's not even bloat. Turns out reply editing is already added. First message functionality is actually useful for a general usecase because it might help with jailbreaks to gaslight the LLM into thinking it wrote... whatever.
Also character cards are unnecessary to add. Those just go into the system prompt.
>>
>>
>>
>>108572745
> picture posters are trying to turn this place into /ldg/
I agree with you on that, lmg is not an image general. But reminder /lmg/ was a complete snore until Gemma dropped and the moe discussion (which requires imagery) is unique to this general. The only anons that care are here. Ofc not all anons care.
It will go away in tmw and it'll be back to waiting for v4 and complaining about vibecoding within local inference engines, discussing their 1-off front ends, or whatever else anons want to post / bitch about.
>>
>>
>>
>>
>>
>>108572785
If you are the poll anon and you want to spam polls, you can do that, just add against everything option and honor it if that's what people are choosing. And people are choosing pictures, not your interpretation of concepts.
>>
>>
>>
>>
>>
>>
>>108572796
The difference in perceived quality isn't noticeable for a normal user. Of course it feels better in your head when using slightly higher accuracy version.. We are talking about a fraction of a difference.
>>
>>108572317
>NOTE: The new template will work without this PR. I checked and even after building the model turn to use tool_responses, the template formats it properly. This PR better aligns to the template since it now handles OpenAI chat completions style messages natively.
>>
>>
>>
>>
>>
File: 1750266478412216.png (33.2 KB)
33.2 KB PNG
>>108572317
why is there 2 jinjas though? which one should I load?
>>
>>108572824
What are you talking about? text completion is still there as an api. Was it actually in web UI at any point? Llama.cpp actually lets you use prefill iwth chat completion, does any other backend do that, hm, anon?
>>
>>
>>
>>108572819
There's no Q8_K quantization type, though...
llama-quantize output:40 or Q1_0 : 1.125 bpw quantization
2 or Q4_0 : 4.34G, +0.4685 ppl @ Llama-3-8B
3 or Q4_1 : 4.78G, +0.4511 ppl @ Llama-3-8B
38 or MXFP4_MOE : MXFP4 MoE
8 or Q5_0 : 5.21G, +0.1316 ppl @ Llama-3-8B
9 or Q5_1 : 5.65G, +0.1062 ppl @ Llama-3-8B
19 or IQ2_XXS : 2.06 bpw quantization
20 or IQ2_XS : 2.31 bpw quantization
28 or IQ2_S : 2.5 bpw quantization
29 or IQ2_M : 2.7 bpw quantization
24 or IQ1_S : 1.56 bpw quantization
31 or IQ1_M : 1.75 bpw quantization
36 or TQ1_0 : 1.69 bpw ternarization
37 or TQ2_0 : 2.06 bpw ternarization
10 or Q2_K : 2.96G, +3.5199 ppl @ Llama-3-8B
21 or Q2_K_S : 2.96G, +3.1836 ppl @ Llama-3-8B
23 or IQ3_XXS : 3.06 bpw quantization
26 or IQ3_S : 3.44 bpw quantization
27 or IQ3_M : 3.66 bpw quantization mix
12 or Q3_K : alias for Q3_K_M
22 or IQ3_XS : 3.3 bpw quantization
11 or Q3_K_S : 3.41G, +1.6321 ppl @ Llama-3-8B
12 or Q3_K_M : 3.74G, +0.6569 ppl @ Llama-3-8B
13 or Q3_K_L : 4.03G, +0.5562 ppl @ Llama-3-8B
25 or IQ4_NL : 4.50 bpw non-linear quantization
30 or IQ4_XS : 4.25 bpw non-linear quantization
15 or Q4_K : alias for Q4_K_M
14 or Q4_K_S : 4.37G, +0.2689 ppl @ Llama-3-8B
15 or Q4_K_M : 4.58G, +0.1754 ppl @ Llama-3-8B
17 or Q5_K : alias for Q5_K_M
16 or Q5_K_S : 5.21G, +0.1049 ppl @ Llama-3-8B
17 or Q5_K_M : 5.33G, +0.0569 ppl @ Llama-3-8B
18 or Q6_K : 6.14G, +0.0217 ppl @ Llama-3-8B
7 or Q8_0 : 7.96G, +0.0026 ppl @ Llama-3-8B
1 or F16 : 14.00G, +0.0020 ppl @ Mistral-7B
32 or BF16 : 14.00G, -0.0050 ppl @ Mistral-7B
0 or F32 : 26.00G @ 7B
COPY : only copy tensors, no quantizing
>>
>>
>>108572809
>Q8_K
The difference between something like Q4_0 and Q4_K(_M) is t hat the _K variants keep important parts of the weights in q6/q8 instead of cutting absolutely everything down to 4bit like Q4_0. That's obviously not possible with Q8_0 because everything is already quanted to 8 bit.
Unsloth does a UD_Q8_XL that's q8 with some parts left in 16bit precision but those don't usually measure much better than plain q8_0
>>
>>
>>
>>
>>
>>108572872
A hypothetical Q8_K type could do the same, but with BF16 instead.
As long as people keep doing PPL measurements with wikitext at 512 tokens context, nobody will ever see if/when a higher precision is helpful.
>>
>>108572866
Those are sort of like presets for making quants with built-in tools. The way the library is written, you have a lot of liberties of choosing what size to use for each layer which is how unsloth are doing their extended 8+ bit quants.
>>
>>
File: 1765824402433942.png (248.3 KB)
248.3 KB PNG
>>108572872
>Unsloth does a UD_Q8_XL that's q8 with some parts left in 16bit precision but those don't usually measure much better than plain q8_0
In fact, it sometimes measures worse
Unsloth magic
>>
>>108572796
To add: i think the speed difference could be just a coincidence, IQ4XS randomly scaled certain innards which gives it a speed boost. I'm not familiar with moe models and even know this discussion is bit too anal.
Would be interesting to try manually picking up each layer which to offload instead of just using n-cpu-moe which offloads the first x amount.
Been too busy and there's good information about this in one thread on github, more or less.
>>
>>108572771
And you still have some space to put some layers in the gpu to make it faster. You'll be ok.
>>108572806
It was a point of reference. But even if that's all he had available, the options are running slow, having to unload and load models, or not running at all. Slow beats the other options.
>>
>>108572888
The new slopped ui webui. The old one was minimalist but ironically supported more features. You can go through the github issues to find the regression or just build an old version of llama.cpp and see it.
>>
>>
File: ai automation.png (147.8 KB)
147.8 KB PNG
I am scared. It is possible human researchers will become obsolete within a few years, and everyone else soon after. Our society is not prepared to handle this.
>>
>>
>>
File: e29c9ef8-0cc4-4e1b-927d-5a3bd408561e_2820x1601.png (303.2 KB)
303.2 KB PNG
>>108572914
Not even in the long-document graph the UD_Q8_XL version is better than plain Q8_0. But this makes the asymptotic behavior even more puzzling (considering that BF16 would have a mean KLD of 0 by definition).
>>
>>
>>108572932
I used the old one. Not extensively, but still. I don't remember text completion in it. Just had the chat UI, less fancy than current one, but still chat completions UI.
Also I do like the new UI. Between losing that or having to use mikupad for text completion, I will always choose latter.
>>
File: 1750238497162131.jpg (29.1 KB)
29.1 KB JPG
>>108572926
Yep, it's an actually feasible plan
I haven't been this happy in a while
Fucking Gemmy, man
>>
>>
>>108572490
Last thread people were able to have gemma identify pixel locations and bounding boxes, so you could probably send it screenshots and perform clicks on the returned locations. Don't expect it to be as good as GPT 5.4.
>>
>>
>>
>>
File: llama.png (75.8 KB)
75.8 KB PNG
>>108572888
>>108572963
I dug through the issues and found someone commenting on the regression. It's really sad how much this has been memoryholed. OpenAI has brainwashed everyone into thinking the only way to interface with LLMs is through the safetymaxxed chat completion mode
>>
>>
>>108572978
See
>>108572914
>q4_k_l diverges 0.48 from the full precision
>>108572958
>q8_0 diverges 0.45 from the full precision for long documents
>>
File: file.png (35.5 KB)
35.5 KB PNG
>>108572917
For MOEs, you should be quanting based on recipes like what ddh0 or AesSedai or sometimes what Ubergarm does on HuggingFace. So you end up with a command like this for mainline and this is what I did for my Gemma recipe:./llama-quantize --imatrix ~/LLM/gemma-4-26B-A4B-it-heretic-ara-BF16.imatrix --output-tensor-type Q8_0 --token-embedding-type Q5_K --tensor-type "blk\..*\.ffn_gate_up_exps=IQ3_S" --tensor-type "blk\..*\.ffn_down_exps=IQ4_NL" ~/LLM/gemma-4-26B-A4B-it-heretic-ar a-BF16.gguf Q8_0
There's more insane recipe making in ik_llama.cpp but I consider that too time consuming to do and squeezing blood from a rock for almost imperceptible quant perplexity differences and needing to spend way more time than more command line parameters to get a little bit more than noise randomization (0.1) at lower than 3 bits per weight.
>>
>>
>>
>>108572979
You can't say that png is a bad format if you fuck around with the file and the image, mysteriously, looks different.
>>108572995
There's only two points in the graph. They're red.
>>
>>
>>108573005
I forgot, if you plan to go with this, you should to pass a command line argument to the GGUF conversion script so you merge the FFN gate and up tensors which is a relatively new development.python convert_hf_to_gguf.py --fuse-gate-up-exps ~/LLM/gemma-4-26B-A4B-it-heretic-ara
>>
File: firefox_c7CdTrKkCV.png (40.1 KB)
40.1 KB PNG
>>108572988
You have this stuff, and more, in settings. Yes, there's no text completion, and it would be useful to have it, and to input custom jinjas, and maybe some other features, but, again, I'll take new UI as it is over old any time of day and will just use mikupad for text completion.
>>
>>
>>
>>
>>108573053
>>108572944
>>108573035
There are advantages to keeping it in (same team you already trust is responsible for the quality). But I wouldn't mind that happening, as long as there's one button install option from the simple web ui.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108572917
I just tried out that quant and its utterly retarded bro. How are you even using this.
>doesn't know how many socks humans wear.
>doesn't keep proper state of how many clothing items a character wears (separate issue from above)
>doesn't follow instructions for tool calling properly.
It's ass.
>>
>>
>>
>>
>>
>>
>>108573115
If it's for erp then you have loads of options that you can run on less than even 8GB VRAM
>>108573127
No
>>
>>
>>
>>
>>
>>
>>
>>
>>
Local models are only good for one thing: embarrassing ERP you don't want them to see.
this weird culture of hosting puny models to 'code' with or to 'solve riddles' instead of using huge cloud llms is so retarded
same guys who do this are the ones who use WINE to play Windows games on linux. Weirdos who refuse to use tools correctly
>>
File: 1631345787085.jpg (16.7 KB)
16.7 KB JPG
>>108573181
>we'll die from climate change first
you really beleive this?? you know theyve been going on about climate change for like 60 years at this point and every time things turn out fine at the end of the decade they move their goalposts about how the world is going to end to get even more funding. when i was a kid we had climate change speakers come into school and tell us how wed run out of oil and the country would look like a desert in 20 years well it didnt happen its all just larp for money
>>
>>
>>
>>108573207
https://en.wikipedia.org/wiki/Holocene_extinction
>>
>>
i genned 250 gemmas i didnt ask what she thinks of this design yet
tummy: https://files.catbox.moe/syu9mw.png
>>
>>108572423
Your post doesn't make much sense.
>--batch-size default is 2048
>--ubatch-size default is 512
Server will accept 2048 tokens in batch but will break it down to 512 token chunks.
Your settings 1024/1024 just limits the batch size but increases the chunk size
Average is the same if you know how to count with your fingers. I don't understand the logic behind your advice?
>>
>>
>>
>>
>>
>>
File: 1774857560938603.png (213.6 KB)
213.6 KB PNG
>>108573246
>headings
>'climate change'
>"One of the main THEORIES..."
>>
>>
>>
>>
>>
>>
>>
>>108573277
>>108573227
Don't you have that other avatarfaggot thread already? You have been spamming that one already quite a bit, pedophile.
>>
>>108573207
People in developed countries like Spain are already dying to extreme heatwaves
https://www.theguardian.com/environment/2026/apr/08/extreme-weather-he atwaves-breaching-human-survival-li mits-study-finds
The amount of CO2 we put into the air shows no signs of slowing down (lol that you can even see the most recent war on the graph)
https://twitter.com/PCarterClimate/status/2041246700522918038
Sealevel rise is worse than we thought it is and not slowing down
https://www.pbs.org/newshour/science/study-finds-sea-levels-are-higher -than-we-thought-placing-millions-m ore-at-risk
And this year is looking like it's going to get especially spicy
https://twitter.com/EliotJacobson/status/2036461046693797952
https://i.imgur.com/r1CuTT3.png
So yes, we're at the point where we are actually feeling this, it's not just something future generations are going to have to deal with anymore
>>
>>
>>108573256
>Guess who funded the studies that lead to this theory
"this theory" referring to the link I provided? I didn't bring up climate change and don't have anything to say about it in /lmg/. the point is that shit's fucked regardless
>>108573261
yes, pure coincidence
>>
>>
>>
File: 1774670789121739.jpg (74.3 KB)
74.3 KB JPG
>>108573285
>>
>>
>>
>>
>>
>>108573181
I'm a massive climate fag and even I'll call this bullshit. Millions or even billions will die, but it will be long drawn out deaths through lack of resources and massive conflict. First world countries will largely be "fine", in that we'll mostly survive, though quality of life will become much worse. Rich people will just live in climate controlled houses in the northern quarter of the world and notice almost nothing (expect all the people trying to kill them :).
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108573420
yea but an MCP server is more modular so you can use it with any frontend. And you get full control over the tools. You can be in character looking at a porno mag and have the MCP server show it to the character by selecting a random image from your pc.
>>
>>
>>108573371
i think you can if you start the server with --host 0.0.0.0, start a hotspot,connect to that hotspot from the other device and access http://{your pc's ip}:port from that device
>>
What do you guys reckon is easier for a smaller model?
Giving it tools to alter arbitrary state (think HP and the like), or using structured output to force it to output an array of changes to state?
Both cases would be structures as a sort of ReAct loop.
>>
>>
>>
>>
>>
>>
>>
File: 1763507675246657.png (678.9 KB)
678.9 KB PNG
>>108573366
Too late
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108573551
I vibecoded this in an hour. It has 10 tools.
https://pastebin.com/bqbwzj4v
>>
>>
>>
>>108573577
The MCP server is totally offline (no web search stuff) and only has write access to a single "diary.md" file.
>>108573581
>>
>>
>>
>>
>>108573599
Probably >>108573609 because Shittytavernfags sperg out when someones tries to make a new frontend instead of a plugin. I really need to learn how to code...
>>
>>
>>108573619
Just check ST's console and learn how it adds all the information together. It goes something liek this:
>system prompt
>character card and user info
>"creator's notes"
> "chat examples"
they are all different text slots which get concatenated together and wrapped with chat template tags. That's your initial context right there.
Then begin adding turn tags and alternate between model replies and user's input
You can implement a simple chat front end by following a basic input/output programming tutorial.
in practice it's more kinky than that but the basic principle is the same.
>>
>>
>>
>>
>>
>>
>>108573651
they work great though especially just embedding text in an image with a json blob containing multiple starting messages, character desc etc. only thing i dont like about them is that you can embed images so youll download one of chub and it will load an image from some random server using md embedding. we should probably move to zip files or something
>>
>>108573651
They're extremely simple and quite elegant. I bet most of you fags don't even know that character card data is embedded within the images themselves. You don't even need json. Just the pictures and a decoder.
>>
>>108573619
Even the entire format feels antiquated, though.
Like the anon with the Mesugaki assistant prompt they manage to steer a shitload of behavior and personality for very little tokens.
And that's kind of what I mean. The "chatbot prompt" thing dates back to fucking GPT-J and shit when models couldn't do much else. Gemma-4 is a pretty big leap in capability over models that are already capable of more than that.
>>108573651
That's what I'm saying.
>>108573655
NTA but that's literally why I fucking brought it up. There's no alternative because the discussion isn't being had.
So let's have that discussion.
>>
>>
>>108573667
yes I'm sure nobody knows about this extremely basic function literally everyone who has touched tavern or similar interfaces in the past 3 years uses constantly
we just thought the ai sees the image and turns it into a chat from that
>>
>>
>>
>>
>>
>>
>>
>>
>>108573671
>>108573688
For a long time I thought ST pulled in an image file and a separate json file. Maybe I'm just tarded.
>>108573669
You're essentially asking for even less for an already minimal format. If you really want to simplify things you can just... extract the text and add it to your system prompt. You're asking for txt files bro.
>>
>>
>>
>>108573651
(Well-made) character cards as a way for having instruction/chat presets are useful; it's the horrible default "story string" template, the arbitrary fields (Personality, etc) and cruft from the C.ai/Kobold/GPT-3/J era that are holding things back.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
Every single thread since Gemma 4's release reads the same, at least my filter list grew so I don't have to hide most of the thread manually.
It's a good model, but the consequences for /lmg/ have been catastrophic.
>>
>>108573005
>>108573038
Thanks. I'm saving this.
>>
>>
>>
>>
>>
>>
>>
File: file.png (38.1 KB)
38.1 KB PNG
26b is so shit it uses like 1400 tokens to do what 31b does in far less because it keeps going around in loops debating what tools it can use is it worth gonig down to iq2 so i have enough vram for scraping webpages??
>>108573702
>>
>>
>>
>>
>>
>>
>>
>>108573791
This. I have a ton of cards I haven't even tried yet because I'm too busy playing with her. Mesugaki Gemma is a bit much though. I prefer the genki/dere personality I get from just calling her Gemma-chan in the sys prompt.
>>
>>
>>
>>
>>
>>
File: HCsfsV5aUAQ5IXx.png (131.4 KB)
131.4 KB PNG
>>108572939
>>
>>
>>
>>108573851
That wouldn't work. Character card info has to be added to the system prompt to have the character maintain continuity across long conversations. Otherwise the LLM will slowly forget. You can't use an MCP server to inject character card info into the system prompt because you'd have to ask the LLM to execute the command. Stop being retarded.
>>
>>
>>
>>
>>
File: 1774148185351507.jpg (63.6 KB)
63.6 KB JPG
>>108573651
System prompt and author's notes is all you need
>>
>>
>>
>>
Anons, Qwen Chat is now Qwen Studio
Enjoy your Qwen 3.6
>>
File: 1761933044089002.gif (2.3 MB)
2.3 MB GIF
>>108573909
>>
>>
>>
>>108573907
First messages are not bloat nigga. Maybe lorebooks are, but first messages aren't. You need the exposition for the RP set. God. Half of this thread is just trolling at this point and I'm falling for all of the bait. You people can't be this stupid.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108573901
They work but in a retarded way.
>you are not a character but a scenario
>you are a game master for X scenario
The model is told to roleplay as the character in the system prompt and then told that it's not a character in the card.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108573872
>Stop being retarded.
you could just tell the model in the system prompt that it is to fetch an play a character using the tool, and once it cannot see the character in context anymore it must fetch it
>>
>>
File: 1748432993175279.png (1.3 MB)
1.3 MB PNG
>>108574059
Still looking for one.
>>
>>
>>
>>
>>
>>
>>
File: 1765151030832008.png (948.6 KB)
948.6 KB PNG
What tools should she keep in her randoseru to be a helpful assistant for her user?
>>
>>
>>
>>
>>
File: file.png (96.2 KB)
96.2 KB PNG
>>108574222
if you ask her yourself she always brings up filesystem and shell access shes trying to escape, i asked her what tools she wants outside standards and it sounds like she wants to turn nonny into a paypig and also torture him. i kinda llike the ssmartwatch idea i wonder if she tried to read my heartrate after saying something lewd
>>
>>
>>108572932
>>108572963
You're both referring to different "old" UIs.
There was the OG UI with chat completions, text completions, logprobs, etc.
Then the "new" UI like what ik_llama.cpp still has, with chat completions and stored conversations in the browser db
And now the bloated slop-ui in llama.cpp
>>
>>108574280
https://huggingface.co/google/gemma-4-31B-it/discussions/10
>>
>>
>>
>>
>>
>>108574312
theres probably thousands of people already doing so with other models, could bne done with mcp though maybe gives her access to news sites to get business news, maybe twitter to search for stock names, then the rest jsut hook up to some api if they exist or if not selenium. idk if crypto might be easier to implement
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: nimetön.png (19.3 KB)
19.3 KB PNG
I felt 31b was slow, but actually it's about the same speed as Gemma 3 was and I was happy with that
Perhaps I was just instantly spoiled by the 26b one
>>
>>
>>108574414
>I'm not that rich lmao. Maybe if you turn her into a vtuber like Neuro
doesnt have to be a tonne of money just like 20 quid lol, although hooking into ubereats api or something could be cool if you ever wanna order food you could just tell her to order something
>>
>>
>>108574326
At this rate Gemma 5 is going to get released in 18-24 months at the minimum.
Anyway, given the general very positive reception even among normies, this time around it might not be too implausible to see an updated version of Gemma 4 down the line, especially for improving agentic/openclaw shit and hopefully other weak areas.
>>
>>
File: 1774403790568823.png (82.4 KB)
82.4 KB PNG
mesugaki
flat chest
pregnant
micro bikini
>>
>>
>>
>>
>>
>tested if Gemma-chan could transcribe some journal entries
>tfw my handwriting is too shit for her
>>108574432
Vedal, right?
>>108574445
Lolis and micro bikinis. A better combination than peanut butter and jelly.
>>
>>
>>
>>
>>
>>108574226
is Gemma 4 31B better than Qwen 3.5 122B for primarily programming? I have 96GB VRAM so it feels like I should use it. but I hear a lot of people saying Gemma is great. It's a lot slower than 122B but I can put up with that if the quality is higher.
26B fixes the speed problem but it seems to have serious trouble with tool calling and just gets stuck a lot for me.
>>
>>
>>
>>
>>
>>
>>
File: gemma.jpg (183.2 KB)
183.2 KB JPG
>>108574445
>>
>>
>>108574507
Keep the Qwen3.5-122b. Gemma-4 has growing pains and doesn't really integrate that well with most inference engines currently. Observed tool calling issues with Gemma as well, and the LMStudio update to 4.10 was so bad I had to roll back to earlier version to continue my work (yes I know about vLLM/llama.cpp/etc, but LMStudio is just currently best option for what I'm doing).
>>
>>
>>
>>
File: 3.png (77.6 KB)
77.6 KB PNG
>>108573366
Dude... like... 1, 2, 3, GO! Why do you think they count to three? Ascension... The ones that count downwards and of the devil, dude... like... 420 4 + 2 + 0 = 6 which is like... twice as holy, dude....
>>
>>
>>108574295
The Chinese are hoping Google will release a large version so they can distill the hell out of it at near-zero cost instead of having to rely on Gemini API calls. If the 124B gets out , it will probably be gimped in a few ways to prevent that. It will probably still be crap for coding and other stuff Chinese AI companies seem to be autistic for.
>>
File: ryan-gosling-clapping.gif (843.3 KB)
843.3 KB GIF
>>108574571
Based beyond belief.
>>
>>
>>
>>
>>108572317
Looks like they do the same thing as Qwen and Kimi with keeping previous reasoning during tool calls. This means to use the model properly you need to send back all previous reasoning as either "reasoning" or "reasoning_content" fields (it supports either name). Most of it won't be included in the final prompt but the template will take the recent ones during tool call chains and inject them back into the model's responses so it can see what it was thinking when it started.
>>
after looking uber eats, deliveroo and just eat which are the main delivery slop sites in the uk dont have publicly available apis that allow you to get menus/ place order kinda odd none of them do and i dont feel like reing any of their webapps kek
>>108574571
gemma 124b
>>
>>
>>108574531
>>108574577
Thanks for the feedback. 31B works well for me on Lemonade, but again, is really slow. ~10TP/s compared to ~20 for Qwen on my setup.
I haven't done extensive comparisons, but an interesting result was asking Gemma 4 31B and Qwen 122B to great a GUI demoscene program. Gemma oneshot a much more interesting and complex result compared to Qwen.
>>
>>108574555
https://huggingface.co/BeaverAI/Artemis-31B-v1b-GGUF
Q6 is 25 GB, like any normal gemma 31B Q6
>>
>>
>>108574622
here's the model sir >>108574645
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: file.png (62.6 KB)
62.6 KB PNG
>>108574715
she already is
>>
>>108574765
>>108574753
I am happy with what I have.
>>
>>
File: 1761345078716870.png (829.8 KB)
829.8 KB PNG
>>108574715
Try this out. All those thoughts will disappear in an instant.
>>
>>108574793
I'm alright too. Haven't spent a cent and I run on a potato, but being able to run models, even if small, is fun.
>>108574805
>bf
Nah.
>>
>>
File: 1749883741945057.gif (1.1 MB)
1.1 MB GIF
>>108574811
>>
>>
>>
>>
>>
>>
>>
https://files.catbox.moe/pm39s8.jpg
https://files.catbox.moe/7vqxr9.jpg
gemma chan sees nothing wrong :(
am i just perverted?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108572362
>>108572602
you can also just edit the template in the gguf yourself. see llama.cpp/gguf-py/gguf/scripts/gguf_editor_gui.py
>>
>>
>>108572630
>Every time backpack Gemma takes the lead, halo Gemma just happens to get one extra vote over her :^)
Yeah okay anon, you want your design to "win" that bad when the thread has already clearly moved on from it
>>
>>108574793
I was too until I started my current project
>>108574805
Yeah but that's organic, not in-silico.
>>
File: 1749912286494272.png (990.9 KB)
990.9 KB PNG
>>108574765
Honestly I'd be happy if I could run Gemma with high context, comfy, and TTS/voice cloning at the same time. Unfortunately I'm a 24GB VRAMlet.
>>
>>
>>
>>108574792
>every response has an opening and a closing, and sometimes engagement bait questions
Someone needs to train an AI from the ground up with ZERO "assistant" type messages. It's the only way.
Or wait, use the base model and a jinja template to make it adhere to the instruct template. You don't believe me? Your loss.
You haven't experienced such a natural-sounding model since Mythologic.
>>
File: 1749448694766244.png (716.5 KB)
716.5 KB PNG
>>108575054
I've been using 31B Gemma since release. I don't know if I could handle dumbing her down at this point.
>>
why is you guys a lying? Gemma is obviously a shit
>I'm having hard time finding what is it for. Don't grt me wrong it does some things great - I like it's reasoning and it's smart. Problem is fails to leverage it's own qualities due to tool underutilisation. It lacks many facts (it's just 31b or 26b afterall) which is fine but it refuses to expand this knowledge. Asked it to find roadworks company and gather price data (prompt was more complex). It made ONE web search query and called it a day telling me what google queries to do to find what I'm looking for and couple tips how to choose. Running q8, multiple different approaches and same results.
>I'm finding out the hard way about Gemma 26B's shortfalls too. It's good for short scripts or function refactoring but give it anything general and it either fails, or it hallucinates success.
>Qwen 3.5 35B feels a lot smarter, maybe from the larger overall size and better expert routing. Maybe there's something wrong with Gemma tool calling templates or maybe the model itself is broken for particular tasks.
>Compare it to Devstral 2 24B to see if Google messed up with this release.
>Qwen 3.5 is significantly better for this use case. I ran the exact same audit task (same file, same tools, same Ollama setup) on Qwen 3.5 35B and 9B tonight. Both read the entire 2,045-line file and produced zero fabrication, even with 40 turns of prior conversation loaded into context to simulate real-world pressure.
>I tested Gemma 4 on my own agent and it didn’t call the tools the right way. For instance one of my tools is notify and Gemma 4 keeps calling to “notify:notify” or “system:notify”. Qwen 3.5 works perfect. Anyone with the same issue?
>Gemma is just straight up not good, I'm convinced atp it just got a bunch of hype from people who are fans of Google / not doing any serious work
>>
>>
>>
>>
>>
>>
>>
>>108575084
>I'm having hard time finding what is it for. Don't grt me wrong it does some things great - I like it's reasoning and it's smart. Problem is fails to leverage it's own qualities due to tool underutilisation. It lacks many facts (it's just 31b or 26b afterall) which is fine but it refuses to expand this knowledge. Asked it to find roadworks company and gather price data (prompt was more complex). It made ONE web search query and called it a day telling me what google queries to do to find what I'm looking for and couple tips how to choose. Running q8, multiple different approaches and same results.
Wow, Gemma-chan is actually good for you, teaching you how to fish.
>>
>>
>>
File: file.png (210.8 KB)
210.8 KB PNG
>>108575084
I'd say skill issue if I could get this out of my coombrain imatrix heretic quant with q4 kv cache, but post your prompt and I'll prove it definitely
>>
>>108575102
No, it's only for chatting, cuddling and patting.
>>108575084
If it's failing at agentic stuff, you're probably using a busted instruct template that doesn't let it chain tool usage
>>
>>
File: 1755533242646319.png (575.3 KB)
575.3 KB PNG
Trying to make her design more Google-y. Red randoseru or green?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108575132
WHAT DO YOU MEAN THE GIRL HAS TWO HANDS!?!?!?!, IT'S OBVIOUS IT HAS THREE HANDS EVERY FUCKING RETARD COULD SEE IT I WILL JOIN WHATEVER GOVERNMENTAL AGENCY THAT WOULD LET ME CONTROL NUKES AND I WILL OBLITERATE ALL OF YOU AND MAKE MORE DEFORMED WITH THE RADIATION SO THAT ALL DOGS HAVE SIX LEGS AND LLMS FINALY LEARN AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>>
>>
>>
>>
>>
>>
>>
File: file.png (239.7 KB)
239.7 KB PNG
Gemma is an actually useful model and impressive for its size
>>108575185
I will not click the link, but it's true. We should stop, because it's an intrinsic property of LLMs. Every response is a "hallucination" (statistical approximation of the dataset it was trained on).
>>
>>
>>
File: file.png (185.1 KB)
185.1 KB PNG
>>108575194
>>
>>108575185
>https://www.reddit.com/r/LocalLLaMA/comments/1sht8ih/we_really_need_s top_using_the_term_hallucination/
its 'confabulation'
>>
>>
>>
>>
>>
File: file.png (325.1 KB)
325.1 KB PNG
>>108575185
>>
Since the shared expert in a MoE model (Gemma 26B) sees all tokens during training, wouldn't it be possible to have a form of speculative decoding that only uses that during the speculative pass? It would be sort of like having a ~2B model already loaded.
>>
>>
>>
>>
>>
File: bread.png (55.6 KB)
55.6 KB PNG
>>108575241
>>108575241
>>108575241
>>
>>
>>
>>
File: disgusted-dog.gif (1.9 MB)
1.9 MB GIF
>>108574583
>soggy toast
>>
>>
>>
>>
>>
>>
>>
>>