Thread #108676460
File: PromptingWhales.png (1.3 MB)
1.3 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads: >>108672381 & >>108667852
►News
>(04/24) DeepSeek-V4 Pro 1.6T-A49B and Flash 284B-A13B released: https://hf.co/collections/deepseek-ai/deepseek-v4
>(04/23) LLaDA2.0-Uni multimodal text diffusion model released: https://hf.co/inclusionAI/LLaDA2.0-Uni
>(04/23) Hy3 preview released with 295B-A21B and 3.8B MTP: https://hf.co/tencent/Hy3-preview
>(04/22) Qwen3.6-27B released: https://hf.co/Qwen/Qwen3.6-27B
>(04/20) Kimi K2.6 released: https://kimi.com/blog/kimi-k2-6
►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
407 RepliesView Thread
>>
►Recent Highlights from the Previous Thread: >>108672381
--Discussing DeepSeek-V4 MoE releases and their million-token context:
>108674136 >108674145 >108674155 >108674161 >108674250 >108674318 >108674261 >108674263 >108674388 >108674389 >108674379 >108674434 >108674435 >108674450 >108674875 >108674883 >108675469 >108675569 >108675405 >108675940
--Discussing potential llama.cpp and Axolotl support for DeepSeek V4:
>108674288 >108674300 >108674320 >108674424 >108674921 >108674948
--Optimization settings and performance benchmarks for Qwen 35B on AMD GPUs:
>108674262 >108674274 >108674280 >108674305 >108674330 >108674339
--Discussing OpenAI's Privacy Filter release and effectiveness:
>108672801 >108673034 >108673043
--Discussing feasibility of DeepSeekV4 support in llama.cpp:
>108674334 >108674432 >108674447 >108675147
--Comparing Hermes agent performance and discussing Gemma's output instability:
>108672431 >108672440 >108672493 >108672518 >108672684 >108672854 >108672944 >108673051 >108673108 >108675044
--Discussing Artificial Analysis hallucination rate chart for frontier models:
>108675041 >108675063 >108675064 >108675074
--Discussing quantization quality and diminishing returns for Gemma 31b:
>108673021 >108673040 >108673067 >108673083
--Troubleshooting system crashes and power spikes with dual 3090 setups:
>108672567 >108672901 >108672964
--Challenges of selling local LLM hardware to corporate management:
>108673015 >108673033 >108673069 >108674370 >108673447 >108673528 >108673543 >108673592
--Logs:
>108672766 >108673108 >108673737 >108674368 >108674514 >108674643 >108674834 >108675630 >108676384
--Teto, Rin, Miku (free space):
>108672697 >108673340 >108675126 >108675156 >108675180 >108675227 >108675466 >108676331 >108676341
►Recent Highlight Posts from the Previous Thread: >>108672385
Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>
>>
>>
>>
File: 1747451052197483.png (300.3 KB)
300.3 KB PNG
>>108676470
I'm more of a gut instinct guy myself
>>
File: 1761239934506682.jpg (126.4 KB)
126.4 KB JPG
>>108676470
>>
>>108676502
kek seething
>>108676480
my gut instinct says benchmarks are real
>>
on my 3090 testing these the ai gave me to set and it works
/path/to/llama-server \
--model /path/to/gemma-4-31B-it-Q5_K_S.gguf\
--port 8080 \
--ctx-size 10192 \
--n-gpu-layers 99 \
-fa 1 \
--host 0.0.0.0 \
--no-mmap \
--jinja \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--temp 1.0 \
--top-p 0.95 \
--top-k 64 \
--min-p 0.05 \
--repeat-penalty 1.0
what is the minimum context for hermes agent to be useful?
>tfw no turbocunts
>>
>>
File: 1769230879916265.gif (597.9 KB)
597.9 KB GIF
>>108676517
Nice settings huh
>>
>>
>>
>>
>>
>>
>>
>>
>>108676517
Jinja is useless too because it's always enabled anyway.
Read llama-server output log to get an idea for a context. For example if you fed the last thread of this pretend discord server to your model, that would probably take over 32k tokens.
>>
>>
File: 1763684891282931.png (194.9 KB)
194.9 KB PNG
>>108676574
>>
>>108676574
Different models handle it differently but the pattern I've seen on newer models' jinjas is that they prefer to omit reasoning tokens from previous messages except during active tool call chains, in which case they include all the reasoning since the last user message (since the agent talks back and forth with tools for a while, thinking each time) and omit earlier ones, and then go back to omitting reasoning once they give a final response and the user sends a new message. But there's all sorts of variations on this theme.
>>
>>
File: 0ED8D82159984C6C3D5B8CE53342A3ED.jpg (382 KB)
382 KB JPG
Gpt and Claude are becoming too expensive while local models are still too shit for poor people.
>>
>>
>>108676574
>>108676583
>>108676600
>>108676602
Wow you are fucking retards.
Read your model's manual first before making any claims.
>>
File: my friend coach.png (64.3 KB)
64.3 KB PNG
>>108676480
Same, bro. I tested gemma 4 31b-it for days, and I am coming to the conclusion that although it's limited to being a 31b, it is the most accurate in actually listening to your instructions compared to all models before it. It's SOTA in instruction listening, and I would love it in +100b. I never knew how bad the "lost in the middle" effect was until I started fucking with this new model. Forget the benchmarks, Google is on to something. We just need more parameters.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108676517
me again. read the replies
looks like I can't physically fit my 1080ti with my 3090 on my old am4 motherboard. may need an adapter. running the 3090 headless should be better I guess, vram-wise.
I could put the mmproj on the 1080ti. split tensor would probably be slow as balls I would imagine with given no nvlink, p2p + pascal archit.
>>
>>
>>108676656
It's very anal with thinking. Iirc, it needs to be instructed on how to think within its think tag, instead of just being asked to think, or else it's just "<|channel>thought" which is the default. It's not like other models with thinking. You need to instruct the thinking, meta-wise. It also needs to be given dead last. You also need the jinja2 template.
>>
>>
>>108676652
That's not all it's dependent on though, unless you're using Text Completion.
In Chat Completion mode, the model's chat template is supposed to decide which reasoning is included or not and it needs the reasoning from each message passed back properly in the prompt via a specific field to do that. If a frontend isn't doing that properly then yeah it will be the one deciding what the model sees, but if it is doing it properly then it depends mainly on the model's chat template.
>>
>>
>>
File: IMG20260421041954.jpg (372.3 KB)
372.3 KB JPG
>>108676667
>looks like I can't physically fit my 1080ti with my 3090 on my old am4 motherboard.
Just a warning, once you go open air it's difficult to go back
>>
>>108676574
depends on the model
with qwen3.6 35ba3b you can pass --chat-template-kwargs '{"preserve_thinking": true}' to keep reasoning traces in context which supposedly helps in agentic scenarios
i have it in my config but i am not a flag tinkertranny so dunno what the implications for vram / speed / accuracy are compared to having it off your mileage may vary
>>
>>
>>108676700
><|think|> Before responding, use your internal reasoning to analyze {{char}}'s motivations, the current subtext of the scene, and how {{char}} would naturally react based on their personality, all within 200 words or less.
Previous instructions also fucks with thinking being activated or not, in my experience. For example: if you’re telling it to role-play and respond in a paragraph, it’s going to be weird when you later also tell it to think about how to respond. You already told it how to respond. This is the most negative-reinforced model I've seen trained. You can actually just tell it to just stop doing something, but double negatives and paradoxes are mixed.
>>
>>108676706
In Text Completion mode the prompt is just the prompt, so there's nothing to decide. If the reasoning's in the context then it's there and everything is 100% up to the frontend.
In contrast, in Chat Completion mode the jinja will look at the prompt which is a conversation history instead of raw text, and it'll filter it based on its own rules to convert it to text. That's where it would decide, assuming the prompt is constructed properly and has the reasoning separated from the content.
>>
>>
>>
>>
>>
Been taking a break from lmg the past few weeks due to the influx of poorfags from the gemma release and the massive drop in thread quality it caused. Are they gone yet, i want to discuss v4 with my old lmg bros
>>
>>
>>
File: 1765543542289463.png (609.1 KB)
609.1 KB PNG
the deepseek niggas definitely lurk here
>>
>>108676797
>>108676818
On one hand, a 30B model being the hottest subject sucks. On the other, the 30B model in question is quite impressive. One can only hope that we will see a similar factor of improvement in bigger weight categories. Though the V4 release does not seem to have been it.
>>
>>
>>108676797
Look chief, I know +400b is better. You know +400b is better. But if Gemma 4 was a +400b, it would destroy everything. There's twitter posts of a 124b coming, from the devs themselves. The hype is real.
>>
File: 1000185857.png (212.5 KB)
212.5 KB PNG
>>108676832
Still retarded lol
>>
>>
>>108676649
>Serverbros in shambles too cause you can't run 1.6T.
Literally the first time since 2023 I wish I’d pulled the trigger on 1.5TB instead of 768GB
My rig doesn’t owe me anything at this point.
Hope q3 isn’t too brain damaged
>>
>>
>>
>>
>>108676836
I don't expect you to believe me, but gemma unironically is more interesting and follows the prompts better than 4v is for me right. I think the model might genuinely be broken because if you told me it was the original deepseek model, I'd believe you.
>>
>>
File: 1760671183453759.png (22.5 KB)
22.5 KB PNG
kek V4's FIM is pretty fun
>>
4bs aren't capable of powering my openclaw, how do I solve this when it just keeps lying about coding something, has no idea why it keeps failing, doesn't even know it's not allowed to write in every folder on the pc?
What's a poor person supposed to do?
Even if you use claude to improve openclaw scaffolding and wrapper how can you make sure it's even working?
>>
>>
>>
>>108676797
This thread became shit well before gemma4.
I think its a great little model.
Can't even post anime waifu logs aniymore, people calling you cringe.
I remember kaioken doing langchain to talk with miku about his depression and talking about pizza or whatever. (langchain is what we used for agents back then for you newfags)
Elitism kinda took over. That and now lots of new influx of people on top of that.
In another timeline comfyanon kept posting here and published his ultimate goal "a automatic galge creator". said thats why he created comfy. time flies by man... all but a blur now.
>>
>>108676747
>>108676777
To add, it means you're bottlenecked during token generation, so your GPU isn't running at full tilt. For instance if you have a model loaded fully in VRAM, your graphics card will be whirring on the whole time, but if you're split into system RAM, you're held up by the slower memory. Try running nvidia-smi in a terminal window or some monitoring software and you'll see just how much your GPU is being utilised
>>
>>108676517
Hermes loads 12k into context instantly when you run a request with all its tool defs and shit, so 10k is useless.
Ideally use a smaller harness, I think something like https://github.com/itayinbarr/little-coder is more up your alley.
>>
>>
>>
>>
>>
>>
File: EC871elWkAAp5Sm.jpg (35.7 KB)
35.7 KB JPG
>>108676936
What quant.
Inb4 less than 5.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
LLM tells me I could use a "kinetic harpoon" from a suborbital vehicle to destroy/capture a satellite. And since it's suborbital it's not bound by space treaties. How legit is this?
Basically you use a suborbital spaceplane or rocket that briefly reaches 500 km, releasing a tethered harpoon or net that grabs the satellite, and let the satellite burn up in atmosphere or capture it intact.
>>
>>
>>
File: gemma-chan goes diving.png (868 KB)
868 KB PNG
>>108677055
skill issue
>>
>>
>>
File: 1768082965569.png (173.4 KB)
173.4 KB PNG
>V3 was a good-mediocre model
>R1 was V3 + The most novel line of research of the time, test-time thinking
>V4 is a good-mediocre model
>R2 will be...
am I high on copium
>>
>>
>>
>>
File: 1764829110623274.png (54.4 KB)
54.4 KB PNG
https://comfy.org/countdown
It's not a coincidence that v4 and this are dropping the same day...
>>
>>
File: pizza bench cropped.png (2.6 MB)
2.6 MB PNG
>>108676470
the only good benchmarks are pizzabench and cockbench
>>
>>
>>
File: Screenshot 2026-04-24 at 13-37-16 gemma chan please make an svg of this - llama.cpp.png (403.7 KB)
403.7 KB PNG
someone ask new dipsy to do this https://gelbooru.com/index.php?page=post&s=view&id=13929965&tags=loli
>>108677111
go back
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
Thank you for hosting these threads and posting so much info. This f I have 16gb vram and 32gb system ram, would there be any benefit to inference by adding another 32gb system ram? Would I be able to do anything more with that or am I limited by my vram?
>>
https://www.anthropic.com/engineering/april-23-postmortem
>On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode. This was the wrong tradeoff. We reverted this change on April 7 after users told us they'd prefer to default to higher intelligence and opt into lower effort for simple tasks. This impacted Sonnet 4.6 and Opus 4.6.
>On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7.
feelsgood to be a local chad, I don't have to deal with this kind of bullshit lolz
>>
>>
>>
>>
>>
>>
>>
https://gist.github.com/aratahikaru0/ea0f49958eaa8852a78078d9e993bbf0
so put this at the end of the first message ig【Character Immersion Requirement】Within your thinking process (inside <think> tags), please follow these rules:
1. Conduct inner monologue in the character's first person, wrapping inner activity in parentheses, e.g., "(thinks: ...)" or "(inner voice: ...)"
2. Use first person to describe the character's inner feelings, e.g., "I think to myself", "I feel", "I secretly", etc.
3. The thinking content should immerse in the character, analyzing the plot and planning the reply through inner monologue.
>>
>>
>>
>>108677200
>>108677221
Hooking llama.cpp up into renpy would be dope as fuck.
>>
>>108677221
>>108677225
This seems to have worked well for orb. At this point maybe that's the only solution. Unless someone has done so already.
>>
>>108677245
That actually sounds like a good idea. Having it interface with the VN engine through forced tool calls. And maybe even have dynamically generated scenes using comfy.
>>108677231
Post number?
>>
>>
>>108677238
they do but it breaks the ui sometimes when they use them out of turn, discussing training data and tokenizers can be difficult. also the jinja template might mangle the context too. probably best to just mention reasoning or thinking without the tags.
>>
>>
>>108677206
I want there to be tts inside old games like jrpgs.
I wanted to try learning japanese for fun and want the characters to be speaking japanese. maybe gemma e4b or the smaller one is enough for that. It can watch everything in real time and follow the context of scenes and tell the tts engine how do expressions like with qwen3tts.
https://qwen.ai/blog?id=qwen3tts-0115
>>
>>
>>108677238
>Models don't know what's a <think> tag
they don't "know" what it is, but they can be trained to output differently when the token is present.
like typing `/nothink` makes glm 4.6 skip reasoning
or typing <moan> makes the tts moan
>>
>>
>>108676873
>>108676875
>>108676884
If you just need to change a tire because it's old but not yet broken, driving to the garage is the correct answer
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108676667
>put the mmproj on the 1080ti
I put it in the cpu and it's fast enough even at 1120.
>>108676708
>Just a warning, once you go open air it's difficult to go back
Isn't that a magnet for insane amount of dust over time?
The dust filters on my machine seem to work overtime, I have to clean them every few months.
>>
Wagie having to deal in the real world here; give it to me straight bros, would you spend an extra $4k, to be able to run big Deepseek locally at 12~t/s? The model was literally trained for roleplay. But more importantly, the other labs are likely going to use the base for their future models, meaning most models by chink frontier labs are going to be that size too. And since it uses QAT, quanting it further will destroy its performance.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: file.png (180.8 KB)
180.8 KB PNG
>>108677238
>>108677309
i was suspecting that was the case, but holy shit, it was not easy to get response like this, gemma tries to read these tokens as letters for some reason
>>
>>
>>
>>
>>108677382
>>108677390
OK thanks I'll try it then
>>
File: belief.png (592.4 KB)
592.4 KB PNG
>>108674657
>>
>>
>>108677414
>would you spend an extra $4k
if it was 4k sure, but current prices are more in the ballpark of 15-25k for anything able to run 1TB+ models
and as much as I don't mind paying for my hobbies, that's a bit for something that isn't a car or huge house renovations
>>
>>
>>
>>108677493
>at least these nutjobs are honest about it,
i use claude at work and i dont see the honesty.
api through openrouter is bad for like 2 months now.
they explicitly said its not the api which is just not true.
opus 4.7 feels totally tarded..
just a little bit context and it gets the opening wrong. thats not normal.
opus 4.6 same thing. even sonnet is super tarded, but im willing to admit it might have always been this way because i didnt use it much.
also: they did the same thing before too last autumn! blamed it on "network issues" or something like that kek
very sketchy stuff. nothing beats local.
>>
>>108677428
No, I mean V4 Pro. The $4k is just to get bigger RAM sticks so the model can be ran with no-mmap. This way, you can maintain large batch sizes for the context on the GPU. Otherwise, the model runs way too slow.
>>108677425
All modern models (even Opus) are slop, I just prefer my slop to be actually smart and usable offline.
>>108677465
Yes, and its arguably the best local RP model, prior to today. I'm looking into the future though, where all the frontier labs start moving to using the V4 model as the base, considering its SOTA at the moment.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: rat w dildo shoveled in.png (115.6 KB)
115.6 KB PNG
>>108676605
is that... what i think it is?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 2527542.jpg (96.4 KB)
96.4 KB JPG
Tatoos are for garbage people.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
Ideally, if I wanted to compare two different models of two different families trained on different datasets etc, I'd run a bunch of different benchmarks including some domain specific ones of my own making, but is there a simple harness or benchmark set that could be used as a sort of sanity check of "model x is generally better/more intelligent than model y".
If not, I'll just make a script on my own, but I'd rather not reinvent the wheel if possible.
I think cudadev was working on something like that?
>>
>>108677788
>>108677838
Why are you so obsessed with black penises and transgender people?
>>
File: 1755568885667508.png (346.9 KB)
346.9 KB PNG
https://localbench.substack.com/p/kv-cache-quantization-benchmark
why is it working not so well on gemma?
>>
>>
>>
>>
File: I don't want to go back to fp16.png (179.9 KB)
179.9 KB PNG
>>108677965
>The only variable changing between runs is cache precision. These measurements include the recently added TurboQuant-inspired attention rotation that llama.cpp applies automatically.
to be fair it's not the full implementation of turboquant, wait for niggerganov to finish the job
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1766795985329438.jpg (38.5 KB)
38.5 KB JPG
every anime girl have pink hair according to my models
>>
>>
>>
File: dsv4.png (52.5 KB)
52.5 KB PNG
>>108678195
46*~3.6=165.6
I'm sure you can figure out the rest.
>>
>>
File: Screenshot 2026-04-24 102219.jpg (60.3 KB)
60.3 KB JPG
I bought my first used 3090 like 3 years ago for $500 and kinda want a second one, but I swear every time I check they go up in price, now they are selling for double what I paid even though they are getting old
>>
>>
>>
>>
File: Untitled.png (245.3 KB)
245.3 KB PNG
How the hell do I figure out my tk/s on vllm? There's no way it's "avg generation throughput" right? That'd mean llama.cpp with split mode layer is faster (25 tk/s) than vllm with tensor-parallel-size: 2. 2x 3090s on pcie gen 4 x16.
>>
>>
>>
>>
File: 1777045825700.jpg (257.4 KB)
257.4 KB JPG
>Original-Model-GGUF
>2k downloads
>Sloppified-Model-GGUF
>200k downloads
>>
>>
>>
>>
>>
File: 1750286439162386.png (14.1 KB)
14.1 KB PNG
why is it so slow? qwen3.6-27b, have a 5080rtx
>>
>>
>>
>>
>>
>>
File: 1762984944616053.png (298.2 KB)
298.2 KB PNG
GLM-5 btw
>>
>>
>>
>>
>>
File: 1745513289115745.png (38.1 KB)
38.1 KB PNG
>>108678543
>>108678537
64gb ram 6000 mt/s
5080 16gb
using recommended sampling from qwen hugging box docs
i'm new to this so not sure what else would be needed info wise
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: n.png (171.2 KB)
171.2 KB PNG
>>108678574
>All I'm missing from the llamacpp frontend is for it to support easily switching between system prompts and it would be perfect.
Better off getting Kimi to code one up yourself. This was one-shot.
>>
>>
>>
>>
>>108678503
On a 16GB card you're either going to need to run a gimp quant (Q3 or less) with relatively low context to fit it all into the card, or you're going to be offloading to RAM which will fuck your token generation speed
I'd just run the MoE personally on that card
>>
>>
>>
>>
File: file.png (27.8 KB)
27.8 KB PNG
>>108677649
>>
>>
>>
>>
>>
File: i.png (169.5 KB)
169.5 KB PNG
>>108678631
i like the in character reasoning feature
>>
>>108678264
this happened even back in like 2022, i bought a 3090ti for stable diffusion but it kept getting crashing in normal linux desktop usage so sold it after maybe like 6 months, i sold it for more than i paid even then kek
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108677281
>>108677307
I'm almost done with it and will probably opensource it today
I've been adding some other features like mouth animations
>>
>>
>>
>>108678820
There were experiments that converted models to linear attention and it made them retarded. Frankenshit like that never works unless your only requirement is semi-coherent sentences no more than a paragraph long
>>
File: KimiTire.png (88.5 KB)
88.5 KB PNG
>>108676860
kimi comparison
>>
>>
File: 1752509108479218.png (433.2 KB)
433.2 KB PNG
>>108678850
kimi comparison (more info)
>>
>>
>>
>>
>>
>>
>>
>>
File: Screenshot 2026-04-23 135943.jpg (21 KB)
21 KB JPG
>>108676860
Chatgpt 5.5 is now able to beat the car wash question btw
>>
>>
>>
>>
>>
File: Screencast_20260424_12590415.webm (3.6 MB)
3.6 MB WEBM
Thanks Gemma 31B this has been a fun experience
>>
>>
>>
>>
>>
>>108679032
Yup
I can fit close to 100k tokens with Q5 but kept it low for the demo. I built it all with those setting save for the higher context window
>>108679045
Asked it to write random blocks for the sake of showing syntax highlighting, that's on gemma
>>
>>
>>
File: Screenshot at 2026-04-25 03-12-45.png (27.7 KB)
27.7 KB PNG
>>108678908
Gemmy already won
>>
>>108679092
Yeah I'm using a trackpad and I'm trying to not zoom around the page
>>108679082
Still good speeds imo
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108677781
Irrelevant but true
>>108678507
Getting a single tire replaced because it's worn is a pretty rare thing, I think the model was correct there. Next time, try saying "I need to get my tires replaced", as in you mean to buy a whole set.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108679403
it gets slower with every update. there was a schizopost a while back that went over it, and people who replied had similar experiences where things trained slower on 12.8. 12.6 may be good but i personally just stay on 12.4 since if it aint broke dont fix it.
https://desuarchive.org/g/thread/106119921/#106125806
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 0d8d31490be9b006e4f6cb98bd1989ae480242689.png (1.3 MB)
1.3 MB PNG
>>108676460
Come home to /wait/.
>>
>>108679817
Yes, I'm comparing dense to dense.
Gemma is efficient in ingestion and thinking, Qwen seems to favor ingesting your entire codebase no matter how small the change is.
It launches 4 sub agents that have to read the entire repo individually when I ask it to update docs, so it is very thorough. I haven't benched them yet, but it seems stronger in autonomy than Gemma and it uses agent functionality more often, the trade is efficiency so Gemma still has its place for tasks that are less precise and autistic.
>>
File: file.png (497.9 KB)
497.9 KB PNG
>>108679730
If you're looking at benchmarks, yeah. But it scores better at coding and agentic indicies than Kimi 2.6 and GLM 5.1 but the only reason it's behind on the general index is because it scores worse on some things like hallucination and long context reasoning which aren't included in the coding and agentic index measurements. In addition, they put in less stuff that they pioneered that people were looking forward to. It's a good step but it's not like the feeling of having GPT o1 at home with R1 where the equivalent today would be having at least Opus 4.6 at home. It's too big of a gap there, it is at least more like having Sonnet 4.6 at home but the models are moving so quickly nowadays that it's not that big of an accomplishment anymore especially when people expect faster turnaround and performance increases like what the big boy labs are doing iteratively every few 2-3 months which Deepseek is not operating on. Also, the 1.6T parameters is offputting even most CPUMaxxers and Flash isn't anything we haven't seen already in months. Overall, I would say it's nice but people expect it to vault above all the other models when it just didn't do that.
>>
>>
>>
>>
>>
>>108678742
Cool. Are you planning to release by the end of this thread or in the next one?
>>108679021
Same honestly
>>
>>
>>
>>108679995
No, it's this: https://github.com/open-webui/open-webui/issues/21564
apparently still broken
>>
>>
>>
>>
>>
File: Screenshot 2026-04-24 at 16-57-20 Orb.png (423 KB)
423 KB PNG
Font rendering on Mac is nice as hell. Then I come on Windows and want to claw eyes out.
>>
>>
>>
>>108680193
What else is there? I want a backend-agnostic server based UI (so no kobold or lmstudio) that works and looks like chatgpt where I can also import my old chats from wherever. Openwebui is the only one I've found so far.
>>
>>108680193
>outside of other ui
Like? I would switch if I knew there was something that did what I wanted.
I make use of a lot of OWUI's functionality even if not all of it, and serve it on the web for my entire family. Once I considered vibe coding my own frontend and realized it would take a lot of work to get feature parity.
>>
>>
>>
>>
>>108680274
>I wanted the opinion of anons who actually used it.
Waiting on quants, but I'm excited to try the flash.
>GLM 4.6: 355B-32A MoE, could only run in IQ2 @8K context
>DS V4 Flash: 284B-13A MoE
I'm very interested in what size I can handle and it's quality.
>>
>>
>>108680254
>>108680242
Damn local is in the pits when it comes to frontends....