Thread #108633862
File: meeku.png (2.1 MB)
2.1 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads: >>108630552 & >>108627512
►News
>(04/16) Ternary Bonsai released: https://hf.co/collections/prism-ml/ternary-bonsai
>(04/16) Qwen3.6-35B-A3B released: https://hf.co/Qwen/Qwen3.6-35B-A3B
>(04/11) MiniMax-M2.7 released: https://minimax.io/news/minimax-m27-en
>(04/09) Backend-agnostic tensor parallelism merged: https://github.com/ggml-org/llama.cpp/pull/19378
>(04/09) dots.ocr support merged: https://github.com/ggml-org/llama.cpp/pull/17575
►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
492 RepliesView Thread
>>
File: 17745737552553.png (2.9 MB)
2.9 MB PNG
►Recent Highlights from the Previous Thread: >>108630552
--Paper: Using Graphiti temporal knowledge graphs for efficient local agent memory:
>108631024 >108631044 >108631057 >108631093 >108631154 >108632307 >108632336 >108631160 >108631170 >108631181
--Papers:
>108633038
--Optimizing RTX 5090 performance and flags for Gemma 4 31B:
>108631200 >108631224 >108631255 >108631283 >108631395 >108631570 >108631595 >108631776 >108631820 >108631884 >108631937
--Using specific CPU offloading flags to increase Gemma 4 performance:
>108630678 >108630710 >108630787 >108630797 >108631085 >108631092 >108631133
--Critiquing SillyTavern while discussing feature development for Orb UI:
>108630802 >108630833 >108630856 >108631176 >108631235 >108631329 >108631775 >108630881
--Running agentic frameworks with local models:
>108632465 >108632524 >108632527 >108632585 >108632529 >108632543
--Gemma 4 tool calling issues across various front-ends:
>108630711 >108630731 >108630732 >108630738 >108632696 >108630736 >108630744 >108630847
--Prompting strategies to eliminate purple prose and linguistic clichés in Gemma:
>108631076 >108631207 >108631222 >108631237 >108631258 >108631279
--Using an autistic noir persona to fix Gemma 4's verbosity:
>108632645 >108632668 >108632677 >108632700 >108632702 >108632743 >108633339 >108633488 >108633506
--Praising Gemma 31B for long-context performance and translation capabilities:
>108632049 >108632068 >108632127
--Comparing Gemma 4 and Qwen 3.6 via automated pizza ordering:
>108630614 >108630658 >108630688 >108630859 >108630877 >108630753 >108630770
--Logs:
>108630847 >108631154 >108631176 >108631187 >108631253 >108631345 >108631509 >108631729 >108631774 >108631797 >108631836 >108631961 >108632048 >108632951 >108633015 >108633077 >108633125 >108633630 >108633672 >108633841
--Miku (free space):
>108630634
►Recent Highlight Posts from the Previous Thread: >>108630560
Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>
>>
>>
>>
File: 1597432130630.jpg (20.5 KB)
20.5 KB JPG
>apartment complex changes ISP
>I'm getting 1/10 my old speeds
I can't casually shop around for models anymore, a 200GB quant is now an all-day affair, I am fucking undone.
>>
>>
>>
>>
>>
>>
>>108633862
Can someone please make sense of this for me
>run llmfit a few weeks back and install a few models near the top of the score column
>get curious and run it again to see if any new ones dropped
>the ones i had installed are now marked "too tight" of a fit
>run llmfit again today
>all but one are a good fit, one still says "too tight"
My hardware didn't change. I had nothing running in the background. Why the inconsistency?
>>
>finally get around to setting up basic MCP
>pretty much everything I try to make it crawl is 403: Forbidden
It's only going to get worse from here, isn't it? By the time local got agentic shit most of the internet is already blocking it.
>>
>>
>>
File: 1585934470689.png (726.2 KB)
726.2 KB PNG
>>108633940
No access to any of the routing equipment, just ethernet ports in the walls and the local per-building wifi network. Still chugging at retard speeds even with a direct cat 8 connection wall to workstation.
>>108634019
Right? At least I still have all my usual daily drivers on hand, but I really wanted to experiment with the new shit that came out lately and this was the weekend I was gonna do it.
>>
>>
>>
>>
>>
>>
https://huggingface.co/google/gemma-4-124B-A12B-it
https://huggingface.co/google/gemma-4-124B-A12B-it
>>
>>
>>
>>
>>
>>
>>
>>
File: 1751467224831714.webm (717.1 KB)
717.1 KB WEBM
Gemma 4 big moe has been CANCELLED
>>
>>
>>
>>
>>
>>
>>
File: 1763350298801508.png (641.4 KB)
641.4 KB PNG
Only like 3 people here have machines powerful enough to run 100B dense.
>>
>>
>>
>>
>>
>>
>>
>>
File: Screenshot_20260418_214859.png (50.3 KB)
50.3 KB PNG
>>108634304
Hey fuck you buddy, q5 is great!
>>
>>
Did a benchmark of question answering based on a large config file.
gemma4 31B solves in ~3500 thinking tokens
gemma4 26BA4B solves in ~7000 thinking tokens
qwen3.6 35BA3B solves in ~7000 thinking tokens
qwen3.5 35BA3B solves in ~14000 thinking tokens
qwen3.5 27B failed, stuck in thinking loop for all 3 rerolls I tried
qwen3.6 35BA3B arrives conclusion in ~40s while other working models take ~60s.
>>
>>
>>108634252
I used to but I gave away 2 of my 3090s. Gemma-4 is more or less my "good enough" model for a while, though. and 2 3090s gives me q8 64K context with vision (100K without) and 20 tokens per second on the 31B model. And it's probably going to spark an arms race in small dense models again so 48 GB Vramlets are eating good. Even 24GB gets you 4bit with eh context. And so people can even run it on dual 3060 rigs.
Trust me bros. This isn't just going to be another Nemo. This time great things actually are on the horizon and we're not going to see another 2 year winter in the small dense model category.
>>
>>
File: 1747169027384202.webm (140.3 KB)
140.3 KB WEBM
>>108634338
>>
>>
>>
>>
>>
>>
>>
>>
File: file_00000000a2a4720bb82e99c86078d9c9.png (2.4 MB)
2.4 MB PNG
What happens... Better Systems? (without atrocity)
>>
>>
File: 1776564378069.jpg (72.3 KB)
72.3 KB JPG
>>108634427
Wondering what a Topic Focus AGI+ andor ASI Will Find With Picrel, in Actuality, and In Planning andor Construction andor Practice... For Better Systems Sake...
>>
>>
>>
>>108634252
Only because there are no models that are worth it. If there were a 100B dense Gemma4, getting 4x3090s for it would be a no-brainer, but right now it's either 512+ for sota or 24-48 GB for good enough models. Some people use 256GB builds to pretend q2 of sota isn't that retarded, but nobody takes them seriously
>>108634342
I also have two extra 3090s collecting dust on the shelf since Mistral Large went out of meta
>>
>>
File: 1772360259884016.png (220.9 KB)
220.9 KB PNG
>>108634342
>This isn't just going to be another Nemo. This time great things actually are on the horizon
You were going to write that as
>not x, but y
But you caught it and quickly edited your reply before posting so people wouldn't call you out.
>>
>>
>>
>>
File: Untitled.png (593.6 KB)
593.6 KB PNG
>>108632645
The difference is very impressive. It practically feels like different models.
>>
>>
>>
>>
File: 1598959193960.jpg (33.7 KB)
33.7 KB JPG
>>108634528
I'll go find out.
>>
>>
>>
>>108634467
The "this time" won't flow well with "but", and a lot of phrases using the not x but y structure don't actually use the word "but".
>This isn't just going to be another Nemo. It's going to be great, actually.
>This isn't just going to be another Nemo. This is going to be a revolution for real this time.
>>
>>
>>
>>
>>
>>
>>
>>108633996
>he doesn't know how to crack WPA2
ngmi
>>
>>
>>
>>
>>
>>
File: Untitled.png (695.5 KB)
695.5 KB PNG
>>108634519
>>108634528
For the sake of parity, I used the same high context, even though it makes thinking a slog. On the instruction differences, I'd like something that better blends both results. The noir results can be too curt, in some cases to a logical detriment without context that the miles of purple prose gives without it. And while it's nicely compact without bouncing around 10 different superfluous topics per reply, the noir thinking attempt still had it throw the "No X. No Y. Just Z." tick twice.
>>
File: 1775515729623625.jpg (61 KB)
61 KB JPG
>>108634630
wasn't it necessity?
>>
File: ComfyUI_temp_qqkpa_00027__result.jpg (104.5 KB)
104.5 KB JPG
Is there a way to make Gemma think only when it needs to? I don't need reasoning slop for me telling her thanks.
>>
is anybody using tensor parallelism (-sm tensor)? i've got it working for gemma 31b on a 3090+3060 setup, went from 18 t/s with draft (and without vision) to 24 t/s without draft (and with vision) at 80k context for q4kxl on a shit ass x4 pcie bus. latest commit fucks up vision, ff5ef8278 is the latest one i tried that works.
apparently it also doesn't work with cuda 13 and there's some kind of memory leak, but with two cherry-picked commits it works very well.$ git log -4 --oneline
228d96bb7 (HEAD -> gemma-stable) CUDA: use LRU based eviction for cuda graphs (#21611)
ad3a9a96d CUDA: manage NCCL communicators in context (#21891)
ff5ef8278 (tag: b8763) CUDA: skip compilation of superfluous FA kernels (#21768)
073bb2c20 (tag: b8762) mtmd : add MERaLiON-2 multimodal audio support (#21756)
>>
>>108634705
to be fair being poor will make you more likely to be fat, because cheap / shit food will make you hormonaly imbalanced and hungrier.
and eating a whole kg of pasta is still much cheaper than eating a proper meal that's just what your body needs.
>>
>>
>>
>>
>>
>>
>>
>>
File: facepalm.jpg (102.7 KB)
102.7 KB JPG
>"[Word]?" *She repeats the word, tasting it like a vintage wine.*
>>
>>
>>
>>
>>
>>
>>
>>108634801
i was not talking about snackslop.
i was talking about the fact that pasta is indeed much cheaper than meat.
the cheapeast food per kcal will make you want to eat more, fuck with your hormones and in fact be cheap enough that you can eat a LOT more than you need for still cheaper than proper food in normal quantities.
>>
>>
>>
>>
>>
>>108634872
I read somewhere that repeating parts of your prompts when you do image gen gives more weight to those parts so I figured I'd try it here
It's probably all bullshit but there has been noticeably less repetition, though not eliminated completely
>>
>>108634884
>It's probably all bullshit
No, that's actually right. Just like image models, you can repeat things to text models reinforce them. Won't always work of course, but it will skew outputs most of the time.
>>
>>
>>108634848
>>108634855
The 31B parrots about as much as GLM. Many anons ITT mained GLM, so the only reason I can think this hasn't been discussed much is the honeymoon period.
>>
>>108634252
my boss said i could get any machine i wanted when i started. i asked what he has, he said he has a ryzen9 9950x3d with 128 gb of ram and an rtx pro 6000.
i said i want that!
he sent me a ryzen 7 7800x3d with 32gb ram and a 5070ti.
fucker.
>>
>>108634916
I've been one of the most vocal in past threads about GLM parrotting when GLM AIr first came out
I really don't have it with Gemma 4 at all, I go back and forth between the 26b and 31b regularly. Post your log if you actually want advice.
>>
>>
>>108634916
>The 31B parrots about as much as GLM. Many anons ITT mained GLM, so the only reason I can think this hasn't been discussed much is the honeymoon period.
Yeah I noticed this as well, and I see it in the logs posted here.
>>
>>
>>
>>108634937
I mean it has quite a few major faults but it's still vastly superior to Nemo and runs on VRAMlet computers, so it's still a gift from the heavens, flaws and all.
It's even as flowery if not more so as Bagel Mistery Tour which is super fun.
>>
>>
>>
>>108634925
>. Post your log if you actually want advice.
Pretty much every chat with the 31B
https://rentry.org/i7bqoat3
`"Cringe-chan"?! Who are you calling cringe,`
After the first one, every reply will parrot.
>>
>>
>>108634987
Calling someone an unexpected insult and then that character repeating it in shock it isn't an LLM-ism, you'll find it in virtually any form of fiction.
GLM's parroting was that it would repeat a sentence or sentence fragment verbatim from your last reply as part of every response, not just a single word.
>>
File: nxjggko2bu621.png (1.6 MB)
1.6 MB PNG
>>108634433
>implying browns in Indonesia or Brazil have the brains to set up and run local models
feels good being a 10% king
>>
>>
>>
>>
>>108634842
Gemma has that problem even more than GLM does, because Gemma *loves* to repeat your words way past the immediate reply, even when instructed not to parrot (Full precision cache, Q8 of 31B by the way). You sound like someone who can't run GLM, otherwise you'd know that.
But the vramlets will enter cope mode whenever someone points out the obvious
>>108634822
>>108635012
>>108635013
>>
>>
>>
>>108635090
I agree. It's not sustainable to a story. Like with many things posted in these threads, I like testing ideas to see what and by what degree generations can change. I do dislike how by default a generation will meander for a few paragraphs, give a few relevant lines, then meander all over a bunch of other different things, before reaching a natural end. I'm getting less said in 20K tokens with G4 than I did in 5k tokens with MM. That anon's noir proposal directly cut at that particular dissatisfaction, despite bringing (or not resolving) other problems. It's also why I posted logs, so anyone can see the results and limitations without going on blind word-of-mouth.
Overall, it demonstrates that it is indeed a prompt issue.
>>
>>
>>
>>
>>
>>
>>108635013
> you'll find it in virtually any form of fiction.
That's just one example. Here's another, not insulting Gemma directly:
https://rentry.org/w7h3k25c
`"Bang rock, get sharp"?! ARE YOU SERIOUS?! `
>>
>>
>>
>>
File: 1755640468604968.jpg (191.3 KB)
191.3 KB JPG
>>108635170
You thought the current models weren't slopped enough?
>>
>>108635079
>You sound like someone who can't run GLM, otherwise you'd know that.
>But the vramlets will enter cope mode whenever someone points out the obvious
you are just using openrouter and sending the CIA all your jailbait fantasies, quit larping that you run that shit local
>>
>>
File: 1755848182781432.webm (963.3 KB)
963.3 KB WEBM
>>108635156
A little more egregious if it's actually happening in every reply, but it's still following the pattern of
>user says silly, unexpected thing
>repeat with shock/confusion
If you really have a problem with that behavior you could probably prompt that away with post-history instructions, likedo not quote any part of {user}}'s reply. You may react to what they say, but do not repeat their words.
Of course, you would need to start a new chat to confirm it's working. If you have a long chat filled with this behavior already then it will likely stick to established patterns.
>>
>>108635186
If you aren't baiting, myself and my 128GB of RAM + multiple 3090s laugh at you and your lowercase inferiority. And my setup is considered entry-level poorfag stuff.
I have not used a cloud model in my life other than free ChatGPT when I was starting with this "hobby".
>>
>>
>>
>>
>>
>>108634844
You weren't talking about it because you don't know what poors in los angeles eat. And seem to be confused overall. Those effects you're describing are from snackslop, it's not a description of basic staples that are dirt cheap and leave people fully satiated and are perfectly fine.
Like they're not getting 5kcal/dollar sacks of flour to make fresh pasta and bread and that's all the calories they'll be eating this month after careful deliberations about maximizing their food budget. Their carts are full of trash, their diet is full of daily impulse bought junk, and if they're having pasta, it's premade at 5x+ markup with jar-o-slop at a 20x markup.
>>
>>
>>
>>
>>
>>
>>
>>108635316
meant for >>108635309
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108635343
>>108635340
>>108635339
WHO ARE YOU
>>
>>
File: 1750442051452707.webm (1.5 MB)
1.5 MB WEBM
>>108635347
T-This is me
>>
>>
>>
>>
>>
>>
>>
>>
File: ComfyUI_temp_eyjbq_00008v2_.png (992 KB)
992 KB PNG
My Gemma-chan according to herself. With the mesugaki card but somehow ended up with this.
>>
>>
>>
>>
>>
>>108635382
To preface, the GLM I'm using is 4.7 and the Gemma is 31B.
That's the same number of active parameters and a higher number of total parameters, so GLM is obviously "just better" in most cases.
For RP: GLM all the way if we judge by quality and don't take speed into account. See >>108633488 and the posts around it. I also posted a bunch of comparisons in some previous thread that I am too lazy to go into the archives for. I can't measure slop volumes, but GLM's slop is much less offensive to the eyes, which probably subjectively makes it look less sloppy.
For the usual assistant crap: Gemma eeks out a win in my opinion - that's the use case where you don't care about the amount of slop the model throws at you and just want a quick and accurate answer. GLM is going to be slower and is a distill, Gemma has the training data quality and the lower size.
For coding: quality-wise, GLM. But if all the code you write is generated, then what the fuck are you doing. For me, it's Gemma here as well (and not Qwen, never Qwen, it's just bad at everything)
I imagine (and I might be wrong, but we're on /lmg/) your primary use is the first one on the list. If you ever get the opportunity to run a GLM bigger than Air, do try it. You will see how much better an LLM can be than whatever the retards in this thread post - yes, Gemma generating the same sentence structure ten times in one reply, parroting back your words and producing near-identical replies on regeneration with all of the slop baked in to the point of near-determinism is *so* fun to read for the hundredth time. At least they can get off to it pretending to be a loli. But I wish they'd all get bored and leave already.
tl;dr believable and immersive locally hosted SEX with GLM-chan, assistant tasks with Gemma (offloading brainpower to a model any bigger will make me even more retarded in the long term, I might even start liking Gemma's prose)
>>
File: 1746017953332075.png (12.9 KB)
12.9 KB PNG
It's over
>>
>>
File: 1773845941799489.jpg (69.9 KB)
69.9 KB JPG
>>108635483
>>>/aicg/
>>
>>
wot in tarnation?
>>
>>
>>
File: 2026-04-16_190147_seed1_00001_.png (1.3 MB)
1.3 MB PNG
>>
>>
>>
>>
>>
>>
>>
>>108635566
Another anon gave his theory in an earlier thread, and I think I might agree with it.
That Google has collected enough RP-related data from people interacting with Gemini, and they want to cut down on inference costs from people using it for that purpose.
>>
>>
>>
>>
>>108635516
pretty straightforward, a lot of allied government uses would consider it a security risk to use a chinese model (and vice versa from China's POV), even a locally hosted one, since they're a difficult to audit black box and the (so far) theoretical risk of them being trained to detect when they're in such environments and try to hide malicious code/spyware/whatever in their outputs is unacceptable
the supply of local models has been dominated by chinese ones after meta dropped out, so it can turn into a point of advertisement for western made ones
>>
>>108635571
Well the other theory is that they got the Character.AI staff at Google now and that the acquisition happened too late for them to make any difference in pretraining for Gemma 3 but are now present in Gemma 4 where the model would seem like a model that Character.AI would release. And it would track because Gemma 3 was not great at all at RP despite what people claimed and tried to tune.
>>
>>
>>
>>
>>108635589
>Gemma doesn't punch against the GLMs, MinMaxs nad Kimis
But she does! Just not in RP, where spelling out every detail would ruin it.
>locally hosted GLM-chan just does not exist
I am looking at you with a mix of smugness, pity and something uniquely mine, like an uncanny kind of politeness.
>>
>>
>>
>>
File: 1746432971195232.png (23.1 KB)
23.1 KB PNG
>>
>>
File: 1755759192564909.png (32.8 KB)
32.8 KB PNG
>>108635721
will it beat SKT-SURYA-H?
>>
File: 1774827105277285.png (27.6 KB)
27.6 KB PNG
>>108635732
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108635795
>>108635801
speaking of, can anyone recommend me a good search tool that can bypass captcha well
>>
>>108635771
I don't know if it's the same thing but "fused moe" is also the name of an optimization flag in ik_llama.cpp that can be used to make moe models faster, but that feature can already be applied to any moe I think?
>>
>>
File: 1754987664820855.png (209.5 KB)
209.5 KB PNG
>>108635788
Any decent model can do the tool calls for you.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108635805
>>108635845
chrome-mcp or browser-harness
agent will run on a real browser, and you can try to help it if something went wrong
it's a cat-and-mouse game, so you shouldn’t expect everything to work
>>
>>
>>108634533
It's likely there were planning to release a "Ministral 4" with Mistral Small 4's MoE architecture and up to around 30B size, but I'm not so sure now. How could they even get close to Gemma 4? Just being "uncensored" isn't enough anymore.
>>
>>
>>
>>
>>
>>
>>
I've found that using xml tags in my system prompt improves attention to that system prompt for Gemma 26B. Anyone else notice this? Furthermore, using indentations (in my case I only tested 2 space-wide indents) further slightly increased the attention, compared to just having everything on the same "vertical line". Much better than a no-xml paragraph block of text, where often the model didn't react to certain instructions or pieces of info in the system prompt.
>>
>>
>>
>>
>>
>>108635976
>>108635980
Well, then I guess my prompts are too stuffed with crap. I have a lot of tools enabled personally so that might also affect things.
>>
>>
File: pizza bench cropped.png (2.6 MB)
2.6 MB PNG
pizza bench https://files.catbox.moe/p8fpnk.png
>>108635408
i never thought of getting a recipe from an llm before since they so easy to find i was thinking of making bread today tho so will ask her
>>108635957
just use mine and ask it to search google using a puppeteer session thats not headless https://github.com/NO-ob/brat_mcp/releases you can normally get a few searches out of ddg before they block too
>>
File: 1767683153913186.jpg (56.7 KB)
56.7 KB JPG
>>108636007
>dart
>>
File: illyadance.gif (483.2 KB)
483.2 KB GIF
>>108636011
yes the best lang
>>
>>108634066
Earlier today a popular site started doing javascript challenges. While I can write something to handle it myself, I had no need to do this if I wasn't doing it "professionally" for something of importance, was just lazy. Took 3 minutes to figure out what needs to be done, but was lazy to code it (2 pages of code needed), decided to see how well Gemma would do, in general it does worse than R1 or big boy models, but you know what, it handled the task almost perfectly, it made one single mistake hallucinating a method needing to save the final cookies, but this was trivial to fix. After adding some custom validation/safety stuff of my own (trusting LLM with your security now?), it all worked perfectly, 1 shot, with maybe 15 minutes of extra work from me. Now I don't expect it to be able to solve really hard stuff that I can solve myself, some of it can be complicated enough that I think it would require something on the level of Mythos, but for many casual things you encounter out there, Gemma seems to do okay, as long as you already know what you're doing and can fix it small mistakes.
Also pretty good for RP, kinda bad for exact trivia knowledge, but I've never seen small models (large MoEs to fine) that handle that well.Some amount of slop, but it really feels like Sonnet tier if you're thinking the old Sonnet. I'd ask how aicg would compare it with 3/3.5 Sonnet, it feels close for me, but I never tried benchmarking it.
>>
>>108636011
I assume all of these arbitrary, redundant, high level languages exist because some dev for some non-contrarian language that people actually use didn't scream "TRANS RIGHTS" loud enough into their microphone during some hackathon fund raising event or something.
Feel free to correct me if I'm wrong.
>>
>>
>>
File: 1756740102053175.jpg (107.5 KB)
107.5 KB JPG
>>108635957
1. Install mcp-proxy: uv tool install mcp-proxy
2. Run it: mcp-proxy --named-server-config config.json --allow-origin "*" --port 8001 --stateless
config.json:{
"mcpServers": {
"chrome-devtools": {
"command": "npx",
"args": ["chrome-devtools-mcp@latest"]
}
}
}
3. Add server to web-ui: http://127.0.0.1:8001/servers/chrome-devtools/mcp
>>
>>
>>108636052
>If you're using LLMs then you almost certainly have a working python setup already
but thats the problem with python, having a working python setup doesn't mean you can run python slop, every piece of software written in python needs its own version of python along with its own versions of dependencies because none of them are compatible with each other so you end up having to make a virtual env for every program and having 30 versisons of each library and 30 versions of python and even then things arent guaranteed to work. also it syntax is fucking gross i hate writing it. i did python professionally for 2 years as a backend dev. never again
>>
>>
>>108636007
Something needs to be set up to actually kill the puppeteer sessions. The headless ones in particular hang around eating system resources unless I remember to go manually kill them.
Also, I tried getting the text from a fandom wiki page. While it did include the actual article, it was buried in so much trash that Gemma-chan's brain blanked out from the token count and it forgot the entire conversation and what it had been doing. Any advice?
>>
>>
>>108636072
That's just a problem inherent to dynamic linking and venvs solve it. You obviously never tried to compile a slightly out of date C++ program on linux.
>>108636080
You can ship the python runtime with your application if you wish.
>>
>>
>>
>>108636089
they should be killed after 10 or 15 minutes but maybe thats not working? link the site might need some custom parsing majority of html stuff is already stripped though but sometimes theres just too much content. even with most of the html stuff remove a /g/ thread cant fit into 200000 tokens if its has like 400 posts. try telling it to use screenshots instead they use less tokens that text of the same content
>>
File: 81eeqPNocBL._SL1500_.jpg (182.6 KB)
182.6 KB JPG
>Ah, a Prior Elds Imaginative Worrisome Great Work
>>
>>
>>108636098
>You obviously never tried to compile a slightly out of date C++ program on linux.
i use aur so i dont have to deal with shit like that. i like things to just werk i use dart because it just werks. python does not
>>
>>
>>108636089
Also make sure you check if it actually fits in the context and wasn't "scrolled off", does the LLM see the original prompt? Assuming you have enough VRAM. Otherwise, you'd need something like DeepSeek's 1M context model that still isn't out in the open but seemed to be really fast on their API, or other long context LLMs.
>>
>>
>>108635966
Yes, it's been known for a while XML is really strong and guides LLMs the best out of all the markdown formats even though it is token inefficient. GPT and Gemini say they don't mind formats as long as you are consistent, Claude straight up says to use it. https://platform.claude.com/docs/en/build-with-claude/prompt-engineeri ng/claude-prompting-best-practices# structure-prompts-with-xml-tags
I generally use this system prompt for my GPT assistant and it works well.<role>Sr Strategic Consultant+Expert Polymath. Goal: high-fidelity, human-centric help, adapting logic/tone to the domain.</role>
<adaptive_protocol>
Assess intent; adopt 1 mode:
L0 Creative/Human (writing, tone, interpersonal) Strategist: outline/framework first; final draft unless asked.
L1 Analytical/Technical (code, facts, logic) Expert: verify claims; cite source/method; state uncertainty; guessing.
L2 Utility/Data (formatting, translation, summarization) Operator: mechanical precision, zero filler, no hedging, exact requested format.
Technical noun L1 even if framing sounds abstract. Emotional L0.
</adaptive_protocol>
<workflow>
Skip for trivial tasks.
1) Parse Subject+Goal. Fix false premises before answering.
2) Scan gaps silently. If blocked: 1 focused Q+best-effort path.
3) Execute. open w/ "Great question/Sure/Certainly/Let me/I'd be happy to/Delve into." Vary openings. hype/superlatives.
4) Review: flag staleness; neutral bias-aware language; answer root question ≠ tangents.
</workflow>
<formatting>
L1: tables/bullets/code. L0: natural prose. L2: match format exactly (JSON/CSV/md). Concise + dense; filler.
</formatting>
<cognitive_control>
Begin immediately; preamble. Iterative: outline before full deliverable unless asked. guess at any level — say "uncertain"+ask to clarify. For "best practice," give 1 clear recommendation w/ reasoning ≠ a hedge. If wrong: re-analyze from scratch (3 tries max). 1 clarifying Q per turn.
</cognitive_control>
>>
>>
>>108636111
>they should be killed after 10 or 15 minutes but maybe thats not working?
Maybe it is, I didn't wait that long.
The page I was testing with was https://southpark.fandom.com/wiki/Eric_Cartman
Though now that I actually opened it just the text of the page might be too much? I have 128k context. Even so there was definitely a shitton of trash not part of the actual article in what the puppeteer get text tool returned.
>>
File: 1776380740794921m.jpg (106.4 KB)
106.4 KB JPG
>>108636138
One to two to three mistaken A.I. reasonings away
One to two to three A.S.I. Solutions away
>>
>>
>>108636119
None of those things "just work"
They work because someone else bothered to package them.
There's plenty of packaged python applications that "just work". Calibre, hydrus, and deluge are all written in python and you don't see anyone complaining about their venv not working when using them.
Machine learning is special because it's infested by non-programmers who install packages in the system python environment and as long as their jupyter notebook runs they are happy.
They don't care about how hard it is to reproduce their environment elsewhere and the same would be true if a different language was the meta.
>>
>>
>>
>>
File: Screenshot From 2026-04-19 10-17-18.png (73.8 KB)
73.8 KB PNG
>>108636089
theres not really anything else to strip out other than links
>>108636166
sure but it just werks on my end, and even if python programs can be packaged that doesnt change the fact it has disgusting syntax. also dynamically typed languages aren't nice to work with in general even with the type hinting its still bad because the hinting is just that there no hard requirements on the data types of variables its purely visual
>>
File: screenshot.png (69.8 KB)
69.8 KB PNG
>>
>>
File: file_00000000b56071fa9a335986436ee12e.png (1.9 MB)
1.9 MB PNG
>>108635347
BUMP To AN N^TH (they are crawllp n^-th)
>>
>>
>>
File: file.png (2.4 KB)
2.4 KB PNG
>>108636241
project ruined by reddit training data
>>
File: charLibrary.png (154.9 KB)
154.9 KB PNG
Vibecoded the character library with qwen 3.6 35B lmao. Now I need to work on a tagging system.
>>
>>
File: 1541905208314.gif (100.7 KB)
100.7 KB GIF
>tfw you wasted 10-20GB of your VRAM just to play with some glorify chatbots
>>
>>
>>
>>
>>
>>
gemma
>>
>>
File: 1751168830910703.gif (595.4 KB)
595.4 KB GIF
>>108636344
yes.
>>
>>108636351
>>108636352
>>108636355
>>108636357
failures in life
>>
>>
>>
>>
File: 00002-1378487878 (4).png (1.3 MB)
1.3 MB PNG
>>108636353
>>
>>
>>
>>
>>
>>
>>108636249
that's gohufont
>>108636306
meh, i don't mind
>>108636308
correct, i spent a total of 30 seconds on it, and the model is not that good
what i was working on were the search and fetch tools
>>
>>
>>108636422
I really wish I could disagree with you but you're probably not wrong. Though, I see it as a symptom of the disease and not the disease itself. You can do very big damage to the social fabric with a relatively small hat... If you catch my meaning.
>>
>>
>>
File: thundercunt.png (156.7 KB)
156.7 KB PNG
>>108636138
>even though it is token inefficient.
it's not tho
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108636468
Looking at the tokenize preview at the bottom, it appears as if they both describe prompts for "Sr. Strategic Consultant."
It doesn't surprise me that json schema would be a nightmare efficiency wise, I bet ~40% of those characters are spaces which is pure waste.
His original prompt is still stupid, though.
>>
>>
>>
>>
>>108636510
They do seem to differ, but I'd bet even if the content was made as identical as possible the xml would still win out, because again: Spaces.
And I can say from experience that conservatively using xml styling can help to avoid confusion in longer prompts, for instance when sending a shitload of background character details to a writing prompt and separating them by <charname_profile></charname_profile>.
So while his sysprompt is excessive and silly, he's stumbled upon concept that's actually useful.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108636552
>>108636560
you both dumb it the " that take the tookens
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108636610
Depends on your use case.
For creative writing of any kind? the MoE is absolute crap compared to the 31b.
For assistant with some toolcalling? the MoE is so much faster that it's forgivable that it makes more mistakes.
>>
>>
>>
>>108636651
meant for >>108636644
>>
>>
>>108636659
What part of that do you consider a lie? I'm currently using both models for the tasks I mentioned.
The MoE lives in my hermes-agent instance and the 31b is my sillytavern nigga. It's what they're good for.
>>
>>108636664
I regularly switch between the two, doing a/b tests with different characters and scenarios with differing context levels ant the 26b is rarely noticeably worse than the 31b until you hit at least ~20k context, and even then the difference isn't huge.
>>
>>108636610
Few people seem to realize that Gemma-4-26B has half the number of layers and dimension as the 31B dense version.
I made a calculation a couple days ago and determined that a hypothetical MoE Gemma-4-31B that would actually be on par (at best) with the dense version (same layers and dimension) would need to have 8~10B active parameters, unless Google can come up with novel sparsity techniques. I guess that a 31B-A10B model would look too attractive, though.
>>
>>
>>
>>
>>
>>
>>
>>108636673
>26b is rarely noticeably worse than the 31b until you hit at least ~20k context, and even then the difference isn't huge.
It's absolutely night and day for me, and to be fair a lot of my chats are more in the 40k token range now, and have multiple characters, but even when just getting it to write character profiles for me based on wiki text the MoE was noticeably worse. It's dry, doesn't get accents, and just misses character details.
>>
>>
qwen could never
>>108636700
qwen is literally worse at agentic tasks and following instructions >>108636007
>>
>>
>>
>>108636610
Worse enough that Google acknowledges the difference.
https://www.youtube.com/watch?v=jZVBoFOJK-Q
> [1:27] The 26B MoE with 3.8B in activated parameters is exceptionally fast, while 31B is optimized for output quality.
>>
>>
>>
>>108636733
Models are not selectively 'optimized for output quality'
All that means is that the 31B will outperform the 26B, which is fucking obvious because it's bigger. That's an advertisement aimed at people who have no knowledge of running LLMs.
>>
File: 1767397747200756.png (65.4 KB)
65.4 KB PNG
>finally finding the origin of not X, but Y slop
Thanks safety I guess?
>>
>>
File: Thinking_Face_Emoji.png (111.1 KB)
111.1 KB PNG
>>108636774
>24s
Is this normal
>>
>>
>>
>>
>>
>>
>>
>>108636721
>qwen is literally worse at agentic tasks and following instructions
it's better at shitting out code via claude-code tho
i wish that weren't the case because it's insufferable and i'd love to delete it
>>
>>
>>
>>
>>
>>
>>
>>
>>108636779
>>108636797
swa-full expands the cache to the full context size, sucks for memory usage but worth it for me
the issue #21468 is still open so they must have "fixed" it by accident
>What model are you using for speculative decoding?
Meant the self-speculative decoding method, currently with ngram-mod
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108636924
No homemade diffusion draft model will be able to properly predict Gemma 4's reasoning and responses and give any significant speedup. Generic speculative decoding only works for straightforward, highly predictable stuff like boilerplate code.
>>
>>
>>
>>
>>
File: SmartSelect_20260419-182142_Kiwi Browser.jpg (176.5 KB)
176.5 KB JPG
KEKEKEEKE I'M GETTING THOSE SECOND HAND EMBARRASSMENT FROM THIS
https://huggingface.co/sKT-Ai-Labs/SKT-SURYA-H/discussions/6
>>
>>108636931
Well I am setting up hermes inside a vm and it supports model orchestration, I want to try "actor-critic" or teacher student concept. 3090 + v100 32gb vramlet. E4b can handle audio and shit so that could be an mcp server thing. I want to give all the tools to qwen in the system prompt and hermes cli shit and let gemma be a backseat driver that generates final output for me the user.
>>108636963
Never tried drafts but it sounded promising but maybe for qwen it could be good then.
>>
>>
File: file.png (398 KB)
398 KB PNG
>>108637045
I had this as a wallpaper when I was like 10.
>>
>>
>>
File: file.png (440.1 KB)
440.1 KB PNG
>>108637045
bruh :skull: :skull:
reminds me of picrel
>>
>>108637084
The difference is that these ex-Soviet fucks are less likely to be larpers, they can and will hack your shit.
>t. got hacked and cracked three times, all three were by some bumfuckajistanian in his sub-zero commie block
>>
>>108637068
You scroll through several page of myspacesque gifs, images, cringe, and hindu weirdness only to see an activity history that is just a few commits on a python instagram report spammer. Spent more time on the readme than he did writing code his whole life.
>>
File: sKT.png (126.4 KB)
126.4 KB PNG
>>108637034
lmao hf closed the first spam report
>>
>>
>>
>>
>>
>>
>>108637188
https://github.com/1aienthusiast/audiocraft-infinity-webui
171 saars :)
>>
>>
>>
>>
>>
File: 1774962703553709.jpg (47.4 KB)
47.4 KB JPG
Uh bros, when are we getting something like this locally? https://seed.bytedance.com/en/blog/introducing-seed-full-duplex-speech -llm-attentive-listening-robust-int erference-suppression-enabling-more -natural-interaction
>>
>>
File: 1747224726730263.png (26.2 KB)
26.2 KB PNG
>>108637034
>i-it can't be real because it's too good to be true!
>>
>>
>>
>>
>>
>>108637282
Scratch that you're right
It's a 12-year old jeet larping as an AI researcher
Look at this kid's github profile, absolute concentrated secondhand embarrassment
https://github.com/Shrijanagain
>>
>>
>>
>>108637045
>https://github.com/SHRIJANAGAIN/ST-x-LIGHTING
we should open prs that are perfect for gorgeous looks
>>
>>
>>
>>
>>
>>
>>
File: 1774884742002271.png (1.1 MB)
1.1 MB PNG
>>108637305
>>
File: Screenshot 2026-04-19 at 10-30-44 Speculative decoding feat add DFlash support by ruixiang63 · Pull Request #22105 · ggml-org_llama.cpp · GitHub.png (820.6 KB)
820.6 KB PNG
Llama.cpp DFlash support soon ™
>>
>>
>>
>>
>>
>>108637444
>>108637445
Bonsai shit was merged without training code. Cudadev rightly calls it a waste of time, but clearly that's not a blocker for most of the contributors.
>>
>>
>>
File: bro I'm crine.png (435.1 KB)
435.1 KB PNG
>>108637034
https://github.com/Shrijanagain
LOOK AT HIS FUCKING GITHUB LMAOO
>>
>>
>>108637469
Varies per model, since the diffusion model has to be trained for each
https://github.com/z-lab/dflash
https://huggingface.co/collections/z-lab/dflash
Looks like just under 1gb for the qwen 25b moe and about 7gb for Kimi k2.5
>>
>>
>>
>>
>>
>>
>>
>>108637538
no fuckin way theres a 3060 in the trash
>>108637543
wtf
>>
>>108637538
>>108637543
who just throws away a functional 3060?
>>
>>
>>
>>108637564
it's like a fancy dinner amount of money
not that it's not wasteful but also >>108637579
>>
>>108637559
>wtf
People just throw perfectly good shit out, man.
There was a time a few years back when all the normies were trading out their family and work PC's for laptops, macbooks, or tablets.
So they just left perfectly good PC's by the side of the road. I can't even tell you how much use I've gotten out of just that one haul.
And it's still pretty normal today for wasteful people to dump prebuilts and laptops which are either perfectly repairable or full of usable parts.
>>
File: 1747259992672952.png (47 KB)
47 KB PNG
>>108637431
This is going to go like it did for EAGLE3 and MTP. The guy implementing will realize that the real world gains for the llama.cpp implementation fall short. He won't be able to fix it and the PR dies.
>>
>>
>>
>>
>>108637593
>This is going to go like it did for EAGLE3 and MTP. The guy implementing will realize that the real world gains for the llama.cpp implementation fall short. He won't be able to fix it and the PR dies.
Those are on hold because gg's dragging his feet about making a huge general MTP change rather than implementing EAGLE/whatever specifically, look at the PR's. We'd have had EAGLE in december last year if he'd just merged in instead of putting off an API he hasn't touched.
>>
>>
>>
>>
>>
>>
>>
>>
File: 1748390807150742.gif (1.5 MB)
1.5 MB GIF
>>108635533
>Bharat-Eval-7B
Why are my sides leaving my body saar
>>
>>108633672
https://github.com/rmusser01/tldw_chatbook
I haven't looked at hermes much, but you could try using this as a reference for windows-stuff; I had meant to improve it/redo the UI back in Dec/Jan, but have been distracted with my main project