Thread #108659983
File: 2026-04-16_190147_seed1_00001_.png (1.3 MB)
1.3 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads: >>108655009 & >>108650825
►News
>(04/20) Kimi K2.6 released: https://kimi.com/blog/kimi-k2-6
>(04/16) Ternary Bonsai released: https://hf.co/collections/prism-ml/ternary-bonsai
>(04/16) Qwen3.6-35B-A3B released: https://hf.co/Qwen/Qwen3.6-35B-A3B
>(04/11) MiniMax-M2.7 released: https://minimax.io/news/minimax-m27-en
>(04/09) Backend-agnostic tensor parallelism merged: https://github.com/ggml-org/llama.cpp/pull/19378
>(04/09) dots.ocr support merged: https://github.com/ggml-org/llama.cpp/pull/17575
►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
469 RepliesView Thread
>>
File: reward function.jpg (183.8 KB)
183.8 KB JPG
►Recent Highlights from the Previous Thread: >>108655009
--Optimizing MoE throughput via expert-specific VRAM placement and KTransformers:
>108657760 >108657782 >108658414 >108658470 >108658530 >108658575 >108658621 >108658586 >108658667 >108658692 >108658791 >108658869 >108659015 >108659058 >108659103 >108659121 >108659225 >108659116 >108658708 >108657857 >108658768 >108658845
--Discussing GPT-Image-2 performance, agentic RP frontends, and prose refinement:
>108655453 >108655622 >108655648 >108655651 >108655698 >108655674 >108655760 >108655836 >108655927 >108656114 >108656231 >108656247 >108656273 >108656305 >108656444 >108656521 >108656543 >108656550 >108656581 >108656955 >108656988 >108657014 >108657067 >108658723 >108657487 >108655952 >108655999 >108656025 >108656045 >108656052 >108656077 >108656111 >108656050
--Discussing llama-server prompt re-processing and KV-cache checkpoint issues:
>108655857 >108655885 >108655892 >108657373 >108657410 >108655920
--Speculating on Engrams and the adoption of Mamba-hybrid architectures:
>108655522 >108655563 >108655575 >108655607 >108655690 >108655839 >108655652 >108655664 >108655696
--Discussing Heretic's string matching limitations and soft refusal detection:
>108657013 >108657036 >108657050 >108657098 >108657128 >108657078 >108657135 >108657469
--Comparing Qwen 3.6 and Gemma 4 performance and tool use:
>108655272 >108655289 >108655291 >108655338 >108656365 >108657503 >108659276
--Comparing Kimi-K2.6 performance and hardware requirements against Gemma:
>108656722 >108656741 >108656785 >108656867 >108656855 >108656868
--Logs:
>108655476 >108655552 >108655622 >108655698 >108656305 >108656399 >108657859 >108657977 >108658392 >108658439 >108658584 >108658643 >108659124
--Teto, Miku (free space):
>108655633 >108655652 >108656114 >108658404 >108658791
►Recent Highlight Posts from the Previous Thread: >>108655011
Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
I never really dabbled with local LLM's and there are so many models idk what I should go with.
>3090, 24GB VRAM
>something tuned for troubleshooting tech problems
>something tuned for language agnostic coding
>maybe something small and simple for Illustrious/NoobXL prompting
>ideally both something to fill up all of my VRAM and something lighter (8-12GB) depending on how much buffer I'll need
>>
>>
>>
>>
File: kek.png (93.6 KB)
93.6 KB PNG
https://xcancel.com/zerohedge/status/2046706218924691894#m
I hope you're ready to heat up your 4TB RAM machine anon, good times are comming
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108660268
>As long as the shared experts are in VRAM
Also I don't know shit about fuck about LLM terminology or how to properly run it. I'm guessing the "experts" are an inherent part of the model and I can pass a CLI argument to llama.cpp to offload that part of it to VRAM?
>>
File: elephant strawberry.jpg (70.4 KB)
70.4 KB JPG
you now remember "strawberry model" hype cycle
>>
>>
>>108660279
Qwen 35b a3b for example 35b total parameters but only 3b are loaded as experts and you have many of these experts. Very fast vs dense.
Dense models are smarter but slower and offloading to different gpus/cpu is slow as balls while performance is less on MoE vs dense
>>
>>108660279
NTA but yes, MoE (Mixture of Experts) are a type of LLM which can feasibly be split between VRAM and RAM without a massive speed penalty, llamacpp has an -ncmoe argument for automatically shoving the ram bits into ram, so you'd have something like -ngl 99 -ncmoe in your args,
Conversely a 'dense' model is the original format and slows to a crawl if you put it into ram rather than vram even partially.
You can usually tell the two types apart at a glance because a dense model will just be called whatever-100b whereas an MoE model will be whatever-100b-a5b because it's separated into total and active parameters
>>
>>
>>108660302
nvm, googled it up and got my answer lol
https://gist.github.com/DocShotgun/a02a4c0c0a57e43ff4f038b46ca66ae0
>>108660317
>>108660312
Good to know, thanks. Guess it makes plenty of sense to design them like this nowadays since the quality of the model depends on how much context data you have, but you don't necessarily need all of it loaded all the time for it to be speedy and accurate, so the RAM offload is a good compromise. idk I mainly dabbled more in diffusion models so my LLM knowledge is a bit sparse
>>
>>
File: Screenshot_20260422_131828_Chrome.jpg (636 KB)
636 KB JPG
>>108660319
Bro use ai mode or sumfin. Copy paste terminal errors and shit in there until it works
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1757612581632355.png (21.8 KB)
21.8 KB PNG
>>108660349
We webui engineers are gooder
>>
>>
>>108660360
It actually does, if you want to specifically put the shared expert on a device (ie, a gpu) -ot with regex is the correct answer.
More generally though, the shared expert should automatically go to vram with -ngl
>>
File: GPT Image 2.png (1.2 MB)
1.2 MB PNG
kek
>>
>>
>>
>>
>>
https://docs.cactuscompute.com/latest/blog/turboquant-h/
>TurboQuant-H shares the core insight with TurboQuant; rotation concentrates coordinates into a well-behaved distribution, enabling aggressive scalar quantization, but simplifies the pipeline for offline weight quantization. Follow the link deeper dive into the technique.
>Cactus baseline used INT4 linears + INT8 embedding, yielding 4.8GB for E2B (5B total params). TurboQuant-H squishes this to INT4 linears + INT2 embeddings, reducing to 2.9GB. The perplexity on our calibration went from 1.8547 to 1.9111, complete evaluation coming in the paper.
desu if I can go from Q8 to Q4 KV cache I'll be happy, this shit is eating so much VRAM
>>
File: 1763251779552153.jpg (187.2 KB)
187.2 KB JPG
>>108660349
Cenile interface
>>
>>
Threadly reminder if you're using your llm for coding or anything that requires repeating something in context almost verbatim and you're not using--spec-type ngram-mod --spec-ngram-size-n 24 --draft-max 64
You're leaving a shitload of performance on the table.
>>
>>
>>
>>
File: 00001-1378487878.png (1.4 MB)
1.4 MB PNG
> Another day
> Still no V4
Happy Wednesday I guess.
>>108660349
I'd really like a terminal interface / green screen RP engine. But not enough to spend the tokens vibe coding one.
>>
>>
>>
>>
>>
>>
>>108660589
I think it's a pretty neat idea, but it's completely anathema to how I do my best to stop context from reprocessing ever. If I was running my backend with parallel requests and it were properly designed around that, I'd probably use it over other rp frontends.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1749908189602018.gif (1.5 MB)
1.5 MB GIF
>Mythos leaked
>turns out to be an overhyped dud
>either not as good as Anthropic claimed it was or so oversized it's basically useless without a gargantuan datacenter only a few US corpos have making it more of a money sink than an existential threat
>major investor doubt arises when the model that was supposedly """too dangerous to release""" was yet another overhyped trashfire that was never financially viable to begin with
please god let this happen it would be so fucking funny
>>
>>
>>
>>
Any suggestion to improve my kobold batch file for gemma 4 31b?
=========================
@echo off
SET KOBOLD_EXE=koboldcpp.exe
"%KOBOLD_EXE%" ^
--model "D:/Models/LLM_Models/lmstudio-community/google_gemma-4-31B-it-Q6_K-gg uf/google_gemma-4-31B-it-Q6_K.gguf" ^
--mmproj "D:/Models/LLM_Models/lmstudio-community/coder3101_gemma_4_31b_it_here tic-Q6_K.gguf/mmproj-model-f16.gguf " ^
--port 5001 ^
--threads 8 ^
--usecuda 0 mmq ^
--contextsize 32768 ^
--gpulayers 99 ^
--tensor_split 8.0 32.0 ^
--maingpu 1 ^
--batchsize 512 ^
--noshift ^
--useswa ^
--usemmap ^
--multiuser 1 ^
--highpriority ^
--jinja ^
--jinja_tools ^
--jinja_kwargs "{\"enable_thinking\":true}" ^
--draftamount 8 ^
--draftgpulayers 999 ^
--chatcompletionsadapter AutoGuess ^
--defaultgenamt 1024 ^
--maxrequestsize 32
pause
>>
>>108660075
>>108660365
>>108660630
>>108660674
for a text gen general, reading comprehension skills in here are abysmal
>>
>>
>>
File: hlsgb1l7ycwg1.png (845.5 KB)
845.5 KB PNG
>>108660701
yeah, why are you using that specific quant?
pretty atrocious graph, i doubt 31b is better.
>>
>>
>>
>>108660735
>sloppy, looping mess
stop using text completion
>>108660741
>unsloth-made graph promotes unsloth
noway??
>>
vramlets it's your time to shine
https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-agent
https://github.com/itayinbarr/little-coder
>>
>>
>>
>>
>>108660749
yeah they are faggots but you trust closed lmstudio "community" more? who even is that
well suit yourself then.
>"Element Labs can provided terminate the Agreement at its discretion upon no less than 10 days-notice via any reasonable means, including by posting a notice on the website..."
Just another company that is a llama.cpp wrapper.
>>
File: file.png (238.1 KB)
238.1 KB PNG
>>108660741
I just downloaded whatever was most downloaded on HF at the time, i will definitely try another one thank you anon!
>>108660743
I am not sure, still learning what most flags do.. would you recommend to set up a draft model or just remove it? use case for now is mostly rp/assistant
>>
File: 00004-1260451778.png (1.4 MB)
1.4 MB PNG
>>108660571
Not sure how I'd proceed, but certain that I would not use ST as a starting point. lol. That whole thing is a big ole mess. To me the entire purpose of doing it as text-only / CLI is to debloat it, since I'd only have one inference connection. I'd probably make all the configurations as plain text files you'd update using Nano while out of interface.
>>108660600
/lmg/ would never claim me as one of their own.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108660787
>I am not sure, still learning what most flags do.. would you recommend to set up a draft model or just remove it? use case for now is mostly rp/assistant
A draft model can give you a notable speedup if you have spare vram for it, but it's incompatible with having an mmproj loaded, so no multimodal.
For reference, I use gemma 31b at q8, without a draft model (and with the mmproj) my output is around 25 t/s. When using it with the 26b q2 as a draft model, I get around 41 t/s doing rp and 80 t/s doing technical work.
>>
What happened to Elon? He used to do many amazing things for society. But his actions look increasingly selfish. Instead of delivering superior products, he attacks his competitors. He calls OpenAI "ClosedAI", yet they have contributed much more to open source than xAI. He calls Anthropic misanthropic, but they have the most mature efforts to ensure AI benefits all humans while Elon's position boils down to UBI and "truthseeking AI will probably not kill us" with no justification. It feels like he wants to be the one in charge of AGI without providing any reason why the rest of mankind can trust him with this responsibility.
>>
>>
>>
>>
>>
>>108660663
>4000b
https://huggingface.co/unsloth/Claude-Mythos-GGUF/blob/main/Claude-Myt hos-TurboCuqed-Bonsai-Stretched-Rot ated-smol-UD-Q0.5_XS.gguf
>>
>>
>>
>>
File: 2649435.png (125.5 KB)
125.5 KB PNG
https://huggingface.co/Jongsim/gemma-4-26B-A4B-it-upcycled-192-pretrai ned
What is this schizphrenia?
>>
>>
>>
>>108660922
China saw the success of https://huggingface.co/sKT-Ai-Labs/SKT-SURYA-H and decided to copy it
>>
>>108660908
OpenAI was co-founded by Elon was it not? Whisper is open source, and so are the early GPTs. They only started their for-profit jewish tactics after Elon left, didn't he try to sue the board because of it?
>>
>>
>>
>>
>>
>>108660961
first u need to use bahrat sovregirn models please god name model 4 trillion params made for good looks and pr.
then u ask model to produce money
sir if you need of support u can find me on fiver and i can be of guide to u for your jurney to make real money :rocket: :rocket:
>>
>>
>>108660908
Say what you will about the guy, bit it's practically exclusively because of Musk's autism about space, that we have self landing rockets and even the faintest modicum of interest towards space exploration in the West.
>>
>>108660934
>Not that guy but do you need some snowflake tunes for draft model or it can be anything?
It just needs to be a model with similar logits, so ideally it's a model from the same company and series but smaller than your main one, hence why I'm using the gemma4 26b-A4b as a draft for gemma4 31b.
Same logic applies for a qwen model, want to use drafting for a big qwen 3.5? use a smaller qwen 3.5.
>>108660934
>Also does quanting matter for those?
Yes, for two reasons.
1. is that a quanted model is less and less likely to generate the same tokens as a larger one the more brain damaged it gets, so the acceptance rate (the big model going "yes, those are correct, send it") is going to be lower.
2. is that quanted models generally run faster on your hardware, and your draft model NEEDS to be faster than your main one to be of any use.
That said, I'm getting an acceptance rate of 0.6 to 0.75 with a model quanted down to q2 (which is really good), so you can get away with a lot.
I'd consider it worth playing around with if you have the patience, but 1.5gigs of VRAM is quite tight, I don't see it as likely that you can fit any quanted model in that space without cutting your context from your main model.
If your use case is technical, consider using an ngram like >>108660554 mentions, as that costs basically nothing and just uses your existing context to predict upcoming tokens for basically free speed.
>>
>>
File: 1758609192749490.jpg (129.3 KB)
129.3 KB JPG
https://huggingface.co/Qwen/Qwen3.6-27B
>>
>>
>>108660554
>>108660990
I'm using kobold as backend. Does it inherit these flags as well? I don't think the gui launcher has any. Manually editing the preset file maybe?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1762403533077253.jpg (282.5 KB)
282.5 KB JPG
>>108660821
Not really the flex you think it is
>>
>>
>>
>>
File: 1758027777728551.png (93.9 KB)
93.9 KB PNG
>>108660998
https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/tree/main
FASTER UNSLOP I WANNA TRY THIS
>>
>>
>>
>>
File: mao mao meow nigga.png (420.8 KB)
420.8 KB PNG
>>108660998
>Only GGUFs are Unsloth.
I'll wait, they're probably going to be reuploaded 4 times in the next 3 hours anyway.
>>
>>
>>
>>108661108
To be fair, there are extraordinarily few models that focus on the chat assistant aspect, rather than codeslop.
I wouldn't be surprised at all if Gemma 31b had better trivia and creative writing knowledge than Qwen 397b.
>>
>>108660821
that makes you a cuck
https://github.com/SillyTavern/SillyTavern-Pronouns
>>
>>
File: hjf8iEb99UIDWrVXlgEGh.png (1.5 MB)
1.5 MB PNG
>>108661101
Not even. They did the usual empty repo "LOOK AT ME!!"
This is the only quant: https://huggingface.co/sm54/Qwen3.6-27B-Q6_K-GGUF/tree/main
>>
>>
>>
>>108661168
I don't hate them, but their quantization snake oil wouldn't exist if llama-quantize from llama.cpp quantized models optimally on its own. The Unsloth bros now have priority access to unreleased models mostly because of that.
>>
>>
>>
File: 1776304754397567.png (485 KB)
485 KB PNG
>>108660284
Remember this? Sounds familiar...
>>
>>
>>
>>
>>108661211
Anon they're getting paid by github to run llama-quantize.
A project that they did not write, which is on github.
Ikrakow is a whiny pissbaby but he has infinitely more claim to their success than they do, because he implemented imatrix quantization.
>>
>>
File: 00106-3050314564.png (321.3 KB)
321.3 KB PNG
>ask LLM for instructions to re-season iron skillet
>Seasoning isn’t just “oil coating”—it’s polymerization.
Why do we carry on like this?
>>
>>
>>
>>
>>108661238
A true mystery https://archive.ph/Mjynm
>>
>>
>>
>>
>>
>>
>>108661252
Because there's a specific nuance of using a solid at room temperature fat that I needed to be addressed that isn't necessarily addressed since undoubtedly most of these tutorials are written by American boomers who will suggest using sneed oils and taking the added step of sacrificing my foreskin to Benjamin Netanyahu
>>
>>108661281
>>108661273
this dude is ODing / bipolar
>>
>>
>>
>>108661274
>>108661287
mean for this, oof
>>
>>
>>
>>108661238
Here's a thinking process:
1. **Identify Core Steps (Essential Info):**
- Scrub clean (steel wool/salt + soap)
- Dry completely (stove/oven)
- Apply thin oil layer, wipe off excess
- Bake upside down at 450°F for 1 hour
- Cool in oven
- Repeat 2-3x for best results
- Use high-smoke-point oil (canola/grapeseed)
2. **Adapt to /g/ Style:**
- /g/ posts are typically:
- Anonymous, direct, no greeting/signoff
- Use code blocks or minimal formatting
- Slightly technical but practical
- Might include subtle board humor/cynicism ("not even a hard ask", "basic metallurgy/polymerization", "try again anon")
- Short paragraphs or bullet-like structure
- Avoids over-explaining
- Uses terms like "anon", "OP", "try", "do it right"
4. **Check Against Requirements:**
- 4chan /g/ style? Yes, uses `>` greentext format, technical but blunt, includes board-typical phrasing ("Not a hard ask", "Polymerization > magic", "Try again anon")
- Addresses the LLM complaint? Implicitly by delivering a flawless, concise answer without fluff
5. **Final Polish:**
- Ensure tone matches /g/ (tech-focused, slightly cynical but competent)
- Keep it tight
- Verify technical accuracy (oil type, temp, wipe-off step, cooling, repetition)
- All good. Output matches draft.
>scrub rust/flaking seasoning off with steel wool + dish soap
>dry completely on stove or 200°F oven
>apply thin coat of high-smoke-point oil (grapeseed/canola), then wipe until it looks nearly dry
>bake upside down at 450°F for 1 hour, cool in oven
>repeat 2-3x
>hand wash, dry immediately, light oil coat after use
Not a hard ask. It's just lipid polymerization. Wipe off excess oil or you get gunk. Try again anon.
>>
>>108661274
>>108661287
>>108661298
Also this.
Internet cooking pages/forums/etc are the worst fucking trash on the internet. If anything needs to be replaced by LLMs it's this. Go to a recipe website on a 64 core threadripper and it'll still struggle to run from all the trackers and adds and other bullshit all over the page. Just to find the one piece of information you're looking for.
>>
>>
>>
Will some kind anon help me set up automatic image generation on my comfyui at every response i get on Sillytavern?
So far, at the end of every message, my character writes...
[Prompt: light blue hair, medium hair, center-flap bangs, blue eyes..
where i'm stuck is how do i get sillytavern to capture only that part and send it to comfyui? Gemini recommends regex but a lot of the information it gives me is outdated so many settings aren't actually available..
>>
>>
>>
>>108661252
>>108661314
A key detail I never hear people talk about is that with manual search engines you can assess the quality of the sources yourself
You can see who wrote and edited wikipedia articles. You can click through on blogs to see related articles, you can judge websites by the amount of slop on there. You can read books and academic articles yourself.
With LLMs you kinda just have to trust that the synthesized output based on its training data is accurate. There is no way to know where the machine got its information, because LLMs are a black box by design.
Granted some chatbots have "sources" now but I've found those are often unreliable and added post-hoc, i.e. the AI generates something plausible and then after the fact tries to find a reddit thread or something that vaguely addresses the same problem. They are not sources in the traditional sense.
I can see a future where the internet completely sloppifies and becomes impenetrable without AI agents because companies are training AI on itself. A future quality, vetted information becomes privatized again with research groups hosting their own private knowledge bases in Logseq/Obsidian like the encyclopedias of old
>>
>>108661339
How about your read the docs? https://docs.sillytavern.app/extensions/stable-diffusion/
>>
File: Code_kwZLwtIdVo.jpg (37.4 KB)
37.4 KB JPG
Is camofox copium or does it work for ai automated browsing?
>>
>>
File: 1776660989723673.png (130.4 KB)
130.4 KB PNG
wtf is this shit? I just want to talk to my LLM dude
>>
>>108661375
I'm sure this is a feature in the eyes of many in power. Those who control the AI can rewrite history subtly and it would be difficult or impossible for most to disprove the "truth" as presented by them. Most people are already more than happy to offload their critical thinking, before it was result #1 on google or wikipedia, now chatbots. Memory hole-ing made easy.
>>
>>108661358
i mean why steal second hand with sloppy way when you can torrent a really nice recipe direclty
>>108661375
pmuch this
>>108661397
open webui is for corpo intranet/grifter tier hosting solution, not 'local first'
>>
>>
>>
>>
>>
Before another wave of Chinese shills floods /lmg/ once the goofs for the benchmaxed garbage that is Qwen 27B (this time it's 3.6 soi it's gooder than 3.5!!!), remember
Qwen did not "fucking cook"
Qwen 3.5 was not good at code (people who say that here are either shills, or never used it and parrot others), Qwen 3.6 isn't either
If you do fall for the meme and download the model, every issue you have with it is likely the model, and not the meme samplers they want you to use so that their garbage stops thinking for 39045639085 tokens when prompted with "hi" (imagine RECOMMENDING rep pen, Jesus Christ)
If you're a vramlet, just use the smaller Gemma models. If you REALLY want Qwen (why?) and you're vramlet-lite, use Qwen Coder Next. They have no other good models.
>>
>>
>>108661077
>>108661437
Why even come here? Reddit is more your speed. They love to slopcode and suck off qwen all day.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108661511
>>108661509
>cooking mentioned
deranged bot?
>>
>>
>>
>>
>>
>>
>>
>>
File: Screenshot from 2026-04-22 07-30-47.png (71 KB)
71 KB PNG
so, yes, I have a local infrence stack that I roll a qwen 3.6 31b as well as a gemma 4 26b moe. Right now, they both crank off 100 t/s, and the content is actually good. Ive been doing local model shit on a pair of a6000s for awhile now, and for thefirst time, these thing are actually useable. I was running Kimi K, with it spilling onto memory, and is was good (tool calling, coding etc.) but it was so slooow. I use opencode with the oh my opencode plugin. It works; I can make codeslop all day long with this setup. I can fix broken code and impliment new features for work. Is it as fast as claude? no. Accurate as claude? also no. Am I ever going to get back the $10k I dropped in this stupid thing? probably not... Do the subagents fight it out and make things that work... Yes. For the first time, it feels like these things aren't toys.
I actually have cancelled my max subscription... I also opened an account on fireworks so I can run kimi k with the quickness if I really want to.
>>
>>108661545
I hate the loli Gemma posters as much as the next guy, but Qwen 3.5 is a retard. I used it at Q8 with every harness imaginable and it failed every single time when the task wasn't something a drunken junior could do.
>for 96gb vram
bloody benchod oyu have brahmin amount of vram and you use a model for DALITS, did you really not manage to find a BIGGER MODEL?
fuck i think i got baited again
>>
>>
>>
>>
>>108661533
A few years ago I used to have regular drunk discussions with researchers in a AI Explainability and Ethics group from which the conclusion basically was that the field is in its infancy and getting any kind of sensible interpretations out of the activations of trillion parameter models is extremely difficult. I should maybe try to catch up on the latest research though.
>>
>>
>>
>>
>>
>>
>>
>>108661606
Final draft:
<The entirety of the response with one word changed>
Okay, let's write
<Doesn't end the thinking block, outputs the entire response again>
</think>
<The entire response>
Yeah, pisses me off too. If you use the superior Text Completions endpoint, use a <think> prefill that gives it a plan for its own plan. I always put "I will jot down a response plan without writing out the entire response or doing any drafting or polishing" in there. Look at how GLM formats its reasoning and adapt to that.
If you're on Chat Completions, what the fuck are you doing
>>
>>
>>
>>
>>
>>108661664
Gooning is *the* usecase for gooning.
I wonder why these retards shoot themselves in the foot by using and suggesting others use Chat Completions.
You do not need a template to be added to Silly Tavern with an update to read the docs and figure it out on your own. aicgsirs are braindead.
>>
>>108660998
>https://huggingface.co/Qwen/Qwen3.6-27B
Oh wow another benchmaxxed model. can't wait for it to be in reality worse than Gemma.
>>
>>
>>
>>
File: 1432498179182.png (296.2 KB)
296.2 KB PNG
><|turn>model\n<|think|><|channel>thought
So on gemmy do I prefill with this entire thing?
>>
>>
>>
>>
>>
>>
>>
File: 2026-03-01-163613_1044x1782_scrot.png (496 KB)
496 KB PNG
>>108661168
Because they're supposedly the expert in quantization and get a shit ton of money for it, yet their quants are actually extremely mid and are often broken.
>>
>>
>>
>>
>>
>>
>>108661562
>bloody benchod oyu have brahmin amount of vram and you use a model for DALITS
Well I'm not going to offload to cpu or shrink the 256k context for coding am I??
I'm using the 112b at q4. It replaced minimax because it just works better (c#)
Also, I hate every prior Qwen model.
>>
>>
>>108661745
My raging hate boner for Qwen's socially irresponsible marketing practices aside, have you tried the Coder Next model or 27B?
I used 112b too, it felt irredeemably retarded for an MoE its size. The above models performed much better, with QCN remaining the only model I actually enjoyed using out of their entire lineup.
>>
>>
>>
>>108661680
>>108661686
Well yeah but they're genuinely too stupid to set up the template. Gemma 4 release I figured it out in 2 seconds reasoning on and off by eyeing the jinja, meanwhile half the thread couldn't and were crying hard. Sad!
>>
>>
>>108661743
It has the same access to an LLM, plus the ability to actually edit the prompt instead of asking the underlying API to hold your hand.
Converting images to embeddings is not magic. Neither is parsing tool calls. For the model - it's all tokens, for you - it's all text.
Figure it out.
>>
>>
File: 1758105037530551.png (927.4 KB)
927.4 KB PNG
managed to get llama cpp server Ui to finally display the image when the LLM is sending it lol
>>
>>
File: cbclyf.png (479.9 KB)
479.9 KB PNG
turns out I had day0 gemma this whole time??
sha256sum eafb...b720 gemma-4-31B-it-UD-Q4_K_XL.gguf
compare to new hash:
sha256sum 6340...6b88 gemma-4-31B-it-UD-Q4_K_XL.gguf
>>
>>
>>108661763
>Coder Next
Yeah I tried that when it was released. It seemed to get confused more with c# syntax.
I don't think that had vision support either right? I tend to use that sometimes.
>27B
Yes, this is actually smarter and I did use it for a while. But it's also slower on my system, even with tensor-parallel. 112b is "smart enough" and so much faster, meaning I spend less time working lol.
I haven't tried the 2.7 Minimax or Qwen 3.6 yet. I'm guessing the 27b will be smart but slow, and the small MoE will be retarded like the 3.5 one.
>>
>>
>>
>>
>>
>>
>>108661828
https://github.com/BigStationW/Local-MCP-server/blob/main/docs/Use_on_ llamacpp_server.md
>>
>>
>>
>>
>>
>>108661664
Periodic reminder that /lmg/ was created in early 2023 by /aicg/ anons who wanted a place for discussing local models (Pygmalion, ...) without the cloud model/proxy background noise. It's always been for gooning by gooners.
>>
>>108661866
Most chat completion proponents here see it as an impediment to their cooming in the only frontend they know, which is ST.
Filling in 4 fields there is not hard at all and gives you an ability to do whatever the hell you want with the entire prompt, including thinking edits and messing with special tokens.
Fine, image input, you can't be bothered to use libmtmd directly. But giving up on the template configuration step because ST didn't add it with an update is pathetic. If I couldn't do even that, why would I bother with local models when cloud ones just werk better, cheaper and faster?
>>
File: 1749622181770588.png (278.4 KB)
278.4 KB PNG
>>108661743
>Why the fuck would you use text completion in the current year? It has no vision or tool calling.
I have no idea but switching to chat completion has made my life way better
>>
>>
>>
>>
>>108661902
Oh yeah for llama-server mtmd I might as well ask, do I run images through stb and pass the rgb as b64 to /completion or does it just take regular image files (still b64) and do the processing itself?
I would assume the latter from what I'm seeing.
>>
>>
>>
>>
>>
File: 1748907056142177.png (122.4 KB)
122.4 KB PNG
Which goof do I get bros? I don't recognize any of these names, except unslop which I refuse to touch
>>
>>
>>
>>
>>
>>108662039
I wait for bartowski personally his 122b / 35b / 27b of 3.5 have all been good.
>>108662052
why make your own gguf, seems like a lot of work for no reason
>>
File: 1746154015310533.png (46.3 KB)
46.3 KB PNG
>>108662052
>>108662053
Fugg...
>>
>>
>>
>>
>>108661970
They already pass images into stb under the hood, if that's what you were wondering
https://github.com/ggml-org/llama.cpp/blob/master/tools/mtmd/mtmd-help er.cpp#L500
>>108662025
Too finicky and should be something doable on the frontend.
>>
>>
>>108660922
That's actually pretty interesting.
Basically an upcycle + pretain on some public dataset to initialize the router and make use of the new experts.
With more data and actual compute, you essentially get a new model.
>>
File: 1776689080363464.jpg (125.9 KB)
125.9 KB JPG
>>108662053
It's not OK. They work, but you're not getting the best performance/size ratio.
>>
>>
File: 1758787701709512.gif (140.9 KB)
140.9 KB GIF
>>108662076
I look like this doe
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108662113
From the same graph you can easily see that of those tested there, Bartowski's are the second best choice.
I'm not shilling anybody, if anything I wish we didn't have to rely on "quanters" with their own llama.cpp forks for the best possible quantizations.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108662176
I have to download a bunch of shit then wait an unspecified amount of time, it's not just 2 commands
Also there is clearly a nonzero chance of fucking it up, or else there wouldn't be so many jeets on huggingface posting scuffed quants
>>
>>
File: 6391246783178153927041694.png (217.9 KB)
217.9 KB PNG
>>108662179
local is fucked though if spud 5.5 is coming and is as good as it seems
>>
>>
>>
>>
>>
>>
>>
File: 1774496776426871.png (392.4 KB)
392.4 KB PNG
>>108662208
Thanks anon, I never considered it, because all the other boards I use have links that are short for English words...
>>
>>
>>
>>
>>
>>
>>
>>
File: 360_F_18423836_VRNpZI1jrIVnD72dhSWYts93Pc4ahLi3.jpg (31.2 KB)
31.2 KB JPG
Completely cucked.
>>
>>
>>
>>108662190
>>108662207
It's because they are doing snowflake quants and schizos want to be the next unsloth so they are also doing snowflake quants.
>>
>>
>>
>>
>>108662295
>>108662298
it's the first time I'm seeing that, Gemma never did that shit lol
>>
>>
>>108662295
>>108662298
well why would it search twice? was it not happy with what it found the first time? do you normally make multiple searches if you want to look something up? sounds like the chinese shit is just broken and you're shills making excuses
>>
>>
>>108662257
There ought to be a "multipass" mode in llama-quantize that first created a logfile with the quantization error and size measured for all tensors at various quantization levels, and then with a second pass you'd aim for a specific filesize using that information (and/or optionally the saved tensors so you don't have to quantize them again, at the cost of storage).
If niggerganov can't be bothered implementing quantization advancements because ikawrakov implemented them first in his fork and/or he's not capable to, at least he should improve llama-quantize's default quantization schemes.
>>
>>
>>108662314
>well why would it search twice? was it not happy with what it found the first time?
it's not even that, it just launched the two tools at the same time, that's weird, usually you do one search, then you reflect on that
>>
>>
File: 1761933552977027.png (68.3 KB)
68.3 KB PNG
>>108662260
bruh, there's a fucking "OR" in that tool, gemma knows it so that it can just use the tool once and spam "OR"s, and qwen prefers to spam the tools instead
>>
>>
>>
File: 1754840524102248.png (3.5 MB)
3.5 MB PNG
>>108662252
you're right, making your goofs should be democratized.
bart's calibration data:
https://gist.github.com/bartowski1182/82ae9b520227f57d79ba04add13d0d0d
first step:
PRODUCING THE BASE GOOOF:
>checkout llama.cpp repo
>do uv venv --python 3.12
>uv pip install -r requirements/requirements-convert_hf_to_gguf.txt
alternatively install manually the libraries you need, sometimes the requirements are outdated, which means do uv pip install ggml transformers accelerate sentencepiece torch protobuf --extra-index-url https://download.pytorch.org/whl/cpu
>download the weights of the model you want
>uvx hf download qwen/qwen3.6-27b --local-dir . (this will download the model in the current path, repalce the . with the path you want it, be relative or absolute)
now it's tiem for the base conversion
>uv run convert_hf_to_gguf.py $PATH_TO_MODEL --outfile $OUTPUTFILE-BF16.gguf --outtype bf16
congrats! you created your first base bf16 gooof!!!!!!!!!!!!
now time to do imatrix shit
>llama-imatrix -m $PATH_TO_MODEL -f $PATH_TO_CALIBRATION_DATA -o imatrix.gguf -t $CPU_THREADS -b $BATCH_SIZE(2048) -ngl $GPU_OFFLOAD_LAYERS --parse-special
now you created the imatrix.gguf file!
from the bf16 you created earlier you can now create all the subquants you want!
for q8_0 you don't really apply the imatrix, so you do:
>llama-quantize $PATH_TO_BF16_MODEL $OUTPUT_QUANT_FILENAME Q8_0 $CPU_THREADS
to apply imatrix instead
>llama-quantize --imatrix $PATH_TO_IMATRIX_FILE $PATH_TO_BF16_MODEL $OUTPUT_QUANT_FILENAME Q4_K_M $CPU_THREADS
of course you can replace the Q4_K_M with whatever quant level you desire. you're welcome!
>>
File: 3d printer.png (37.4 KB)
37.4 KB PNG
>>108662175
The few months of overlap after the nerds invaded when this was simultaneously both a tech and guro board was the peak of /g/.
>>
>>
>>
>>108662162
the screen shot of quants didn't have the typical quanters available, I usually shill for barts or mrader. I assumed the person making the inquiry was looking for one of those guys, so I directed him to ggml the only other name on the list I could identify. if you have the hard drive space and internet bandwidth you can just make your own, you dont need enough ram or vram for the safe tensors just disk space.
>>
>>
>>
>>
>>108662357
https://www.reddit.com/r/LocalLLaMA/comments/1sc8kdg/running_gemma4_26 b_a4b_on_the_rockchip_npu_using/
>>
>>
>>
>>
>>
File: dd964229a55cf670ddcb6693e739f694.jpg (211.1 KB)
211.1 KB JPG
>>108662353
you are a good man, thanks
>>
>>
>>
>>
>>
>>
>>
>>
>>108660554
>ngram-mod
The ngram cache resets way too early by default. Change this from 3 to 64 and recompile. Makes it usable.
>if (n_low >= 3) {
https://github.com/ggml-org/llama.cpp/blob/8bccdbbff9d0d91d54838471f6e ea182b9ab1b79/common/speculative.cp p#L747
>>
>>108662353
adding to this, if you want to get the mmproj, you do:
>uv run convert_hf_to_gguf.py $PATH_TO_MODEL --outfile $mmproj-OUTPUTFILE-BF16.gguf --outtype bf16 --mproj
also with llama quantize you can get cheeky and do specific quant levels on the layers you want by passing --tensor-type (or multiple of them) if you want to keep certain layers at q8_0 bf16/f16
you did imatrix shit but HOW DO YOU KNOW IT'S WORKING????? BY MEASURING PPL!!!!!!!
first you need to obtain wikitext shit!
https://cosmo.zip/pub/datasets/wikitext-2-raw/wiki.test.raw
https://cosmo.zip/pub/datasets/wikitext-2-raw/wiki.train.raw
https://cosmo.zip/pub/datasets/wikitext-2-raw/wiki.valid.raw
then you run this shit, $DATASET should be the wiki.test.raw file
>llama_perplexity -m $PATH_TO_MODEL -f $DATASET -ngl $GPU_LAYERS_OFFLOAD
run this on both an imatrix'd and non imatrix'd quant to se how much of an effect you shit did!
>b-but muh KLD
I actually didnt look into how to do kld calculations so... sorry! just ask your fave llm or hit google lomoa!
>>
OpenAI released an open source model!
https://huggingface.co/openai/privacy-filter
>1.5B parameters total and 50M active parameters.
>sparse mixture-of-experts feed-forward blocks with 128 experts total (top-4 routing per token)
>>
>>
>>
>>
>>
>>
>>
File: 8x ddr4 2400 vs 12x ddr5 6400.png (199.6 KB)
199.6 KB PNG
>>108662533
Here's hoping.
The best we can expect is competition and rivalries where they keep trying to one up each other by releasing increasingly more capable open weight models.
>>
>>
>>
>>
>>
>>
File: 1775155673440.png (1.4 MB)
1.4 MB PNG
>>108662589
>>108662594
>>
File: 1751577236872156.png (261.2 KB)
261.2 KB PNG
@kache shut the fuck up you leaf faggot
>>
>>
>>
>>
File: 1774443294977572.png (81.6 KB)
81.6 KB PNG
https://www.reddit.com/r/LocalLLaMA/comments/1ssl1xh/qwen_36_27b_is_ou t/
kek
>>
>>
>>
>>108662489
I know people like to shit on oai releases, but this strikes me as actually somewhat useful. I'd feel better about letting my """agents""" that have access to my personal notes interact with other people with more filters in place. Especially when using comparatively-more-retarded local models that are easier for other people to fool than some 10T cloud abomination.
That being said, for the personal usecase, idk if stacking a filter model is really more effective than a regex filter with my name in it.
>>
File: quant_error.png (784 KB)
784 KB PNG
>>108662321
>quantization error
Getting closer; turns out it could be easily vibe coded, in a way or another.
>>
File: 1774790474325284.png (1.2 MB)
1.2 MB PNG
>>108662752
lmaooo
>>
File: HGQXeUBWUAAuLgm.png (30 KB)
30 KB PNG
>>108661019
>>108661023
BEHOLD
https://x.com/saltjsx/status/2045874466958270903
>>
File: 1769170576230608.png (603.7 KB)
603.7 KB PNG
https://www.reuters.com/world/asia-pacific/tencent-alibaba-talks-inves t-deepseek-information-reports-2026 -04-22/
Deepseek's comeback??
>>
>>108662813
https://arxiv.org/abs/2309.08632
must have implemented this old paper
>>
>>
File: HGX7Z0YWkAANkJ3.jpg (83.1 KB)
83.1 KB JPG
Should I pull the trigger on second GPU bros..
Model is getting so much better every week, days even one card might be all I needed after all.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>[59301] ggml_cuda_init: failed to initialize CUDA: unknown error
>[59301] load_backend: loaded CUDA backend from /app/libggml-cuda.so
>[59301] load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
>[59301] warning: no usable GPU found, --gpu-layers option will be ignored
>[59301] warning: one possible reason is that llama.cpp was compiled without GPU support
What the fuck? it worked fine yesterday bitch!
>>
>>
File: 1750659002865044.png (49.7 KB)
49.7 KB PNG
lol
>>
If your qwen starts going
Wait… but I should
Wait… but maybe
Wait… what if
isn't that a sign your quantization is garbage? Mine will think for a long time potentially but I stopped getting the Wait… shit when I switched from unsloth to a non shitty quant provider. That and --reasoning-budget 4096 / --reasoning-budget-message "Thinking time exceeded. Output answer now\n"
>>
>>
>>
>>
File: 1768183262130103.png (173.6 KB)
173.6 KB PNG
>>108663091
Yes, and?
>>
>>
>>
>>
>>
File: glow-shine.jpg (210.3 KB)
210.3 KB JPG
>>108662957
buy a 3090 and a cloud sub. Wait like the rest of us who didn't CPUMAXX and wait...
>>
>>108662614
>>108662594
Wonder if I'd be able to run it with my 7900xtx and 32gb ram...
>>
>>
>>
>>
>>
>>
Thoughts on new Qwen 27B.
I almost like the thinking process it does. Everything makes sense. I don't even mind it drafting the whole response and then checking / fixing some things in post.
But, 2000 tokens of thinking take so fucking long, man. If this ran at 100+ tk/s it would be pretty fun.
It writes a lot like Gemma 4 31B does but thinks 1500 more tokens while running 5 tk/s faster.
If this released a month ago I'd be using it but now it's just not worth it.
>>
>>
>>
>>108662654
Yooo, @yacineMTB / kache is a pedophile / hates niggers like us? Based! I've heard roon also visits here and posts pics from his personal stash.
>>108662783
Reddit cooked with this epic cuckoldry meme
>>
>>
File: Screenshot048.png (5.2 KB)
5.2 KB PNG
>>108663256
ngmi
DOA
>>
File: 1776345292760771.png (36.6 KB)
36.6 KB PNG
>>
>decide to migrate to emacs, ditch vim and micro
>spend 8 hours configuring init.el and still not sure about everything...
Finally I can begin editing my text files! Thinking about making elisp version of my llm client. that will probably be spaghetti.
>>
>>
>>
>>
>>
File: 1763559247414133.jpg (25.7 KB)
25.7 KB JPG
>>108663621
>>
>>
>>
>>
File: 00000-1378487878.png (1.3 MB)
1.3 MB PNG
>>108662834
Sounds more like a potential tech sharing agreement in the works. West acts like China's all kumbaya and shares everything, but ofc they are hyper competitive.
If you're the DS founder, gotta love the $20B valuation on his side-gig. lol. Idk what that guy makes at his real job but this implies he's now a multibillionaire at minimum.
>>
>>108664332
That would be more of something that the CCP would arrange under cover, an investment opportunity like this wouldn't cover tech and it's not like they can't just look at what Deepseek publishes to get something out of whatever they are planning.