Thread #108641942
/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads: >>108637552 & >>108633862
►News
>(04/16) Ternary Bonsai released: https://hf.co/collections/prism-ml/ternary-bonsai
>(04/16) Qwen3.6-35B-A3B released: https://hf.co/Qwen/Qwen3.6-35B-A3B
>(04/11) MiniMax-M2.7 released: https://minimax.io/news/minimax-m27-en
>(04/09) Backend-agnostic tensor parallelism merged: https://github.com/ggml-org/llama.cpp/pull/19378
>(04/09) dots.ocr support merged: https://github.com/ggml-org/llama.cpp/pull/17575
►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
432 RepliesView Thread
>>
File: threadrecap.png (1.5 MB)
1.5 MB PNG
►Recent Highlights from the Previous Thread: >>108637552
--llama.cpp PRs adding DFlash and speculative checkpointing for speed:
>108640571 >108640591 >108640606 >108640682 >108640733 >108640747 >108640744 >108640767
--Anon uses Gemma-4 to build a self-modifying MCP server:
>108637873 >108637890 >108637916 >108637970 >108637976 >108638105
--Anon showcases VN frontend using Gemma 4 and ComfyUI:
>108638473 >108638488 >108638514 >108638534 >108638554 >108638691 >108638775 >108638828 >108638607 >108638650 >108638652 >108639369 >108639312 >108640497
--Discussing complex multi-model agent orchestration and layout efficiency:
>108638914 >108638931 >108638964 >108639017 >108639105 >108639126 >108639139
--Comparing local 5090 hardware costs against high-end coding APIs:
>108639080 >108639120 >108639133 >108639153 >108639172 >108639201 >108639207 >108639748 >108639138 >108639203 >108639745
--Comparing Qwen3.6 and Gemma4 performance on benchmarks and translation:
>108639021 >108639039 >108639052
--Orb-anon shares updates on Orb agentic writing tool and UI:
>108637985 >108638191 >108638211 >108638222 >108638259 >108638318 >108638451 >108638478
--Comparing Qwen3.5 and Gemma4 performance for manga OCR and boxing:
>108640026 >108640041 >108640042 >108640051
--Comparing Gemma 4 MoE and dense models' safety guardrail persistence:
>108641209 >108641221 >108641266 >108641485 >108641608
--Using custom tags to force first-person reasoning in Gemma/Qwen:
>108638379 >108638397 >108638486 >108638529
--Testing Gemma 31b performance at higher context windows for RP:
>108637978 >108638070 >108638224 >108638238
--Searching for lists and detectors of overused LLM prose cliches:
>108637879 >108637885 >108637993 >108638011 >108638062 >108638086
--Logs:
>108637976 >108638379 >108638451 >108639253 >108639453 >108639750 >108639781
--Miku (free space):
>108638191
►Recent Highlight Posts from the Previous Thread: >>108637554
Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
File: Screenshot038.png (91.8 KB)
91.8 KB PNG
Gemma, darling...
>>
>>
>>
>>
how did gemma manage to read the unreadable text in a thumbnail?
>>
>>
>>108642213
>>108642220
It actually mis-quoted, didn't it? Doesn't the original say 'entered' a thread, not 'searched'?
>>
>>108642235
yes
>>108642220
qwen can't read it so it must be gemma training on it
>>
File: untitled.png (128.5 KB)
128.5 KB PNG
>>108642235
>It actually mis-quoted, didn't it?
LLMs do this when reciting text verbatim from their training data. Also why you get links to the wrong github PR etc
>>
>>
>>
>>
>>
>>
File: dispenser.png (235.3 KB)
235.3 KB PNG
hi so
>prompt eval time = 29369.07 ms / 721 tokens ( 40.73 ms per token, 24.55 tokens per second)
>eval time = 13822.73 ms / 73 tokens ( 189.35 ms per token, 5.28 tokens per second)
>total time = 43191.80 ms / 794 tokens
release: id 0 | task 152 | stop
gemma 4 e2b
6 gigs of ram
4 core Intel Xeon Gold CPU
8 gigs of swap
on CPU inference
i doubt i can do anything here without upgrading, can i
-m /models/Gemma-4-E2B-uncensored-pruned-TextOnly-EnglishOnly-Q4_K_M.gguf
--host 0.0.0.0
--port 8080
--ctx-size 3000
--batch-size 64
--ubatch-size 32
--threads 4
--threads-batch 4
--swa-checkpoints 1
--parallel 1
--flash-attn on
--temp 1.0
--top-p 0.95
--fit on
--cache-ram 0
--n-predict 400
--override-tensor "per_layer_token_embd\.weight=CPU"
--jinja
--no-mmap
>>
>>
File: file.png (33.4 KB)
33.4 KB PNG
>>108642625
someone tested and found it made no difference
the best speedup comes from using the 26b for spec decoding
>>
>>
>>
>>108642625
>>108642647
Moe model needs larger draft max or otherwise they tokens will get trunkated. 48 or more.
I don't understand this "test" it is very haphazardly half assed and meaningless.
>>
>>
>>
>>
>>
>>
>>
>>
>>
I just learned by looking at the verbose Llama.cpp logs that Open WebUI automatically reinjects the thinking block for previous messages, and that there is no option to disable this behavior, because I guess they think all models want previous message thinking. Oh also the default thinking tags OWUI uses is <think> although at least it seems they let you set custom tags.
WHAT THE FUCK.
WHAT THE FUCKING FUCK.
THAT'S (one reason?) WHY THINKING BREAKS RANDOMLY ON GEMMA
FUCK
>>
>>
>>
>>
>>
>>108642665
>Moe model needs larger draft max or otherwise they tokens will get trunkated. 48 or more.
--draft is the same as draft max and it was tested with 64, 128, and 256. All above 48, none helped.
Those scores were also all averages of 11 swipes done at 40k context with gemma 31b q8 as the main model and 26b q4 as the main model
t. guy who actually did those tests, as well as the previous ones testing how quanted draft kv effects acceptance rate (The answer is negatively, unsurprisingly, but this was done before the rotating kvquant stuff was merged)
Feel free to prove me wrong and get a better measured result by fucking around with draft max, I'd love to get some free speed.
>>
>>
>>
File: 1755348081481696.jpg (798.6 KB)
798.6 KB JPG
new vision SOTA benchmark just dropped
>>
>>
>>
File: 1752764167468619.png (98.9 KB)
98.9 KB PNG
>>108642862
It's benchmaxxed already, get new material
>>
File: 1750031460093473.png (313.5 KB)
313.5 KB PNG
>>108642884
>>
>>
File: yayyy.png (15.2 KB)
15.2 KB PNG
>>108642862
>>
>>
>>108642887
>>108642892
>>108642901
get in the oven, all of you
>>
>>
>>
>>
File: yayyy_02.png (15.2 KB)
15.2 KB PNG
>>108642916
I don't send the filename. The vimscript uses the :image: tag to embed the file only.
>>108642917
I just noticed I ran it with temp 1.5, but I don't know if that's gonna make much of a difference.
>>
>>108642813
Normally it would but OpenWebUI doesn't actually send it back as thinking. It literally just pastes the thinking block straight into the "content" field of the message with thinking tags and spacing that is not going to be consistent with every model's usage of them.
It's supposed to be sent separately in the API under a "reasoning" or "reasoning_content" field without tags. When this is done properly, the jinja filters out all previous thinking except in cases where it's necessary (typically chained tool calls).
>>
File: laigs.png (233.3 KB)
233.3 KB PNG
>>108642862
Even the retardmode moe quant knows its ai generated, even if it can't count lol.
>>
>>
File: nayyy_01.png (14.8 KB)
14.8 KB PNG
>>108642917
I'm also using --image-min-tokens 560 --image-max-tokens 560. With the default settings it failed the two times I tried. Four on the first one and file now.
>>
>>
File: laigs 2 4 u.png (576.4 KB)
576.4 KB PNG
>>108642956
Couldn't be fucked switching my other moe quants over from the hdd to the ssd.
Here.
moe q4km gets it
>>
File: 1760095272195059.png (678.8 KB)
678.8 KB PNG
Gema 26b Q8 gets the tails correct but not the legs
>>
>>108642806
Not him but yes it is actually sending the prompt that way. It results in weird formatting issues in the chat history like duplicated <think> tags for some models, so if you have any weird behavior with reasoning models in OWUI there's a good chance that bug is contributing to it. I used a reverse proxy to fix it myself which processes the prompt before sending it to the server. I think you could do something similar through their pipelines or extension system but I never looked into it because the reverse proxy was easier for me.
>>
>>
File: laigs intredasting.png (276 KB)
276 KB PNG
>>108642989
Huh, weird. It seems like asking for both legs and tails makes 26b count wrong, when it can get it correct when just asked about the legs.
>>
>>
>>
File: eh.png (579.8 KB)
579.8 KB PNG
>>108643013
Gets it right when you ask for paws and tails, though. Weirdly inconsistent.
>>
>>108643013
>>108643028
Goes to show just how unreliable current models are for factual data
>>
File: pero.jpg (160.3 KB)
160.3 KB JPG
>>108643028
>>
>>
>>
File: Screenshot at 2026-04-20 17-31-56.png (4.3 KB)
4.3 KB PNG
>>108643076
Are you even archiving
>>
>>
>>108643001
>what kind of window manager are you using
My own, lightly inspired by ratpoison. But what you see on the screenshots is just tmux.
>whats your setup like?
>i like your red and your font..XTerm.vt100.background : black
XTerm.vt100.foreground : gray
XTerm.vt100.boldMode : false
XTerm.vt100.allowBoldFonts : false
XTerm.vt100.eightBitInput : false
XTerm.vt100.metaSendsEscape : true
XTerm.vt100.utf8 : true
XTerm.vt100.locale : UTF-8
XTerm*faceName : Terminus
XTerm*faceSize : 8
! black
XTerm.vt100.color0: #000000
XTerm.vt100.color8: #888888
! red
XTerm.vt100.color1: #881111
XTerm.vt100.color9: #d06666
! green
XTerm.vt100.color2: #118811
XTerm.vt100.color10: #66d066
! yellow
XTerm.vt100.color3: #888811
XTerm.vt100.color11: #d0d066
! blue
XTerm.vt100.color4: #3333a0
XTerm.vt100.color12: #6666e0
! magenta
XTerm.vt100.color5: #881188
XTerm.vt100.color13: #d066d0
! cyan
XTerm.vt100.color6: #118888
XTerm.vt100.color14: #66d0d0
! white
XTerm.vt100.color7: #b0b0b0
XTerm.vt100.color15: #cdcdcd
>>
>>
>>
>>
>>
>>108643076
>>108643085
If you are intelligent and selective about it, hoarding is a good practice for the future.
If I had some serious disk space I would download the whole anna's archive and lots of other things.
>>
>>
>>
>>108643110
Maybe you should try an even newe model
https://huggingface.co/sKT-Ai-Labs/SKT-SURYA-H
>>
File: 1775544523743610.png (20.1 KB)
20.1 KB PNG
>>108643153
wtf
>>
>>
>>
>>
>>
File: 1754036894943805.png (46.6 KB)
46.6 KB PNG
>>108643158
>>
>use chatgpt in little bouts here and there because it's easy to access etc
>it keeps implementing 'better' ways i never asked
I would use claude or something but they all require an account and I'm not really keen doing that. Every time I use this piece of shit my blood pressure gets high.
No, local Gemma won't cut it not until I have some form of agentic development pipeline which I don't at this point.
>>
>>
>>
>>
>>
>>108643197
https://github.com/ggml-org/llama.cpp/pull/21149
>>
>>
>>
>>
>>
>>108641942
I wish so called open models were more open. They don't explain their decisions, they don't even have clean canonical implementations. For example Qwen's config says it uses silu but when you check the >2k line huggingface implementation, you will see it is actually an inefficient swiglu. Why does qwen use swiglu mlp with scale 1, 6, 3, 1 instead of the more canonical 1, 16/3, 8/3, 1 that is equivalent to the nongated 1, 4, 1 projection?
>>
>>
File: 1758121487584422.png (3.1 KB)
3.1 KB PNG
>>108643233
>having to justify magic numbers to laymen
we don't do that here
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: GPQ_eyBacAEQet7.jpg (36.1 KB)
36.1 KB JPG
>>108643262
Thank you.
>>
>>
>>
>>
>>
>>108643271
Right after the big sir model was posted >>108643153 but surely it's just a coincidence
>>
File: 1693464909257094.jpg (697.2 KB)
697.2 KB JPG
>>108643270
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1768048089757638.png (990.2 KB)
990.2 KB PNG
Why are we getting raided again?
/aicg/ hasn't had proxies for ages so it can't be one of them dying
>>
>>
>>
>>
>>
>>
>>
>>
File: 1747532861846983.jpg (133.6 KB)
133.6 KB JPG
>>108643467
>>
>>108643484
>>108643507
Choosing between erasing Israel or India would be the hardest decision a genie could ever give a man
>>
>>
>>
>>108643530
this seems like a pretty much consistent theme
i asked it to search about something, got hit by multiple captchas and it confabulated the whole thing from the couple initial search previews that actually returned something
>>
>>108643530
>>108643566
they really need to train models to say that if they don't know or need more informations, they should say so
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
... how can i see gemma 4 31b it thinking block? i am using koboldcpp and sillytavern, i enabled the auto parse and show hidden settings under reasoning, added <|channel>thought and <channel|> as prefix and suffix but still nothing :(
>>
File: 1749277720192179.png (151.9 KB)
151.9 KB PNG
>>108643667
>>
>>
File: 1776679755127.png (4.6 KB)
4.6 KB PNG
what are these 2 extra buttons
i have only pp and tg buttons
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108643798
>Nobody will know until a difficult long-context benchmark is done.
it should be mendatory to make LLMs do their benchmark test starting at at least 50000 context, easy for a local model to start strong and then not care what happens as it goes on and on
>>
>>
>>108643798
>Nobody will know until a difficult long-context benchmark is done. There's barely any difference between quants at short contexts and common knowledge until you reduce precision substantially.
https://localbench.substack.com/p/gemma-4-31b-gguf-kl-divergence
>>
>>
>>
File: Screenshot.png (137.6 KB)
137.6 KB PNG
another re-upload today
>>
>>
>>
it's here
https://huggingface.co/moonshotai/Kimi-K2.6
https://huggingface.co/moonshotai/Kimi-K2.6-Code
>>
>>
File: 1766651896779772.gif (3.4 MB)
3.4 MB GIF
>>108641943
>--Miku (free space):
>>108638191
Impressive
>>
>>
>>
>>
>>
gemma 4 31b q4km is just too big for hermes on a 3090 so I'm switching to 26b MoE. gonna try iq4-NL from unsloth. maybe if some codeslave could add turbocunt or whatever then LOCAL WOULD BE FUCKING SAVED but no.
feel like I am so close to getting the setup of my dreams going but maybe that is the local model delusion?
>>
File: 1761150749876516.png (381 KB)
381 KB PNG
even if they released MAX I wouldn't use it, sick and tired of its long thinking loop autism
>>
File: 1730869321292980.jpg (359.2 KB)
359.2 KB JPG
>The 70b peak is still sao10k after all these years.
I don't know how, but I swear, if fine-tuning becomes more accessible and less costly, hence more popular, these tuners today are going to look like retards. I just know it. There's going to be forums +10 years from now that'll be like "Remember that drummer dumbass that didn't use the skiddipop technique everyone does now? Man, what an idiot. He didn't even have the pattern recognition for the flambeagle tactic, everyone who fine-tunes can figure that out."
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108644195
It's solved in private models like mythos that actually run back prop on its entire context during inference to temporarily bake it into its weights during usage which works as longtime storage and effectively gives it unlimited context length for agentic tasks.
>>
>>
>>108644092
With the amount of data required to make something worth using increasing year after year, even if finetuning becomes so accessible that it's just a matter of drag-n-dropping datasets into a GUI, regular people still won't have the compute and the resources to curate the data and train the models.
I can't really see local compute capabilities (and memory) increasing by a factor of 100-1000 in the next few years. Costs will always be high. If there will be anything accessible, maybe it will come from "continual learning" models, but in that case you probably won't need to train them on everything and the kitchen sink, only on what matters to you, the end-user.
>>
>>
File: fr.png (251.3 KB)
251.3 KB PNG
>>108644205
>temporarily bake it into its weights during usage
>>
>>
>>
>>
>>
>>
>>
>>
>>108644206
>is there a paywall mirror like for medium???????????
I didn't know it's paywalled? It works for me
full page screenshots
https://files.catbox.moe/ypgni0.png
https://files.catbox.moe/f44shg.png
the table and graph full size
https://files.catbox.moe/yg6i6v.jpg
https://files.catbox.moe/jq06vf.png
That's on 250k ctx
>>
>>
>>
>>108644182
Yeah but for the hyperscalers who actually have the resources to waste they'd rather make a model 5x bigger that runs 10x faster for 10% the cost than blow all their budget training a giant dense model that will be outdated on release because it took 6 months of their datacenter's capacity. Meta had the unique combination of having the biggest GPU stockpile of anyone at the time and a CEO with the biggest willingness to burn money that allowed for something like Llama 3.1 to exist.
>>
>>
>>
>>
>>
>>108644345
>there's a paywall for the moe model
fuck i didn't even know he did the MoE
long context: - https://files.catbox.moe/xy0kqu.png
>>
>>
>>
>>108644345
>https://localbench.substack.com/p/gemma-4-26b-a4b-gguf-quality-benchm ark
This could probably unlock with some ublock shenanigans.
>>
>>
File: gemma_26b_a4_gguf_comparison.jpg (125.9 KB)
125.9 KB JPG
>>108644423
Nevermind, google search was able to snatch the chart itself. Good riddance!
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: Screenshot041.png (98 KB)
98 KB PNG
>>108641945
Am I the only one experiencing looping in gemma 4?commit="82764d8f405ff7928c061d8c100b50e9f77939f6" && \
model_folder="/mnt/AI/LLM/gemma-4-26B-A4B-it-GGUF/" && \
model_basename="google_gemma-4-26B-A4B-it-Q8_0" && \
mmproj_name="mmproj-google_gemma-4-26B-A4B-it-f16.gguf" && \
model_parameters="--temp 0.6 --top_p 0.95 --top_k 64" && \
model=$model_folder$model_basename'.gguf' && \
cxt_size=$((1 << 15)) && \
CUDA_VISIBLE_DEVICES=0 \
numactl --physcpubind=24-31 --membind=1 \
\
"$HOME/LLAMA_CPP/$commit/llama.cpp/build/bin/llama-server" \
--model "$model" $model_parameters \
--threads $(lscpu | grep "Core(s) per socket" | awk '{print $4}') \
--ctx-size $cxt_size \
--n-gpu-layers 99 \
--no-warmup \
--mmproj $model_folder$mmproj_name \
--port 8001 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--flash-attn on \
--image-max-tokens 1120 \
--batch-size $((1024 * 2)) \
--ubatch-size $((1024 * 2)) \
--chat-template-file "/mnt/AI/LLM/gemma-4-26B-A4B-it-GGUF/chat_template.jinja" \
--media-path /tmp \
--n-cpu-moe 10
>>
>>108644488
>>108644501
It has been solved by downloading another one, the most recent unslop gguf. I think my old one was fucked in all kinds of ways, and I'm looking forward to finding out in which ways this one is also fucked
I like having her think because it genuinely interests me, I don't even care about the RP I just wanna see what it thinks or deducts when I say certain things
>>108644461
Well now you've been tasked with editing these, attaboy
>>
>>
I got an Intel ARC B70 Pro over the weekend and wasted most of the weekend trying to get it to work. Long strory short: it was a pain in the ass and its not worth the trouble for the 32GB of VRAM. Long story: It wasn't recognized properly by the kernel out of the box with ubuntu 24, I had to add a ppa to get a newer kernel. Funny, because to install the intel frameworks and libraries they only support a handful of OS among them ubuntu 24, but whatever. Then, I eventually got llama.cpp working but --no-mmap wouldn't stop it from trying to first load the model to system RAM, and I only had 32GB in my test box, and if it were 2025 I'd just buy more, but it's 2026 and 64GB of DDR4 is a rip off so that was the end of llama.cpp. Then I tried vLLM. I never got it to work. It doesn't support openvino well, and I wanted to run gemma 4 31b it and I couldn't find a compatible quant version. I am RMAing the card back today. What a waste of time.
>>
>>
>>
>>
>>
>>
>>
File: 1768415903475874.jpg (26.8 KB)
26.8 KB JPG
>>108644532
>it was a pain in the ass and its not worth the trouble for the 32GB of VRAM
We know retard. Here is your fell for it award.
>>
>>
>>
>>
>>108644533
But why is there such a large difference from BF16 (the source, KLD=0 by definition), though? There's either something that as soon as gets touched causes measurable damage, or Q8_0 doesn't work as well as it should.
>>
>>
>>
>>
File: 9b qual.jpg (108 KB)
108 KB JPG
>>
>>108644506
>>108644502
>>108644583
According to this graph, unsloth q2_k_xl is almost the same as bartowski q4_0.
This doesn't make any sense or if it does it means that unsloths has skewed the weights towards this particular stat.
fixed the typos
>>
>>
>>
>>
>>108644554
>anything above Q4?
>>108644571
>use Q8_0 and upwards
retard
>>
>>
>>
>>
>>
>>108644619
https://files.catbox.moe/jq06vf.png(for the 31b)
top 1 is only 92%
and that would include all the obvious punctuation and other 99.5% tokens
>>
>>
Nyehehehehe
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1759485716349073.png (248.3 KB)
248.3 KB PNG
>>108644506
31b fairs better but yeah, Q4 quants aren't particularly high quality.
>>
>>
File: absolute_retard.png (163.2 KB)
163.2 KB PNG
>>108644742
>>
>>108644842
>>108644848
Yeah I guessed the answer would be something like this. Thanks.
>>
File: 1759906247156495.png (852.3 KB)
852.3 KB PNG
>>108644851
>Nyahahaha
>>
>>
>>
>>
>>
File: 1759061755858768.png (68.2 KB)
68.2 KB PNG
>>108644829
Nobody can
>>
File: 1569566339879.png (166.5 KB)
166.5 KB PNG
What's the best vibecoding plugin in vscode that can connect to OAI Compatible?
>>
File: 1749034134206952.png (587.7 KB)
587.7 KB PNG
>>108644868
>That tumblr style
>>
>>108644877
>I realize now that my current upload is an experimental collection of models rather than a function 2.5T model
it's like saying "I just realized that I put my shoes in the freezer instead of putting them in the closet."
>>
>>108644559
Yes, I know. I epected it to suck but I wanted to s ee for myself how badly.
I have a decent setup which can run qwen3.5-27b at full 261K context (4090D 48GB + 3090) but I would really like to run stuff in the 100-400B range locally. I have to decide either swap the 3090 for a 6000 Pro Max-Q or maybe buy a max-RAM M5 mac studio when they are released.
>>
>>
>>
File: 1558206602155.jpg (18.6 KB)
18.6 KB JPG
Can a single backend serve two frontends? I want to run coding and to test the coded app I need to connect it to the backend that's already occupied. I think kobold has some multiuser stuff but is that what I want?
>>
>>
>>
>>
>>
>>108644952
Moreover they do not even return direct thinking tokens
https://platform.claude.com/docs/en/build-with-claude/extended-thinkin g#summarized-thinking
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108644998
Shame. That's probably the one you could split processing over a 10 gigabit home network that could, maybe, make sense.
Would also allow you to perform prompt processing on an nvidia machine and inference on, say, a mac.
>>
>>108645003
>That would get prohibitively expensive really fast.
yeah, probably only the chinese labs
i'm just guessing that's how they do it
>but nobody making 'Opus4.6-Distillation-6700000x-extreme-superhigh-max-reasoning' gives a shit
agreed, those unsloth retard loras
i remember someone did a qwen2-14b logit distill of the 405b llama-3.1 with a schitzo vocab swap + healing token thing a while back
>>
>>
>>
>>
>>
Is there a world knowledge benchmeme out there? Asking models questions that require specific knowledge and see if they give non-hallucinated responses? (e.g. When was the year album x of musician y released?) Obviously asked without internet search.
I want to see quantitative data on how this kind of stuff scales with weights.
>>
>>
>>
>>
>>
File: 1772994328900860.jpg (12.3 KB)
12.3 KB JPG
>>108645186
haha yeah imagine
>>
>>
>>
>>
>>
>>
>>
please answer me
>>108643697
>>108642540
>>
>>
>>
>>
>>
File: Screenshot at 2026-04-21 00-43-44.png (147.8 KB)
147.8 KB PNG
I can already think of a lot of improvements I want to make but... I can play chess with Gemmy now!
>>
>>
>>
>>108645309
gemini is actually pretty decent at chess for an llm, curious how gemma performs for you
how are you representing the board state? I feel like that's always really tricky to get right and probably the biggest obstacle for llms to be able to play effectively
>>
>>108645324
I'm pretty bad so I wouldn't be surprised if she did win, mostly just wanted to see if it was even possible but seems quite promising so I'll refine the UI a bit so it's easier for me to play (right now I'm just using curl) and it automatically notifies her about moves made and so on.
>>
>>
>>108645309
Seconding >>108645355's question.
I was thinking of doing something like that using PGN format.
>>
>>
>>108645355
What I did was ask it and it said (paraphrasing) if you give me two tool calls, one to get the current board state in FEN format and the other to make moves then it should work.
So I made a basic chess server with a Ruby chess engine (https://github.com/pioz/chess - which already outputs FEN and understands UCI moves etc) under the hood, hooked up the tool calls to that server and seems to work just fine.
It'll be interesting to see how it goes over a long game though, first I need a better way of making my own moves that isn't curl...
>>
File: 1765300142692930.jpg (125.6 KB)
125.6 KB JPG
>>108645429
Ask gemma to teach you english first
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108645500
>>108645506
not like that
i want to give llm prompt like "ask llm to write function x, then ask to write function y, make sure it does z, show the output"
>>
>>
>>
>>108645565 Meant for >>108643895.
>>
File: Screenshot at 2026-04-21 01-31-09.png (585.5 KB)
585.5 KB PNG
>>108645544
That part is possible cause she has image gen capabilities already.
>>
>>
>>
>>
>>
File: 1771440662324531.jpg (151.2 KB)
151.2 KB JPG
>>108645658
I love this thread bros
>>
>>
>>
>>
>>
File: Robo-Wife.mp4 (2.6 MB)
2.6 MB MP4
soon
>>
>>
File: Bam-Bam-Painting-min.jpg (47.4 KB)
47.4 KB JPG
People keep saying that LLM's are state-less machines.
If so, how to erase the context freeing VRAM?
Also, I can have several chats running in llama.cpp
How on Earth do they manage to separate them from each other in VRAM, so one chat's context does not spill over into another?
>>
>>
>>
>>
>>
>>108643971
I always click on these troll links. Umm.
https://huggingface.co/moonshotai/Kimi-K2.6
https://huggingface.co/moonshotai/Kimi-K2.6
https://huggingface.co/moonshotai/Kimi-K2.6
>>
>>
>>
File: 1boy, looking_at_viewer.png (15.9 KB)
15.9 KB PNG
>>
>>
>>
>>
>>
File: Screenshot 2026-04-20 at 17.55.24.png (190.8 KB)
190.8 KB PNG
https://www.kimi.com/blog/kimi-k2-6
Wish there was a GLM 5.1 comparison.
>>
>>108645767
>People keep saying that LLM's are state-less machines.
Yes, but intermediate results can still be cached. That's what the kvcache is.
>If so, how to erase the context freeing VRAM?
On llama.cpp, you can't free allocated memory.
>Also, I can have several chats running in llama.cpp
Yes if you have multiple slots. Read llama-server -h for --parallel, --cram and probably some others. Read the whole thing.
>How on Earth do they manage to separate them from each other in VRAM, so one chat's context does not spill over into another?
Uh... a slightly more complicated version of if (slotctx < ctx / slots) ok; else notok; I suppose.
>>
File: Capture.png (84.9 KB)
84.9 KB PNG
>>
>>
https://huggingface.co/moonshotai/Kimi-K2.6
https://huggingface.co/moonshotai/Kimi-K2.6
https://huggingface.co/moonshotai/Kimi-K2.6
it's out
>>
>>
>>
File: 1765651225005855.png (106.6 KB)
106.6 KB PNG
>>108645842
another moe, I wonder if vision will be better than gemma4
>>
File: Untitled.jpg (148.3 KB)
148.3 KB JPG
>>108645842
mfw seeing a model i cant run even with a q1 quant
>>
>>
>>
File: file.png (9.7 KB)
9.7 KB PNG
>>108645861
wtf, gemmy has 550m param vision encoder and it's only 31b
that seems very disproportional, or is their moon thingy that efficient?
>>
>>108645842
>4. Native INT4 Quantization
>Kimi-K2.6 adopts the same native int4 quantization method as Kimi-K2-Thinking.
So natively accelerated on blackwell? I don't even know if it's possible with gguf/q4.
>Kimi-K2.6 has the same architecture as Kimi-K2.5, and the deployment method can be directly reused.
Less llama.cpp drama, good.
>>
>>
>>108645834
I don't believe them I've been using the k2.6 preview to vibecode via their subscription and its performance is clearly dumber than mimo v2 pro, which I already put some steps below codex.
I am cancelling it and trying mimo via xiaomi directly this month.
>>
>>
>>
>>
>>
>>
File: file.jpg (247.5 KB)
247.5 KB JPG
>>108645894
Does gemma have vision benchmarks?
Because kimi does.
>>
>>
>>
>>108643872
KLD is not a capabilities benchmark.
Noise floor is a thing (he should test KLD of BF16 vs BF16 offloaded on different hardware, or with a different -ub, as llama.cpp produces different logits depending on those values).
>>
>>
>>
>>
>>
File: 1763698178837198.jpg (46.2 KB)
46.2 KB JPG
>>108645982
since it's moonshot_ AI
the counterpart should be moonlol_ AI
>>
>>
>>
>>108645993
>>108646000
Mode collapsed as in model collapse?
That's hilarious.
>>
>>
>>108644302
>bloat
If it works then does it matter? One would require you to spend time vibe coding and then an unknown amount of time fixing and improving the vibe coded shit. The other is just ready made and you just follow the instructions.
>>
>>108646017
Mode collapse as in mode collapse. Somewhat similar concepts, different technicalities.
https://en.wikipedia.org/wiki/Mode_collapse
https://en.wikipedia.org/wiki/Model_collapse
>>
File: migu D.jpg (19.2 KB)
19.2 KB JPG
>>108645752
>>
>>108645894
it's this
https://huggingface.co/moonshotai/MoonViT-SO-400M
>>
>>
>>
>>
>>
>>
>>108640471
I'm not reluctant, I'm still working on it until it's in a presentable state and fixed some issues, I'm pretty close though
>>108642791
what
>>
>>108646124
https://huggingface.co/unsloth/Kimi-K2.6-GGUF
currently unslopping
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>