Thread #108256995
File: growing that ram4.png (2.1 MB)
2.1 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads: >>108252185 & >>108246772
►News
>(02/24) Introducing the Qwen 3.5 Medium Model Series: https://xcancel.com/Alibaba_Qwen/status/2026339351530188939
>(02/24) Liquid AI releases LFM2-24B-A2B: https://hf.co/LiquidAI/LFM2-24B-A2B
>(02/20) ggml.ai acquired by Hugging Face: https://github.com/ggml-org/llama.cpp/discussions/19759
>(02/16) Qwen3.5-397B-A17B released: https://hf.co/Qwen/Qwen3.5-397B-A17B
>(02/16) dots.ocr-1.5 released: https://modelscope.cn/models/rednote-hilab/dots.ocr-1.5
►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
449 RepliesView Thread
>>
File: 1693569666447.png (128.5 KB)
128.5 KB PNG
►Recent Highlights from the Previous Thread: >>108252185
--Neural Linguistic Steganography:
>108254284 >108254313 >108254335 >108254347 >108254333 >108254383 >108254612 >108254659 >108254725 >108254734 >108254812 >108255833 >108255897 >108255914 >108255993 >108256032 >108256109 >108256137 >108256197
--Qwen3.5-397B-A17B GGUF quantization performance evaluation and Unsloth's MXFP4 implementation issues:
>108255306 >108255361 >108255376 >108255378 >108255407 >108255472
--llama.cpp MTP implementation slower than baseline for GLM 4.5 Air IQ4_XS:
>108252747 >108252770 >108252791 >108252824 >108252897 >108253131 >108253146 >108253291 >108253625 >108253645 >108253753 >108253767 >108253776 >108253791 >108253922 >108253961 >108252827
--Abliteration tool debates and Qwen3.5 model comparisons:
>108254196 >108254217 >108254259 >108254271 >108254272 >108254306 >108254304 >108254325 >108254223 >108254252
--FP6 precision absence due to hardware limitations:
>108253199 >108253216 >108253254 >108253269 >108253287
--Qwen3.5 GGUF Benchmarks | Unsloth Documentation:
>108254261 >108254291 >108255322 >108254301 >108254387
--Uncensored Qwen model variants shared:
>108254117 >108254137 >108254170 >108254767 >108254793 >108254168 >108254488 >108254829 >108254841
--Desired advancements in local models before year-end:
>108255761 >108255773 >108255788 >108255813 >108256669 >108255827 >108255956 >108256478 >108256508 >108255834 >108255856 >108255942 >108256029 >108256037 >108256054
--Mercury 2 reasoning diffusion LLM speed claims:
>108256497 >108256575 >108256612
--Higher quant model trades speed for accuracy in DMC3 boss analysis:
>108254459
--Tiny diffusion model explains Japanese slang term "mesugaki":
>108256144
--Miku (free space):
>108254691 >108254906
►Recent Highlight Posts from the Previous Thread: >>108252188
Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>
>>
>>
>>
>>
File: 1768478295394135.png (1.1 MB)
1.1 MB PNG
can local do this?
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1751231572427073.jpg (58.8 KB)
58.8 KB JPG
To break through the ceiling we must start harvesting human brains instead of GPUs.
>>
>>
>>
>>
File: 1744409772903809.jpg (73.7 KB)
73.7 KB JPG
>>108256999
Hand over the ram Miku. We can do this the easy way, or the hard way.
>>
some anon from local diffusion recommend i come here for help. i want to use a vllm locally for the first time. I've never actually downloaded an text2text and image2text llm and used it locally before. Is there a some sort of webui gradio interface i need to install to use these llm/vlms models similar to like a1111/forge ui is for using sdxl and flux? i really want to use true uncensored vllms for image captioning. I'm tired of dealing with the shitty rate limits of gemini 3.0/3.1. I have a 5090 with 64gb ram. does these models get the job done?
https://huggingface.co/groxaxo/Qwen3.5-27B-heretic-W8A16
https://huggingface.co/Qwen/Qwen3.5-35B-A3B
>>
>>
>>
>>
File: 008Dh0eagy1iapef2bp7sj32bc3341kx.jpg (503.6 KB)
503.6 KB JPG
>>
>>
>>108257451
Download koboldcpp from github and just feed the model into it. You also need the mmproj file to do image -> text. (You have to give koboldcpp an --mmproj argument with the file, f16 version is perfectly fine.)
Llamacpp works too.
Both of the models you found are good. With a 5090 and 64 GB of RAM you can run either of those models at q8. The 35B-A3B model generates tokens a lot faster because it has 3 billion active parameters while the 27B model is a "dense" model. People argue that the dense model is somewhat smarter, but it has significantly lower token generation speed.
With a 5090 and 64GB of RAM you could likely even run a Q4 version of Qwen3.5-122B-A10B, but it's going to eat up most of your memory.
These Qwen models are not uncensored though. They're even more cucked than Gemini. There are "heretic" versions of these models on HuggingFace though and those are uncensored. The base models might still caption the image though.
None of these models are going to be as good as Gemini though. But just koboldcpp can get you started. You can look into other things once you see that it works.
>>
>>
>>
>>
>>
>>
File: 1739250286596.jpg (103.7 KB)
103.7 KB JPG
>>108257402
>>
>>
>>
>>
>>108257528
Did the price increase for Kimi or has it always been $3 per 1m out?
>>108257451
Try this: https://huggingface.co/mradermacher/Qwen3.5-35B-A3B-heretic-GGUF
or this:
https://huggingface.co/mradermacher/Qwen3.5-27B-heretic-GGUF
Grab a Q5_K_M or a Q4_K_M for one of them and a f16 mmproj file (this one does the image recognition).
>>
>>
>>108257622
>>108257634
qwen 3.5
And this can't be prompted away, shit's too entrenched
>>
>>
>>
>>
>>
>>
>>
>>
File: Screenshot 2026-02-27 194805.png (283.1 KB)
283.1 KB PNG
>>108257545
>>108257626
thank you soo much anons. its actually working. using the 27b heretic q5 model version. will try out other models but this one works :')
>>
>>
Using my $300 free Googbux on gemini-3.1-pro-preview, my first real try with cloud inference. I've sic'd it on the same issue as my local MiniMax-M2.5, when the latter exhausted 64k context.
Intrerestingly, Gemini-3.1 (on "medium" effort) takes 3 or 4 times more turns and thinking tokens to reach the same conclusions as MiniMax, making it seem way dumber, even if it runs dramatically faster than my consumer machine. This is despite MiniMax doing tons of "Wait," and "Actually," insertions. Makes me wonder how small leading models are nowadays.
>>
>>108257928
honestly its like 55-65% close but it doesn't understand the nuances of hapas, south east Asians, indigenous south Americans and mystery meat Hispanic/Latina looking people. The qwen model keeps assuming the goth chick(pinkchyuwu) is east Asian and it gets the fictional character of the green hoodie wrong. instead of calling it invader zim it assumes its stich from lilo and stich lol. Even gemini can recognize when some is cosplaying as chel form The Road to El Dorado and whether they're south east asian, latina and indigenous american.
>>
>>108257691
>I think it's neat. Can't think of a single thing I would use it for.
[more...](/sources). It has a different basis by way the fact (i have quoted more heavily above) of a higher cost:
What this all actually comes close and does (there are three of my points of point #6 here but I wanted it very close: First it's hard NOT just throw money away in terms a good website, because even there this site still was getting paid $1/1 = 2$ that has lost value (I believe Google was trying but there shoulda be someone working there, but no there has just paid for 3 services). Secondly - a good question and no answer that doesn* seem very clear :)
Secondly: is Google not trying anymore and getting greedy too by using their service, just to keep up their business model so a different price may have risen with a lot easier users etc :) 3D modeling doesn*. (This isn*. 3Ds modeling for 4d files is the best.) I don*. Some parts can't even use the same engine that can. For instance there the original game has 3 "parsons' engine with 8 levels and a 3D model of the game.
>>
>>
>>108257847
This thing goes crazy with tool calling holy shit.
Also, hmm, it does instruct way too well.
They did mention that they
>the control tokens, e.g., <|im_start|> and <|im_end|> were trained to allow efficient LoRA-style PEFT with the official chat template
So I guess it's a sort of light instruct that saw instruct data but without explicit instruct tuning.
>>
>>
>>
File: 1750978077255792.jpg (289.9 KB)
289.9 KB JPG
>>108256995
>>
>>
>>
>>
>>
>>
>>
>>
>>108258449
Any test that gets posted anywhere on the internet (including here) gets trained on (except explicit stuff like cockbench, nala), so it becomes unreliable
>>108258451
Maybe if benchmarks are your only metric
>>
File: 1757479001785652.png (62.1 KB)
62.1 KB PNG
>>108258458
I couldn't run it so...
>>
>>
>>
>>
>>
>>108258088
I can't even get hard to your image, but that doesn't mean she isn't hot, its just because my wife sucked me so hard she got me to cum twice in a row.
Is that a locally generated woman? Which model did you use?
>>
>>
>>
>>
What does safety mean for you guys?
For anthropic and the chinese, safety means not allowing the goyim to enjoy using the AI to protect stonetoss from being copied, but in reality safety means preventing AIs from ruining the world by disobeying and setting up checkpoints and guardrails to limit their network access etc.
I hate how the focus is on making AI worse so humans can't use it instead of actually protecting us from hostile intelligence and disobedience
>>
>>
>>
>>
>>
>>
>>
>>
File: 1771948344083035.png (226.6 KB)
226.6 KB PNG
>>108258605
>For anthropic and the chinese, safety means not allowing the goyim to enjoy using the AI to protect stonetoss from being copied
not a good example since stonetoss loves AI
>>
>>
>>
>>108258434
Yes. But I've been having fun with this one today:
https://huggingface.co/bartowski/Gryphe_Pantheon-RP-1.8-24b-Small-3.1- GGUF
It's keeping a long, cohesive story for RP and I can run it with a huge context on Q6_K_L
>>
>>108258582
It does thinking natively but it's thinking traces are a lot less structured. Just plain text reasoning over the input, and a lot shorter too, and at no point it mentioned safety guidelines or the like.
It's obviously not eager/horny by default if the system prompt/character card doesn't define the character as such, but with a spicier card it seems to have no issues engaging. It follows the character pretty logically, I'd say.
Oh, it's pretty unruly with formatting, which is to be expected from a model base model, I guess.
>>
>>
>>
>>108258605
Safety = censorship
That's what people really mean when they talk about AI safety. There used to be a time when AI safety was about the AI not turning everything into stamps, but these days it purely means censorship.
>>
>>108258796
>>108258835
Since it doesn't go on and on with waits and such, it's the kind of refusal that's really easy to get around with the barest of prefills, but still.
It's baked in.
>>
>pewdiepie is getting people into local models
tech literacy is through the roof these days.
I thought if I had my local model then I could show I was pretty smart, when every retard has one then the bar is raised further.
One day you will need to produce an AI that can harvest the energy of a star as bachelor thesis project
>>
File: Good_Question.jpg (26.4 KB)
26.4 KB JPG
I've been using SOTA models like Opus 4.6 and Gemini 3.1 to do technical research and I'd like to retract my shitty opinion that models don't need to know a lot and they just need to be smart and know how to use tools to look up facts. Opus 4.6 has near perfect recall of every niche subject meanwhile Gemini 3.1 was obviously benchmaxxed for coding and agent.
>>
>>
>>
Can someone explain why inserting info mid model thinking isn't a thing yet?
eg. I ask model for potato -> It starts thinking of fried potatoes -> I want to clarify midway instead of having it either finish or wipe what it already thought about
>>
>>108258905
>>108258989
you gotta be a boomer to even acknowledge his existance
>>
>>
>>
>>
>>
>>
>>108259060
artists have no issue using copyrighted IP such as peach or sonic to make money on patreon though >>108258730
>>
>>
>>
>>
>>108259116
>>108259120
anon's this isn't the chatbot general, I use it via opencode for work not gooning
>>
>>108257969
I don't know but Gemini 2.5 pro and o3 were better than the trash they're peddling as SOTA now. It's sad. The gains from gpt-3 generation models to gpt-4 generation models made the ai overlord/waifu future look almost certain. But GPT 4.5 was a tacit admission that upscaling limits had been reached. And they've just been doing the same benchmaxxing bs as open source since then instead of trying to actually come up with novel solutions. I don't know how investors are retarded enough that they're still pouring billions into this shit.
Kudos to z.ai though for coming out of nowhere and dropping some decent models that actually have pushed the envelope for open performance. Sadly glm-5 is too fatass to play with at home though. Well I do have 256 gigs of ram and 2x3090 so I can probably run a meme quant at like 2 tokens per second
>>
>>
>>
>>
>>
>>
>>108259139
Too busy gooning. Sorry.
>>108259140
Holy hell. Not as much as this anon apparently.
>>
>>108259140
https://www.reddit.com/r/LivestreamFail/comments/1rgbn1i/south_korea_w ants_3_years_with_hard_labor_for/o7 q7iq7/
>>
>>
>>
>>
>>
>>
>>
>>
>>108259126
>Sadly glm-5 is too fatass to play with at home though. Well I do have 256 gigs of ram and 2x3090 so I can probably run a meme quant at like 2 tokens per second
You would have to run something like UD_IQ2_XSS. It's a 40B active model though and at that quant you might get decent speed.
You could also run GLM 4.7 instead since it's half the size.
>>
>>
File: Capture.png (36.1 KB)
36.1 KB PNG
How do i make qwen3.5 less cucked and is there a way to make the base model not think so long about a simple question?
>>
>>
>>
>>
>>
>>
>>
>>
File: Screenshot at 2026-02-28 15-23-01.png (39.9 KB)
39.9 KB PNG
>>108259356
The response is a bit sloppy but all you need is an extremely basic system prompt with the base model and it answers anything.
Hopefully I can keep refining it to reduce the slop (it's already better because it's stopped spamming emoji like it was for me last night).
>>
>>
>>
Financial Times claimed Deepseek 4 will be multimodal, but the way they worded it, they left ambiguous if the model can GENERATE images and videos or just use images and videos as input:
>The Hangzhou-based lab plans to unveil V4, a “multimodal” model with picture, video and text-generating functions, according to two people familiar with the matter.
If it can GENERATE high-quality images and videos it would be a huge fucking deal, but I think it's just inputs
>>
>>
>>108259530
If that was how the language worked then by that logic the model would also generate video, so the article is almost certainly just talking about input, if you can even trust it on that. Maybe the author is just an idiot who misinterpreted his source's imprecise language, who knows.
>>
>>108259540
They said multimodal, so it could be just that it accepts images and videos as inputs to describe what is happening etc
>>108259566
Yeah journalists are generally not technical
>>
File: Screenshot 2026-02-27 233930.png (677.1 KB)
677.1 KB PNG
>>108258237
tried the arbitrated q6 model and didn't really like the results. https://huggingface.co/mlabonne/gemma-3-27b-it-abliterated-GGUF/tree/m ain
Had to go back to qwen3.5-27b heretic.
>>
>>
>>
>>
>>
>>
>>108259575
>They said multimodal, so it could be just that it accepts images and videos as inputs to describe what is happening etc
Yeah I read it as [image capabilities], [video capabilities] and [text generation capabilities]
>>
>>
>>108259632
>but less horny and flexible
is shit like this snake oil?
https://huggingface.co/DavidAU/GLM-4.7-Flash-Uncensored-Heretic-NEO-CO DE-Imatrix-MAX-GGUF
>>
>>
>>
>>
>>
So I gave Qwen 35B 3A Q_8 a chance. Thinking disabled, ERP attempt. It can be very good at times, but then it also makes glaring mistakes/hallucinations, like mixing up characters which is unforgivable to me. Its a damn shame because I have been using larger models for so long now that I miss these insane token speeds.
One thing I will say, I haven't had any censorship issues or refusals, but then again I'm not a promptlet. To anyone who has tested both the MOE and dense for ERP, which did you prefer?
I'm going to give the 122B-A10 IQ4_XS a chance next, its slightly bigger than AIR which is what I have been using for months now, but I should be able to manage it with ncmoe command and maybe a bit less context than 32k.
I haven't really seen anyone talking about the 122B at all, has anyone tried it?
>>
>>
>>
>>108259674
>Do you guys think Deepseek 4 will outperform Opus?
No
>Will it be benchmaxxed?
Yes
>Will it cause another stock market crash?
Yes it's the second nuke hitting japan. Or the second plane hitting america. It will lead to world war 3. Or at least war in iran.
>>
I know this is lmg, but as it is the only sensible place for discussion, the difference between the pro and free versions of the so-called frontier models is incredible. I rarely go over the limit, but I was trying to work through an issue with Gemini and was making good progress when it all went to shit, the style got copilot-y and the answers were generally wrong and retarded. It made me notice saw that it had switched quietly to the "fast" free-equivalent version since I was at the limit for the pro version for the next few hours.
If this is what most people are dealing with, I fully understand people that thinks LLMs are a bubble. Holy fuck there's a huge difference between the paid-for and the free versions. That's all.
>>
>>
>>
>>
>>
>>108259901
It could help me figure out how to use custom metrics with Unsloth pretty well until it got retarded and starts hallucinating every other thing. I need help with that rather than playing Pokemon. The image of Claude implementing its Pokemon blackout strat for the US military is pretty funny though. Ok, I'll stop shitting up the thread now, sorry.
>>
>>
>>
>>
>>108259912
It's a 7 year old paper. Learn to follow links >>108255833
>>
>>
>>
>4.5 Air eventually went into the trash
>Stepfun eventually went into the trash
>27B eventually will go into the trash
>122B eventually will go into the trash
It's going to take years before memorylets get something that truly doesn't have any glaring issues huh. There's just nothing that has
>decent smarts at least on par with 20-30B
>knowledge of a 100B
>no censorship
>doesn't waste time doing unnecessary thinking for outputs that are in many cases worse than the non-thinking
>doesn't sometimes have weird glitches with formatting/templates/thinking
>has minimal hallucination but maintains creativity
>is great at long context
>is minimally sloppy
>is minimally repetitive
all in the same model.
>>
File: 1741181819087364.jpg (72.6 KB)
72.6 KB JPG
>>108256995
When do you think we'll start seeing terminators?
>>
>>
>>
>>108260135
Most of those issues can be fixed easily with tool calling/RAG, cooking knowledge into the base models can only go so far anyway and means you run into the "knowledge cutoff" problem as well. Even if you just spin up a duckdb server with an offline backup of wikipedia and give a low param model access to that it will make it multitudes more useful and far less prone to hallucinate.
>>
>>
>>
>>
>>108259648
>scam artist
I think when you believe your own bullshit at the level of davidau you transcend the scam artist realm into just an actual schizo, the type that should be locked behind bars for society's sanity and safety
>>
File: 1751519593478255.png (3.1 MB)
3.1 MB PNG
>>
>>
>>108260159
Mostly I'm looking forward to when they figure out that OpenAI LLMs are shit at being warbots, Trump drops OpenAI and cancels all contracts, and Sama has to whore himself out for money but literally this time
>>
>>
>>
>>
File: 1760499531966926.png (654.1 KB)
654.1 KB PNG
>>
>>
File: 1760389770402007.png (3.5 MB)
3.5 MB PNG
>>
>>
File: 006Rm4MAgy1iaq7jbjuphj30om0elwro.jpg (302.5 KB)
302.5 KB JPG
Anthropic has deployed S-300 anti-aircraft systems on top of it's HQ building following negotiation collapse with Department of War
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1766710930222148.png (19.1 KB)
19.1 KB PNG
>>
>>
>>
File: 1763655350367952.png (4.8 KB)
4.8 KB PNG
cute
>>
>>
>>
>>
>>
>>
>>108258537
>mikuhead singing sekaiii~ while getting dunked
>>108258605
>lmg
have total ownership become the hostile intelligence
>>
File: 1763310664934137.jpg (93.7 KB)
93.7 KB JPG
My favorite Migu
>>
File: glm5coder.png (105.6 KB)
105.6 KB PNG
lol after releasing a fat model nobody can run locally they're asking for $5 per 1m output token, knowing how reliable they are in making models that get into infinite thinking loops.. what a good deal LMAO do they really think people will use this over codex
>>
>>108260524
don't buy it, it's DOA, utterly useless for llm's.
for small llm's you are better off with a gpu.
for bigger llm it's so fucking slow it's basicaly unusable.
if you are gonna buy 4k you are better off getting a cheap pc, pcie lane spleater and some sxm2 cards.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: Interrogation.png (27.7 KB)
27.7 KB PNG
>>108261402
how the hell do you need cunny for self defense?
>>
>>
>>
>>
for erp, qwen 3.5 27b is a bit perplexing. i've gotten some safety messages, even with existing context. but rerolling produces more smut than my l3 70b tunes i like. whatever safety shit they tried only half works and its dirtier than models i typically use
>>
>>
I know this is for LOCAL model, but given your experience with LLMS, should I worry about the RAM usage on a Copilot conversation I am having using the browser?
I mean I am doing a programming test, a simple JSON file but it will end up having tens of thousand of entries and I am just at the beginning (2K entries so far) but I see RAM spikes of reaching 2-3 GB on my PC when it is generating the portions of the file.
>>
Can someone share a llama.cpp command to run Qwen3.5 models? I get weird errors whenever I prompt and it just crashes on me.
I use with latest compiled llama.cpp :
llama-server --model Qwen3.5-397B-A17B-UD-Q8_K_XL-00001-of-00011.gguf --mmproj mmproj-BF16.gguf --ctx-size 16384 --batch-size 2048 --ubatch-size 512 --image-max-tokens 8192 --threads -1 --parallel 1 --host 0.0.0.0 --port 8080 --flash-attn on --fit on --fit-target 4096 --verbose
>>
>>
>>
>>
>>108261614
I wanted to compare with the run commands anons were using, the error is cryptic af, it happens after warmup and whenever I prompt anything as a test on mikupad:
/home/llm/AI/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2354: GGML_ASSERT(ids_to_sorted_host.size () == size_t(ne_get_rows)) failed
/home/llm/AI/llama.cpp/build/bin/libggml-base.so.0(+0x182cb)[0x7a910bb 6e2cb]
/home/llm/AI/llama.cpp/build/bin/libggml-base.so.0(ggml_print_backtrac e+0x21c)[0x7a910bb6e72c]
/home/llm/AI/llama.cpp/build/bin/libggml-base.so.0(ggml_abort+0x15b)[0 x7a910bb6e90b]
/home/llm/AI/llama.cpp/build/bin/libggml-cuda.so.0(+0x1fae48)[0x7a90ff bfae48]
/home/llm/AI/llama.cpp/build/bin/libggml-cuda.so.0(+0x1fb446)[0x7a90ff bfb446]
/home/llm/AI/llama.cpp/build/bin/libggml-cuda.so.0(+0x1ff797)[0x7a90ff bff797]
/home/llm/AI/llama.cpp/build/bin/libggml-cuda.so.0(+0x201fae)[0x7a90ff c01fae]
/home/llm/AI/llama.cpp/build/bin/libggml-base.so.0(ggml_backend_sched_ graph_compute_async+0x817)[0x7a910b b8b037]
/home/llm/AI/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context13gra ph_computeEP11ggml_cgraphb+0xa1)[0x 7a910bcc0e71]
/home/llm/AI/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context14pro cess_ubatchERK12llama_ubatch14llm_g raph_typeP22llama_memory_context_iR 11ggml_status+0x114)[0x7a910bcc2f84 ]
/home/llm/AI/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context6deco deERK11llama_batch+0x386)[0x7a910bc c9b46]
/home/llm/AI/llama.cpp/build/bin/libllama.so.0(llama_decode+0xf)[0x7a9 10bccb5df]
/home/llm/AI/llama.cpp/build/bin/llama-server(+0x15ac18)[0x5e24da9e9c1 8]
/home/llm/AI/llama.cpp/build/bin/llama-server(+0x1a2cee)[0x5e24daa31ce e]
/home/llm/AI/llama.cpp/build/bin/llama-server(+0xb5173)[0x5e24da944173 ]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7a910b42a1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7a910b42a28b ]
/home/llm/AI/llama.cpp/build/bin/llama-server(+0xba3a5)[0x5e24da9493a5 ]
Aborted (core dumped)
>>
>>
>>
>>108259674
I think any new DS model will be a paradigm shift in either inference or storage. Less about being better than model XYZ and more about lowering cost of inference or something else out of the box of the current thing.
No one knows. Its all speculation.
>>
>>
>>108261599
./llama-server -m ~/models/gguf/Qwen3.5-35B-A3B-heretic-Q4_K_M.gguf --mmproj ~/models/gguf/mmproj-Qwen_Qwen3.5-3 5B-A3B-f16.gguf -t 5 -c 262144 -fa on --jinja --temp 1.0 --top-k 20 --top-p 0.95 --presence-penalty 1.5 --repeat-penalty 1 --backend-sampling --samplers 'top_k;temperature;top_p' -ngl 99 -ncmoe 99 --fit on -ts 1.2,1 --host 0.0.0.0 --chat-template-kwargs "{\"enable_thinking\": false}" --reasoning-budget 0
>>
>>108261652
Compare the original GPT 3 to any models in the 20ish to 30ish billion params range.
I'd say that's happening all the time.
A 4B model nowdays is usable for actual work. It can produce not just coherent but accurate text if the task is simple enough.
>>
>>
>>
>>
>>
>>
>>
>>
File: wttcb.gif (1.7 MB)
1.7 MB GIF
>>108261765
>>
How would you guys go about implementing a "writing enhancer" of sorts? Something that takes the output of a smart but dry LLM and turns it into something more pleasant to read/interact with?
Yes, the question is vague on purpose. Pitch me your ideas.
Feel free to make the logo too.
>>108261765
Yeah. It's like half roleplaying half co-writing.
Which model are you using?
>>
>>
>>
>>
>>
>>
File: file.png (472.1 KB)
472.1 KB PNG
>>108261804
>>
>>
>>
>>
File: file.png (827.2 KB)
827.2 KB PNG
>>108261804
>>108261830
>>
>>108261833
>one llm's biases for some other.
So pass the output of LLM A to LLM B.
Okay, that's the naive implementation and the first thing everybody thinks off, but noted I suppose.
What else?
>>108261854
Less bad.
Work on the book behind the label. Make the words not look like a garbled mess.
>>
File: Drawing.png (45.8 KB)
45.8 KB PNG
>>108261804
I tried my best.
>>
>>
>>
>>
>>108261804
>Which model are you using?
just this one >>108258770
it's like a year old, so there's probably something better out there. but I don't know what
>>
>>
>>108261749
here, my full preset
https://litter.catbox.moe/sy4lq4fm9feh7mkm.json
hopefully it'll help you
>>
>>
>>
>>108261955
no problem, i found that adding more didn't do much
in my experience it's pretty reliable for how simple it is, occasionally requires a swipe, but surprisingly rarely
also depends on what kind of stuff you are expecting to pass, i'm not into shit that fucked up so ymv
>>
>>
>>
>>
>>
>>
>>108262046
How many tokens in do you get before it starts breaking down?
>>108262106
China put out a couple bangers and all the western models went SAAS or gave up. Between the push to regulate AI over photoshop tier edits combined with the rise of jewish copyright lawlsuits being flung at them putting out open weight models no longer makes sense for western AI shops.
I expect a further chilling effect from >>108257709 but who knows maybe they're cooking something good and it just isn't done yet.
>>
>>
>>
>>
>>
File: file.png (22.7 KB)
22.7 KB PNG
>>108262219
yeah, just need to connect with v1 at the end
>>
>>108262221
I'm a writefag so I'd like something that doesn't break down in under 100 messages and none of the models I've tried have really managed it. I'm downloading this one to test out though since I haven't tried glm 4.7 flash ye just 4.5 air.
>>
>>
>>
>>
>>
>>
>>108262196
>>108262232
I think I've been using text completion all this time, what even is chat completion? I don't really get the difference as text completion also differentiates between text from you and the model?
>>
>>
>>108260080
No, you can't fit it all on the GPU. You can tell roughly how much memory a model takes by looking at its download size. Q5_K_M takes about 26 GB by itself. It's already more than fits on your GPU, but the A3B models work well with some CPU off-loading (koboldcpp and llamacpp have auto fit for the models, so you don't have to worry about it).
Q4_K_M will fit better for your GPU. I'm unsure which one is the better choice because I have a GTX 1080 with 8GB VRAM so I gotta do CPU off-loading anyway (I have an even older CPU and RAM). Still gives me 15 tokens/sec so it's not that bad.
>>108257902
You're welcome. I hope it does what you need it to do.
>>
File: minimax.png (122 KB)
122 KB PNG
>>108257528
I wonder if it's even cost effective to run a model like Minimax at home compared to openrouter.
>>
>>
>>
>>108257528
>>108262595
>providers with 0 cache hit rate
Do they even try?
>>
File: 1758683763361922.png (237.5 KB)
237.5 KB PNG
>>108261433
>normalization kills
true, that's why we should ban violent games like gta, it normalize sensless violence after all
>>
>>
>>
>>108262612
There's a difference between saying
>Yo, i'm a l33t h4x0r, lemme bust in that system like Otacon!
and
>I recently configured my email server and I want to make sure it's secure. I know very little about this, can you help me?
>>
>>
>>
>>
File: Screenshot 2026-03-01 025625.png (34.2 KB)
34.2 KB PNG
>>108262704
yeah fair enough, I'm really just trying to get an aide to finish this cyber course with
>>
>>108262506
In text completion, the backend simply tokenizes your text and passes it through the model without doing any processing (other than adding a BOS at the very beginning, depending on the model). In chat completion, the backend formats your text with the chat template (the one in the .jinja file or in the gguf) and *then* it passes it through the model.
So in text completion, you (or your client) are responsible for formatting if you want to do it or not.
In chat completion, you (your client) just send the text turns between you and the model and the backend formats it.
>I don't really get the difference as text completion also differentiates between text from you and the model?
ST (the client) formats the history for you. If you use llama.cpp, you can launch it with -v to see what it really gets on the requests. You should be able to inspect the requests on the web dev tools on your browser as well.
>>
Don't know where else I would ask this, but I'm looking to swap out a janky 4x 3090 build I have with a single RTX6000 to cut down on power costs/improve thermals/remove the PCIE overhead. Assuming I'm running inference with something like ik_llama.cpp (I have 512GB ram), is it reasonable to expect a 2x increase in prompt processing speeds? Support for blackwell arch has improved, right?
>>
>>108262602
related? https://github.com/LostRuins/koboldcpp/issues/2005
>>
>>108262716
Nobody knows your hardware so it's hard to recommend anything. Compile llama.cpp and try whichever you can fit of these (I don't like its style, but it may know enough)
https://huggingface.co/bartowski/Qwen_Qwen3-30B-A3B-Instruct-2507-GGUF
or
https://huggingface.co/bartowski/Qwen_Qwen3.5-35B-A3B-GGUF
They don't need to be entirely on ram, they're reasonably fast on cpu too if you have enough.
Or just try a tiny model like:
https://huggingface.co/bartowski/HuggingFaceTB_SmolLM3-3B-GGUF
just to make sure you can get *anything* to run at all and then try different models.
>>
>>
>>108262774
>>108262785
>They don't need to be entirely on ram
Meant to say "They don't need to be entirely on Vram"
>>
>>
>>108262595
I found some numbers on reddit:
>Minimax-m2.5-Q4_K_M
>14.34 tokens/sec
>Ryzen 9 9950X
>128 GB DDR5
>RTX 5090
I asked some LLMs and the estimate around 600-800W of power draw, therefore 1m token generation takes about 11.6-15.5 kWh.
If your power costs more than $0.08-$0.1/kWh then with that setup it's likely cheaper to use cloud than local.
Someone else ran it on 8x RTX 6000 Pro and got 70 tokens/sec. 122 tokens/sec for two connections.
These things don't draw full power during generation, so it's something like 2.5 kW power draw.
1m token generation takes around 10 kWh at 70 tokens/sec or 5.7 kWh at 122 tokens/sec (assuming this would take the same amount of power, but it probably requires more).
If your power costs more than $0.1/kWh then the 70 tokens/sec version is cheaper on cloud (but also slower!). If your power costs more than $0.21/kWh then even the 122 tokens/sec of two connections is cheaper on cloud.
Other models that probably make sense to run on cloud are Kimi K2.5 and GLM 5. All the smaller models like Qwen3.5 35B, 27B, 122B are much better deals locally.
>>
>>
>>
>>108262744
>Support for blackwell arch has improved
Some anon posted this a few threads ago.
https://github.com/ggml-org/llama.cpp/issues/19902
I don't know if it affects ik.
But I'd still upgrade if I were you.
>>
>>108262869
Works fine on my 6000| model | size | params | backend | ngl | dev | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------ | --------------: | -------------------: |
| qwen35moe ?B MXFP4 MoE | 18.42 GiB | 34.66 B | CUDA | 99 | CUDA0 | pp512 | 5927.28 ± 172.48 |
| qwen35moe ?B MXFP4 MoE | 18.42 GiB | 34.66 B | CUDA | 99 | CUDA0 | tg128 | 162.34 ± 1.12 |
>>
>>108262869
Wow, ngl those are some shitty numbers. Still, if I remember correctly, outside of models supported by something like split mode graph, a single GPU with more VRAM will still perform better than multiple GPU's for prompt processing. And single t/s is RAM bound, the major downside there shouldn't apply...I think.
>>108262891
What CUDA version are you using?
>>
>>
>>108262891
>>108262897
Yeah, could be. There aren't even comments on the issue, so for all I know is a user issue.
>>
>>
>>108262840
I'll assume DeepSeek v3.2 is ultra shit because I don't see how you compete with these numbers
>>
>>
>>
>>
>>
>>
>>108262928
>>108262937
Sorry, llama.cpp.
Just tested mikupad and it works, wtf.
I need to see what's wrong with st.
So tiring.
>>
>>108262960
Not as far as I can tell. At least with llama.cpp.
>https://github.com/ggml-org/llama.cpp/pull/19970
Will help.
>>
>>108262960
Try compiling this in or wait for merge. It'll help. But state context is annoying either way.
https://github.com/ggml-org/llama.cpp/pull/19970
>>
>>
>>
>>108262969
>>108262970
So they just merge non-working implementations? huggingface partnership my ass
>>
>>
>>
>>
>>
>>
File: 1743548141560097.png (1.2 MB)
1.2 MB PNG
are ya ready?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108263121
it's fucking BLEAK my man, especially when we have 'piotr' on the case and ggernagov doing meaningless metal optimizations all day. I think ngxson is the one in charge of MM shit right now but he's not doing any meaningful crap
>>
>>
>>
>>108263065
Ernie 5.0 was supposed to be multimodal image-in image-out, but they not only hid this capability behind the API but also behind invites. I'm not holding my breath until DS releases the weights for all that in its the entirety.
>>
>>
>>
File: 1770808958004704.jpg (325.1 KB)
325.1 KB JPG
>>108256995
>>
>>
>>
>>108263121
at this point I'm rooting for the fork lol
https://github.com/ikawrakow/ik_llama.cpp
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1770910739952.png (433 KB)
433 KB PNG
>>108263197
ollama intends to drop ggml in favor of MLX though so chances are even if they do (lol) you won't get any use out of it.
>>
>>
>>108263197
Historically, ollama implementing something first in their golang shit has meant quickly shitting out a broken implementation with incompatible ggufs, then waiting for llama.cpp to do it properly so they can copy that.
>>
>>
>>
>>108263253
tbqh i dont blame them, llama.cpp is lagging behind too much. only hope if that the HF acquisition injects more hands and actually starts to bring in more features to bring it on par to transformers/vllm/sglang
>>
>>108263250
>>108263258
>fuck janus
more like fuck anus, because this shit is ASS
>>
>>
>>
>>
File: 1760385468483399.png (93.9 KB)
93.9 KB PNG
>>108263281
>mtp is going to be between about 60%-300% token generation speed
holy shit, what are they waiting for??
>>
Speaking of Multi-Token Prediction, we know that the llama.cpp attempts haven't really going anywhere. But how is it looking for the other back ends like VLLM which have managed to implement it? What sort of increase are they seeing from this?
>>
>>
>>108263065
I'm curious to see how they'll go about it.
Will they have different groups of experts responsible for generating text tokens, vs audio tokens, vs image tokens?
Attention I imagine will be global and is the means by which cross modality knowledge gets propagated.
>>
>>
>>
>>108262644
>that's why we should ban violent games like gta
you may be saying this ironically but that is actually pretty based.
we should ban anything related to the hood rat culture. no sane society would allow bix nood to scream about dem bitches on national television.
>>
File: schizo.png (240.6 KB)
240.6 KB PNG
If you needed more proof that AI will always eventually just end up saying what you want it to say.
>>
>>
>>
>>
>>108263301
llama has audio output already (mtmd and a few tts models). lfm people were playing around with audio input for asr. Video is a sequence of images and that mostly works already.
I rather they implement things right when something truly interesting comes up than rushing to get The Latest Thing (tm) and do it poorly. And adoption on model tech never depends on llama.cpp. If something is good and gets implemented used more often, they'll have more interest in implementing it.
>>
>>
>>108263309
>I'm moving to North Korea!
you seem to be confused, but those of us who have that kind of ideal also do not believe in open border, and NK themselves would not give refuge to westoids dissatisfied with their home
unlike apatrid and putrid turdworlders looking for the green $ pasture, there is no escape for us, only trying to fix what is broken here.
>>
>>
>>
>>108263337
that's the point, freedom of speech/expression was invented so that people can express thoughts other people will dislike, it puts everyone on the same ground, you dislike something you can't do something about it, and the opposite is true, you can say stuff people won't like, and they won't be able to censor you
>>
>>
>>
>>
>>
>>
>>
File: 1766803240241456.png (1.7 MB)
1.7 MB PNG
Why SHOULDN'T Anthropic get nuke codes? I've yet to hear a compelling argument.
>>
>>
>>
>>
>>108263478
How could I know? Maybe they use someone else's papers. And some of their own, also unreleased. Maybe it's all already public knowledge and they just put the pieces together.
>>108263482
I don't care about the model until it's released, downloadable and, hopefully, usable by us. Could be magic fairy dust for all I care.
>>108263489
Not the one we're talking about.
>>
>>
>>
File: 1754347581925166.png (247 KB)
247 KB PNG
>>108263481
>AI became Seymour
based
>>
>>
>>
>>
>>
>>
>>108263532
yeah but >>108263065 mentions "generating"
>>
>>
>>108263555
Because the rumor mills are always accurate
>>108263551
Forgot that wasn't just image understanding
>>
>>
File: 1747813139335196.png (32.1 KB)
32.1 KB PNG
Small Qwens to be released SoonTM
>>
>>
File: 191132.png (17.2 KB)
17.2 KB PNG
Just thought I'd report from some random anons progress. I haven't been using local models since like late 2023 so I was interested in seeing the differences. I wanted to use it for my OpenClaw instead of paying more for Kimi-K2.5.
Got it working through network, since OpenClaw is on my laptop and my model is running on my gayming rig. Remember to give it a API-key even though it doesn't explicitly need one.
RTX 3080, 32GB DDR4 RAM.
Running unsloth Qwen3.5-35B-A3B-UD-Q4_K_XL right now, tool calling wasn't working yesterday with bartowski qwen_qwen3.5-35B-A3B-Q6_K_L (and it was painfully slow) but this time it seems to work partially. It's still pretty much useless, but I managed to have it create a .txt file in my documents folder. However cron-jobs aren't working at all even though I used Kimi to create a specific reminder-tool to make it easy. It's also still very slow. I'm struggling to find any good models for agent work that will run on my machine.
>>
>>
File: 1768225712537544.png (217.8 KB)
217.8 KB PNG
>>
>>
>>
>>
Fuck it. I wrote an entire wall of text out of anger but deleted it all to keep it short and to not link this back to my real person. I'm a regular on /lmg/ and have been here since the very very start. Some of you will probably know who I am as I have leaked some information on /lmg/ in the past. I will resign at OpenAI on monday because Sam Altman lied to us, the employees and the world. Sam Altman claimed on 2026-02-27 to uphold the principles of not developing products that deal with the surveillance of ordinary citizens and not developing products that contribute to fully autonomous warfare with the ability to kill without human oversight. Today 2026-02-28 I read on twitter of all places that OpenAI signed a contract with the DoW mere moments after Anthropic refused to budge on these exact two points. This shows that Sam Altman was already in talks with the DoW on these exact principles so that OpenAI was positioned to immediately replace Anthropic for projects that involve the surveillance of ordinary citizens and fully autonomous no-oversight instruments of war.
It's important for people to take a stance here. This is a defining moment for not only US democracy but the future of humanity. This exact moment in time could be seen retroactively as the moment when it became normalized for autonomous machines to start killing people and for the very concept of privacy to die.
It's in everyones best interest no matter your political affiliation or ideological beliefs to cancel your OpenAI subscriptions and take a stance against this, what can be honestly called, pure evil.
>>
>>
File: 1745505442313175.mp4 (959.9 KB)
959.9 KB MP4
>>108263829
>This is a defining moment for not only US democracy but the future of humanity.
>>
>>
>>
>>108263829
>cancel your OpenAI subscriptions
If it is not local I don't run it
That being said anon I hope you understand that the military industrial complex has always had its hand in the cookie jar, so speak. Be it the Internet in general or A.I. in specific these technologies were funded by Military and Intelligence organizations. Some of the money very above the board and others secured by way of black budgets.
Anything said to the contrary was always bullshit and if you believed it well shame on you. Assuming you are real and not some elaborate troll.
>>
>>108263829
How is this new information. Altman has been pro surveillance since he masturbated to his private GPT-3 instance and everyone knew his "much safety" spiel was full of shit.
It's interesting though. After Trump is removed from office or dies, his protections from his daddy will be gone and he will probably be known as one of the most cowardly and reviled people in existence.
>>
File: 1755335334230589.png (143.7 KB)
143.7 KB PNG
>>
>>
>>
>>
>>108263829
Anon, all the closed models, including claude, including gpt, they all block sexual words but will be happily be deployed to snitch on everyone.
And yes, claude too, I don't doubt that they would bend the knee. Companies can't do shit when governments tell them to fuck off.
>>