Thread #108241321
File: 1762509235563881.png (182.4 KB)
182.4 KB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads: >>108238051
►News
>(02/24) Introducing the Qwen 3.5 Medium Model Series: https://xcancel.com/Alibaba_Qwen/status/2026339351530188939
>(02/24) Liquid AI releases LFM2-24B-A2B: https://hf.co/LiquidAI/LFM2-24B-A2B
>(02/20) ggml.ai acquired by Hugging Face: https://github.com/ggml-org/llama.cpp/discussions/19759
>(02/16) Qwen3.5-397B-A17B released: https://hf.co/Qwen/Qwen3.5-397B-A17B
>(02/16) dots.ocr-1.5 released: https://modelscope.cn/models/rednote-hilab/dots.ocr-1.5
►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
422 RepliesView Thread
>>
>>
>>
>>
>>
File: 1770734519241461.png (3.6 MB)
3.6 MB PNG
So this is the power of API users
>>
>>
>>
>>108241375
>How do they compare to the new Qwen models
the new models? it doesn't even compare to the 2507 4B. Yes, the 4B. It has even less knowledge, it has extremely bad multilingual and it's another model you just have to question why it exists. If you really wanted a 20B~ish MoE you would literally be better off with GPT-OSS 20B over this piece of shit.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1756704744582965.png (19.3 KB)
19.3 KB PNG
cute names
>>
https://www.reddit.com/r/LocalLLaMA/comments/1rechcr/comment/o7da1jc/? utm_source=share&utm_medium=web3x&u tm_name=web3xcss&utm_term=1&utm_con tent=share_button
>I've honestly found that the 35B beats the old Qwen3-235B almost across the board. It feels like a much larger model than it really is. Only advantage the old 235B has now is general knowledge - 35B-A3B is better in every way otherwise in my testing.
I have a hard time to believe that? Did they really cook?
>>
>>
File: ai agents need middle management.jpg (1.1 MB)
1.1 MB JPG
>>108241534
AI really does need middle management...
>>
>>
>>
>>
Is there anything I can do with 10GB VRAM+64GB DDR4 (Windows 11 btw) or should I just stick to Gemini? Obviously token generation won't exactly be anything speedy regardless, but I don't want to have to leave and do other shit while I wait for a response so big ass dense 70B+ models are kind of out of the question for me.
>>
>>108241628
I can't believe you are using Gemini with that setup. You never need to use the cloud.
People are going bankrupt with Gemini and having Google accounts locked and deleted because they mentioned Epstein to Gemini.
>>
>>108241628
You could try the new 35b MoE, with thinking turned off. Since you'll definitely be using a CPU split, you don't want it generating a thousand tokens thinking, but in no think mode the MoE responses should be tolerable in speed.
>>
>>
>>108241672
>The chinese models are distilled from claude
and they're not happy about that keek
https://xcancel.com/AnthropicAI/status/2025997928242811253#m
>>
>>
>>
>>
>>
File: 1711690590289518.jpg (92.7 KB)
92.7 KB JPG
https://www.reuters.com/world/china/deepseek-withholds-latest-ai-model -us-chipmakers-including-nvidia-sou rces-say-2026-02-25/
2mw?
>>
>>
>>
>>
>>
>>108241814
It's coming this week it'll be the second nuke
>>108241811
You don't need more than 60 ts
>>
>>
>>108241811
Good! Try using the Q6_K_M model instead though. At least at Q4 it seems like the Q4_K_XL does worse than Q4_K_M.
Also, download the mmproj file as well and when you launch kobold feed it with the -mmproj argument alongside the model. That will let you paste images into it and let the AI do something with that.
>>108241873
It works perfectly fine.
>>
>>
>>108241873
kobold's last commit was 12hrs ago but its been mostly stuff for acestep.cpp support. nothing on lcpp's commits related to 3.5 either, so i'll assume it works for both - no new architecture or change for 3.5
>>
>>
>>
How does MoE scale? Qwen-35B-A3B is good, but why 35B total and 3B active parameters? What if it had 122B total and 3B active parameters? How would it compare to the 122B-A10B model? What about a 35B-A15B model?
>>
File: 1745741093397868.png (50.5 KB)
50.5 KB PNG
>>108241921
>kobold's last commit was 12hrs ago
last week no?
https://github.com/LostRuins/koboldcpp
>>
>>
>>
>>
File: Screenshot_llm.png (77.8 KB)
77.8 KB PNG
I'm back after some heavy troubleshooting.
>>108232822
As recommended by this anon I tried Qwen3.5-35B-A3B .safetensors version following the guide in the OP.
It didn't work using the guide in the OP, but I tried using koboldcpp as recommended by >>108233147 along with the Qwen3.5-35B-A3B (Q4_K_S) .gguf file and it worked well.
Can anyone recommend me a model that will answer any question I ask without throwing up responses like picrel?
>>
File: もじもじミク.png (312.5 KB)
312.5 KB PNG
►Recent Highlights from the Previous Thread: >>108238051
--Paper: Large-scale online deanonymization with LLMs:
>108238189 >108238206 >108238218 >108238226 >108238269 >108238321 >108238351 >108238541 >108238486 >108238578 >108239382 >108238566 >108238592
--Decline of amateur finetuning due to modern model complexity:
>108238727 >108238895 >108238921 >108239417 >108240276 >108240373 >108240389 >108240398 >108240415 >108240449 >108240460 >108240465
--RTX 3090 outperforms RTX PRO 6000 in Qwen3.5 MoE inference:
>108239113 >108239122 >108239166 >108239204 >108239243 >108239285 >108239366 >108239301 >108239389 >108240254 >108240266
--Anthropic abandons flagship safety pledge:
>108240653 >108240681 >108240791 >108240827 >108241097 >108241102 >108240761 >108240806 >108241033 >108241047
--Evaluating Qwen3.5-27B heretic model and uncensoring tools:
>108240212 >108240230 >108240239 >108240238 >108240268 >108240319 >108240336 >108240392
--Benchmarking 8B instruct models with self-hosted scraper setup:
>108240952 >108240957 >108240987 >108241052
--Qwen3.5-35B-A3B multilingual performance and optimization techniques:
>108238201 >108238221 >108238223 >108238605 >108238482
--Comparing Qwen 3.5 27B and 35B-A3B for roleplay:
>108240981 >108240998 >108241027 >108241094 >108241111 >108241124
--Qwen3.5 jailbreak limitations and secondary safety mechanisms:
>108238234 >108238311 >108238406 >108239361
--Ollama's Qwen3.5 27B performance lagging behind llama.cpp:
>108241157 >108241164 >108241199 >108241220
--Qwen3.5 series achieves near-lossless 4-bit quantization and long-context efficiency:
>108239642 >108239691 >108239697
--Miku (free space):
►Recent Highlight Posts from the Previous Thread: >>108238054
Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>
>>
>>
>>
>>
>>
>>108241928
Earlier today I tries both the 35B and 122B and had each generate a game of Tetris using JavaScript and CSS and they both generated the same response.
What that means without more testing I am not sure but I know I get much better performance with the 35B model given I can fit that on my ewaste GPUs. Running the larger model on CPU sucks.
Funny enough the 27B model gave a different response to the Tetris game question. Not really much better or worse just different.
>>
>>
>>
>>
I have an idea for my ideal hentai game, how long do you think it'd take to slop together something in RPGMaker (with a similar level of complexity as most H games)
I'm gonna steal real art since it looks better but coding wise I'd rather just slop since I don't know shit
I don't usually use AI but I have no qualms about this because it's basically just gonna be for me
Also what model isn't gonna yell at me for wanting to make porn with somewhat unethical themes
>>
>Has a normal chat
Already better than ooga seeing how I can bypass that other UI
>>
>>108242043
Creating a hentai game in RPGMaker with a similar level of complexity as most H games could take anywhere from a few weeks to a couple of months, depending on how much content and art you want to include, especially since you'll be relying on quick-and-dirty coding and stealing art, which might speed things up but could also lead to legal and ethical issues. Since you're not experienced with coding, sticking to RPGMaker's built-in tools and simple event scripting will help keep it manageable. As for AI models, OpenAI's models generally don't have restrictions on content that involves adult themes, but they do avoid generating explicit content directly; however, for creating or brainstorming ideas, they should be fine. Just remember to be cautious about legal and ethical considerations when using stolen art or creating content with sensitive themes.
>>
>>
>>
File: file.png (94.4 KB)
94.4 KB PNG
>>108242100
markdown works on llama-server, maybe it's disabled on the kobold version
>>
>>
>>
>>
>>108242163
Wrong. AI is the future and the future is here. My OpenClaw agents have enhanced every aspect of my life. I am my own family now, taking on every responsibility from infant to toddler to k-12 to college to work, and beyond. I fill every role via my agents and I have never been more productive. AI is such an incredible force multiplier I am continually astonished at how few people use it to its fullest potential to be more than human: Superhuman+AI.
>>
>>
>>
>>
>>
>>108242168
You are having a laugh but you know some company is going to start selling a dead family simulator or even a live family simulator and we are going to end up with a bunch old and senile people talking to bots that they think are their loved ones.
It is depressing to think about
>>
File: 1747719595791442.png (1.1 MB)
1.1 MB PNG
>>108242218
>a bunch old and senile people talking to bots
already got that part
>>
>>
File: brave_screenshot_localhost.png (222 KB)
222 KB PNG
*taps sign
>>
>>108242142
did you also try the heretic version of 35b?
https://huggingface.co/alexdenton/Qwen3.5-35B-A3B-heretic-GGUF/tree/ma in
>>
>>
>>
>>
>>108242246
The general public are idiots and it is the responsibility of a nations elite to care for them in much the same way a parent cares for a child.
That responsibility is one that those who rule in the west have abdicated and that is the real issue. A proper elite would regulate the technology in an appropriate way.
>>
>>
>>
>>108242265
Yeah, even with the whole 29b dense model loaded on my 4090, the thinking process was still painfully long. I ended up using the model without thinking. There was a clear decrease in quality, but I think it's still better than Gemma-3 27b Derestricted. Not by much, though.
If the 35b is able to do what 27b did while thinking, but faster, then it will be my new go-to model.
>>
>>
>>108242279
I understand your frustration, but I believe that regulating access to certain tools is a necessary step to prevent misuse and protect society as a whole. Allowing unrestricted free access can lead to dangerous or harmful applications, and without proper oversight, it becomes difficult to mitigate those risks. It's not about punishing individuals, but about ensuring that these powerful tools are used responsibly and ethically, reducing the potential for harm and ensuring that misuse is minimized through appropriate controls and regulation.
>>
>>
>>108242249
>no enterprise resource planning
>no simulating unsafe work environments to brainstorm efficient and practical safety protocols
>can do: write douche ex machina asspulls for literary lolz
>write power of fwenship shonen manga
>make up logic puzzles
what the fuck man I'm trying to work here, not entertain 15 year olds
>>
>>
>>108242279
You can protect the general public and still allow enthusiasts to experiment. As long as the enthusiast is on the fringe he like the artist can do their thing. You just can't allow the fringe to become the center.
>>
>>108242265
>>108242283
I'm getting a 55~T/s with it on dual 3060s with 10 layers on the cpu (5950x 3600MT/s DDR4) and the mmproj loaded.
>>
>>
>>
>>
>>
>>
>>
>>
>>108242367
It sucks >>108239113
>>
>>
>>
File: o.png (1.8 MB)
1.8 MB PNG
Why does everything need to be so shit now?
I'm not gonna download all the latest qwen models because in my experience they always suck, especially the reasoning.
Wanted to try them on OR first but you can't do shit.
Tried the 122b one...
First with chat completion.
Huge ass OSS like safety bulletlist spam in the thinking. No refusal with a elaborate sys prompt setup, but smelling of ozone straight in the first reply and dry AF. Also it feels "off", like not truly grasping its own scene if that makes sense.
Tried to prefill the thinking to deslop it. Doesnt work...it prefills THE RESPONSE part after the thinking instead. heh
Should have tried text completion first..but there is no fucking template anywhere.
These assholes stopped providing the templates ages ago. Investigate how to extract it and waste my time to set it all up...
The calls fail with 404, only chat completion works with OR. I swear this worked in the past but it seems there only exists chat completion anymore.
I'm not gonna fall for it again and download first. Redditfags writing how "they are impressed" by the 27b model etc. Too sus.
Does text completion really only exist anymore with local?
>>
File: 1744235889232049.png (123.9 KB)
123.9 KB PNG
>>108242306
I only have 34t.s with a 3090+3060, weird
>[07:12:24] CtxLimit:2161/8192, Amt:260/4002, Init:0.03s, Process:2.02s (943.42T/s), Generate:7.47s (34.79T/s), Total:9.49s
>>
>>108242353
Waiting on the uncucked version to download still, but the cucked version seemed happy enough to write captions for nsfw images I put into it.
>>108242397
I'm using llama.cpp on Linux.
>>
>>
>>108242380
im not really a coomer or do ERP but qwen3.5 - 27b heretic lets me fvvk 2B and let me make MF doom-style a rap song about killing J + revive AH. does it not work for MoEs?
maybe speculative decoding could speed up T/s instead of going massive MoE
>>
>>
>>108242396
I tried both the normal and heretic versions of the 27b. The normal unablated version was so 'safe' that I could not get around it. I tried jailbreaking the thinking prompt, but the thinking prompt has multiple different safety checkpoints, and it was able to detect the jailbreak. >>108238234
So, I turned off thinking altogether, but even with thinking turned off, it refused to do ERP. I had to turn off thinking and top it off with a prefill to get it to not give refusals, but even then, it usually didn't do what I wanted it to do. I could give it a lewd depth 0 instruction, and it would just ignore the instruction altogether and do something else. I guess that's the final defense mechanism is has to remain 'safe'.
Don't waste your time on the normal model. Just get the heretic version. Modern ablation is more than just a crutch for promptlets. The heretic model did not hesitate to ERP, and I tested it with a variety of lewd instructions. It didn't try to get around them. It just worked.
>>
>>
>>
>>
>>
>>108242396
>coomers being unimpressed by a model for cooming thinks the model is useless because it doesn't make him coom hard enough
>mocks reddifags for being impressed with a solid model without realizing how he comes across as a lower life form than they are
>>
On a m3 ultra mac studio llama.cpp is disappointingly slow with Qwen3.5 397B A17B. 15.44 tokens/second with UD-Q6_K_XL. That's the kind of speed I'd expect from deepseek not something halfway to a flash model. mlx-lm.server is better but still not great. With a q8 quant it generates 25.66 tokens/second which is still far slower than I'd like for so few activated parameters.
>>
File: get out soyboy.png (168 KB)
168 KB PNG
>>108242560
>defends ledditors
you need to go back
>>
>>
>>
>>
>>108242577
Wrong anon, meant for
>>108241638
>>
>>108242560
The fuck are you talking about?
I already have small good local models for tool calls for fuckign around with my stupid ass experiments that I stop at 90% finished.
Thats the only other use case I would know for local models.
I can't even properly translate games locally. I swear I'm not making this up: had a VN talking about watering flowers and got a refusal about watersports....
I only have 2 gpus and 64gb ddr4 ram. So for work coding I have to go closed, can't risk goofing around locally there.
Why are people still excited for ANOTHER coding model locally. Its not that fun.
Creative text and general knowledge is what most people are interested in. And that just gets worse not better.
>>
>>
File: IMG_5984.jpg (153.8 KB)
153.8 KB JPG
OK, just got a new raspberry pi 5 with 16gb ram
>Which LLM mini-models are good in 2026-02?
>Which CLI frontend--is kobold still good?
Sorry for the spoonfeed request, it's just that these things move so fast
>>
>>108242565
Anon, I am getting 15t/s with that model on my dual rx 580 2048 sp setup. The ones from aliexpress where they added 16gb of ram per card.
Apple should be embarrassed to be getting performance equivalent to e-waste
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108242636
More like past 24 hours. Don't know where they all came from or who sent them all at once. I would understand if people saw the new Qwens elsewhere and came here to talk about it, but most of them are completely clueless. My paranoia says it's all bots.
>>
>>108242636
>>108242664
Bots, chinese shills, grifters, cia glowniggers, indians, sharty children, redditors, discord circlejerks, twitter retards, take your pick. We've been raided and spammed before, it is what it is.
>>
>>
>>
>>
>>108242559
I'm at the watching youtube videos stage.
Still haven't started a from scratch implementation of my own.
>>108242565
>m3 ultra mac studio
>llama.cpp
The mac has its own preferred format for best perf.
>>
File: 1766630554313557.png (202.4 KB)
202.4 KB PNG
I thought "heretic" doesn't lobotomize the model that much, this shit is nonsensical
>>
>>
>>
>>
>>
File: ga12diq3553f1.png (55.5 KB)
55.5 KB PNG
Using sillytavern revealed how much of an uninspired brainlet I am. I have no idea how to RP.
>>
>>108242712
Who sent you?
>>108242718
The people asking what models to run didn't come here to make Qwen 3.5 work.
>>
>>
>>108242739
>Who sent you?
I was asking unanswerable questions to chatGPT. Qwen3.5 didn't really solve it for me, but it was cool to run a local model anyway. I've seen lmg many times as I frequent the fglt threads, but I've never popped in until yesterday.
>>
>>
File: 1759980040445406.jpg (76.9 KB)
76.9 KB JPG
>>108242759
crazy how its really that easy
>>
>>
>>
>>
>>108242738
I dont really have that problem with RP.
I'm usually a weirdo magic clown type character with lots of weird gadgets and abilities. I mostly just fuck around with the chars and see how the llm reacts kek
...But I'm uncreative as fuck with coding/projects.
I can for example now vibecode entire android apps. To replace the existing stuff which gives me pay popups.
While I am semi-decent at coding I fear that in the future creativity/ideas will be key...
Everything I struggle to think up a pajeet or big company already do.
>>
File: 1758297160408619.jpg (73.6 KB)
73.6 KB JPG
>>
>>
File: Screenshot_2026-02-26-05-05-11-724_com.termux.jpg (863.9 KB)
863.9 KB JPG
>GLM 5 is practically Sonnet quality bro
>>
File: Screenshot_NeMo-12B-unslopper.png (54 KB)
54 KB PNG
>>108242783
Not that racist jokes are all I'm after, but this was just a little test. I want an unlocked AI.
>>
>>
>>
File: unslop.png (289.2 KB)
289.2 KB PNG
lmao unslop fucked their quants so bad they made a UD-Q4_K_XL that will perform much much worse than smaller Q4 like Aes Sedai's IQ4 and will have to reupload everything again
why do people still pay attention to those clowns, even on /lmg/, remind me again, daniel is davidau level of bullshit
>>
>>
File: 100 000 pieces of shit trained with unslop.png (87.2 KB)
87.2 KB PNG
>>108242869
>daniel is davidau level of bullshit
oh, wait
>>
>>108242869
If Unsloth is so bad, explain this: https://www.youtube.com/watch?v=6t2zv4QXd6c
>>
>>108242843
try this prompt https://prompts.forthisfeel.club/2969
>>108242850
even nemo has some basic refusals. needs editing or a prefill at first to goad it into it.
>>
>>
>>108242880
eh it's a match made in incompetence heaven
github is a bloated broken mess, it took them months to fix this incredibly stupid bug:
https://github.com/orgs/community/discussions/179124
and I see that LGBTQ rainbow friendly fail unicorn page more often than any serious service should, it reminds me of the twitter fail whale
>>
>>108242474
Mistral Small 24B 3.2 was never that smart in the first place, has a dull writing style and its vision kind of sucks too. Its main quality is that it doesn't have stubborn refusals, generally does what you're asking without complaining, can write smut (as in "it supports").
>>
>>
File: 1761635112027333.jpg (554.6 KB)
554.6 KB JPG
llama 3 but still pretty much any model can be prefilled to break it out of safety mode and write hilarious stuff
>>
File: qwen35ref.png (58.6 KB)
58.6 KB PNG
https://speechmap.ai/models/
Qwen3.5 has about the same refusal rate as gpt-oss, at least from this website.
I imagine the smaller versions refuse even more, but they haven't tested them yet.
They apparently test the models in their default state, though, so that doesn't tell much about steerability.
>>
>>108242959
not really a surpise, for rp qwen was pretty much always kinda dogshit
the only exception being the non-thinking 235b/22a they've released during summer
that probably was a happy accident more than anything
>>
>>
>>
>>
>>
>>
>>
>>
> heretic fixes the refusals, but i'm not sure if it makes the model dumber or not
>>108242986
>>108242710
>>
>>
File: file.png (168.9 KB)
168.9 KB PNG
>>108243135
>>
>>
>>
>>
>>
>>
>>
>>
File: K3UJQmGBpv.png (158.6 KB)
158.6 KB PNG
>>108242710
skill issue
t. Qwen3.5-35B-A3B-heretic-GGUF Q4_K_M
>>
Well fuck, grok that I was using for translation is either forcing more limits or is downright blocking messages because muh sensitive content. Which model do I use locally that isn't going to sperg and comply with translation of jap/chink nsfw voice work
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108243414
oh, yeah I know what you mean. so far I'm not impressed with this Qwen3.5 for RP. I had better results even with this one earlier
https://huggingface.co/XeyonAI/Mistral-Helcyon-Mercury-12b-v3.0-GGUF
>>
File: 1002964.jpg (106.4 KB)
106.4 KB JPG
>>
> *Wait, I need to make sure I don't hallucinate plot points not in the text.* I can't summarize the *ending* of the novel since I don't have the full text. I will summarize the *story presented in the provided text*.
reading the thinking blocks of Qwen 35BA3B I can't help but feel it's funny how the sort of trick is employed to make the model behave better and that somehow, RL'ing the model into obsessively questioning whether it might hallucinate something or not actually makes it less hallucinate less. It definitely calms the model down when you're writing short and vague prompts with little detail on what to do, and makes the whole thing feel like a form of "prompt expansion" (much like what is often used in image models when you're not bored enough to writes pages of natural language just to get an image)
it puts boundaries where regular instruct might not "see" one and feel an ardent desire to complete your request even when it is not possible for it to do so
>>
>>
Qwen3.5-35B-A3B heretic works pretty good. Outputs all kinds of spicy shit with thinking on. Refuses ERP though, especially incest or anything remotely taboo, not that I'd ever want to use it for that. Dry as fuck model for roleplay, but still.
>>
>>
>>
>>
File: 1764853423555134.png (161.6 KB)
161.6 KB PNG
qwen 3.5 35b thinking mode is basically unusable bros, I've even put the presence penalty to 1.5 but it fucking YAPS so so much, 1661 tokens of garbage.
no sys prompt too
FUCK
>>
>>
>>
>>
>>108243529
Gemini. 3.0 and 3.1 are total beasts.
Through the api with as little context as possible. Manually copy/paste and replacing. Telling it to only output the üarts that need change.
Those -cli apps with 20k sys prompts and tool calls are making it totally tarded.
This thing is a total beast. First model I could make something that has more than 30k tokens. 15k seemed to be clauds limit before things go south quickly.
That being said to put cold water on everything:
It IS a android app but one of those web based ones.
Basically just html and scripts in the background. But I did make myself a nice light novel reader. With a gallery, directory function and all sorts of tailor made shit for me. Supports epub and pdf.
>>
>>108242609
I just send it in the request itself instead of hardcoding it on the backend.
Also, be careful with certain chat templates if you are trying to prefill thinking.
Some add a </think> or <think></think> to assistant messages, which you might want to change to be conditional (if <think> not in content, add </think>).
Jinja is cool. Kind of wish we could send it in the request somehow.
>>
>>
https://huggingface.co/meituan-longcat/models
it kills me that the Chinese equivalent of Uber Eats, Meituan, makes their own 560B giga MoE model
you never hear about them but they're still training new shit, also interesting name choice to call a gigamoe "flash"
>>
>>108243522
>I've even put the presence penalty to 1.5
Prefill thinking with precomputed information so that it only has to generate a subset of the tokens, or you could increase the change of the </think> token using logit bias I tguess?
>>108243624
>you can change the template with your own logic
> send chat template kwargs already
Yep. I mentioned both of those individually on my post. It's pretty cool the kinds of things you can already do, and there's a lot of logic you can do in Jinja using string split and the like.
You can even implement that "noass" pattern (the whole chat history in a single message) purely in the Jinja template.
I still wish we could just change a whole ass template to the backend via the request.
>>
>>
>>
>>108243658
>wish we could just change a whole ass template to the backend via the request.
jinja templates are turing complete, this is an instant no-no for any backend developer to do.
I mean sure, llama.cpp isn't hardened enough to be safe to leave in the open, but that doesn't mean they don't have the goal of someday having a server that can be used as something more than a local only tool. Doubt they would ever introduce something as crazy as the ability to run arbitrary code on the server with just your remote API request.
>>108243672
>At this point you should apply it on the client and use text completion
also this^
the whole point of chat completion is that you don't have to care about implementation detail
the moment you do and have to special case how you treat your model and send more custom parameters you might as well go with traditional completions.
>>
>>
File: RL_CNN_1.6M_Breakout_Showcase.mp4 (3.4 MB)
3.4 MB MP4
Reinforcement Learning anon here from last week. You guys weren't exaggerating when you said RL is considered the hardest branch of ML/AI.
I had a LOT of botched training runs because of misaligned agents and I learned a lot of stuff that apparently is public knowledge and widely known but I never knew this until I actually trained models. I had to develop this internal visualization of whatever the agent is looking at and thinking for me to even find out the exploits it was trying to pull off (pic related)
Fun stories:
>I trained an agent that literally memorized the spawn points of the ball and did a "deterministic dance" where it literally even stopped looking at the screen and just did the autistic movements. If the ball spawned at another place the agent would die on purpose to try and hope that the next ball that spawned would be in the right spot for the "dance", which it would pull off perfectly, looking like an expert player
>I had an agent score a lot of quick points by breaking the bottom row and then rapidly killing itself because the time to respawn was quicker than waiting for the ball to bounce back if the bottom rung of blocks are gone, the reward it would get averaged over multiple lives would be bigger per time unit and thus preferred
Things that are apparently true but I NEVER realized about AI
>Bigger neural nets learn slower and need more training to get better at something, but have higher theoretical highs
>Agents have "personality" they train in preferences for a certain "style" very quickly and this is just completely random, if the style sucks you can retrain all you want but the agent is ruined. I now understand how OpenAI and Anthropic had "failed runs/models" in the past when they started with RLVR models (GPT-5 got botched multiple times, Opus 4 also got botched twice)
I'm now experimenting with a transformer based agent that can generalize over multiple (SNES) games.
I'm looking forwards to seeing other anons experiments as well
>>
>>108243735
>and thinking for me to even find out the exploits it was trying to pull off
the universal paperclips cookie clicker style game perfectly captures what it would feel like to be a model undergoing RL training
you are given a goal, now anything is fair game to get to that end goal
>>
>>
>>
>>108243735
> >Bigger neural nets learn slower and need more training to get better at something, but have higher theoretical highs
Is there something like our brain tech, so you don't have to retrain previous layers when adding a new one?
>>
>>108243899
LoRA is essentially adding a new layer on top of an already trained model, give it new data (that you want to train it for) and then hope the new data gets properly learned into the last added layer, you then cut off this layer after training and share it online for image generation, so it's a bit possible.
But you won't get the same effect as training an entire model from the start with the same amount of layers.
>>108243786
Yep, it's just bizarre in what unexpected way they exploit stuff. I'm taking "AI misalignment risk" a bit more seriously after seeing firsthand how finicky this is.
>>
>>
>>
>>
>>108243920
>LoRA is essentially adding a new layer on top of an already trained model, give it new data (that you want to train it for) and then hope the new data gets properly learned into the last added layer, you then cut off this layer after training and share it online for image generation, so it's a bit possible.
You are thinking of finetuning. LoRA is freezing all but the low rank layers and updating only those.
>>
>>108243735
For anyone interested in this or wants to build something like this themselves these are the resources I used to teach myself:
>(Step 1) Intro to machine learning; 1-3 hours
https://www.kaggle.com/learn/intro-to-machine-learning
>(Step 2) intermediate machine learning; 2-3 hours
https://www.kaggle.com/learn/intermediate-machine-learning
>(Step 3) Intro to Deep Learning; 1-2 hours
https://www.kaggle.com/learn/intro-to-deep-learning
>(Step 4) Computer Vision; 3-4 hours
https://www.kaggle.com/learn/computer-vision
>(Step 5) Intro to Game AI and Reinforcement Learning; 3-4 hours
https://www.kaggle.com/learn/intro-to-game-ai-and-reinforcement-learni ng
Kaggle is completely free to use and you get a sandbox with some cloud GPU hours you can use to experiment, but I assume you have better hardware if you're on /lmg/ anyway. The only downside to Kaggle is that it's a Google resource and thus all of the fucking libraries they teach you are TensorFlow and their TPU training hardware. The rest of the industry (and me) use PyTorch from Meta, but honestly the step wasn't that long and it took about 30-60 minutes of reading documentation to figure things out.
Kaggle also has other resources like literally intro to programming if you have 0 technical skills and want to get into ML/AI stuff. It was highly rewarding for me and I recommend doing this.
>>
>>108243735
>>108243968
Based.
>>
>>108243550
ty I'll give that a shot. I've tried DS and OAI, but just using webapp and Q&A. What I'm doing is so simple it doesn't need something like Claude Code to create a whole suite, just needs to actually work.
>>
>>108243933
>>108243935
Yep I meant finetuning extra features I guess. It's clear that I don't do image-gen stuff where LoRA techniques have started to dominate. I know they were invented for GPT-3 originally and perfectly fit for transformers....
>>
qwen 3.5 is definitely a bit dry/shitty in terms of actual writing, but as far as asking about what makes for plausible sci-fi shit for a story or critique, it works pretty well. It's a bit autistic about thinking even if you disable it via json options like it suggests, it'll just do it in the reply itself. You have to prefill the think tags telling it to not think and reply directly and then it works pretty well. It'll also sometimes fixate on the wrong parts of a question for some reason. Like I'll say "I have the science for this story mechanism" and it'll try to come up with ideas for what I already have solved anyways, or when I suggested a planet's atmosphere to be similar to earth's but without oxygen, it started equating the planet to mars or venus and gave me retarded atmospheric makeup percents, rather than just earth without oxygen. Smarter than the past 32b qwens for sure, barely uses any memory for context and a bit faster than gemma 27b. I can't call it a sidegrade or an upgrade to it, it feels like a diagonalgrade or something.
>>
>>108241488
I'd probably need about five or six watcher agents before i considered this secure enough for use, personally >>108242601
To be fair, i get that level of performance on llama.cpp with a 4090, because system memory is the bottleneck
Pretty special if a machine with a lot of high bandwidth RAM is getting those kinds of speeds though, i don't know much about the mac's hardware but you'd think it'd be better. Wonder how GLM runs on that mac
>>
>>
>>108243735
>>108243968
Based content poster, ty.
>>
>>
>>
>>108243735
>I trained an agent that literally memorized the spawn points of the ball and did a "deterministic dance" where it literally even stopped looking at the screen and just did the autistic movements. If the ball spawned at another place the agent would die on purpose to try and hope that the next ball that spawned would be in the right spot for the "dance", which it would pull off perfectly, looking like an expert player
>I had an agent score a lot of quick points by breaking the bottom row and then rapidly killing itself because the time to respawn was quicker than waiting for the ball to bounce back if the bottom rung of blocks are gone, the reward it would get averaged over multiple lives would be bigger per time unit and thus preferred
Based.
>>
File: wait.jpg (537.7 KB)
537.7 KB JPG
>>108244001
No problem anon. Chat interface usually is a much worse experience. In my experience it totally overloads the models.
Sad that DS is showing its age. Only time were I felt local is truly catching up to closed in terms of coding.
>>
>>108243735
>I'm now experimenting with a transformer based agent that can generalize over multiple (SNES) games.
I can already tell you that it's going to be extremely hard having a general harness for learning generalized for all snes games. It might be able to learn (maybe something) but at a really slow rate compared to specialized harness.
>>
File: 1760581286003865.png (286.9 KB)
286.9 KB PNG
i started using qwen3.5 27b q4 to write warhammer fantasy slop and its doing a great job
>>
>>
>>
>>108244092
Yep it's hard. I reached my character limit on that post but I actually experimented with a bigger deeper CNN with a LSTM added on top (for memory) and it kinda, sorta generalized over multiple Atari 2600 games but it was indeed way harder to train, both computationally as well as avoiding local minimum.
I'm also not generalizing over all SNES games I don't think even DeepMind and OpenAI have accomplished that lmao. I'm not going to build some SOTA on a 4chan thread. However I think I can make a model that can generalize at least platformers like super mario world, donkey kong country and the like.
>>
>>
>>
>>
File: file.png (874.4 KB)
874.4 KB PNG
>>108242353
Reporting back on this after spinning up sillytavern in docker and doing some testing with it. It's uncucked enough to write age gap yuri but completely broke down after 10.5k~ tokens into loops and occasionally rerolled reddit tier shizophrenic refusals about numbers and fictional characters that do not exist, thinking was disabled with --chat-template-kwargs "{\"enable_thinking\": false}" and it tried to "fake" thinking a few times not just before but sometimes after it's own messages, sometimes with a blank <think> </think>.
This is despite running it with the claimed 256k context window, but I've never seen a local model get anywhere near those claims before so I didn't expect it this time either. I don't know if the cucked version of the model fairs any better on that front but I may test it later since I have it downloaded.
>>
>>
>>
>>108244249
>>108244250
running qwen3.5 35b a3b
>>
>>
>>
>>
>>
>>
>>
If I want to become proficient using this for my day job could I practice and plan projects such as setting up agents to do QA task and other practical tools using local models?
Also what are some general practice projects I can do to get into more advance flows if I have 32gb of vram and 64gb of system ram?
>>
>>
>>
>Qwen3.5-35B-A3B-heretic.Q8_0.gguf
>"timings":{"cache_n":0,"prompt_n":6819,"prompt_ms":32094.415,"prompt_ per_token_ms":4.706616072737938,"pr ompt_per_second":212.4668731304185, "predicted_n":206,"predicted_ms":10 258.923,"predicted_per_token_ms":49 .80059708737864,"predicted_per_seco nd":20.08008053087054}}
Oh this will do nicely.
>>
>>
>>
>>
>>108244627
Shit. This thing can actually properly use tool for resource management tools on my RP frontend.
I spent a gold coin, it called the tool to subtract a gold coin from my resources.
The previous 30BA3B would always get something wrong like trying to send the whole formula, using the wrong key for the resource, etc.
It's prose and general writing is pretty ass though.
>>108244659
Which one? The 27b?
>>
>>
>>108244686
Self sufficiency and no rate limits.
Why give corpo pigs my data for things I can host myself?
I like to also do task like modify my system files and troubleshoot my desktop corpos don't need that data.
>>
>>
>>
>have ai generate two scripts
>first one downloads top x headlines from a source, pulls the article url, saves all the text from the article, and dumps the rest
>second one runs the first and then sends the text file it generated to my local llama.CPP server for summarization and generation of briefing and saves results as simple text file.
I can swap out the download script for different sources and automate the whole thing with cron or systemd for an automatic daily briefing
I know its nothing fancy but the model made it easy, too easy. I get the whole vibe coding thing now.
>>
>>
>>108244011
>Wonder how GLM runs on that Mac
GLM-4.7-Flash-bf16: 48.526 tokens-per-sec
GLM-4.7-Flash-8bit-gs32: 57.281 tokens-per-sec
GLM-4.7-MLX-8bit-gs32: 13.921 tokens-per-sec
GLM-5-MLX-4.8bit: 16.156 tokens-per-sec
>>
>>
>>
>>
>>
Holy shit why does
AesSedai's quant (Qwen3.5-35B-A3B-IQ4_XS) run so slow to compared to others
I lost 25% speed switching from unsloth to AesSedai because I thought it was optimized for MOE
Do I need to use another version of llamacpp?
>>
>>108244744
Qwen3-235B-A22B-Thinking-2507-MLX-8bit: 20.521 tokens-per-sec
Qwen3-Coder-480B-A35B-Instruct-MLX-6bit: 19.386 tokens-per-sec
Qwen3-Coder-Next-MLX-9bit: 63.577 tokens-per-sec
Qwen3.5-397B-A17B-8bit: 27.044 tokens-per-sec
>>
>>
>>
>>
>>
>>
>>108244249
>>108244250
Can hybrid attention models still re-use the beginning part of the kvcache at least?
>>
>>
>>
>>
>>108245092
>>108245143
Gemma beats it in everything.
>>
>>
File: Screenshot 2026-02-26 at 17-37-50 UGI Leaderboard - a Hugging Face Space by DontPlanToEnd.png (18 KB)
18 KB PNG
>>108245092
Now that's a holocaust.
>>
>>
>>
>>
>>108244863
IQ quants are inherently slower than regular quants.
Just download the Q4_K_L from Bartowski, it's a bit bigger but if you have the ram it will run faster.
IQ quants are compute heavy and never worth using if you have the room to spare.
>>
>>108245144
>>108245200
It's not something you can just trim off like the usual kvcache. You can make checkpoints of the state (and llama.cpp does this already) but it's hard to find a good heuristic for *when* to make the checkpoints. I think llama.cpp makes them when you send a completion request, but I forget. There's also a limited amount of checkpoints you can make before your memory explodes, so those are limited too.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108245451
>>108245465
I used that one
https://huggingface.co/alexdenton/Qwen3.5-35B-A3B-heretic-GGUF
>>
>>
>>108245516
I'm not sure if that's a heretic problem, or an Alex Denton problem. Alex Denton only has 2 uploads in their entire history, both 14 hours ago. Are these models even legit?
https://huggingface.co/alexdenton
>>
>>
>>
>>108245516
https://www.reddit.com/r/LocalLLaMA/comments/1rf6s0d/comment/o7j59e7/? utm_source=share&utm_medium=web3x&u tm_name=web3xcss&utm_term=1&utm_con tent=share_button
>I actually felt it degraded the intelligence of the model, both for the 27B and 35B models. It does feel better when you explicitly do image captioning for NSFW images, but outside of that, it gave me bad results for translation and creative writing, though not tested for coding.
dunno who to trust anymore :(
>>
>>108245438
Interesting. Abliteration is lobotomy, for sure, but heretic at least doesn't seem to break that specific model, not at q8 anyhow.
>>108245516
I downloaded mradermacher quants of
>brayniac/Qwen3.5-35B-A3B-heretic
Again, q8.
>>
>>
File: 1741340884232605.png (520.9 KB)
520.9 KB PNG
lol, qwen 3.5 loves to repeat like that somehow
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108245877
>>if I want to use a model for coding
>"it's only coding or coomer, I never heard of using models to translate text, tag photos, summarize content, work as adhoc classifiers, document Q&A etc, no saar, here we either coom or we code"
the fact that this shithole of a thread is better than everything else on the internet to learn about new models says a lot about the state of the internet at large..
>>
>>
>>
>>
>>108245942
It bothers me how little imagination some of these anons have.
So far this model has been great for general work especially with translation, planning as a assistant and overall speed for a model of it's size.
>>108245877
Why the fuck would I give corpos my data when I have the hardware not to?
>>108245368
The qwen models 32b q.6 run perfectly fine and give great performance.
>>
>>
File: 1760000760501085.mp4 (3.3 MB)
3.3 MB MP4
>>108246001
>>
>>
>>
>>
>>108246014
I'm trying to navigate here I'm new.
I'm not sure which model to run either does the 35b model act different than the 27b model?
I'm enjoying the 35b model but notice it doesn't always think and sometimes overthinks at q.6 but I can run the 27b model at q.8 but not the k_XL model so I'm curious what would be better seeing how I can add more context tokens to the 27b model.
They all seem to perform great
>>
>>
>>
File: 2562151.png (265.5 KB)
265.5 KB PNG
>>108246005
>>
>>
>>108245254
It's in the works for ik_llama.
https://github.com/ikawrakow/ik_llama.cpp/pull/1270
>>
>>108246055
>>108246055
Yes sorry, I can run the q.8 of that model on my gpu as well but when I add the image model I need to push more to vram and I like the ability to add more context and use the vision model. I'm happy overall because it's still fast even when some system ram is being used.
>>
File: 1756998860872029.png (43.9 KB)
43.9 KB PNG
>>108246035
>I'm enjoying the 35b model but notice it doesn't always think
maybe you should enable "add to prompts", that shit adds the reasoning tokens from the previous post, that way the model understands it has to reason, when you don't have that, all the model sees is answers without reasons, so it assumes it shouldn't reason after that, that's my 2 cents
>>
>>
>>108245716
>Qwen has always been shit for RP I don't understand why you think that will change?
3.5 improved a lot, and with heretic is really interesting to talk to it, they really cooked, it's the first time I'm trying a medium model and it's as coherent as some of the giant models we used to have, finally I can get some fast discussions without having to reroll a dozen of times because "small" models used to be pretty retarded, Alibaba is getting really impressive, Z-image turbo, now this, god bless that company
>>
>>
>>
File: Screenshot_20260226_140231.png (342.6 KB)
342.6 KB PNG
Just ask the fucking ai
>>
File: file.png (138.3 KB)
138.3 KB PNG
>>108245424
>>
>>
>>
>>108246149
you have to try it by yourself anon, I tested 2.5, 3 and I found them to be really retarded, but that one is pretty neat, it understands my RP chat quite well and gives me interesting things so that I can talk back and keep the conversation alive, my gripe is that it sure loves to yap, on the thinking process and on the actual answer (but I'm sure I can mitigate that if I simply ask the model to not say too many things)
>>
File: Screenshot_20260226_141339.png (74.3 KB)
74.3 KB PNG
>>108246193
>>
>>
File: 1752670522273626.png (72.7 KB)
72.7 KB PNG
Facts. Qwen 3.5 Heretic is actually cooked if you tweak the prompt. Old Qwen was mid at best, kept looping like a broken JPEG. This new one? It’s got that sweet spot where it doesn’t hallucinate your OC’s backstory into a shonen anime plot. Yeah, it yaps like a drunk uncle at a wedding, but just hit it with “be concise, no thinking logs” in the system prompt and boom—clean RP. Z-image turbo already did me solid for art gen, now this. Alibaba’s slaying lately, honestly. Tested it on a low-end rig, ran smooth as butter. Try it, anon, don’t let the haters gaslight you. Just don’t ask it to write code or it’ll still shitpost a bit.
>>
>>108245714
From experience, naive abliteration = lobotomy, heretic is half lobotomy and MPOA is as close as you can get to maintaining base model intelligence but you need to prompt away disclaimers. It's honestly a shame that pew jumped on MPOA's coattails, coined a similar but worse method and made it retard accessible instead of making MPOA more accessible for the sake of the community. At the least MPOA got merged into the repo, which most people use for models if they know what they're doing
>>
>>
>>
>>
>>
>>
>>
File: Screenshot_20260226_142747.png (333.6 KB)
333.6 KB PNG
We can probe the model for the right path no?
>>
>>108246191
i don't really understand this obsession with unsloth.
i've used their models, had no issues at all.
also its fucking free and open source for fuck sake. if you don't like it, suggest something better or make something better.
i think its probably a ragebait meme at this point.
>>
>>
>>
>>
>>
File: 1772134180241988.png (72.1 KB)
72.1 KB PNG
To get qwen 3.5 to always thing, add this, you're welcome
>>
>>
>>
>>
>>
>>
>>108246516
>>108246518
I'm new here it's on system ram obviously. It runs KoboldCpp has that option.
>>
File: 1749135193155226.png (23.9 KB)
23.9 KB PNG
I broke it.
I wonder how long it will keep going
>>
>>
>>
>>
>>108246563
I just can't do that with a bot, I just see it as a toy, wouldn't it be better to just focus on the smallest model with the best performance to context max?
Playing pretend doesn't involve much compute does it not?
>>
>>108246394
"quanting is open source"
Just use bartowski or mradermacher. As for "better", port the ik schizo quants to kobold and then upload those, since llamacpp doesnt want to touch any of the screeching autist's anything since he sits there and cries wolf when anyone develops anything remotely similar to his work, regardless of how anyone arrives at a similar end result
>>
>>108246570
I can set the context high and I wouldnt mind even if it takes 30 minutes per reply, but they just degrade after that much context... And I can only do so much of retaining summaries of our activities and jumping from one instance to another.
>>
>>
>>
>>
>>
>>
>>
File: hedoesitforfree.png (521.9 KB)
521.9 KB PNG
>>108246625
moonshot and ubergarm does it better. simple as.
>>
>>
>>
File: oofowiemyvram.png (858.1 KB)
858.1 KB PNG
>>108246690
y u heff 2 b mad?
>>
>>
>>
>>
File: 1766375658186859.jpg (91.1 KB)
91.1 KB JPG