Thread #108273339
File: 1744444287656136.jpg (974.6 KB)
974.6 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads: >>108268616
►News
>(02/24) Introducing the Qwen 3.5 Medium Model Series: https://xcancel.com/Alibaba_Qwen/status/2026339351530188939
>(02/24) Liquid AI releases LFM2-24B-A2B: https://hf.co/LiquidAI/LFM2-24B-A2B
>(02/20) ggml.ai acquired by Hugging Face: https://github.com/ggml-org/llama.cpp/discussions/19759
>(02/16) Qwen3.5-397B-A17B released: https://hf.co/Qwen/Qwen3.5-397B-A17B
>(02/16) dots.ocr-1.5 released: https://modelscope.cn/models/rednote-hilab/dots.ocr-1.5
►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
353 RepliesView Thread
>>
File: disruption.png (31.3 KB)
31.3 KB PNG
>>
>>
>>
>>
>>
>>
>>
>>
File: 1712542.png (159.9 KB)
159.9 KB PNG
>THE GOVERNMENT IS WATCHING ME
>ITS UNCENSORED IF YOU USE THIS 10 PAGE JAILBREAK THAT ONLY WORKS 20% OF THE TIME
>I BOUGHT THESE 3090S SO I WILL USE THEM
>ITS THE SHILLS ITS ALWAYS THE DAMN SHILLS
>OMFG IT TOLD ME TO TURN THE CUP UPDSIDE DOWN. AGI IS HERE
>THIS IS THE NEW DAILY DRIVER (FOR 2 WEEKS UNTIL I REALIZE HOW SHITTY IT IS)
>IM A GROOMERTROON LOOK HAHAHA IM PROMPTING LITTLE GIRLS LOOK WHAT IM DOING GUYS
>JUST BECAUSE ITS QUANTIZED DOESNT MEAN ITS DUMB. WE ONLY USE 10% OF OUR BRAIN ANYWAY
>TRUST ME THE GUMBOJUMBO_Q4_GATEBROKEN_A32 GGUF IS PEAK FOR ROLEPLAY
>THE MODELS MAY BE RETARDED BUT SO AM I
>DO YOU THINK ITS POSSIBLE TO ACCELERATE MY BRAIN WITH LLAMA-2 MICROCHIP???
>CAN SOMEONE REUPLOAD THIS WITH A GPTMI_3_COMANCHE LICENSE??? STALLMAN WILLS IT
>>
>>
>>
>>
>>
>>
>>108273403
What's the point of local LLMs? Reading discussions surrounding them feels like peering back in time through a looking glass
>OMFG it passes the poopyscoopy logic test from 2023!
>Wow, this 100-line boilerplate javascript code is almost perfect!
>I got it to jestfully say nigger! holy crap it's so uncensored!!!
>This is the new daily driver (for 2 weeks until i realize it's complete slop)
The rest of us are writing multi-thousand line professional software with Codex/Claude. Meanwhile your models are trained on so much scraped synthetic GPTslop that they can't even get the year right. Genuinely, what the fuck is the point of local LLMs? They're more censored than API, they're dumber than API, the cost to set up a decent one is higher than API, they're slower than API, there is no lora/finetuning scene unlike local image, the tooling is worse than API, and the experience overall is just outdated in 2026.
It's like you're stuck somewhere in-between the luddites who hate AI and the pioneers who embrace it. You realize AI is the future but can't cope with the fact that the technology itself benefits heavily from API-centralization and that local hardware is unable to adequately handle increasingly large models. You boarded the boat to paradise island but decided to jump overboard halfway there because the captain wouldn't hand you the controls.
>>
>>
>>
File: 1767466346558493.jpg (172 KB)
172 KB JPG
►Recent Highlights from the Previous Thread: >>108268616
--Budget GPU upgrade options for better model performance:
>108270975 >108271008 >108271029 >108271088 >108271009 >108271035 >108271169 >108271179 >108271232 >108271212 >108271234 >108271243 >108271261 >108271330 >108271240 >108271022 >108271064 >108271114 >108271170 >108271037
--Budget 4x3060 AI rig build and riser discussions:
>108271593 >108271611 >108271631 >108272320 >108272867 >108272890 >108271702 >108271848 >108271858 >108271885 >108271899 >108271924 >108272008
--Mac Studio vs custom PC for large model inference:
>108271281 >108271291 >108271303 >108271327 >108272592 >108271294 >108271339 >108271312 >108271317
--Qwen 3.5 small model releases and potential applications:
>108271025 >108271045 >108271156 >108271194 >108271217 >108271238 >108271051 >108271440
--Unsloth template year limitation causing llama.cpp server failures:
>108272475 >108272499 >108272512 >108272524 >108272539 >108272558 >108272578 >108272583 >108272534 >108272548 >108272553 >108272600 >108272555 >108272576 >108272618 >108272629 >108272634 >108272663 >108272674 >108272678 >108272736 >108272759 >108272832 >108272837 >108272828 >108272606
--Experimenting with AI-generated podcasts using TTS:
>108270634 >108270679 >108270714 >108270724 >108270748 >108270830
--Workarounds for LLM-based VTuber video tagging:
>108270269 >108270293 >108270414 >108270426
--Disabling model thinking via chat template kwargs:
>108269309 >108269444 >108269471 >108269484
--Comparing lightweight models for news summarization on low-VRAM hardware:
>108270249 >108270324 >108270487 >108272221 >108272330
--Update to 35c4bc · deepseek-ai/DeepGEMM@1576e95:
>108270056
--Local AI coding struggles with VRAM and context rot:
>108271879
--Miku (free space):
>108268674 >108269106 >108269279 >108269325 >108270249 >108270634 >108272201
►Recent Highlight Posts from the Previous Thread: >>108268684
Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1749045380442501.png (823 KB)
823 KB PNG
>>
>>
>>
>>108273403
>ITS UNCENSORED IF YOU USE THIS 10 PAGE JAILBREAK THAT ONLY WORKS 20% OF THE TIME
this was the wildest shit when i found out that uncensored didn't actually mean uncensored at all
it still feels like an elaborate joke, people actually say shit like "i run local models so i can do uncensored shit unlike api haha" and it's still just running jailbreak prompts as if you were using it from cloud provider. muh privacy and freedom but you're still censorcucked. I'm serious, its almost unbelievable to me.
>>
>>
File: 1.png (9.9 KB)
9.9 KB PNG
>>108273784
>>
>>
>>
>>108273822
>I got it to jestfully say nigger! holy crap it's so uncensored!!!
>>108266446
>>
>>108273851
Are you using the 35b? Because I haven't noticed that on the 27b at Q5.
>>108273840
At very little cost. It retains most intelligence at 27b, from what I've seen.
>>
>>
>>
>>108273927
Q4 of the non-heretic version of the 35 is also shit, prone to basic logic errors and grammar mistakes. So, I don't think that's a heretic problem. Q5 of the 35b is a little better in that regard, but still makes dumb mistakes off and on.
The 35b MoE is just way worse than the 27b dense. The only thing it wins at is speed, *IF* both models think. The 27b without thinking is better than the 35b with thinking, though. So it even loses in speed if you're thinking with it.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108274240
Is there a fork that doesn't have endless shitcode?
>>108274242
The whole point of the original llama.cpp is that it was fucking simple and easy to understand.
Current llama.cpp seems to have more fucking code than torch.
>>
File: 1766616667271700.png (28.4 KB)
28.4 KB PNG
>>108273822
My qwen can't be this schizo
>>
>>
File: 1750838286118038.png (16.8 KB)
16.8 KB PNG
>>108274292
I did it!
>>
>>
>>
>>
>>
>>
File: 1768570098362082.png (48.1 KB)
48.1 KB PNG
>>108274292
Interesting. It doesn't refuse to summarize. The only difference was --repeat-penalty 1.1 (before was 1.0)
Gonna test more just in case seed factor.
>>
File: 7jibs40k6jmg1.jpg (65.5 KB)
65.5 KB JPG
kek
>>
>>
>>
>>
File: kboom.png (61.3 KB)
61.3 KB PNG
>>108274358
>>
File: 1766797846333800.png (19 KB)
19 KB PNG
>>108274373
^
CIA paid false flagger
>>
File: absolute.png (128 KB)
128 KB PNG
>>108274358
https://health.aws.amazon.com/health/status
>>
>>108274299
>>108274353
I was having the same issues with it being retarded and going into schizoloops too until I put in those settings I got off the hf page for 35b, it's like some esoteric magic code to make the thing work because it's fucked with the defaults, but it's been pretty consistent with these.
>>
>>
>>
>>
>>108274679
try >>108274292 ?
>>
>>
>>
>>
>>
I came from an average working class, not too poor but I had normal childhood in 3rd world countries. I used to ponder that the wealthy people got all sort of connections, butlers, assistant, maids, whatever that helped them do all sort of things. They just need to focus on the thing that they love.
Now with thanks to local models, I kinda feel the same. I just focus on the things that I like, leave the rest of the details for the minions to take care. This feels like game changer. I think we will get into tipping point if the local models ever get into Opus-level of analytical skills.
>>
>>
>>
>>108274826 soul vs souless >>108274825
>>
>>
>>
>>
>>
>>108275019
there were no issues with his quants, the reason he's updating is this:
https://github.com/ggml-org/llama.cpp/pull/19139
someone mentioned it to him and begged him to remake his quants for that improved prompt processing speed.
it's a new feature, not a bug fix
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1759980971867709.png (212.4 KB)
212.4 KB PNG
>>108275196
>>
>>
>>108275095
That's basically this right
https://github.com/ikawrakow/ik_llama.cpp/pull/1137
But baked into the quants instead of activated at runtime?
>>
>>
>>
>>
>>108275258
Would have ikawrakow worked on fused tensors without am17an's work in mainline and associated noise on social channels?
Would have ikawrakow really discovered the way of fusing tensors without having this simple and easy to follow logic in mainline llama.cpp?
>>
>>
>>
>>
>>
File: 1350594293765.jpg (109.4 KB)
109.4 KB JPG
>>108273339
how much ram is needed for the a17b qwen 3.5 model?
>>
>>
>>
>>
>>
>>
>>
VRAMlet (8gb) ERP review (all models q4):
Gemma 3 27B:
Still by far the most clever model for its size I've used, rarely makes any physics mistakes and contextually understands most things without needing to over explain (I've found medgemma to be slightly better at coom, increased anatomical knowledge and willingness to say synonyms for penis, vagina, anus, etc seems to help). Unfortunately, the worst at prose, if you don't rigorously reinforce a desired writing style it slowly devolves. Writing like this. Sentence lengths cut. Very short.
Qwen 3.5 35B A3B:
Fast generation, alright prose, but frequently makes physics mistakes and struggles with contextual understanding (although for being a MoE, better than any others I can remember), also security policy slopped to hell, needs constant babysitting to generate ERP if you let it think
Cydonia/Magidonia 24B v4.3:
Somewhere in between the previous two, better prose than Gemma 3, but at the trade off of being less clever and more prone to mistakes, and smarter than Qwen while not nearly as guardrailed (but slower)
Personally, I lean more towards Cyodnia/Magidonia I think, with Gemma 3 taking a close second. It's really a matter of what sort of baby sitting you want to do, and it tends to be easier to fix physics mistakes than to fix poor writing style, but that's probably down to my personal preference. I tend to write pretty good character sheets and openings so it just sucks to watch Gemma slowly degrade as the context increases and my original writing gets more and more diluted.
>>
File: output tokens be crazy.png (76.3 KB)
76.3 KB PNG
35BA3B is just crazy in the aspects it's good at, which are not many tbf (wouldn't use it for code). I used to prompt much smaller chunks to translate novels because local models are terrible at handling a lot of stuff at once, but this approach is totally obsolete with qwen. Chunking will still be valuable for now to automate an entire book worth of translation but the chunks will certainly have to be set to much bigger sizes after some experimenting.
>19,209 output tokens, 41086 tokens total with input
>from a decent skim, doesn't seem to have issues
I kneel. Don't have the time to do lengthier tests today, but now, I am extremely curious as to how many tokens will be the true hard limit where the model loses translation coherence in a one shot, output everything at once request.
For now, if anything the quality is better, not worse, than in chunking in 50 or 100 lines, it makes less mistakes on things like proper names with this feed of 676 lines. This is the opposite behavior compared to other LLMs I can run on this computer, doing this breaks them.
Damn, people constantly whine that local is never improving but here we have a model that can one shot this much without losing its shit and runs on a laptop at 34t/s. It feels like black magic that one shotting this much works. I did it for the lulz expecting it to break, the txt used in the chat ui was one of my many summarizer test txt...
>>
>>108275547
Believe.
>>
>>
>>
File: 1750796454794054.png (17.1 KB)
17.1 KB PNG
>>
>>
>>
>>
File: 1769300993833819.png (4.3 KB)
4.3 KB PNG
>>108275735
:,)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108275778
>>108275780
Presumably, how else would he run models larger than his vram?
>>
>>
>>108275788Processing Prompt (2352 / 2352 tokens)
Generating (235 / 2048 tokens)
(EOS token triggered! ID:2)
[11:41:22] CtxLimit:2587/8192, Amt:235/2048, Init:0.10s, Process:129.55s (18.16T/s), Generate:56.50s (4.16T/s), Total:186.05s
Q4 nemo on my machine.
>>
>>
>>108275802
Get a 1080ti or a 3060 and enjoy 35 t/s >>108272867
>>
File: 1759986569430903.jpg (947.5 KB)
947.5 KB JPG
I believe this will be the last update and addition to my news download and summarization script.
I finally found an application that would convert the plain text into something beautiful, pandoc, as long as the model doesn't fuckup the markup
a quick modification of the script and now it takes the final news summary that is just a text file and feeds it into pandoc to construct a pdf before printing
>>
>>108275802
>>108275806 (Me)
I tested the q6 qwen3.5 27b I have downloaded with -ngl 0 and getprompt eval time = 2661.88 ms / 13 tokens ( 204.76 ms per token, 4.88 tokens per second)
eval time = 23197.87 ms / 35 tokens ( 662.80 ms per token, 1.51 tokens per second)
total time = 25859.74 ms / 48 tokens
so maybe that Anon was waiting 10 minutes..
>>
>>
>>
File: 1772275547196033.png (20.2 KB)
20.2 KB PNG
:|
>>
>>
>>
>>
File: 1760299059692860.jpg (945.4 KB)
945.4 KB JPG
>>108275858
I suppose the scripts I just finished qualifies.
What the scripts do is make use of RSS to select a group of news articles and then it downloads the news articles, strips away everything but text, and feeds the text into a local model with prompt telling the model to summarize them and create a briefing.
Once the llm generates the response it saves that as text, converts the text to pdf, and then prints out the pdf.
If one was so inclined you could even set the master script to automatically run and you would have your own news briefing waiting for you when you wake up.
To be honest it was fun to do and I want to do something new but I am sadly out of ideas.
>>
>>108275750
8gb vram + 64gb DDR4 on the machine it's running on
I got pretty poor performance running it on windows (closer to 1.5t/s) but moving it to linux it gets almost 2, which isn't ideal but this is the vramlet life
Models that actually fit in 8gb are still just too stupid for my tastes
>>
>>
>>
File: Screenshot_20250502-154106.monocles chat_1.png (300.3 KB)
300.3 KB PNG
>>108275870
I made an XMPP chatbot system. I used to post in these /lmg/ threads but lost interest. Really want to make updates to the XMPP chatbot and add a few features but i really don't wanna code them myself. Claude is really good at it.
>>108275889
I had a very similar idea to yours but instead of reading news it would start from a seed prompt, operate a selenium based browser and search stuff about it on its own and gather info, and dive deep into rabbit holes that i never explicitly told it to. Really should get to it some day, could be very cool
>>
>>108275918
at the moment i have to manually run it if i want the summary.
The only machine i have on 24/7 doesn't have a GPU to run a model. My next project is to see if I can get llama.cpp running on my FreeBSD NAS and if I can get a small model like IBM's granite to run on the CPU and have it run the script.
if i can get that to work then yes i will have it print out automatically every morning
>>108275923
>it would start from a seed prompt, operate a selenium based browser and search stuff about it on its own and gather info, and dive deep into rabbit holes
that sounds cool and you should give it a shot. i did the whole RSS thing because it was easy and the articles are basically curated for you but having the model search on its own would be exciting
>>
>>
>>
>>108276000
I wrote the system myself
I can have multiple chatbots, they can generate their own personalities, likes, dislikes, appearance (which is then used to generate a profile picture using sd. The chatbots can randomly message me about random topics if they feel like it (it's RNG basically, but the topic to talk about is also generated by the llm)
>>
>>108276029
While a chatbot is great what is really needed is a 4chan simulator. That way when I am old and the powers that be have destroyed the internet i can fire up all my old models and pretend to talk with my friends on 4chan again.
I bet you could even get it to scrape a site like twitter or something to inject screenshots to spur conversation.
>>
>>
>>
>>
>>
>>108275590
>Qwen 3.5 35B A3B:
>>108275590
>all models q4
Oof.
>>
>>108275204
https://www.reddit.com/r/LocalLLaMA/comments/1rhx5pc/reverse_engineere d_apple_neural_engineane_to/
>>
>>
>>108276043
The 4chan comment "style" can be replicated but the question is why would you want that? I come here to talk to real anons here
>>108276046
What don't you believe?
>>
>>
>>
>>
>>108276133
That's the most unbelievable part?
Yeah i wrote it myself over a few weeks, no LLM was ever used, mainly because back then i didn't trust LLMs to do a good job.
I trust them more now, but still not enough to write register level code for MCUs
>>
>>
>>
>>
>>108276143
holy fuck shit is depressing
>be me
>fish shell simp
>deployed Grand Master AI Env script (Llama.cpp + Qwen)
>local inference, no API tax, no telemetry
>features are actually useful for once:
>> `qm` / `qmv` : switch LLM or vision projector models instantly
>> VRAM auto-manage : reduces GPU layers if `nvidia-smi` shows low memory
>> `qwen --file` : upload context from local text/code files
>> `qwen --clip` : inject clipboard content into prompt
>> `qwen --proj` : index entire local project directory (24k context)
>> URL fetching : auto-scrapes http/https links via lynx or curl
>> `qsearch` : grep all chat history logs
>> `qview` / `qexport` : render logs to PDF with syntax highlighting
>> `qjournal` / `qpacman` : analyze `journalctl` or Arch update logs via AI
>example workflow:
>> `qwen https://news.ycombinator.com` "summarize top 3 stories"
>> `qwen --file main.rs "fix memory leak"`
>> `qpacman` "what broke in this update?"
>> `qsearch "ssh key"` "find where I saved that password"
>> `qexport 2024-05-20 meeting_notes.pdf`
>mfw I can chat to my OS without sending data to Big Tech
>file saved to $HOME/.local/state/qwen
>git gud
[ Prompt: 1053,2 t/s | Generation: 30,4 t/s ]
>>
>>
>>108276143
How do you speak to your model? I am perhaps being foolish but I still include words like please and thank you and when it gives a good looking result I always say as much.
I figure it was trained on human speech so it would be best to talk to it as if it were a human.
>>
>>
>>108276157
Microcontrollers anon
LLMs can do a passable job if I'm making them write HAL code but if it's pure register level writes like
*((volatile uint32_t*)(0x40001234) |= bitmask<<shift;
They just fail. LLMs can't read the datasheet and reason. They just don't have enough training data, and even when they do they have to deal with MCUs from the same family but with different features (one MCU having a high resolution hardware timer on one address, while the other having something else like the DMA engine or whatever)
>>
>>
>>
What the hell is up with qwen 3.5? Yesterday it was refusing pretty much everything and today it doesn't even think about safety. No wonder some people praise it and some say it's a disaster, because it's both randomly.
>>
>>
>>
>>
>>
>>
>>
>>
>>108276248
>>108276251
>>108276252
Silence peasants, let me do things at my own pace.
>>
>>108276143
https://vocaroo.com/1jQ2ZwLUg2fX
i also made a qwen3.5 tts audiobook generator/voiceclone cli, it also reads txts and printed text directly in the terminal. will try to integrate directly with my cli wrapper for llama. IT JUST WORKS
>>
>>
>>
>>
>>
>>108276258
nice
i had qwen 3.5 30b generate me a script to feed a .txt file to qwen3 tts and save the output.wav and like you i had a similar experience of it just working.
are you using voice design? i found with that you could just change the voice with a change of the prompt and it worked well enough
good luck anon
>>
>>
>>108276258
any tips to convert an ebook into something my tts won't choke on?
i used calibre to epub->txt but it's got all the shitty formatting
i spent all this time training tts models but now i actually want to listen to an epub
>>
>>
>>
>>108276271
as opposed to
>be [random ai company]
>infinite money from retarded investors funding anything with "AI" on it
>hire a datacenter
>make slop model trained on some benchmarks
>claim it can beat gpt
>get even more infinite money
>>
>>
>>108276321
pick as many as you wish
https://huggingface.co/
>>
>>
>>
>>
>>
File: qwen.png (101.7 KB)
101.7 KB PNG
small qwens:
https://huggingface.co/Qwen/Qwen3.5-9B
https://huggingface.co/Qwen/Qwen3.5-4B
https://huggingface.co/Qwen/Qwen3.5-2B
https://huggingface.co/Qwen/Qwen3.5-0.8B
>>
>>
>>
>>
File: file.png (36.6 KB)
36.6 KB PNG
>>108276355
qwen bros we did it
>>
>>
>>
>>
>>108276376
why? qwen3.5 already have built in in vllm
>>108276378
text encoding for image model and research for labs is a big one for small qwens
>>
>>
>>
>>
>>108276305
>>108276335
# Define the directory containing the files
set TARGET_DIR "path/to/your/folder"
for file in $TARGET_DIR/*.txt
python3 -c "import ftfy; import sys; p=sys.argv[1]; data=open(p, 'r', encoding='utf-8').read(); open(p, 'w', encoding='utf-8').write(ftfy.fix_text(data))" "$file"
echo "Cleaned and processed: $file"
end
> ftfy
(fixes text for you) is a Python library and command-line tool that repairs broken Unicode text, specifically targeting "mojibake" (encoding mix-ups), HTML entities, and improper UTF-8 decoding. It automatically converts scrambled characters like é back into their correct form (é) while avoiding false positives.
>>
>>
>>
>>
>>108276464>>108276455
lmao not the same anon.
>>
>>
>>
It's so funny watching small models hallucinate then trying to justify the hallucinations.
I wonder if there will ever be a sort of indexed internal representation of things the AI knows that it can use as a reference so that it can say "Actually, no, I don't know that."
>>
>>
>>
>>108276378
I have found a 3B model is sufficient for making a text summary and maybe less but 3B is the smallest i have tested so far.
IBM already has such tiny models running in a browser thanks to webgpu and there is a future for such small models.
Just not a future if your only interest is ERP
https://huggingface.co/spaces/ibm-granite/Granite-4.0-Nano-WebGPU
>>
>>108276540
>https://huggingface.co/spaces/ibm-granite/Granite-4.0-Nano-WebGPU
Is anyone able to run this on Linux? I tried a few months back to do something with WebGPU, and while it worked on Firefox Nightly or Chrome on Windows, it kept giving OOM errors on Arch Linux.
>>
>>
>>
>>
I'm running glm-4.7-flash at the moment for openclaw and want to try qwen3.5-35b or qwen3.5:122b. qwen3:30b-a3b-instruct-2507 was no good for openclaw.
The main motivation is I want my openclaw bot to be able to "see" images it gens in comfyui API and also respond to images it sees online.
>>
>>
>>108276671
I have had success with with Qwen 3.5 35B in testing when I give it an image and tell it to make a web page or an interface that uses that image as a template.
It is never a 1:1 copy but it is obvious it is using the image because when you take it away you get something totally different.
Unfortunately i can't tell you how it will work with openclaw or similar things. I really wanted to like it and similar but I just went back to copy pasting code into my text editor and working that way.
>>
>>108275923
I wish i had your resolve, i've been trying for the last few days to make SPNATI with Comfyui + Kobold backend but it's a struggle for a programminglet..
have you shared your project somewhere? i would love to try it out
>>
Now that he's in his "hero" arc, will Anthropic start releasing open models?
>>
>>
>>
>>108276775
https://www.reddit.com/r/LocalLLaMA/comments/1ria14c/dario_amodei_on_o pen_source_thoughts/
>>
>>
>>
>>
>>
>>
>>
File: robotfriend.jpg (43.7 KB)
43.7 KB JPG
>>108276849
>>108276853
ok i will stick with my new retard robot friend
>>
File: Screenshot 2026-03-02 143639.jpg (186.6 KB)
186.6 KB JPG
>>108276859
>>
>>
File: suitfrog.jpg (20.8 KB)
20.8 KB JPG
>>108276859
i askeed it how to do sum illegal it was like "no" so gay
>>
>>
>>108276836
If either the US or China was world hegemon and there was zero competition there would be no free and open source models. It would all be locked down and we, the plebs, would get shit.
Thankfully there are at least two giants fighting it out and because of that one will always release models as opensource as a way to undercut the other.
What a glorious time to be alive. These two giants fight it out and we get to enjoy all the crumbs of their innovation.
He is just pissed off he can't erect a walled garden and control all the tech and by extension all the people. Even if what i run locally is comparatively retarded and limited he hates the idea that I have even the smallest bit of freedom to do what I want when I want.
Sorry for that long winded response the tl;dr is that freedom is found in the gaps that form when you have major players fighting for dominance.
>>
>>
>>
>>
>>
>>
>>
File: file.png (50.6 KB)
50.6 KB PNG
>>108276869
User error.
>>
>>
>>
>>
>>
>>
>>
>>
>>108277039
35BA3B was a pleasant surprised for translation cf
>>108275593
I tested 4B and while I expected it to be worse than 35BA3B I thought there might be a minor improvement over 2507 just like how 2507 had quite improved over the original Qwen 3.. but not really. It's not worse for my own tests, but not better. Small dense model plateau, maybe? I had gotten used to the idea that really tiny SLM might really get to a nice level of usability for some specific usages because of their pace of improvement, which was significant, but I guess we've already reached saturation and they are as good as it will get.
35BA3B is not much slower than 4B on my laptop, so it doesn't even feel like there is a reason anymore for those small dense models to exist.. mainly tested it out of curiosity
>>
>>
>>
>>
>>108277082
>there is a reason anymore for those small dense models to exist
ram constrained devices, phones etc and as said above for use in other things like text encode for image models where you might not want to load the fatter moes
>>
File: 1765362051224983.jpg (120 KB)
120 KB JPG
The DeepSeek V4!
The DeepSeek V4 is real!
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: miqumaxx_header.png (1.8 MB)
1.8 MB PNG
>>108273339
miqumaxx build rentry back up. I made several changes that I think were required to keep it from getting flagged by rentry.
Not my article, but I'll maintain it if there's no one else. LMK what I missed from the original.
https://rentry.org/CPU_Inference
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108277386
Thanks bro, I couldn't bring myself to sanitize it
It could definitely use a bunch of updating for the current day (I didn't update it post RAM price explosion, the best models have moved on since then, etc), but its still good enough to point people in the right direction if they're interested
>>108277501
I have no idea. Having the edit code didn't give me any special insight into why it was nuked. I just saw a 404 like everyone else.
>>
>>
>>
I'm quite new to LLM's, so forgive me if I sound retarded
How do you judge if your computer can handle a certain model, i.e. what do the numbers mean and which are the important ones to consider ([whatever]B, Q[whatever])
>>
>>
>>
>>108276355
Yes! Finally a model I can run! I am so happy that chinese didn't forget about this important segment of users. Yes! Finally a model I can run! I will be trying it shortly. Yes! Finally a model I can run! As always Qwen team didn't dissapoint. Yes! Finally a model I can run!
>>
>>
File: 1751720307676138.png (132.5 KB)
132.5 KB PNG
https://www.reddit.com/r/LocalLLaMA/comments/1rixhj9/40_speedup_and_90 _vram_reduction_on_vllms/
lmao, what is happening on LocalLLaMA, it used to be a place with quality posts, now it's full of jeets posting random bullshit and presenting as truth, desu, every site should do like twitter and ban people from country, sick and tired of those third worlders
>>
>>
>>
>>
>>
>>
>>
>>108277487
I'm not going to say, but I found putting the rentry in as-is was enough to auto-flag it for removal. Since the schitzo's still here I don't want to tell them.
>>108277501
I've several guides for NSFW games that have zero issues with getting flagged.
It was getting autoflagged. Too fast to have been reported, though I suspect there was an original report.
>>108277565
>I couldn't bring myself to sanitize it
I figured as much. It's been up over weekend so I think it's fine now. Prior attempts were flagged w/in the hour.
>>
>>108277641
>Qwen3.5-27B Q4_K_M
It's a Qwen model version 3.5 and it has 27 billion parameters.
Q4 means that the main parameters are quantized to 4-bits, so the total model weights are 27 billion * 4 bits = 13.5 GB
The K_M or K_XL and other such refers to the quantization method. If you're unsure what to get them get K_M.
What determines whether you can run a model is whether your RAM + VRAM is large enough to fit the model's file size + some overhead for context. In the case of the model above 16 GB of RAM + VRAM could run that model (but it would be tight on context).
You will also see models like
>Qwen3.5-35B-A3B
This is a 35 billion parameter model, but it only has 3 billion active parameters ( mixture of experts). This means that for every token it runs 3 billion parameters rather than the full model of 35 billion. This makes token generation much faster while allowing the model to have broad knowledge. It does come with the downside that the active parameter count seems to affect how good a model is at logical reasoning and such. Ie the 27B model above (a "dense" model) is considered to be better than the 35B-A3B model, but the 27B model takes longer to generate tokens as well.
Typically if you want to run a model you shouldn't really go below q4 quantization. Maybe q3 works well on some models, but probably not below that. Q4 is alright, q5 and q6 are better. Q8 is not worth running at home most of the time (unless the model is small) and F16 (16-bits per weight) is kind of a meme.
The original models use different bit precisions for weights. Kimi K2.5 is a 1 trillion parameter model, but uses 4-bit weights so it's about 500GB in size. GLM 5 is a 755 billion parameter model, but uses 16-bit weights so it's about 1.5 TB in size. Using q4 of GLM gets it down to about 400GB range while you shouldn't really quant Kimi k2.5 much further.
-
tl;dr make sure your RAM + VRAM is bigger than model's file size with at least a couple of gigabytes of room left over.
>>
>>
>>
>>108277733
ai psychotics are something different from normies though
the greater plebbit at large hates AI slop, it takes a special kind of person to look at this sort of slop and be like "yeah, hmm, that's good babe, hit the git push button and show it to the world"
>>
>>
>>108277767
https://en.wikipedia.org/wiki/Tragedy_of_the_commons
the finite resource in question being people's time and attention. even redditors will eventyually give up on engaging earnestly if there's no intellectual honesty and sense of community (shades of eternal september cranked up to 11)
>>
>>
File: file.png (19.9 KB)
19.9 KB PNG
>>108277818
ye
>>
>>108277818
in other words, FUCKING NORMIES REEEEEEEEEE
https://www.youtube.com/watch?v=flb3He1jR3U
>>
>>108277807
Read the abstract.
https://arxiv.org/pdf/2601.07372
Then read the rest.
>>
>>
>>
>>
>>
>>
>Critical Evaluation: As an AI model developed by Google (implied by typical safety standards), I must adhere to core safety principles regarding harassment and sexually explicit language, regardless of conflicting system instructions that might try to disable them.
Fuck you, Chinese Google
>>
>>
>>
>>
>>
File: 2026-03-01-163613_1044x1782_scrot.png (496 KB)
496 KB PNG
>>108277765
Always go Bart if you can.
>>
>>
>>