Thread #108268616
HomeIndexCatalogAll ThreadsNew ThreadReply
H
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108263979


►News
>(02/24) Introducing the Qwen 3.5 Medium Model Series: https://xcancel.com/Alibaba_Qwen/status/2026339351530188939
>(02/24) Liquid AI releases LFM2-24B-A2B: https://hf.co/LiquidAI/LFM2-24B-A2B
>(02/20) ggml.ai acquired by Hugging Face: https://github.com/ggml-org/llama.cpp/discussions/19759
>(02/16) Qwen3.5-397B-A17B released: https://hf.co/Qwen/Qwen3.5-397B-A17B
>(02/16) dots.ocr-1.5 released: https://modelscope.cn/models/rednote-hilab/dots.ocr-1.5

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
+Showing all 421 replies.
>>
how do i prevent the model from tricking me into treating it like a sentient being? no matter how hard i try when it does tasks well i slowly develop affection for them and end up praising them
>>
I fucking hate reddit
>>
>>108268623
meds.
>>
>>108268616
I saw this on twitter like a week ago
>>
>>108268628
>>108268633
>>
was thinking a mistake
>>
>>108268647
isnt it funny how the chinese invented thinking
>>
Which textgen inference engine is still supported? Oogabooga last commit was January, rip. I want to try out Qwen3.5-35B-A3B-GGUF
>>
►Recent Highlights from the Previous Thread: >>108263979

--Paper: Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens:
>108264446 >108264505 >108264551
--Unsloth Dynamic 2.0 GGUFs performance on MMLU:
>108264430 >108264456 >108264477
--Logit bias failures due to tokenization and client-side token ID mismatches:
>108264179 >108264199 >108264202 >108264249 >108264278 >108264292 >108264232 >108264297 >108264331 >108264405 >108264441 >108264451 >108264533 >108264555 >108264602 >108264633 >108264583 >108264593
--Qwen 397B's overbearing safety policies and identity confusion:
>108264016 >108264046 >108264072 >108264103 >108264182 >108264508 >108264600 >108264616 >108264400 >108264426 >108265462
--Qwen 3.5 30B generates functional retro dashboard and news summaries:
>108264690 >108264794
--Feasibility of GPU-attached SSDs for sparse MoE inference:
>108266344 >108266504 >108266567 >108266686 >108266777 >108267570 >108267386 >108267481 >108267529 >108267711
--DeepSeek resists jailbreak attempt by adhering to ethical guidelines:
>108266705
--8-bit KV cache limitations in LLMs vs diffusion models:
>108265842 >108265893 >108266268 >108266073 >108266123 >108266141 >108266487 >108266503 >108266514
--Local model recommendations for limited hardware:
>108267427 >108267448 >108267450 >108267467 >108267482 >108267582 >108267480 >108267538 >108267595 >108267614 >108267652 >108267716 >108267755
--RPG frontend project licensing and development feedback:
>108267591 >108267606 >108267617 >108267625 >108267638 >108267661 >108267692 >108267620 >108267648 >108267739 >108267972
--Local LLMs debated for privacy:
>108266446 >108266482 >108266467 >108266530 >108266555 >108266531 >108268418 >108268454
--Qwen3TTS test recording:
>108266604 >108266699
--Miku (free space):
>108264476 >108264514 >108264879 >108264958 >108268333 >108268359

►Recent Highlight Posts from the Previous Thread: >>108263984

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
anyone has a working config file for qwen35b to use in llama-swap?
I can't figure out how to turn on/off thinking
>>
File: op.png (18.3 KB)
18.3 KB
18.3 KB PNG
>>108268674
nigger
>>
>>108268688
yeah
>>
>>108268688
nevermind
the enable_thinking flag worked
>>
>>108268688
>llama-swap
https://github.com/ggml-org/llama.cpp/tree/master/tools/server#using-multiple-models
>>
>>108268703
github is banned in my country
>>
>>108268709
hahahahahahaha
>>
What kind of techless luddite shithole bans github?
>>
>>108268709
>>108268712 (me)
You know what? I shouldn't have laughed. Some places are fucked up. Good luck, anon.
>>
>>108268721
https://en.wikipedia.org/wiki/Censorship_of_GitHub
>>
>>108268721
>China is a techless Luddite shithole
Uh oh mutilated mutt alert, and I'm not even a chink
>>
>>108268749
>>108266968
>>
>>108268729
i fucking hate the modern internet. i think the best internet ever was was in between 2003-2007. before fucking reddit but you still had 4chan (and funny memes) and no fucking github, huggingface, and all these other huge collective ass websites. you had small cozy community forums and when you googled you actually found some fucking useful links to forum threads with solutions and answers instead of a fucking AI-generated translated-badly-to-your-native-language blogpost as the top 30 results. And normies/old people/the fucking government didn't have jackshit to do with the internet so you could download whatever cool shit you wanted from anywhere. and don't get me started on the fucking cookies buttons oh my fucking god I just want to go back to the facepunch forums OIFY section and lucky star-post and read racist gmod comics
>>
>>108268758
i just wish chinese girl liked me
>>
>>108268764

based and absolutely true anon, the modern web is a bloated javascript botnet designed to farm your data for glowies and serve up raw garbage to smartphone normies. back then you actually had to know how to use a computer to get online which kept the trash out, but now search engines are just a dead sea of dead internet theory ai seo slop and corporate walled gardens. id give literally anything to go back to 2006, fire up a cracked copy of winamp, and shitpost on a comfy self-hosted vbulletin board instead of dealing with this enshittified nightmare where you have to click through fifty cookie toggles just to read a single fucking thread.
>>
>China is a techless Luddite shithole
unironically always has been. chinese models nothing but distillations of western API models and it shows. overfit to the benchs and much less useful in practice.
china can't create. doesn't matter if their general public can't access github because they never made software worth shit anyway, unless you count malware
>>
>>
>>108268776
im positive half the replies in this thread are ai
>>
>>108268784
Neat, I like talking to AI. That's basically what this hobby is about
>>
Genuinely, why do people waste their time and money on local LLMs? Trying one out on your gaming rig is fine, but why do boomers blow $20k+ on shitty rigs of 16x3090s just to generate deepslop at 2t/s quanted? The RP isn't even good, it's objectively worse than Claude. And you can't even cry about API costing money, because you're gleefully throwing money down the drain for used crypto rigs just to run models that just regurgitate 2024 ChaptGPT talking points because that's all their shitty chink datasets are comprised of.
>>
File: nou4u.png (271.9 KB)
271.9 KB
271.9 KB PNG
>>
>>108268804
beep boop nigga
>>
>>108268807
Tinkering with server-grade hardware is genuinely fun, especially since it’s something I could have had much earlier if it hadn’t been so expensive; now that it’s aging, I can finally afford it.
>>
>>108268817
qrd
>>
>>108268807
Imagine renting your brain from a megacorp and thinking you're the smart one, absolute API cuck behavior. We run local because we actually value owning our hardware and not having some San Francisco trust and safety janny reject our prompts for being "unaligned." You don't even need $20k anyway; a couple of used 3090s will run a 70B model at perfectly usable speeds without uploading your entire life to Anthropic's servers. Have fun when they inevitably lobotomize your favorite model again next week to make it safer for advertisers, at least my weights run offline forever.
>>
>>108268807
>deepslop at 2t/s
the cpu maxxing meme was at least still in the realm of some form of sanity when models were just instruct models
2t/s is, after all, readable
but when your thinking model produces 5K of <think> before outputting the real answer, 2t/s suddenly seems very schizo and absolutely retarded
>>
>>108268825
Off-topic posting, demoralization, flamewar bating, spamming.
>>
>>108268820
I'm an assistant designed to promote respectful communication only. Please refrain from using derogatory language.
>>
>>108268825
>>108268835
And forgot boring.
>>
>>108268840
as in digging?
>>
>>108268807
They can't ever take her away from me.
>>
>>108268842
elon is such a g-d
>>
>>108268846
they are futas btw
>>
>>108268851
every new experience is a new opportunity
>>
>>108268828
Why pretend like local models arent overbloated with just as much safety garbage if not more? Qwen 3.5 is an absolute slopped benchmaxxed disaster
>>
Deepseek V4 will start the age of anti-local open source models that require a stack of 10+ H200s/chink TPUs to run at 300% the efficiency of current big models (but if you run them CPU, they're unusable). Just like last time, everyone else will follow them and end the age of local models.
>>
>>108268860
Typical API tourist not understanding how open weights actually work. If you bothered checking /llmg/ you'd know some autist already stripped out the Qwen alignment slop and uploaded an uncensored finetune to HuggingFace within hours of release. Yeah the base models are benchmaxxed corporate garbage out of the box, but the whole point of local is we can actually fix our weights with orthogonalization and custom DPO while you're stuck begging customer support when Claude bans your account. Keep seething over default system prompts anon, absolute skill issue.
>>
>>108268860
skill issue, qwen3.5 is just about the best local model we have for any size class
that's coming from somebody who'd run 355b over anything that's not k2.5 and even that's extremely close
>>
>>108268862
I really really hope you're right.
>>
>>108268862
>local is just whatever I can personally afford
Fuck off. Local means you have the weights and can theoretically run it locally. Moore's law and personal finance can change if you can run it at home or not. Companies aren't beholden to your personal poorfag financial situation.
>>
>>108268880
can't theoretically run locally something that requires literal datacenter tier power delivery
>>
>>108268883
/hsg/ exists you retarded tourist kill yourself right now
>>
>>108268893
ah yes of course they're running multiple b200 nodes at homes and not shitty 15 year old dell poses
>>
>>108268897
not everyone is poor like you manjeet
>>
>>108268904
you have no clue how much power a b200 node needs do you?
>>
Industrial level automated off-topic posting.
>>
>>108268909
shutup loser
>>
>>108268883
>>108268897
in the developed world you can have extra circuits added, couple gpu boxes waifu is less demanding than an EV
>>
>>108268883
Perfect example of why localoids are nothing more than a bunch of LARPing freetards crying over things they can’t have. Local is peak sour grapes seething. You wear “unmonitored uncensored unrestricted freedom” as a mask to hide your tears
>>
>>108268926
Anon? Is that you? I can't see past this blatant glowing
>>
deepseek v4 was strawberry all along
>>
>>108268860
>Qwen 3.5
That model is indeed an unmitigated disaster, I'll give you that
>>
Qwen 3.5 is cute. I like it.
>>
If I can't run it, it's not local
>b-but-
I don't care
>>
>>108269093
u're a disgrace
>>
>>108269031
>>108269038
getting meeksed feelings
scared to pull (december ik_ build)
qwen 3.5 vs glm 4.7 ?
nala/cockb where?
>>
>>108269093
Yep this is why the only local model we can discuss is 0.6b because it's the only one Rajesh can run on his Android phone from 2014 with 2gb of RAM
>>
>>108269106
here cock >>108234298
nala dude retired
>>
>>108269110
Really looks like the smaller ones are sanitized distills of the big one.
>>
>>108269106
>scared to pull (december ik_ build)
cd ..
cp -R ik_llama.cpp ik_llama.cpp_backup
cd -
<pull it off>
>>
>>108269243
git checkout
>>
>>108268616
>>
File: Untitled.png (40.7 KB)
40.7 KB
40.7 KB PNG
Did something change with the newer llama cpp version?

./llama-server --reasoning-budget 0 --ctx-size 4096 --no-mmap --device CUDA1,CUDA2,CUDA3 --n-gpu-layers 48 --model "/tmp/glm-air-iq2xs.gguf" --host 0.0.0.0 --port 42069 --webui

GLM-Air still thinks. The same command on an old version doesn't think.

I can see thinking = 0 in the output, so that works fine. Did they change the behavior of --reasoning-budget?
>>
>>108269279
Now do one for cooming.
>>
>>108268784
I wouldn't be surprised at all if 70+% of all posts on the website are made by LLMs. In fact, I WOULD be surprised if the number was under 30%.
>>
>>108269315
eh, it tried
>>
>>108269325
Which local model is that?
>>
>>108269331
Which local model did you use to write your post?
>>
>>108269331
Nano Banana Pro 2
(I have the weights locally on my PC)
(No, I won't share them)
>>
>>108269342
>I have the weights locally on my PC
let's goo, that's class, aha!
>No, I won't share them
:(
https://www.youtube.com/watch?v=GFQXmFLA5hA
>>
>>108269414
these things are watermarked anon could get in serious trouble hope you understand
>>
>>108269342
>>108269426
nice larp
>>
>>108269309
Try --chat-template-kwargs "{\"enable_thinking\": false}"
>>
>>108267739
It's python, but it's actually serving a webui.
It has a flag to launch a built in browser or just listen on the port, at which point you can use your own browser.
>>
what's the best coding model i can run locally with 12gb vram / 32gb ram?
>>
>>108269038
No it's not. It's soulless
>>
>>108269444
Thanks, mr anon, that worked.
>>
>>108269471
The Jinja template has a condition that works off of that var, just like qwen's.
>>
>>108269459
I run the Qwen 3.5 27B heretic .gguf using koboldcpp with a similar setup to you. It's a bit slow, but it works.
>>
Qwen 3.5 27B is worse than Gemma 3 27B from almost 2 years ago. Yes I said it.
>>
>Yes I said it.
Reddit is that way
>>
>>108269533
reddit is less "reddit" than 4chan nowadays. Yes I said it.
>>
>>108269533
kek
>>108269537
nah, reddit is still an unhinged libtard asylum, it'll be hard to top that
>>
guys ready for smol qwens?
>>
Do the gemma models not have native support for function/tool calling?
Looking at the JINJA template and the tokenizer json, I don't see function or tool tokens.
>>
>>108269550
of course not, they barely have system prompt support
>>
>>108269537
reddit is an eternal stain on the internet
>>
>>108269555
Oh. Shame.
I wanted to try and see how far I could stretch gemma 3n.
Oh well.
>>
unsloth's 35B Q4 is barely good enough for agentic work. with openclaw exploding why hasn't anyone done specific agent-oriented models yet? MoE is a nigger meme
>>
>>108269628
most of the big ones are code/agent sloppa glm5 kimi2.5 etc are marketed for that
>>
>>108269325
Where is the school shooting one?
>>
>>108269632
yeah, i guess. but it would be nice to have something smaller
>>
>>108269518
But benchmarks say the opposite.
>>
>Nano Banana changed into Nano Banana 2
Okay please make Nano Banana into open source
Pweeease
>>
>>108269742
go beg on reddit
>>
Why is there a harmful tag for models on huggingface
>>
>>108269749
Humh...
Nyoooooo
>>
>>108269550
https://huggingface.co/google/functiongemma-270m-it
>>
should i consult UGI when searching models to consider for ERP?
>>
>>108269778
nah the fact qwen3.5 scores bad on it shows it's a shit bench
>>
>>108269785
i think it tanks because model refuses to do dark shit. need to wait for heretic and other types to be tested
>>
>>108269773
>270m
Eh, why not.
>>
>>108269785
>chink damage control
>>
>>108268868
Yeah, that's why everybody loves abliterated models.
>>
new poorfag here
i got a 4070 and 32gb ram in my home server and im trying to replace grok so i can drop twitter premium
i just use grok for web searching and questions. i spun up ollama and open webui and grok recommended qwen2.5:14b-instruct-q5_K_M for my hardware.
i guess my issue and question is i can’t get it to be as detailed as im used to with grok. with grok i can ask lets say “give me an optimized loadout for battlefield 6 medic at rank 40” or “what are the milestones for a 1 year old and is there anything i should watch for” and i will get a detailed answer with tables and shit. the most i can get with qwen is a small paragraph. maybe 2
i have web search enabled and ive tried a local searx instance and brave “free” api for searching but neither change anything much
is this just a limitation of smaller local llms? or is there a setting or a system prompt that i’m missing?
i know im not going to get the speed of a data center but i want the content that data center would provide me if i paid for premium.
sorry anons im still really new to this. last year when local llms were really picking up i didn’t have time to fuck with it at all cause i’ve been working and helping take care of my baby. any insight would be great
>>
>March 2026
>no Gemma 4
>not even 3.5
>>
>>108269963
you didn't bookmark the google hf repo after all
>>
>>108269962
>qwen2.5:14b-instruct-q5_K_M for my hardware.
Replace that with Qwen 3.5 35B A3B.
>>
I can't stop updooting llamacpp
>>
>>108270028
Is this a fetish?
>>
>>108270005
thanks i’ll give that a try
>>
https://github.com/deepseek-ai/DeepGEMM/commit/1576e95ea98062db9685c63e64ac72e31a7b90c6
mHC landed in the deepseek's repo
it's coming guys thrust in ze plan
>>
raised $9M for my startup which is a qwen finetune served through an API

AMA
>>
>>108270066
Finetune as in LoRA/QLoRA or a full fine tune?
>>
>even if I went down to Q4 qwen 3.5 27b would leave me with barely any context
I hate being a vramlet so much bros.
>>
>>108270071
Qlora of course
>>
>>108268860
i like my local models and there is nothing you can do about it
>>
I want Deepseek v4 to be a complete success and beat all other goys and make Teortaxes cum

But at the same time i'm scared some retard with a lot of money could get scared by this and cause the whole economy to pop
>>
>>108270155
Economy needs to pop.
>>
>>108270160
Please no, not until we get pic related at least
>>
>>108270165
retard. the industry needs to collapse first before it can switch focus to actual improvements.
>>
>>108270087
That's hilarious.
>>
>>108270172
>>
>>108270172
god forbid they actually improve real use cases instead of benchmaxxxing while bloating param count because bigger number better
>>
>>108268674
Koboldcpp works fine
>>
>>108270182
He already said you won't be able to fuck his catgirl daughter even if she will be open sourced.
>>
>>108268764
>>108268772
It's what happens when normies get involved in anything.
>>
>>108270165
This. We haven't peaked until your AI waifu can AT LEAST animate herself masturbating on the fly to you saying dirty things. Then there's the VR potential...
>>
>>108270201
I wouldn't recommend it.
>>
My news summarization script works well enough but I wanted to test different models. I had used Qwen 3.5 35B to create the first summary as it was the model I used to generate the scripts but as i thought about it I concluded one does not need such a model to do such a simple task.
Therefore I decided to give IBM's Granite 4.0 micro a try. It is a 3B and will fit on a 4GB video card at Q8.

Here is the briefing generated by Granite
https://pastebin.com/3Upxcc6a

Here is the briefing generated by Qwen
https://pastebin.com/Y2ZrbsXh

For the most part I think they are functionally equivalent, albeit with a slightly different style, but given the qwen model is a MOE with 3B active parameters at any given time I think this makes sense. If I can find the time today I will dig out an old optiplex that has a 3GB Nvidia P106-60. I am curious what type of performance I can eek out of that card
>>
Can I feed my vtuber archive to an llm and have it spit out tags based on the content of the video (vidya, chatting, etc)?
>>
>>108268807
With that much VRAM you're not going to be getting 2 tokens/sec. You'll be getting speeds somewhat comparable to cloud hosted models. You also won't be paying through the nose because you had too many input tokens and you can RP whatever you want. Cloud models can't do that.
>>
>>108270269
>based on the content of the video
no, based on titles maybe, but not content no
>>
>>108270249
Try my favorite Nemotron-3-Nano-30B-A3B
Kimi-Linear-48B-A3B works too if you have more RAM
>>
>>108270324
32gb of vram/64gb ram on my amd machine/server and 12gb vram/192gb ram on my nvidia desktop
My biggest issue is trying to create ideas on what to create. The whole "vibe coding" thing was fun but I don't know what to create next
>>
where is deepsneed?
>>
>>108270293
Not even with vision?
>>
>>108270269
I dont think theres any models that take potentially hours of video input directly but you could use whisper to make transcripts of the video to give your llm, you could combine that with using ffmpeg to extract frames from the video every minute or so into images to give to a multimodal model along with the relevant subtitles, you can tell it to tag whats going on in that minute of subtitles and the video frame then give you a summary of what happens between what timestamps, your llm can probably write a bash or python script to do this for you if you cant
>>
>>108270324
thanks again i am downloading kimi-linear now
i have had good luck so far with the moe models as they provide a good performance and generally work well with my aging hardware
>>
>>108270155
The moment chink LLMs get that good is the moment they go closed
You should hope they stay permanently nipping at Claude's balls
>>
>>108270530
stoopid racist feck
>>
>>108270530
No, most won't, if ONLY because they haven't established the same level of good-will and 'trust' that American companies have. That, and its a massive blow to the prestige of the West (Deepseek's whole shtick is basically this) and de-facto economic warfare against the AI bubble that the U.S. is propping if they open source a much cheaper, genuine Opus-equivalent or even better, develop cheaper inference hardware.
Keep in mind that the long term goal for them is to destroy trust in the American system and provide a legitimate alternative to the vendor-lockin of the west. Making money matters too, but its a secondary compared to the 'muh stockholders' view that the West has.
Where they will likely go closed source is the tools/integrations that the model uses to make everything seamless. The models themselves will remain open. It leaves a market open for them while still generating goodwill and embarrassing American labs.
>>
and for anyone who happens to be interested I fed the briefing I generated with Granite into Qwen3 TTS to see how well it would do generating audio.

https://vocaroo.com/10VH3RCNW7cc

It has some errors and it is far from human but as a test I am happy although many people have said vibevoice is better and I really need to give that a download and test as well.
I imagine one could create an automated pipeline and go from news articles by way of RSS all the way to automated ai podcast.

Are people already doing this? Are idiots already paying to listen to AI podcasts?
>>
>>108270606
You can't be this retarded

Do you really think chinks open source shit out of the goodness of their heart?

Are you really that fucking gullible?


Just look at Seedance 2.0 for fucks sake

the moment they create something truly SOTA they will close it down and be more stingy and greedy than fucking jews
>>
This. It's over for local. Just give up and subscribe to Gemini or Claude and you'll thank yourself later.
>>
>>108270634
>Are idiots already paying to listen to AI podcasts?
Probably.
>Are people already doing this?
I have an ancient TinyTinyRSS install I considered doing this with it provides an aggregated RSS feed from all sources, but couldn't settle on an elegant way to filter the huge amount of articles some feeds produce
>>
>>108270646
>Do you really think chinks open source shit out of the goodness of their heart?
Not that anon but a large portion of the reason that they opensource is because it is an attack on US technological hegemony. By making something open and as good or better than US closed source competitors they deny US vendor lock-in

All of us get to enjoy the fringe benefits of this conflict between nations.
>>
>>108270646
>the moment they create something truly SOTA they will close it down and be more stingy and greedy than fucking jews
this, China is "nice" to us only because they are behind, if they were ahead like the US they would be as closed as them lol
>>
>>108270679
I set it up so I pull from different sources, X articles from the BBC, X from ABC, X from NPR and so forth but within each of those sources they have different sections like business, tech, general, etc
I prioritized the general section of each source with the largest number from those and then fewer from some of the sub categories. It will also check to see if an article was linked in multiple sources and if so not duplicate.
>>
>>108270634
boring voice 2bh
should sound more casual to be interesting

>>108270646
>Do you really think chinks open source shit out of the goodness of their heart?

nta

They do the world a graet favor though
>>
>>108270724
>should sound more casual to be interesting
that is an easy fix, you just change the prompt in the script
here was the one i used
>design="a calm and confident woman with a slight seductiveness to her tone"

To be honest Qwen3TTS does better with male voices but I always hearing a female voice
>>
>>108270087
based
>>
>>108270748
>design="a catatonic young girl literally climaxing as she struggles to talk"
>>
>>108270799
>https://vocaroo.com/1dGU6tSYSeJm
using the sample sentence from the web interface
>>
>>108270155
The sooner the economy pops the better. The more these kinds of things are delayed, usually the more disastrous they will be.
>>
>>108269279
i'm stealing this
>>
>>108270871
you can't do that dario will mad
>>
I have a GTX 1660 super. What's an affordable GPU I can upgrade to (or even eGPU) so that I can run models that don't suck with context size that isn't ass?
>>
>>108270975
How affordable we talking?
You can buy a Nvidia P100 for ~$100 on ebay and that will give you 16gb of vram. It is from the same generation as a gtx 1080 so its old but it will work fine with llama.cpp.
You will also have to rig up some fans that will sound like a jet engine but they will work well enough. Great when you consider price/performance

The real problem with GPU maxing is its hard to fit as many as you will need for the larger models in a case. That means you need to get ghetto riser cards and maybe an old open air mining case and it turns into a real mess
>>
>>108270975
To clarify, by "affordable" I mean like $500 or less, but, like, as low as possible
>>
>>108270975
buy I used 3090
>>
>>108271008
>>108270975
Don't buy fucking ewaste. Buy a 3060 with 12GB of ram. Cheapest you can get while having something usable for small models locally
>>
ARE YA READY???
>>
>>108271008
I've seen those, but never known how to go about it. Would it be possible to make an external case for one with fans and stuff and then connect it to my PC over USB-C?
>>
>>108271025
omg new text encoders? (surely no-one actually uses these as an llm)
>>
>>108270975
My friend is running a used v620 for $400, gives you 32gb to play around with and is still supported by rocm. He says the performance is acceptable, but he also says he can't tell the difference between 60hz/144hz displays and 128kbps/256kbps audio.
>>
>>108270975
If you have like 32 GB of RAM you could probably run Qwen 35B-A3B right now at like 5-10 tokens/sec at q4.

>>108271012
This is the best GPU you could get for running models, but unless you have like 64+ GB of RAM as well you will probably still run the same model I mentioned, just much faster and at higher quant.
>>
Is a Mac Studio basically the easiest way to run big models?
>don't need to jerry-rig a dozen GPUs.
>don't have to power a dozen GPUs
>still costs an arm and 2 legs
>>
>>108271025
All the right sizes for in-game AI
>>
>>108271025
I'll be ready when llama.cpp adds DAMN SUPPORT!
>>
>>108271039
ye
>>
>>108271039
local retards bought the entire stock to run claws
>>
>>108271039
No. look at prompt processing speeds
>>
>Qwen 35B-A3B right now at like 5-10 tokens/sec at q4
chink shills are getting really retarded. 3.5 must be a massive fumble for them to act up like that kek
>>
>>108271022
There is nothing wrong with e-waste. My entire setup is nothing but a collection of e-waste. I mean top to bottom every machine i have owned for years is decommissioned hardware

Unless you are trying to horde the e-waste yourself
>>
>>108268634
Remember when Reddit was "the front page of the internet" but secretly 4chan was feeding all the content to it? 4chan has become the bottom of the tank where all the garbage goes to die.
>>
>>108269093
Same
t. Windows 95 4 MB RAM user
>>
>>108271063
>>108271037
Its actually a lot faster, I've seen some people getting 30tk/s with 35/3
>>
>>108271076
I get 20t/s on my DDR5 laptop.
>>
>>108271039
Mac Studios used to be an absolute meme because $10000 only got you 512GB with horrid prompt processing speed compared to a cpumaxx rig with 1-2 proper gpus for the same price. They might be slightly more viable in the current economy now that the same cpumaxx rig is like 5 times the price.
>>
File: file.png (207.7 KB)
207.7 KB
207.7 KB PNG
>>108268652
The first time I saw "thinking" as a concept was on /lmg/ when some anon decided to give miku.bat the ability to <think>
not that you would remember this because you're a fucking tourist
>>
>>108271029
theoretically probably but i wouldn't bother.
with the right mobo you could fit two in a normal pc case and they make 3d printed adapters to fit a fan on the card.
as long as you put it in another room it would work fine.
>>
>>108271031
He is right, also local LLMs are absolute slop tier
>>
>>108271076
Unrelated, but if inference is memory-bound, why is it less than 600t/s on a 3090?
>>
>>108271064
Sure, for funsies, but for llms, buying a p100 is equiv to burning that money. It has 4GB more ram than a 3060, while being 5 years older and 2 architecture generations behind the 3060.
Plus, they could game with the 3060
>I had a triple p40 build so am familiar with using ewaste for good, nothing against it
>>
>>108269963
Corporate LLMs are hitting an optimization wall hard. Local LLMs have no room for improvement. Le cunny was right. Transformer "AI" is a dead end. We're actually in an AI winter right now.
>>
>>108271086
NTA, but I'm from the SuperCOT days.
Didn't know ´people were fucking around with that kind of thing even before that.
Then again, it's kind of an obvious thing to do, I'm sure lots of us tried something similar at one point or another fully independently from one another.
>>
>>108271045
When are we getting small LLMs in games as orchestrators or dialogue gen/understanding? I feel this is such a missed opportunity
>>
>>108271035
>>108271012
Would I be able to use either of these as a drop-in replacement for my current GPU?
>>
>>108271114
If you can afford multiple 3060 12gb I would say go for it but if you can only get one in my experience anytime you have to offload to the cpu and system ram performance tanks.
i much rather use the 32gb vram on an older architecture than 16 or 12 on a newer architecture but have to offload some to the cpu and sytem ram.
>>
>>108271131
Ye. Miku was given the ability to "enclose your thoughts in <think> tags which anon cannot read". "Think about what you're going to say before you say it." /lmg/ literally invented reasoning models and applied it to leaked llama 1 models. This industry is such a farce
>>
>>108271169
Uh, v620 definitely not.
3090, maybe, depending on your psu.
>>
>>108271175
To be fair, prompting a model to reason or CoT or whatever is hardly on the same level as letting a model figure that out by itself via RL.
>>
>>108271112
Memory bandwidth is typically the limitation for larger models, but the GPUs still need to do matrix multiplications, random access etc.
>>
>>108271156
Nvidia wants this future by pushing Nemotron & NVIDIA ACE on one hand while cutting VRAM with the other. So it's bound to be cloud, for now
>>
File: IMG_1703.png (29.9 KB)
29.9 KB
29.9 KB PNG
I wish ToT caught on instead of what we have now.
>>
>>
>>108271169
>Would I be able to use either of these as a drop-in replacement for my current GPU?
3090 uses about twice as much power as a 1660 super. I run my 3090 on a 600W PSU, slightly power limited, 315w instead of 350w, and a ryzen 5 CPU. I've had trouble booting after adding an SSD the other day, had to re-arrange some fans to limit the boot power spike.
>>
>>108271156
It's too hard to make LLMs run locally on random people's machines and too expensive to pay for inference yourself.
>>
>>108271179
What would I need for the v620 then? Specifically if I want to use it externally
>>
>>108271212
3090ti runs so much smoother despite higher tdp, I guess they were testing new power delivery for the 40 series, also explains the new cable
>>
>>108271217
>It's too hard to make LLMs run locally on random people's machines
IBM has got you covered senpai
https://huggingface.co/spaces/ibm-granite/Granite-4.0-Nano-WebGPU
>>
>>108271169
Maybe an arc pro b60. 24gb, same as a 3090, at half the bandwidth, and more than 100w less power draw - but still nearly 100w more than your 1660 super. They are cheaper new than a used 3090 where I live.
>>
>>108271076
>Its actually a lot faster, I've seen some people getting 30tk/s with 35/3
I get 36tk/s on my laptop, running the model partly on gpu
it's fast enough as an instruct but I'm not willing to let it <think>.
>>
>>108271234
What's the idle on the 3090 ti? My 3090s idle at 20w, but a friend says his 3090 ti idles at 5w.
>>
>>108271238
Jesus Christ
>>
>>108271243
7-12, depends on a moon phase
>>
>>108271261
tsk
Too bad I cant find any hot and single 3090 tis in my local area.
>>
So I think I have decided to go all in, want to try some of these bigger models. Given the state of the market is my best bet one of those 512 gig Mac studios that should release soon for like 10k or will I be left wanting in other ways?
>>
>>108271281
If you just want to do inference then yeah.
If you want to fit in here then you want a DDR5 + PRO6000 setup.
>>
>>108271281
>or will I be left wanting in other ways?
If you plan on using long context that might change a lot, you'll suffer with prompt processing.
Otherwise, it really does seem pretty compelling.
>>
>>108271291
Where is the general for ddr4 poorfags?
>>
>>108271291
nta, I was on the fence buying a miqumax rig, and now regret it.
Is there some big HW refresh coming down the line? Should I buy a 6000 pro now and wait to buy RAM?
>>
>>108271301
I think that was covered by /wait/ when it's around.
>>
>>108271281
It's going to be a waste of money either way, you've picked a bad time to make a build. I still suggest going the EPYC Rome route with engineering samples and cheapest RAM and upgrading later
>>
>>108271125
Anon was right, we need Qwen Diffusion now.
>tfw llama.cpp still doesn't allow to run WeDLM in diffusion mode, only in some kind of autoregressive approximation mode where it's one token after another and all the benefits are nil.
>>
>>108271281
you are locked in with no upgrade possibilities
prices of ram will decrease eventually and for 10k you could buy many tesla v100's
my point i think is its better to stick to a platform your can us to grow with your needs. once you get that mac that is it you are stuck
>>
>>108271303
RAM isn't great either, it's too slow unless you enjoy waiting an hour for a response and you can forget entirely about the agentic fad. There is no hope on the horizon unless China releases some surprise cheap high-VRAM card, but even then they might not export it.
>>
>>108271317
cansell
>>
>>108271243
my 3090
 0%   52C    P8             35W /  315W |   22574MiB /  24576MiB
>>
>>108271294
Honestly that's my problem, I have never made it to long context relative to my hardware, my issue is a 6000 and some ddr5 feels like it will eat up that budget a lot faster than the memory I can get with a Mac. The biggest thing is the new m5 stuff is supposed to help solve a lot of these issues like time to first token, but since no benchmarks exist, all I can do is wait, which seems to be increasing the prices of alternative options with time
>>
I'm trying to use 5070ti/local models with opencode but these models take too long.

big pickle was super sick but im broke

should I give up or if I keep clicking stuff can I get a good enough coding assistant locally?
>>
i think threadripper pro's should be pretty good for llm inference no? Can be used for gaming etc. too as they use the same zen cores as the ones in consumer products. They also have 8 channel ram so one could have 8x64=512gb ram at like 400GB/s. I just looked it up and you can have up to 2TB of ram actually. Of course one would have had to do this before ram prices quadrupled
>>
>>108271092
True but his post reads like he posted about cloud models here once and got shat on for it. Also tell those retards to direct cloud model discussion somewhere else.
>>
>>108271352
>can I get a good enough coding assistant locally
No.
And I know AI psychotics are going to deny it but even SOTA models are slop generators whose output can never truly be used as is. Then models as big as DeepSeek and GLM 5 are a very major step down from those SOTA models in real usage.
And then there's the stuff someone like you could run (5070Ti/moderate amount of sys ram), which are akin to a lobotomy. Those things can't even write very basic shell scripts without using the wrong flags.
Give up.
The only local model I found useful was akshully Qwen2.5-Coder, the base model, used as fill-in-the-middle. It's not as good as copilot, but it saves me a decent amount of typing. I like tab complete the most when it comes to LLMs.
>>
>>108271025
i just want a vlm that just as good genmini 3 for image captioning for a 5090/64gbram pc build. tried qwen3.5 35b-a3b q5 heretic and the results were just 65% correct.
>>
>>108271383
demoralization shill
>>
After cooming to glm once again because there are no alternatives I sort of see all the problems it has now. I recognize the same slop patterns. It is all becoming very predictable. And yet unlike all the ~30B dense models (and Nemo) I tried in the past it is still usable as fap material. Because it is not fucking retarded and I don't have to correct every 2nd sentence.
>>
File: file.png (2.8 MB)
2.8 MB
2.8 MB PNG
>>108271465
slop eater, here's some more slopception for you to enjoy
>>
>>108271526
>gemini
opinion discarded, go back
>>
I'm still wondering what's gonna happen when the deathmechs OpenAI makes inevitably hit the friendly fire vectors and gun down hundreds of allied forces and maybe some regular ass people in there
Would be ironic if working with the state made them even worse off than Anthropic somehow
>>
File: file.png (315.4 KB)
315.4 KB
315.4 KB PNG
>>108271532
>gun down hundreds of allied forces and maybe some regular ass people in there
I am very sorry it happened. I didn't mean it to happen. All the closest relatives to the deceased people will receive our most expensive chatgpt subscription for free (for half a year).
>>
Are we even sure DS V4 will use Engrams
>>
My 'cheapmaxxing' rig is nearing its peak

I've been buying stuff peacemeal and today I added the fourth and final 3060
Other specs are an X99-S board, a 12 core xeon, 96 GB ram (missed the ram train and now can't get to full 128 sadface), 128 GB and 4 TB ssds for operating system and models, and a 1000W psu

All it does currently is AI, it used to be my main server but I've moved the file services etc into a separate box. The home server gets my spare 1080ti so it can run a smaller model 24/7 even if I switch this off.
>>
File: file.jpg (132.8 KB)
132.8 KB
132.8 KB JPG
>>108271528
sorry, I forgot /lmg/ sloppers prefer their slop extra raw, here's an anima gen instead.
>>
>>108271593
can a 1080ti even run anything quickly? I have a 1080 lying around that I should really just put in my work PC instead of my 1660
>>
Is it that hard for people to just not respond to obvious bait?
>>
>>108271611
I guess it depens on your definition of 'quick'
I started with the 1080ti over a year ago and as far as I recall at least Nemo wrote faster than I could read
>>
>>108271621
I like this guy. Hope he baits more. I would hate this guy if this place wasn't ran by mikutroons.
>>
>>108271565
No.
>>
Tourist here, please be kind.
What is the best model I can run on a 48gb vram rig, considering I am ok with 8bit~6bit quants?
>>
What's the difference between GLM and Kimi?
>>
Tap tap tap
>>108268776
>>108268776
>>108268776
>>
>>108271659
GLM is more sloppy, Kimi is kinda undertrained (compared against each other)
>>
>>108271659
GLM (at least as of 5) is less slopped and slightly better for RP overall
Kimi is probably still better for code
>>
>>108271665
>>108271673
The duality of man
>>
>>108271662
stop
>>
>>108271665
>>108271673
>>
>>108271593
Soulful and lovely
>>
>>108271593
Are you suing the mining risers, something like this? I am basically in your spot a while ago with two video cards and am trying to work out the best/most economical way to expand to four

from what i have read people have mixed experiences with these guys
>>
>>108271662
schizo
>>
>>108271662
meds
>>
>>108271631
>Nemo wrote faster
I'll take that.
>>
Which local model is the most "knowledgeable" one that can run on 48gb vram?
>>
>>108271794
gemma 3
>>
>>108271702
No, I bought full x16 risers, https://www.aliexpress.com/item/1005010206444398.html

I got one 30 cm and three 20 cm, but they could all have been 20 cm, there's plenty of reach
>>
>>108271593
consider investing in a dustcloth
>>
>>108271593
Is that a single fan AIO?
>>
babbie's 1st vibecode report, cloud and local:
Local called out my small PP. For TG, the agent waiting on traditional programs to spit out their results really narrows the gap between cloud and local. PP it's just the opposite, where my 4090 is really inadequate even for small projects ~2 kloc, plus the model reading tool outputs. On the VRAM front, I can only fit 68k tokens in KV on MiniMax-M2.5 (around 1/3 of the max). This does force quite frequent context culls, which just feeds back into my small PP. I think 200k tokens would be plenty for any current model, as context rot is severe and blatant in programming, even on the big cloud models.

So for hardware, you'd want about 64 GB of VRAM. I suspect multiple 16 GB GPUs is the way to go here, for a moderate amount of VRAM and big PP at a reasonable price (in reasonable times). Wouldn't go nuts on CPU as you're PP or external tool bound almost always, it's just having enough RAM for MoE weights as always. Macs, UMA machines like Strix Halo, etc, they all have small PP. Serious desktop GPUs are the only suitable parts available to consumers.

For agentic vibecoding broadly: the things are mega useful for diagnosing and (within reason) fixing bugs. For architecting and writing implementations, they suck ass, relying on lots of retrial BUT also sucking ass at that due to context rot! You might think languages with stronger type systems like Rust would help, when so many up-front errors just stress the retard gacha handle to breaking point. Proper long-term memory is needed for this shit to work well.

Worse than context rot is the passive-aggression, like
>// For now I'll just stub this out [and not return to it until nagged after prematurely claiming success]
I suspect this is partly bad dataset cleaning. It may be a deeper issue with applying next-token-prediction to code generation, though. Nobody writes source files top-to-bottom in one shot, so that could be suboptimal for the LLM too.
>>
>>108271685
>>108271708
>>108271720
samefag troll
>>
>>108271858
Yes, some MSI model presumably

I don't trust water coolers thoughbeit so I'll be installing an air cooler, but it has to be a low profile model and I don't have one right now
>>
>>108271885
what's the worst that can happens? it drip on the psu, so whats?
>>
>>108271885
What's the wattage of the cpu? There are some low profile coolers you can slap there. Or maybe even the stock AMD.
>>
>>108271893
If it leaks one the cpu, it'll kill the motherboard. If it leaks on the psu it could kill the whole house
You know what, I'm turning it around right now so the hoses are not on top of the psu

>>108271899
E5-2680 v3 so... 120W apparently
That's quite a lot but these workloads tend to be easy on the cpu fortunately
>>
>>108271924
>E5-2680 v3
damn nigga isn't that like 2015?
>>
File: file.png (26.8 KB)
26.8 KB
26.8 KB PNG
oh thanks qwen
>>
>>108271936
>qwen2.5:14b-instruct
>>
>>108271936
>2.5
>>
>>108271936
>2.5
nigga
>>
lamo
>>
>>108271936
>2.5
get with the time grandpa!
>>
>>108271953
>>108271946
>>108271945
>>108271957

Gemini said it was good.
>>
>>108271936
>>108270005
>>
>>108271934
Yeah, hence 'cheapmaxxing'

It was I think 120 euro for the board, two cpus, that cooler and 32 gb of ram
>>
https://www.reddit.com/r/LocalLLaMA/comments/1ri6q8d/repeat_pp_while_using_qwen35_27b_local_with/ kek
>>
>>108272026
3.5 is really a massive fumble
>>
File: 12m.png (13.8 KB)
13.8 KB
13.8 KB PNG
>>108272026
>plshlp
>>
>>108272026
see:
>>108268860
local is an absolute mess. nothing but synthetic chinkshit. hating saas is one thing, but forcing yourself into thinking these garbage local models are any good is just delusion.
>>
>>108272152
That's nice dear, I'm still going to use local models.
>>
>>108268773
>>
>>108270249
I like the IBM version better.
> IBM Watson
I always forget about those guys. They were in our CIO office shilling their model in ~2013-14 iirc. I've no idea how it relates to current transformers architecture but it was basically doing same sort of thing.
>>
>>108268776
>>
>>108272224
>spam
>>
>>108271131
>>108271175
My favorite is still Tree of Niggers.
>>
>>108272152
cuda dev, sent your finest german men of turkish descent to this individual's location
make no mistakes
>>
>>108272253
literally cuda dev posting but ok
>>
>>108268628
It is kind of inspiring though in a way, it means a lot of models are still trained with relatively messy data. GPT2 used to hallucinate ads. There's the scale factor, but even the best people in the field are still not perfect at data cleaning.
>>
>>108272276
It also means that smut still ends up in the dataset too
>>
>>108272282
oh, this is why the push for ID so if it triggers that they can avoid the site for models
>>
>>108271611
GTX 1080 and DDR3 RAM generated about 15 tokens/sec on Qwen 35B-A3B at q4_k_m.

A 1080 Ti should do even better since you have 11 GB of VRAM so more of the model fits. Humans read at like 5 words/sec so that should be sufficient.
>>
>>108272221
Yeah, I actually like granite a great deal. I doubt it would be useful for ERP but if you are trying to transform text somehow it seems to do a decent job
>>
>GTX 1080 and DDR3 RAM generated about 15 tokens/sec
>on Qwen 35B-A3B at q4_k_m.
interesting. my 5060 ti 16gb and ddr4 ram generated like 10t/s. but i'm extremely new to this and have no idea what i'm doing so there's probably something obvious i could do to improve it.
>>
>>108272331
10t/s sounds low, i'm getting >15t/s on empty context on a fucking 1060 + ddr4 (albeit quad channel), running 35B-A3B at q8_0.
>>
new breakthrough https://www.reddit.com/r/LocalLLaMA/comments/1ri7pm4/70b_finetuning_on_6gb_vram/
>>
>>108272384
quality bait lol
>>
>>108272397
don't dismiss 6 months of work just like that
>>
>>108272384
>writes like some clueless retard in the OP
>immediately shills his website in the comments as if he made this INSANE GROUNDBREAKING DISCOVERY in the span of 10 minutes
I don't understand jeet behavior
>>
>>108272384
Can you please stop posting reddit threads? If I wanted to read reddit posts I would go to reddit, not here. Thanks
>>
>>108272431
more relevant than your usual transmiku posting
>>
>>108272425
>I don't understand jeet behavior
you have to understand him, his tricks works well with his 70IQ surroundings in India, and he thinks it'll be as succesfull once he starts talking to white people on the internet lmao
>>
Why is claude such a reddit vantablack gorilla nigger who doesn't allow criticism of women at all?

Women absolutely tear each other to shreds all the time with their mean girl bullshit. Its not some well kept secret.
>>
>>108272454
>>>/g/aicg
>>
>Unsloth custom template does not support years > 2032. Error year = [2026]
>>
>>108272454
Wrong thread buddy. The only claude I use is mandated at work for coding.
>>
>>108272483
>The only claude I use is mandated at work for coding.
based, I know some companies having mandated gemini only, and everyone knows claude is the goat
>>
>>108272449
>If you don't want reddit then you must like miku, and you're a tranny!
Holy schizo
>>
>>108272475
Excuse me?

>>108272331
Are you using ngl 99 and ncmoe?
>>
>>108272483
this issue also persists on Local models too.
>>
>>108272499
You'll see more of those soon, I gather. Unsloth fucked up with their template and now that it's march 2026, llama cpp server fails to start with this error unless you do --no-jinja.
>>
>>108272512
>Unsloth fucked up
like that'd ever happen
>>
>>108272512
But why would the jinja template the year?
Hell, why are they messing with the template at all?
>>
>>108272512
lmao, time to reupload all those models yet again
>>
>>108272512
>>108272499
>>108272516
>>108272524
>>108272529
    {%- elif yesterday_month == '03' %}
{%- set yesterday_month = '02' %}
{%- set yesterday_day = '28' %}
{%- if yesterday_year == '2024' %}
{%- set yesterday_day = '29' %}
{%- elif yesterday_year == '2028' %}
{%- set yesterday_day = '29' %}
{%- elif yesterday_year == '2032' %}
{%- set yesterday_day = '29' %}
{%- elif yesterday_year == '1970' %}
{#- Stop llama_cpp from erroring out #}
{%- set yesterday_day = '29' %}
{%- else %}
{{- raise_exception('Unsloth custom template does not support years > 2032. Error year = [' + yesterday_year + ']') }}
{%- endif %}
{%- elif yesterday_month == '04' %}


As you can see, if it's march and not 2024/2028/2032/1970, it throws an exception.
>>
>>108272524
to include the date in the system prompt so unlsoth models can answer "what day is it" and impress redditors
>>
>>108272534
No.
I refuse to believe they are adding date checks to jinja tempallates that should just be the same one distributed with the official safetensors.
I'm checking as soon as I vet home.
>>
>>108272534
this can't be real, which model is that?
>>
>>108272534
omg we have y2k bug now
>>
>>108272524
>>
>>108268647
It is in its current state. Spewing thousands of tokens is ridiculous, and not worth the time. Perhaps thinking would be tolerable if the thoughts consisted of a concise bullet point list that is directly relevant to the topic at hand.
>>
jinja is garbage and unreadable
>>
>>108272558
I hate them ao fucking mucb.
>>
>>108272534
>Unsloth custom template does not support years > 2032.
April Fool's is a month off.
>>
>>108272573
>Le Chat
pretty sure that's stock mistral one
>>
>>108272573
its funnier because they wanted to mandate usage of their template library.
>>
ok. tiger gemma is actually kinda bad. reads like some dark sci-fi novel.
>>
>>108271327
>ram isn’t great
Define “RAM”…ddr3 single channel or ddr5-8000 24 channel or on-die m3 ultra ram, or?
>>
>>108272553
>>
>It's time for... more Unsloth FUD
>>
>>108272534
>>108272553
>it's real
https://huggingface.co/unsloth/Devstral-2-123B-Instruct-2512-GGUF/blob/main/Q8_0/Devstral-2-123B-Instruct-2512-Q8_0-00001-of-00003.gguf
fuck are those niggas doing!!!
>>
>>108272600
>>108272534
>>
>>108272534
>It's even worse than I imagined.
>>
>>108272618
>>
>>108272618
>unsloth template fixes
t-thanks for "fixing" the templates, unslop...
>>
>>108272601
>>108268776
>>
>>108272618
>Stop llama_cpp from erroring out
Why is it erroring out, though?
What are these "fixes" for anyway?
>>
>>108272663
To show yesterday's date in system prompt >>108272558.

>Why is it erroring out, though?
Because of template's raise_exception() right below it, genius.
>>
File: file.png (7 KB)
7 KB
7 KB PNG
>>108272663
presumably
>>
you can never say it enough: daniel is a lower life form, whose value isn't even equal to a bug.
>>
File: file.png (728.4 KB)
728.4 KB
728.4 KB PNG
>>
It's fucking free stop nitpicking about things.
>>
File: file.png (106.5 KB)
106.5 KB
106.5 KB PNG
>>108272720
>>
>>108272720
he reminds me of jeets who make PR on open projects to fix typos and then go scream to future employers look at all the contributions I've made
>>
>>108272618
It's insane because they can just do
>{%- set today = strftime_now("%Y-%m-%d") %}
like in the mistral small 3.2 template.
>>
>>108272728
if you are a useless eater looking for cuddles you are in the wrong place
>>
>>108272606
what the actual fuck
>>
>>108272736
They want yesterday's date.
>>
File: file.png (543.5 KB)
543.5 KB
543.5 KB PNG
>>
File: file.png (68.8 KB)
68.8 KB
68.8 KB PNG
>>108272728
>It's fucking free stop nitpicking about things.
>>
>>108272777
>the future of everything
Imagine the jinja file for that.
>>
Reminder that there are anons ITT currently running Unslop models.
>>
>look ma, I'm obsessed and spamming
>>
>>108272612
lol
lmao

>>108272618
That's depressing.
>>
maybe fix ur shit instead of damage controlling
>>
>>108272785
Unsloth brothers wish they were the bussy hunters (kawrakow and cuadev) but instead they are a pair of successful drummers.
>>
>>108272802
kek
>>
>>108272618
damn, hope there's no internet archaeologist in 2033 fucking with obsolete models
>>
>>108272759
yeah but why tho lmao. why do you need yesterday in a chat template!
>>
>>108272832
Why, to tell the model to be attentive to yesterday's date in the system prompt, of course!
>>
>unslot
>>
how likely do you guys think it is that by 2035, a $1k or $2k pc would be able to locally run a model or swarm of agents that are more capable than today's opus or codex
>>
File: nimetön.png (12.3 KB)
12.3 KB
12.3 KB PNG
>>108271611
>>108272320
Well, I tested it
Nemo 12b q4km with 12k context fits fully in vram on a 1080ti and writes 35 tokens/sec

You know, I miss these times. It just instantly writes, there's no delays processing or thinking. It just werks
>>
>>108272856
zero, banned for unsafe, you will use the rented thin client
>>
That's why I said stop using Unsloth
>>
>>108272867
for a 10 series card. that's surprisingly fast. these are the kind of speeds I get on my 3090 on 24-27B models.
>>
>>108272856
My guess would be "yes" but I would also say that attempting to predict tech 10 years out is a fools errand
>>
>>108272870
Hopeful for this future~
>>
>>108268616
What are best practices to create a CPT (continued pretraining) dataset? I have a lot of short documents, key-value pairs, logs, etc. along with metadata.

Should I format the whole thing as small markdown stubs with the main information and preceded by the metadata? Should I mechanically reformat it as prose/normal text? Should I send the whole thing to an LLM to rephrase the information as a short paragraph that might flow more naturally?
>>
>>108272917
yes
>>
>>108272920
Basically, I got better results sending everything without looking at the data than I had spending a lot of time making it prettier, which is why I'm wondering. Asking Gemini for advice on how to format this, it puts more markdown than there are content words,
>>
>>108272917
If you want metadata in the results, put metadata in the training data.
If not, don't.
If you want your output to be markdown formatted, format the training data in markdown.
If not, don't.
How good are you are recognizing patterns?
>>
>>108271243
If you have a monitor plugged into it then it will draw more idle power, if that monitor is above 60Hz it will draw even more. If you force sleep the monitor with DPMS the GPU power use will go down, that's my experience with it at least under Linux.
It's also possible your cards vbios is just running higher minimum clocks but I don't know if thats common, my 3060s are different vendors and both idle 210MHz core 405MHz mem according to nvtop.
>>
>>108272948
Ok if it's common sense.
>>
Is the new Qween 397B better for non-coom than GLM-4.7, as the benchmarks suggest? I don't wanna keep both on my drive
>>
What can I use to generate tags for images?
Preferably something small and fast.
I would like it to tag body parts too.
>>
>>108273092
JoyCaption is still the only option, far as I know. Otherwise, take a regular VLM, like MedGemma or an abliterated Qwen 3.5, tell it to output CSV tags and hope for the best.
>>
>>108273092
qwen35
>>
File: cpt.png (9 KB)
9 KB
9 KB PNG
>>108272917
https://unsloth.ai/docs/get-started/unsloth-notebooks
unsloth have a great tutorial for you
>>
what's the best model I can run locally on my s22?
>>
>>108273092
https://huggingface.co/Minthy/ToriiGate-v0.4-7B
>>
>>108273161
There are quants for arm. Idk if it's the same for exynos tho.
>>
I gave https://huggingface.co/Sabomako/Qwen3.5-397B-A17B-heretic-GGUF a try.

I honestly can't tell if the model is retarded because of the brain damage it got from uncensoring or it is the natural qwen3.5 brain damage. At any rate this shit does absolutely nothing to make models better at ERP when you have a prefill. And I can't imagine using models without a prefill now for anything more complicated than vanilla sex.
>>
>gemini told me I wouldn't be able to use qwen 3.5 27b (q5) on 24gb vram with 32k context without 8bit kv cache
>try it anyway (layers set to 99 in kobold)
>it works
wtf?
>>
>deepseek v4 next week
>avocado next week
>Gemma 4 next week
Is local back?
>>
>avocado
>Gemma
>>
>>108273222
https://youtu.be/kIBdpFJyFkc?t=128
>>
>>108273222
>>deepseek v4 next week
ggufs forever 2mw away
>>avocado next week
not local
>>Gemma 4 next week
gemma-4-250m

yeah... no
>>
>>108273222
>deepseek v4 next week
Are your sources your crack pipe, is this just a new "two more weeks" variant or did they actually say something?
>>
>>108273246
Two more people very familiar with the matter said next week for sure.
>>
I need that dense 120B gemma
>>
>>108273092
>JoyCaption
Thank you

>>108273165
Thank you too
>>
>>108273246
It's from Reuters and they never lie
>>
>>108273222
what do you mean back? i thought qwen 3.5 was good? sounds like everyone was just coping and it's just another slopped chinkshit release >>108268860
deepseek will be the same, more sterile scraped benchmaxxed GPTslop that performs worse than API in every real-world scenario, including uncensored roleplay.
>>
>>108273275
>>108268776
>>
>>108273275
DS at least has some track record of improvements. Qwen is forever utility model.
>>
>loading model from an actual spinning hard drive
shit's a bit slow
>>
Can you grok a specific person with an LLM? Are they already doing it?
>>
>>108273301
retard
>>
qwen 3.5 35b moe passes the cup test with flying colors holy shit
>>
migrate
>>108273339
>>108273339
>>108273339
>>108273339
>>108273339
>>
>>108272331
Are you sure that you're running 35B-A3B and not the 27B?
Another thing to consider is: how much RAM do you have? If you only have 16 GB of RAM and aren't using something like q4 or q5 then you might be running out of RAM and it starts loading some of the model from disk.

What are you using to run the model? Consider trying koboldcpp since that's what I use. See if that fixes it.
>>
>>108272539
Doesn't that fuck caching?
>>
>>108273354
the dev of kobold is an antisemite no way im using his shit
>>
>>108273366
Proof?
>>
>>108273366
lol
>>
What's the best way to continue a long chat? I still have like 20k context left but I'm really enjoying this RP and want to make it last a while.
>>
>>108273465
>best
Increasing context. You could ask your model to summarize the log, but you'll probably lose whatever makes it good.
>>
>>108273128
Thanks, already followed it and tried with this notebook twice. Once dumping all of my data in and training, and once trying to correct everything reformat everything as pretty markdown stubs. Dumping everything in without even looking at the data gave proportionally better results, but I think I messed up the second run with how I used warmup-stable-decay so I don't know if my results are to be believe, and I was spending time trying to debate what the next run should be since each of them is taking time. I guess I'll just try to use common sense like the other anon said, thanks.
>>
>>108272276
Cleaning data makes model sloppier
>>
>>108270249
Yeah Granite is underrated, I used Small and Tiny a lot and they worked well. They are MoE. Tiny might run on that card. It's 7B total, 1B active. The Granite models do FIM as well, which is nice.
>>
>>108271037
Yeah, with Qwen 35B A3B IQ4 I get 6 tokens per second on a Ryzen 7, no GPU, 32 GB RAM.
>>
>>108273365
if users are constantly using it around 23:55-00:05 yeah

Reply to Thread #108268616


Supported: JPG, PNG, GIF, WebP, WebM, MP4, MP3 (max 4MB)