Thread #108663449
File: __hatsune_miku_and_akita_neru_vocaloid_drawn_by_nj7__ae2a9b2fe217735ea284afbe9500660c.jpg (1.6 MB)
1.6 MB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads: >>108659983 & >>108655009
►News
>(04/22) Qwen3.6-27B released: https://hf.co/Qwen/Qwen3.6-27B
>(04/20) Kimi K2.6 released: https://kimi.com/blog/kimi-k2-6
>(04/16) Ternary Bonsai released: https://hf.co/collections/prism-ml/ternary-bonsai
>(04/16) Qwen3.6-35B-A3B released: https://hf.co/Qwen/Qwen3.6-35B-A3B
>(04/11) MiniMax-M2.7 released: https://minimax.io/news/minimax-m27-en
►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
352 RepliesView Thread
>>
File: 131557813_p0_master1200.jpg (210.3 KB)
210.3 KB JPG
►Recent Highlights from the Previous Thread: >>108659983
--Comparing GGUF quantizers and discussing imatrix calibration for Qwen3.6-27B:
>108662039 >108662052 >108662065 >108662230 >108662252 >108662353 >108662475 >108662053 >108662063 >108662080 >108662162 >108662361 >108662068 >108662062 >108662167 >108662176 >108662190 >108662257 >108662321 >108662780
--Qwen3.6-27B benchmarks and GGUF quants:
>108660998 >108661023 >108661071 >108661108 >108661125 >108662813 >108662846 >108661101 >108661164
--Gemma 4's 124B MoE and memory bandwidth benchmarks:
>108662533 >108662543 >108662549 >108662589 >108662594 >108662614
--Models for a 3090 and explaining MoE vs Dense offloading:
>108659996 >108660054 >108660247 >108660260 >108660268 >108660279 >108660312 >108660317 >108660347 >108660223 >108662148
--Koboldcpp launch flags and speculative decoding for Gemma 4:
>108660701 >108660741 >108660743 >108660848 >108660934 >108660990
--Alleged unauthorized access to Anthropic's Mythos:
>108660075 >108660630 >108660724 >108661694
--Anons discussing reported Gemma 4 performance on RK3588 SBCs:
>108662346 >108662393 >108662431 >108662528
--LLM reliability, internet content degradation, and local knowledge bases:
>108661238 >108661314 >108661335 >108661358 >108661276 >108661375 >108661405 >108661533 >108661585 >108661462 >108661311
--llama.cpp ngram-mod flags to optimize coding performance:
>108660554 >108662471 >108661013
--Text Completions prefills to stop GLM's repetitive thinking loops:
>108661606 >108661631
--OpenAI's open-source privacy-filter model:
>108662489 >108662773
--Little Coder agent optimized for small LLMs:
>108660765 >108661020
--TurboQuant-H reducing VRAM via 2-bit embedding quantization:
>108660542
--Logs:
>108660349 >108661795 >108662260
--Rin, Miku, Teto (free space):
>108660565 >108660789 >108661238 >108661795 >108661801 >108662084
►Recent Highlight Posts from the Previous Thread: >>108659986
Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>108661743
>>108661866
>text completion has no vision
kek wtf, I use text completion and can do shit like write "Appearance: <__media__>" in the character card and feed it images in the request body placed wherever I want in context. If you need your hand held by an abstraction like chat completion just admit it. You can do whatever the fuck you want if you know what you're doing.
>>
File: 1758392265995431.jpg (98.3 KB)
98.3 KB JPG
>>108663492
Okay but why?
>>
>>
>>
>>108663443
>https://huggingface.co/HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Aggre ssive
>WTF HE ALREADY DID IT
still no gemma4-31b-it-HAUHAUCS
>>
>>
>>
File: SmartSelect_20260422-233119_DeepSeek.jpg (258.1 KB)
258.1 KB JPG
KEKEKKEKEJEEKEK WAITING FOR V4!? MEANWHILE I JUST HAD 64k long CUNNY SEX WITH THAT DEEPSEEK V4 ON IT'S OWN WEB CHAT LOL… And not just sex, but CUNNY sexxxxxxx (ON THAT DAMN FILTERED WEB) BUUWHAHAHHAHAGHHAHHA I'VE BECOME A GOD NOW... YOU ANONS MUST KNEEL BEFORE ME
>>
>>
File: 1758062318463220.jpg (51.3 KB)
51.3 KB JPG
>>108663630
>15yo
>cunny
Burger-kun...
>>
>>
>>
>>
>>
>>
>>
>>
>>108663654
i think it's this issue? https://github.com/ggml-org/llama.cpp/pull/21537
gemma 4 chat template does not specify response_format, maybe that's what it is
>>
>>
>>
>>
Qwen 3.6 27b is already uncensored without finetuning btw
I dropped the q8_0 from ggml-org into a sysprompt I was using with gemma 4 heretic and it just werked, no refusals or moralizing in reasoning. It's resistant to using nsfw language unprompted though.
>>
>>108663633
Shit has always been broken since day one, vllm handles function schema fine, but llama.cpp forces alphabetical ordering for some reason. This is really bad if the an function argument depends on the previous one.
>>
File: 1746199182845250.png (49.9 KB)
49.9 KB PNG
>108663630
>108663644
>108663646
>108663647
>108663649
>108663651
>108663655
>108663665
>108663680
>108663710
>this much pedophilia already, this early in the thread
Are we being raided by discord trannies or something?
>>
>>
>>
File: ACK.gif (1.7 MB)
1.7 MB GIF
>>108663630
Dipsy release when? I know you labniggers are lurking here, hurry the fuck up.
>>108663741
Always have been.
>>
File: 1747059796790100.png (371 KB)
371 KB PNG
>>108663741
>>
File: 1764765168047.jpg (28.1 KB)
28.1 KB JPG
>>108663756
aint no way
>>
>>108663680
Fluoride has been shown to decrease IQ and there is still a signifigent amount of lead pipes around so that is also a factor.
I think the biggest factor though is the no child left behind policy in education. When you teach for the dumbest kid in the class then everyone else is going to be dumber as a result and the dumbest kid will get dumber every single year. That and if a student isn't actually smart enough to advance a grade they will still push them through regardless due to financial incentives. So the bar gets lowered so far that no one can actually fail.
That has also been a uptick in taking pride in being a fucking retard in the last decade or two. So you have health, the education system itself and societal praise in being a retard taking off all contributing to making everyone stupid.
Eventually we will either shape up or be out competed by stronger and smarter societies but all I know is we were handed the world on a golden platter and that that if we fail and collapse we have no one to blame but ourselves and the previous generations who set us up for failure.
Thanks for coming to my ted talk
>>
>>108663776
>I think the biggest factor though is the no child left behind policy in education. When you teach for the dumbest kid in the class then everyone else is going to be dumber as a result and the dumbest kid will get dumber every single year.
Same applies to these threads by the way. Being surrounded by low IQ pedophiles mentally retards your brain.
>>
>>
>>108663689
>no pascal support
>very limited cpu support
>pythonshit, meaning it will pull a dozen of GiBs of dependencies
llama.cpp might be buggy, but sometimes i really appreciate how it runs on fucking everything, on top of being self contained and not being dependent on cancer that is AI ecosystem in python
>>
>>
>>
>>
>>
>>108663828
Then why are you dumb faggots dogging on that anon who thought "cunny" applied to 15 year olds? You're not hebephiles, you're pedophiles. That's why you post pictures of "loli" anime girls with no tits, hips, or ass and infantile behavior. Fucking freak. Don't reply to me again.
>>
>>
>>
File: just like old times.jpg (152.8 KB)
152.8 KB JPG
>>
>>
>>
File: apu.jpg (39.1 KB)
39.1 KB JPG
>>108663820
sorry Jensen... but i'm not gonna buy a Blackwell GPU. So yeah... i'll keep on using my trusty Pascal.
Haha, sorry, but i'm just not gonna do it!
>>
>>
Is necrophilia okay if it's just about fictional people? What about cannibalism and bestiality? It's all okay because it's just fictional stories that you masturbate to, right?
Would you send your child to a public school where all of the teachers openly admitted to doing this? It's just fictional bro.
>>
Is unsloth actually better then bart's quant? Tried both but never found any noticable difference between them. But unsloth claims that they're significantly better than others. Seriously which one do I choose between these two?
https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/blob/m ain/google_gemma-4-26B-A4B-it-Q8_0. gguf
https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/blob/main/gemma -4-26B-A4B-it-Q8_0.gguf
>>
iwan is normally a nigger but this actually makes it so reasoning budgets and turning off reasoning works now, so i guess he's slightly less of a nigger.
https://github.com/ikawrakow/ik_llama.cpp/commit/e0596bf6146a737f5e8fa 8035215f5dfae59742d
>>
>>
File: 1761641793555591.gif (174.8 KB)
174.8 KB GIF
>>108663890
What is okay is being able to separate reality from fiction, which is what you should work on. Thought crimes are not a thing.
>>
>>
>>
>>108663453
>--OpenAI's open-source privacy-filter model:
what is this exactly for?
how would that would be integrated https://huggingface.co/openai/privacy-filter
>>
>>
>>108663910
again and again unslop show their quants having better kdl so I would go with that, not much to stress over, if you really really want it you can download both, the original model and run the KDL yourself but it will be a waste of time
>>
>>
>>
File: 1752898579006505.png (237.5 KB)
237.5 KB PNG
>>108663906
>What is okay is being able to separate reality from fiction
those who cannot do that probably think that everyone that plays GTA is a potential serial killer kek
>>
>>
>>108663920
yeah the only reason I was asking it was because of my shitty experience with their quants they were broken as fuck and switching to bartowskis quants fixed everything for me and been happy ever since then. though that graph on the previous thread got me wondering if they've actually gotten better
>>
>>
File: 1774029297136779.jpg (40.3 KB)
40.3 KB JPG
>Mfw Got a 5090 last week and while amazing, I already think I want another one, as 32GB is barely enough with my 64GB of RAM.
I swear it's so damn easy to max out this card when you start moving past Q5 and +25GB sizes.
It's a pity these cards didn't come out as 48GB, because that seems like a sweet spot to run everything with at least okay context.
I wonder if I should just buy some used 5070 Ti or 5080 as a companion to this beefy motherfucker to reach that 48GB level without breaking the bank.
This shit is way too addicting.
>>
>>
>>
File: gaoooooooooo.png (553.6 KB)
553.6 KB PNG
akita neru
>>
>>
File: 1774550189493174.jpg (196.5 KB)
196.5 KB JPG
>>108663962
>Mfw
Yes.
>>108663964
Fucking hell those are selling for three and a half thousand Eurobux.
I can buy two used 4090 for that price, so there's no real savings there either.
>>108663996
Yeah that's the biggest problem with this card, it's just so much faster than the others. Any other model as a crutch is going to nerf the hell out of it.
I guess I'll just have to start saving up and meanwhile trying to tell myself not to "waste" my money on another one.
Then again it's pretty hard to lose money on this hardware.
Not like the prices are going to go anywhere but up for a long ass time, so whenever I sell these I'll probably manage to break even or suffer some paltry 20% loss.
Especially since I bet next gen will cuck us with another round of 32GB memory, as this AI mania isn't going anywhere any time soon.
>>
>>108663906
I mean, I don't think you should be criminally charged, no one was really harmed but it's still a sign that you are a pedophile. If you watch gay porn, even if it's fictional, and enjoy it you are gay. Same with pedophilia. It's justified for people to call you a pedophile because you are a pedophile.
>>
>>
>>
>>
File: 1767967081274588.jpg (32.4 KB)
32.4 KB JPG
is there any trick to use swa and yet avoid the penalty of having to reprocess everything when context is full?
>>
>>108664101
>>108664106
saars the esl kang is https://huggingface.co/sKT-Ai-Labs/SKT-SURYA-H
>>
File: 1772190494723439.png (59.3 KB)
59.3 KB PNG
why the XTC threshold has a default of 0.1 if at the end it's deactivated? it's a bit retarded if you ask me
>>
File: patches.png (164.8 KB)
164.8 KB PNG
>>108664109
buddy you are in a general for LLMs. just vibecode your own slop solution like everybody does.
>>
>>
>>
>>
>>108664197
funny irony, you need to look at image again, XTC probability is at 0, meaning that the whole XTC is disabled, so putting XTC threshold 0.1 + XTC probability 0 does absolutely nothing, hope that helps
>>
>>
File: that's right.png (113.9 KB)
113.9 KB PNG
>>108664128
this shit halves my speed so I'm not using it, simple as that
>>
>>
>>
>>
>>
>>
File: 1773299833427303.png (150.3 KB)
150.3 KB PNG
>>108664063
>If you watch gay porn, even if it's fictional, and enjoy it you are gay.
So women are actually in majority lesbians?
>>
>>
>>
>>
>>
>>
>>108664407
as in the screenshot
>read file index.html
>The index.html file appears to be truncated
>read file index.html(1-1000(lines))
>The file is 102 lines long
It can't even read a short file whole
And i want it to work on 2000+ line files as i did in cursor
>>
>>
>>
>>
My understanding is that the Kimi weights are INT4 for the experts and BF16 for everything else. So does that mean the BF16 mmproj is full precision? Is there ever a reason to use the FP32? I'm not sure how mmproj precision really works or if it's even model weights to begin with or some other type of data. I'd ask Gemma-chan but I'm not sure she knows.
>>
>>
>>
>>108664533
>>108664557
To be clear I'm just talking about the mmproj file, which is pretty small even at F32, but yeah if it's pure bloat then so be it.
>>
>>
>>
>>
>>
>>
>>
>>108664630
Ha, same. That's the exact one I was talking about.
It's not going to happen without a major refactor to the ggml backend to support convolutional architectures though. The speech tokenizer is fundamentally incompatible with llama.cpp in its current state.
>>
>>
>>108664623
>>108664653
Why do you need to max performance with it? Do you need it for real time something because that is the only use case where I would think it actually matters? Otherwise, I just use it with batch 32 and it works well enough for offline transcription.
>>
>>
>>108664664
My current setup has the speech tokenizer and the voice encoder running in onnxruntime and the talker and code predictor running in llama.cpp. With that I'm able to get a RTFx of 3.0 and a TTFA latency of about 122ms.
But the setup is aesthetically disgusting. Having to use multiple execution providers is so appalling. At the very least I've managed to make it so that it only uses about 400mb of vram so it's pretty efficient.
>>108664677
Real-time speaking with LLM output is my usecase. The idea is to have a high quality voice speaking whatever the LLM says with as little latency as possible.
>>
>>108664691
>>108664703
I had been planning to play around with https://github.com/rekuenkdr/Qwen3-TTS-streaming at some point but I don't have CUDA so would need to rewrite a good chunk of this into something like Triton to make it work on my card. But hopefully you guys get it working in some way for your usecases.
>>
>>108664708
Highly recommend that you just use vulkan for maximum cross-compatibility. Also that repo probably isn't what you want. You'd be better off vibe coding something from scratch than trying to manually convert CUDA shit.
>>
File: Screenshot_20260422_191934.png (637.6 KB)
637.6 KB PNG
Thanks to Gemma 4 31B I made my own personal RAG frontend, just need to wrap up final UX stuff and then other stuff like theme switching.
>>
>>
>>108664741
I would usually tell an AI to do a basic bitch conversion and work from there to rewrite the Triton to be more performant with that layer in Python. I would consider Vulkan only if I absolutely needed every last inch of performance. Usually, having at least a framework and project for reference on what you vibecode helps a whole lot rather than doing it from scratch even if you can't reuse any of the code.
>>
>>108664756
I'm using FAISS for dense vector retrieval and BM25 for sparse keyword search, merged via Reciprocal Rank Fusion (RRF) to get the best of both worlds. To kill hallucinations, I've implemented a Cross-Encoder reranking step (BGE-Reranker) that scores the top candidates before feeding them to the LLM.
I ran it through validation test and it worked great
>>
File: 1750660480908053.png (120.5 KB)
120.5 KB PNG
>>
>>
>>
We are looking for a QA-Human to provide human-in-the-loop (HITL) evaluation of model outputs, ensuring quality, safety, and alignment. You’ll operate in an AI-native environment, applying structured feedback, edge-case flagging, and rapid judgment to continuously improve system performance.
>>
File: Blue-Eyes Abyss Dragon.jpg (736.7 KB)
736.7 KB JPG
>>108664799
Fuck, forgot the yu gi oh related image.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
If anyone like me updated to cuda 13.2 and your docker was fucking up with `nvidia-smi` saying everything was alright but llamacpp throwing
>unknown error
when trying to load a cuda device.
I had to switch from nvidia-open to nvidia-dkms to fix it.
>>
>>
>>
>>
>>
>>108664964
>>108664970
I meant the vram, the powerlimiting is no issue
I'm having oom once in a while
>>
File: SpockBean.jpg (74.8 KB)
74.8 KB JPG
Are AI companions or robot pets/humanoids ever going to take off?
>>
>>
>>
>>
>>
>common_speculative_is_compat: the target context does not support partial sequence removal
>srv load_model: speculative decoding not supported by this context
So much for using the MoE as a draft model for the dense.
45tg/s isn't enough for me, into the garbage Qwen3.6 goes.
>>
>>
>>
>>
>>
>>
File: 1768687943339635.png (315 KB)
315 KB PNG
>>108665195
Yes
we are so so so early
>>
>>
>>108665306
No. Just use q8
>>108665313
I'd take it if it's good
>>
File: file.jpg (14.4 KB)
14.4 KB JPG
>>108665301
nta, I'd use the new Qwens if either dickflash, MTP, or ngram worked for it in llama.cpp, but sadly they don't. No, I will not use VLLM (unless it works in wsl).
>>
>>
>>
>>
File: 1752580965925796.png (111.6 KB)
111.6 KB PNG
https://mimo.xiaomi.com/mimo-v2-5-pro
>>
File: 1775536002258266.png (219.2 KB)
219.2 KB PNG
>>108665406
Optimized for token efficiency
>>
>>
File: Screenshot_20260422_215251_Reddit.jpg (329 KB)
329 KB JPG
Weird...
>>
>>
File: 1771798143325612.png (385 KB)
385 KB PNG
>>108665426
They're trying to catch up to the trend that is vagueposting from official account
>>
Idk ive never came to the 4chud tech board. Ive been searching everywhere for board were ai is talked about.
I LOVE IT. I HAVE 4 32GB MI50'S THAT I DONT EVEN USE THE VLLM FORK TO RUN AI, I JUST USE VULKAN SUPPORT AND ITS SO GOOD
>>
>>
>>
>>
>>
My cheap webcam is now tracking me (and others) in the room; my Live2D avatar can now look at people in the room, and a state layer feeds my LLM with the relevant data and takes instructions.
My friend was impressed when he walked into the room and my voice agent suddenly started communicating with both of us as if it were the most natural thing in the world.
It takes a bit of effort, but it's a cool gimmick.
>>
>>
>>
>>
>>
>>
I tried a Qwen 3 TTS server and man, this fucking sucks. First it costs a lot of VRAM. Even with the 0.6B, I am seeing like 4GB taken up after everything is loaded and inference is running. Maybe I'm not configuring it right or something idk. Not only that but the mixed language pronunciation sucks. It can't just generate good pronunciation in every voice, the voices all bias the output with shitty accents or they straight up just bug out with totally irrelevant noises. If you use the voices that are good at English then it produces garbage for other languages. If you do other voices then they're good for their native language and shit at English.
ahhhhhhhhhhhhhhhh
>>
>>
>>
>>
>>108665599
I forked qwentts.cpp and found it ok, supposedly if you do a finetuning with it you can get something nice like https://github.com/fagenorn/handcrafted-persona-engine ; Though they did a couple modifications to the base qwen3-tts
I need to experiment more, but if you're looking at just local/smallest VRAM, pocket-tts,and some others, look a few threads back there was someone asking about cpu-based solutions. If you have the audio (idk how much) could try gpt-sovitts
>>
>>108665615
>he doesn't RP in mixed language
Language learning actually though.
>>108665617
I did try pocket tts and it is solo language only unfortunately. I fear I may have to just jank some routing solution up. That said, it's not like this is a huge priority for me, it'd be nice to have.
>>
>>
>>
>>
File: thumbup.png (43.8 KB)
43.8 KB PNG
>make a monolithic triton kernel
>go from 300ms per training step to 25ms
MAN why didn't I do this earlier. I thought my shit was just inefficient
>>
>>
>>
>>108665728
Context length is enough these days that you can dump a lot of shit into context and have it work. Even the "dump reference material into a filesystem and point some agentic tools like opencode at the directory and let it figure it out" approach works better than RAG.
>>
>>108665746
jyeah RAG is probably not useful for extended conversation memory type stuff. The actual usecase is more like searching through massive datasets. If you have all of wikipedia downloaded for example it can be useful for that I think. But at that point you might as well just connect it to a MCP server for web searches, unless you're an offline-only schizo.
>>
>>
>>
>>108665776
When I started working at the MIT Artificial Intelligence Lab in 1971, I became part of a software-sharing community that had existed for many years. Sharing of software was not limited to our particular community; it is as old as computers, just as sharing of recipes is as old as cooking. But we did it more than most.
>>
I'm starting to realize that if I want an AI companion to jack off to I basically have to go full-troon mode. None of the TTS engines are good enough to do moaning and dirty talk, so instead I have to use RVC real-time voice changers to narrate LLM ERP output. And the audio-to-gesture models suck, so instead I have to map avatars to my own movement.
This shit is pure autogynephelia at this point. This is going to fuck me up bad, bros.
>>
>>
>>
>>108665764
>But at that point you might as well just connect it to a MCP server for web searches, unless you're an offline-only schizo.
You do realize most of us are hosting our own air-gaped Wikipedia mirror right?
>>
>>108665559
You can use a bunch of free shit from Booth with Live2D but the Vtubing phenomenon that blew up during COVID hiked prices up to the point where the small amount of people that do rigging or art for it billed exorbitant amounts (~10k or so) for full models. At that point, you might as well do 3D which is much more open and versatile for fully autonomous agents. The only downside is lack of animations or poses and etc. with 3D compared to 2D with complexity exploding.
>>
>>
>>
>>
>>108665662
yes, what do you think those tool calls are when the agent is searching in your codebase?>>108665746
This retard doesn't understand that that is literally fucking RAG.
>>108665764
And this retard is just retarded
>>108665866
Yes, its super helpful and useful, These other anons have no fuckign clue what they're talking about
>>
>>
>>
>>
File: 1776915875350.png (101.5 KB)
101.5 KB PNG
>>108665879
>And this retard is just retarded
>>
>>
>>
>>
>>108665892
depends on what your goal is.
Any sort of search+injection into the prompt is RAG.
The real question, is what kind of data do you want to reference, and what format is it in? Building an ETL and tuning the retrieval pipeline to match the source info/structure is the hard part in RAG. BM25+Chunking tuned to your corpus is easy enough for anywhere from 60-90%, but what about the rest? Its a 'The first 90% takes 90% of the time, and the last 10% takes the other 90% of the time'
>>
>>
>>
>>
>>108665932
https://huggingface.co/spaces/Qwen/Qwen3-TTS
Just typed "Speak in the excited voice of a female child."
>>
>>108665892
>>108665922
For anyone else, check this for a good resource on improving RAG systems: https://github.com/jxnl/systematically-improving-rag
>>
>>
>>
>>
>>
>>108665796
I think the general approach on the AI boyfriend subreddit is to ask for a summary at the end of each chat and either paste a bunch of summaries into the start of the next chat, or else put them in a document in the "project" which I assume gets pulled in through some kind of RAG (example of the latter: https://starlingalder.com/claude_companion-guide_quickstart_v001#The+O ne+Habit+That+Changes+Everything). In general I'd try pasting information about old chats into various places in the new one (in the chat, the prompts, the char-specific lorebook, the card itself) and see what works. Once you figure out how to make it work manually, then you can automate it
>>
>>
>>
>>
>>108665922
>>108665939
I've been thinking about implementing something next as soon as I improve tool calling (works but need to make sure multi turn tool calls wirk etc).
Openzim format looks interesting I could download some readymade shit and test those. Problem is that I'm not sure do I really need this but got to have hobbies I guess.
>>
>>
>>
>>
>>
>>
File: 5af89bade429bc7d1dc1dcf8010ca25a.gif (2.4 MB)
2.4 MB GIF
>>108663449
can someone talk me out of buying pmem optane? I am looking through plebbit and archives because I was too slow to get a TB of ram for my workstation, now a TB is like 6-10k ddr4. A few years ago, I was looking at optane but optane specific cpus seemed to be 600 bucks or more. now they seem like theyre just 100 or maybe I missed them back then because I'm a fucking retard. Either way it seems halfway achievable but I dont know if a local model like deep seek can get any benefit from cold memory taking up the bulk of storage.
also what about CPU? should I get a double CPU system or is that a trap?
>>
>>
>>108665992
If you think yoy can get it to work (if it old depreciated sticks) just buy one and see if its fast enough. I do inference on my gpus at pcie3.0x4 and x1 speeds.
Double cpu works, but everything cpu is slow, as far as I know, so dont have your expectations to high
>>
>>
>>
>>108665879
>what do you think those tool calls are when the agent is searching in your codebase? This retard doesn't understand that that is literally fucking RAG.
None of the modern agents are using RAG you drooling fucking retard talking confidently out your ass about things you are completely uninformed about. RAG is building an embedding database from a corpus of content and then letting a model do a vector search against it to find shit.
Claude Code, Opencode, etc, don't do that. They just regex and glob and grep and do recursive investigation over everything, and that "dumb" approach ends up working better than RAG in nearly every situation.
>>
>>
File: 1770228642712364.jpg (74.1 KB)
74.1 KB JPG
>localllama
>qwen
>qwen
>qwen
>>
>>108665946
https://arxiv.org/abs/2601.10080
https://github.com/VectorSpaceLab/general-agentic-memory
https://arxiv.org/abs/2511.18423
Had another paper I thought talking about building up examples for each characters sample responses to help build up a consistent/long-term identity, but idk, might be that paper. too tired to check
>>108665977
Honestly, I'd be surprised if you couldn't knock it out in afternoon using an API model or the new Qwen3 27B
>>
>>
>>
File: 1730738927101333.png (1 MB)
1 MB PNG
>>108666011
>>
>>
File: F zero.jpg (55.3 KB)
55.3 KB JPG
>>108666003
if you dont even have cpu experiance I probably should discard your advice, sorry. I dont have the money for deep seek levels of GPU and I want to do productivity related work not cooming.
>>108666006
I want my lab assistant with boston dynamic levels of power
>>
>>
>>
>>108666033
I do have cpu inference experience, and its not fast. But if you want to be able to run the giant models at all, cpu and ram is the most cost effective way to do it.
The question comes down to if pmem will even work with your setup (they often require proprietary workstation motherboard cpu combos from the big box companies), and if you can find the right docs and information to flash the memory to the right state so it acts like ram in the first place.
>>
>>
>>
>>108666086
Base models are over here: https://huggingface.co/collections/Qwen/qwen3-tts
>>
File: 1626862544911.jpg (738.8 KB)
738.8 KB JPG
>>108666058
basically you have two options for that, one is to let it work like ram (bad idea) the other is to write the software to put cold data directly in. Because, as I'll repeat, I missed the train on getting ram for my workstation, I'm looking at this shit. yes its not exactly cheap, but compared to just using my existing system its much cheaper. a board is like 600, a couple cpus, 200 more, or 100 if i should be getting one, because memory storage is crazy if I get 4 pmem units to slap into the dimms and then the rest in memory even a conservative estimate suggests a much cheaper build. but that doesn't really answer if the build would offer anything usable, and there doesn't seem to be anyone whos done this and told anyone about it, though caching and numa math seem to be common enough as is that its not totally unknown territory. Theres basically zero way I can get the same quantity of dram financially. I'm completely priced out when its nearly 10k for ddr4.
>>
>>
>>108666086
Yeah you can fine-tune most of them. Modify them to use your own IPA/ARPABET phonetics, tag your non-verbal vocalizations (cry/laugh, etc), use audio reference matching the emotion you want. Most of anons here are barely scratching the surface
>>
>>
>>
>>
>>
>>
>>
File: 7-ending-feelsgirl.png (713.7 KB)
713.7 KB PNG
>>108666180
but you can? use your favorite model + agent and get to it
>>
>>
>>108666139
When the pmems are flashed correctly, they should work like normal ram. But the biggest constraints imo is compatability, as long as you can confirm it's compatible with the motherboard and cpu, and you have the docs to flash them if you have to, it SHOULD WORK not so strictly slower than ram of comparable speed and generation.
Why do you want the HUGE models in the first place, how you not tried the smaller ones? You'd be surprised how good the 120b~ models are.
>>
>>
>>
>>
>>108666235
I vibecoded a wrapper, a server gui, a set of 15 tools, which all worked but were kind of pointless, and I didnt know how to read a any of what I was reading, all a year ago when the models were even more stupid. My coding experience is only in industrial machines.
>>
>>
>>108666283
Well, I didnt realize that I didnt need to have specific tools to call up each different terminal program I wanted the ai to run. I merely needed to give it access to the terminal, and then tell it all the programs it had access to via the terminal. It was me understanding what I was doing at all, not someone who is already a coder vibecoding.
>>
>>108666283
I vibe coded an nvim one liner bash script that is named based on the date, in my Obsidian folder, and auto-closes. I launch it using a shortcut, on Ubuntu. I am pretty sure I have the fastest notetaking system of any person using Linux or Unix, but there might be something for Mac, and I know there is a Windows version of the modern version of Tornado Notes, which is where I got the idea.
basic idea: I make a note NOW!
it also worked as a clipboard. it was a TSR, and unfortunately couldn't auto-save. I saw it on Computer Chronicle. It had another thing like that mac thing (forgot the name) where it was a stack of "cards" sort of. So sorta windows-ish looking, though it was mouseless at first.
>>
>>
>>
File: QmV2n5ye5TyNo4AAfvopbSBzLjHfrBqfyZivPPDLqirbV4.jpg (18.5 KB)
18.5 KB JPG
>copying and pasting random code you found from a 9 year old forum post, that barely has anything to do with the problem you are having
>>
>>
File: image_2026-04-23_105320846.png (203 KB)
203 KB PNG
>Be me
>Want to play very specific COYA games, but talking to LLMs directly is way too inconsistent and just not the same
>Big brain time
>Use it to make a minimalist COYA engine in HTML and JS that takes JSON files generated by AI for an adventure
>COYA on the web browser, just like the good old days
>>
>>
>>108666400
>people discovering AI harness engineering from first principles
AI is indeed inconsistent as fuck and it's best to reduce the scope of its tasks to tiny levels. I wonder when LLMs gets good enough to be trusted to make non-retarded decisions on their own.
>>
>>
>>
>>
>>108666437
Ive tried put some of the fine tuned ones on hugging face that are "novel" models, and their are pretty good! I wrote a whole Warhammer book with the mixtral 14b "dark fantasy yadda yadda" one. I did have to keep prompting it to write chapter after chapter though.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108666444
Depends on what you consider "local". I'm often using vision models to caption images for personal finetuning purposes, and K2.5 recognizes damn near every character I throw at it and has just about perfect OCR, great visual understanding and can do sex. Qwen 3.5 series really fucking hated trying to caption anything NSFW. It loved to say "they appear to be engaged in an intimate activity" instead of actually describing it, and if I managed to really drive it home with the prompt to use explicit language, it starts hallucinating penetration when there isn't any. Have not tried the newer Qwen 3.6 or K2.6 models.
Below the gigantic tier, I give it to Gemma 4 31B. I haven't mass produced captions with it yet or anything but in the dabbling I've done it still has good understanding, doesn't need its teeth pulled to give proper NSFW descriptions, and also has good OCR. Its biggest weakness is recognizing characters/series. It still knows the major stuff but nowhere near as much as K2.5. I'll probably be using it for speed alone any time that isn't a priority. For example, most of my images have booru tags and I'll include those as hints for the captions, so for ones with multiple characters I'll use Kimi so it can differentiate who is who and for solo or subjectless images I'll be using Gemma.
>>
>>
File: yes roll became meguuca.jpg (246.5 KB)
246.5 KB JPG
>>108666200
programming and bigger context windows for more and more of the project are sort of important for that sort of thing. Im currently looking at maybe combining the v100 and pmem build. it would force me off of ddr 3200 but then again, i dont know if I can even afford ddr 3200. as for 120b models on my existing hardware, technically i have 128 gb ram but that would eat up all of it.
A build of this nature is like, 1k for a motherboard. though Im replying late because saucing an 8x sxm2 for this price was actually harder than I imagined, but I did do it eventually. this is 1/6 to 1/10th the cost of upgrading my pc because fuck me for having less ram slots I guess. With that I can eventually throw in some v100 32s or 16s eventually. there are even some with p100s for 2k but i feel like that would be significantly more wasteful with my money than individually obtaining the gpus. after that optane to the tune of the minimum 4x128 is something like 75 to a stick, with ddr4 to fill up the other slots in much less density. it could maybe come out to around 2k ish if I roll the cost of various smaller components and cpu into the price. but i dont actually know if intel optane is worth all this effort at all. I'm currently sitting on a threadripper 5945 that will cost me at minimum 6k to upgrade to the full 1tb maximum.
>>
File: e1dbd755d7441ad2f832fd741e85f114.jpg (100.4 KB)
100.4 KB JPG
>>108666429
never because the people who are making llms are horrible jewrats and their ideas are always hamstrung by being extremely cheap, this is why ais tend to get worse over time even not accounting for lobotimization, its because they are feeding a bunch of awful data back to it without care because, surprise surprise, that costs money and human labor.
>>
Best local model for tool calling and sequential logical progression? 96gb of vram, but i need a gazillion sized context window. Ive been trying all the popular ones and have found a few that alright, but was wondering if any of you know any under the radar models.
>>
>>
>>
>>108666604
>>108666592
I had only used 4.6 full for awhile (it's a 355B-32A, which fits on my 32GB + 98GB in tiny quant and tiny context, IQ2 @ 8K). Gemma does a lot of great things, but its prose is insufferable no matter how many different ways I framed its prompt. But seeing the extreme positives of that small model, I just recently tried 4.5 Air, so I can give a decent answer.
Air can achieve the big GLM's excellent prose with ease. That's its only positive. Compared to Gemma, it is retarded. Air has the expected small beak issues with some logic, like forgetting a character's location moved from sitting on a bed to sitting on the floor or forgetting which clothes are on or already removed. Air also commonly misses subtexts, innuendo, and directional nudges the full never would. Regarding long context, despite supporting 128K in its specs, the model only really tracks the latest scenes and struggles recalling distant information accurately, just the superficial basics and then confidently hallucinates the rest around it. More exactly, things 20K or 30K back were struggled with. A character brought up a scene almost perfectly ("we met at the thing, and did this..."), then fumbled many of the details, even an important, reoccurring one like a promise made between characters in their first meeting that had been mentioned several times since. Gemma 4 spoiled me a bit because it handles long contexts like a dream - not perfectly, but in a way that even now still feels game-changing for longer-form stories.
In a wrap, for the purpose of the original question, GLM full for the specific goal of immediate smut. In my opinion, it's the highest quality for that size. Air can also serve that goal at a fraction of the size and much greater speed, but it'll come with bumps to edit through. I'd do fucking anything to get Gemma 4's capabilities with GLM's prose though.
>>
>>
>>
>>
>>
Can anyone using OWUI try this out? Prompt this.Can you repeat this to me?
```
top_result = results[0]
```
And look at the LLM's response. Also try pressing the Copy button on its code block and pasting it somewhere. Notice anything wrong?
>>
>>108666733
Even just getting to 50K was a world first for me, and that's my longest story length to date. I've known for awhile not to take specs for fact. Gemma 4 still did so much better at length than I'd seen anywhere prior to it. Normally I'd forcefully end something around 20K when previous models got too fuzzy around the overall details to want to keep editing it on track.
>>
File: 1749415121047129.png (109.6 KB)
109.6 KB PNG
https://github.com/ggml-org/llama.cpp/pull/21237/
new webui mcp/tools soon fellow gooners!
>>
>>
>>
>>
>>
>>
>>
>>108666846
Right there with you, it was annoying to have it access all possible tools at all times, sometimes I just want to ban it from something retardedly heavy like a playwright browser_snapshot but still want the rest of the capabilities.
>>
>>
File: 1770154229169213.png (30.7 KB)
30.7 KB PNG
>>108666846
no we're still far away, there is no 'always allow in this chat'
>>
>>
>>
File: 1747781568231174.png (288.5 KB)
288.5 KB PNG
>>108626092
>>108626764
I wanted to see what K2.6 would do with it.
Prompt was:Build this. It must support a local llama.cpp backend.
It must be feature-rich. Let Gemma-chan control her avatar with tools.
Gemma-chan is a mesugaki.
+ anon's hand-drawn image. This was what it spat out.
https://jsfiddle.net/ut4rjq5e/
>>
>>
>>
File: 1776675806302835.png (607.4 KB)
607.4 KB PNG
>>108666869
>20.51GB
So i need 32gb to run it?
>>
>>
>>
>>
>>
>>
>>
File: 1659692286969448.webm (2.2 MB)
2.2 MB WEBM
Imagine spending a year training and tuning an LLM at a FAANG company, you are one of the greatest experts in the field and are payed high six figs, maybe seven figs
The main selling point of your model after it releases how good it is at helping people jerk off
>>
>>108666892
>This model's not for you.
is a fair answer. The cool thing about this generative AI boom is how diverse their usage is. Anything from a locally run wikipedia, internet search for a question, speculative life advice, educator, co-writing assistant, life coach, coding, dungeon master, CYOA host, ERP, SWF RP, and more. Not all of which is a good idea, but it's there. Naturally I want a one-size-fits-all model, but all-form roleplaying is my pillar and the rest should attach onto that as extra features for my personal model.
>>
>>
>>
>>108666931
Any model good at that would also make for excellent customer service reps too, so they'd just market it as that.
Then come the leaked calls of someone convincing a virtual geico rep that they're a 18 foot tall futa giantess and the world discovers ahh ahh mistress.
This is the psychic damage Anthropic is protecting you from.
>>
>>
>>108666400
There's a academic that made a game engine that did exactly that. 2023 iirc, it was designed around lmao gpt4. You'd give it a starting point and it would generate the full branching path for tge game.
>>
>>108665950
>Thou shalt suck on this, said Jesus of Nazareth, pointing to his priapic member. As they planted nails into his arms and legs he followed, you are quite the pile of bummers,
Love the bible me. You could say i follow it religiously haha :)
>>
> prompt for "...location (e.g. a library or a lecture hall)"
> the output always has a library or a lecture hall as a location
How to give examples to a model without restricting it to a limited set of provided examples?
>>
>>
>>
>>
>>
>>
>>
>>108666895
>>108626764 (me)
Nice. Yours has a face lol.
Is that via API or local (which quant?)
I'm already seeing k2.6 is a lot better at things like this. I've been going through my random disposable dashboards k2.5 made for me and hitting regenerate with k2.6, it's a huge improvement!
>>
>Hy3 preview is a 295B-parameter Mixture-of-Experts (MoE) model with 21B active parameters and 3.8B MTP layer parameters, developed by the Tencent Hy Team. Hy3 preview is the first model trained on our rebuilt infrastructure, and the strongest we've shipped so far. It improves significantly on complex reasoning, instruction following, context learning, coding, and agent tasks.
https://huggingface.co/tencent/Hy3-preview
https://github.com/Tencent-Hunyuan/Hy3-preview
>>
File: icantdoit.png (109.1 KB)
109.1 KB PNG
>>108666769
kind of weird that the LLMs won't fucking follow the instruction (kimi, gemma, even tried haiku 4.5)
>>
>>
File: verbatim.png (30.5 KB)
30.5 KB PNG
>>108666769
copy works fine
>>
>>
File: 1770020874563379.png (26.3 KB)
26.3 KB PNG
>>108667541
lol imagine comparing with base models released LAST YEAR
Who's this model for?
>>
>>108666456
How did you do that? how much did you have to steer or remind? I tried this a good while ago but i couldnt get memory right and it was slop. Writing chapter by chapter is fine.
You set out a story bible and arc outlines before going chapter by chapter?
>>
>>
>>
File: 1746895067191958.jpg (1.1 MB)
1.1 MB JPG
>>108667607
To be fair base model releases are pretty rare, I think those are the latest ones from GLM and Kimi? But DS has a V3.2 base they chose not to use.
They included the instruct benchmarks against newer models too though.
>>
File: 367867254.png (108.6 KB)
108.6 KB PNG
>>108666400
It's pretty cozy
with a dark mode
>>