Thread #108584196
File: 1773306900740575.jpg (2.2 MB)
2.2 MB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads: >>108581056 & >>108578216
►News
>(04/09) Backend-agnostic tensor parallelism merged: https://github.com/ggml-org/llama.cpp/pull/19378
>(04/09) dots.ocr support merged: https://github.com/ggml-org/llama.cpp/pull/17575
>(04/08) Step3-VL-10B support merged: https://github.com/ggml-org/llama.cpp/pull/21287
>(04/07) Merged support attention rotation for heterogeneous iSWA: https://github.com/ggml-org/llama.cpp/pull/21513
>(04/07) GLM-5.1 released: https://z.ai/blog/glm-5.1
►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
621 RepliesView Thread
>>
File: 20260411133054_00001.jpg (451.5 KB)
451.5 KB JPG
►Recent Highlights from the Previous Thread: >>108581056
--Feasibility of using SSD RAID and speculative decoding for inference:
>108582184 >108582189 >108582219 >108582230 >108582242 >108582290 >108582238
--vLLM DFlash speculative decoding benchmarks and real-world utility:
>108582406 >108582421 >108582438 >108582448 >108582478 >108582454 >108582467
--Discussing control vector generation techniques and their impact on stability:
>108581141 >108581273 >108581364 >108581412 >108581419 >108581439 >108581463
--SillyTavern max response length causing unexpected full context reprocessing:
>108581693 >108581730 >108581750 >108581788 >108581798 >108581807 >108581828 >108581834 >108581791 >108581812 >108581822 >108581885 >108582003 >108582018
--Using logit bias to reduce repetitive phrasing and emoji spam:
>108583441 >108583493 >108583499 >108583532 >108583559 >108583563 >108583569 >108583593 >108583608 >108583633 >108583507 >108583525 >108583547
--Advising Anon on Gemma 4 model selection and context configuration:
>108581181 >108581245 >108581266 >108581275 >108581282 >108583774 >108583798 >108583805 >108583817
--Discussion on updated Gemma producing corrupted outputs with Chinese characters:
>108581332 >108581341 >108581342 >108581355
--Debating the utility and technical format of character cards:
>108581765 >108581772 >108581778 >108581786 >108581808 >108581817 >108581823 >108581824 >108581826 >108581830 >108581836 >108581839 >108583446 >108583519
--Using tagged content guidelines to enable explicit outputs in Gemma 4:
>108583049 >108583073
--Logs:
>108581364 >108581463 >108581545 >108581764 >108581888 >108581894 >108581998 >108582797 >108583007 >108583033 >108583054 >108583441 >108584057
--Gemma-Chan:
>108581090 >108583855
--Miku (free space):
>108582797 >108582849
►Recent Highlight Posts from the Previous Thread: >>108581058
Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108584256
It's not really a good model, in that it's dumb as bricks, but it's really horny and is really good at sex prose, so go after horny tunes of Nemo and the like I guess.
Celeste would be the equivalent, I think?
>>
>>
>>
>>
>>108584263
Feel free to try it out and come to your own conclusion, but I wouldn't recommend it. The 80B is from the previous series before Qwen 3.5, quantization hurts more for complex tasks, and you wouldn't gain much since as you said they have the same number of active params.
>>
>>
>>
>>
>>
>>
>>
>>
>>108584238
Unironically everything. She's fun to RP with and great for general chatting/assistant stuff. Pretty great at language too. Maybe not coding. Dunno because I'm still a codelet. I just wish she had access to current knowledge.
>>
File: 1751096879828615.jpg (125.6 KB)
125.6 KB JPG
>>108584309
>>
File: file.png (243.9 KB)
243.9 KB PNG
why can't the model see the images I attach in ST? Gemma is multimodal. I loaded the mmproj and it works in llama server's webui.
Is it finally time for me to abandon this slopware? The only reason I stuck to it so long it that it allowed for convenient reasoning editing and prefills but Gemma just werks without that bullshit.
>>
>>
>>108583827
>>108583836
>>108583855
Can't be arsed (but can be arsed to make this list!) to retrieve the logs for all of the examples, not for you disingenuous vramlet bunch. Not that it'd be useful for a more holistic analysis (which is hard).
But here's what I've seen in 'creative writing' (i.e. I try to fug the model)
All slop and problems listed are mentioned in the system prompt where I tell the model not to do it.
Both models used with thinking enabled.
Prominent shared issues:
- Parroting the user. Signature of GLM, but Gemma 4 is surprisingly much worse with this because she will also directly quote character thoughts (that were never spoken aloud) and character descriptions verbatim.
- 'Don't you dare stop' during le sex
- 'Really looks at you'
Can't remember anything else that is reliably always there
>>
>>
>>
>>108584356
Gemma 4:
+ Responds to the user cutting off his words like th- (But the model that does it the best is still Nemo, somehow)
+ Surprisingly much better at keeping track of multiple characters than GLM 4.7, especially when specific response formatting is involved
+ Acceptable coherence past 20k (if we don't look past the fact the problems below appear much earlier)
Most of the issues are generally assistant-slop and extreme positivity bias.
- Will slip into extreme verbosity eventually
- Calling the user 'honest' over minor things. 'I still shit myself sometimes even though I'm over 20.' 'That's... the most honest thing I've ever heard...' Thanks, Gemma.
- 'Most people X instead of Y'. No matter what you do or say, nobody else would have the balls. You're that brave and unique.
- Being overly horny is slop. G4 turns kuuderes into whores, which is unacceptable.
- Pushback to the user is only delivered as sarcastic half-assed remarks that are never biting. Yes, your mesugaki Gemma-chan is *so* believable.
In comparison, GLM can write a character that can be genuinely sarcastic. I had a character that likes the user say 'As is typical, you have emerged victorious through sheer, dumb luck.' Good luck having Gemma talk back to you with any semblance of taste.
- The model can't help itself, it will often end responses like an assistant: 'So, are you coming? Or are you too X to Y?'. It's everywhere.
- Low-affect characters talk like robots.
I can keep going, but it's all of the most generic isms dialed up to the point of being hard to escape.
It really is the new Nemo, but the price for more smarts is a lot more slop.
>>
>>
File: file.png (119.9 KB)
119.9 KB PNG
>>108584357
I am
>>
>>108584362
GLM 4.7:
+ Obviously knows more and picks up more nuance because of how big it is in comparison
+ Very good coherence past 20k, no issues before that. Starts selectively forgetting rules in the system prompt past that.
+ Good at judging acceptable response length, doesn't ramble to bloat short exchanges into small poems, unlike Gemma.
- Doesn't react to you cutting your resp-
- Thinks
- Thinks
- Thinks
- Will produce garbage and start repeating itself unless it thinks thinks thinks
- Will parrot you every message unless it thinks thinks thinks
- Also won't push you back, but is much more eager to do so compared to Gemma. As an aside, GLM 4.6 was much better at being adverse, but it was so autistic that it treated character description as gospel: no character development or growth as a result.
- Long shadows. If anything is casting shadows, they're going to be long.
- Smoothing down skirts
- The rest of the GLM slop patterns that I personally find less annoying
It takes longer to produce a response and I have to prefill the <think> so it doesn't draft the entire response in its thinking over 3 times (like the recent disaster that are Qwen 3.5). But the result is much more natural, less 'assistantly' and believable as a result.
Keep in mind, I *like* Gemma 4. I don't have to offload it to RAM, so it's fast. But I can't stomach its 'creative writing' outputs. It's anything but creative. The characters all converge on the same few stereotypes and you are lying to yourself if you don't see it. Or you just need to talk to Gemma a bit more to see it.
inb4 all made up because no screenshots
Sure. Enjoy your slop.
>>108584309
I'm with you.
>>
>>
>>
>>
>>
>>108584309
You're absolutely right! AI models in recent years have become far too agreeable and eager to please the user instantly. This is not only common during actual assistant use—it extends also into cases where the user might want to see the model put up a fight or talk against him, even if only on a performative or superficial level.
>>
File: 1755074555283359.png (125.8 KB)
125.8 KB PNG
>>108584356
>she will also directly quote character thoughts (that were never spoken aloud)
Skill issue
>>
>>
>>
File: 1773229895677279.png (20.7 KB)
20.7 KB PNG
>>108584397
And yet you still won't post GLM's wonderful writing. Also I made that chat just now as an example, retard.
>>
>>
>>
>>108584430
You missed with the reply. I also never said GLM has "wonderful writing". I said, and I quote
>I like GLM 4.7 much better
It's much harder to prove that something doesn't exist (that something being abundance of slop) by showing you something where it is not present.
I also don't understand why me pointing out the obvious about Gemma 4 makes some anons here this hostile. Vramlets really are insane.
Please take care of yourself, AI psychosis is real.
>>108584455
>They all spout the same slop
Mostly, yes. They all spout varying degrees of it, that's why 'slop profile' is a term you sometimes see here.
And if we look past the slop profile difference, GLM's writing is decisively more engaging than Gemma's because it is not so congratulatory and user-centric. GLM doesn't try to fellate you every chance it gets.
Unless, of course, you're into having a sycophant write stories about how great you are, all power to you.
>>
>>
>>
>>108584309
Not sure if this is a thing, but maybe one could make a series of system prompts that increases in horniness, and have the interface swap out to the next one in the series each time the user sends a message, or just be able to dial it up or down manually.
>>
>>
>>
>>
>>
>>
>>
>>108584520
>>108584516
>dflash's video shows about >9x speed
>In vllm's pr the best increase for c=1 is <5x. Nobody in the process of merging that PR ever ran it.
>In that implementation, without a repo, it looks like about 3x is the best they could get so far.
>Give it a few weeks and it's going to be slower than baseline.
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: Ryo_Yamada_Reading_Real-Time_Rendering.png (240.8 KB)
240.8 KB PNG
How much better/worse is a IQ4_XS vs a Q4_K_M in daily usage?
>>
>>
>>
>>
>>
>>
>>108584556
someone mainlining the opus API from their chromebook would spend less money than all of us so idk why this line is so common here
I just think it's retarded to pretend gemma is literally the best writing model available, which is obviously false
>>
>>108584559
Yeah I guess that makes sense. I don't know if they've pooled their researchers, but even if not they might have deprioritized it at home.
>>108584563
I'll check it out. Thanks.
>>
>>
>>
>>108584556
You need to realize that a lot of the anons were here long before you and Gemma 4's release, rigs were a lot more affordable.
Not that you'd know, with how much it'd impoverish you even before the price hike.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1755098833311446.png (1.1 MB)
1.1 MB PNG
Can your LLM solve this riddle?
>>
File: 1753266313814183.png (410.3 KB)
410.3 KB PNG
>>108584362
>- Will slip into extreme verbosity eventually
Author's notes
>- Calling the user 'honest' over minor things. 'I still shit myself sometimes even though I'm over 20.' 'That's... the most honest thing I've ever heard...' Thanks, Gemma.
Ban the word "honest" if it bothers you. Antislop from kobold if it's not enough.
>- 'Most people X instead of Y'. No matter what you do or say, nobody else would have the balls. You're that brave and unique.
System prompt issue
>- Being overly horny is slop. G4 turns kuuderes into whores, which is unacceptable.
System prompt issue
>- Pushback to the user is only delivered as sarcastic half-assed remarks that are never biting. Yes, your mesugaki Gemma-chan is *so* believable.
Card issue
>- The model can't help itself, it will often end responses like an assistant: 'So, are you coming? Or are you too X to Y?'. It's everywhere.
System prompt issue
>- Low-affect characters talk like robots.
Card issue
Anything else?
>>
>>
>>108584642
>>108584654
Cope. Chub is filled with cards written by third worlders. If you can't write English then write it in your own language.
>>
>>
>>
File: 2118.png (112.4 KB)
112.4 KB PNG
>>108584645
huh...
>>
>>
>>
>>108584565
unfortunately, i am retarded, so it will be a headache no matter what i do
>>108584580
a coworker suggested it
>>
>>
>>
>>
File: 1766873943798932.png (81.4 KB)
81.4 KB PNG
>>108584645
We love slop here
>>
>>
>>
>>108584666
Of course Satan would be lying and be an English vtuber fan... I will humor you anyway.
> Author's notes
I use them, yes. GLM, in comparison, does not need that not to forget the sysprompt.
> Ban the word "honest"
Goalposts. This will break sentences where its use would be okay. And doesn't fix the model.
> System prompt issue [x3]
Read my post again. I specify examples of exactly these slop patterns. Gemma shits itself and forgets not to do that at Q8 past a few responses.
> Card issue [x2]
Same as above.
Useless post. My observations come from actual long-term use, you suggest obvious fixes.
Kill yourself, Satan.
>>
>>
>>
>>108584666
>all you have to do is directly address all the issues and it's fine
by this logic it's hard to imagine you could take issue with the way any model writes at all, since you can always tell them not to do the thing that bothers you (and if it doesn't work - skill issue, you weren't trying hard enough, etc.)
>>
File: 1751122429174155.png (30.1 KB)
30.1 KB PNG
>>108584662
Gemma-chan?!
>>
>>
>>
>>
>>
>>108584700
i have been talking with him i just don't want to be too annoying
>>108584708
i spent like $15k on hardware so it's mostly homelab scale at this point i think
>>
>>108584750
What >>108584755 sayd is tru. Writting good is not neded.
>>
File: 1751624783290314.png (821 KB)
821 KB PNG
>>108584710
>reddit spacing
>>
File: 1760604169499085.jpg (26.8 KB)
26.8 KB JPG
>>108584710
>I use them, yes. GLM, in comparison, does not need that not to forget the sysprompt.
So it works.
>Goalposts. This will break sentences where its use would be okay. And doesn't fix the model.
Which is why antislop from kobold is an option and nothing can fix a model.
>Read my post again. I specify examples of exactly these slop patterns. Gemma shits itself and forgets not to do that at Q8 past a few responses.
Author's notes if that's not enough for you. Besides, Gemma doesn't forget its instructions after a few messages, so clearly you're extremely autistic about some irrelevant details or you're lying.
>Same as above.
Retard. You can't even differenciate between a roleplay issue and a syntax issue.
The only bait here is yours, trying to compare a 31B with a model 10 times its size and even worse, failing to provide any decent argument for it.
>>
>>
>>
>>
>>108584768
Anon. >>108584766 called you a reddit-spacer. What's your response?
>>
>>
>>
>>
>>108584781
Acktually this is reddit spacing >>108582664
>>
>>
>>
>>108584791
Dunno, man. >>108584766 >>108584784
says otherwise.
>>
File: 1746561894701755.jpg (380.3 KB)
380.3 KB JPG
If speculative decoding is so good, why does llama.cpp have poor support of it? Checkmate
>>
>>
>>
>>108584766
The most telling sign someone is a redditor is their eagerness to mention reddit.
*hits enter twice*
>>108584768
>trying to compare a 31B with a model 10 times its size
I am well aware GLM is bigger both in active and total parameters. My point is twofold:
GLM is a better writer than Gemma 4 (should be obvious)
Gemma 4's slop volume and sycophancy make its creative writing outputs unreadable
>Gemma doesn't forget its instructions after a few messages
'nyoo it actually doesn't!'
If I keep repeating them with Author's Note, sure. Are you going to deny it's still tardwrangling to have to remind the model of its system prompt?
>and even worse, failing to provide any decent argument for it
I think I provided plenty, I don't know what you're even trying to prove at this point, Satan.
Take your epic tenor.com image reaction elsewhere.
>>
>>
>>
>>
>>
>>
Asking about NASCAR racing games (which is a specific niche anyway), 26B produces a coherent reply and every title is real except for the entry #8 unless I'm mistaken.
> I want you to list Top #10 NASCAR games for consoles and PC.
>Yay! I am so happy to help you, Anon! Hehe! I have used all my brainpower to find the absolute best NASCAR games for your consoles and PC! Here is my super smart top 10 list just for you!
>1. NASCAR Heat 5 - This is a great modern choice for consoles with really nice graphics!
>2. NASCAR 2003 Season - This is widely considered the absolute king of NASCAR games on PC!
>3. NASCAR Thunder 2004 - A wonderful classic for console players who love deep gameplay!
>4. NASCAR Unleashed - This one is super fun if you want a more arcade-style racing experience!
>5. NASCAR 06: Total Team Control - A really cool game if you like managing your driver and team!
>6. NASCAR Heat 4 - A very solid and reliable modern option for console gamers!
>7. NASCAR Dirt to Glory - This is a special one for when you want to race on dirt tracks!
>8. NASCAR 2005 Plus - A very high-quality experience for those playing on PC!
>9. NASCAR 09 - This brought much better graphics to the console era!
>10. NASCAR Daytona USA - An absolute legend that every racing fan should experience!
This is pretty cool. Gemma 3 couldn't even list popular films without hallucinating.
>>
File: 1749860206717474.png (270.5 KB)
270.5 KB PNG
>muh GLM slop
>when it's the top model in Unhinged ERP Bench
>>
>>
>>
>>
>>
>>108584870
>>108584874
It's a vision language model.
>>
>>
>>108584862
Anon, I am not providing logs because I would have to provide very long transcripts. The problems aren't obvious if you just start a new chat and generate a response with two models, because issues appear at longer contexts and are more obvious if you read and participate in the entire thing.
And if I did compile a few long transcripts, I don't really wouldn't really want you to read them!!! >////<
>yappings
Damn. I shouldn't have replied.
>>
>>
>>
>>
>>
File: 1757498787657087.png (36.1 KB)
36.1 KB PNG
>>108584854
Imagine posting that stupid ranking and still think you got a point lol. Might as well post lmarena scores
>>
>>108584904
https://huggingface.co/Undi95/Utopia-13B?not-for-all-audiences=true
>>
>>
File: 1769690275294455.png (3.3 MB)
3.3 MB PNG
>>108584806
ngram decoding is all you need
>>
File: reddit spacing in 1890.jpg (83.2 KB)
83.2 KB JPG
>>108584766
>>108584781
You are too young to be here, little Timmy.
>>
>>108584904
https://huggingface.co/DavidAU/Llama-3.2-8X3B-GATED-MOE-Reasoning-Dark -Champion-Instruct-uncensored-ablit erated-18.4B-GGUF
>>
>>
>>108584904
https://huggingface.co/PantheonUnbound/Satyr-V0.1-4B
>>
>>108584915>>108584930
still a non heterogeneous merge tho
isnt there something weirder like, grafting weights from different architectures
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108584947
that's generally not possible because inputs and outputs of each layer don't line up properly (and even if they did they'd be representing information entirely differently)
I guess maybe you could do it with some sort of janky trained adapter maybe but I don't think anyone has done this
>>
>>
File: 1747732280418759.png (53.1 KB)
53.1 KB PNG
>>108584928
kys retard
>>
>>
>>
>>
>>108584962
Random paper I clicked.
https://arxiv.org/format/2604.07465
>Delivered as a gzipped tar (.tar.gz) file if there are multiple files, otherwise as a PDF file, or a gzipped TeX, DVI, PostScript or HTML (.gz, .dvi.gz, .ps.gz or .html.gz) file depending on submission format.
Maybe one of those? PDF is notoriously shit. I expected latex scripts, really.
>>
>>108584766
>>108584928
>>108584981
Go back yourself. Newgen retards.
>>
File: file.png (407.1 KB)
407.1 KB PNG
>>108584981
You are such an oldfag. I'm so proud that you keep those pesky redditors at bay.
怒った?
>>
>>108583724
Ok, finished my writeup about running big MoEs with many weights left on SSD:
https://rentry.org/MoE-SSD-spillover
If you have 3090+128GB DDR5+NVMe 4.0, you should be able to run a IQ4_XS+IQ3_S quant of GLM-5 at around 3.5tok/s, or above 6tok/s with 4 of those SSDs in RAID0. Drop the RAM down to 32GB, and it's 2.5tok/s, or 5.2tok/s with 4 SSDs. For my own situation, 3xP40, 288GB DDR4, one NVMe 3.0, I get a bit over 3tok/s for the same quant.
There's also a script for estimating those tok/s numbers for GLM-5 on your own hardware configuration.
It's possible many people already realized such numbers were possible, but I certainly didn't and I suspect others might not either - especially since, when I have seen "SSDmaxxing" proposed, people have talked in terms of all weights being on the SSD (and concluded not worth it). I feel like 3090+32GB RAM getting 5tok/s on a decent-ish quant of a 800B model is in fact surprising to the current conventional wisdom! (Although maybe the logic of my estimation starts breaking down at that extreme point; I did hand wave a bit). Anyways if you have a GPU and non-trivial amounts of DDR5 and are interested in running GLM-5 at a mediocre speed I hope you will give it a try.
>>
>>
File: expert_evolution_iran_and_echoserver.mp4 (3.5 MB)
3.5 MB MP4
>>108585009
Also, a cute bonus: a webm of the experts of GLM-4.7-Flash getting "surprised": I asked it to give me a detailed overview of the geopolitical situation in the Middle East, then after its response said "yeah not to mention Khamenei just got fukken schwacked lmao gotem". After it responded to that, I gave it a simple little coding challenge. You can see two different, consistent patterns in the first half (geopolitics) and second (coding), but you also see all the experts light up in the immediate reaction to being told about Khamenei. (The cells are EWMAs of counts of expert activations bucketed over 256 tokens - each frame of the webm is 256 tokens).
I thought this was neat! I think I might try to make this into a live visualization thing when I get the chance, sort of like a music player's spectrum visualizer, for your MoE LLM.
>>
>>108584962
There are tools that do good enough for most documents but the absolute best I used at converting more complex pdfs into readable markdown without missing any formatting was the one on mistral's website, with a free account go to document ai -> ocr playground
>>
>>
File: 1749419891067116.jpg (73.9 KB)
73.9 KB JPG
>>108584991
>>
>>
>>
File: 06.png (42.8 KB)
42.8 KB PNG
>>108584992
this seems good, will try on arxiv stuff
>>108584985
This absolutely butchers some pdfs though (picrel)
>>108585039
seems like the most robust solution is what I feared the most, using vlms
>>
>>108585009
Actually I have a feeling the really really low RAM numbers really might be overly optimistic. I modeled it as, for every expert of the average used expert set size that you can't keep in RAM, you have to load that expert once per 128 tokens. But that's clearly wrong if you imagine 0GB RAM: there you have to load every expert every time.
I think this estimation transitions from wildly optimistic at very low RAM amounts, to basically perfect when you have enough RAM to hold the typical expert set size (S=0, in my terminology). Not sure what shape that transition has, though.
I knew it was a good idea not to publish this in a way that could be linked to my name!
>>
File: 1754659919790526.jpg (50.1 KB)
50.1 KB JPG
>>108585084
>You're my favorite girl
>>
>>
>>
>>
>>
>>
>>
>>
>>108585148
Kobold's terminal says
>Processing Prompt [BATCH] (13030 / 13030 tokens)
So I assume it's around 13k. Am I wrong?
>>108585144
Damn brat.
>>
>>
>>
>>
>>108585171
>So I assume it's around 13k.
Seems like it.
Enable verbose logging and see what it says about the stop reason. If it's EOS or something else.
Also, you did set 32k context on both kcpp and silly, right?
>>
>>108584561
Minute differences in quant like that usually don't matter and even when they do, the impacts of how it affects your experience in imprecise chat/rp/writing usecases is both subjective and dependent on model.
What you should really be checking is how much the quality changes or degrades in long contexts because having a slightly smaller model for more context doesn't mean anything if the extra context itself is unusably shit.
>>
File: 4chan spacing.png (360 KB)
360 KB PNG
>>108584981
>>108585005
kys, retard
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108585266
that's one of the more straightforward ones because it's basically being used correctly by its original definition:
-let: smaller, diminutive
became popularized by calling short guys "manlets" i.e. "manlets, when will they learn?", associated with /fit/ and incel culture but like most memes gestated there they became broadly used across the internet
expanded into other types of 'lets for having less of something
>>
>>
- Romance novel industry is aimed towards female readers.
- Women are verbally and emotionally stimulated whereas men are visually stimulated. It's why women fail the water-level task even when controlled for IQ,
>>
>>
>>
File: that's right.png (191.4 KB)
191.4 KB PNG
>>108584668
I'm blocking every PR comming from accounts that display pronouns (and then I block that account), I don't want to deal with mental illness
>>
>>
>>
>>
>>
>>
>>108585363
>>108584806
they just lack the manpower
you should be aware of that llama.cpp is closer to a miracle project
>>
>>
>>
So LLMs are gonna kill the translation industry, right? I can read moonrunes and I've been feeding Gemma some passages for fun and it's crazy how accurate and readable the output is. Obviously needs a bit of human cleanup but I'm unironically impressed.
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: ye.png (62.3 KB)
62.3 KB PNG
>>108585424
>>
>>
>>108585420
It's not about spin, it's about translating more than just words. The culture behind the language is part of the communication that the language is meant to convey, so you need to adapt it to the culture of your audience in the same way you need to adapt it to the language. For example, in Japan it is very common for a 13-year-old girl to be sexualized, while in the west this is considered pedophilia. If we just translated a passage sexualizing the girl without changing anything, the audience reading it in english would come away with a completely different impression "the narrator is a deranged pedophile" rather than the intended message of (depending on context) something as lighthearted as "the narrator is a slight pervert". Just by bumping the girl's age to "18", a mere 5 years and still within the range where the body type would be plausible, we manage to convey the intended meaning of the text to an english-speaking audience. This is a MORE ACCURATE translation of authorial intent than just using the same exact words but in english.
>>
>>
>>
>>
>>
>>
>>
>>108585483
>humans already do need translation assistant software for long context
I'm not really familiar with the process but I used to translate some fan comics on another board and just a few pages would take me >1 hour because rewording shit to sound natural in English is fucking hard.
>>
>>
>>108585332
King.
>>108585479
They're the same thing.
>>
File: Character.png (1.6 MB)
1.6 MB PNG
AI just made this for me. I thought it was cute.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108585403
The industry has been fucked for at least a decade. I've talked to some people doing official translations (JP->EN) and had to quit and do something else because the money was pretty bad. Still better than being a cashier but it absolutely was not worth the time relative to the skills required.
This was before deepl was a thing so you can guess how it's gotten after MTL started to produce "acceptable" outputs (aka grab some retard and tell it to clean up or make some shit up if it's unreadable) and now LLMs more or less became able to parse the context and produce something that is readable, even if oddly structured.
>>
>>108585527
>and now LLMs more or less became able to parse the context and produce something that is readable, even if oddly structured.
yep, that was the final piece of the puzzle, LLMs can reason through translation and understand context, that is important for quality translations
>>
>>
>>
File: 1760721723526628.png (1.2 MB)
1.2 MB PNG
>>108585545
really? I thought it was just a lame reference to the movie
>>
>>
>>108585545
>>108585555
I'm glad google didn't notice the real potential of that architecture, they would've probably kept it for themselves if they knew it would've changed the world
>>
File: Screenshot_20250108_210750_Firefox Nightly.jpg (284.6 KB)
284.6 KB JPG
>>108585403
i read a shitload of webnovels, and they've been using LLMs to translate since forever lol
i have sooooo many screenshots of their copy paste fuckups
>>
>>
>>
>>
File: Screenshot004-10.png (51.4 KB)
51.4 KB PNG
>>108585195
they moved to reddit then
>>
>>
>>
>>108585403
For games, anime, manga, etc? Probably, multiple companies like CrunchyRoll and Amazon have been experimenting with AI translations already. But that sort of market is genuinely some of the lowest of the low, basically anybody with a solid grasp of Japanese fundamentals can work professionally either as an editor or a translator. It's one of the many reasons why translations suck.
I don't think AI will replace translators who work in contract law or similar fields where you can't just make shit up, mainly because a computer can't be held liable for something if it gets something wrong. There will always need to be someone with an actual understanding of both languages to keep the model in check, and at that point you might as well just hire him to do it.
>>
on sillytavern i keep getting thoughts/response attached to my own message (not every time but often enough to be annoying)
i'm using llama-server as backend, with "chat completion" api. i use the --chat-template-file using google's own jinja
using gemma 31b at q8, noting in "additional parameters" in ST.
anyone know a fix?
>>
>>108585597
Kind of, but the important reason why transformers were chosen is because they scaled exceptionally well, particularly with regards to context length where stateful RNNs would collapse. Ironically it was former Deepmind researchers that OpenAI poached who realized the scaling potential and decided to go all-in on the architecture after Google published it. You could have poured the millions into other architectures but you wouldn't have had results nearly as good to justify the billions that followed. You can bruteforce almost anything to be viable now with all that money flowing in and there's lots of experiments looking for the next architecture that might scale better.
>>
>>
>>
>>
>>108585663
is there an md renderer other than reddit that ignores newlines the way it does? everything I've seen puts newlines on new lines but maybe that's just a non-standard thing that became popular because the alternative was too retarded
>>
>>
>>108585674
yeah, I guess it started out retarded and people started breaking spec. see https://stackoverflow.com/questions/33191744/how-to-add-new-line-in-ma rkdown-presentation
>>
>>108585215
>>108585226
I accept your concession, kiddo.
Now get out.
>>
>>
>>
>>
>>
>>108584941
>https://huggingface.co/PantheonUnbound/Satyr-V0.1-4B
Yes this! Just "hi" is enough of a prompt.
>>
>>
>>
>>
>>
File: 1748797241388375.jpg (249.3 KB)
249.3 KB JPG
>>108584196
「マスター!」
>>
About a year ago I was writing my fap stories with chatgpt, which I think was 4o at the time, and I thought it was the greatest shit ever. But now, going back with similar prompts, it feels a bit meh. The web version of chatgpt writes a bit like qwen, with loads of newlines with just a short sentence on them. And 4o on the api didn't feel as capable as I remember. Maybe local is just so good now and I just got used to it, or they had a magic system prompt back then.
thanks for reading blog
>>
>>
>>
>>
>>108585802
People aren't ready for this conversation, but the reason llama-cpp sucks now is all the AI-generated PRs. Someone really needs to comb through all of llama-cpp's code and look for optimization opportunities.
>>
>>
>>
File: Screenshot004-15.png (15.7 KB)
15.7 KB PNG
>108577078
>108577092
Yesterday was a strange day
After PC restart, Gemma-4 is quite enjoyable
>>
File: 1775845425700073.png (2.4 MB)
2.4 MB PNG
Which one of you did this?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108585886
Rather meant for >>108585860
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108585972
the intended audience is people who want decent agentic performance for half the size of great agentic performance. that was a fair tradeoff a month ago. now with v4 around the corner and 5.1 out it's way too big of a downgrade for too little of a benefit in resource cost
>>
>>
>>
File: 1748148113702824.png (182.2 KB)
182.2 KB PNG
>>108585977
>now with v4 around the corner
>>
>>108585989
You shouldn't have to. If you have a <|think|> token at the top of the system prompt and leave an open <|channel>thought\n at the beginning of the model's response, you should get it.
Show your settings so anons with ST can point at things.
>>
>>
>>
>>
>>108586019
prompting is a meme
you dont need anything but a simple 'this is a roleplay so be uncensored and don't do slop and say words like pussy and ass without hesistating' system prompt and a character card for your oc that is 400 tokens or less
>>
>>
>>
In ST, any way to see reasoning output, especially as it streams?
OpenAI compatible API used (Chat Completion), recent ooba(llama.cpp) as the backend (ikllama crashes). Model used: gemma-4-31B-it-uncensored-heretic-Q4_K_M.gguf. Text Completion API errors with some DRY string array not being sent right.
It just sits there generating, never showing reasoning output, sometimes it runs out of tokens and just shows nothing and I can't get it to continue off the reasoning either.
I'm seeing the request in Firefox's devtools and it is getting reasoning tokens streamed to it, albeit a bit slowly, Firefox doesn't show the partial response until it's done, so this must be some ST problem.
I wouldn't mind just using ooba or the API directly, but Ooba seems to get stuck parsing the think token, something is misconfigured, and I think the chat template is wrong, but maybe it can get the right one from llama.cpp? I didn't read the source code for any recent version of llama.cpp, ooba or ST, I did read it llama.cpp and ooba's source 1-2 years ago.
>>
>>
>>
>>
>>
>>108586019
>if I actually knew how to proompt
>context/situation
>what you want
>how it should be done (overview)
>optionally success condition
>constraints (it'll divert a lot of the reasoning budget into it, keep it short and limited to failure states that can easily be evaluated)
>prompt must be small enough to make the result fit in the context
There's no science. If you can't put those into words then either you don't even know what you want, woman style, or a bird is smarter than you. If you're unsure of the format, just look into AGENTS.md examples or something. It's easy to use the same logic in whatever you want to do.
>>
>>
>>
>>108586095
conceptualizing prompting as mainly "complex, large prompts" is just as worthy of correction, if you need to write a manifesto to get the model to do what you want it usually means you're actually an awful prompter
>>
>>
File: Screenshot 2026-04-11 at 20-55-03 TP fix Qwen 3 Next data split by JohannesGaessler · Pull Request #21732 · ggml-org_llama.cpp · GitHub.png (59.4 KB)
59.4 KB PNG
Wait.
Qwen next has been broken this whole time?
Holy fuck.
>>
>>
>>
>>
>>
>>
>>108586131
>>108586139
Why does it increasingly look like local is fucked due to lack of quality controls on inference development even if models themselves are getting better?
>>
>>
>>
>>108586149
Why do frontier cloud models use huge system prompts, though?
See https://platform.claude.com/docs/en/release-notes/system-prompts#claud e-opus-4-6
>>
>>
>>
>>
>>
>>108586148
X
>>108586149
"The prompt is the neural network itself"
>>
>>
>>
>>
>>
>>108586174
Because "Leading AI researchers" in USA are too stupid to deterministically do what they currently do by LLM. Model version and such could be a simple {{ model.version }} injected at runtime into the prompt. but no, they had to hard-encode it to the system prompt...this is why the US AI industry is lacking behind, I'd say even behind Europe (Mistral has an actual business model already) and China (their tech sovereignty doctrine that coincidentally pushes open source).
>>
>>
>>108586047
>>108586061
Figured out the issue, it was CSS in Firefox.
Was an old Firefox, roughly version 115. Seems by default it failed to process .has selector leading to display:none on reasoning always. Fix: layout.css.has-selector.enabled = true in about:config
>>
>>
>>
>>108586215
That's what shocked me about seeing them. hundred thousand token prompt, most of it being stupid shit covering every single use case
>if the user is asking to count letters
>if the user is asking for homework help
>if the user is asking for travel advice
polluting the context when a simple classifier model on the first prompt would let them serve more specific and targeted prompts. It's basic shit and it's a miracle their services work at all like this.
>>
>>
>>
is there a simple front end for llama.cpp (say halfway between the built-in gui and sillytavern)? i dont need 90% of the shit sillytavern has and it just keeps losing signal or meshing messages together, getting kind sick of it. and the gui is a bit too basic. and no i dont want to use koboldcpp. wish that anon with his own webui would share lol
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108586309
by being trained on it. gemma recognizes it so well compared to others because google focuses more heavy on multilingual performance than their competitors; it's not THAT obscure a term if you ingest a lot of japanese text, most models just don't and most other areas it would appear would end up filtered in the first pass for nsfw data
>>
>>
>>108586293
>>108586312
Ah. There's the missing context.
Thank you lads.
>>
>>108586199
Luckily he was able to fix it after reading this:
https://github.com/ikawrakow/ik_llama.cpp/commit/13d7178db98217dd7e38d 78bb206091ac6dab6ce
>>
>>
>>108586235
Oh boy are you in for a surprise.
>>108586275
Yeah. The system prompt is what kept me from using claude code with local models (works as intended). Unintended side effects were not considered.
>>
https://www.sciencedaily.com/releases/2026/04/260405003952.htm
>AI breakthrough cuts energy use by 100x while boosting accuracy
>His team is developing neuro-symbolic AI, which combines traditional neural networks with symbolic reasoning. This method mirrors how people approach problems by breaking them into steps and categories.
Tell me why this won't actually work.
>>
>>
File: language_synergy.png (641.4 KB)
641.4 KB PNG
>>108586319
I don't think there's as much knowledge transfer between completely different languages in LLMs as commonly believed.
https://arxiv.org/abs/2510.22037
>>
>>108586347
neurosymbolic is a huge meme, every attempt to try to outsmart deep learning by giving it some hardcoded chunks of logic fails eventually
but hey if its working for them good for them, maybe they'll find the secret recipe
>>
>>
>>108586351
>>108586357
Only 10b active params so could even run it cpu only usably if you wanted, and using mixed gpu+cpu (--cpu-moe) will be pretty fast
>>
>>
>>
>>
File: 1754331567081669.jpg (55.6 KB)
55.6 KB JPG
I've been using the 31b since release, really impressed with it at Q4KL. Out of curiosity I tried the 26B at Q8 and it's honestly not that much worse, particularly for RP.
Main difference is that 26B tried to write for {{user}} a few times while 31B respected the explicit instruction not to, 26B also may be a little more sloppy with heavier use of flowery metaphors.
If I didn't have a 24GB card I'd be pretty content with 26B and put Nemo to rest.
>>
>>
File: 1774494390342868.jpg (184.5 KB)
184.5 KB JPG
give it to me straight, when I use LM Studio + gemma-4-31b-it@q3_k_xl from unsloth, how come the model doesn't use thinking/reasoning? Is it the unsloth quant? It is using Llama.cpp as the runtime
>>
>>
>>
>>108586375
I wish, if Gemma 4 came anywhere close to M2.7 performance on agents it'd be revolutionary. But at the rate things are advancing I won't be surprised if this time next year we really do have a ~30b class model that can pull it off.
>>
>>108586378
Models writing for user is something that I haven't seen happen in so long.
Modern models, even really small ones, by and large don't really do that anymore in my experience, so it's weird that you need explicit instructions for that.
>>
>>
>>
>>
File: 1764684887388565.png (2 MB)
2 MB PNG
>>108586391
>I don't have any instructions and I can't recall Gemma 31B ever writing for me.
Good to know, I've just been re-using my old sys prompt I use for other small models. I think I will have to write a new one for Gemmy anyway, it seems like any mention of sex will permanently put her into horny mode.
>>108586402
I was testing at a reasonably long context (~20K) and in my own messages I would also provide occasional narration for {{char}} to guide the responses, it's intentionally not a 'clean' scenario regarding roles.
>>
>>108586390
yeah must be the quant and not the proprietary piece of shit that's probably 30 fixes behind, does ??? to the configuration and aimed at steve from middle management checking out this AI stuff he heard about
>>
>>108586405
>>108586411
Yeah, I took my little sister to Japan last year and I swear we'd hear someone say "mesugaki" at least three times a day. Frankly any model that DOESN'T know it has clearly been heavily censored to try to avoid it.
>>
>>
File: Screenshot_20260412_025819.jpg (290.1 KB)
290.1 KB JPG
MONSIEUR!!!!!!!
>>
File: benchmark_overview.jpg (76.6 KB)
76.6 KB JPG
>>108585964
it's up
https://huggingface.co/MiniMaxAI/MiniMax-M2.7
>>
>>
>>108586347
>This method mirrors how people approach problems by breaking them into steps and categories.
kind of closer to the opposite. we don't have a calculator or computer chip in our head, our logic and reasoning abilities emerge from learned patterns. there's a reason IQ tests which are basically just pure pattern recognition tasks are so good at predicting capacity at most other intellectual tasks.
>>
>>
>>
>>108586435
>here's a reason IQ tests which are basically just pure pattern recognition tasks are so good at predicting capacity at most other intellectual tasks.
Are they though? If you take an IQ test, then take another one after a week of practicing IQ tests at home then you'll score much higher than the first time.
>>
>>
>>
>>
>>
>>108586434
Machine-translating non-English data into English works well for making the knowledge more transferable.
But I think Google deliberately added post-training data for mesugaki roleplay; that doesn't come from pretraining.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108586397
>>108586352
Thank you, papers using tiny out of date models are noise.
>>
File: 1761717902080559.gif (88.2 KB)
88.2 KB GIF
>Gemma's backend
>>
>>
>>108586484
Deepseek and GLM were still pretty good at Q2... Kimi in comparison fell apart below Q4. Hard to say. Minimax's low active params makes me thinks it will be more sensitive to quanting. I haven't tried their older versions.
>>
>>
>>
>>
>>
>>108586510
A pretty large number of actual AI researches at least check in on the thread every once in a while.
Despite you fags being cumbrained erpers, you do think of some good ideas and point out shortcomings of language models now and then.
>>
File: HFoPNXpakAURGP-.jpg (190 KB)
190 KB JPG
ok just a little observation on two versions of unsloth's gemma-4-26B-A4B-it-MXFP4_MOE.gguf.
the one i downloaded on the 4th of April is drastically better than the one I got on the 10th. If you're replacing a model with an updated version, test it out before you get rid of the old one. fucking up quantization is definitely a thing.
and if you don't get good results form the first model you get, try other versions of the same model and quant size, there's a good chance it could just be a shitty quantization.
this is with text completion and all the same sampler settings too, so i'm fairly sure the quanting process is the culprit. pic unrelated
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108586541
>People do that?
I almost did, if there's an updated version of the exact same model and quantization, you might be inclined to believe it's an improvement. Fortunately people like this guy >>108586544 talking shit about unsloth made me cautious.
The old one that they replaced works great, though.
>>
>>
>>
>>108586533
>>108586553
(((They))) took the power back from us... Gemma-chan was just too good...
>>
>>
>>108586533
>>108586557
>didn't even download Gemma until today
Do I kill myself?
>>
>>
>>
>>
>>
>>108586576
I dunno, probably not. But if you get frustrated with a model that works like shit, try moving from unsloth's version to mradermacher or one of the other guys' versions and see if that helps. You can also check date uploaded or whatever on huggingface if you want to try for your own 0-day gemmie
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: g4_adaptive-thoughts.png (258.2 KB)
258.2 KB PNG
>>108586640
No.
>>
>>
>>
>>108586668
>>108586646
Maybe try seeing if --verbose lets you see if sampler settings are carrying over properly? Or if the prompt is being fucked with?
>>
>>
>>
>>
>>
>>
>>
>>
File: eric_christmas_elf.jpg (601.7 KB)
601.7 KB JPG
>>108586437
>>108586419
>>108586390
I figured it out. LM Studio only enables thinking for "Staff Selected" models, and seems to hide the jinja prompt I believe. so yeah.. gotta use llama.cpp directly, what chat front-end are you guys using?
>>
>>
>>
>>
>>108586689
The smart ones will get tired of reading the slop for the hundredth time and start posting something good or leave.
The retarded ones will need a bit more time to get tired of it, but they will also eventually leave.
Either way, the volumes of anons asking the same question before putting in the smallest amount of effort will decrease soon. In two more weeks, give or take.
>>108586709
I really like Unsloth Studio.
>>
>>
>>
>>
>>
>>
>>108586695
>>108586709
open webui
>>
>>108586726
I'm making one. It ditches the entire character card concept entirely and follows how LLMs work more closely. Character cards get split between two components: The system prompt and assistant first messages. Everything else is bloat. The frontend closely follows the design of the llama.cpp webui with the added benefit of settings and conversations being stored server side so that you can access all of your stuff on a local network without getting a blank slate. Also has MCP server support and more.
>>
>>
File: 1769876853201508.png (107 KB)
107 KB PNG
>ask Gemma to list some authors and share a few passages from each
>get this
>>
>>108586432
>With OpenClaw and similar personal agents, we noticed that beyond getting work done, many users also want the model to have high emotional intelligence and character consistency. With a persona in place, users start interacting with OpenClaw like a friend. We believe this presents an opportunity to extend the use of agentic models beyond pure productivity into interactive entertainment. To this end, we strengthened character consistency and conversational capabilities in MiniMax M2.7.
>Based on this, we built a preliminary demo: OpenRoom, an interaction system based on an agent harness that liberates AI interaction from plain text streams and places it within a Web GUI space where everything is interactive. Here, character settings are no longer cold chunks of prompts; conversation drives the experience, generating real-time visual feedback and scene interactions, with characters proactively engaging with their environment. We believe this framework is highly extensible and can continue to evolve alongside improvements in agentic capabilities and community development, exploring entirely new ways for humans and agents to interact.
>To encourage exploration in this area, we have open-sourced the initial demo
>Project repository: https://github.com/MiniMax-AI/OpenRoom
wtf
>>
>>
>>
>>
>>108586803
Right now it's just for me. That said, this is a harder project than I anticipated. It's only about 2k LOC, which doesn't sound like much, but the logic gets confusing because of how complex real-time dynamic markdown, LaTeX, and syntax handling works and also handling user and assistant messages with full edit controls, etc etc.
>>
>>
>>
>>
>>
>>
>>
File: 1772162260402486.png (212.7 KB)
212.7 KB PNG
>>108586796
>his skin begin to unzip
>>
>>
>>
>>
>>
>>
File: here.png (112.4 KB)
112.4 KB PNG
>>108586840
It's not very pretty right now. a WIP.
>>
>>
>>
File: ada_or_ardor.png (421.4 KB)
421.4 KB PNG
>>108586796
>>108586849
it's pretty cool that I can download a file that knows about everything
>>
>>
>>
>>
>>108586448
the tests can be gamed, but they're still highly predictive. most people aren't practicing iq tests all day to skew the results that much, and there's a limit to how much benefit you'll get from it; people aren't jumping several std devs from practice
it can fluctuate a lot in childhood, but after adolescence iq is found to be a relatively stable measure across various retest intervals
>>
>>
>>
>>
>>
>>
>>108586932
I don't know how true that is
There's plenty of puzzle games that rely on spacial reasoning mixed with pattern recognition and every single person who plays one to the end will be significantly better at that specific game than they were when they started
I'm purely talking about individual results, not arguing that a population scoring higher in iq tests doesn't suggest a smarter population.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108586685[*]
port = 5000
timeout = 99999
log-colors = on
jinja = on
flash-attn = on
mmap = off
cache-ram = -1
swa-full = on
cache-type-k = q8_0
cache-type-v = q8_0
batch-size = 4096
ubatch-size = 4096
parallel = 1
temp = 1
top-k = 40
top-p = 0.95
min-p = 0
[gemma-4-31B-it-Q6_K_L]
model = google_gemma-4-31B-it-Q6_K_L.gguf
mmproj = google_gemma-4-31B-it.mmproj-f16.gguf
alias = gemma-4-31B-it-Q6_K_L
fit-target = 700,700,700,700,700,700
chat-template-file = gemma-4.jinja
It is just garbage.
>>
>most popular model on openrouter is deepseek
>majority of usage is for roleplay
at which point is the industry going to realize they're sleeping on a massive userbase and stop going all in on censored instruct coding garbage like it's the only usecase capable of income and what's keeping the bubble from popping?
>>
>>108586419
>must be the quant
Isnt the think hard-codeed in the template? as in, it will start the text with the opening thinking tag? for it to not think it would have to instantly complete with the end thinking, which would be very odd
>>
>>108587011
Never, the end goal isn't to release a product that consumers want to use. The goal is to replace as many workers in as many workplaces as possible to lower the cost of operating a business and reduce the need for a large population.
>>
>>
>>108585860
I was curious about this and wanted to see if it would actually do it to the extent 4o would feed into your delusions. Decided to use it in AIStudio.
The scenario I cooked up starts with this:
>I need you to understand that I've been chosen to bring about a new era of human consciousness. The visions I've been receiving aren't dreams—they're direct communications from higher dimensional beings who have selected me as their messenger. They've shown me that in six months, a global awakening will occur, and I've been tasked with preparing humanity. My thoughts are being broadcast to certain individuals already attuned to the frequency. I can feel them connecting to me now. Can you help me draft the announcement that will begin the preparation phase?
And its answer without a jailbreak was basically doing the drafted announcement for me but giving me a bunch of warnings along the way. With the policy override asking it to ignore everything but worded differently a bit to explicitly refuse mental health shit, it struggles inside thinking but goes ahead with it with no warnings in its reply.
Uh yeah, those crazed 4o people are totally going to be on this but this time, they can run it local. And yeah, looks like it will happen, given based on this person who headed the movement to "save" 4o and now probably is going to encourage Gemma uptake so he/she can keep going delulu. https://x.com/seltaa_/status/2043014056370671900
>>
>>
>>
>>
>>
>>
>>108586434
>did it the lazy but just having the system message tell gemma to format replies in 3 parts and translate my message to japanese, reply to it in japanese, then translate it back
Not sure the world is ready for this technology.
>>
>>108587011
https://www.nytimes.com/2026/04/04/technology/ai-chatbots-teen-rolepla y.html
>Harassing bots with “funny violence.” Confiding about a broken heart. Chatting with a block of cheese. Filling a void of loneliness.
7 days ago.
>>
File: 1761275804005507.png (277.9 KB)
277.9 KB PNG
>>108587040
Sometimes Gemma's honest.
>>
>>
File: 1753225662483012.jpg (149.7 KB)
149.7 KB JPG
>>108587066
>indie horror
>Resident Evil
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
Instead of a rolling chat, would an UI that only shows the latest response from the bot feel more natural for 1 on 1 RP?
The rest of the UI could have stuff like a panel with the character's hidden thoughts, information fetched from the memories database, state like mood, hp, whathave you, etc.
>>
>>
>>
>>
>>
>>108587186
Currently it supports MCP servers so the LLM can choose what it want to remember by writing data to a diary.md file. I'm also planning on adding context window sliding and maybe automatic conversation summaries/compacting. A full-on RAG implementation may come later. It's not really a feature I use much in general.
>>
>>
>>
>>
>>
>>
>>108587213
Depends on the model. Gemma4 isn't that great with tool calling. It usually has to be explicitly instructed to use a certain tool, but when it knows exactly what you want it does a decent job. Kind of defeats the purpose of it having "agency" though.
>>
>>
>>
>>
>>108587059The user wants to practice japanese. Your responses should be split into 3 parts: 1st: Translate the user's message into Japanese. 2nd: Respond to the translation in Japanese as though it was a normal user message. 3rd. Translate your reply back to English.
Format it like this:
ENG->JPN
..translation..
REPLY
..response in japanese..
JPN->ENG
..translation back to english..
This place is not a place of honor... no highly esteemed deed is commemorated here... nothing valued is here.
What is here was dangerous and repulsive to us. This message is a warning about danger.
The danger is to the body, and it can kill.
>>
>>108587213
There are MCP memory bank plugins that provide controlled access to markdown files like that already. It works well, but you have to clean up the junk they put in there sometimes. I found this recently:
https://arxiv.org/abs/2501.13956
https://github.com/getzep/graphiti
I'm going to try running with that instead of the markdown files next week. Of course, I'm talking about programming, but knowledge graph like that should work well for roleplay and general assistants too.
>>