New Image Pipeline and Other Updates

Share this post:

Share on Bluesky Share on Twitter Share on Facebook

I've made some big changes to my engine's image serving pipeline, and I'm pushing them to this game first since it is the easiest to implement here.

Up until now, my engine has selected images by mapping LLM output to Booru tags via an embedding model, then searched for tag matches in image metadata. This was easy to implement since all Gelbooru images have tag metadata. However, tags are semantically flat--they describe features of the image, but not their relationships. This isn't a huge problem if the image is of a single character in a straightforward scene, but, when multiple characters interact, tags tell us very little about who's doing what to whom and who has which characteristics (hair color, eye color, etc.)

However, a recently released lightweight NSFW captioning model (https://huggingface.co/Minthy/ToriiGate-v0.4-2B) has allowed me to caption the images for all 3 characters in this game, opening up the door to more sophisticated image selection. In addition to a list of relevant tags, I now have access to a plain-english description of the image that relates its features to one another in an organic way. Now, after the initial tag-based sort, the top 10 images have their captions embedded and the images are resorted based based on this more holistic assessment. This should yield better results in this game, and the benefits will be even more dramatic in future games I have planned where multiple characters are frequently interacting in an image.

Unfortunately, I can't directly port this to games like Orgone Collector and Priestess Breeder, as they utilize a library of 5 million images and I don't have enough compute budget to caption all of them.

However, I've also enhanced the image selection model by fine-tuning the embedding model on the specific tasks the engine uses it for. The new model has better comprehension of NSFW terms and text, things none of the current SOTA small embedding models are trained on. This helps the engine get the most out of the captions I've added, but also helps with tag mapping, and will be rolled out to all my games.

Finally, to take advantage of the improved app effect comprehension the game now has thanks to the last update, I've added more sophisticated app effect handling to the backend of this game. Now, instead of just knowing which effects are currently active, it knows which effects have just been added, which have been removed, and which settings have changed, making it better to able to narrate how your changes to the app affect your subject.