Letting an LLM Direct the Scene

I've been building custom nodes for ComfyUI to make AI image generation less tedious and more surprising. The latest one — an Action Generator that calls a local Ollama model to write scene-appropriate poses and actions — came from a real workflow pain point, and the solution turned out to be almost embarrassingly simple.

The mad libs problem

My ComfyUI workflow is built around composable text nodes. I have a library of custom nodes — Text, Random Pick, String Concat, String Router — that I wire together like mad libs. Random Pick chooses a facial expression from a list: "deep in thought," "laughing," "looking to the side and smiling." Another picks a color for the outfit. String Concat glues everything into a prompt. Hit generate, get variety.

This works brilliantly for things that are context-free. A facial expression works in any scene. A color works on any clothing. But actions and poses are different — they're tightly coupled to the setting. "Sitting on a bench" makes sense in a park. It doesn't make sense in a kitchen. Every time I changed the scene, I had to go through and rewrite the action options: swap "sitting on a bench" for "sitting on the couch," "leaning on the railing" for "leaning against the counter." It was tedious and it killed the randomness I was going for, because I'd only write three or four actions per scene before getting bored.

The idea

The fix came to me while I was thinking about this exact problem: just ask an LLM. I have Ollama running locally with a 20-billion-parameter model. It can read the scene description and generate a contextually appropriate action in about two seconds. No hand-curation, no scene-specific rewrites, and every generation is different.

So I built an Action Generator node. It takes a scene description, a subject line ("a woman," "an old man"), and an optional direction hint ("reading a book," "eating ice cream"), then calls Ollama to produce 3-5 sentences describing only the pose, body language, and interaction with the environment. Not appearance, not clothing — other nodes handle those. The LLM just directs the scene.

Iterating on the design

The first version was too simple — it concatenated the scene and action into one output. Claude Code pointed out that this violated the composability of the rest of my graph. Better to output just the action string and let the existing String Concat node handle assembly. That was obviously right.

Then we added a subject input so the prompt wasn't hardcoded to "she." A seed input for reproducibility — ComfyUI users expect seed controls. A temperature slider for dialing creativity up or down. I added an ollama_host field so I could point it at Ollama running on a different machine on my network. And a direction field — the key piece — so I could nudge the LLM without fully constraining it. "Reading a book" becomes four sentences of cinematic description: how she's holding the book, how she's sitting, what her posture looks like, how she's interacting with the space around her.

The system prompt was the part that needed the most care. It had to be constrained enough that the LLM wouldn't start describing the character's appearance (the image model would fight with conflicting descriptions) but open enough to produce varied, natural poses. Present tense, 3-5 sentences, only pose and environment interaction.

What makes it work

The reason this works so well is that it plugs into the existing node graph without disrupting anything. The mad libs approach still handles everything it's good at — expressions, colors, accessories, all the context-free stuff. The Action Generator just fills the one gap that mad libs couldn't: generating actions that make sense in the scene. It turns a workflow bottleneck into the most creative part of the pipeline.

And it's fast. Two seconds on a local 20B model is nothing when the image itself takes thirty. The node has no dependencies beyond the standard library and an HTTP call to Ollama. It fits in a single Python file.

Just make it yourself

When I started with ComfyUI a few weeks ago, my instinct was to google around for custom nodes that did what I wanted. Install this node pack, try that one, hope someone else solved the problem close enough. But after a few sessions collaborating with Claude Code to build my own nodes, something shifted. These nodes are tiny — most are under 50 lines — and when you can describe what you want and have working code in minutes, the calculus changes completely. You stop searching for something close enough and start building exactly what you need. It's a surprisingly freeing way to work.

In case you're curious, here's the entire Action Generator node:

import hashlib
import struct

from .llm_utils import llm_chat

OUTTAKE_DIRECTION = (
    "Laughing outtake — the subject lost composure trying to strike the pose. "
    "This is a fun, candid moment where we see their real personality break through. "
    "They're mid-laugh or catching themselves, still in the context of what they "
    "were attempting but clearly not holding it together."
)

SYSTEM_PROMPT = """\
You are a scene action writer for a still photo. Given a setting and a subject,  \
describe what the subject is doing in that setting. Focus ONLY on pose, body language, \
actions, and interaction with the environment. Do NOT describe the subject's \
appearance, clothing, hair, or physical features — other systems handle that.

Write 2-4 vivid sentences in present tense. Describe a single frozen moment — \
no motion, no sequences, no before/after. Be specific and photographic.\
"""


class ActionGenerator:
    """Calls an Ollama LLM to generate a pose/action description for a scene."""

    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": {
                "enabled": ("BOOLEAN", {"default": True}),
                "model": ("STRING", {"default": "gpt-oss:20b"}),
                "ollama_host": ("STRING", {"default": "192.168.1.51:11434"}),
                "subject": ("STRING", {"default": "a woman"}),
                "direction": ("STRING", {"default": ""}),
                "seed": ("INT", {"default": 0, "min": 0, "max": 0xFFFFFFFF, "control_after_generate": True}),
                "temperature": ("FLOAT", {"default": 0.7, "min": 0.0, "max": 2.0, "step": 0.05}),
                "setting": ("STRING", {"forceInput": True}),
                "outtake_chance": ("FLOAT", {"default": 0.0, "min": 0.0, "max": 1.0, "step": 0.05}),
            },
        }

    RETURN_TYPES = ("STRING",)
    RETURN_NAMES = ("action",)
    FUNCTION = "execute"
    CATEGORY = "jxw/utils"

    @classmethod
    def IS_CHANGED(s, **kwargs):
        return float("nan")

    async def execute(self, setting, model, seed, temperature, subject, ollama_host, direction, enabled, outtake_chance):
        if not enabled:
            return ("",)

        user_prompt = f"Setting: {setting}\nSubject: {subject}"
        if direction.strip():
            user_prompt += f"\nDirection from the user: {direction.strip()}"

        if outtake_chance > 0:
            h = hashlib.sha256(struct.pack(">I", seed)).digest()
            roll = int.from_bytes(h[:4], "big") / 0xFFFFFFFF
            if roll < outtake_chance:
                user_prompt += f"\nOuttake twist: {OUTTAKE_DIRECTION}"

        try:
            action = await llm_chat(ollama_host, model, SYSTEM_PROMPT, user_prompt, seed, temperature)
        except Exception as e:
            action = f"[ActionGenerator error: {e}]"

        return (action,)


NODE_CLASS_MAPPINGS = {
    "JXW_ActionGenerator": ActionGenerator,
}

NODE_DISPLAY_NAME_MAPPINGS = {
    "JXW_ActionGenerator": "Action Generator",
}

Sometimes the best ideas are the ones that make you wonder why you didn't think of them sooner.