For context, I was attempting to put a cup of hot chocolate into the hands of an anime character.
https://usercontent.irccloud-cdn.com/file/DkQ5SSdT/image.png
Pixelwave generates much better horse-people, in the event I wanted that. Admittedly it doesn't have an edit function; unfortunately the illustrations I want to edit all have people in them.
Realistic photos work better, though it still doesn't beat Flux.1-dev: https://usercontent.irccloud-cdn.com/file/ZsouXNpn/image.png
It seems like really combining visuals at the level of generation capability means language understanding is fully grounded in a richer world model.
I am hoping for a step up in real world common sense intelligence areas like those covered by SimpleBench. Although they are static images, so there might still be room for improvement ad far as physics understanding.
Also, if they can get it to the point of really accurate (probably larger models), this unlocks whole industries in terms of being able to do useful work.
Eg. "I suggest moving the boiler from point A to B on the below map of the factory to reduce piping costs and heat loss"
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
Previous try with some interesting introspection:
https://drive.google.com/file/d/1SCBbpDo1dAJBAz7bFABk4yBZBuz..., https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...