On March 25, OpenAI made public its new multimodal image-generation capabilities that will now come directly integrated into the tech startup’s GPT-4o AI language model, making it the default image generator within the ChatGPT interface. The integration, called “4o Image Generation,” will allow the model to follow prompts more accurately (with better text rendering than DALL-E 3) and respond to chat context for image modification instructions.
The latest image generator, which has gone viral for its ability to generate uncanny recreations of Studio Ghibli art (the creative pattern of the famous Japanese animation studio), has now been made available for everyone.
While the updated tool was initially launched on March 25 and was always planned to be available across the ChatGPT Plus, Pro, Team, and Free subscription tiers, the ChatGPT creator had to delay the product’s free rollout, with CEO Sam Altman joking that “our GPUs are melting,” following high demand. At one point, ChatGPT picked up a million new users of the new tool in a single hour.
The arrival of “4o Image Generation” can be easily termed the next big thing after OpenAI’s DALL-E 2, which was launched in the spring of 2022. Back then, DALL-E 2 took its users by storm, as the act of text-to-image generation suddenly became accessible to a select group of users, thereby creating a community of digital explorers who experienced wonder and controversy as the technology automated the act of visual creation. The idea of a computer creating relatively photorealistic images on demand based on just text descriptions was something unheard of until then.
However, just like many early AI systems, DALL-E 2 struggled with consistent text rendering, often producing garbled words and phrases within images. It also had limitations in following complex prompts with multiple elements, often missing key details or misinterpreting instructions. These shortcomings forced OpenAI to introduce subsequent iterations, including the 2023 launch of DALL-E 3.
Using AI to redefine AI
“DALL-E 3, as an AI model, uses a technique called latent diffusion to pull images it recognises out of noise progressively, based on written prompts provided by a user—or, in this case, by ChatGPT. It works using the same underlying technique as other prominent image synthesis models like Stable Diffusion and Midjourney. ChatGPT and DALL-E 3 currently work hand in hand, making AI art generation an interactive and conversational experience. You tell ChatGPT (through the GPT-4 large language model) what you’d like it to generate, and it writes ideal prompts for you and submits them to the DALL-E backend. DALL-E returns the images (usually two at a time), and you see them appear through the ChatGPT interface, whether through the web or via the ChatGPT app,” states a report from Ars Technica.
OpenAI likely used hundreds of millions of images found online and licensed from Shutterstock libraries to train DALL-E 3. To learn visual concepts, the AI training process typically associates words from descriptions of images found online (through captions, alt tags, and metadata) with the images themselves.
It then encodes that association in a multidimensional vector form. However, the whole method faces one problem: as the “scraped captions,” written by humans, aren’t always detailed or accurate, this leads to some faulty associations that reduce an AI model’s ability to follow a written prompt.
To address the above-mentioned problem, OpenAI has used the method of using AI to improve the AI model itself. As detailed in the DALL-E 3 research paper, the team at OpenAI trained this new model to surpass its predecessor by using synthetic (AI-written) image captions generated by GPT-4V, the visual version of GPT-4. With GPT-4V writing the captions, the team generated far more accurate and detailed descriptions for the DALL-E model to learn from during the training process. That made a massive difference in terms of DALL-E’s prompt fidelity, accurately rendering what is in the written prompt.
In addition, DALL-E 3 is very good at rendering accurate text compared to DALL-E 2 and some other image synthesis models. This is another effect of the highly detailed captioning created by GPT-4V.
“When building our captioner, we paid special attention to ensuring that it was able to include prominent words found in images in the captions it generated. As a result, DALL-E 3 can generate text when prompted,” stated the DALL-E 3 team in its paper.
And now we have “4o Image Generation,” which has taken the game to the next level. OpenAI told Ars Technica that the image generation when GPT-4.5 is selected calls upon the same 4o-based image-generation model as when GPT-4o is selected in the ChatGPT interface.
“Like DALL-E 2 before it, 4o IG is bound to provoke debate, as it enables sophisticated media manipulation capabilities that were once the domain of sci-fi and skilled human creators into an accessible AI tool that people can use through simple text prompts. It will also likely ignite a new round of controversy over artistic styles and copyright—but more on that below. Some users on social media initially reported confusion since there’s no UI indication of which image generator is active, but you’ll know it’s the new model if the generation is ultra slow and proceeds from top to bottom. The previous DALL-E model remains available through a dedicated DALL-E GPT interface, while API access to GPT-4o image generation is expected within weeks,” the media outlet stated.
Ensuring multimodal output
“4o Image Generation” also represents a shift to “native multimodal image generation,” where the large language model processes and outputs image data directly as tokens. That’s a big thing, as it means image tokens and text tokens share the same neural network, thereby leading to new flexibility in image creation and modification.
Even though GPT-4o, which launched in May 2024, included advanced multimodal image-generation capabilities, the “o” in GPT-4o was promoted as representing “omni” to emphasise its ability to understand and create text, images, and audio. However, OpenAI has taken over 10 months to provide this functionality to users.
However, the wait was worth it, as the new AI model can now ostensibly converse using speech in real time, reading emotional cues and responding to visual input. GPT-4o is now operating faster than OpenAI’s previous best model, GPT-4 Turbo. GPT-4o responds to audio inputs in about 320 milliseconds on average, similar to human response times in conversation and much shorter than the typical 2-3 second lag experienced with previous models.
Before launching GPT-4o, OpenAI reportedly trained the AI model on an end-to-end basis, using text, vision, and audio in a way that ensures all inputs and outputs are processed by the same neural network.
By uploading screenshots, documents containing text and images, or charts, GPT-4o users can hold conversations about the visual content and receive data analysis from the AI model. During a live media demo by OpenAI, the AI assistant also showed its ability to analyse selfies, detect emotions, and engage in light-hearted banter about the images.
Additionally, the tool exhibited improved speed and quality in more than 50 languages, which, as per OpenAI, covers 97% of the world’s population. The model also showcased its real-time translation capabilities, facilitating conversations between speakers of different languages with near-instantaneous translations.
OpenAI met its match in the form of Google’s Gemini 2.0 Flash. The latter’s native image-generation capabilities are now making the experimental feature available to anyone using Google AI Studio. The multimodal technology integrates both native text and image processing capabilities into one AI model. In fact, despite Studio Ghibli art stealing the headlines all over, Google’s AI model is now making tech geeks take note of its ability to remove watermarks from images, albeit with artifacts and a reduction in image quality.
In fact, Gemini 2.0 Flash can add objects, remove objects, modify scenery, change lighting, attempt to change image angles, zoom in or out, and perform other transformations, all to varying levels of success depending on the subject matter, style, and image in question.
Senior AI reporter Benj Edwards said, “To pull it off, Google trained Gemini 2.0 on a large dataset of images (converted into tokens) and text. The model’s knowledge about images occupies the same neural network space as its knowledge about world concepts from text sources, so it can directly output image tokens that get converted back into images and fed to the user.”
However, “4o Image Generation” has a drawback: it is “extremely slow,” taking anywhere from 30 seconds to one minute (or longer) to generate each image.
In the opinion of Edwards, “Even if it’s slow (for now), the ability to generate images using a purely autoregressive approach is arguably a major leap for OpenAI due to its flexibility. But it’s also very compute-intensive, since the model generates the image token by token, building it sequentially. This contrasts with diffusion-based methods like DALL-E 3, which start with random noise and gradually refine an entire image over many iterative steps.”
Conversational image editing
In a blog post, OpenAI positions “4o Image Generation” as moving beyond generating “surreal, breathtaking scenes” seen with earlier AI image generators and toward creating “workhorse imagery” like logos and diagrams used for communication. The AI startup further noted improved text rendering within images, a capability where previous text-to-image models often spectacularly failed, frequently turning “Happy Birthday” into something resembling alien hieroglyphics.
“OpenAI claims several key improvements: users can refine images through conversation while maintaining visual consistency; the system can analyse uploaded images and incorporate their details into new generations; and it offers stronger photorealism—although what constitutes photorealism (for example, imitations of HDR camera features, detail level, and image contrast) can be subjective. In its blog post, OpenAI provided examples of intended uses for the image generator, including creating diagrams, infographics, social media graphics using specific colour codes, logos, instruction posters, business cards, custom stock photos with transparent backgrounds, editing user photos, or visualising concepts discussed earlier in a chat conversation,” Edwards noted.
Shortly after OpenAI launched “4o Image Generation,” the AI community on X (formerly Twitter) put the feature through its paces, finding that it is quite capable of inserting someone’s face into an existing image, creating fake screenshots, and converting meme photos into the style of Studio Ghibli, South Park, felt, Muppets, Rick and Morty, Family Guy, and much more.
“It seems like we’re entering a completely fluid media reality courtesy of a tool that can effortlessly convert visual media between styles. The styles also potentially encroach upon protected intellectual property. Given what Studio Ghibli co-founder Hayao Miyazaki has previously said about AI-generated artwork (“I strongly feel that this is an insult to life itself”), it seems he’d be unlikely to appreciate the current AI-generated Ghibli fad on X at the moment,” Edwards continued.
To get a firsthand experience of the true capabilities of “4o Image Generation,” Ars Technica ran some informal tests, including some of the usual CRT barbarians, queens of the universe, and beer-drinking cats.
The ChatGPT interface with the new 4o image model is also conversational (as before with DALL-E 3), but the user can suggest changes over time. Ars Technica took the author’s (Edwards’) EGA pixel bio and attempted to give it a full body, and in the publication’s opinion, Google’s more limited image model did a far better job than “4o Image Generation.”
“While my pixel avatar was commissioned from the very human (and talented) Julia Minamata in 2020, I also tried to convert the inspiration image for my avatar (which features me and legendary video game engineer Ed Smith) into EGA pixel style to see what would happen. In my opinion, the result proves the continued superiority of human artistry and attention to detail,” Edwards stated.
Ars Technica also explored how many objects “4o Image Generation” could fit into an image, inspired by a 2023 tweet from animator Nathan Shipley, who evaluated DALL-E 3 shortly after its release.
“To take text generation a little further, we generated a poem about barbarians using ChatGPT, then fed it into an image prompt. The result feels roughly equivalent to diffusion-based Flux in capability, maybe slightly better, but there are still some obvious mistakes here and there, such as repeated letters. We also tested the model’s ability to create logos featuring our favourite fictional Moonshark brand. One of the logos not pictured here was delivered as a transparent PNG file with an alpha channel. This may be a useful capability for some people in a pinch, but to the extent that the model may produce ‘good enough’ (not exceptional, but looks OK at a glance) logos for the price of $0 (not including an OpenAI subscription), it may end up competing with some human logo designers, and that will likely cause some consternation among professional artists,” Ars Technica remarked, while adding, “Frankly, this model is so slow we didn’t have time to test everything before we needed to get this article out the door. It can do much more than we have shown here—such as adding items to scenes or removing them. We may explore more capabilities in a future article.”
Still, limitations exist
Ars Technica’s extensive tests with “4o Image Generation” prove that, just like previous AI image generators, the new AI model is not perfect in quality either. While this is one of the most capable AI image generators ever created, OpenAI has also acknowledged significant limitations with its product. For example, the AI model sometimes crops images too tightly or includes inaccurate information (confabulations) with vague prompts or when rendering topics it hasn’t encountered in its training data.
“The model also tends to fail when rendering more than 10-20 objects or concepts simultaneously (making tasks like generating an accurate periodic table currently impossible) and struggles with non-Latin text fonts. Image editing is currently unreliable over multiple passes, with a specific bug affecting face-editing consistency that OpenAI says it plans to fix soon. And it’s not great with dense charts or accurately rendering graphs or technical diagrams. In our testing, 4o Image Generation produced mostly accurate but flawed electronic circuit schematics,” Edwards observed.
“Even with those limitations, multimodal image generators are an early step into a much larger world of completely plastic media reality, where any pixel can be manipulated on demand with no particular photo-editing skill required. That brings with it potential benefits, ethical pitfalls, and the potential for terrible abuse. In a notable shift from DALL-E, OpenAI now allows 4o IG to generate adult public figures (not children) with certain safeguards, while letting public figures opt out if desired. Like DALL-E, the model still blocks policy-violating content requests,” he continued.
The ability for 4o Image Generation to imitate celebrity likenesses, brand logos, and Studio Ghibli films has also brought another problem for the startup: lawsuits involving copyright or consent from the public.
On X, OpenAI CEO Sam Altman wrote, “This represents a new high-water mark for us in allowing creative freedom. People are going to create some really amazing stuff and some stuff that may offend people; what we’d like to aim for is that the tool doesn’t create offensive stuff unless you want it to, in which case, within reason, it does.”
“Zooming out, GPT-4o’s image-generation model (and the technology behind it, once open-sourced) feels like it further erodes trust in remotely produced media. While we’ve always needed to verify important media through context and trusted sources, these new tools may further expand the deep doubt media scepticism that’s become necessary in the age of AI. By opening up photorealistic image manipulation to the masses, more people than ever can create or alter visual media without specialised skills,” Edwards opined.
While OpenAI includes C2PA (Coalition for Content Provenance and Authenticity) metadata in all generated images, that data can be stripped away and might not matter much in the context of a deceptive social media post. But “4o Image Generation” doesn’t change what has always been true: people judge information primarily by the reputation of its messenger, not by the pixels themselves.
On the potential misuse of the new tool to generate deceptive social media posts, Altman is ready to take on the risks of releasing the technology into the world.
“As we talk about in our model spec, we think putting this intellectual freedom and control in the hands of users is the right thing to do, but we will observe how it goes and listen to society. We think respecting the very wide bounds society will eventually choose to set for AI is the right thing to do, and increasingly important as we get closer to AGI. Thanks in advance for the understanding as we work through this,” Sam Altman wrote on X.
