OpenAI Can Now Turn Words Into Ultra-Realistic Videos – CNET

AI startup OpenAI has unveiled a text-to-video model, called Sora, that could raise the bar for what’s possible in generative AI.

Like Google’s text-to-video tool Lumiere, Sora’s availability is limited. Unlike Lumiere, Sora can generate videos up to 1 minute long.

Piggybacking on the Sora news, AI voice generator ElevenLabs a few days later revealed that it is working on text-generated sound effects for videos.

Text-to-video has become the latest arms race in generative AI as OpenAI, Google, Microsoft and more look beyond text and image generation and seek to cement their position in a sector projected to reach $1.3 trillion in revenue by 2032 — and to win over consumers who’ve been intrigued by generative AI since ChatGPT arrived a little more than a year ago.

According to a post on Thursday from OpenAI, maker of both ChatGPT and Dall-E, Sora will be available to “red teamers,” or experts in areas like misinformation, hateful content and bias, who will be “adversarially testing the model,” as well as visual artists, designers and filmmakers to gain additional feedback from creative professionals. That adversarial testing will be especially important to address the potential for convincing deepfakes, a major area of concern for the use of AI to create images and video.

In addition to garnering feedback from outside the organization, the AI startup said it wants to share its progress now to “give the public a sense of what AI capabilities are on the horizon.”


Watch this: OpenAI’s Custom GPT Apps Do Your Bidding


One thing that may set Sora apart is its ability to interpret long prompts — including one example that clocked in at 135 words. The sample videos OpenAI shared on Thursday demonstrate that Sora can create a variety of characters and scenes, from people and animals and fluffy monsters to cityscapes, landscapes, zen gardens and even New York City submerged underwater.

This is thanks in part to OpenAI’s past work with its Dall-E and GPT models. Text-to-image generator Dall-E 3 was released in September. CNET’s Stephen Shankland called it “a big step up from Dall-E 2 from 2022.” (OpenAI’s latest AI model, GPT-4 Turbo, arrived in November.)

In particular, Sora borrows Dall-E 3’s recaptioning technique, which OpenAI says generates “highly descriptive captions for the visual training data.”

“Sora is able to generate complex scenes with multiple characters, specific types of motion and accurate details of the subject and background,” the post said. “The model understands not only what the user has asked for in the prompt, but also how those things exist in the physical world.”

The sample videos OpenAI shared do appear remarkably realistic — except perhaps when a human face appears close up or when sea creatures are swimming. Otherwise, you might be hard-pressed to tell what is real and what isn’t.

The model also can generate video from still images and extend existing videos or fill in missing frames, much like Lumiere can do.

“Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI,” the post added.

AGI, or artificial general intelligence, is a more advanced form of AI that’s closer to human-like intelligence and includes the ability to perform a greater range of tasks. Meta and DeepMind have also expressed interest in reaching this benchmark.


OpenAI conceded Sora has weaknesses, like struggling to accurately depict the physics of a complex scene and to understand cause and effect.

“For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark,” the post said.

And anyone that still has to make an L with their hands to figure out which one is left can take heart: Sora mixes up left and right too.

OpenAI didn’t share when Sora will be widely available but noted it wants to take “several important safety steps” first. That includes meeting OpenAI’s existing safety standards, which prohibit extreme violence, sexual content, hateful imagery, celebrity likeness and the IP of others.

“Despite extensive research and testing, we cannot predict all of the beneficial ways people will use our technology, nor all the ways people will abuse it,” the post added. “That’s why we believe that learning from real-world use is a critical component of creating and releasing increasingly safe AI systems over time.”

Sound effects

In a blog post about AI sound effects, ElevenLabs on Monday said it used prompts like “waves crashing,” “metal clanging,” “birds chirping” and “racing car engine” to create audio, which it overlaid on some of Sora’s AI-generated videos for added effect.

ElevenLabs did not share a release date for its text-to-sound generation tool, but the post said, “We’re thrilled by the excitement and support from the community and can’t wait to get it into your hands.”