Last Updated: 4 months ago
A few years ago, the idea of generating video from text sounded like sci-fi. Today, it’s… sort of real. AI systems can now create short clips from scratch, based on nothing more than a sentence like “a panda dancing in the snow.” And while the results are often surreal, or just plain weird they’re also getting better fast.
Let’s be honest: AI video isn’t about to replace real cameras, directors, or VFX teams. But it’s moving in that direction. Whether you’re a filmmaker, content creator, or just curious about what’s next, the evolution of AI video is worth following: both for what it can do, and what it still can’t.
In this article, we’ll trace the key milestones in AI video generation, from early research experiments to the latest tools like Sora and Veo-3. We’ll show you real examples, note where each one impressed (or failed), and map out how we got to where we are today.
TL;DR
2016
- MIT CSAIL creates short clips from still images: blurry but groundbreaking.
- First step toward motion prediction, not storytelling.
2018-2019
- Deepfakes go mainstream: Jordan Peele’s Obama PSA and “DeepTomCruise” show AI can realistically alter existing footage.
- These tools manipulate real video, not generate from scratch.
2021
CogVideo is one of the first public models to generate video from text prompts. Results are low-res, jittery, but mark a technical milestone.
2022
- “Will Smith eating spaghetti” goes viral. It’s surreal, glitchy, and oddly mesmerising, showing how weird and memeable AI video can be.
- 2022 (April): Latent Diffusion Models (LDMs) introduced by the CompVis group (Stability AI & LMU Munich), dramatically reducing the compute cost of generating high-resolution images by operating in a compressed latent space. This became the basis for Stable Diffusion.
- 2022 (Late): Early experiments with image-to-video diffusion using frame interpolation and basic temporal consistency layers built on top of image LDMs. Researchers begin applying diffusion models to sequential frames, but results are rudimentary.
2023
- 2023 (Early): Launch of models like ModelScope’s Text2Video and ZeroScope, which adapt latent diffusion principles to create short video clips (2-4 seconds), adding basic temporal coherence but struggling with scene stability.
- Runway’s Gen-1 & Gen-2 bring creator-friendly AI video tools to the public. Focus shifts to style transfer, short video prototyping, and indie storytelling.
2024
- (Mid): Major advances with models like Pika Labs, Runway Gen-2, and Stability AI’s Stable Video Diffusion, introducing fine-tuned latent diffusion models explicitly trained on video datasets. These improve motion dynamics, style consistency, and prompt fidelity.
- OpenAI Sora drops, stunning people with 60-second HD video clips. Some outputs somewhat resemble full cinematic scenes, a major leap in quality.
- Google Veo enters with strong cinematic structure, multi-shot continuity, camera control, stylistic polish.
2025
- Emerging state-of-the-art models (e.g., OpenVEO, Vidu, and others) integrate 3D-aware latent spaces, long-sequence coherence, and controllable camera motion, bringing cinematic-quality video generation from text prompts within reach.
- Viral Moment: The AI-generated “kangaroo with a passport walking through an airport” becomes the first emotionally resonant AI video hit.
Where we are now
- AI video tools are still limited: clips are short (typically 4–60 seconds), and glitches are common.
- Not yet useful for full films, but excellent for moodboards, pre-vis, concepting, and short creative content.
- We’re moving from gimmick to prototype tool, and soon, maybe to real production workflows.
- Breakthrough commercials like Kalshi’s and Popeye’s (w)rap battle redefine how advertising is made.
We offer YouTube consultancy if you’re thinking about how AI can fit into your channel, get in touch here.
2016-2017: The Research Phase
The earliest AI video experiments weren’t designed for art or storytelling, but rather, as you would expect, they were academic exercises in understanding motion.
MIT’s Predictive Video
In 2016, researchers at MIT’s CSAIL lab built a system that could take a still image and generate a 1-2 second clip predicting what might happen next. For example, a beach with waves beginning to move, or a baby starting to crawl.
See what it looked like here
See how it was being reported on at the time here
The outputs were fuzzy and pretty robotic, but they proved that AI could mimic basic motion using pattern recognition. It wasn’t storytelling yet, but it was movement.
2018–2019: The Deepfake Era
The first time AI video really broke into the mainstream, was through controversy instead of creativity.
Using deep learning, developers began creating realistic fake videos of celebrities or politicians saying and doing things they never did. One of the first major examples came in 2018, when director Jordan Peele worked with BuzzFeed to release a deepfake of Barack Obama delivering a public service announcement.
Around the same time, TikTok and YouTube saw a wave of celebrity impersonations, especially the now-famous DeepTomCruise videos. Created by tiktoker Miles Fisher, these face-swapped clips of “Tom Cruise” doing mundane things were shockingly convincing.
Deepfakes didn’t generate video from scratch, instead they manipulated existing footage. Although, they did prove how far AI could go in blurring reality.
2021: The First Text-to-Video Models
With deepfakes making headlines, researchers quietly shifted toward something more constructive: teaching AI to create entire video clips from just text.
CogVideo: Tsinghua University
In 2021, a team at Tsinghua University released CogVideo, a model trained to generate short video clips (up to 4 seconds) based on natural language prompts.
You could type a prompt and get back a few seconds of flickering, low-res footage that vaguely resembled your prompt.
The quality wasn’t great. Most clips were jittery, inconsistent, or distorted. But conceptually, it was huge: AI was starting to imagine video from scratch.
2022–2023: AI Video Hits the Internet (and Gets Weird)
As models became more accessible, artists and creators began playing with them, often with strange and hilarious results.
🍝 The “Will Smith Eating Spaghetti” Moment
In early 2022, a surreal video of Will Smith eating spaghetti went viral. It was grotesque, fascinating, and totally AI-generated. The facial movements were distorted, the hands were glitchy, and yet… It worked. Sort of.
This clip became a meme and a milestone. It was proof that AI could generate video without a real person in front of a camera. It wasn’t good. But it was something.
Runway Gen-1 and Gen-2
In 2023, Runway released tools that let creators transform or generate short video clips. Gen-1 lets you stylize real footage (e.g. make a person walking look like a claymation figure). Gen-2 took a leap into full-on text-to-video: typing a sentence, getting back a video.
These were the first tools aimed at creators, not researchers. Limitations still applied: clips were short (4-6 seconds), character consistency was weak, and quality varied wildly. But it was somewhat usable.
If you’re curious about what each different video model has to offer you, make sure to check out our guide to ai video!
2024-2025: The Shift Toward Cinematic Potential
The biggest leap yet came in 2024, when the Sora and Veo video models were released.
Sora:
Sora is a text-to-video model that can generate high-definition, realistic video clips up to 60 seconds long. Unlike earlier tools, Sora produces:
- Coherent motion
- Consistent subjects
- Cinematic framing and depth
- Somewhat believable physics and lighting
🔗 Watch Sora’s official demo videos
Some outputs look like they came from a high-end camera. Others still fall into uncanny territory. But compared to everything before it, Sora felt like a real step forwards in video generation
Google’s Veo:
In May 2024, Google DeepMind unveiled Veo, its answer to the AI video race. And it’s clearly not aiming for quick social clips or meme fodder. Veo is the model closest to achieving visual storytelling. What sets Veo apart?
- Smooth, controllable camera motion (dolly shots, pans, aerials)
- Stylistic fidelity across genres – from animation to nature documentary
- Prompt editing and revision: describe what you want changed, and it adapts
🎬 Watch the official Veo demo reel here:
🔗 https://deepmind.google/technologies/veo/
While still under limited access, Veo is clearly designed with filmmakers and creative professionals in mind. It’s not flawless but if you’re looking for an AI video tool with editorial polish, this might be the one to watch.
So Where Are We, Really?
Let’s call it like it is:
- AI video isn’t “good” yet by film standards – Outputs are short, quality varies, and storytelling is tricky. But the tools are getting more coherent, and creators are learning how to work around the glitches.
- What’s changing is who gets to create – You don’t need a camera or crew to generate motion, texture, or mood. With a prompt and some patience, anyone can prototype visual ideas that used to require a full studio.
- The future isn’t here, but it’s nearby – AI won’t replace filmmakers, but it might reshape preproduction, visual design, concept development, and maybe even animation.
- YouTube has demonetised AI content but there are questions to answer about how much of your content can be produced with AI before it becomes AI slop.
A Bold Take: Our Stance on AI Video
At Bold Content, we’re keeping a close eye on AI, but we’re certainly not replacing our cameras any time soon.
Right now, AI video is best used for early-stage concepting, moodboards, and visual exploration. It’s a creative tool, not a production solution. The results can be fun, occasionally impressive, but they’re not consistent, controllable, or client-ready in the way professional video needs to be.
We believe real storytelling still happens on set, with directors, crews, talent, and craft. That’s where the nuance lives. That’s where the human part comes in.
AI might speed things up in the background, and we’re open to that. But for us, the heart of great video hasn’t changed. It’s still about meaningful stories, told well, by real people.
If you’re looking for fast, flexible content that actually works, there’s a better option already here: modular video. It’s a proven approach we use to create tailored, scalable content across multiple platforms, using real footage, real people, and real strategy.
In other words: AI seems easy. But this is easier (and it works.)
If your curious to learn more, make sure to get in touch!
If it’s inspiration you’re looking for head to our portfolio!
Author Bio
Adam Neale is the CEO and Creative Director of Bold Content Video, a London-based video production agency specialising in strategic, story-driven films for global brands. With over two decades of experience in the video industry, and more than 1100 videos filmed in 43 countries, Adam has led award-winning productions across branded content, documentary, and corporate storytelling.
His work has been recognised with a Vimeo Staff Pick, a Webby Award, and honours from international film festivals, reflecting his commitment to creative excellence and innovation in visual storytelling.
Under Adam’s leadership, Bold Content Video has produced campaigns for leading organisations including Coca-Cola, the Commonwealth Secretariat, and Google. He is passionate about helping brands communicate with authenticity and purpose through the power of film.