Start-up of the Week: Twelve Labs' AI-powered video analysis captures attention of tech giants

San Francisco-based tech start-up, Twelve Labs, is working on AI models that understand videos as well as text, which can unlock powerful new applications. The company, established in 2021, believes that videos most closely resemble the sensory inputs from the real world. Based on this philosophy, the start-up “models the world” by shipping next-generation multimodal models, pushing the boundaries of video understanding.

According to Jae Lee, the co-founder of Twelve Labs, the company trains video-analysing models for a range of use cases. Using these models, users can search through videos for specific moments and generate summaries. Given the potentially disruptive nature of this innovation, major players in the tech sector, including Nvidia, Samsung, and Intel, have invested in the venture.

In today’s edition of the “Start-up of the Week,” International Finance will provide a detailed look at this disruptive venture.

What Exactly Does The Start-up Do?

Twelve Labs claims to be building an “AI that perceives reality the way its users do,” adding, “Video most closely resembles the sensory inputs from the real world. We model the world by shipping next-generation multimodal models, pushing the boundaries of video understanding.”

Jae Lee, along with his tech peers Aiden Lee, SJ Kim, Dave Chung, and Soyoung Lee, formed Twelve Labs to train large language models (LLMs) to map text to what’s happening inside a video, including actions, objects, and background sounds.

Models like Google’s Gemini can search through footage, and companies like Microsoft and Amazon offer video analytics services to spot objects in clips. However, Jae Lee argues that Twelve Labs’ products stand apart with their customisation options, which allow customers to tailor models using their own data.

Through Twelve Labs, developers can create apps based on the start-up’s models to search across video footage and more. The company’s technology can drive applications like ad insertion, content moderation, and auto-generating highlight reels from clips. According to Lee, Twelve Labs will develop its own model-ethics-related benchmarks and datasets, and will conduct bias tests on all of its models before releasing them.

Essentially, the company offers a machine learning model that combines various types of data—including images, text, speech, and numbers—with intelligence processing algorithms to yield sophisticated and accurate outputs. Based on the multimodal model, Twelve Labs analyses images and sounds in a video and matches them to human language.

The model can also create text based on the video content, edit short-form videos, and categorise videos according to user-determined standards. The venture trains its models on a mix of public domain and licensed data, and does not source customer data for training.

While video analysis remains core to Twelve Labs’ operations, the company is also branching into areas like multimodal embeddings. One of Twelve Labs’ models, Marengo, can search across images and audio in addition to video, and can accept a reference audio recording, image, or video clip to guide a search.

Here’s The Product Line-Up

Twelve Labs’ growing product portfolio has helped the start-up secure clients in the enterprise, media, and entertainment sectors. Two major partners are Databricks and Snowflake, both of which are integrating Twelve Labs’ tooling into their offerings. Databricks developed an integration that allows customers to invoke Twelve Labs’ embedding service from existing data pipelines. Snowflake, meanwhile, is creating connectors to Twelve Labs’ models in Cortex AI, its fully managed AI service.

In fact, both Databricks and Snowflake invested in Twelve Labs in December 2024 through their respective venture arms. SK Telecom and HubSpot Ventures joined in, along with In-Q-Tel, an Arlington, Virginia-based nonprofit venture capital firm that invests in start-ups supporting U.S. intelligence capabilities.

The new investments amounted to USD 30 million, bringing Twelve Labs’ total funding to USD 107.1 million. Jae Lee says the proceeds will be used for product development and hiring.

Twelve Labs’ key product has been its “Search” tool, which offers users lightning-fast, context-aware searches that pinpoint the exact moment they need in any video. Users can search in natural language or with an image to uncover semantically related moments within videos quickly. This allows them to explore a new dimension of multimodal search while gaining a deeper understanding of their video data. They can search across speech, text, audio, and visuals, making video data searchable across modalities. Clients can further fine-tune Twelve Labs’ foundational models to learn the nuances of their business, enabling searches in language that’s natural for their internal teams.

This tool has the potential to revolutionise video tagging, the process of adding metadata to videos to make them more searchable and accessible. This metadata can include keywords (descriptive words detailing the video’s content and purpose), categories (ways to organise videos), timestamps (to mark specific points in a video), and locations (to mark specific locations in a video for future reference). While video tagging is used in many applications, including video search engines, video hosting platforms, and video analytics, the process can be time-consuming and unscalable, with issues like manually logging videos and missing critical elements of the video (visuals and sounds) in transcripts, or missing context in object-level tags.

Through its multimodal embedding model, “Marengo,” Twelve Labs is helping developers around the world run millions of queries, enabling them to index videos in a fraction of the time—just a quarter of their length—using the start-up’s cutting-edge infrastructure.

Using Twelve Labs’ expertise, developers can generate open-ended text formats tailored to their needs, from Q&A and creative suggestions to content analysis, by leveraging the start-up’s “rich embeddings.” They do this by synthesising content to create concise and efficient descriptions, highlights, and chapters that are useful for content management, exploration, and understanding. Twelve Labs helps developers create topic categories, video titles, and hashtags, all of which are helpful for marketing, advertising, and ideation.

The medium here is “Pegasus 1,” which naturally understands videos in their entirety to comprehend content like how developers do. This tool helps users take advantage of the start-up’s “out-of-the-box” templates to kickstart their video understanding or use custom prompts tailored to their specific use cases.

Among its other innovations, the start-up also offers its “Multimodal Embed API,” which makes it easier for software developers to build features like semantic search, hybrid search, recommender systems, anomaly detection, classification, and more. The API replaces the need for siloed solutions for image, text, audio, and video by converting “rich video data into vectors in the same space.”

Software developers can also use the “Multimodal Embed API” to improve data quality when training large language models, transforming workflows with embeddings to create training data, improve data quality, and reduce manual labelling needs. Additionally, they can use the platform to identify unusual patterns or anomalies in diverse data types.

The Road Ahead

Twelve Labs, which launched Marengo 2.7 in December 2024—a new state-of-the-art multimodal embedding model that shows over a 15% improvement over its predecessor, Marengo 2.6—also made a significant leadership change by adding a president to its C-suite: Yoon Kim, former CTO of SK Telecom and a key architect behind Apple’s Siri. Yoon will also serve as Twelve Labs’ Chief Strategy Officer, spearheading the start-up’s aggressive expansion plans.

Meanwhile, South Korea’s top mobile carrier, SK Telecom, will now invest USD 3 million in Twelve Labs to incorporate the venture’s technology into its AI agent services.

The two companies have also agreed to collaborate on developing technologies for implementing multimodal AI in security and public safety applications, such as AI surveillance systems.

Compared to traditional surveillance systems, where a single operator monitors numerous CCTV feeds for many hours, Twelve Labs’ multimodal AI model will enable SK Telecom to implement rapid searching and summarising of key incidents, movements, and individuals.

The two firms are now aiming to develop technologies to apply multimodal AI to areas such as security and public safety, including AI monitoring systems. According to reports, Twelve Labs will also join the K-AI Alliance, a group of Korean companies promoting AI technology, to collaborate with other members in boosting Korea’s AI ecosystem.

Start-up of the Week: Twelve Labs’ AI-powered video analysis captures attention of tech giants

Leave a Comment Cancel Reply

AI in asset management market to reach USD 21.82 billion by 2030, says report

Telenor acquires Bahnhof to strengthen position in Sweden’s broadband market

Intesa vs Banco BPM: Battle for MPS to reshape Italian banking

Exclusive

Latest News

Iran war: DP World boosts truck fleet as Gulf shifts to road freight

Germany unveils record 2027 budget with defence, infrastructure push

Visa, M-Pesa pick Congo as testing ground for stablecoin-driven cross-border payments

Ferrari’s manual gearbox gambit: Is nostalgia papering over a bigger problem

Eyeing consolidation in Spain, ING to acquire 40% of wealth manager Singular Bank

Opinion2

Opinion3

Opninion1

Opinion 4

What's New

Leave a Comment Cancel Reply

Opinion2

Opinion3

Opninion1

Opinion 4