Pixtral 12B, a new multimodal language model by Mistral AI, launched on September 17, 2024, is accessible online through the Le Chat service. This model can process various data types, including images and text. It competes well with other models like Anthropic’s Claude-3 and Google’s Gemini Flash-8B, especially in multimodal reasoning. Users can input images for analysis, such as converting sketches into HTML code. Pixtral shows promise in generating functional outputs but has weaknesses in text comprehension tasks.
You no longer need to download nearly 24 GB of files through BitTorrent to explore the multimodality of Pixtral. Since September 17, 2024, Pixtral 12B, an advanced language model created by French startup Mistral AI, is accessible online. Just head to Le Chat, a web service that functions similarly to ChatGPT.
Le Chat acts as a chatbot or conversational agent, allowing users to interact with Mistral’s generative artificial intelligence. Since its launch in February, this tool has made three AI models available: Mistral Nemo, Codestral, and Mistral Large 2, each offering unique features and capabilities.
Pixtral is a multimodal model, enabling it to process various types of data, including text, images, audio, and video, based on its configuration. For instance, Pixtral can accept an image as input, analyze it, and then provide a written description of the content.
An example of this capability was shown recently when Emmanuel Macron shared an image on X (formerly known as Twitter) on September 14, celebrating the successful Parade of Champions for the French Olympic team. Although Pixtral struggled with the sporting context, it successfully identified the Arc de Triomphe, although its interpretation of the colored smoke was less accurate.
To experiment with Pixtral 12-B, simply visit Le Chat, create an account, and select Pixtral from the available templates next to the text entry field. Then, you can input your prompts, with a particular focus on image uploads to enhance your experience.
Comparing Pixtral 12-B with Other Models
The introduction of Pixtral 12B on Le Chat, along with its availability on La Plateforme—an internal AI hosting solution—was paired with a technical report showcasing its performance against several competitors, such as Anthropic’s Claude-3 Haiku and Google’s Gemini Flash-8B, both designed in 2024.
Based on the evaluation of selected competitors and chosen test scenarios, Pixtral 12B often matches or exceeds the performance of its rivals. However, in text comprehension tasks involving science, math, and coding, it slightly lags behind the competition.
Mistral asserts that Pixtral significantly outperforms other open models in its class, as well as closed models like Claude-3 Haiku. The company boasts that it competes favorably with larger models, including LLaVa OneVision 72B, though there can be minimal differences in performance.
Mistral highlighted that Pixtral-12B “provides first-rate multimodal reasoning without sacrificing essential text capabilities such as instruction tracking, coding, and math.” According to Mistral, Pixtral was specifically developed to succeed the Nemo 12B model introduced in July.
Converting Visuals into Code
One notable feature that Mistral emphasized is Pixtral’s ability to transform a visual sketch into functional computer code. For instance, they showcased a concept of a website layout drawn on a loose leaf, which included a title, a menu, a ‘next’ button, and a copyright notice.
By using the prompt, “Write HTML code to create a website like this,” Pixtral generated the necessary code and provided explanations covering HTML structure, CSS styling, and a JavaScript function. While the outcome may not be extravagant, it is indeed practical.