Mozilla has switched Firefox to a rapid release development cycle which means new versions will come more frequently. The non-profit organization has promised to push out a new stable build every six weeks.

The latest Mozilla Firefox Beta is now available for testing on Windows, Mac, Linux and Android. This beta includes performance enhancements that improve the browsing experience for users and enable developers to create faster Web apps and websites.

What's New

Firefox 130 is bringing a game-changing feature: automatic alt-text generation for images using a fully private on-device AI model. Initially available in the built-in PDF editor, our aim is to extend this to general browsing for screen reader users.

Why alt text?

Web pages have a fundamentally simple structure, with semantics that allow the browser to interpret the same content differently for different people based on their own needs and preferences. This is a big part of what we think makes the Web special, and what enables the browser to act as a user agent, responsible for making the Web work for people.

This is particularly useful for assistive technology such as screen readers, which are able to work alongside browser features to reduce obstacles for people to access and exchange information. For static web pages, this generally can be accomplished with very little interaction from the site, and this access has been enormously beneficial to many people.

But even for a simple static page there are certain types of information, like alternative text for images, that must be provided by the author to provide an understandable experience for people using assistive technology (as required by the spec). Unfortunately, many authors don't do this: the Web Almanac reported in 2022 that nearly half of images were missing alt text.

Until recently it's not been feasible for the browser to infer reasonably high quality alt text for images, without sending potentially sensitive data to a remote server. However, latest developments in AI have enabled this type of image analysis to happen efficiently, even on a CPU.

We are adding a feature within the PDF editor in Firefox Nightly to validate this approach. As we develop it further and learn from the deployment, our goal is to offer it for users who'd like to use it when browsing to help them better understand images which would otherwise be inaccessible.

Generating alt text with small open source models

We are using Transformer-based machine learning models to describe images. These models are getting good at describing the contents of the image, yet are compact enough to operate on devices with limited resources. While can't outperform a large language model like GPT-4 Turbo with Vision, or LLaVA, they are sufficiently accurate to provide valuable insights on-device across a diversity of hardware.

Model architectures like BLIP or even VIT that were trained on datasets like COCO (Common Object In Context) or Flickr30k are good at identifying objects in an image. When combined with a text decoder like OpenAI's GPT-2, they can produce alternative text with 200M or fewer parameters. Once quantized, these models can be under 200MB on disk, and run in a couple of seconds on a laptop – a big reduction compared to the gigabytes and resources an LLM requires.

Example Output

The image below (pulled from the COCO dataset) is described by:

  • FIREFOX – our 182M parameters model using a Distilled version of GPT-2 alongside a Vision Transformer (ViT) image encoder.
  • BASELINE MODEL – a slightly bigger ViT+GPT-2 model
  • HUMAN TEXT – the description provided by the dataset annotator.

Both small models lose accuracy compared to the description provided by a person, and the baseline model is confused by the hands position. The Firefox model is doing slightly better in that case, and captures what is important.

What matters can be suggestive in any case. Notice how the person did not write about the office settings or the cherries on the cake, and specified that the candles were long.

If we run the same image on a model like GPT-4o, the results are extremely detailed:

The image depicts a group of people gathered around a cake with lit candles. The focus is on the cake, which has a red jelly topping and a couple of cherries. There are several lit candles in the foreground. In the background, there is a woman smiling, wearing a gray turtleneck sweater, and a few other people can be seen, likely in an office or indoor setting. The image conveys a celebratory atmosphere, possibly a birthday or a special occasion.

But such level of detail in alt text is overwhelming and doesn't prioritize the most important information. Brevity is not the only goal, but it's a helpful starting point, and pithy accuracy in a first draft allows content creators to focus their edits on missing context and details.

So if we ask the LLM for a one-sentence description, we get:

A group of people in an office celebrates with a lit birthday cake in the foreground and a smiling woman in the background.

This has more detail than our small model, but can't be run locally without sending your image to a server.

Small is beautiful

Running inference locally with small models offers many advantages:

  • Privacy: All operations are contained within the device, ensuring data privacy. We won't have access to your images, PDF content, generated captions, or final captions. Your data will not be used to train the model.
  • Resource Efficiency: Small models eliminate the need for high-powered GPUs in the cloud, reducing resource consumption and making it more environmentally friendly.
  • Increased Transparency: In-house management of models allows for direct oversight of the training datasets, offering more transparency compared to some large language models (LLMs).
  • Carbon Footprint Monitoring: Training models in-house facilitates precise tracking of CO2 emissions using tools such as CodeCarbon.
  • Ease of Improvement: Since retraining can be completed in less than a day on a single piece of hardware, it allows for frequent updates and enhancements of the model.