Transformers.js v4 Released: Run AI Models Locally in Browsers

Q: How do I install Transformers.js v4?

You can install the preview version of Transformers.js v4 by running the command `npm i @huggingface/transformers@next` in your project directory. The library is currently published under the `next` tag on NPM until its stable release. No additional configuration is required for basic usage after installation.

Q: Does Transformers.js v4 work in Node.js?

Yes, Transformers.js v4 has expanded runtime support. It works in Node.js, as well as Bun and Deno, leveraging WebGPU acceleration where available. This allows the same code to execute seamlessly across both browser and server-side environments without modification.

Q: What performance can I expect from browser-based AI models?

With WebGPU acceleration, Transformers.js v4 can achieve between 20 to 60 tokens per second for language model inference. For embedding models like BERT, performance improvements are significant, running up to 4 times faster than the v3 implementation. Actual speed will vary depending on the specific model size and the capabilities of the user's device.

Q: Can I use Transformers.js v4 for production applications?

Transformers.js v4 is currently in a preview state, available under the `next` tag on NPM. While Hugging Face is publishing regular updates to improve stability and performance, developers should be aware that breaking changes may still occur. It is recommended to monitor the official changelog closely before using it in a production environment, pending the stable release.

Q: How does Transformers.js v4 compare to cloud AI APIs?

Transformers.js v4 offers key advantages of running models entirely client-side, which eliminates server costs and enables offline functionality. In contrast, cloud APIs provide access to much larger models without consuming client resources. The choice between them depends on your specific needs regarding data privacy, latency tolerance, and budget (operational cost vs. model capability).

Q: What's the smallest usable AI model in Transformers.js v4?

One of the smallest functional language models suitable for use is the SmolLM2-135M-Instruct. The library also uses a compact 8.8kB tokenizer for text preprocessing. Developers must balance the trade-off between a model's capabilities, its download size, and its inference speed when selecting a model for their application.

Q: Does FP16 quantization affect model accuracy?

Using FP16 (16-bit floating point) quantization typically results in a significant speed increase—approximately 40% faster inference—with only minimal accuracy loss for most common AI reasoning tasks. Transformers.js v4 supports multiple quantization levels, including the q4f16 format which has been used in testing models with up to 20 billion parameters.

Key Takeaways

Transformers.js v4 now available on NPM after 11 months of development starting March 2025
WebGPU runtime delivers 4x speedup for BERT models and supports 20B parameter models at 60 tokens/sec
Build times dropped from 2 seconds to 200 milliseconds with 53% smaller bundle sizes
Works across Node, Bun, Deno, and all major browsers with full offline capability

Hugging Face released Transformers.js v4 on February 9, 2026, fundamentally changing how developers deploy AI models in JavaScript environments. The library now runs state-of-the-art language models entirely in browsers without server dependencies, achieving performance that rivals desktop applications. Installation requires a single NPM command, marking the transition from GitHub-only distribution to mainstream accessibility.

WebGPU Runtime Rewrites Performance Standards

The most significant advancement centers on the complete C++ rewrite of the WebGPU Runtime developed in collaboration with Microsoft’s ONNX Runtime team. This architecture underwent testing across 200+ supported model architectures plus new v4-exclusive implementations. The runtime enables identical Transformers.js code to execute across browsers, Node.js, Bun, Deno, and desktop applications with consistent behavior.

Hugging Face engineers re-implemented models operation-by-operation using specialized ONNX Runtime Contrib Operatorsincluding com.microsoft.GroupQueryAttention, com.microsoft.MatMulNBits, and com.microsoft.QMoE. Adopting the com.microsoft.MultiHeadAttention operator produced a 4x speedup specifically for BERT-based embedding models. The system now supports full offline functionality by caching WASM files locally, allowing applications to run without internet connectivity after initial download.

What hardware acceleration does Transformers.js v4 support?

Transformers.js v4 leverages WebGPU for hardware acceleration across all supported JavaScript runtimes. The technology works in Chrome, Edge, Firefox, Safari, Node.js, Bun, and Deno environments. WebGPU delivers 10x faster performance compared to WebGL for transformer models.

Build System Migration Cuts Processing Time by 90%

Switching from Webpack to esbuild reduced build times from 2,000 milliseconds to 200 milliseconds a 10x improvement that accelerates development iteration cycles. Bundle sizes decreased by an average of 10% across all builds, with the most dramatic change affecting transformers.web.js. This default export now measures 53% smaller than the v3 equivalent, translating to faster downloads and quicker application startup for end users.

The new build infrastructure supports the library’s transition to a pnpm workspace-based monorepo structure. This architecture allows shipping smaller, focused packages that depend on the core @huggingface/transformers library without maintaining separate repositories.

Expanded Model Support Includes 20B Parameter Systems

Transformers.js v4 introduces support for advanced architectural patterns previously unavailable in browser environments. New model architectures include:

Mamba state-space models for efficient sequence processing
Multi-head Latent Attention (MLA) systems
Mixture of Experts (MoE) architectures with specialized routing
GPT-OSS, Chatterbox, GraniteMoeHybrid, LFM2-MoE implementations
HunYuanDenseV1, Apertus, Olmo3, FalconH1, and Youtu-LLM variants

Internal testing confirmed GPT-OSS 20B running at approximately 60 tokens per second on an M4 Pro Max processor using q4f16 quantization. All supported models maintain WebGPU compatibility for hardware-accelerated inference in browser and server-side JavaScript contexts.

How large are the models that Transformers.js v4 can handle?

Transformers.js v4 supports models exceeding 8 billion parameters. Testing verified GPT-OSS 20B model performance at 60 tokens per second with q4f16 quantization on M4 Pro Max hardware. Developers should choose models under 2 billion parameters for broadest device compatibility across consumer hardware.

Repository Restructuring Improves Maintainability

The development team split the monolithic 8,000-line models.js file into modular, focused components with clear separation between utility functions, core logic, and model-specific implementations. This restructuring enables developers to add new models by focusing exclusively on model-specific code without navigating unrelated logic.

Example projects moved from the main repository to a dedicated transformers.js-examples repository, creating a cleaner codebase concentrated on core library functionality. Prettier configuration updates ensure consistent formatting across all files, with automated enforcement on future pull requests.

Standalone Tokenizers Library Offers Zero Dependencies

The @huggingface/tokenizers package represents a complete refactor of tokenization logic into a separate library measuring just 8.8kB gzipped. The implementation works seamlessly across browsers and server-side runtimes with zero external dependencies while maintaining full type safety.

This separation keeps Transformers.js core focused on model execution while providing a lightweight, versatile tool that WebML projects can integrate independently. Developers working exclusively with text preprocessing can now avoid loading the full Transformers.js library.

Can Transformers.js v4 models run completely offline?

Installation and Implementation

Developers install the preview release using the next tag on NPM until the stable version launches. The command executes as:

bash

npm i @huggingface/transformers@next

Regular updates continue under the next tag with incremental improvements and bug fixes. The library maintains functional equivalence with Hugging Face’s Python transformers library, allowing developers to run identical pretrained models using similar API patterns.

The pipeline() function provides the fastest implementation path for pretrained model inference. Models download and cache in the browser, then execute in separate threads to prevent blocking main UI operations.

Performance Comparison: v3 vs v4

Metric	v3	v4	Improvement
Build Time	2,000ms	200ms	10x faster
Bundle Size (web)	Baseline	-53%	53% reduction
BERT Speedup	Baseline	4x faster	4x improvement
Model Size Support	<8B params	20B+ params	2.5x+ increase
Tokenizers Library	Integrated	8.8kB standalone	Modular

Privacy and Security Advantages

Running AI models locally in browsers eliminates data transmission to external servers. All inference executes on user devices, providing better privacy than cloud-based alternatives. This architecture reduces latency by removing network round trips and grants offline access after initial model download.

Organizations handling sensitive data benefit from the zero-server-communication model, as user inputs never leave the client environment. The approach also eliminates per-request API costs associated with cloud AI services.

Browser Compatibility and Requirements

Transformers.js v4 requires WebGPU support, available in:

Chrome 113+ (May 2023 release)
Edge 113+ (May 2023 release)
Safari 18+ (September 2024 release)
Firefox 134+ (January 2025 release)

Server-side JavaScript runtimes including Node.js, Bun, and Deno gained WebGPU support through the new runtime implementation. Developers should test on Chrome or Edge for the easiest initial setup due to mature WebGPU implementations.

What’s the recommended model size for web applications?

Models under 2 billion parameters provide optimal performance across consumer devices. Smaller models like SmolLM2-135M-Instruct offer faster loading times and broader hardware compatibility. Testing on target hardware determines practical limits based on available GPU memory and processing capability.

Considerations and Trade-offs

WebGPU requires modern browser versions, limiting compatibility with devices running older operating systems. Initial model download sizes range from tens to hundreds of megabytes depending on model complexity and quantization level.

Users on limited bandwidth connections experience longer first-load times compared to API-based solutions. GPU memory constraints on lower-end devices may restrict which models run successfully, requiring developers to test across representative hardware configurations.

Development Roadmap and Future Updates

The Hugging Face team continues publishing v4 releases under the next NPM tag until the stable version launches. Regular updates address bug fixes, performance optimizations, and expanded model support.

The Examples repository receives ongoing contributions showcasing new model capabilities and implementation patterns. Developers can track progress through the official GitHub repository and participate in the open-source development process.

Frequently Asked Questions (FAQs)

How do I install Transformers.js v4?

Run npm i @huggingface/transformers@next in your project directory. The library installs under the next tag until stable release. No additional configuration required for basic usage.

Does Transformers.js v4 work in Node.js?

Yes, Transformers.js v4 supports Node.js, Bun, and Deno with WebGPU acceleration. The same code executes across browser and server-side environments without modification.

What performance can I expect from browser-based AI models?

Transformers.js v4 achieves 20-60 tokens per second for language models with WebGPU acceleration. BERT embedding models run 4x faster than v3 implementation. Actual performance varies by model size and device capabilities.

Can I use Transformers.js v4 for production applications?

The library remains in preview status under the next NPM tag. Hugging Face publishes regular updates addressing stability and performance. Developers should monitor the changelog for breaking changes before stable release.

How does Transformers.js v4 compare to cloud AI APIs?

Transformers.js v4 eliminates server costs and provides offline functionality. Cloud APIs offer larger model access and no client-side resource consumption. The choice depends on privacy requirements, latency tolerance, and cost structure.

What’s the smallest usable AI model in Transformers.js v4?

SmolLM2-135M-Instruct represents one of the smallest functional language models. The 8.8kB tokenizers library handles text preprocessing independently. Developers balance model capability against download size and inference speed.

Does FP16 quantization affect model accuracy?

FP16 models run approximately 40% faster with minimal accuracy loss for most AI reasoning tasks. Transformers.js v4 supports multiple quantization levels including q4f16 used in testing 20B parameter models.

Search for an article

Transformers.js v4: The JavaScript Library That Brings Desktop-Class AI to Browsers