Key Takeaways
- Transformers.js v4 now available on NPM after 11 months of development starting March 2025
- WebGPU runtime delivers 4x speedup for BERT models and supports 20B parameter models at 60 tokens/sec
- Build times dropped from 2 seconds to 200 milliseconds with 53% smaller bundle sizes
- Works across Node, Bun, Deno, and all major browsers with full offline capability
Hugging Face released Transformers.js v4 on February 9, 2026, fundamentally changing how developers deploy AI models in JavaScript environments. The library now runs state-of-the-art language models entirely in browsers without server dependencies, achieving performance that rivals desktop applications. Installation requires a single NPM command, marking the transition from GitHub-only distribution to mainstream accessibility.
WebGPU Runtime Rewrites Performance Standards
The most significant advancement centers on the complete C++ rewrite of the WebGPU Runtime developed in collaboration with Microsoft’s ONNX Runtime team. This architecture underwent testing across 200+ supported model architectures plus new v4-exclusive implementations. The runtime enables identical Transformers.js code to execute across browsers, Node.js, Bun, Deno, and desktop applications with consistent behavior.
Hugging Face engineers re-implemented models operation-by-operation using specialized ONNX Runtime Contrib Operatorsincluding com.microsoft.GroupQueryAttention, com.microsoft.MatMulNBits, and com.microsoft.QMoE. Adopting the com.microsoft.MultiHeadAttention operator produced a 4x speedup specifically for BERT-based embedding models. The system now supports full offline functionality by caching WASM files locally, allowing applications to run without internet connectivity after initial download.
What hardware acceleration does Transformers.js v4 support?
Transformers.js v4 leverages WebGPU for hardware acceleration across all supported JavaScript runtimes. The technology works in Chrome, Edge, Firefox, Safari, Node.js, Bun, and Deno environments. WebGPU delivers 10x faster performance compared to WebGL for transformer models.
Build System Migration Cuts Processing Time by 90%
Switching from Webpack to esbuild reduced build times from 2,000 milliseconds to 200 milliseconds a 10x improvement that accelerates development iteration cycles. Bundle sizes decreased by an average of 10% across all builds, with the most dramatic change affecting transformers.web.js. This default export now measures 53% smaller than the v3 equivalent, translating to faster downloads and quicker application startup for end users.
The new build infrastructure supports the library’s transition to a pnpm workspace-based monorepo structure. This architecture allows shipping smaller, focused packages that depend on the core @huggingface/transformers library without maintaining separate repositories.
Expanded Model Support Includes 20B Parameter Systems
Transformers.js v4 introduces support for advanced architectural patterns previously unavailable in browser environments. New model architectures include:
- Mamba state-space models for efficient sequence processing
- Multi-head Latent Attention (MLA) systems
- Mixture of Experts (MoE) architectures with specialized routing
- GPT-OSS, Chatterbox, GraniteMoeHybrid, LFM2-MoE implementations
- HunYuanDenseV1, Apertus, Olmo3, FalconH1, and Youtu-LLM variants
Internal testing confirmed GPT-OSS 20B running at approximately 60 tokens per second on an M4 Pro Max processor using q4f16 quantization. All supported models maintain WebGPU compatibility for hardware-accelerated inference in browser and server-side JavaScript contexts.
How large are the models that Transformers.js v4 can handle?
Transformers.js v4 supports models exceeding 8 billion parameters. Testing verified GPT-OSS 20B model performance at 60 tokens per second with q4f16 quantization on M4 Pro Max hardware. Developers should choose models under 2 billion parameters for broadest device compatibility across consumer hardware.
Repository Restructuring Improves Maintainability
The development team split the monolithic 8,000-line models.js file into modular, focused components with clear separation between utility functions, core logic, and model-specific implementations. This restructuring enables developers to add new models by focusing exclusively on model-specific code without navigating unrelated logic.
Example projects moved from the main repository to a dedicated transformers.js-examples repository, creating a cleaner codebase concentrated on core library functionality. Prettier configuration updates ensure consistent formatting across all files, with automated enforcement on future pull requests.
Standalone Tokenizers Library Offers Zero Dependencies
The @huggingface/tokenizers package represents a complete refactor of tokenization logic into a separate library measuring just 8.8kB gzipped. The implementation works seamlessly across browsers and server-side runtimes with zero external dependencies while maintaining full type safety.
This separation keeps Transformers.js core focused on model execution while providing a lightweight, versatile tool that WebML projects can integrate independently. Developers working exclusively with text preprocessing can now avoid loading the full Transformers.js library.
Can Transformers.js v4 models run completely offline?
Transformers.js v4 supports models exceeding 8 billion parameters. Testing verified GPT-OSS 20B model performance at 60 tokens per second with q4f16 quantization on M4 Pro Max hardware. Developers should choose models under 2 billion parameters for broadest device compatibility across consumer hardware.
Installation and Implementation
Developers install the preview release using the next tag on NPM until the stable version launches. The command executes as:​
npm i @huggingface/transformers@next
Regular updates continue under the next tag with incremental improvements and bug fixes. The library maintains functional equivalence with Hugging Face’s Python transformers library, allowing developers to run identical pretrained models using similar API patterns.
The pipeline() function provides the fastest implementation path for pretrained model inference. Models download and cache in the browser, then execute in separate threads to prevent blocking main UI operations.
Performance Comparison: v3 vs v4
| Metric | v3 | v4 | Improvement |
|---|---|---|---|
| Build Time | 2,000ms | 200ms | 10x faster |
| Bundle Size (web) | Baseline | -53% | 53% reduction |
| BERT Speedup | Baseline | 4x faster | 4x improvement |
| Model Size Support | <8B params | 20B+ params | 2.5x+ increase |
| Tokenizers Library | Integrated | 8.8kB standalone | Modular |
Privacy and Security Advantages
Running AI models locally in browsers eliminates data transmission to external servers. All inference executes on user devices, providing better privacy than cloud-based alternatives. This architecture reduces latency by removing network round trips and grants offline access after initial model download.
Organizations handling sensitive data benefit from the zero-server-communication model, as user inputs never leave the client environment. The approach also eliminates per-request API costs associated with cloud AI services.
Browser Compatibility and Requirements
Transformers.js v4 requires WebGPU support, available in:
- Chrome 113+ (May 2023 release)
- Edge 113+ (May 2023 release)
- Safari 18+ (September 2024 release)
- Firefox 134+ (January 2025 release)
Server-side JavaScript runtimes including Node.js, Bun, and Deno gained WebGPU support through the new runtime implementation. Developers should test on Chrome or Edge for the easiest initial setup due to mature WebGPU implementations.
What’s the recommended model size for web applications?
Models under 2 billion parameters provide optimal performance across consumer devices. Smaller models like SmolLM2-135M-Instruct offer faster loading times and broader hardware compatibility. Testing on target hardware determines practical limits based on available GPU memory and processing capability.
Considerations and Trade-offs
WebGPU requires modern browser versions, limiting compatibility with devices running older operating systems. Initial model download sizes range from tens to hundreds of megabytes depending on model complexity and quantization level.
Users on limited bandwidth connections experience longer first-load times compared to API-based solutions. GPU memory constraints on lower-end devices may restrict which models run successfully, requiring developers to test across representative hardware configurations.
Development Roadmap and Future Updates
The Hugging Face team continues publishing v4 releases under the next NPM tag until the stable version launches. Regular updates address bug fixes, performance optimizations, and expanded model support.
The Examples repository receives ongoing contributions showcasing new model capabilities and implementation patterns. Developers can track progress through the official GitHub repository and participate in the open-source development process.
Frequently Asked Questions (FAQs)
How do I install Transformers.js v4?
Run npm i @huggingface/transformers@next in your project directory. The library installs under the next tag until stable release. No additional configuration required for basic usage.
Does Transformers.js v4 work in Node.js?
Yes, Transformers.js v4 supports Node.js, Bun, and Deno with WebGPU acceleration. The same code executes across browser and server-side environments without modification.
What performance can I expect from browser-based AI models?
Transformers.js v4 achieves 20-60 tokens per second for language models with WebGPU acceleration. BERT embedding models run 4x faster than v3 implementation. Actual performance varies by model size and device capabilities.
Can I use Transformers.js v4 for production applications?
The library remains in preview status under the next NPM tag. Hugging Face publishes regular updates addressing stability and performance. Developers should monitor the changelog for breaking changes before stable release.
How does Transformers.js v4 compare to cloud AI APIs?
Transformers.js v4 eliminates server costs and provides offline functionality. Cloud APIs offer larger model access and no client-side resource consumption. The choice depends on privacy requirements, latency tolerance, and cost structure.
What’s the smallest usable AI model in Transformers.js v4?
SmolLM2-135M-Instruct represents one of the smallest functional language models. The 8.8kB tokenizers library handles text preprocessing independently. Developers balance model capability against download size and inference speed.
Does FP16 quantization affect model accuracy?
FP16 models run approximately 40% faster with minimal accuracy loss for most AI reasoning tasks. Transformers.js v4 supports multiple quantization levels including q4f16 used in testing 20B parameter models.

