Hugging Face, Model Security, and the Importance of Safe Serialization

In our previous discussion, Saving Your Machine Learning Model: The Good, The Bad, and the Overengineered, we dove into why the way you save your model matters—from convenience to security. Recent news from The Hacker News has further underscored that point. Let’s break down what Hugging Face is, why it’s become a standard hub for machine learning models, what security issues have arisen recently, and how to avoid similar pitfalls in your own work.

What is Hugging Face and Why Is It Important?

Hugging Face has quickly established itself as the premier repository for machine learning models thanks to several key features:

Centralized Repository:
Hugging Face hosts thousands of models spanning various domains—from natural language processing to computer vision. This makes it easier for researchers, developers, and companies to discover and reuse state-of-the-art models.
Community-Driven Innovation:
With an active community of contributors, the platform fosters collaboration, transparency, and rapid innovation. Users can share pre-trained models, fine-tune them, and even provide feedback to help improve the ecosystem.
Ease of Integration:
The hub offers seamless integration with popular frameworks like PyTorch and TensorFlow, and it provides simple APIs to quickly integrate models into your applications. This ease of use is a significant factor in its growing adoption as a standard hub in the industry.
Openness and Accessibility:
Open sharing of models and code accelerates research and development. It democratizes access to powerful tools and techniques, ensuring that a broader audience can contribute to and benefit from the latest AI advancements.

The Security Issue: Malicious Models and Broken Pickle Files

The security problem recently highlighted by The Hacker News centers on malicious machine learning models hosted on Hugging Face that exploited a vulnerability in the Pickle serialization format.

What Went Wrong?

Malicious Payloads in Models:
Two models on Hugging Face were discovered to contain embedded malicious Python code. These models, stored using PyTorch’s format, are essentially compressed Pickle files. In this case, the payload was designed to launch a reverse shell, potentially giving attackers remote access to the system.
Evasion Through “Broken” Pickle Files:
The attackers used an unusual technique: they compressed the models using the 7z format rather than the default ZIP compression. This change allowed them to insert the malicious payload at the beginning of the Pickle stream. Since deserialization with Pickle executes opcodes sequentially, the harmful code was run before the entire file was processed—thereby evading detection by Hugging Face’s security scanner, Picklescan.
Partial Deserialization Issues:
The way Pickle processes files means that even if the file later fails decompilation, any code at the start of the stream can still execute. This gap between the scanning tool’s expectations and actual deserialization behavior created a window for the malicious payload to operate undetected.

How to Avoid These Security Pitfalls

This incident is a stark reminder that choosing the right model saving format and maintaining up-to-date security practices is essential. Here are some recommendations to protect your models and deployments:

Opt for Safer Serialization Formats:
- ONNX:
  Designed for cross-platform use, ONNX does not allow arbitrary code execution, making it a much safer choice for deployment across different environments.
- Framework-Specific Formats:
  Consider using formats like TorchScript (for PyTorch) or TensorFlow’s SavedModel. These formats are optimized for production and come with better security considerations than Pickle.
Keep Security Tools Updated:
- Ensure that any scanning tools (like Picklescan) are frequently updated to catch new evasion techniques.
- Implement multiple layers of security testing to verify the integrity of any model files before deployment.
Rigorous Model Validation:
- Always perform your own testing and sandboxing of external models.
- Validate that no hidden or malicious code is present in the model file before integrating it into your systems.
Adopt and Advocate Best Practices:
- Educate your team on the inherent risks of insecure serialization formats like Pickle.
- Promote transparency and reproducibility in model saving and sharing to ensure everyone adheres to a secure workflow.

Conclusion

Hugging Face has undeniably become a critical hub for sharing and deploying machine learning models, empowering a new era of open and collaborative AI. However, as the platform grows and diversifies, so do the security challenges. The recent incident involving malicious models exploiting Pickle vulnerabilities serves as a crucial reminder that security must be a top priority.

By choosing safer serialization formats and keeping our security practices current, we can prevent similar incidents and ensure that the power of open models is harnessed safely. As open language models continue to evolve and become more diverse, this remains an important topic for ongoing research and development.

References

By staying informed and adopting secure practices, we can continue to innovate safely in this rapidly evolving field.