SOPPU

Scalable One PEFT per User

Decentralized LLM inference for scalable personalized AI using compressed PEFT (LoRA) adapters

  • Combines client-side compression of personal PEFT adapters using Joint Compression
  • Utilizes scalable serving through LoRAX
  • Accepted to DAI 2024 (International Conference on Distributed Artificial Intelligence)
  • Global Top 50 in Alibaba Global E-commerce Challenge

GitHub Repo

Tech Stack

  • Core Technologies: Large Language Models (LLMs), Transformers, PEFT
  • Infrastructure: LoRAX serving system
  • Compression: Advanced LoRA compression techniques
  • Architecture: Client-side processing + Centralized inference server

Experimental Results

Key Achievements:

  • Achieved 1.99x compression ratio (1.7M to 856K parameters)
  • Demonstrated 49.7% average memory savings across adapters
  • Successfully scaled to thousands of concurrent users
  • Maintained adaptation quality while reducing storage requirements
  • Proved better preservation of early layer information through layerwise analysis

Why I made it

  • Enable efficient personalization without compromising base model access
  • Create a scalable solution for handling multiple user adapters efficiently
  • Develop a framework that protects user privacy through local processing while maintaining performance