LLM Serving frameworks: LLMOps
The launch of GPT-3 and DALL-E steered up in the age of Generative AI and Large Language Models (LLM). With 175 billion parameters and trained on 45 TB of text data, GPT-3 was over 100x the 1.5 billion parameters of its predecessor. The next 18 months saw a cascade of innovation, with ever larger models, capped by the launch of ChatGPT at the tail end of 2022.
Basic workflow is as follows
So, Generative AI needs an operationalized workflow to accelerate adoption, where a terminology LLMOps comes into picture.
Key Components of LLMOps
- Model Fine-Tuning: Adapting pre-trained LLMs for specific tasks by fine-tuning on domain-specific data.
- Infrastructure Management: Handling the extensive computational resources needed for deploying and running LLMs, often involving GPUs or TPUs.
- Latency & Performance Optimization: Ensuring that LLMs respond within acceptable timeframes, especially when deployed in real-time applications.
- Scalability: Deploying LLMs across distributed systems to handle large-scale inference workloads.
- Security & Privacy: Managing risks related to the potential misuse of LLMs, ensuring data privacy, and protecting intellectual property.
- Bias & Fairness: Monitoring LLMs for biased outputs and implementing strategies to mitigate these biases.
- Ethical Considerations: Ensuring responsible AI practices are followed, especially considering the powerful capabilities of LLMs.
- Inference Cost Management: Optimizing the costs associated with running large models, including infrastructure and energy consumption.
Considering all the above points for serving a LLM application, we need to evaluate multiple frameworks those meet our business needs.
Here is a comparison of all those
- Use vLLM when maximum speed is required for batched prompt delivery.
- Opt for Text generation inference if you need native HuggingFace support and don’t plan to use multiple adapters for the core model.
- Consider CTranslate2 if speed is important to you and if you plan to run inference on the CPU.
- Choose OpenLLM if you want to connect adapters to the core model and utilize HuggingFace Agents, especially if you are not solely relying on PyTorch.
- Consider Ray Serve for a stable pipeline and flexible deployment. It is best suited for more mature projects.
- Utilize MLC LLM if you want to natively deploy LLMs on the client-side (edge computing), for instance, on Android or iPhone platforms.
- Use DeepSpeed-MII if you already have experience with the DeepSpeed library and wish to continue using it for deploying Llm.