Get in Touch

Course Outline

Tencent Hunyuan Production Fundamentals

  • Overview of Tencent Hunyuan model serving scenarios
  • Production characteristics of large and MoE models
  • Common latency, throughput, and cost bottlenecks
  • Defining service-level objectives for inference workloads

Deployment Architecture and Serving Flow

  • Core components of a production inference stack
  • Choosing between containerized, on-premise, and cloud deployment models
  • Model loading, request routing, and GPU allocation basics
  • Designing for reliability and operational simplicity

Latency Optimization in Practice

  • Using optimized inference engines such as TensorRT where applicable
  • KV-cache concepts and practical cache tuning
  • Reducing startup, warmup, and response overhead
  • Measuring time to first token and token generation speed

Throughput, Batching, and GPU Efficiency

  • Continuous batching and request batching strategies
  • Managing concurrency and queue behavior
  • Improving GPU utilization without harming user experience
  • Handling long-context and mixed-workload requests

Quantization and Cost Control

  • Why quantization matters for production serving
  • Practical trade-offs of FP16, INT8, and other common precision options
  • Balancing model quality, latency, and infrastructure cost
  • Building a simple cost optimization checklist

Operations, Monitoring, and Readiness Review

  • Autoscaling triggers for inference services
  • Monitoring latency, throughput, cache usage, and GPU health
  • Logging, alerting, and incident response basics
  • Reviewing a reference deployment and creating an improvement plan

Requirements

  • Basic understanding of large language model deployment and inference workflows
  • Experience with containers, cloud or on-premise infrastructure, and API-based services
  • Working knowledge of Python or system engineering tasks

Audience

  • ML engineers deploying LLMs into production
  • Platform engineers responsible for GPU-based inference services
  • Solution architects designing scalable AI serving platforms
 14 Hours

Number of participants


Price per participant

Upcoming Courses (Minimal 5 peserta)

Related Categories