Pipelining

OSDI: Elastic Resource Sharing for Distributed Deep Learning虽然说是elastic，但是在分析的时候是只考虑任务本身资源伸缩的开销，而没有考虑context switch的开销。 active-standby用内存空间换switching overhead。不同的context之间： Multi-Instance GPU（并行）只有NVIDIA H100, A100, and A30支持； time slicing（并发）则从Pascal架构开始支持，而且提出MPS之后，将多个进程的CUDA Context，合并到一个CUDA Context 中，流处理器就可以被不同的kernel函数共享，可以做到和CPU的多进程并发的效果，但他们需要所有的数据都preload到显存中。多个stream里面的kernel的并行是一直支持的。 switching overhead由四个部分组成： old task cleaning, new task initialization, GPU memory allocation, and model transmission via PCIe from CPU to GPU. observation: DNN models have a layered structure and a layer-by-layer computation pattern 文章的idea就是模型一层层地传，然后边传变算。 the core idea is pipelining model transmission over the PCIe and model computation in the GPU...