Factory-AI for Deep Learning Purposes
Tutorial, CEA List, SIALV Laboratory, 2022
Taught how to effectively use the internal HPC cluster to optimize Deep Learning experiments.
The course included:
- SLURM Tutorial:
- Principles of nodes, jobs, submissions, queue
- How to submit a job
- configuration to maximize usage of partition resources
- multi-node and multi-process setting
- Pytorch Tutorial:
- Dataloader
- Distributed training
- along with SLURM by using set environment variables
- Avoid CPU/GPU synchronization
After this course, a noticable number of experiments were achieved faster on the cluster and computational resources were better used.