Factory-AI for Deep Learning Purposes

Tutorial, CEA List, SIALV Laboratory, 2022

Taught how to effectively use the internal HPC cluster to optimize Deep Learning experiments.

The course included:

SLURM Tutorial:
- Principles of nodes, jobs, submissions, queue
- How to submit a job
  - configuration to maximize usage of partition resources
  - multi-node and multi-process setting
Pytorch Tutorial:
- Dataloader
- Distributed training
  - along with SLURM by using set environment variables
- Avoid CPU/GPU synchronization

After this course, a noticable number of experiments were achieved faster on the cluster and computational resources were better used.