Trunks

Image

ResNet and ResNext

eztorch.models.trunks.create_resnet(name, num_classes=1000, progress=True, pretrained=False, small_input=False, **kwargs)[source]

Build ResNet from torchvision for image.

Parameters:
  • name (str) – name of the resnet model (such as resnet18).

  • num_classes (int, optional) – If not \(0\), replace the last fully connected layer with num_classes output, if \(0\) replace by identity.

    Default: 1000

  • pretrained (bool, optional) – If True, returns a model pre-trained on ImageNet.

    Default: False

  • progress (bool, optional) – If True, displays a progress bar of the download to stderr.

    Default: True

  • small_input (bool, optional) – If True, replace the first conv2d for small images and replace first maxpool by identity.

    Default: False

  • **kwargs – arguments specific to torchvision constructors for ResNet.

Return type:

Module

Returns:

Basic resnet.

Timm

Timm models are accessible through Eztorch to retrieve VITs, Efficient-Net, …

eztorch.models.trunks.create_model_timm(model_name, pretrained=False, pretrained_cfg=None, checkpoint_path='', scriptable=None, exportable=None, no_jit=None, **kwargs)

Create a model

Parameters:
  • model_name (str) – name of model to instantiate

  • pretrained (bool) – load pretrained ImageNet-1k weights if true

    Default: False

  • checkpoint_path (str) – path of checkpoint to load after model is initialized

    Default: ''

  • scriptable (bool) – set layer config so that model is jit scriptable (not working for all models yet)

    Default: None

  • exportable (bool) – set layer config so that model is traceable / ONNX exportable (not fully impl/obeyed yet)

    Default: None

  • no_jit (bool) – set layer config so that model doesn’t utilize jit scripted layers (so far activations only)

    Default: None

Keyword Arguments:
  • drop_rate (float) – dropout rate for training (default: 0.0)

  • global_pool (str) – global pool type (default: ‘avg’)

  • ** – other kwargs are model specific

Video

Pytorchvideo

Pytorchvideo models are accessible if the library has been installed and it is possible to use them to retrieve their models.

Video model and Head wrapper

eztorch.models.trunks.create_video_head_model(model, head)[source]

Build a video model.

Parameters:
  • model (DictConfig) – Config for the model.

  • head (DictConfig) – Config for the head.

ResNet 3D with basic blocks

eztorch.models.trunks.create_resnet3d_basic(*, input_channel=3, model_depth=50, model_num_class=400, dropout_rate=0.5, norm=<class 'torch.nn.modules.batchnorm.BatchNorm3d'>, activation=<class 'torch.nn.modules.activation.ReLU'>, stem_activation=<class 'torch.nn.modules.activation.ReLU'>, stem_dim_out=64, stem_conv_kernel_size=(1, 7, 7), stem_conv_stride=(1, 2, 2), stem_pool=<class 'torch.nn.modules.pooling.MaxPool3d'>, stem_pool_kernel_size=(1, 3, 3), stem_pool_stride=(1, 2, 2), stem=<function create_res_basic_stem>, stage1_pool=None, stage1_pool_kernel_size=(2, 1, 1), stage_conv_a_kernel_size=((1, 3, 3), (1, 3, 3), (3, 3, 3), (3, 3, 3)), stage_conv_b_kernel_size=((1, 3, 3), (1, 3, 3), (1, 3, 3), (1, 3, 3)), stage_spatial_h_stride=(1, 2, 2, 2), stage_spatial_w_stride=(1, 2, 2, 2), stage_temporal_stride=(1, 1, 1, 1), basicblock=<function create_basic_block>, head=<function create_res_basic_head>, head_pool=<class 'torch.nn.modules.pooling.AvgPool3d'>, head_pool_kernel_size=(4, 7, 7), head_output_size=(1, 1, 1), head_activation=None, head_output_with_global_average=True)[source]

Build ResNet style models for video recognition. ResNet has three parts: Stem, Stages and Head. Stem is the first Convolution layer (Conv1) with an optional pooling layer. Stages are grouped residual blocks. There are usually multiple stages and each stage may include multiple residual blocks. Head may include pooling, dropout, a fully-connected layer and global spatial temporal averaging. The three parts are assembled in the following order:

Input
  ↓
Stem
  ↓
Stage 1
  ↓
  .
  .
  .
  ↓
Stage N
  ↓
Head
Parameters:
  • input_channel (int, optional) – Number of channels for the input video clip.

    Default: 3

  • model_depth (int, optional) – The depth of the resnet. Options include: \(18, 50, 101, 152\).

    Default: 50

  • model_num_class (int, optional) – The number of classes for the video dataset.

    Default: 400

  • dropout_rate (float, optional) – Dropout rate.

    Default: 0.5

  • norm (Callable, optional) – A callable that constructs normalization layer.

    Default: <class 'torch.nn.modules.batchnorm.BatchNorm3d'>

  • activation (Callable, optional) – A callable that constructs activation layer.

    Default: <class 'torch.nn.modules.activation.ReLU'>

  • stem_activation (Optional[Callable], optional) – A callable that constructs activation layer of stem.

    Default: <class 'torch.nn.modules.activation.ReLU'>

  • stem_dim_out (int, optional) – Output channel size to stem.

    Default: 64

  • stem_conv_kernel_size (Tuple[int], optional) – Convolutional kernel size(s) of stem.

    Default: (1, 7, 7)

  • stem_conv_stride (Tuple[int], optional) – Convolutional stride size(s) of stem.

    Default: (1, 2, 2)

  • stem_pool (Optional[Callable], optional) – A callable that constructs resnet head pooling layer.

    Default: <class 'torch.nn.modules.pooling.MaxPool3d'>

  • stem_pool_kernel_size (Tuple[int], optional) – Pooling kernel size(s).

    Default: (1, 3, 3)

  • stem_pool_stride (Tuple[int], optional) – Pooling stride size(s).

    Default: (1, 2, 2)

  • stem (Optional[Callable], optional) – A callable that constructs stem layer. Examples include: create_res_video_stem().

    Default: <function create_res_basic_stem>

  • stage_conv_a_kernel_size (Union[Tuple[int], Tuple[Tuple[int]]], optional) – Convolutional kernel size(s) for conv_a.

    Default: ((1, 3, 3), (1, 3, 3), (3, 3, 3), (3, 3, 3))

  • stage_conv_b_kernel_size (Union[Tuple[int], Tuple[Tuple[int]]], optional) – Convolutional kernel size(s) for conv_b.

    Default: ((1, 3, 3), (1, 3, 3), (1, 3, 3), (1, 3, 3))

  • stage_spatial_h_stride (Tuple[int], optional) – The spatial height stride for each stage.

    Default: (1, 2, 2, 2)

  • stage_spatial_w_stride (Tuple[int], optional) – The spatial width stride for each stage.

    Default: (1, 2, 2, 2)

  • stage_temporal_stride (Tuple[int], optional) – The temporal stride for each stage.

    Default: (1, 1, 1, 1)

  • basicblock (Union[Tuple[Callable], Callable], optional) – A callable that constructs basicblock block layer. Examples include: create_basicblock_block().

    Default: <function create_basic_block>

  • head (Callable, optional) – A callable that constructs the resnet-style head. Ex: create_res_basic_head

    Default: <function create_res_basic_head>

  • head_pool (Callable, optional) – A callable that constructs resnet head pooling layer.

    Default: <class 'torch.nn.modules.pooling.AvgPool3d'>

  • head_pool_kernel_size (Tuple[int], optional) – The pooling kernel size.

    Default: (4, 7, 7)

  • head_output_size (Tuple[int], optional) – The size of output tensor for head.

    Default: (1, 1, 1)

  • head_activation (Callable, optional) – A callable that constructs activation layer.

    Default: None

  • head_output_with_global_average (bool, optional) – if True, perform global averaging on the head output.

    Default: True

Return type:

Module

Returns:

Basic resnet.

R2+1D

General R2+1D

eztorch.models.trunks.create_r2plus1d(*, input_channel=3, model_depth=50, model_num_class=400, dropout_rate=0.0, norm=<class 'torch.nn.modules.batchnorm.BatchNorm3d'>, norm_eps=1e-05, norm_momentum=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>, stem_dim_out=64, stem_conv_kernel_size=(1, 7, 7), stem_conv_stride=(1, 2, 2), stage_conv_a_kernel_size=((1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1)), stage_conv_b_kernel_size=((3, 3, 3), (3, 3, 3), (3, 3, 3), (3, 3, 3)), stage_conv_b_num_groups=(1, 1, 1, 1), stage_conv_b_dilation=((1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1)), stage_spatial_stride=(2, 2, 2, 2), stage_temporal_stride=(1, 1, 2, 2), stage_bottleneck=(<function create_2plus1d_bottleneck_block>, <function create_2plus1d_bottleneck_block>, <function create_2plus1d_bottleneck_block>, <function create_2plus1d_bottleneck_block>), head=<function create_res_basic_head>, head_pool=<class 'torch.nn.modules.pooling.AvgPool3d'>, head_pool_kernel_size=(4, 7, 7), head_output_size=(1, 1, 1), head_activation=<class 'torch.nn.modules.activation.Softmax'>, head_output_with_global_average=True)[source]

Build the R(2+1)D network from:: A closer look at spatiotemporal convolutions for action recognition. Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar Paluri. CVPR 2018.

R(2+1)D follows the ResNet style architecture including three parts: Stem, Stages and Head. The three parts are assembled in the following order:

Input
  ↓
Stem
  ↓
Stage 1
  ↓
  .
  .
  .
  ↓
Stage N
  ↓
Head
Parameters:
  • input_channel (int, optional) – Number of channels for the input video clip.

    Default: 3

  • model_depth (int, optional) – The depth of the resnet.

    Default: 50

  • model_num_class (int, optional) – The number of classes for the video dataset.

    Default: 400

  • dropout_rate (float, optional) – Dropout rate.

    Default: 0.0

  • norm (Callable, optional) – A callable that constructs normalization layer.

    Default: <class 'torch.nn.modules.batchnorm.BatchNorm3d'>

  • norm_eps (float, optional) – Normalization epsilon.

    Default: 1e-05

  • norm_momentum (float, optional) – Normalization momentum.

    Default: 0.1

  • activation (Callable, optional) – A callable that constructs activation layer.

    Default: <class 'torch.nn.modules.activation.ReLU'>

  • stem_dim_out (int, optional) – Output channel size for stem.

    Default: 64

  • stem_conv_kernel_size (Tuple[int], optional) – Convolutional kernel size(s) of stem.

    Default: (1, 7, 7)

  • stem_conv_stride (Tuple[int], optional) – Convolutional stride size(s) of stem.

    Default: (1, 2, 2)

  • stage_conv_a_kernel_size (Tuple[Tuple[int]], optional) – Convolutional kernel size(s) for conv_a.

    Default: ((1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1))

  • stage_conv_b_kernel_size (Tuple[Tuple[int]], optional) – Convolutional kernel size(s) for conv_b.

    Default: ((3, 3, 3), (3, 3, 3), (3, 3, 3), (3, 3, 3))

  • stage_conv_b_num_groups (Tuple[int], optional) – Number of groups for groupwise convolution for conv_b. 1 for ResNet, and larger than 1 for ResNeXt.

    Default: (1, 1, 1, 1)

  • stage_conv_b_dilation (Tuple[Tuple[int]], optional) – Dilation for 3D convolution for conv_b.

    Default: ((1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1))

  • stage_spatial_stride (Tuple[int], optional) – The spatial stride for each stage.

    Default: (2, 2, 2, 2)

  • stage_temporal_stride (Tuple[int], optional) – The temporal stride for each stage.

    Default: (1, 1, 2, 2)

  • stage_bottleneck (Tuple[Callable], optional) – A callable that constructs bottleneck block layer for each stage. Examples include: create_bottleneck_block(), create_2plus1d_bottleneck_block().

    Default: (<function create_2plus1d_bottleneck_block>, <function create_2plus1d_bottleneck_block>, <function create_2plus1d_bottleneck_block>, <function create_2plus1d_bottleneck_block>)

  • head_pool (Callable, optional) – A callable that constructs resnet head pooling layer.

    Default: <class 'torch.nn.modules.pooling.AvgPool3d'>

  • head_pool_kernel_size (Tuple[int], optional) – The pooling kernel size.

    Default: (4, 7, 7)

  • head_output_size (Tuple[int], optional) – The size of output tensor for head.

    Default: (1, 1, 1)

  • head_activation (Callable, optional) – A callable that constructs activation layer.

    Default: <class 'torch.nn.modules.activation.Softmax'>

  • head_output_with_global_average (bool, optional) – If True, perform global averaging on the head output.

    Default: True

Return type:

Module

Returns:

Basic resnet.

R2+1D18 often used in papers

eztorch.models.trunks.create_r2plus1d_18(downsample=True, num_classes=101, layers=[1, 1, 1, 1], progress=True, pretrained=False, stem=<class 'eztorch.models.trunks.r2plus1d_18.LargeR2Plus1dStem'>, **kwargs)[source]

Build R2+1D_18 from torchvision for video.

Parameters:
  • num_classes (int, optional) – If not \(0\), replace the last fully connected layer with num_classes output, if \(0\) replace by identity.

    Default: 101

  • pretrained (bool, optional) – If True, returns a model pre-trained on ImageNet.

    Default: False

  • progress (bool, optional) – If True, displays a progress bar of the download to stderr

    Default: True

  • layers (List[int], optional) – Number of layers per block.

    Default: [1, 1, 1, 1]

  • stem (Union[str, Module], optional) – Stem to use for input.

    Default: <class 'eztorch.models.trunks.r2plus1d_18.LargeR2Plus1dStem'>

  • **kwargs – arguments specific to torchvision constructors for ResNet.

Return type:

Module

Returns:

Basic resnet.

S3D

eztorch.models.trunks.create_s3d(num_classes=101, gating=False, slow=False)[source]

Build s3d network.

Parameters:
  • num_classes (int, optional) – If not \(0\), replace the last fully connected layer with num_classes output, if \(0\) replace by identity. Defaults to \(101\).

    Default: 101

  • gating (bool, optional) – If True, init S3D-G network.

    Default: False

  • slow (bool, optional) – If True, use slow S3D.

    Default: False

Return type:

Module

Returns:

The S3D network instantiated.

X3D

eztorch.models.trunks.create_x3d(*, input_channel=3, input_clip_length=13, input_crop_size=160, model_num_class=400, dropout_rate=0.5, width_factor=2.0, depth_factor=2.2, norm=<class 'torch.nn.modules.batchnorm.BatchNorm3d'>, norm_eps=1e-05, norm_momentum=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>, stem_dim_in=12, stem_conv_kernel_size=(5, 3, 3), stem_conv_stride=(1, 2, 2), stage_conv_kernel_size=((3, 3, 3), (3, 3, 3), (3, 3, 3), (3, 3, 3)), stage_spatial_stride=(2, 2, 2, 2), stage_temporal_stride=(1, 1, 1, 1), bottleneck=<function create_x3d_bottleneck_block>, bottleneck_factor=2.25, se_ratio=0.0625, inner_act=<class 'pytorchvideo.layers.swish.Swish'>, head=<function create_x3d_head>, head_dim_out=2048, head_pool_act=<class 'torch.nn.modules.activation.ReLU'>, head_bn_lin5_on=False, head_activation=<class 'torch.nn.modules.activation.Softmax'>, head_output_with_global_average=True)[source]

X3D model builder. It builds a X3D network backbone, which is a ResNet.

Christoph Feichtenhofer. “X3D: Expanding Architectures for Efficient Video Recognition.” https://arxiv.org/abs/2004.04730

Input
  ↓
Stem
  ↓
Stage 1
  ↓
  .
  .
  .
  ↓
Stage N
  ↓
Head
Parameters:
  • input_channel (int, optional) – Number of channels for the input video clip.

    Default: 3

  • input_clip_length (int, optional) – Length of the input video clip. Value for different models: X3D-XS: 4; X3D-S: 13; X3D-M: 16; X3D-L: 16.

    Default: 13

  • input_crop_size (int, optional) – Spatial resolution of the input video clip. Value for different models: X3D-XS: 160; X3D-S: 160; X3D-M: 224; X3D-L: 312.

    Default: 160

  • model_num_class (int, optional) – The number of classes for the video dataset.

    Default: 400

  • dropout_rate (float, optional) – Dropout rate.

    Default: 0.5

  • width_factor (float, optional) – Width expansion factor.

    Default: 2.0

  • depth_factor (float, optional) – Depth expansion factor. Value for different models: X3D-XS: 2.2; X3D-S: 2.2; X3D-M: 2.2; X3D-L: 5.0.

    Default: 2.2

  • norm (Callable, optional) – A callable that constructs normalization layer.

    Default: <class 'torch.nn.modules.batchnorm.BatchNorm3d'>

  • norm_eps (float, optional) – Normalization epsilon.

    Default: 1e-05

  • norm_momentum (float, optional) – Normalization momentum.

    Default: 0.1

  • activation (Callable, optional) – A callable that constructs activation layer.

    Default: <class 'torch.nn.modules.activation.ReLU'>

  • stem_dim_in (int, optional) – Input channel size for stem before expansion.

    Default: 12

  • stem_conv_kernel_size (Tuple[int], optional) – Convolutional kernel size(s) of stem.

    Default: (5, 3, 3)

  • stem_conv_stride (Tuple[int], optional) – Convolutional stride size(s) of stem.

    Default: (1, 2, 2)

  • stage_conv_kernel_size (Tuple[Tuple[int]], optional) – Convolutional kernel size(s) for conv_b.

    Default: ((3, 3, 3), (3, 3, 3), (3, 3, 3), (3, 3, 3))

  • stage_spatial_stride (Tuple[int], optional) – The spatial stride for each stage.

    Default: (2, 2, 2, 2)

  • stage_temporal_stride (Tuple[int], optional) – The temporal stride for each stage.

    Default: (1, 1, 1, 1)

  • bottleneck_factor (float, optional) – Bottleneck expansion factor for the 3x3x3 conv.

    Default: 2.25

  • se_ratio (float, optional) – if > 0, apply SE to the 3x3x3 conv, with the SE channel dimensionality being se_ratio times the 3x3x3 conv dim.

    Default: 0.0625

  • inner_act (Callable, optional) – Whether use Swish activation for act_b or not.

    Default: <class 'pytorchvideo.layers.swish.Swish'>

  • head_dim_out (int, optional) – Output channel size of the X3D head.

    Default: 2048

  • head_pool_act (Callable, optional) – A callable that constructs resnet pool activation layer such as ReLU.

    Default: <class 'torch.nn.modules.activation.ReLU'>

  • head_bn_lin5_on (bool, optional) – If True, perform normalization on the features before the classifier.

    Default: False

  • head_activation (Callable, optional) – A callable that constructs activation layer.

    Default: <class 'torch.nn.modules.activation.Softmax'>

  • head_output_with_global_average (bool, optional) – If True, perform global averaging on the head output.

    Default: True

Return type:

Module

Returns:

The X3D network.