Trunks¶

Image¶

ResNet and ResNext¶

eztorch.models.trunks.create_resnet(name, num_classes=1000, progress=True, pretrained=False, small_input=False, **kwargs)[source]¶

Build ResNet from torchvision for image.

Parameters:

name (str) – name of the resnet model (such as resnet18).
num_classes (int, optional) – If not \(0\), replace the last fully connected layer with num_classes output, if \(0\) replace by identity.
Default: 1000
pretrained (bool, optional) – If True, returns a model pre-trained on ImageNet.
Default: False
progress (bool, optional) – If True, displays a progress bar of the download to stderr.
Default: True
small_input (bool, optional) – If True, replace the first conv2d for small images and replace first maxpool by identity.
Default: False
**kwargs – arguments specific to torchvision constructors for ResNet.

Return type:

Module

Returns:

Basic resnet.

Timm¶

Timm models are accessible through Eztorch to retrieve VITs, Efficient-Net, …

eztorch.models.trunks.create_model_timm(model_name, pretrained=False, pretrained_cfg=None, checkpoint_path='', scriptable=None, exportable=None, no_jit=None, **kwargs)¶

Create a model

Parameters:

model_name (str) – name of model to instantiate
pretrained (bool) – load pretrained ImageNet-1k weights if true
Default: False
checkpoint_path (str) – path of checkpoint to load after model is initialized
Default: ''
scriptable (bool) – set layer config so that model is jit scriptable (not working for all models yet)
Default: None
exportable (bool) – set layer config so that model is traceable / ONNX exportable (not fully impl/obeyed yet)
Default: None
no_jit (bool) – set layer config so that model doesn’t utilize jit scripted layers (so far activations only)
Default: None

Keyword Arguments:

drop_rate (float) – dropout rate for training (default: 0.0)
global_pool (str) – global pool type (default: ‘avg’)
** – other kwargs are model specific

Video¶

Pytorchvideo¶

Pytorchvideo models are accessible if the library has been installed and it is possible to use them to retrieve their models.

Video model and Head wrapper¶

eztorch.models.trunks.create_video_head_model(model, head)[source]¶

Build a video model.

Parameters:

model (DictConfig) – Config for the model.
head (DictConfig) – Config for the head.

ResNet 3D with basic blocks¶

eztorch.models.trunks.create_resnet3d_basic(*, input_channel=3, model_depth=50, model_num_class=400, dropout_rate=0.5, norm=<class 'torch.nn.modules.batchnorm.BatchNorm3d'>, activation=<class 'torch.nn.modules.activation.ReLU'>, stem_activation=<class 'torch.nn.modules.activation.ReLU'>, stem_dim_out=64, stem_conv_kernel_size=(1, 7, 7), stem_conv_stride=(1, 2, 2), stem_pool=<class 'torch.nn.modules.pooling.MaxPool3d'>, stem_pool_kernel_size=(1, 3, 3), stem_pool_stride=(1, 2, 2), stem=<function create_res_basic_stem>, stage1_pool=None, stage1_pool_kernel_size=(2, 1, 1), stage_conv_a_kernel_size=((1, 3, 3), (1, 3, 3), (3, 3, 3), (3, 3, 3)), stage_conv_b_kernel_size=((1, 3, 3), (1, 3, 3), (1, 3, 3), (1, 3, 3)), stage_spatial_h_stride=(1, 2, 2, 2), stage_spatial_w_stride=(1, 2, 2, 2), stage_temporal_stride=(1, 1, 1, 1), basicblock=<function create_basic_block>, head=<function create_res_basic_head>, head_pool=<class 'torch.nn.modules.pooling.AvgPool3d'>, head_pool_kernel_size=(4, 7, 7), head_output_size=(1, 1, 1), head_activation=None, head_output_with_global_average=True)[source]¶

Build ResNet style models for video recognition. ResNet has three parts: Stem, Stages and Head. Stem is the first Convolution layer (Conv1) with an optional pooling layer. Stages are grouped residual blocks. There are usually multiple stages and each stage may include multiple residual blocks. Head may include pooling, dropout, a fully-connected layer and global spatial temporal averaging. The three parts are assembled in the following order:

Input
  ↓
Stem
  ↓
Stage 1
  ↓
  .
  .
  .
  ↓
Stage N
  ↓
Head

Parameters:

input_channel (int, optional) – Number of channels for the input video clip.
Default: 3
model_depth (int, optional) – The depth of the resnet. Options include: \(18, 50, 101, 152\).
Default: 50
model_num_class (int, optional) – The number of classes for the video dataset.
Default: 400
dropout_rate (float, optional) – Dropout rate.
Default: 0.5
norm (Callable, optional) – A callable that constructs normalization layer.
Default: <class 'torch.nn.modules.batchnorm.BatchNorm3d'>
activation (Callable, optional) – A callable that constructs activation layer.
Default: <class 'torch.nn.modules.activation.ReLU'>
stem_activation (Optional[Callable], optional) – A callable that constructs activation layer of stem.
Default: <class 'torch.nn.modules.activation.ReLU'>
stem_dim_out (int, optional) – Output channel size to stem.
Default: 64
stem_conv_kernel_size (Tuple[int], optional) – Convolutional kernel size(s) of stem.
Default: (1, 7, 7)
stem_conv_stride (Tuple[int], optional) – Convolutional stride size(s) of stem.
Default: (1, 2, 2)
stem_pool (Optional[Callable], optional) – A callable that constructs resnet head pooling layer.
Default: <class 'torch.nn.modules.pooling.MaxPool3d'>
stem_pool_kernel_size (Tuple[int], optional) – Pooling kernel size(s).
Default: (1, 3, 3)
stem_pool_stride (Tuple[int], optional) – Pooling stride size(s).
Default: (1, 2, 2)
stem (Optional[Callable], optional) – A callable that constructs stem layer. Examples include: create_res_video_stem().
Default: <function create_res_basic_stem>
stage_conv_a_kernel_size (Union[Tuple[int], Tuple[Tuple[int]]], optional) – Convolutional kernel size(s) for conv_a.
Default: ((1, 3, 3), (1, 3, 3), (3, 3, 3), (3, 3, 3))
stage_conv_b_kernel_size (Union[Tuple[int], Tuple[Tuple[int]]], optional) – Convolutional kernel size(s) for conv_b.
Default: ((1, 3, 3), (1, 3, 3), (1, 3, 3), (1, 3, 3))
stage_spatial_h_stride (Tuple[int], optional) – The spatial height stride for each stage.
Default: (1, 2, 2, 2)
stage_spatial_w_stride (Tuple[int], optional) – The spatial width stride for each stage.
Default: (1, 2, 2, 2)
stage_temporal_stride (Tuple[int], optional) – The temporal stride for each stage.
Default: (1, 1, 1, 1)
basicblock (Union[Tuple[Callable], Callable], optional) – A callable that constructs basicblock block layer. Examples include: create_basicblock_block().
Default: <function create_basic_block>
head (Callable, optional) – A callable that constructs the resnet-style head. Ex: create_res_basic_head
Default: <function create_res_basic_head>
head_pool (Callable, optional) – A callable that constructs resnet head pooling layer.
Default: <class 'torch.nn.modules.pooling.AvgPool3d'>
head_pool_kernel_size (Tuple[int], optional) – The pooling kernel size.
Default: (4, 7, 7)
head_output_size (Tuple[int], optional) – The size of output tensor for head.
Default: (1, 1, 1)
head_activation (Callable, optional) – A callable that constructs activation layer.
Default: None
head_output_with_global_average (bool, optional) – if True, perform global averaging on the head output.
Default: True

Return type:

Module

Returns:

Basic resnet.

R2+1D¶

General R2+1D

eztorch.models.trunks.create_r2plus1d(*, input_channel=3, model_depth=50, model_num_class=400, dropout_rate=0.0, norm=<class 'torch.nn.modules.batchnorm.BatchNorm3d'>, norm_eps=1e-05, norm_momentum=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>, stem_dim_out=64, stem_conv_kernel_size=(1, 7, 7), stem_conv_stride=(1, 2, 2), stage_conv_a_kernel_size=((1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1)), stage_conv_b_kernel_size=((3, 3, 3), (3, 3, 3), (3, 3, 3), (3, 3, 3)), stage_conv_b_num_groups=(1, 1, 1, 1), stage_conv_b_dilation=((1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1)), stage_spatial_stride=(2, 2, 2, 2), stage_temporal_stride=(1, 1, 2, 2), stage_bottleneck=(<function create_2plus1d_bottleneck_block>, <function create_2plus1d_bottleneck_block>, <function create_2plus1d_bottleneck_block>, <function create_2plus1d_bottleneck_block>), head=<function create_res_basic_head>, head_pool=<class 'torch.nn.modules.pooling.AvgPool3d'>, head_pool_kernel_size=(4, 7, 7), head_output_size=(1, 1, 1), head_activation=<class 'torch.nn.modules.activation.Softmax'>, head_output_with_global_average=True)[source]¶

Build the R(2+1)D network from:: A closer look at spatiotemporal convolutions for action recognition. Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar Paluri. CVPR 2018.

R(2+1)D follows the ResNet style architecture including three parts: Stem, Stages and Head. The three parts are assembled in the following order:

Input
  ↓
Stem
  ↓
Stage 1
  ↓
  .
  .
  .
  ↓
Stage N
  ↓
Head

Parameters:

input_channel (int, optional) – Number of channels for the input video clip.
Default: 3
model_depth (int, optional) – The depth of the resnet.
Default: 50
model_num_class (int, optional) – The number of classes for the video dataset.
Default: 400
dropout_rate (float, optional) – Dropout rate.
Default: 0.0
norm (Callable, optional) – A callable that constructs normalization layer.
Default: <class 'torch.nn.modules.batchnorm.BatchNorm3d'>
norm_eps (float, optional) – Normalization epsilon.
Default: 1e-05
norm_momentum (float, optional) – Normalization momentum.
Default: 0.1
activation (Callable, optional) – A callable that constructs activation layer.
Default: <class 'torch.nn.modules.activation.ReLU'>
stem_dim_out (int, optional) – Output channel size for stem.
Default: 64
stem_conv_kernel_size (Tuple[int], optional) – Convolutional kernel size(s) of stem.
Default: (1, 7, 7)
stem_conv_stride (Tuple[int], optional) – Convolutional stride size(s) of stem.
Default: (1, 2, 2)
stage_conv_a_kernel_size (Tuple[Tuple[int]], optional) – Convolutional kernel size(s) for conv_a.
Default: ((1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1))
stage_conv_b_kernel_size (Tuple[Tuple[int]], optional) – Convolutional kernel size(s) for conv_b.
Default: ((3, 3, 3), (3, 3, 3), (3, 3, 3), (3, 3, 3))
stage_conv_b_num_groups (Tuple[int], optional) – Number of groups for groupwise convolution for conv_b. 1 for ResNet, and larger than 1 for ResNeXt.
Default: (1, 1, 1, 1)
stage_conv_b_dilation (Tuple[Tuple[int]], optional) – Dilation for 3D convolution for conv_b.
Default: ((1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1))
stage_spatial_stride (Tuple[int], optional) – The spatial stride for each stage.
Default: (2, 2, 2, 2)
stage_temporal_stride (Tuple[int], optional) – The temporal stride for each stage.
Default: (1, 1, 2, 2)
stage_bottleneck (Tuple[Callable], optional) – A callable that constructs bottleneck block layer for each stage. Examples include: create_bottleneck_block(), create_2plus1d_bottleneck_block().
Default: (<function create_2plus1d_bottleneck_block>, <function create_2plus1d_bottleneck_block>, <function create_2plus1d_bottleneck_block>, <function create_2plus1d_bottleneck_block>)
head_pool (Callable, optional) – A callable that constructs resnet head pooling layer.
Default: <class 'torch.nn.modules.pooling.AvgPool3d'>
head_pool_kernel_size (Tuple[int], optional) – The pooling kernel size.
Default: (4, 7, 7)
head_output_size (Tuple[int], optional) – The size of output tensor for head.
Default: (1, 1, 1)
head_activation (Callable, optional) – A callable that constructs activation layer.
Default: <class 'torch.nn.modules.activation.Softmax'>
head_output_with_global_average (bool, optional) – If True, perform global averaging on the head output.
Default: True

Return type:

Module

Returns:

Basic resnet.

R2+1D18 often used in papers

eztorch.models.trunks.create_r2plus1d_18(downsample=True, num_classes=101, layers=[1, 1, 1, 1], progress=True, pretrained=False, stem=<class 'eztorch.models.trunks.r2plus1d_18.LargeR2Plus1dStem'>, **kwargs)[source]¶

Build R2+1D_18 from torchvision for video.

Parameters:

num_classes (int, optional) – If not \(0\), replace the last fully connected layer with num_classes output, if \(0\) replace by identity.
Default: 101
pretrained (bool, optional) – If True, returns a model pre-trained on ImageNet.
Default: False
progress (bool, optional) – If True, displays a progress bar of the download to stderr
Default: True
layers (List[int], optional) – Number of layers per block.
Default: [1, 1, 1, 1]
stem (Union[str, Module], optional) – Stem to use for input.
Default: <class 'eztorch.models.trunks.r2plus1d_18.LargeR2Plus1dStem'>
**kwargs – arguments specific to torchvision constructors for ResNet.

Return type:

Module

Returns:

Basic resnet.

S3D¶

eztorch.models.trunks.create_s3d(num_classes=101, gating=False, slow=False)[source]¶

Build s3d network.

Parameters:

num_classes (int, optional) – If not \(0\), replace the last fully connected layer with num_classes output, if \(0\) replace by identity. Defaults to \(101\).
Default: 101
gating (bool, optional) – If True, init S3D-G network.
Default: False
slow (bool, optional) – If True, use slow S3D.
Default: False

Return type:

Module

Returns:

The S3D network instantiated.

X3D¶

eztorch.models.trunks.create_x3d(*, input_channel=3, input_clip_length=13, input_crop_size=160, model_num_class=400, dropout_rate=0.5, width_factor=2.0, depth_factor=2.2, norm=<class 'torch.nn.modules.batchnorm.BatchNorm3d'>, norm_eps=1e-05, norm_momentum=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>, stem_dim_in=12, stem_conv_kernel_size=(5, 3, 3), stem_conv_stride=(1, 2, 2), stage_conv_kernel_size=((3, 3, 3), (3, 3, 3), (3, 3, 3), (3, 3, 3)), stage_spatial_stride=(2, 2, 2, 2), stage_temporal_stride=(1, 1, 1, 1), bottleneck=<function create_x3d_bottleneck_block>, bottleneck_factor=2.25, se_ratio=0.0625, inner_act=<class 'pytorchvideo.layers.swish.Swish'>, head=<function create_x3d_head>, head_dim_out=2048, head_pool_act=<class 'torch.nn.modules.activation.ReLU'>, head_bn_lin5_on=False, head_activation=<class 'torch.nn.modules.activation.Softmax'>, head_output_with_global_average=True)[source]¶

X3D model builder. It builds a X3D network backbone, which is a ResNet.

Christoph Feichtenhofer. “X3D: Expanding Architectures for Efficient Video Recognition.” https://arxiv.org/abs/2004.04730

Input
  ↓
Stem
  ↓
Stage 1
  ↓
  .
  .
  .
  ↓
Stage N
  ↓
Head

Parameters:

input_channel (int, optional) – Number of channels for the input video clip.
Default: 3
input_clip_length (int, optional) – Length of the input video clip. Value for different models: X3D-XS: 4; X3D-S: 13; X3D-M: 16; X3D-L: 16.
Default: 13
input_crop_size (int, optional) – Spatial resolution of the input video clip. Value for different models: X3D-XS: 160; X3D-S: 160; X3D-M: 224; X3D-L: 312.
Default: 160
model_num_class (int, optional) – The number of classes for the video dataset.
Default: 400
dropout_rate (float, optional) – Dropout rate.
Default: 0.5
width_factor (float, optional) – Width expansion factor.
Default: 2.0
depth_factor (float, optional) – Depth expansion factor. Value for different models: X3D-XS: 2.2; X3D-S: 2.2; X3D-M: 2.2; X3D-L: 5.0.
Default: 2.2
norm (Callable, optional) – A callable that constructs normalization layer.
Default: <class 'torch.nn.modules.batchnorm.BatchNorm3d'>
norm_eps (float, optional) – Normalization epsilon.
Default: 1e-05
norm_momentum (float, optional) – Normalization momentum.
Default: 0.1
activation (Callable, optional) – A callable that constructs activation layer.
Default: <class 'torch.nn.modules.activation.ReLU'>
stem_dim_in (int, optional) – Input channel size for stem before expansion.
Default: 12
stem_conv_kernel_size (Tuple[int], optional) – Convolutional kernel size(s) of stem.
Default: (5, 3, 3)
stem_conv_stride (Tuple[int], optional) – Convolutional stride size(s) of stem.
Default: (1, 2, 2)
stage_conv_kernel_size (Tuple[Tuple[int]], optional) – Convolutional kernel size(s) for conv_b.
Default: ((3, 3, 3), (3, 3, 3), (3, 3, 3), (3, 3, 3))
stage_spatial_stride (Tuple[int], optional) – The spatial stride for each stage.
Default: (2, 2, 2, 2)
stage_temporal_stride (Tuple[int], optional) – The temporal stride for each stage.
Default: (1, 1, 1, 1)
bottleneck_factor (float, optional) – Bottleneck expansion factor for the 3x3x3 conv.
Default: 2.25
se_ratio (float, optional) – if > 0, apply SE to the 3x3x3 conv, with the SE channel dimensionality being se_ratio times the 3x3x3 conv dim.
Default: 0.0625
inner_act (Callable, optional) – Whether use Swish activation for act_b or not.
Default: <class 'pytorchvideo.layers.swish.Swish'>
head_dim_out (int, optional) – Output channel size of the X3D head.
Default: 2048
head_pool_act (Callable, optional) – A callable that constructs resnet pool activation layer such as ReLU.
Default: <class 'torch.nn.modules.activation.ReLU'>
head_bn_lin5_on (bool, optional) – If True, perform normalization on the features before the classifier.
Default: False
head_activation (Callable, optional) – A callable that constructs activation layer.
Default: <class 'torch.nn.modules.activation.Softmax'>
head_output_with_global_average (bool, optional) – If True, perform global averaging on the head output.
Default: True

Return type:

Module

Returns:

The X3D network.