【技术追踪】SAM(Segment Anything Model)代码解析与结构绘制之Mask Decoder

  论文:Segment Anything
  代码:https://github.com/facebookresearch/segment-anything

  系列篇:
  (1)【技术追踪】SAM(Segment Anything Model)代码解析与结构绘制之Image Encoder
  (2)【技术追踪】SAM(Segment Anything Model)代码解析与结构绘制之Prompt Encoder

  本篇示例依然采用系列篇中的狗狗图像运行代码,预测部分代码如下:

input_point = np.array([[1300, 800]])   # 输入point的坐标
input_label = np.array([1])   # label=1表示前景, label=0表示背景
# 输入box的坐标,(700,400)为左上角坐标, (1900,1100)为右下角坐标
input_box = np.array([[700, 400, 1900, 1100]])   
# 调用预测函数
masks, scores, logits = predictor.predict(point_coords=input_point,point_labels=input_label,box=input_box,multimask_output=True,
)

  


1. Mask Decoder代码解析

(1)输入参数

  在【segment_anything/predictor.py --> SamPredictor类 -->predict_torch函数】中调用了mask_decoder实现mask预测,如下所示:

low_res_masks, iou_predictions = self.model.mask_decoder(image_embeddings=self.features,image_pe=self.model.prompt_encoder.get_dense_pe(),sparse_prompt_embeddings=sparse_embeddings,dense_prompt_embeddings=dense_embeddings,multimask_output=multimask_output,)

  ①参数self.features为input_image经过image_encoder嵌入后的向量,本例中大小为 [ 1 , 256 , 64 , 64 ] {[1, 256, 64, 64]} [1,256,64,64]

  ②参数sparse_embeddings为prompt point和prompt box经过prompt_encoder得到的嵌入向量,本例中其大小为 [ 1 , 3 , 256 ] {[1, 3, 256]} [1,3,256]

  ③参数dense_embeddings在本例中为无prompt mask输入时采用 nn.Embedding 的预定义嵌入向量, 其大小为 [ 1 , 256 , 64 , 64 ] {[1, 256, 64, 64]} [1,256,64,64]

  ④参数multimask_output是bool型参数,默认为True,支持多mask输出;

  ⑤参数self.model.prompt_encoder.get_dense_pe()调用PositionEmbeddingRandom实现位置编码,其大小为 [ 1 , 256 , 64 , 64 ] {[1, 256, 64, 64]} [1,256,64,64]

  def get_dense_pe(self) -> torch.Tensor:return self.pe_layer(self.image_embedding_size).unsqueeze(0)

(2)MaskDecoder类

位置:【segment_anything/modeling/mask_decoder.py -->MaskDecoder类】
作用: 初始化网络结构,并调用predict_masks函数实现mask和iou预测

  先看MaskDecoder的 _ _ i n i t _ _ {\_\_init\_\_} __init__ 初始化函数和 f o r w a r d {forward} forward 函数:

class MaskDecoder(nn.Module):def __init__(self,*,transformer_dim: int,transformer: nn.Module,num_multimask_outputs: int = 3,activation: Type[nn.Module] = nn.GELU,iou_head_depth: int = 3,iou_head_hidden_dim: int = 256,) -> None:super().__init__()self.transformer_dim = transformer_dim   # transformer的通道维度 = 256self.transformer = transformer  # 用于mask预测的transformer = TwoWayTransformerself.num_multimask_outputs = num_multimask_outputs  # 消除歧义时需要的mask数量 = 3self.iou_token = nn.Embedding(1, transformer_dim)  # (1, 256)self.num_mask_tokens = num_multimask_outputs + 1   # mask数目加1 = 4self.mask_tokens = nn.Embedding(self.num_mask_tokens, transformer_dim)  # (4, 256)# 以反卷积实现4倍上采样self.output_upscaling = nn.Sequential(nn.ConvTranspose2d(transformer_dim, transformer_dim // 4, kernel_size=2, stride=2),LayerNorm2d(transformer_dim // 4),activation(),nn.ConvTranspose2d(transformer_dim // 4, transformer_dim // 8, kernel_size=2, stride=2),activation(),)# 4个mask对应的mlpself.output_hypernetworks_mlps = nn.ModuleList([MLP(transformer_dim, transformer_dim, transformer_dim // 8, 3)for i in range(self.num_mask_tokens)])# iou预测对应的mlpself.iou_prediction_head = MLP(transformer_dim, iou_head_hidden_dim, self.num_mask_tokens, iou_head_depth)def forward(self,image_embeddings: torch.Tensor,image_pe: torch.Tensor,sparse_prompt_embeddings: torch.Tensor,dense_prompt_embeddings: torch.Tensor,multimask_output: bool,) -> Tuple[torch.Tensor, torch.Tensor]:masks, iou_pred = self.predict_masks(image_embeddings=image_embeddings,  # image encoder嵌入 [1, 256, 64, 64]image_pe=image_pe,  # 图像嵌入大小对应的位置编码 [1, 256, 64, 64]sparse_prompt_embeddings=sparse_prompt_embeddings,  # prompt point和box嵌入 [1, 3, 256]dense_prompt_embeddings=dense_prompt_embeddings,  # prompt mask嵌入[1, 256, 64, 64])  # 输出mask.size()=[1,4,256,256], iou_pred.size()=[1,4]# Select the correct mask or masks for outputif multimask_output:mask_slice = slice(1, None)   # 从索引1开始取后面全部else:mask_slice = slice(0, 1)   # 从索引0开始取到1结束masks = masks[:, mask_slice, :, :]  # [1, 3, 256, 256]iou_pred = iou_pred[:, mask_slice]  # [1, 3]return masks, iou_pred

  传送门:【python函数】内置函数slice()用法解析

   f o r w a r d {forward} forward 的过程中主要完成了 predict_masks 函数调用;而在 _ _ i n i t _ _ {\_\_init\_\_} __init__函数中定义了 t r a n s f o r m e r {transformer} transformer o u t p u t _ u p s c a l i n g {output\_upscaling} output_upscaling o u t p u t _ h y p e r n e t w o r k s _ m l p s {output\_hypernetworks\_mlps} output_hypernetworks_mlps i o u _ p r e d i c t i o n _ h e a d {iou\_prediction\_head} iou_prediction_head 这四个玩意儿,接下来咱来瞅瞅他们是啥样的。


  ① transformer: 在【segment_anything/build_sam.py】中可以看到为transformer定义为TwoWayTransformer,prompt_embed_dim参数为256。

        mask_decoder=MaskDecoder(num_multimask_outputs=3,transformer=TwoWayTransformer(depth=2,embedding_dim=prompt_embed_dim,  # 256mlp_dim=2048,num_heads=8,),transformer_dim=prompt_embed_dim,iou_head_depth=3,iou_head_hidden_dim=256,),

  TwoWayTransformer 结构如下:

class TwoWayTransformer(nn.Module):def __init__(self,depth: int,embedding_dim: int,num_heads: int,mlp_dim: int,activation: Type[nn.Module] = nn.ReLU,attention_downsample_rate: int = 2,) -> None:super().__init__()self.depth = depth   # =2self.embedding_dim = embedding_dim  # =256self.num_heads = num_heads  # =8self.mlp_dim = mlp_dim  # =2048self.layers = nn.ModuleList()# 2个TwoWayAttentionBlock模块for i in range(depth):self.layers.append(TwoWayAttentionBlock(embedding_dim=embedding_dim,  # 256num_heads=num_heads,  # 8mlp_dim=mlp_dim,  # 2048activation=activation,  # nn.ReLUattention_downsample_rate=attention_downsample_rate,  # 降采样率=2skip_first_layer_pe=(i == 0),  # 第1个TwoWayAttentionBlock为True, 第2个TwoWayAttentionBlock为False))# 1个Attention模块self.final_attn_token_to_image = Attention(embedding_dim, num_heads, downsample_rate=attention_downsample_rate)self.norm_final_attn = nn.LayerNorm(embedding_dim)def forward(self,image_embedding: Tensor,  # 图像编码:[1,256,64,64]image_pe: Tensor,   # 图像位置编码:[1,256,64,64]point_embedding: Tensor,   # iou_token,mask_tokens和sparse_prompt_embeddings的拼接向量:[1,8,256]) -> Tuple[Tensor, Tensor]:# BxCxHxW -> BxHWxC == B x N_image_tokens x Cbs, c, h, w = image_embedding.shape  # [1, 256, 64, 64]image_embedding = image_embedding.flatten(2).permute(0, 2, 1)  # [1,4096,256]image_pe = image_pe.flatten(2).permute(0, 2, 1)   # [1,4096,256]# Prepare queriesqueries = point_embedding  # 查询Q:[1,8,256]keys = image_embedding     # 键值K:[1,4096,256]# Apply transformer blocks and final layernormfor layer in self.layers:queries, keys = layer(queries=queries,keys=keys,query_pe=point_embedding,key_pe=image_pe,)  # 经过两个TwoWayAttentionBlock后, queries:[1,8,256], keys:[1,4096,256]# Apply the final attention layer from the points to the imageq = queries + point_embedding  # [1,8,256]k = keys + image_pe  # [1,4096,256]attn_out = self.final_attn_token_to_image(q=q, k=k, v=keys)  # [1,8,256]queries = queries + attn_out  # [1,8,256]queries = self.norm_final_attn(queries)  # [1,8,256]return queries, keys

  Attention 结构如下:
  以TwoWayAttentionBlock的第一个Attention模块为例,即:

# embedding_dim = 256, num_heads=8
self.self_attn = Attention(embedding_dim, num_heads) 

  Attention模块主要实现了Transformer中基本的attention机制,若参数downsample_rate不为1,则会先对维度进行下采样映射:

class Attention(nn.Module):def __init__(self,embedding_dim: int,   # 256num_heads: int,   # 8downsample_rate: int = 1,   # 1) -> None:super().__init__()self.embedding_dim = embedding_dim   # 256self.internal_dim = embedding_dim // downsample_rate   # 256self.num_heads = num_heads   # 8assert self.internal_dim % num_heads == 0, "num_heads must divide embedding_dim."self.q_proj = nn.Linear(embedding_dim, self.internal_dim)   # (256,256)self.k_proj = nn.Linear(embedding_dim, self.internal_dim)   # (256,256)self.v_proj = nn.Linear(embedding_dim, self.internal_dim)   # (256,256)self.out_proj = nn.Linear(self.internal_dim, embedding_dim)   # (256,256)def _separate_heads(self, x: Tensor, num_heads: int) -> Tensor:b, n, c = x.shapex = x.reshape(b, n, num_heads, c // num_heads)return x.transpose(1, 2)  # B x N_heads x N_tokens x C_per_headdef _recombine_heads(self, x: Tensor) -> Tensor:b, n_heads, n_tokens, c_per_head = x.shapex = x.transpose(1, 2)return x.reshape(b, n_tokens, n_heads * c_per_head)  # B x N_tokens x Cdef forward(self, q: Tensor, k: Tensor, v: Tensor) -> Tensor:# Input projections# 输入q:[1,8,256];k:[1,8,256];v:[1,8,256]q = self.q_proj(q)  # [1,8,256]k = self.k_proj(k)  # [1,8,256]v = self.v_proj(v)  # [1,8,256]# Separate into headsq = self._separate_heads(q, self.num_heads)  # [1,8,8,32]k = self._separate_heads(k, self.num_heads)  # [1,8,8,32]v = self._separate_heads(v, self.num_heads)  # [1,8,8,32]_, _, _, c_per_head = q.shape   # 每个head的维度c_per_head=32# attention机制-----------------------------------------------------------------------# 每个head实现q乘k的转置: [1,8,8,32]@[1,8,32,8]->[1,8,8,8]attn = q @ k.permute(0, 1, 3, 2)  # B x N_heads x N_tokens x N_tokensattn = attn / math.sqrt(c_per_head)  # q @ k(^T) / 根号dattn = torch.softmax(attn, dim=-1)  # [1,8,8,8]# -----------------------------------------------------------------------------------# Get outputout = attn @ v   # softmax( q @ k(^T) / 根号d ) @ v ---> [1,8,8,32]out = self._recombine_heads(out)  # [1,8,256]out = self.out_proj(out)  # [1,8,256]return out

  为避免代码看的太晕,把Attention可视化一下,没错,就是最基本的Multi-head Attention啦~
  
在这里插入图片描述
  
  TwoWayAttentionBlock 结构如下:
  以TwoWayTransformer的第一个TwoWayAttentionBlock模块为例,即:

TwoWayAttentionBlock(embedding_dim=embedding_dim,  # 256num_heads=num_heads,  # 8mlp_dim=mlp_dim,  # 2048activation=activation,  # nn.ReLUattention_downsample_rate=attention_downsample_rate,  # 降采样率=2skip_first_layer_pe=(i == 0),  # 第1个TwoWayAttentionBlock为True)

  TwoWayAttentionBlock模块:

class TwoWayAttentionBlock(nn.Module):def __init__(self,embedding_dim: int,num_heads: int,mlp_dim: int = 2048,activation: Type[nn.Module] = nn.ReLU,attention_downsample_rate: int = 2,skip_first_layer_pe: bool = False,) -> None:super().__init__()self.self_attn = Attention(embedding_dim, num_heads)   # embedding_dim=256, num_heads=8self.norm1 = nn.LayerNorm(embedding_dim)  # 256self.cross_attn_token_to_image = Attention(embedding_dim, num_heads, downsample_rate=attention_downsample_rate)   # embedding_dim=256, num_heads=8, attention_downsample_rate=2self.norm2 = nn.LayerNorm(embedding_dim)  # 256# embedding_dim=256, mlp_dim=2048, activation=nn.ReLUself.mlp = MLPBlock(embedding_dim, mlp_dim, activation)self.norm3 = nn.LayerNorm(embedding_dim)  # 256self.norm4 = nn.LayerNorm(embedding_dim)  # 256self.cross_attn_image_to_token = Attention(embedding_dim, num_heads, downsample_rate=attention_downsample_rate)   # embedding_dim=256, num_heads=8, attention_downsample_rate=2self.skip_first_layer_pe = skip_first_layer_pe  # Truedef forward(self, queries: Tensor, keys: Tensor, query_pe: Tensor, key_pe: Tensor) -> Tuple[Tensor, Tensor]:# 输入queries:[1,8,256], keys:[1,4096,256], query_pe:[1,8,256], key_pe:[1,4096,256]# Self attention blockif self.skip_first_layer_pe:queries = self.self_attn(q=queries, k=queries, v=queries)  # [1,8,256]else:q = queries + query_peattn_out = self.self_attn(q=q, k=q, v=queries)queries = queries + attn_outqueries = self.norm1(queries)  # [1,8,256]# Cross attention block, tokens attending to image embeddingq = queries + query_pe  # [1,8,256]k = keys + key_pe  # [1,4096,256]attn_out = self.cross_attn_token_to_image(q=q, k=k, v=keys)  # [1,8,256]queries = queries + attn_out  # [1,8,256]queries = self.norm2(queries)  # [1,8,256]# MLP blockmlp_out = self.mlp(queries)   # [1,8,256]queries = queries + mlp_out   # [1,8,256]queries = self.norm3(queries)  # [1,8,256]# Cross attention block, image embedding attending to tokensq = queries + query_pe    # [1,8,256]k = keys + key_pe   # [1,4096,256]attn_out = self.cross_attn_image_to_token(q=k, k=q, v=queries)  # [1,4096,256]keys = keys + attn_out  # [1,4096,256]keys = self.norm4(keys)  # [1,4096,256]return queries, keys

  可以看到TwoWayTransformer的结构以及token维度变化并不复杂,但其交错的 Q {Q} Q K {K} K V {V} V 确实令人眼花缭乱:
在这里插入图片描述

  TwoWayTransformer中的MLP:

class MLPBlock(nn.Module):def __init__(self,embedding_dim: int,mlp_dim: int,act: Type[nn.Module] = nn.GELU,) -> None:super().__init__()# embedding_dim=256, mlp_dim=2048self.lin1 = nn.Linear(embedding_dim, mlp_dim)  self.lin2 = nn.Linear(mlp_dim, embedding_dim)self.act = act()def forward(self, x: torch.Tensor) -> torch.Tensor:return self.lin2(self.act(self.lin1(x)))

  MLP为简单的线性、激活、线性结构:
在这里插入图片描述


  ② output_upscaling:

Sequential((0): ConvTranspose2d(256, 64, kernel_size=(2, 2), stride=(2, 2))(1): LayerNorm2d()(2): GELU(approximate='none')(3): ConvTranspose2d(64, 32, kernel_size=(2, 2), stride=(2, 2))(4): GELU(approximate='none')
)

  output_upscaling模块由两个反卷积、两个GELU激活和一个LayerNorm组成,实现了特征图的四倍上采样,在 predict_masks函数 中将 [ 1 , 256 , 64 , 64 ] {[1,256,64,64]} [1,256,64,64] 上采样至 [ 1 , 32 , 256 , 256 ] {[1,32,256,256]} [1,32,256,256]

src = src.transpose(1, 2).view(b, c, h, w)   # reshape: [1,4096,256]-> [1,256,64,64]
upscaled_embedding = self.output_upscaling(src) # [1,32,256,256]

  ③ output_hypernetworks_mlps:

ModuleList((0-3): 4 x MLP((layers): ModuleList((0-1): 2 x Linear(in_features=256, out_features=256, bias=True)(2): Linear(in_features=256, out_features=32, bias=True)))
)

  output_hypernetworks_mlps由4个MLP组成,在 predict_masks函数 中将 [ 1 , 256 ] {[1,256]} [1,256] 下采样至 [ 1 , 32 ] {[1,32]} [1,32] 。与TwoWayAttentionBlock中的MLP不同,其结构稍稍多一丢丢:

class MLP(nn.Module):def __init__(self,input_dim: int,   # 256hidden_dim: int,  # 256output_dim: int,  # 32num_layers: int,  # 3sigmoid_output: bool = False,  # False) -> None:super().__init__()self.num_layers = num_layers  # 3h = [hidden_dim] * (num_layers - 1)  # [256,256]self.layers = nn.ModuleList(# [input_dim] + h: [256,256,256], h + [output_dim]:[256,256,32]nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim]))self.sigmoid_output = sigmoid_outputdef forward(self, x):for i, layer in enumerate(self.layers):# i<2经线性层后relu激活x = F.relu(layer(x)) if i < self.num_layers - 1 else layer(x)  if self.sigmoid_output:x = F.sigmoid(x)return x

在这里插入图片描述


  ④ iou_prediction_head:

MLP((layers): ModuleList((0-1): 2 x Linear(in_features=256, out_features=256, bias=True)(2): Linear(in_features=256, out_features=4, bias=True))
)

  iou_prediction_head用以实现iou预测,由1个MLP完成,其结构与output_hypernetworks_mlps中的MLP一样,只是最终将 [ 1 , 256 ] {[1,256]} [1,256] 映射至 [ 1 , 4 ] {[1,4]} [1,4]分别代表非multimask预测时的1个mask和multimask预测时的3个mask的iou。


(3)predict_masks函数

位置:【segment_anything/modeling/mask_decoder.py --> MaskDecoder类 --> predict_masks函数】
作用: 利用上述 t r a n s f o r m e r {transformer} transformer o u t p u t _ u p s c a l i n g {output\_upscaling} output_upscaling o u t p u t _ h y p e r n e t w o r k s _ m l p s {output\_hypernetworks\_mlps} output_hypernetworks_mlps i o u _ p r e d i c t i o n _ h e a d {iou\_prediction\_head} iou_prediction_head 四个模块,实现mask和iou预测
  
  此时此刻,首先来重温一下,传入predict_masks函数的参数分别是什么:

  ① image_embeddings:image encoder嵌入,大小为 [ 1 , 256 , 64 , 64 ] {[1, 256, 64, 64]} [1,256,64,64]
  ② image_pe:图像嵌入大小对应的位置编码,大小同为 [ 1 , 256 , 64 , 64 ] {[1, 256, 64, 64]} [1,256,64,64]
  ③ sparse_prompt_embeddings:prompt point和box嵌入,大小为 [ 1 , 3 , 256 ] {[1, 3, 256]} [1,3,256]
  ④ dense_prompt_embeddings:prompt mask嵌入,大小为 [ 1 , 256 , 64 , 64 ] {[1, 256, 64, 64]} [1,256,64,64]

def predict_masks(self,image_embeddings: torch.Tensor,  # [1, 256, 64, 64]image_pe: torch.Tensor,  # [1, 256, 64, 64]sparse_prompt_embeddings: torch.Tensor,  # [1, 3, 256]dense_prompt_embeddings: torch.Tensor,  # [1, 256, 64, 64]
) -> Tuple[torch.Tensor, torch.Tensor]:"""Predicts masks. See 'forward' for more details."""# Concatenate output tokens# 拼接iou的token和mask的token: [1,256]+[4,256]->[5,256]output_tokens = torch.cat([self.iou_token.weight, self.mask_tokens.weight], dim=0)output_tokens = output_tokens.unsqueeze(0).expand(sparse_prompt_embeddings.size(0), -1, -1)  # [1,5,256]# iou的token和mask的token + prompt point和box嵌入tokens = torch.cat((output_tokens, sparse_prompt_embeddings), dim=1)  # [1,8,256]# Expand per-image data in batch direction to be per-masksrc = torch.repeat_interleave(image_embeddings, tokens.shape[0], dim=0)  # 按batch重复: [1,256,64,64]src = src + dense_prompt_embeddings  # [1,256,64,64]pos_src = torch.repeat_interleave(image_pe, tokens.shape[0], dim=0)  # 按batch重复: [1,256,64,64]b, c, h, w = src.shape  # 1,256,64,64# Run the transformer# src是image encoder嵌入和prompt mask嵌入# pos_src是图像嵌入大小对应的位置编码# tokens是iou的token和mask的token + prompt point和box嵌入hs, src = self.transformer(src, pos_src, tokens)  # hs:[1,8,256], src:[1,4096,256]iou_token_out = hs[:, 0, :]  # 第1个为iou的token输出[1,256]mask_tokens_out = hs[:, 1: (1 + self.num_mask_tokens), :]  # 随后4个为mask的token输出[4,256]# Upscale mask embeddings and predict masks using the mask tokenssrc = src.transpose(1, 2).view(b, c, h, w)   # reshape: [1,4096,256]-> [1,256,64,64]upscaled_embedding = self.output_upscaling(src)  # [1,32,256,256]hyper_in_list: List[torch.Tensor] = []for i in range(self.num_mask_tokens):hyper_in_list.append(self.output_hypernetworks_mlps[i](mask_tokens_out[:, i, :]))hyper_in = torch.stack(hyper_in_list, dim=1)  # [1,4,32]b, c, h, w = upscaled_embedding.shape  # 1,32,256,256masks = (hyper_in @ upscaled_embedding.view(b, c, h * w)).view(b, -1, h, w)  # [1,4,256,256]# Generate mask quality predictionsiou_pred = self.iou_prediction_head(iou_token_out)  # [1,4]return masks, iou_pred

  由此可见,经TwoWayTransformer获得了iou_token_out和mask_tokens_out,iou_token_out由iou_prediction_head(1个MLP)实现iou预测,4个mask_tokens_out分别经过1个MLP所获得的映射拼接后,与经过output_upscaling上采样后的图像嵌入(包含image encoder嵌入和prompt mask嵌入)进行矩阵相乘,得到mask预测。


2. Mask Decoder结构绘制

(1)结构打印

MaskDecoder((transformer): TwoWayTransformer((layers): ModuleList((0-1): 2 x TwoWayAttentionBlock((self_attn): Attention((q_proj): Linear(in_features=256, out_features=256, bias=True)(k_proj): Linear(in_features=256, out_features=256, bias=True)(v_proj): Linear(in_features=256, out_features=256, bias=True)(out_proj): Linear(in_features=256, out_features=256, bias=True))(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(cross_attn_token_to_image): Attention((q_proj): Linear(in_features=256, out_features=128, bias=True)(k_proj): Linear(in_features=256, out_features=128, bias=True)(v_proj): Linear(in_features=256, out_features=128, bias=True)(out_proj): Linear(in_features=128, out_features=256, bias=True))(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(mlp): MLPBlock((lin1): Linear(in_features=256, out_features=2048, bias=True)(lin2): Linear(in_features=2048, out_features=256, bias=True)(act): ReLU())(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm4): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(cross_attn_image_to_token): Attention((q_proj): Linear(in_features=256, out_features=128, bias=True)(k_proj): Linear(in_features=256, out_features=128, bias=True)(v_proj): Linear(in_features=256, out_features=128, bias=True)(out_proj): Linear(in_features=128, out_features=256, bias=True))))(final_attn_token_to_image): Attention((q_proj): Linear(in_features=256, out_features=128, bias=True)(k_proj): Linear(in_features=256, out_features=128, bias=True)(v_proj): Linear(in_features=256, out_features=128, bias=True)(out_proj): Linear(in_features=128, out_features=256, bias=True))(norm_final_attn): LayerNorm((256,), eps=1e-05, elementwise_affine=True))(iou_token): Embedding(1, 256)(mask_tokens): Embedding(4, 256)(output_upscaling): Sequential((0): ConvTranspose2d(256, 64, kernel_size=(2, 2), stride=(2, 2))(1): LayerNorm2d()(2): GELU(approximate='none')(3): ConvTranspose2d(64, 32, kernel_size=(2, 2), stride=(2, 2))(4): GELU(approximate='none'))(output_hypernetworks_mlps): ModuleList((0-3): 4 x MLP((layers): ModuleList((0-1): 2 x Linear(in_features=256, out_features=256, bias=True)(2): Linear(in_features=256, out_features=32, bias=True))))(iou_prediction_head): MLP((layers): ModuleList((0-1): 2 x Linear(in_features=256, out_features=256, bias=True)(2): Linear(in_features=256, out_features=4, bias=True)))
)

(2)结构绘制

  整体结构就是这样的啦,完结,撒花~
  
在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/197896.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

C++ Qt 学习(八):Qt 绘图技术与图形视图

1. 常见 18 种 Qt 绘图技术 1.1 widget.h #ifndef WIDGET_H #define WIDGET_H#include <QWidget> #include <memory> #include <QTreeView> #include "CPaintWidget.h"using namespace std;class Widget : public QWidget {Q_OBJECTpublic:Widget…

Dart笔记:glob 文件系统遍历

Dart笔记 文件系统遍历工具&#xff1a;glob 模块 作者&#xff1a;李俊才 &#xff08;jcLee95&#xff09;&#xff1a;https://blog.csdn.net/qq_28550263 邮箱 &#xff1a;291148484163.com 本文地址&#xff1a;https://blog.csdn.net/qq_28550263/article/details/13442…

第7天:信息打点-资产泄漏amp;CMS识别amp;Git监控amp;SVNamp;DS_Storeamp;备份

第7天&#xff1a;信息打点-资产泄漏&CMS识别&Git监控&SVN&DS_Store&备份 知识点&#xff1a; 一、cms指纹识别获取方式 网上开源的程序&#xff0c;得到名字就可以搜索直接获取到源码。 cms在线识别&#xff1a; CMS识别&#xff1a;https://www.yun…

基于单片机C51全自动洗衣机仿真设计

**单片机设计介绍&#xff0c; 基于单片机C51全自动洗衣机仿真设计 文章目录 一 概要二、功能设计设计思路 三、 软件设计原理图 五、 程序六、 文章目录 一 概要 基于单片机C51的全自动洗衣机仿真设计是一个复杂的项目&#xff0c;它涉及到硬件和软件的设计和实现。以下是对这…

SDUT OJ《算法分析与设计》搜索算法

A - 子集和问题 Description 子集和问题的一个实例为〈S,t〉。其中&#xff0c;S{ x1 &#xff0c; x2 &#xff0c;…&#xff0c;xn }是一个正整数的集合&#xff0c;c是一个正整数。子集和问题判定是否存在S的一个子集S1&#xff0c;使得&#xff1a; 。 试设计一个解子…

MATLAB与Excel的数据交互

准备阶段 clear all % 添加Excel函数 try Excel=actxGetRunningServer(Excel.Application); catch Excel=actxserver(Excel.application); end % 设置Excel可见 Excel.visible=1; 插入数据 % % 激活eSheet1 % eSheet1.Activate; % 或者 % Activate(eSheet1); % % 打开…

23.11.19日总结

经过昨天的中期答辩&#xff0c;其实可以看出来项目进度太慢了&#xff0c;现在是第十周&#xff0c;预计第十四周是终级答辩&#xff0c;在这段时间要把项目写完。 前端要加上一个未登录的拦截器&#xff0c;后端加上全局的异常处理。对于饿了么项目的商品建表&#xff0c;之前…

C语言:动态内存管理

目录 为什么存在动态内存分配 动态内存函数 malloc和free 示例 calloc 示例 realloc 示例 常见的动态内存错误 对NULL指针的解引用操作 对动态开辟的空间进行越界访问 对于非动态开辟内存使用free释放 使用free释放一块动态开辟内存的一部分 对同一块内存多次释…

fopen/fwrite/fread 对UNICODE字符写入的总结

windows对fopen函数进行了升级&#xff0c;可以支持指定文件的编码格式&#xff08;ccs参数指定&#xff09;。 例如&#xff1a; FILE *fp fopen("newfile.txt", "rt, ccsUTF-8"); 当以 ccs 模式打开文件时&#xff0c;进行读写操作的数据应为 UTF-16…

【SpringBoot3+Vue3】三【实战篇】-后端(优化)

目录 一、登录优化-redis 1、SpringBoot集成redis 1.1 pom 1.2 yml 1.3 测试程序&#xff08;非必须&#xff09; 1.4 启动redis&#xff0c;执行测试程序 2、令牌主动失效&#xff08;代码优化&#xff09; 2.1 UserController设置token到redis 2.2 登录拦截器Log…

mysql练习1

-- 1.查询出部门编号为BM01的所有员工 SELECT* FROMemp e WHEREe.deptno BM01; -- 2.所有销售人员的姓名、编号和部门编号。 SELECTe.empname,e.empno,e.deptno FROMemp e WHEREe.empstation "销售人员";-- 3.找出奖金高于工资的员工。 SELECT* FROMemp2 WHE…

FPGA设计时序约束八、others类约束之Set_Case_Analysis

目录 一、序言 二、Set Case Analysis 2.1 基本概念 2.2 设置界面 2.3 命令语法 2.4 命令示例 三、工程示例 四、参考资料 一、序言 在Vivado的时序约束窗口中&#xff0c;存在一类特殊的约束&#xff0c;划分在others目录下&#xff0c;可用于设置忽略或修改默认的时序…

综述:目标检测二十年(机翻版)(未完

原文地址 20年来的目标检测&#xff1a;一项调查 摘要关键词一 介绍二 目标检测二十年A.一个目标检测的路线图1)里程碑&#xff1a;传统探测器Viola Jones探测器HOG检测器基于可变形零件的模型&#xff08;DPM&#xff09; 2)里程碑&#xff1a;基于CNN的两阶段探测器RCNNSPPN…

Matlab绘制双坐标轴图示例函数yyaxis

一、前言 出于一些需求&#xff0c;我们需要将两个不同属性的参量绘制在同一张图上&#xff0c;但是两个参量属性不同&#xff0c;即单位不同&#xff0c;纵坐标值分布范围不同&#xff0c;此刻&#xff0c;我们只需要将一个参量的值在y轴左侧展示&#xff0c;另一个参量的值在…

centos7安装mongodb

1、下载mongodb https://www.mongodb.com/try/download/community 2、解压 3、重命名 4、创建mongodb的data、logs目录 5、启动mongodb, bin/mongod --port27017 --dbpath/data/program/mongodb/data --logpath/data/program/mongodb/logs/mongodb.log --bind_ip0.0.0.0 --f…

实用篇-ES-DSL查询文档

数据的存储不是目的&#xff0c;我们希望从海量的酒店数据中检索出需要的信息&#xff0c;这就是ES的搜索功能 官方文档: https://elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html#query-dsl。DSL是用来查询文档的 Elasticsearch提供了基于JSON的DSL来定…

算法通关村——数字中的统计、溢出、进制转换处理模板

数字与数学基础问题 1、数字统计 1.1、符号统计 LeetCode1822. 给定一个数组&#xff0c;求所有元素的乘积的符号&#xff0c;如果最终答案是负的返回-1&#xff0c;如果最终答案是正的返回1&#xff0c;如果答案是0返回0. 这题其实只用看数组中0和负数的个数就好了&#x…

力扣刷题篇之位运算

系列文章目录 目录 系列文章目录 前言 一、位运算的基本运算 二、位运算的技巧 三、布隆过滤器 总结 前言 本系列是个人力扣刷题汇总&#xff0c;本文是数与位。刷题顺序按照[力扣刷题攻略] Re&#xff1a;从零开始的力扣刷题生活 - 力扣&#xff08;LeetCode&#xff0…

DeepMind发布新模型Mirasol3B:更高效处理音频、视频数据

Google DeepMind日前悄然宣布了其人工智能研究的重大进展&#xff0c;推出了一款名为“Mirasol3B”的新型自回归模型&#xff0c;旨在提升对长视频输入的理解能力。该新模型展示了一种颠覆性的多模态学习方法&#xff0c;以更综合和高效的方式处理音频、视频和文本数据。 Googl…

基于STC12C5A60S2系列1T 8051单片机的SPI总线器件数模芯片TLC5615实现数模转换应用

基于STC12C5A60S2系列1T 8051单片的SPI总线器件数模芯片TLC5615实现数模转换应用 STC12C5A60S2系列1T 8051单片机管脚图STC12C5A60S2系列1T 8051单片机I/O口各种不同工作模式及配置STC12C5A60S2系列1T 8051单片机I/O口各种不同工作模式介绍SPI总线器件数模芯片TLC5615介绍通过按…