Analysis of wtalc-pytorch source code
论文名:W-TALC: Weakly-supervised Temporal Activity Localization and Classification
Code link: https://github.com/sujoyp/wtalc-pytorch
The main structure of the code is as follows:
python file | function |
---|---|
main.py | Main function |
options.py | Parameter configuration |
video_dataset.py | Data set classification and loading |
model.py | Weak supervision layer model |
train.py | Training code |
test.py | Test code |
detectionMAP.py | map |
classificationMAP.py | Classified map |
1.opts.py is the parameter configuration.
parser = argparse.ArgumentParser(description='WTALC')
parser.add_argument('--lr', type=float, default=0.00001,help='learning rate (default: 0.0001)')
parser.add_argument('--batch-size', type=int, default=10, help='number of instances in a batch of data (default: 10)')
parser.add_argument('--model-name', default='weakloc', help='name to save model')
parser.add_argument('--pretrained-ckpt', default=None, help='ckpt for pretrained model')
parser.add_argument('--feature-size', default=2048, help='size of feature (default: 2048)')
parser.add_argument('--num-class', default=20, help='number of classes (default: )')
parser.add_argument('--dataset-name', default='Thumos14reduced', help='dataset to train on (default: )')
parser.add_argument('--max-seqlen', type=int, default=1200, help='maximum sequence length during training (default: 750)')
parser.add_argument('--Lambda', type=float, default=0.5, help='weight on Co-Activity Loss (default: 0.5)')
parser.add_argument('--num-similar', default=3, help='number of similar pairs in a batch of data (default: 3)')
parser.add_argument('--seed', type=int, default=1, help='random seed (default: 1)')
parser.add_argument('--max-iter', type=int, default=100000, help='maximum iteration to train (default: 50000)')
parser.add_argument('--feature-type', type=str, default='I3D', help='type of feature to be used I3D or UNT (default: I3D)')
-Lr learning rate
–batch-size
--Model-name saved model name
--Pretrained-ckpt pre-trained model
--Feature-size feature dimension
--Num-class number of categories
--Dataset-name data set name
--Max-seqlen maximum sequence length during training
--The weight of Lambda Co-Activity Loss in the total loss
--Num-similar video similar pair in a batch
--Max-iter training period
--Feature-type The model used for the extracted features
2.video_dataset.py is the data set classification and loading part
2.1 init()
init () first obtains some configuration of this data set, and then calls the train_test_idx() function and classwise_feature_mapping() function.
2.2 train_test_idx()
The function of train_test_idx() is divided into training set and test set in the form of serial number
def train_test_idx(self):
for i, s in enumerate(self.subset):
if s.decode('utf-8') == 'validation': # Specific to Thumos14
self.trainidx.append(i) # 训练集序号
else:
self.testidx.append(i) # 测试集序号
2.3 classwise_feature_mapping()
classwise_feature_mapping() classifies the dataset video
def classwise_feature_mapping(self):
for category in self.classlist:
idx = [] # 一个类别的视频序号添加到一个idx中
for i in self.trainidx:
for label in self.labels[i]:
if label == category.decode('utf-8'):
idx.append(i); break;
self.classwiseidx.append(idx)
2.4 load_data()
The main function of load_data() is to obtain similar video pairs, and finally return the feature matrix and label of 5 video pairs
def load_data(self, n_similar=3, is_training=True):
if is_training==True:
features = []
labels = []
idx = []
# Load similar pairs-->3对相似的视频对
rand_classid = np.random.choice(len(self.classwiseidx), size=n_similar)
# 加载一对相似的视频
for rid in rand_classid:
rand_sampleid = np.random.choice(len(self.classwiseidx[rid]), size=2)
idx.append(self.classwiseidx[rid][rand_sampleid[0]])
idx.append(self.classwiseidx[rid][rand_sampleid[1]])
# idx = [6,]-->idx[10,]
# Load rest pairs-->随机又生成2个视频对?并且不一定是相似的 有什么用
rand_sampleid = np.random.choice(len(self.trainidx), size=self.batch_size-2*n_similar)
for r in rand_sampleid:
idx.append(self.trainidx[r])
# 返回5个视频对的特征矩阵和label
return np.array([utils.process_feat(self.features[i], self.t_max) for i in idx]), np.array([self.labels_multihot[i] for i in idx])
else:
labs = self.labels_multihot[self.testidx[self.currenttestidx]]
feat = self.features[self.testidx[self.currenttestidx]]
if self.currenttestidx == len(self.testidx)-1:
done = True; self.currenttestidx = 0
else:
done = False; self.currenttestidx += 1
return np.array(feat), np.array(labs), done
3.mdel.py is the model part
The function of model.py is mainly to implement the model of the weak supervision layer module (very simple, just look at the source code and the weak supervision formula of the paper).
[External link image transfer failed. The source site may have an anti-hotlink mechanism. It is recommended to save the image and upload it directly (img-sniWRnPJ-1603872318456)(C:\Users\shan\AppData\Roaming\Typora\typora-user-images\ image-20201028153100513.png)]
class Model(torch.nn.Module):
def __init__(self, n_feature, n_class):
super(Model, self).__init__()
self.fc = nn.Linear(n_feature, n_feature)
self.fc1 = nn.Linear(n_feature, n_feature)
self.classifier = nn.Linear(n_feature, n_class)
self.dropout = nn.Dropout(0.7)
self.apply(weights_init)
#self.train()
def forward(self, inputs, is_training=True):
x = F.relu(self.fc(inputs))
if is_training:
x = self.dropout(x)
#x = F.relu(self.fc1(x))
#if is_training:
# x = self.dropout(x)
return x, self.classifier(x)
4.train.py is the training module
The main part of train.py is to find the multi-instance loss and Co-Activity Similiarity loss
4.1 MILL() is a multi-instance loss function
def MILL(element_logits, seq_len, batch_size, labels, device):
''' element_logits should be torch tensor of dimension (B, n_element, n_class),
k should be numpy array of dimension (B,) indicating the top k locations to average over,
labels should be a numpy array of dimension (B, n_class) of 1 or 0
return is a torch tensor of dimension (B, n_class) '''
print('******************************')
# [18 68 20 43 68 22 16 37 42 37]
k = np.ceil(seq_len/8).astype('int32')
labels = labels / torch.sum(labels, dim=1, keepdim=True)
instance_logits = torch.zeros(0).to(device)
for i in range(batch_size):
# 取batch_size的第i批次的前seq_len[i]行,在第0个维度进行排序,取一个视频特征相对突出的前k行特征
tmp, _ = torch.topk(element_logits[i][:seq_len[i]], k=int(k[i]), dim=0) # [seq_len[i], 20]
instance_logits = torch.cat([instance_logits, torch.mean(tmp, 0, keepdim=True)], dim=0) # [1,20]
# 套论文公式求出millloss
milloss = -torch.mean(torch.sum(Variable(labels) * F.log_softmax(instance_logits, dim=1), dim=1), dim=0)
return milloss
4.2 CASL() is the Co-Activity Similiarity loss function
def CASL(x, element_logits, seq_len, n_similar, labels, device):
''' x is the torch tensor of feature from the last layer of model of dimension (n_similar, n_element, n_feature),
element_logits should be torch tensor of dimension (n_similar, n_element, n_class)
seq_len should be numpy array of dimension (B,)
labels should be a numpy array of dimension (B, n_class) of 1 or 0 '''
sim_loss = 0.
n_tmp = 0.
for i in range(0, n_similar*2, 2):
# 使用softmax对每个视频类的激活分数沿时间轴进行标准化
atn1 = F.softmax(element_logits[i][:seq_len[i]], dim=0)
atn2 = F.softmax(element_logits[i+1][:seq_len[i+1]], dim=0)
n1 = torch.FloatTensor([np.maximum(seq_len[i]-1, 1)]).to(device)
n2 = torch.FloatTensor([np.maximum(seq_len[i+1]-1, 1)]).to(device)
# 首先定义高、低attention区域的类的特征向量
Hf1 = torch.mm(torch.transpose(x[i][:seq_len[i]], 1, 0), atn1)
Hf2 = torch.mm(torch.transpose(x[i+1][:seq_len[i+1]], 1, 0), atn2)
Lf1 = torch.mm(torch.transpose(x[i][:seq_len[i]], 1, 0), (1 - atn1)/n1)
Lf2 = torch.mm(torch.transpose(x[i+1][:seq_len[i+1]], 1, 0), (1 - atn2)/n2)
# 使用余弦相似度来衡量两个特征向量之间的相似度
d1 = 1 - torch.sum(Hf1*Hf2, dim=0) / (torch.norm(Hf1, 2, dim=0) * torch.norm(Hf2, 2, dim=0))
d2 = 1 - torch.sum(Hf1*Lf2, dim=0) / (torch.norm(Hf1, 2, dim=0) * torch.norm(Lf2, 2, dim=0))
d3 = 1 - torch.sum(Hf2*Lf1, dim=0) / (torch.norm(Hf2, 2, dim=0) * torch.norm(Lf1, 2, dim=0))
# 为了加强上述两个性质,使用了rank hinge loss
sim_loss = sim_loss + 0.5*torch.sum(torch.max(d1-d2+0.5, torch.FloatTensor([0.]).to(device))*Variable(labels[i,:])*Variable(labels[i+1,:]))
sim_loss = sim_loss + 0.5*torch.sum(torch.max(d1-d3+0.5, torch.FloatTensor([0.]).to(device))*Variable(labels[i,:])*Variable(labels[i+1,:]))
n_tmp = n_tmp + torch.sum(Variable(labels[i,:])*Variable(labels[i+1,:]))
# 整个训练集的总损失
sim_loss = sim_loss / n_tmp
return sim_loss
5.test.py is the test module
The main formula of test.py is to call the dmAP() function and cmAP() to find the map and the classified map respectively
The reference link for the understanding of map is as follows: https://blog.csdn.net/better_boy/article/details/109334234
6.main.py
Finally, we will explain the main function and connect the above classes and functions in series.
6.1 Get parameter configuration
args = options.parser.parse_args()
6.2 Load data set
dataset = Dataset(args)
6.3 Instantiate the model and parameters
model = Model(dataset.feature_size, dataset.num_class).to(device)
optimizer = optim.Adam(model.parameters(), lr=args.lr, weight_decay=0.0005)
6.4 Then start each epoch iteration, call the training function and the test function and save the model every 500 iterations
for itr in range(args.max_iter):
train(itr, dataset, args, model, optimizer, logger, device)
if itr % 5 == 0 and not itr == 0:
torch.save(model.state_dict(), './ckpt/' + args.model_name + '.pkl')
test(itr, dataset, args, model, logger, device)