阿里云-零基础入门NLP【基于深度学习的文本分类3-BERT】

文章目录

  • 学习过程
  • 赛题理解
  • 学习目标
  • 赛题数据
  • 数据标签
  • 评测指标
  • 解题思路
  • BERT
    • 代码


学习过程

20年当时自身功底是比较零基础(会写些基础的Python[三个科学计算包]数据分析),一开始看这块其实挺懵的,不会就去问百度或其他人,当时遇见困难挺害怕的,但22后面开始力扣题【目前已刷好几轮,博客没写力扣文章之前,力扣排名靠前已刷有5遍左右,排名靠后刷3次左右,代码功底也在一步一步提升】不断地刷、遇见代码不懂的代码,也开始去打印print去理解,到后面问其他人的问题越来越少,个人自主学习、自主解决能力也得到了进一步增强。

赛题理解

  • 赛题名称:零基础入门NLP之新闻文本分类
  • 赛题目标:通过这道赛题可以引导大家走入自然语言处理的世界,带大家接触NLP的预处理、模型构建和模型训练等知识点。
  • 赛题任务:赛题以自然语言处理为背景,要求选手对新闻文本进行分类,这是一个典型的字符识别问题。

学习目标

  • 理解赛题背景与赛题数据
  • 完成赛题报名和数据下载,理解赛题的解题思路

赛题数据

赛题以匿名处理后的新闻数据为赛题数据,数据集报名后可见并可下载。赛题数据为新闻文本,并按照字符级别进行匿名处理。整合划分出14个候选分类类别:财经、彩票、房产、股票、家居、教育、科技、社会、时尚、时政、体育、星座、游戏、娱乐的文本数据。

赛题数据由以下几个部分构成:训练集20w条样本,测试集A包括5w条样本,测试集B包括5w条样本。为了预防选手人工标注测试集的情况,我们将比赛数据的文本按照字符级别进行了匿名处理。

数据标签

处理后的赛题训练数据如下:
Image
在数据集中标签的对应的关系如下:{‘科技’: 0, ‘股票’: 1, ‘体育’: 2, ‘娱乐’: 3, ‘时政’: 4, ‘社会’: 5, ‘教育’: 6, ‘财经’: 7, ‘家居’: 8, ‘游戏’: 9, ‘房产’: 10, ‘时尚’: 11, ‘彩票’: 12, ‘星座’: 13}

评测指标

评价标准为类别f1_score的均值,选手提交结果与实际测试集的类别进行对比,结果越大越好。

解题思路

赛题思路分析:赛题本质是一个文本分类问题,需要根据每句的字符进行分类。但赛题给出的数据是匿名化的,不能直接使用中文分词等操作,这个是赛题的难点。

因此本次赛题的难点是需要对匿名字符进行建模,进而完成文本分类的过程。由于文本数据是一种典型的非结构化数据,因此可能涉及到特征提取和分类模型两个部分。为了减低参赛难度,我们提供了一些解题思路供大家参考:

思路1:TF-IDF + 机器学习分类器
直接使用TF-IDF对文本提取特征,并使用分类器进行分类。在分类器的选择上,可以使用SVM、LR、或者XGBoost。

思路2:FastText
FastText是入门款的词向量,利用Facebook提供的FastText工具,可以快速构建出分类器。

思路3:WordVec + 深度学习分类器
WordVec是进阶款的词向量,并通过构建深度学习分类完成分类。深度学习分类的网络结构可以选择TextCNN、TextRNN或者BiLSTM。

思路4:Bert词向量
Bert是高配款的词向量,具有强大的建模学习能力。

这里使用思路1(TF-IDF + 机器学习分类器) 及 思路4(Bert词向量)

BERT

微调将最后一层的第一个token即[CLS]的隐藏向量作为句子的表示,然后输入到softmax层进行分类。

预训练BERT以及相关代码下载地址:链接: https://pan.baidu.com/s/1zd6wN7elGgp1NyuzYKpvGQ 提取码: tmp5

在原基础上,更改了不少参数,目前调试后该参数效果在原基础上是最佳的

代码

print("程序开始")
import logging
import randomimport numpy as np
import torchlogging.basicConfig(level=logging.INFO, format='%(asctime)-15s %(levelname)s: %(message)s')# set seed
seed = 666
random.seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed(seed)
torch.manual_seed(seed)# set cuda
gpu = 0
use_cuda = gpu >= 0 and torch.cuda.is_available()
if use_cuda:torch.cuda.set_device(gpu)device = torch.device("cuda", gpu)
else:device = torch.device("cpu")
logging.info("Use cuda: %s, gpu id: %d.", use_cuda, gpu)print("开始split data to 20 fold")
# split data to 20 fold
fold_num = 20
data_file = 'train_set.csv'
import pandas as pddef all_data2fold(fold_num, num=20000):fold_data = []f = pd.read_csv(data_file, sep='\t', encoding='UTF-8')texts = f['text'].tolist()[:num]labels = f['label'].tolist()[:num]total = len(labels)index = list(range(total))np.random.shuffle(index)all_texts = []all_labels = []for i in index:all_texts.append(texts[i])all_labels.append(labels[i])label2id = {}for i in range(total):label = str(all_labels[i])if label not in label2id:label2id[label] = [i]else:label2id[label].append(i)all_index = [[] for _ in range(fold_num)]for label, data in label2id.items():# print(label, len(data))batch_size = int(len(data) / fold_num)other = len(data) - batch_size * fold_numfor i in range(fold_num):cur_batch_size = batch_size + 1 if i < other else batch_size# print(cur_batch_size)batch_data = [data[i * batch_size + b] for b in range(cur_batch_size)]all_index[i].extend(batch_data)batch_size = int(total / fold_num)other_texts = []other_labels = []other_num = 0start = 0for fold in range(fold_num):num = len(all_index[fold])texts = [all_texts[i] for i in all_index[fold]]labels = [all_labels[i] for i in all_index[fold]]if num > batch_size:fold_texts = texts[:batch_size]other_texts.extend(texts[batch_size:])fold_labels = labels[:batch_size]other_labels.extend(labels[batch_size:])other_num += num - batch_sizeelif num < batch_size:end = start + batch_size - numfold_texts = texts + other_texts[start: end]fold_labels = labels + other_labels[start: end]start = endelse:fold_texts = textsfold_labels = labelsassert batch_size == len(fold_labels)# shuffleindex = list(range(batch_size))np.random.shuffle(index)shuffle_fold_texts = []shuffle_fold_labels = []for i in index:shuffle_fold_texts.append(fold_texts[i])shuffle_fold_labels.append(fold_labels[i])data = {'label': shuffle_fold_labels, 'text': shuffle_fold_texts}fold_data.append(data)logging.info("Fold lens %s", str([len(data['label']) for data in fold_data]))return fold_datafold_data = all_data2fold(20)
print("结束split data to 20 fold")print("开始build train, dev, test data")
# build train, dev, test data
fold_id = 19# dev
dev_data = fold_data[fold_id]# train
train_texts = []
train_labels = []
for i in range(0, fold_id):data = fold_data[i]train_texts.extend(data['text'])train_labels.extend(data['label'])train_data = {'label': train_labels, 'text': train_texts}# test
test_data_file = 'test_a.csv'
f = pd.read_csv(test_data_file, sep='\t', encoding='UTF-8')
texts = f['text'].tolist()
test_data = {'label': [0] * len(texts), 'text': texts}
print("结束build train, dev, test data")print("开始build vocab")
# build vocab
from collections import Counter
from transformers import BasicTokenizerbasic_tokenizer = BasicTokenizer()class Vocab():def __init__(self, train_data):self.min_count = 0self.pad = 0self.unk = 1self._id2word = ['[PAD]', '[UNK]']self._id2extword = ['[PAD]', '[UNK]']self._id2label = []self.target_names = []self.build_vocab(train_data)reverse = lambda x: dict(zip(x, range(len(x))))self._word2id = reverse(self._id2word)self._label2id = reverse(self._id2label)logging.info("Build vocab: words %d, labels %d." % (self.word_size, self.label_size))def build_vocab(self, data):self.word_counter = Counter()for text in data['text']:words = text.split()for word in words:self.word_counter[word] += 1for word, count in self.word_counter.most_common():if count >= self.min_count:self._id2word.append(word)label2name = {0: '科技', 1: '股票', 2: '体育', 3: '娱乐', 4: '时政', 5: '社会', 6: '教育', 7: '财经',8: '家居', 9: '游戏', 10: '房产', 11: '时尚', 12: '彩票', 13: '星座'}self.label_counter = Counter(data['label'])for label in range(len(self.label_counter)):count = self.label_counter[label]self._id2label.append(label)self.target_names.append(label2name[label])def load_pretrained_embs(self, embfile):with open(embfile, encoding='utf-8') as f:lines = f.readlines()items = lines[0].split()word_count, embedding_dim = int(items[0]), int(items[1])index = len(self._id2extword)embeddings = np.zeros((word_count + index, embedding_dim))for line in lines[1:]:values = line.split()self._id2extword.append(values[0])vector = np.array(values[1:], dtype='float64')embeddings[self.unk] += vectorembeddings[index] = vectorindex += 1embeddings[self.unk] = embeddings[self.unk] / word_countembeddings = embeddings / np.std(embeddings)reverse = lambda x: dict(zip(x, range(len(x))))self._extword2id = reverse(self._id2extword)assert len(set(self._id2extword)) == len(self._id2extword)return embeddingsdef word2id(self, xs):if isinstance(xs, list):return [self._word2id.get(x, self.unk) for x in xs]return self._word2id.get(xs, self.unk)def extword2id(self, xs):if isinstance(xs, list):return [self._extword2id.get(x, self.unk) for x in xs]return self._extword2id.get(xs, self.unk)def label2id(self, xs):if isinstance(xs, list):return [self._label2id.get(x, self.unk) for x in xs]return self._label2id.get(xs, self.unk)@propertydef word_size(self):return len(self._id2word)@propertydef extword_size(self):return len(self._id2extword)@propertydef label_size(self):return len(self._id2label)vocab = Vocab(train_data)
print("结束build vocab")print("开始build module")
# build module
import torch.nn as nn
import torch.nn.functional as Fclass Attention(nn.Module):def __init__(self, hidden_size):super(Attention, self).__init__()self.weight = nn.Parameter(torch.Tensor(hidden_size, hidden_size))self.weight.data.normal_(mean=0.0, std=0.05)self.bias = nn.Parameter(torch.Tensor(hidden_size))b = np.zeros(hidden_size, dtype=np.float32)self.bias.data.copy_(torch.from_numpy(b))self.query = nn.Parameter(torch.Tensor(hidden_size))self.query.data.normal_(mean=0.0, std=0.05)def forward(self, batch_hidden, batch_masks):# batch_hidden: b x len x hidden_size (2 * hidden_size of lstm)# batch_masks:  b x len# linearkey = torch.matmul(batch_hidden, self.weight) + self.bias  # b x len x hidden# compute attentionoutputs = torch.matmul(key, self.query)  # b x lenmasked_outputs = outputs.masked_fill((1 - batch_masks).bool(), float(-1e32))attn_scores = F.softmax(masked_outputs, dim=1)  # b x len# 对于全零向量,-1e32的结果为 1/len, -inf为nan, 额外补0masked_attn_scores = attn_scores.masked_fill((1 - batch_masks).bool(), 0.0)# sum weighted sourcesbatch_outputs = torch.bmm(masked_attn_scores.unsqueeze(1), key).squeeze(1)  # b x hiddenreturn batch_outputs, attn_scores# build word encoder
csv_path = './bert/bert-mini/'
bert_path = '/students/julyedu_554294/NLP/news/bert/bert-mini'
dropout = 0.18from transformers import BertModelclass WordBertEncoder(nn.Module):def __init__(self):super(WordBertEncoder, self).__init__()self.dropout = nn.Dropout(dropout)self.tokenizer = WhitespaceTokenizer()self.bert = BertModel.from_pretrained(bert_path)self.pooled = Falselogging.info('Build Bert encoder with pooled {}.'.format(self.pooled))def encode(self, tokens):tokens = self.tokenizer.tokenize(tokens)return tokensdef get_bert_parameters(self):no_decay = ['bias', 'LayerNorm.weight']optimizer_parameters = [{'params': [p for n, p in self.bert.named_parameters() if not any(nd in n for nd in no_decay)],'weight_decay': 0.01},{'params': [p for n, p in self.bert.named_parameters() if any(nd in n for nd in no_decay)],'weight_decay': 0.0}]return optimizer_parametersdef forward(self, input_ids, token_type_ids):# input_ids: sen_num x bert_len# token_type_ids: sen_num  x bert_len# sen_num x bert_len x 256, sen_num x 256sequence_output, pooled_output = self.bert(input_ids=input_ids, token_type_ids=token_type_ids)#print('sequence_output:', sequence_output)#print('pooled_output:', pooled_output)if self.pooled:reps = pooled_outputelse:reps = sequence_output[:, 0, :]  # sen_num x 256if self.training:reps = self.dropout(reps)return repsclass WhitespaceTokenizer():"""WhitespaceTokenizer with vocab."""def __init__(self):vocab_file = csv_path + 'vocab.txt'self._token2id = self.load_vocab(vocab_file)self._id2token = {v: k for k, v in self._token2id.items()}self.max_len = 256self.unk = 1logging.info("Build Bert vocab with size %d." % (self.vocab_size))def load_vocab(self, vocab_file):f = open(vocab_file, 'r')lines = f.readlines()lines = list(map(lambda x: x.strip(), lines))vocab = dict(zip(lines, range(len(lines))))return vocabdef tokenize(self, tokens):assert len(tokens) <= self.max_len - 2tokens = ["[CLS]"] + tokens + ["[SEP]"]output_tokens = self.token2id(tokens)return output_tokensdef token2id(self, xs):if isinstance(xs, list):return [self._token2id.get(x, self.unk) for x in xs]return self._token2id.get(xs, self.unk)@propertydef vocab_size(self):return len(self._id2token)# build sent encoder
sent_hidden_size = 256
sent_num_layers = 2class SentEncoder(nn.Module):def __init__(self, sent_rep_size):super(SentEncoder, self).__init__()self.dropout = nn.Dropout(dropout)self.sent_lstm = nn.LSTM(input_size=sent_rep_size,hidden_size=sent_hidden_size,num_layers=sent_num_layers,batch_first=True,bidirectional=True)def forward(self, sent_reps, sent_masks):# sent_reps:  b x doc_len x sent_rep_size# sent_masks: b x doc_lensent_hiddens, _ = self.sent_lstm(sent_reps)  # b x doc_len x hidden*2sent_hiddens = sent_hiddens * sent_masks.unsqueeze(2)if self.training:sent_hiddens = self.dropout(sent_hiddens)return sent_hiddens
print("结束build module")print("开始build model")
# build model
class Model(nn.Module):def __init__(self, vocab):super(Model, self).__init__()self.sent_rep_size = 256self.doc_rep_size = sent_hidden_size * 2self.all_parameters = {}parameters = []self.word_encoder = WordBertEncoder()bert_parameters = self.word_encoder.get_bert_parameters()self.sent_encoder = SentEncoder(self.sent_rep_size)self.sent_attention = Attention(self.doc_rep_size)parameters.extend(list(filter(lambda p: p.requires_grad, self.sent_encoder.parameters())))parameters.extend(list(filter(lambda p: p.requires_grad, self.sent_attention.parameters())))self.out = nn.Linear(self.doc_rep_size, vocab.label_size, bias=True)parameters.extend(list(filter(lambda p: p.requires_grad, self.out.parameters())))if use_cuda:self.to(device)if len(parameters) > 0:self.all_parameters["basic_parameters"] = parametersself.all_parameters["bert_parameters"] = bert_parameterslogging.info('Build model with bert word encoder, lstm sent encoder.')para_num = sum([np.prod(list(p.size())) for p in self.parameters()])logging.info('Model param num: %.2f M.' % (para_num / 1e6))def forward(self, batch_inputs):# batch_inputs(batch_inputs1, batch_inputs2): b x doc_len x sent_len# batch_masks : b x doc_len x sent_lenbatch_inputs1, batch_inputs2, batch_masks = batch_inputsbatch_size, max_doc_len, max_sent_len = batch_inputs1.shape[0], batch_inputs1.shape[1], batch_inputs1.shape[2]batch_inputs1 = batch_inputs1.view(batch_size * max_doc_len, max_sent_len)  # sen_num x sent_lenbatch_inputs2 = batch_inputs2.view(batch_size * max_doc_len, max_sent_len)  # sen_num x sent_lenbatch_masks = batch_masks.view(batch_size * max_doc_len, max_sent_len)  # sen_num x sent_lensent_reps = self.word_encoder(batch_inputs1, batch_inputs2)  # sen_num x sent_rep_sizesent_reps = sent_reps.view(batch_size, max_doc_len, self.sent_rep_size)  # b x doc_len x sent_rep_sizebatch_masks = batch_masks.view(batch_size, max_doc_len, max_sent_len)  # b x doc_len x max_sent_lensent_masks = batch_masks.bool().any(2).float()  # b x doc_lensent_hiddens = self.sent_encoder(sent_reps, sent_masks)  # b x doc_len x doc_rep_sizedoc_reps, atten_scores = self.sent_attention(sent_hiddens, sent_masks)  # b x doc_rep_sizebatch_outputs = self.out(doc_reps)  # b x num_labelsreturn batch_outputsmodel = Model(vocab)
print("结束build model")print("开始build optimizer")
# build optimizer
learning_rate = 2e-4
bert_lr = 2.5e-5
decay = .95
decay_step = 1000
from transformers import AdamW, get_linear_schedule_with_warmupclass Optimizer:def __init__(self, model_parameters, steps):self.all_params = []self.optims = []self.schedulers = []for name, parameters in model_parameters.items():if name.startswith("basic"):optim = torch.optim.Adam(parameters, lr=learning_rate)self.optims.append(optim)l = lambda step: decay ** (step // decay_step)scheduler = torch.optim.lr_scheduler.LambdaLR(optim, lr_lambda=l)self.schedulers.append(scheduler)self.all_params.extend(parameters)elif name.startswith("bert"):optim_bert = AdamW(parameters, bert_lr, eps=1e-8)self.optims.append(optim_bert)scheduler_bert = get_linear_schedule_with_warmup(optim_bert, 0, steps)self.schedulers.append(scheduler_bert)for group in parameters:for p in group['params']:self.all_params.append(p)else:Exception("no nameed parameters.")self.num = len(self.optims)def step(self):for optim, scheduler in zip(self.optims, self.schedulers):optim.step()scheduler.step()optim.zero_grad()def zero_grad(self):for optim in self.optims:optim.zero_grad()def get_lr(self):lrs = tuple(map(lambda x: x.get_lr()[-1], self.schedulers))lr = ' %.5f' * self.numres = lr % lrsreturn res
print("结束build optimizer")print("开始build dataset")
def sentence_split(text, vocab, max_sent_len=256, max_segment=16):words = text.strip().split()document_len = len(words)index = list(range(0, document_len, max_sent_len))index.append(document_len)segments = []for i in range(len(index) - 1):segment = words[index[i]: index[i + 1]]assert len(segment) > 0segment = [word if word in vocab._id2word else '<UNK>' for word in segment]segments.append([len(segment), segment])assert len(segments) > 0if len(segments) > max_segment:segment_ = int(max_segment / 2)return segments[:segment_] + segments[-segment_:]else:return segmentsdef get_examples(data, word_encoder, vocab, max_sent_len=256, max_segment=8):label2id = vocab.label2idexamples = []for text, label in zip(data['text'], data['label']):# labelid = label2id(label)# wordssents_words = sentence_split(text, vocab, max_sent_len-2, max_segment)doc = []for sent_len, sent_words in sents_words:token_ids = word_encoder.encode(sent_words)sent_len = len(token_ids)token_type_ids = [0] * sent_lendoc.append([sent_len, token_ids, token_type_ids])examples.append([id, len(doc), doc])logging.info('Total %d docs.' % len(examples))return examples
print("结束build dataset")print("开始build loader")
# build loaderdef batch_slice(data, batch_size):batch_num = int(np.ceil(len(data) / float(batch_size)))for i in range(batch_num):cur_batch_size = batch_size if i < batch_num - 1 else len(data) - batch_size * idocs = [data[i * batch_size + b] for b in range(cur_batch_size)]yield docsdef data_iter(data, batch_size, shuffle=True, noise=1.0):"""randomly permute data, then sort by source length, and partition into batchesensure that the length of  sentences in each batch"""batched_data = []if shuffle:np.random.shuffle(data)lengths = [example[1] for example in data]noisy_lengths = [- (l + np.random.uniform(- noise, noise)) for l in lengths]sorted_indices = np.argsort(noisy_lengths).tolist()sorted_data = [data[i] for i in sorted_indices]else:sorted_data =databatched_data.extend(list(batch_slice(sorted_data, batch_size)))if shuffle:np.random.shuffle(batched_data)for batch in batched_data:yield batch
print("结束build loader")print("开始some function")
# some function
from sklearn.metrics import f1_score, precision_score, recall_scoredef get_score(y_ture, y_pred):y_ture = np.array(y_ture)y_pred = np.array(y_pred)f1 = f1_score(y_ture, y_pred, average='macro') * 100p = precision_score(y_ture, y_pred, average='macro') * 100r = recall_score(y_ture, y_pred, average='macro') * 100return str((reformat(p, 2), reformat(r, 2), reformat(f1, 2))), reformat(f1, 2)def reformat(num, n):return float(format(num, '0.' + str(n) + 'f'))
print("结束some function")print("开始build trainer")
# build trainerimport time
from sklearn.metrics import classification_reportclip = 5.0
epochs = 20
early_stops = 6
log_interval = 150test_batch_size = 64
train_batch_size = 64save_model = './bert_submission_0306_12_change_canshu_taidaxiugai_kuoda2.bin'
save_test = './bert_submission_0306_12_change_canshu_taidaxiugai_kuoda2.csv'class Trainer():def __init__(self, model, vocab):self.model = modelself.report = Trueself.train_data = get_examples(train_data, model.word_encoder, vocab)self.batch_num = int(np.ceil(len(self.train_data) / float(train_batch_size)))self.dev_data = get_examples(dev_data, model.word_encoder, vocab)self.test_data = get_examples(test_data, model.word_encoder, vocab)# criterionself.criterion = nn.CrossEntropyLoss()# label nameself.target_names = vocab.target_names# optimizerself.optimizer = Optimizer(model.all_parameters, steps=self.batch_num * epochs)# countself.step = 0self.early_stop = -1self.best_train_f1, self.best_dev_f1 = 0, 0self.last_epoch = epochsdef train(self):logging.info('Start training...')for epoch in range(1, epochs + 1):train_f1 = self._train(epoch)dev_f1 = self._eval(epoch)if self.best_dev_f1 <= dev_f1:logging.info("Exceed history dev = %.2f, current dev = %.2f" % (self.best_dev_f1, dev_f1))torch.save(self.model.state_dict(), save_model)self.best_train_f1 = train_f1self.best_dev_f1 = dev_f1self.early_stop = 0else:self.early_stop += 1if self.early_stop == early_stops:logging.info("Eearly stop in epoch %d, best train: %.2f, dev: %.2f" % (epoch - early_stops, self.best_train_f1, self.best_dev_f1))self.last_epoch = epochbreakdef test(self):self.model.load_state_dict(torch.load(save_model))self._eval(self.last_epoch + 1, test=True)def _train(self, epoch):self.optimizer.zero_grad()self.model.train()start_time = time.time()epoch_start_time = time.time()overall_losses = 0losses = 0batch_idx = 1y_pred = []y_true = []for batch_data in data_iter(self.train_data, train_batch_size, shuffle=True):torch.cuda.empty_cache()batch_inputs, batch_labels = self.batch2tensor(batch_data)batch_outputs = self.model(batch_inputs)loss = self.criterion(batch_outputs, batch_labels)loss.backward()loss_value = loss.detach().cpu().item()losses += loss_valueoverall_losses += loss_valuey_pred.extend(torch.max(batch_outputs, dim=1)[1].cpu().numpy().tolist())y_true.extend(batch_labels.cpu().numpy().tolist())nn.utils.clip_grad_norm_(self.optimizer.all_params, max_norm=clip)for optimizer, scheduler in zip(self.optimizer.optims, self.optimizer.schedulers):optimizer.step()scheduler.step()self.optimizer.zero_grad()self.step += 1if batch_idx % log_interval == 0:elapsed = time.time() - start_timelrs = self.optimizer.get_lr()logging.info('| epoch {:3d} | step {:3d} | batch {:3d}/{:3d} | lr{} | loss {:.4f} | s/batch {:.2f}'.format(epoch, self.step, batch_idx, self.batch_num, lrs,losses / log_interval,elapsed / log_interval))losses = 0start_time = time.time()batch_idx += 1overall_losses /= self.batch_numduring_time = time.time() - epoch_start_time# reformatoverall_losses = reformat(overall_losses, 4)score, f1 = get_score(y_true, y_pred)logging.info('| epoch {:3d} | score {} | f1 {} | loss {:.4f} | time {:.2f}'.format(epoch, score, f1,overall_losses,during_time))if set(y_true) == set(y_pred) and self.report:report = classification_report(y_true, y_pred, digits=4, target_names=self.target_names)logging.info('\n' + report)return f1def _eval(self, epoch, test=False):self.model.eval()start_time = time.time()data = self.test_data if test else self.dev_datay_pred = []y_true = []with torch.no_grad():for batch_data in data_iter(data, test_batch_size, shuffle=False):torch.cuda.empty_cache()batch_inputs, batch_labels = self.batch2tensor(batch_data)batch_outputs = self.model(batch_inputs)y_pred.extend(torch.max(batch_outputs, dim=1)[1].cpu().numpy().tolist())y_true.extend(batch_labels.cpu().numpy().tolist())score, f1 = get_score(y_true, y_pred)during_time = time.time() - start_timeif test:df = pd.DataFrame({'label': y_pred})df.to_csv(save_test, index=False, sep=',')else:logging.info('| epoch {:3d} | dev | score {} | f1 {} | time {:.2f}'.format(epoch, score, f1,during_time))if set(y_true) == set(y_pred) and self.report:report = classification_report(y_true, y_pred, digits=4, target_names=self.target_names)logging.info('\n' + report)return f1def batch2tensor(self, batch_data):'''[[label, doc_len, [[sent_len, [sent_id0, ...], [sent_id1, ...]], ...]]'''batch_size = len(batch_data)doc_labels = []doc_lens = []doc_max_sent_len = []for doc_data in batch_data:doc_labels.append(doc_data[0])doc_lens.append(doc_data[1])sent_lens = [sent_data[0] for sent_data in doc_data[2]]max_sent_len = max(sent_lens)doc_max_sent_len.append(max_sent_len)max_doc_len = max(doc_lens)max_sent_len = max(doc_max_sent_len)batch_inputs1 = torch.zeros((batch_size, max_doc_len, max_sent_len), dtype=torch.int64)batch_inputs2 = torch.zeros((batch_size, max_doc_len, max_sent_len), dtype=torch.int64)batch_masks = torch.zeros((batch_size, max_doc_len, max_sent_len), dtype=torch.float32)batch_labels = torch.LongTensor(doc_labels)for b in range(batch_size):for sent_idx in range(doc_lens[b]):sent_data = batch_data[b][2][sent_idx]for word_idx in range(sent_data[0]):batch_inputs1[b, sent_idx, word_idx] = sent_data[1][word_idx]batch_inputs2[b, sent_idx, word_idx] = sent_data[2][word_idx]batch_masks[b, sent_idx, word_idx] = 1if use_cuda:batch_inputs1 = batch_inputs1.to(device)batch_inputs2 = batch_inputs2.to(device)batch_masks = batch_masks.to(device)batch_labels = batch_labels.to(device)return (batch_inputs1, batch_inputs2, batch_masks), batch_labels
print("结束build trainer")print("开始train")
# train
trainer = Trainer(model, vocab)
trainer.train()
print("结束train")print("开始test")
# test
trainer.test()
print("结束test")print("程序结束")logging.basicConfig(level=logging.INFO, format='%(asctime)-15s %(levelname)s: %(message)s')
logging.info("Use cuda: %s, gpu id: %d.", use_cuda, gpu)

score:0.9221

比赛源自:阿里云天池大赛 - 零基础入门NLP - 新闻文本分类

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/288895.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

android 11 SystemUI 状态栏打开之后的界面层级关系说明之一

比如WiFi 图标的父layout为&#xff1a; Class Name: ButtonRelativeLayout Class Name: QSTileView Class Name: TilePage Class Name: PagedTileLayout Class Name: QSPanel Class Name: NonInterceptingScrollView Class Name: QSContainerImpl Class Name: FrameLayout Cl…

C语言数据结构易错知识点(5)(插入排序、选择排序)

插入排序&#xff1a;直接插入排序、希尔排序 选择排序&#xff1a;直接选择排序、堆排序 上述排序都是需要掌握的&#xff0c;但原理不会讲解&#xff0c;网上有很多详尽地解释&#xff0c;本文章主要分享一下代码实现上应当注意的事项 1.直接插入排序&#xff1a; 代码实…

RK3568驱动指南|第十四篇 单总线-第155章 单总线简介

瑞芯微RK3568芯片是一款定位中高端的通用型SOC&#xff0c;采用22nm制程工艺&#xff0c;搭载一颗四核Cortex-A55处理器和Mali G52 2EE 图形处理器。RK3568 支持4K 解码和 1080P 编码&#xff0c;支持SATA/PCIE/USB3.0 外围接口。RK3568内置独立NPU&#xff0c;可用于轻量级人工…

win10休眠功能误删了

背景&#xff1a; 今天用某电脑管家清理C盘&#xff0c;不小心把休眠文件给删了&#xff0c;结果电脑失去了休眠功能。这玩意对我太重要了&#xff0c;我合上盖子就是要电脑休眠带走的。于是开始找办法&#xff0c;终于在官网找到了。并且恢复成功。 话不多说直接上技能&#…

基于javaweb(springboot)汽车配件管理系统设计和实现以及文档报告

基于javaweb(springboot)汽车配件管理系统设计和实现以及文档报告 博主介绍&#xff1a;多年java开发经验&#xff0c;专注Java开发、定制、远程、文档编写指导等,csdn特邀作者、专注于Java技术领域 作者主页 央顺技术团队 Java毕设项目精品实战案例《1000套》 欢迎点赞 收藏 ⭐…

【APP_TYC】数据采集案例天眼APP查_抓包分析_①

一杯敬朝阳 一杯敬月光 唤醒我的向往 温柔了寒窗 于是可以不回头地逆风飞翔 不怕心头有雨 眼底有霜 一杯敬故乡 一杯敬远方 守着我的善良 催着我成长 所以南北的路从此不再漫长 灵魂不再无处安放 &#x1f3b5; 毛不易《消愁》 准备工作 在开始之前&…

qtxlsx 学习

简介&#xff1a; QXlsx是一个可以读写Excel文件的库。不依赖office以及wps组件&#xff0c;可以在Qt5支持的任何平台上使用。 QXlsx和QAxObject 比较 QAxObject使用需要系统中安装了offie或wps&#xff0c;这种方法不推荐使用&#xff1b; 因为如果安装了wps&#xff0c;可…

以太网链路聚合——增加带宽,解决生成树收敛慢的问题

目录 一.对STP生成树的补充 1.STP接口状态 2.STP生成树的改进 二.网络可靠性 1.单板可靠性 2.设备可靠性 3.链路可靠性 三.链路聚合 1.多条链路聚合增加带宽 2.链路聚合术语 四.链路聚合模式 1.手动模式 2.LASP模式 &#xff08;1).LASP术语 &#xff08;2&…

头歌 实验二 Java类编程实验

头歌 实验二 Java类编程实验 制作不易&#xff0c;点个关注&#xff01;给大家带来更多的价值&#xff01; 目录 头歌 实验二 Java类编程实验制作不易&#xff0c;点个关注&#xff01;给大家带来更多的价值&#xff01;第一关&#xff1a; 编写一个代表三角形的类第二关&…

智慧公厕,让数据和技术更好服务社会生活

智慧公厕&#xff0c;作为智慧城市建设中不可忽视的一部分&#xff0c;正逐渐受到越来越多人的关注。随着科技的不断进步&#xff0c;智能化公厕已经成为一种趋势&#xff0c;通过数据的流转和技术的整合&#xff0c;为社会生活带来了更好的服务。本文以智慧公厕源头实力厂家广…

密码学 总结

群 环 域 群 group G是一个集合&#xff0c;在此集合上定义代数运算*&#xff0c;若满足下列公理&#xff0c;则称G为群。 1.封闭性 a ∈ G , b ∈ G a\in G,b\in G a∈G,b∈G> a ∗ b ∈ G a*b\in G a∗b∈G 2.G中有恒等元素e&#xff0c;使得任何元素与e运算均为元素本…

dataguard配置SYNC导致数据库性能频繁卡顿

获取卡时间段的堵塞事件&#xff0c;发现大部分事件是“cursor: pin S wait on X”&#xff0c;“log file parallel write”&#xff0c;“LGWR-LNS wait on channel”等根据blocking_session的堵塞源进行查询&#xff0c;最终查到源头&#xff0c;发现堵塞在文件号为2&#x…

C++第十三弹---内存管理(下)

✨个人主页&#xff1a; 熬夜学编程的小林 &#x1f497;系列专栏&#xff1a; 【C语言详解】 【数据结构详解】【C详解】 目录 1、operator new与operator delete函数 1.1、operator new与operator delete函数 2、new和delete的实现原理 2.1、内置类型 2.2、自定义类型 …

【爬虫基础】第4讲 GET与POST请求

GET请求 GET请求是一种HTTP方法&#xff0c;用于向服务器获取&#xff08;或读取&#xff09;数据。它是Web开发中最常用的请求方式之一。对于GET请求&#xff0c;客户端向服务器发送一个HTTP请求&#xff0c;服务器返回请求的资源。GET请求通常用于获取静态资源&#xff0c;比…

Apache Hive的基本使用语法

一、数据库操作 创建数据库 create database if not exists myhive;查看数据库 use myhive; desc database myhive;创建数据库并指定hdfs存储 create database myhive2 location /myhive2;删除空数据库&#xff08;如果有表会报错&#xff09; drop database myhive;…

【Redis】Redis 介绍Redis 为什么这么快?Redis数据结构Redis 和Memcache区别 ?为何Redis单线程效率也高?

目录 Redis 介绍 Redis 为什么这么快&#xff1f; Redis数据结构 Redis 和Memcache区别 &#xff1f; 为何Redis单线程效率也高&#xff1f; Redis 介绍 Redis 是一个开源&#xff08;BSD 许可&#xff09;、基于内存、支持多种数据结构的存储系统&#xff0c;可以作为数据…

专题:一个自制代码生成器(嵌入式脚本语言)之应用实例

初级代码游戏的专栏介绍与文章目录-CSDN博客 我的github&#xff1a;codetoys&#xff0c;所有代码都将会位于ctfc库中。已经放入库中我会指出在库中的位置。 这些代码大部分以Linux为目标但部分代码是纯C的&#xff0c;可以在任何平台上使用。 专题&#xff1a;一个自制代码…

信号处理--基于FBCSP滤波方法的运动想象分类

目录 理论 工具 方法 代码获取 理论 通用空间模式 (CSP) 算法可以用来有效构建最佳空间滤波器区分&#xff0c;然后实现运动想象的数据中的脑电信号的区分。然而&#xff0c;空间滤波器性能的好坏主要取决于其工作频带。如果脑电信号没有经过滤波或者滤波的频带范围不合适…

如何选择适合自己的软文推广平台

随着新媒体的兴起&#xff0c;越来越多的企业关注软文的推广&#xff0c;一篇好的软文离不开一个好的发布渠道。如何选择合适的发稿平台已经成为很多企业的痛点&#xff0c;所以我会根据自己的经验介绍一个常见的发稿平台。 1.门户网站 门户网站&#xff0c;这里就不解释哪些网…

【虹科分享】前Tableau工程师展示Domo如何与Tableau和Power BI高效结合

文章速览&#xff1a; Domo如何与Tableau协同工作如何将Domo数据集连接到Tableau工作簿如何从Domo连接到Tableau数据提取Domo如何与Power BI协同工作 现有的BI工具的不足该如何弥补&#xff0c;前Tableau工程师Tanner Brockbank的建议是&#xff0c;取长补短&#xff0c;结合…