面试题答案
一键面试1. 模型构建思路
- 数据预处理:将文本数据进行清洗、分词、向量化等操作,使其能被模型处理。
- 迁移学习:利用在大规模通用文本数据上预训练的模型(如BERT、GPT等),其已经学习到了丰富的语言特征,在新的分类任务(垃圾邮件识别)上微调。
- 模型架构:在预训练模型基础上,添加分类层,例如全连接层,将预训练模型输出的特征映射到具体的类别。
2. 迁移学习的实现方式
以Hugging Face的Transformer库中的BERT模型为例:
- 加载预训练模型:使用
transformers
库加载预训练的BERT模型及其tokenizer。 - 冻结参数:在微调初期,可以选择冻结预训练模型的部分或全部参数,仅让新添加的分类层参数更新,防止预训练参数被过度调整。
- 微调:在训练过程中,逐渐放开预训练模型的参数,与分类层一起在新数据集上进行联合训练。
3. 超参数调整策略
- 学习率:初始学习率设置不宜过大,通常在
1e-5
到1e-3
之间,可通过学习率调度(如余弦退火、StepLR等)在训练过程中动态调整。 - 批大小:一般在16、32、64中选择,较大的批大小有助于利用硬件并行计算能力,但可能导致内存不足,需根据实际情况调整。
- 隐藏层维度:在分类层,隐藏层维度可尝试从128、256、512等取值,根据验证集性能调整。
4. 数据增强的具体方法
- 回译:将文本翻译成其他语言,再翻译回原始语言,以生成语义相似但表述不同的文本。
- 同义词替换:使用同义词词典,对文本中的部分单词进行同义词替换。
- 随机删除:以一定概率随机删除文本中的单词。
5. 代码实现(以PyTorch和Hugging Face的transformers
库为例)
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset
from transformers import BertTokenizer, BertForSequenceClassification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
import random
from transformers import AdamW, get_linear_schedule_with_warmup
# 数据预处理
def load_data(file_path):
texts = []
labels = []
with open(file_path, 'r', encoding='utf-8') as f:
for line in f:
parts = line.strip().split('\t')
texts.append(parts[0])
labels.append(int(parts[1]))
return texts, labels
texts, labels = load_data('your_data_file.txt')
train_texts, val_texts, train_labels, val_labels = train_test_split(texts, labels, test_size=0.2, random_state=42)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
train_input_ids = torch.tensor(train_encodings['input_ids'])
train_attention_mask = torch.tensor(train_encodings['attention_mask'])
train_labels = torch.tensor(train_labels)
val_input_ids = torch.tensor(val_encodings['input_ids'])
val_attention_mask = torch.tensor(val_encodings['attention_mask'])
val_labels = torch.tensor(val_labels)
train_dataset = TensorDataset(train_input_ids, train_attention_mask, train_labels)
val_dataset = TensorDataset(val_input_ids, val_attention_mask, val_labels)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
# 模型构建
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
# 冻结BERT参数
for param in model.bert.parameters():
param.requires_grad = False
# 训练
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
optimizer = AdamW(model.parameters(), lr=1e-5)
num_epochs = 3
total_steps = len(train_loader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)
criterion = nn.CrossEntropyLoss()
for epoch in range(num_epochs):
model.train()
for batch in train_loader:
input_ids = batch[0].to(device)
attention_mask = batch[1].to(device)
labels = batch[2].to(device)
optimizer.zero_grad()
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()
scheduler.step()
model.eval()
val_preds = []
val_true_labels = []
with torch.no_grad():
for batch in val_loader:
input_ids = batch[0].to(device)
attention_mask = batch[1].to(device)
labels = batch[2].to(device)
outputs = model(input_ids, attention_mask=attention_mask)
logits = outputs.logits
preds = torch.argmax(logits, dim=1)
val_preds.extend(preds.cpu().numpy())
val_true_labels.extend(labels.cpu().numpy())
val_accuracy = accuracy_score(val_true_labels, val_preds)
print(f'Epoch {epoch + 1}, Val Accuracy: {val_accuracy}')
# 放开BERT参数继续微调
for param in model.bert.parameters():
param.requires_grad = True
optimizer = AdamW(model.parameters(), lr=1e-5)
total_steps = len(train_loader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)
for epoch in range(num_epochs):
model.train()
for batch in train_loader:
input_ids = batch[0].to(device)
attention_mask = batch[1].to(device)
labels = batch[2].to(device)
optimizer.zero_grad()
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()
scheduler.step()
model.eval()
val_preds = []
val_true_labels = []
with torch.no_grad():
for batch in val_loader:
input_ids = batch[0].to(device)
attention_mask = batch[1].to(device)
labels = batch[2].to(device)
outputs = model(input_ids, attention_mask=attention_mask)
logits = outputs.logits
preds = torch.argmax(logits, dim=1)
val_preds.extend(preds.cpu().numpy())
val_true_labels.extend(labels.cpu().numpy())
val_accuracy = accuracy_score(val_true_labels, val_preds)
print(f'Epoch {epoch + 1}, Val Accuracy: {val_accuracy}')
说明
上述代码假设数据文件格式为每行一个文本和对应的标签,以制表符分隔。代码中先加载数据,进行预处理,构建模型,初期冻结BERT参数训练分类层,之后放开BERT参数继续微调。在实际应用中,可根据数据特点和需求进一步优化代码。