为什么Transformer需要进行 Multi-head Attention？

移动开发 2025-04-09 21:59

0 阅读

目录

2. 基本概念

2.2. Attention is all you need

2.3. Self-attention

2.3.1. 概述self-attention

2.3.2. 训练细节

2.4. Multi-head Attention

2.4.1. 多头理论细节

2.4.2. 多头代码实现

3. 讨论观点

3.1. 观点1：

3.2. 观点2：

3.4. 观点4：

3.5. 观点5：

3.6. 观点6：

3.7. 个人观点

1. 前言

这篇文章是华为云共创的一个任务，当看到主题的时候也是很感兴趣，整个的讨论在知乎，原链接：