AI 技术栈的 4 个支柱：数据、计算、模型和运营

注：机翻，未校。

4 AI Tech Stack Pillars to Watch: Data, Compute, Models, and Ops

by AI & ML Software Engineer

Aayush Mittal

Fact Checked by Alexandra Pankratyeva

26 January 2024

a brain with neurons symbolizing artificial intelligence.

The generative AI tech stack saw massive progress in 2023, with breakthroughs in systems like ChatGPT, DALL-E 3, and Google’s Gemini. However, as AI becomes more powerful and widespread, it’s clear we’re only beginning to tap into the possibilities.
生成式 AI 技术栈在 2023 年取得了巨大进步，在 ChatGPT、DALL-E 3 和谷歌的 Gemini 等系统方面取得了突破。然而，随着 AI 变得越来越强大和广泛，很明显我们才刚刚开始挖掘各种可能性。

The foundational pillars of the AI technology stack – data, compute, models, and AIOps – will continue advancing rapidly.
AI 技术栈的基础支柱（数据、计算、模型和 AIOps）将继续快速发展。

In this article, we compiled the key developments to anticipate in each area.
在本文中，我们汇总了每个领域需要预测的关键发展。

Key Takeaways 关键要点

The AI tech stack is based on four fundamental elements: training data, compute resources, AI models, and AIOps best practices.
AI 技术栈基于四个基本要素：训练数据、计算资源、AI 模型和 AIOps 最佳实践。
As AI models become more powerful, adequate infrastructure to support them becomes even more critical.
随着 AI 模型变得越来越强大，支持它们的充足基础设施变得更加重要。
Data markets will emerge to value, trade, and combine diverse data sources as AI models consume more data.
随着 AI 模型消耗更多数据，数据市场将出现，以评估、交易和组合不同的数据源。

The Essential AI Tech Stack and Development Trends to Watch 值得关注的基本 AI 技术栈和发展趋势

Data 数据

High-quality training data remains the fuel for increasingly powerful AI models. As models scale up into the trillion-parameter range, the data hunger only grows. However, not all data is created equal – variance, complexity, and alignment matter as much as scale.
高质量的训练数据仍然是日益强大的 AI 模型的燃料。随着模型扩展到万亿个参数范围，数据饥渴只会增加。然而，并非所有数据都是平等的 – 方差、复杂性和一致性与规模一样重要。

Key data trends to track include:
需要跟踪的主要数据趋势包括：

Synthetic data generation will continue to improve, producing training sets that better mimic the complexity of the real world – tools like Mostly AI and AI21 Lab’s Jurassic-1 point the way.
合成数据生成将继续改进，生成更好地模拟现实世界复杂性的训练集——Mostly AI 和 AI21 Lab 的 Jurassic-1 等工具指明了方向。
Multimodal data integration will allow models like Google’s Imagen to tackle tasks that require connecting images, audio, video, and text. Models pre-trained on aligned multimodal datasets will power further breakthroughs.
多模态数据集成将允许 Google 的 Imagen 等模型处理需要连接图像、音频、视频和文本的任务。在对齐的多模态数据集上预训练的模型将推动进一步的突破。
Real-world data from users and companies will supplement synthetic data via federated learning and other techniques. This real-world grounding is key to avoiding AI hallucinations.
来自用户和公司的真实数据将通过联邦学习和其他技术补充合成数据。这种现实世界的基础是避免 AI 幻觉的关键。
Low-data techniques like prompt engineering will enable highly sample-efficient fine-tuning. Models will adapt to new domains with only hundreds of examples rather than millions.
提示工程等低数据技术将实现高度采样效率的微调。模型将适应只有数百个示例而不是数百万个示例的新域。
Data markets will emerge to value, trade, and combine diverse data sources. As AI models consume more data, proper valuation and incentives become critical. In November 2023, OpenAI announced the launch of Data Partnerships, where they will work together with organizations to produce public and private datasets for training AI models.
数据市场将出现，以评估、交易和组合不同的数据源。随着 AI 模型消耗更多的数据，适当的估值和激励措施变得至关重要。2023 年 11 月，OpenAI 宣布启动数据合作伙伴关系，他们将与组织合作，生成用于训练 AI 模型的公共和私有数据集。

Compute 计算

Training the largest AI models already requires Google-scale infrastructure. Optimizing the compute AI stack will help democratize access to the development of various AI-powered solutions:
训练最大的 AI 模型已经需要 Google 规模的基础设施。优化计算 AI 栈将有助于实现对各种 AI 驱动解决方案开发的民主化访问：

Specialized hardware like tensor processing units (TPUs), Dojo, and Cerebras will offer order-of-magnitude speedups and power efficiencies vs GPUs.
与 GPU 相比，张量处理单元（TPU）、Dojo 和 Cerebras 等专用硬件将提供数量级的加速和能效。
Model parallelism, as shown in Megatron LM, will efficiently scale model training beyond what fits on any one chip.
如 Megatron LM 所示，模型并行性将有效地扩展模型训练，使其超出任何一个芯片的适用范围。
Inference optimization will reduce latency and costs. Approaches like mixture-of-experts, model quantization, and streaming inference will help.
推理优化将减少延迟和成本。混合专家、模型量化和流式推理等方法将有所帮助。
Cloud marketplace competition from Amazon, Microsoft, Google, and startups will continue driving down model serving costs.
来自 Amazon、Microsoft、Google 和初创公司的云市场竞争将继续降低模型服务成本。
On-device inference will push AI compute to the edge devices like smartphones. This will enable developers to avoid cloud costs and latency.
设备上推理将 AI 计算推送到智能手机等边缘设备。这将使开发人员能够避免云成本和延迟。

Researchers from MIT, the MIT-IBM Watson AI Lab, developed a technique enabling deep-learning models to adapt to new sensor data directly on an edge device.
麻省理工学院（MIT-IBM Watson AI Lab）的研究人员开发了一种技术，使深度学习模型能够直接在边缘设备上适应新的传感器数据。

According to Song Han, an associate professor in the Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab: “On-device fine-tuning can enable better privacy, lower costs, customization ability, and also lifelong learning, but it is not easy. Everything has to happen with a limited number of resources. We want to be able to run not only inference but also training on an edge device. With PockEngine, now we can.”
MIT-IBM Watson AI Lab 成员、电气工程与计算机科学系（EECS）副教授 Song Han 表示：“设备上微调可以实现更好的隐私、更低的成本、定制能力，以及终身学习，但这并不容易。一切都必须在有限数量的资源下进行。我们希望不仅能够在边缘设备上运行推理，还希望能够进行训练。有了 PockEngine，我们现在可以了。

Models 模型

Language, image, video, and multimodal models will continue to grow more powerful. However, not just scale matters—new architectures, training techniques, and evaluation metrics are also critical.
语言、图像、视频和多模态模型将继续变得更加强大。然而，不仅仅是规模很重要，新的架构、训练技术和评估指标也至关重要。

Multimodal architectures like Google’s Gemini fuse modalities into a single model, avoiding siloed AI. This enables richer applications like visual chatbots.
Google 的 Gemini 等多模态架构将模态融合到单个模型中，避免了孤立的 AI。这支持更丰富的应用程序，如可视化聊天机器人。
Improved training with techniques like Anthropic’s Constitutional AI will reduce harmful biases and improve safety. Models like Midjourney’s v6 show steady progress.
使用 Anthropic 的 Constitutional AI 等技术改进训练将减少有害偏见并提高安全性。Midjourney 的 v6 等模型显示出稳步的进展。
Better evaluation through benchmarks like HumanEval and AGIEvaluator will surface real progress, avoiding vanity metrics. Robust out-of-distribution (OOD) generalization is the goal.
通过 HumanEval 和 AGIEvaluator 等基准测试进行更好的评估将揭示真正的进展，避免虚荣的指标。稳健的分布外（OOD）泛化是目标。
Specialized models will tackle vertical domains like code, chemistry, and maths. Transfer learning from general models helps bootstrap these.
专用模型将处理代码、化学和数学等垂直领域。从通用模型中迁移学习有助于引导这些。

Ops 运营

The AIOps stack requires tooling for rapid experimentation, deployment, and monitoring to build real-world AI applications.
AIOps 栈需要用于快速实验、部署和监控的工具，以构建真实的 AI 应用程序。

MLOps will become table-stakes, allowing seamless model development and deployment lifecycles.
MLOps 将成为赌注，实现无缝的模型开发和部署生命周期。
Experiment tracking through tools like Comet ML and Weights & Biases will accelerate research.
通过Comet ML和Weights & Biases等工具进行实验跟踪将加速研究。
Infrastructure automation via Terraform and Kubernetes will simplify scaling.
通过 Terraform 和 Kubernetes 实现的基础设施自动化将简化扩展。
Monitoring through WhyLabs, Robust Intelligence, and others will ensure reliable production AI.
通过 WhyLabs、Robust Intelligence 和其他公司进行监控将确保可靠的生产 AI。
Distribution platforms like HuggingFace, Render, and Causal will simplify model access.
HuggingFace、Render 和 Causal 等分发平台将简化模型访问。
Vertical solutions will hide complexity for non-experts. For example, Replicate and Runway ML focus on deploying generative models.
垂直解决方案将为非专家隐藏复杂性。例如，Replicate 和 Runway ML 专注于部署生成模型。

The Critical Role of AI Infrastructure AI 基础设施的关键作用

As AI models grow more powerful, the infrastructure to support them becomes even more crucial. Here’s why it’s so essential:
随着 AI 模型变得越来越强大，支持它们的基础设施变得更加重要。以下是它如此重要的原因：

Data management 数据管理

With AI models requiring vast amounts of high-quality data, infrastructure must provide secure and efficient data pipelines. This includes capabilities like data versioning, lineage tracking, access controls, and compliance monitoring.
由于 AI 模型需要大量高质量数据，因此基础设施必须提供安全高效的数据管道。这包括数据版本控制、沿袭跟踪、访问控制和合规性监控等功能。

Specialized hardware 专用硬件

AI workloads demand high-performance compute like GPUs and TPUs. Infrastructure must make these resources available on demand while optimizing cost and energy efficiency.
AI 工作负载需要高性能计算，如 GPU 和 TPU。基础设施必须按需提供这些资源，同时优化成本和能源效率。

Model development 模型开发

AI Infrastructure should enable iterative coding, rapid experimentation, and seamless model deployment to accelerate research. MLOps practices in areas like experiment tracking are essential.
AI 基础设施应支持迭代编码、快速实验和无缝模型部署，以加快研究速度。实验跟踪等领域的 MLOps 实践是必不可少的。

Scaling 缩放
As the model size and request volumes grow, infrastructure must scale smoothly via distribution and load balancing. Auto-scaling on serverless platforms helps match supply to demand.
随着模型大小和请求量的增长，基础设施必须通过分配和负载均衡平稳扩展。无服务器平台上的自动扩展有助于实现供需匹配。
Monitoring 监测

Once in production, AI systems require robust monitoring for accuracy, latency, costs, and other metrics. This allows to prevent harmful errors or degradation.
一旦投入生产，AI 系统就需要对准确性、延迟、成本和其他指标进行强大的监控。这样可以防止有害的错误或退化。

The Bottom Line 最重要的事

The trends in AI stack point to a future where AI capabilities will become far more powerful and more robust, transparent, and accessible to all developers.
AI 技术栈的趋势表明，未来 AI 功能将变得更加强大、更加强大、透明，并且所有开发人员都可以使用。

Significant work is still ahead in improving data quality and availability, specialized hardware, evaluation rigor, and productive tooling.
在提高数据质量和可用性、专用硬件、评估严谨性和高效工具方面，仍有大量工作要做。

However, the progress of 2023 sets the stage for an exciting decade of AI innovation to come.
然而，2023 年的进展为即将到来的激动人心的 AI 创新十年奠定了基础。

via:

4 AI Tech Stack Pillars to Watch: Data, Compute, Models & Ops

https://www.techopedia.com/ai-tech-stack-pillars-to-watch-data-compute-models-and-ops