1、Hadoop
- Hadoop is an open source distributed computing framework for storing and processing large-scale data sets. It provides a scalable distributed file system (HDFS) and a distributed computing framework (MapReduce), which can perform parallel computing on a large number of cheap hardware.
2、HDFS
- HDFS (Hadoop Distributed File System) is a distributed file system of Hadoop. It is designed for storing and managing large-scale datasets in a cluster. HDFS partitions data into blocks and replicates these blocks to different computing nodes to provide fault tolerance and high availability.
- As far as I know, most companies generally save the data required by the model, such as files in csv/libsvm format, as Hive tables and store them on HDFS.
3、HIVE
- HIVE is a Hadoop-based data warehouse infrastructure that provides a SQL-like query language (HiveQL) for querying and analyzing data stored on Hadoop. Hive can map structured data to HDSF on Hadoop's distributed file system, and provide high-level abstraction, enabling users to use SQL-like syntax for query and analysis.
- Built on top of HDFS, Hive can be regarded as a translator in essence, translating HiveSQL language into MapReduce program or Spark program.
- As far as I know, most companies generally save the data required by the model, such as files in csv/libsvm format, as Hive tables and store them on HDFS. Generally, TFRecords of tensorflow is used to read data on HDFS on a large scale. Tensorflow provides a solution: spark-tensorflow-connector, which supports saving spark DataFrame format data directly as TFRecords format data. Next, I will take you to understand the principle, composition and how to generate TFRecords files of TFRecord.
4、HBase
HBase is a distributed, scalable, column-oriented NoSQL database built on top of Hadoop. It provides real-time read and write access to large-scale data sets, and is characterized by high reliability and high performance. HBase is suitable for applications that require random, fast access to large-scale data.
5、Spark
- Spark is a fast and general-purpose big data processing engine that can perform distributed data processing and analysis. Compared with Hadoop's MapReduce, Spark has higher performance and richer functions. Spark supports multiple programming languages (such as Scala, Java, and Python (pyspark)), and provides a rich set of APIs, including libraries for data processing, machine learning, and graph computing.
- As far as I know, most companies will use pyspark for distributed processing of data preprocessing + model reasoning, such as model distributed reasoning (tensorflow and torch only support distributed training, not distributed prediction).
Reference:
- [1] ChatGPT
- [2] Me