Ideal car's technical architecture in the Hadoop era
First, briefly review the development of big data technology. Based on my personal understanding, the development of big data is divided into four periods:
The first period: 2006 to 2008. Around 2008, Hadoop became the top project of Apache and officially released version 1.0. Its foundation is mainly defined based on Google's troika, GFS, MapReduce, and BigTable.
The second period: from 2009 to 2013. Enterprises such as Yahoo, Alibaba, and Facebook are increasingly using big data. At the end of 2013, Hadoop officially released version 2.0. I was fortunate enough to get in touch with big data in 2012, and experienced it with Hadoop 1.0 plus Hive. At that time, I felt amazing. Big data can quickly solve problems that cannot be solved with SQL Server or MySQL with a few machines.
The third stage: From 2014 to 2019, during this period of development is very fast, during which Spark and Flink have become the top Apache projects. During this rapid climbing period, we also tried to use Storm, which was later replaced by Flink.
The fourth stage: From 2020 to the present, after Hudi graduated from Apache and became a top-level project in 2020, I personally understand that the data lake has entered the maturity stage of the entire development, and has reached the data lake 2.0 stage of big data. The data lake has three main characteristics, firstly, unified and open storage, secondly, an open format, and rich computing engines.
In the overall development process, big data mainly has several characteristics, which are the four "Vs" that everyone often says: Volume, Velocity, Variety, and Value. Now there is a fifth "V" (Veracity), the accuracy and trustworthiness of the data. The quality of data has always been criticized. I hope that there will be a set of standards in the industry to improve the quality of the data lake. This may be the standard for the emergence of Data Lake 2.0, because projects such as Hudi and Iceberg are all intended to improve the quality of the entire data lake. The management of the data lake is done well.
Personally think that Hadoop is a synonym for big data, but big data is not only Hadoop. Big data is a set of solutions for processing and using a large amount of data formed after the integration of multiple components during the development process . In the past few years, everyone basically believes that Hadoop is going downhill. First, the merger and delisting of Hadoop commercialization companies Cloudera and Hortonworks, the original business model cannot continue; usability challenges, and the growing complexity of the Hadoop ecosystem itself.
Current Architecture of Ideal Automobile Big Data Platform
At this stage, the big data platform of Li Auto is shown in the figure above. Ideal Car uses many open source components.
- Transport layer: Kafka and Pulsar. Kafka was used as a whole in the initial stage of platform construction. Kafka’s cloud-native capabilities are relatively poor. Pulsar was designed according to the cloud-native architecture at the beginning of its design, and has some capabilities that are very suitable for IoT scenarios, and it also matches our business scenarios. Therefore, We recently introduced Pulsar.
- The storage layer is HDFS + JuiceFS.
- The current main computing engines of the computing layer are Spark and Flink, and these computing engines are running on the current Yarn. The three computing engines are managed by Apache Linkis, which is open sourced by WeBank, and currently we use Linkis heavily.
- On the right are three databases. The first one, MatrixDB, is a commercial version of the time series database. TiDB is mainly used for mixed scenarios of OLTP and OLAP. Currently, we mainly use it for TP scenarios. StarRocks is responsible for the OLAP scenario.
- ShardingSphere wants to use its Database Plus concept to unify the underlying databases for gateway-level management. It is still in the exploratory stage, and there are many new features that we are very interested in.
- Further to the right, Thanos is a cloud-native monitoring solution. We have integrated the monitoring of components, engines, and machines into the Thanos solution.
- The application layer is our current four main middle-end products, including data application, data development, data integration and data governance.
features
Through the status quo of big data platforms, you can find some characteristics:
- First, there are many components in the whole solution, users have strong dependence on these components, and the mutual dependence between components is also relatively strong. It is recommended that you try to choose more mature cloud-native components when selecting components in the future .
- Second, our data has clear peaks and valleys. The travel scene is generally in the morning peak and evening peak, and there will be more people on Saturdays and Sundays.
- The third feature is that the popularity of our data is basically the hottest, and we generally only access the data of the last few days or the last week. However, a large amount of data is generated, and sometimes a large amount of backtracking may be required, so the data also needs to be stored for a long time, so the utilization rate of the data is much worse.
Finally, the entire data system currently lacks some effective management methods at the file level. From construction to now, it is basically based on HDFS, and there is a large amount of useless data, which causes a waste of resources. This is a problem that we need to solve urgently.
Pain Points of Big Data Platforms
- First, there are many components, making deployment difficult and inefficient . There are more than 30 big data components around Hadoop, and there are more than 10 commonly used ones. There are strong and weak dependencies between some components, and unified configuration and management become very complicated.
- Second, the machine cost and maintenance cost are relatively high . For the stable operation of the business, offline and real-time clusters are deployed separately. However, due to the business characteristics mentioned above, the peaks and valleys of our business are obvious, and the overall utilization rate is not high. A large number of cluster components require specialized personnel to manage and maintain them.
- Third, cross-platform data sharing capabilities . Currently, data shared across clusters can only be synchronized to other Hadoop clusters through DistCp. It cannot be easily and quickly synchronized to other platforms and servers.
- Fourth, data security and privacy compliance . Based on different data security requirements, ordinary users are managed through Ranger, and special security requirements can only be met by building different clusters and setting separate VPC policies, resulting in many data islands and maintenance costs.
The evolution and thinking of ideal car cloud native
First of all, let me briefly share my personal understanding of cloud native:
First, cloud native is derived from cloud computing. Now everyone uses cloud vendors such as Alibaba Cloud, AWS, Tencent Cloud, Baidu Cloud, etc., initially providing technical services at the IaaS layer to help enterprises package the most basic things such as storage, computing, and networks for unified management. Enterprises only need to apply for a server on it. After applying for servers, these servers are still managed by cloud vendors, which is our traditional cloud operation.
Cloud native is inseparable from cloud computing. Generally speaking, cloud native belongs to the PaaS layer service of cloud computing, which is mainly a type of application for developers. Cloud native must be installed on the cloud, and it is a cloud-based software development and application method. Cloud + native, cloud is cloud computing, native is to abandon the traditional operation and maintenance development framework, through containerization, DevOps, and micro-service architecture to achieve elastic scaling and automatic deployment of applications, making full use of cloud computing resources to achieve in the least space do the biggest thing. It can also solve some pain points of our current big data system, such as poor scalability and maintainability, requiring a lot of manpower and time.
The above figure briefly lists several time points of cloud native
- In the first stage, AWS proposed the concept of cloud native, and launched EC2 in 2006. This stage is the server stage, the cloud computing stage mentioned above.
- The second stage, the cloudization stage, is mainly after the release of open source Docker and Google's open source Kubernetes. Kubernetes is a lightweight and extensible open source platform for managing containerized applications and services. Automatic deployment and scaling of applications can be performed through Kubernetes.
- In the third stage, the CNCF Foundation was established in 2015 to promote cloud-native concepts and help cloud-native development as a whole. The last is the open source of Knative. A very important goal of Knative is to develop cloud-native, cross-platform serverless orchestration standards. Derived to the present, it is already the stage of cloud native 2.0, that is, the stage of serverless. I personally understand that the development of big data should also develop in the direction of Serverless . For example, the entire online service of AWS is basically Serverless.
Big Data Cloud Native Architecture
Next, let me introduce the changes in the components of the ideal car big data platform after cloud nativeization:
- At the storage layer, all storage after cloud nativeization is basically object storage. The above architecture diagram leads to Luster, which will be described in detail below. You can understand that the "cloud storage" layer mainly uses JuiceFS to manage object storage and Luster parallel distributed file system (Note: Due to Luster's single-copy problem, we are currently considering using parallel file systems provided by cloud service providers product).
- The container layer, mainly on top of computing, storage, and network, is all replaced by Kubernetes plus Docker, and all components are grown on it.
- For the component part, the first is the big data computing framework. We may abandon Hive, directly use Spark and Flink, use Hudi as the underlying capability support of Data Lake 2.0, and gradually replace HDFS.
- In the middleware part, besides Pulsar, there is Kafka. At present, Kafka's cloud nativeization is not particularly good. I personally prefer to replace Kafka with Pulsar. At present, Linkis has been used online to adapt all Spark engines, and Flink will be adapted and integrated later. ShardingSphere has just supported cloud native in version 5.1.2, and we will conduct scenario verification and capability exploration as planned.
- The database layer is still TiDB, StarRocks, and MatrixDB. At present, these three databases already have cloud-native capabilities, and they all support object storage. But this piece has not been tested separately, and we are still using physical machines. Because for the database, the IO capabilities provided by the current object storage cannot meet the performance requirements of the database, which will greatly reduce the overall performance of the database.
- In terms of operation and maintenance, an additional Loki is added to the Thanos solution, mainly for cloud-native log collection. But Loki and Thanos are just two of them. In the future, I understand that they will align with Ali's open source SREWorks capabilities, and seal the entire quality, cost efficiency, and security in the comprehensive operation and maintenance capabilities, so that the entire cloud-native management can be managed stand up.
- Observability, a recently popular concept in the field of cloud native. Some of the components that everyone is making now start to develop cloud-native after they become popular. They are not born on the cloud at the beginning, but they just hope to grow on the cloud later. In this case, it will encounter some problems. The first problem is that there is no comprehensive visibility monitoring. We consider how to develop a plan for these components as a whole in the future, so that all components can be effectively monitored after they are cloud-native.
To sum up, I personally think that the future cloud native of big data is basically:
- Unified use of cloud-native storage as the underlying storage for all components (including databases)
- All components run in containers
- Use Serverless Architecture to Serve Upper-Layer Applications
But this also brings challenges to the current data platform products, that is, how to design products with serverless capabilities for users to use.
Advantages of Big Data Cloud Native
The first point is the separation of storage and calculation, and elastic expansion and contraction . After using physical machines to deploy Hadoop, if you need to expand or shrink capacity, you need to contact the operator, and there may be a long cycle. The separation of storage and calculation solves this problem well. The second is pay-as-you-go, without purchasing idle resources. At present, the data of our entire business scenario has peaks and valleys. During peaks, machines need to be prepared, and machines need to be withdrawn during valleys, but this is not possible now. Now we basically pile up all the machines to the peak. During the peak, it can meet the demand and is stable without failure. But it is idle for at least 12 hours during the valley. In this case, resources are also charged. After cloud native, we don't have to pay for it anymore.
The second point is automated deployment and operability . Kubernetes supports DevOps integrated deployment solutions. In this way, our components as a whole can be quickly deployed (for example, through Helm chart), and the component operation and maintenance capabilities are lowered to the cloud native platform, so that big data does not need to consider component operation and maintenance scenarios.
The third point is object storage . Object storage is the core and most important product launched by cloud computing. The benefits of object storage are self-evident, easy to expand, unlimited storage space, relatively low unit price, and object storage is also divided into low-frequency storage, archive storage and other storage types, further reducing storage costs, data can be stored more long time. At the same time, controllable cost, high reliability, and low operation complexity are also advantages of object storage.
The fourth point is security and compliance . After cloud native, dedicated namespace, multi-tenant isolation, and remote authentication can be realized. At present, what we have achieved is basically isolation at the network level. The widely recognized solution for HDFS file management is Ranger. Using Ranger to manage HDFS directory permissions can also manage some permissions such as Hive server, HBase, and Kafka, but these permissions are relatively weak.
Another solution is Kerberos. The security of the entire big data component will be greatly improved, but it has a lot of costs, and any request must be verified. We have not used this solution so far, and it has something to do with our cluster environment and scenarios. We are basically on the intranet and do not provide external services. If your big data project needs to provide some services to the external network, you still need to have strong authentication, otherwise the data will be easily leaked.
Difficulties of Cloud Native Big Data
The difficulty of big data cloud native also exists.
First, there are many components related to big data. At the same time, the update of Kubernetes is relatively fast. After the components are crossed, there will be problems in compatibility, complexity and scalability.
Second, the allocation and reallocation of resources. Kubernetes is a general-purpose container resource scheduling tool, and it is difficult to meet the resource usage scenarios of different big data components. In a big data scenario, the resource usage will be relatively large, the request frequency will be high, and the number of pods will be relatively large each time. In this case, there is currently no good solution. Currently we are looking at the solution of Fluid. Fluid also implements the runtime of JuiceFS. This is what we will investigate later. Fluid currently claims that it can support big data and AI, not just AI scenarios, because big data and AI The scenarios are similar, and they are all data-intensive operations. Fluid has made some breakthroughs in computing efficiency and data abstraction management.
Thirdly, object storage also has some disadvantages. The disadvantages of object storage are low metadata operation performance, poor compatibility with big data components, and eventual consistency.
The last point is data-intensive applications. The storage-computing separation mode cannot meet the requirements of data-intensive applications such as big data and AI in terms of computing operation efficiency and data abstraction management.
The exploration and implementation of JuiceFS in big data cloud native solutions
Before the open source of JuiceFS, we have paid attention to and done some landing tests. After the open source version is launched, we will use it immediately. When it went online, I also encountered some permission issues and a few small bugs. The community was very supportive and helped us solve them quickly.
HDFS is going offline because of its poor scalability, and at the same time, our data volume is relatively large, and the storage cost of HDFS is relatively high. After storing several batches of data, the space on the physical machine is not enough, and a lot of calculations are required. At that time, our business development was still in its infancy, and in order to get as much value from the data as possible, we wanted to keep as much data as possible. Moreover, HDFS requires three copies. We later changed it to two copies, but two copies are still risky.
On this basis, we tested JuiceFS in depth. After the test was completed, we quickly introduced JuiceFS to our online environment. Migrating some relatively large tables from HDFS to JuiceFS relieved our urgent need.
We value three points of JuiceFS:
-
First, JuiceFS is multi-protocol compatible . It is fully compatible with POSIX, HDFS and S3 protocols, and it is 100% compatible in current use without any problems.
-
Second, the ability to cross clouds . When an enterprise has a certain scale, in order to avoid systemic risks, it will not only use one cloud service provider. It will not be tied to one cloud, it will all be multi-cloud operations. In this case, JuiceFS's ability to synchronize data across clouds plays a role.
-
Third, cloud-native scenarios . JuiceFS supports CSI. At present, we have not used CSI in this scenario. We basically use POSIX to mount, but using CSI will be simpler and more compatible. We are also developing towards cloud native now, but the entire component Haven't really gotten to Kubernetes yet.
Application of JuiceFS in Ideal Car
Persist data from HDFS to object storage
After JuiceFS was open sourced, we began to try to synchronize the data on HDFS to JuiceFS. At the beginning of the synchronization, DistCp was used. It is very convenient to synchronize with the Hadoop SDK of JuiceFS, and the overall migration is relatively smooth. The reason for migrating data from HDFS to JuiceFS is because of some problems.
The first is that the storage-computing coupling design of HDFS has poor scalability, and there is no way to solve this problem. My personal understanding of big data from the very beginning is that big data must be deployed on physical machines, not cloud hosts. Including the various EMR systems launched by cloud vendors later, they are actually encapsulating Hadoop. In the past one or two years, these EMR systems have gradually de-Hadoopized.
The second is that HDFS is difficult to adapt to cloud nativeization. The current HDFS is difficult to adapt to cloud native, because it is relatively heavy. Although the community has been focusing on cloud native, I personally think that the development trend of Hadoop is going downhill, and the future should be based on object storage.
Third, object storage also has some disadvantages. It cannot be well adapted to the HDFS API. Due to network and other reasons, the performance is also much different from that of local disks. In addition, metadata operations such as list directories are also very slow. We use JuiceFS to do some acceleration, and the measured performance is very impressive. It is basically comparable to the local disk in the case of cache. Based on this, we quickly switch the current scene directly to JuiceFS.
Platform-level file sharing
The second scenario is file sharing at the platform level. All the shared file data of our current scheduling system, real-time system, and development platform are stored on HDFS. If we stop using HDFS later, we need to migrate these data away. The current solution is to use JuiceFS to connect to the object storage. Through the service of the application layer, all of them are mounted in POSIX mode, and everyone can request the files in JuiceFS without feeling.
JuiceFS meets most of our application requirements in this scenario, but there are still some small scenarios that have problems. The original idea was to put all the Python environment and the like into it. Later, I found that the actual operation is too difficult, because there are a lot of small files in the Python environment, and there are still problems when loading. Scenarios like the Python environment that contain a large number of fragmented files still need to be stored on the local disk for operation. In the future, we are going to hang a piece of block storage specifically to do this.
Share a few problems we have encountered with HDFS before:
First, when the NameNode is under heavy pressure or Full GC, there will be download failures, and there is currently no perfect solution. Our solution is to increase the memory as much as possible, or add some retries when downloading the package to avoid its peak period, but it is difficult to completely solve the problem of HDFS in this case, because it is written in Java after all, and the GC There is no way to avoid the scene.
Second, when using HDFS across systems, for example, if we have two clusters, it is basically unrealistic to use one cluster to share files, because the network needs to be opened to connect the two clusters Or get through the application, so there is no way to guarantee security. At present, we basically have two clusters that maintain their own shared files independently. Now the real-time platform (such as the Flink platform) has been switched to JuiceFS, and it is still very smooth and has not encountered any problems.
Third, we currently have a large number of physical machine deployments, all of which are single-cluster, without a disaster recovery strategy. If some catastrophic problem occurs in the computer room, our entire service will be unavailable. But the object storage itself is cross-computer room, in the same region, there should be at least three copies, cloud vendors help us to do the backup. In the future, we may develop multiple clouds, hoping to use JuiceFS to share some high-level files, core databases, including some core backup files, and make backups in multiple clouds. In this way, multi-cloud, multi-region, and multi-region are realized, which can solve the problem of single-point disaster recovery.
Cross-platform use of massive data
In another scenario, all platforms share massive data through JuiceFS. The first type of shared data on our side is road test data. There will be a large amount of video, audio and image data uploaded for road test. After uploading, these data will directly enter JuiceFS, which is convenient for downstream to do some synchronization and sharing. Including some data screening, and then get PFS is a parallel file system, under which is mounted SSD. This can make the GPU utilization higher, because the object storage capacity is relatively weak, otherwise the GPU capacity will be wasted a lot.
The remaining data types include some logs reported by vehicles for analysis, buried point data, and vehicle-related signal data required by some national platforms. These data will be entered into the data warehouse for some analysis. We will also do some feature data extraction on these data, do model training for the algorithm team, or do some NLP retrieval and other more scenarios.
Cloud Native Storage Acceleration - Luster as a read cache (under test)
Now we are testing another scenario, which is to hang a Luster on the object storage layer to serve as a read cache for JuiceFS, and use Luster's cache to help JuiceFS improve the read speed and cache hit rate.
One advantage of this is that we are using physical machines now, which have physical disks, which can be used to cache data. But because computing tasks are executed on multiple nodes, the cache hit rate is not very high. This is because the community version of JuiceFS does not yet support P2P distributed caching, and only supports single-node local caching, and each node may read a lot of data. In this case, some disk pressure is also caused on the computing node, because the cache will occupy a certain amount of disk space.
Our current solution is to use Luster as the read cache of JuiceFS. Specifically, according to the size of the data to be cached, mount a Luster file system with a capacity of about 20~30TB to the local computing node, and then use this Luster mount point as the cache directory of JuiceFS. In this case, after JuiceFS reads the data, it can asynchronously cache it in Luster. This solution can effectively solve the problem of low cache hit rate and greatly improve read performance.
If we write data directly to the object storage in the Spark scenario, there will be bandwidth and QPS limitations. If the writing is too slow, the upstream tasks may jitter. In this case, the write cache function of JuiceFS can be used Writing data to Luster first, and then asynchronously writing to object storage, this solution is applicable in some scenarios. But there is a problem that Luster is not a cloud-native solution, it is perceived by the user, and the user needs to explicitly write a command to mount it when starting the pod. Therefore, we also hope to make some changes to JuiceFS in the future, automatically identify object storage and Luster, and then automatically implement some caching mechanisms, so that users do not need to perceive the existence of Luster.
At present, the PoC of this solution has been completed and passed the basic test. Next, we will do a lot of pressure tests in the production environment. It is expected that it will be officially launched in Q3 this year to cover some edge services.
The overall solution of JuiceFS in big data cloud native
As can be seen from the architecture diagram of the overall solution, we currently use all three methods provided by the JuiceFS client.
As shown in the left half of the figure above, we will have independent Spark and Flink clusters, and we will directly mount JuiceFS to the entire cluster through the CSI Driver, so that when users start Spark and Flink, they will not be aware of JuiceFS at all. It exists, and the reading and writing of computing tasks are all done through object storage.
This section currently has a question about shuffle. Because the Spark task requires a large amount of data to be written to the disk during the shuffle phase of the calculation process, a large number of file read and write requests generated during this period have higher performance requirements for the underlying storage. Flink is relatively better because it is streaming and does not require a large number of disks. In the future, we hope that JuiceFS can directly write to Luster, but this requires some modifications in JuiceFS. Through client integration, JuiceFS can directly read and write Lustre. This will not be perceived by users, and it can also improve shuffle Stage read and write performance.
The application in the right half of the figure above has two scenarios. One is to simply query JuiceFS data, such as data preview through HiveJDBC. In this scenario, JuiceFS can be accessed through the S3 gateway.
The second is the scenario where the big data platform and the AI platform are linked. For example, colleagues on the AI platform often need to read sample data, feature data, etc. in their daily work, and these data are usually generated by Spark or Flink tasks on the big data platform, and have been stored in JuiceFS. In order to share data between different platforms, when the pod of the AI platform is started, JuiceFS will be directly mounted to the pod through FUSE, so that colleagues on the AI platform can directly access the data in JuiceFS through Jupyter to make some models Training, instead of copying data repeatedly between different platforms like the traditional architecture, improves the efficiency of cross-team collaboration.
Because JuiceFS uses POSIX standard users and user groups for permission control, and the container starts as the root user by default, which makes it difficult to control permissions. Therefore, we made a modification to JuiceFS to mount the file system through an authentication token, which contains the connection information of the metadata engine and other permission control information. In some scenarios where multiple JuiceFS file systems need to be accessed at the same time, we use the JuiceFS S3 gateway combined with IAM policies for unified permission management.
Some difficulties encountered in using JuiceFS at present
The first point is that the permission management function based on users and user groups is relatively simple. In some scenarios, the container starts as the root user by default, and the permissions are not easy to control.
The second point is about the configuration optimization of JuiceFS Hadoop SDK. At present, we mainly have three configurations to optimize the JuiceFS Hadoop SDK: juicefs.prefetch
, juicefs.max-uploads
and juicefs.memory-size
. Some problems were encountered in the process of tuning juicefs.memory-size
the configuration . The default value of this configuration is 300MB. The official suggestion is to set an off-heap memory that is 4 times the size of the default value, which is 1.2GB. At present, most of our tasks are configured to 2GB of off-heap memory, but some tasks occasionally fail to write even if more than 2GB of memory is configured (HDFS can write stably). However, this is not necessarily a problem with JuiceFS, it may also be caused by Spark or object storage. Therefore, at present, we are also planning to deeply adapt Spark and JuiceFS, and then find out the reasons step by step, trying to overcome all these pitfalls, and reduce the memory while ensuring task stability.
The third point is that as the overall architecture (JuiceFS + object storage + Lustre) becomes more complex, there are more possible failure points, and the stability of tasks may decrease, which requires other fault-tolerant mechanisms to guarantee. For example, during the shuffle write phase of the Spark task, an error similar to "lost task" may be reported, and the specific cause of the error has not yet been located.
The architecture combination of JuiceFS + object storage + Luster mentioned above improves the read and write performance to a certain extent, but it also makes the architecture more complex and correspondingly increases some possible failure points. For example, Luster does not have a strong disaster recovery copy capability. If Luster suddenly hangs up a node, can the running tasks continue to read and write data in Luster stably, or is the data in Luster accidentally lost? Can it still be stable? It is currently uncertain whether to go to JuiceFS and re-pull it through object storage, and we are currently doing this kind of catastrophic test.
future and outlook
Real-time data lake solution based on Flink + Hudi + JuiceFS
One of the things we will do in the near future is the real-time data lake solution of Flink + Hudi + JuiceFS. The left side of the figure above is the data source. Through Flink, Kafka/Pulsar, the data is written to Hudi in real time. At the same time, the data of Hudi will fall into JuiceFS to replace our current real-time data warehouse.
Long-term planning of big data cloud native
Finally, I would like to introduce the long-term plan of ideal car big data cloud native, which is also an outlook.
The first point is a unified data management and governance system. We believe that in the data lake 2.0 era, the biggest problem that needs to be solved is to solve the data swamp problem in the data lake 1.0 scenario. But now it seems that there is no better open source product for unified metadata management, data directory management, and data security control, similar to AWS Glue and AWS Lake Formation. We are currently working on a "origin system" project. The first step of this system is to manage all the metadata in the above database and object storage in a unified directory management, unified security control, and unified data management. We are groping our way forward.
The second point is faster, more stable, and lower-cost underlying storage capabilities. At present, the biggest difficulty in all scenarios is object storage. The advantages of object storage are stability and low cost. At the same time, object storage is constantly iterating. For now, I think that if big data cloud native is to develop, object storage must provide better performance under the premise of ensuring stability.
At the same time, S3 may claim to support strong consistency, but at present I understand that the architecture design based on object storage may be difficult to achieve strong consistency, or in order to achieve strong consistency, it is bound to sacrifice some things. This may be a need A matter of balance. JuiceFS natively supports strong consistency, which is very friendly to big data platforms.
The third point is a smarter, more efficient, and easier-to-use query engine. To extend the thinking about the integration of lakes and warehouses mentioned above, the integration of lakes and warehouses is still in the early stages of development, and it may take 5 to 10 years of development. Databricks and Microsoft are trying to build a vectorized MPP engine on the data lake, hoping to promote the integrated architecture of the lake and warehouse. This may be a future development direction, but it seems that there is no way to use one engine to meet the needs of all scenarios in a short time.
Our current architecture is basically equipped with all query engines, such as Spark, Flink, relational database (for OLTP scenarios), time series database, and OLAP database. In principle, it is better to use whoever is better, and our upper layer will manage it through unified middleware. Another example is Snowflake. Although it already supports the query of structured and semi-structured data at the same time, it is still unclear how to support unstructured data (such as pictures, voice, and video) involved in artificial intelligence in the future. clear. However, I think this is definitely a future development direction. Li Auto also has a similar artificial intelligence scenario, so we will explore and build together with various business parties.
Finally, the ultimate goal of the entire big data development is to complete data analysis at the lowest cost and highest performance, so as to realize real business value .
If you are helpful, please pay attention to our project Juicedata/JuiceFS ! (0ᴗ0✿)