Data exploration artifact: Volcano engine DataLeap Notebook revealed

For more technical exchanges and job opportunities, please follow the WeChat official account of ByteDance Data Platform and reply [1] to enter the official communication group

Background introduction

Problems solved by Notebook

  1. Some task types (python, spark, etc.) require step-by-step debugging during the creation and configuration phase;
  2. Due to the weak exploration and query capabilities, some users can only develop and debug through other platforms or other ways. However, when deployed to Dorado, they found problems such as inconsistent behavior (running environment problems), and the overall experience was poor. The exploration query module needs to be improved. ability;
  3. Currently, exploration query only supports SQL, but can support more language types and expand data development methods;

Overall architecture introduction

Volcano Engine DataLeap notebook is mainly implemented based on open source projects such as JupyterHub, notebook, lab, enterprise kernel gateway, etc., and is deeply modified and customized based on these projects to meet the needs of Volcano Engine DataLeap users.

In terms of basic components, it is mainly based on TCE, YARN, MYSQL, TLB, and TOS.

The core goal is to provide a notebook service that supports large-scale users, is stable, and is easy to expand.

The overall system architecture is shown in the figure below, which mainly includes Hub, notebook server (nbsvr), kernel gateway (eg) and other components.

picture.image

Multi-user management

Hub

JupyterHub is a server that supports "multi-user" notebooks. It implements multi-user notebooks by managing & acting on multiple single-user notebook servers.

The JupyterHub service consists of three main components:

  • a Hub (tornado process), which is the heart of JupyterHub;
  • a configurable http proxy (node-http-proxy): dynamically routes user requests to the Hub or Notebook server;
  • multiple single-user Jupyter notebook servers (Python/IPython/tornado) that are monitored by Spawners;
  • an authentication class that manages how users can access the system;

The entire system architecture diagram is as follows:

picture.image

Users access JupyterHub through IP address or domain name. The basic process is:

  • Start the Hub service, and the Hub will start the proxy process;
  • When a user requests the Hub, the request will be sent to the proxy. The proxy maintains a proxy table. Each mapping record is the mapping of the user request to the target IP or domain name. When the proxy table does not contain the mapping for the current request, the proxy will send all requests to the Hub by default. ;
  • The Hub handles user authentication and authentication, and the Hub spawner starts a Notebook server;
  • Hub configures proxy to route the user's request to the created notebook server;

1. Volcano Engine DataLeap authentication Hub natively supports authentication and is mainly used to solve multi-tenant issues. Hub mainly uses the authenticator class to authenticate.

The main authenticators supported natively by Hub include the following:

  • Local authenticator, work with local Linux/UNIX userst
  • PAM authenticator, authenticate local UNIX users with PAM
  • Dummy authenticator, any username + password is allowed for testing

Considering that option 1 requires a large amount of development and high maintenance costs, we adopted option 2.

The entire authentication & authentication steps using Option 2 are as follows:

  1. The user accesses the Volcano Engine DataLeap notebook on the web page, and the frontend will bring the session information to request the hub post /api/users/{name}/tokens api to obtain a token. This process requires authenticate & authroization, including:
  2. Authenticate the user corresponding to the sessionid through titan;
  3. Verify whether the user has project permissions through the Volcano Engine DataLeap backend ProjectControl /project/canedit api;
  4. Subsequent visits by the user will bring the token, and the Hub will use the token for user authentication.
  5. Each generated token will be saved to db;
  6. During authentication, matching is also performed from the db;
  7. Token exists for expired time, and expired ones will be cleared from the db;

2. TCE Spawner Spawner is responsible for starting the single-user notebook server. Its essence is an abstract representation of a process. A customized spawner implements the following three methods:

Currently, our services do not run on physical machines, so we do not manage server & kernel through k8s. Considering operation and maintenance & expansion, we consider using TCE as the carrier of notebook server, so we need to implement TCE Spawner.

When designing the TCE spawner, the following points should be considered:

  1. Spawner.state needs to contain information such as service id, cluster id, psm, api token, etc. This information will be persisted in the db; after the hub is restarted or the server is shut down, when the notebook server is restarted, it is ensured that the same user is mapped to the user who started it before The server (same user same server);
  2. In order to speed up the startup process, spawner confirms that the tce instance is started. Once the tce cluster deployment is initiated, it starts sd lookup psm to confirm whether the server starts normally, and does not confirm whether the deployment is completed through poll deployment status. This can speed up the startup process because the tce deployment process It includes steps such as health check-up, which takes a long time;
  3. In Stop, the tce instance is not actually killed, so that the next startup will basically not consume time;
  4. When polling the server status, you need to consider the status changes caused by upgrade & migrate. Once found, the abnormal status will be returned immediately. In this way, the hub will think that the notebook server is not running, and will abnormal the spawner. When subsequent new requests arrive, the spawner will be restarted. , since this is not the first time to start, the process is very fast and the user is not aware of it.

The entire TCE spawner mainly uses two features of tce:

  • Psm only corresponds to one service;
  • Discover ip & port through psm;
  • Obtain server status through tce's API;
  • Convenient operation and maintenance (upgrade & migration);

Off-topic: I recently investigated server on yarn, which feels a bit like k8s. It essentially uses resource scheduling. However, yarn resource scheduling has a disadvantage: when each application is scheduled to yarn, it needs to be accompanied by an Application Master. Although AM is mainly used to maintain the heartbeat with RM most of the time, and only requires 0.5 cores, it always feels awkward, or it adds an unstable factor.

3、State isolated

(1) Hub migration

When upgrading the native jupyter hub or migrating instances, all spawners & servers need to be closed. This means that after the hub instance is changed, the previous server & kernel will be shut down.

Since the current system uses remote server + remote kernel and does not actively shut down the kernel, when the hub instance changes, the server & kernel instances will not be shut down. However, after the new hub instance is started, all servers will not be able to connect to the new hub instance, and ghost servers & kernels will be generated.

We provide the following solutions:

  1. Add a regular check thread in the notebook server to check whether the corresponding IP & port has changed according to the hub's psm;
  2. If changes occur, switch hub_activity_url & hub_api_url. In this way, the notebook server can connect to the new hub instance.

(2) Notebook server migration

If the notebook server instance is upgraded or migrated, the hub also needs to be able to detect it in time and shut down the spawner correctly.

This is currently implemented through tce spawner poll. The poll will check whether the IP & port of the corresponding notebook server has changed. If it has changed, it will return a non-zero status, indicating that the server is abnormal. At this time, the hub senses and closes the spawner. Later, when the user's request comes, the spawner will be re-created and connected to the same notebook server.

Resource pool

Pool is designed with two considerations in mind:

  • Tce resources cannot be exclusive;
  • Server starts slowly;

Since the notebook server is started on TCE, starting a server on TCE requires the following key stages: New service -> New cluster -> Deployment (build image, deployment) -> Some checks The whole process takes a long time, it is expected that It takes 3-5 minutes. If the startup process of each server takes this long, it is obviously unacceptable.

So, we applied to create a bunch of new tce instances to build a tce resource pool. Every time a new project is accessed, Hub spawner handles it according to the following process:

  • Go to the tce resource pool to check whether there are unoccupied instances, and if so, pick one.
  • Otherwise, follow the original new process;

At present, the establishment of the pool is manual operation, and automatic detection and expansion will be supported in the future:

  • Timing thread, detects whether the current pool capacity is less than 30 (for example);
  • If it is less than, create a new one and add it to the pool;

Another question is: each instance in the pool needs to support psm service discovery, so what state are they in before the server is allocated? After being assigned, how to start the server according to the configuration corresponding to the user? The instances in Pool all start an idle server (native notebook server) (this method allows the instance to start successfully and be discovered by the service). At the same time, there is a timing thread that constantly checks whether the configuration file corresponding to tos is ready, shut down the idle server after ready, and start the single user notebook server according to the tos configuration file.

After this method, the startup time is reduced from 3min+ to 8s. 8s is the time for the single user notebook server to start and stably provide services.

Kernel management

book storage

The code and output text in the notebook are mainly stored through json files with the suffix .ipynb, so the notebook server needs to be responsible for the management of creating and deleting ipynb files.

Notebook server stores notebooks through FileManager. FileManager is mainly responsible for file operations such as the creation, saving, deletion, and renaming of ipynb. In addition, it also performs format checks on ipynb files to ensure that the format is correct.

FileManger saves files through the local filesystem. In order to persistently store ipynb files, we have embedded the tos file storage function in FileManager. The specific process is:

  1. When creating for the first time, after generating ipynb locally, put a copy on tos;
  2. Each time the update is saved, a copy is put to tos after the local update;
  3. Each time ipynb is opened, first determine whether the corresponding ipynb file exists locally. If it does not exist, it will be pulled from tos; if it exists, no pulling operation will be performed;
  4. The deletion operation only deletes the local files and does not delete the tos copy.

kernel management

When we open a notebook task on the page, the notebook server will try to start a kernel to execute the code you click to run. Each task on Volcano Engine DataLeap corresponds to a kernel, and the notebook server is responsible for maintaining the kernel of each task.

Notebook server maintains kernel information through KernelManager. KerneManager is responsible for kernel startup, restart, deletion and other operations.

By default, the Kernel is started in the running container where the notebook server is located. In this case, a single server cannot support a large-scale kernel.

acting

As mentioned in the previous section, the notebook server local mode does not support large-scale kernel expansion and is suitable for small-scale use. The main reasons are as follows:

  • The kernel is started in the notebook server host, and a single machine cannot accommodate a large-scale kernel;
  • There is no isolation between Kernels, only isolation between processes, and resources & execution environments are not well isolated and customized;

Enterprise kernel gateway (EG for short) is mainly committed to solving the above problems. The system architecture using EG is as follows:

picture.image

Technically speaking, EG partially expanded the functionality of notebook server and then made the following changes:

  • Reuse the API in notebook server (kernel management part);
  • Provides management of WS;
  • Based on MultiKernelManager & KernelManager & SessionManager in notebook server, it is extended to provide RemoteMappingKernelManager;

As can be seen from the figure, the client is not a notebook-related system, but can also be other systems. This means that EG can be directly used as a Code Execution Server, and only its ws client needs to follow the Jupyter msg protocol.

Agency architecture

In the Volcano Engine DataLeap notebook system, the client in the above figure is the notebook server. At this time, the notebook server is only responsible for managing notebook files (creation, reading, writing, saving, and deletion). All kernel operations are forwarded to EG for processing (note The forwarding here includes http forwarding and ws forwarding). Details are shown in the figure below:

picture.image

The user runs a piece of code in the browser, and the entire interaction process is shown in the figure below:

picture.image

For detailed process reference of EG proxy:

picture.image

Currently, EG supports submitting kernels to commonly used resource management systems in the industry such as yarn and k8s. We currently only support remote kernel on yarn, and will consider supporting k8s in the future.

Remote Kernel

1、Remote kernel on yarn

Open source EG mainly uses yarn_client to submit tasks to yarn, which is based on yarn rm restful api to perform resource exploration & task submission & status polling & kill and other operations. The company does not open the corresponding REST API, so it needs to be transformed based on YAOP.

2、Kernel configuration

Open source EG does not currently support specifying dynamic parameters when submitting tasks to yarn, such as queue selection, mirror selection, etc. yarn parameters. We have made a simple transformation to support users to set richer yarn parameters to customize a personalized execution environment.

3、Async

The open source community version is not completely asynchronous. In order for a single eg server to support more kernels, we have made a complete asynchronous transformation. Before optimization, it can only support 10+ kernels. After optimization, it can support 100+ kernels (the upper limit has not been specifically tested).

4、image

It supports users to choose a custom image to start kenrel. This feature allows users to install the environment they need in the kernel, which greatly improves the usage scenarios of the kernel.

Timing scheduling

Scheduling principle

Notebook scheduling execution is different from the manual debugging execution in each cell. It needs to be executed automatically at regular intervals. Each time, all cells are directly run, and the execution results are saved in the notebook.

Jupyter provides a tool that can directly execute an ipynb file: nbconvert. nbconvert will start the corresponding kernel according to the kernel information in ipynb to execute each cell in ipynb. It essentially performs the function of notebook kernel startup + run each cell.

However, nbconvert can only start the local kernel, and the current system is remote kernel on yarn. This can be done by submitting nbconvert to yarn, and then running the above process on yarn. Of course, this involves the submission principle of pyspark tasks. Generally speaking, notebook tasks have the same timing scheduling function as other tasks on dorado.

More features

1. Version control supports notebook version control.

2. Workflow debug workflow supports notebook tasks and supports overall debugging.

3. Parameterized supports notebook parameterization.

4. Executed notebook view supports scheduled execution results display.

Conclusion

It has been more than a few years since Jupyter Notebook was born, and Zeppelin, PolyNote, and Deepnote have appeared continuously during this period. Despite this, Jupyter Notebook still has the largest user group and a relatively complete technical ecosystem, so we chose Jupyter Notebook for in-depth customization and transformation to serve users.

Currently, the Volcano Engine DataLeap Notebook has basically the capability of offline data exploration. These capabilities have helped many users better perform data exploration, task development and debugging, visualization, etc. With the platform's support for streaming data development, we also hope to use Notebook to realize users' needs for streaming data exploration, streaming task debugging, visualization and other functions. I believe that in the near future, Notebook will be able to integrate streaming and batching to serve a wider range of user groups.

Click to jump to Big Data R&D Governance Suite DataLeap to learn more

JetBrains releases Rust IDE: RustRover Java 21 / JDK 21 (LTS) GA With so many Java developers in China, an ecological-level application development framework .NET 8 should be born. The performance is greatly improved, and it is far ahead of .NET 7. PostgreSQL 16 is released by a former member of the Rust team I deeply regret and asked to cancel my name. I completed the removal of Nue JS on the front end yesterday. The author said that I will create a new Web ecosystem. NetEase Fuxi responded to the death of an employee who was "threatened by HR due to BUG". Ren Zhengfei: We are about to enter the fourth industrial revolution, Apple Is Huawei's teacher Vercel's new product "v0": Generate UI interface code based on text
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5588928/blog/10112213