US group practice program on a distributed ID

In complex distributed systems, often requires a large amount of data and messages to be uniquely identified. As in the US group review of financial, payment, catering, hotels, cat's eye movies and other products, systems, data growing, the need for a unique ID to identify a piece of data or news data sub-library sub-table, increment ID database obviously can not meet the demand; in particular, that such an order, the rider, coupons are also required to do a unique ID logo. At this time, a globally unique ID can be generated in the system is necessary. Summarized down, and that business systems What are the requirements for ID number it?

  1. Globally unique: duplicate ID numbers can not appear, since it is the unique identifier, which is a basic requirement.
  2. Incremental trend: the use of the MySQL InnoDB engine is a clustered index, since most RDBMS data structures used to store the index B-tree data, primary key in the choice of the above we should try to use the primary key to ensure orderly write performance.
  3. Monotonically increasing: ensuring an ID greater than a certain ID, version number, for example, the special needs of transaction, increment the IM message sorting.
  4. Information Security: If the ID is continuous, Pa malicious users to take the job very easy to do, directly in accordance with the specified URL to download the order; if the order number is even more dangerous, and we know that competition for direct single-day amount. Therefore, in some scenarios, it may require ID no regular, irregular.

123 corresponding to the above-described three different scenes, 3 and 4 demand or mutually exclusive, use the same scheme can not satisfy.

At the same time in addition to the requirements of its own ID number, and availability requirements of the business also generated ID number system is extremely high, imagine if ID generation system failures, the entire US group pay reviews, coupons issued coupons, and other key riders to send a single action can not be execution, which will bring a disaster.

ID summarize thereby generating a system should do the following:

  1. The average delay and delay TP999 be as low as possible;
  2. Availability 5 9;
  3. High QPS.

UUID

UUID (Universally Unique Identifier) of a standard type of 32 hexadecimal digits, hyphens divided into five sections, the form of a 8-4-4-4-12 36 characters example: 550e8400-e29b-41d4-a716-446655440000, the industry so far a total of 5 ways to generate UUID, as detailed specifications issued by IETF UUID  a a Universally of Unique IDentifier (UUID) the URN the Namespace .

advantage:

  • Performance is very high: generated locally, no network consumption.

Disadvantages:

  • Easy to store: UUID long, 128-bit bytes 16, 36 generally indicates the length of the string, a lot of scenes is not applicable.
  • Unsafe information: MAC address generation algorithm UUID-based MAC address may result in leakage, this loophole has been used to find Melissa virus creator position.

  • ID there are some problems in a particular environment will be as a primary key, such as the following do DB primary key scene, UUID very NA:

    ① MySQL official with clear suggestions primary key as short as possible to try to [4], UUID 36 characters in length does not meet the requirements.

    All indexes other than the clustered index are known as secondary indexes. In InnoDB, each record in a secondary index contains the primary key columns for the row, as well as the columns specified for the secondary index. InnoDB uses this primary key value to search for the row in the clustered index.*** If the primary key is long, the secondary indexes use more space, so it is advantageous to have a short primary key***.

② index detrimental to MySQL: If as a database primary key in InnoDB engine, UUID disorder may cause frequent changes in location data seriously affect performance.

Class snowflake scheme

This kind of embodiment is generally divided namespace (UUID can be considered due to the more common, it is analyzed separately) to generate an algorithm ID, which the programs are divided into 64-bit multi-stage, separate marking machine , time, etc., for example, in 64-bit snowflake respectively represent the following figure (image from the network) as shown:

image

image

 

41-bit time may represent (1L << 41) / (1000L * 3600 * 24 * 365) = 69 years, 10-bit machines may represent machine 1024 respectively. If we have a need for IDC division, also 10-bit can be divided 5-bit to IDC, points 5-bit machine to work. This can represent 32 IDC, the machine 32 can each IDC, can be defined according to their needs. 12 increment sequence number may represent 2 ^ 12 ID, QPS snowflake scheme theoretically about 409.6w / s, this assignment can ensure that any ID IDC any one of a machine-generated in an arbitrary millisecond It is different.

The advantages and disadvantages of this approach are:

advantage:

  • Milliseconds high, low self-energizing in sequence, the entire ID is incremented trend.
  • System does not rely on third-party databases, as a service deployment, greater stability, performance ID generated is very high.
  • Bit bit can be assigned according to their business characteristics, very flexible.

Disadvantages:

  • Strong dependence machine clock, the clock on the machine if the callback will result Fa is repeated or services are unavailable.

Application examples Mongdb objectID

ObjectID MongoDB official document can be counted and the like snowflake by "Time Encoding + + pid + inc" of 12 bytes, by way of 4 + 3 + 2 + 3 finally identified as a 24 hexadecimal length system character.

Database generation

In MySQL example, using a field setting auto_increment_incrementand auto_increment_offsetto ensure that the self-energizing ID, each service MySQL obtained using the following SQL read ID number.

begin;
REPLACE INTO Tickets64 (stub) VALUES ('a'); SELECT LAST_INSERT_ID(); commit; 

image

image

 

The advantages and disadvantages of this approach are as follows:

advantage:

  • Very simple, using the existing database system function realization, small cost, DBA professional maintenance.
  • ID number incremented monotonically, you can achieve some special requirements for service ID.

Disadvantages:

  • Strong dependence DB, DB abnormality when the system is unavailable, is a fatal problem. From the master copy can be configured to increase the availability of possible, but difficult to ensure data consistency in exceptional circumstances. When switching from the main inconsistency may result in duplicate Fa.
  • ID numbers issued in a single performance bottleneck MySQL read and write performance.

对于MySQL性能问题,可用如下方案解决:在分布式系统中我们可以多部署几台机器,每台机器设置不同的初始值,且步长和机器数相等。比如有两台机器。设置步长step为2,TicketServer1的初始值为1(1,3,5,7,9,11…)、TicketServer2的初始值为2(2,4,6,8,10…)。这是Flickr团队在2010年撰文介绍的一种主键生成策略(Ticket Servers: Distributed Unique Primary Keys on the Cheap )。如下所示,为了实现上述方案分别设置两台机器对应的参数,TicketServer1从1开始发号,TicketServer2从2开始发号,两台机器每次发号之后都递增2。

TicketServer1:
auto-increment-increment = 2
auto-increment-offset = 1

TicketServer2:
auto-increment-increment = 2
auto-increment-offset = 2

假设我们要部署N台机器,步长需设置为N,每台的初始值依次为0,1,2…N-1那么整个架构就变成了如下图所示:

image

image

 

这种架构貌似能够满足性能的需求,但有以下几个缺点:

  • 系统水平扩展比较困难,比如定义好了步长和机器台数之后,如果要添加机器该怎么做?假设现在只有一台机器发号是1,2,3,4,5(步长是1),这个时候需要扩容机器一台。可以这样做:把第二台机器的初始值设置得比第一台超过很多,比如14(假设在扩容时间之内第一台不可能发到14),同时设置步长为2,那么这台机器下发的号码都是14以后的偶数。然后摘掉第一台,把ID值保留为奇数,比如7,然后修改第一台的步长为2。让它符合我们定义的号段标准,对于这个例子来说就是让第一台以后只能产生奇数。扩容方案看起来复杂吗?貌似还好,现在想象一下如果我们线上有100台机器,这个时候要扩容该怎么做?简直是噩梦。所以系统水平扩展方案复杂难以实现。
  • ID没有了单调递增的特性,只能趋势递增,这个缺点对于一般业务需求不是很重要,可以容忍。
  • 数据库压力还是很大,每次获取ID都得读写一次数据库,只能靠堆机器来提高性能。

Leaf这个名字是来自德国哲学家、数学家莱布尼茨的一句话: >There are no two identical leaves in the world > “世界上没有两片相同的树叶”

综合对比上述几种方案,每种方案都不完全符合我们的要求。所以Leaf分别在上述第二种和第三种方案上做了相应的优化,实现了Leaf-segment和Leaf-snowflake方案。

Leaf-segment数据库方案

第一种Leaf-segment方案,在使用数据库的方案上,做了如下改变: - 原方案每次获取ID都得读写一次数据库,造成数据库压力大。改为利用proxy server批量获取,每次获取一个segment(step决定大小)号段的值。用完之后再去数据库获取新的号段,可以大大的减轻数据库的压力。 - 各个业务不同的发号需求用biz_tag字段来区分,每个biz-tag的ID获取相互隔离,互不影响。如果以后有性能需求需要对数据库扩容,不需要上述描述的复杂的扩容操作,只需要对biz_tag分库分表就行。

数据库表设计如下:

+-------------+--------------+------+-----+-------------------+-----------------------------+
| Field       | Type         | Null | Key | Default           | Extra                       |
+-------------+--------------+------+-----+-------------------+-----------------------------+
| biz_tag     | varchar(128) | NO   | PRI |                   |                             |
| max_id      | bigint(20)   | NO   |     | 1                 |                             |
| step        | int(11)      | NO   |     | NULL              |                             |
| desc        | varchar(256) | YES  |     | NULL              |                             |
| update_time | timestamp    | NO   |     | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
+-------------+--------------+------+-----+-------------------+-----------------------------+ 

重要字段说明:biz_tag用来区分业务,max_id表示该biz_tag目前所被分配的ID号段的最大值,step表示每次分配的号段长度。原来获取ID每次都需要写数据库,现在只需要把step设置得足够大,比如1000。那么只有当1000个号被消耗完了之后才会去重新读写一次数据库。读写数据库的频率从1减小到了1/step,大致架构如下图所示:

image

image

 

test_tag在第一台Leaf机器上是1~1000的号段,当这个号段用完时,会去加载另一个长度为step=1000的号段,假设另外两台号段都没有更新,这个时候第一台机器新加载的号段就应该是3001~4000。同时数据库对应的biz_tag这条数据的max_id会从3000被更新成4000,更新号段的SQL语句如下:

Begin
UPDATE table SET max_id=max_id+step WHERE biz_tag=xxx SELECT tag, max_id, step FROM table WHERE biz_tag=xxx Commit 

这种模式有以下优缺点:

优点:

  • Leaf服务可以很方便的线性扩展,性能完全能够支撑大多数业务场景。
  • ID号码是趋势递增的8byte的64位数字,满足上述数据库存储的主键要求。
  • 容灾性高:Leaf服务内部有号段缓存,即使DB宕机,短时间内Leaf仍能正常对外提供服务。
  • 可以自定义max_id的大小,非常方便业务从原有的ID方式上迁移过来。

缺点:

  • ID号码不够随机,能够泄露发号数量的信息,不太安全。
  • TP999数据波动大,当号段使用完之后还是会hang在更新数据库的I/O上,tg999数据会出现偶尔的尖刺。
  • DB宕机会造成整个系统不可用。

双buffer优化

对于第二个缺点,Leaf-segment做了一些优化,简单的说就是:

Leaf 取号段的时机是在号段消耗完的时候进行的,也就意味着号段临界点的ID下发时间取决于下一次从DB取回号段的时间,并且在这期间进来的请求也会因为DB号段没有取回来,导致线程阻塞。如果请求DB的网络和DB的性能稳定,这种情况对系统的影响是不大的,但是假如取DB的时候网络发生抖动,或者DB发生慢查询就会导致整个系统的响应时间变慢。

为此,我们希望DB取号段的过程能够做到无阻塞,不需要在DB取号段的时候阻塞请求线程,即当号段消费到某个点时就异步的把下一个号段加载到内存中。而不需要等到号段用尽的时候才去更新号段。这样做就可以很大程度上的降低系统的TP999指标。详细实现如下图所示:

image

image

 

采用双buffer的方式,Leaf服务内部有两个号段缓存区segment。当前号段已下发10%时,如果下一个号段未更新,则另启一个更新线程去更新下一个号段。当前号段全部下发完后,如果下个号段准备好了则切换到下个号段为当前segment接着下发,循环往复。

  • 每个biz-tag都有消费速度监控,通常推荐segment长度设置为服务高峰期发号QPS的600倍(10分钟),这样即使DB宕机,Leaf仍能持续发号10-20分钟不受影响。

  • 每次请求来临时都会判断下个号段的状态,从而更新此号段,所以偶尔的网络抖动不会影响下个号段的更新。

Leaf高可用容灾

对于第三点“DB可用性”问题,我们目前采用一主两从的方式,同时分机房部署,Master和Slave之间采用半同步方式[5]同步数据。同时使用公司Atlas数据库中间件(已开源,改名为DBProxy)做主从切换。当然这种方案在一些情况会退化成异步模式,甚至在非常极端情况下仍然会造成数据不一致的情况,但是出现的概率非常小。如果你的系统要保证100%的数据强一致,可以选择使用“类Paxos算法”实现的强一致MySQL方案,如MySQL 5.7前段时间刚刚GA的MySQL Group Replication。但是运维成本和精力都会相应的增加,根据实际情况选型即可。

image

image

 

同时Leaf服务分IDC部署,内部的服务化框架是“MTthrift RPC”。服务调用的时候,根据负载均衡算法会优先调用同机房的Leaf服务。在该IDC内Leaf服务不可用的时候才会选择其他机房的Leaf服务。同时服务治理平台OCTO还提供了针对服务的过载保护、一键截流、动态流量分配等对服务的保护措施。

Leaf-segment方案可以生成趋势递增的ID,同时ID号是可计算的,不适用于订单ID生成场景,比如竞对在两天中午12点分别下单,通过订单id号相减就能大致计算出公司一天的订单量,这个是不能忍受的。面对这一问题,我们提供了 Leaf-snowflake方案。

image

image

 

Leaf-snowflake方案完全沿用snowflake方案的bit位设计,即是“1+41+10+12”的方式组装ID号。对于workerID的分配,当服务集群数量较小的情况下,完全可以手动配置。Leaf服务规模较大,动手配置成本太高。所以使用Zookeeper持久顺序节点的特性自动对snowflake节点配置wokerID。Leaf-snowflake是按照下面几个步骤启动的:

  1. 启动Leaf-snowflake服务,连接Zookeeper,在leaf_forever父节点下检查自己是否已经注册过(是否有该顺序子节点)。
  2. 如果有注册过直接取回自己的workerID(zk顺序节点生成的int类型ID号),启动服务。
  3. 如果没有注册过,就在该父节点下面创建一个持久顺序节点,创建成功后取回顺序号当做自己的workerID号,启动服务。

image

image

 

弱依赖ZooKeeper

除了每次会去ZK拿数据以外,也会在本机文件系统上缓存一个workerID文件。当ZooKeeper出现问题,恰好机器出现问题需要重启时,能保证服务能够正常启动。这样做到了对三方组件的弱依赖。一定程度上提高了SLA

解决时钟问题

因为这种方案依赖时间,如果机器的时钟发生了回拨,那么就会有可能生成重复的ID号,需要解决时钟回退的问题。

image

image

 

参见上图整个启动流程图,服务启动时首先检查自己是否写过ZooKeeper leaf_forever节点:

  1. 若写过,则用自身系统时间与leaf_forever/${self}节点记录时间做比较,若小于leaf_forever/${self}时间则认为机器时间发生了大步长回拨,服务启动失败并报警。
  2. 若未写过,证明是新服务节点,直接创建持久节点leaf_forever/${self}并写入自身系统时间,接下来综合对比其余Leaf节点的系统时间来判断自身系统时间是否准确,具体做法是取leaf_temporary下的所有临时节点(所有运行中的Leaf-snowflake节点)的服务IP:Port,然后通过RPC请求得到所有节点的系统时间,计算sum(time)/nodeSize。
  3. 若abs( 系统时间-sum(time)/nodeSize ) < 阈值,认为当前系统时间准确,正常启动服务,同时写临时节点leaf_temporary/${self} 维持租约。
  4. 否则认为本机系统时间发生大步长偏移,启动失败并报警。
  5. 每隔一段时间(3s)上报自身系统时间写入leaf_forever/${self}。

Due to the strong dependence clock, more sensitive to the requirements of time, when the machine work NTP synchronization can also cause second-level rollback, we recommend that you can turn off direct NTP synchronization. Either directly when the clock does not provide services directly callback return ERROR_CODE, such as a clock to catch up. Or make one retry, and then reported to the alarm system, it is found to have a clock or automatic callback and, after removal of the alarm node itself, as follows:

 //发生了回拨,此刻时间小于上次发号时间
 if (timestamp < lastTimestamp) {
  			  
            long offset = lastTimestamp - timestamp;
            if (offset <= 5) { try { //时间偏差大小小于5ms,则等待两倍时间 wait(offset << 1);//wait timestamp = timeGen(); if (timestamp < lastTimestamp) { //还是小于,抛异常并上报 throwClockBackwardsEx(timestamp); } } catch (InterruptedException e) { throw e; } } else { //throw throwClockBackwardsEx(timestamp); } } //分配ID 

Judging from the line, appeared in 2017, leap seconds that time there have been parts of the machine to call back, because the Leaf-snowflake strategy to ensure, managed to avoid the impact on the business.

Leaf in the US group reviews the company's internal financial services include payment transactions, restaurants, takeaways, hotels and tourism, cat's eye movies and many other lines of business. Leaf current performance of the machine 4C8G the QPS can pressure measured near 5w / s, TP999 1ms, has been able to meet the needs of most businesses. Call the number provided daily amount of one hundred million, as a public company's internal technology infrastructure, must ensure that high-performance SLA and service, we are still only reached the pass line, there is a lot of room for improvement.

Cheow Tong, the US group reviews the infrastructure team members, mainly involved in Mtrace US group large distributed link tracking system reviews distributed generation system Leaf ID and beauty group development. Worked Ali Baba, in July 2016 joined the US group.

Guess you like

Origin www.cnblogs.com/zhangfengshi/p/11571878.html