Whether to adopt the read-write separation scheme

How do we decide whether to adopt a read-write separation architecture or a sharding architecture?

               In general, the DBA team prefers the sharding mechanism , rather than relying heavily on replication based read/write split;
                for existing read-write separation applications, it needs to be sorted out;
                new read-write separation solutions must be reviewed by the Architecture Review Committee. or confirmed by the Director of Development and Director of DBA; 
               

The benefits of read-write separation :

  1. With relatively simple efforts (only need to do read-write separation, compared to sharding, development is much simpler), the scalability problem of the system can be solved;
  2. The requirements for high availability of writing are relatively low, and the requirements for high availability of reading/qps for reading are very high (such as user login, mobile configuration type information), which can easily realize new problems of system expansion.
  3. It is easy to achieve 99.999% high availability of the read library
  4. To some extent, it can reduce the pressure on online systems for large query/report applications; ( such as wms/tms, but this practice is not encouraged )


Disadvantages of read-write separation:

  1. Application development needs to know clearly when to read the master library and when to read the slave library; if the master-slave delay causes an exception, it should be handled as a bug;
  2. For application development, to a certain extent, the development requirements are higher: you cannot write large jobs (the slave library must delay), and you cannot write large SQL (the slave library is also easy to delay);
  3.  Need to deal with the exception of the slave library delay; when to return to the main library; most of them cannot/should not be returned to the source; centralized return to the source can easily overwhelm the main library;
  4. It is easy to let dev abuse the system; do not do too much optimization;
  5. Reduce the write capacity of the main library. When the TPS exceeds 6k, the slave library will gradually increase the delay, while the TPS of the main library can actually go to 1.5w or even 2w

 

The overall judgment guideline:

  1. The read- write ratio must be above 10:1 (recommended to be above 20:1), and it is possible to consider read-write separation ; otherwise, in principle, we do not agree to read-write separation, and we still need to avoid delays anyway, for example: to prevent sudden delays Send failover, then either discard the delayed data (if the hardware fails, a large number of binlogs generated by large transactions cannot be transmitted in time, and the data will be lost), or wait for the delay to lengthen the failure recovery time ;
  2. For a database with read and write separation, there must be LVS in front of the read library to virtualize the IP of the read library ; in principle, with 10:1 read and write separation, the pressure of reading the library will be greater than that of writing the library; multiple database servers are also required. to support; LVS is also needed to cover up and solve the maintenance problem of the back-end database server (basically, the library can be logically always online) 
  3.  For applications with read-write separation, if the program can handle the delay, it must be able to tolerate the read library delay of 3~5s;  if it cannot be tolerated, please do not connect to the slave library;
  4. For the main library with read-write separation, within 3 years, the writing library should not need sharding to split  the capacity supported by the system design ; otherwise, the complexity of reading and writing separation first, and then sharding in the main library will be introduced in vain;
  5. For applications with read- write separation, the system cannot be abused because of read -write separation ; large transactions in the main library and large SQL in the slave library will lead to large-scale delays of the slave; 
  6. For applications with read-write separation, for those in a transaction or with strong consistency dependencies, you cannot go to the slave library and the main library all at once; the information in a transaction must be consistent; it belongs to the ms level; the main library There must be this delay from the copy;
  7. In principle, for separate read-write databases, try not to make the main database too large (2 TB);
  8. 读写分离,不应该作为报表的标准解决方案;报表类型的应用(后端系统,比如vis, tps, ods, tms, wms等较多)原则上,应该考虑通过hadoop 来解决MySQL天然不适合报表类型的应用;在线系统和报表系统,天然应该分离;
  9. 读写分离应用: 对于cache侧有可能回源的应用, 可以考虑选择读写分离, 所有的读回源必须到 LVS vip

过去的例子,读写分离之后,由于应用设计的问题,带来的问题

  1. 订单/Coupon的问题;一个交易里面,一会儿读从库,一会儿读主库;在部分有几秒或者十几秒delay的场景下,发生异常;
    1. 订单例子
      业务场景:用户订单支付成功第三方支付平台回调流程中,pay域调用订单接口更新订单支付状态,主库更新成功从库同步存在延迟,pay域进一步调用审单接口进行审核(其中审核数据包含支付状态的判断),读取从库订单数据支付状态仍处于未支付状态,此时订单审单流程异常。
      1. 原始订单数据<order_sn,pay_status> = <15032483118418,0> (pay_status 0:未支付,1:支付成功)
      2. 订单支付成功后,订单主库数据<15032483118418,1> 从库数据<15032483118418,0>
      3. 审单时读取从库订单数据<15032483118418,0>,判断订单支付状态时出现异常
    2. coupon例子
      业务场景:coupon.api在大促前为了降低主库压力,在变更余额时判断余额金额的查询由主库迁到从库,并做了数据库主从一致性判断,当数据库主从不一致时,不能变更会员账户余额。但是订单退款其他操作是没有判断数据库主从一致性的,所以导致当退款操作时,数据库主从不同步时其他金额退款成功,礼品卡唯品卡账户退款是失败的。这段时间数据库主从一致性不稳定,导致出现异常单。

      解决方案:
      在订单退款的时候,礼品卡唯品卡账户从主库来获取。

 

 常见的slave delay的原因:

  1. 大job,一个update 几万条几十万条记录的;
  2. 一个update好多条记录(100+),而且是全表扫描类型的,没有合适的where 条件可以走索引的;因为我们是基于row的复制,每一条都要跑一边这个全表扫描/低效的索引扫描;这个问题在slave会放大;
  3. 表没有主键,导致从库复制每一条记录的DML都会做全表扫描
  4. slave 大SQL,比如频繁的大量的复杂join,耗尽磁盘的io能力或者耗光CPU资源;slave delay;
  5. DBA维护操作,比如大表的online DDL,或者dba 晚上的归档的job; --应该禁止;我们online ddl要求控制速度;归档也要控制速度和并发;麻烦的是两个同时kick off的时候;
  6. 网络或者系统问题;比如跨机房网络,同机房网络不稳定,系统不稳定(机器故障);
  7. MySQL bug; 
  8. 频繁对Text字段的表的读写,会把IO资源耗光,哪怕是flash卡
  9. 一些框架,查询也会发起事务,如set autocommit=0,又不关闭连接,到发布需要做ddl变更时,就会导致表锁延迟

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326437756&siteId=291194637