How to design a short-chain service in the system design series

The originator of short-chain services is TinyURL , which was the first website to provide short-chain services. At present, there are many short-chain services in China: Sina (t.cn), Baidu (dwz.cn), Tencent (url.cn) and so on.

I have to ask, why use short chains? Another meaning of this question is, is it necessary for short-chain services to exist?

To paraphrase the jackpot answer: Existence is reasonable.

The rationality of the existence of short-chain services

Let us first talk about the rationality of the existence of short-chain services.

The only advantage of short chains is that they are short .

Early users of Weibo know that each Weibo can only be limited to 140 characters. If you want to share a link, you need to reduce the description text.

Similarly, if you want to put a link in a marketing message, you need to consider the cost. If it is an early mobile phone, you also need to consider that the user may receive three disconnected short messages, which will seriously affect the reach and click of the short messages.

In this case, if the link is short enough, other content can be richer. However, we may define links of different lengths according to different businesses, and in order to meet other needs (for example, statistical marketing data), we will add parameters to ordinary links. Therefore, the short link was born. By redirecting, you can use a short link to replace another link. For example , you can redirect through a 20-character link like http://t.cn/A6ULvJho . To the original link with a length of 146 characters https://www.howardliu.cn/how-to-use-branch-efficiently-in-git/index.html?spm=5176.12825654.gzwmvexct.d118.e9392c4aP1UUdv&scm=20140722.2007.2.1989 .

The above two examples prove the value of short chains. We summarize the additional uses of several short chains:

  1. Sending marketing text messages saves more money: the link becomes shorter, the length of the text message becomes smaller, and the cost of the text message to be paid is reduced. For example, the short link above has 20 characters and the original link has 146 characters. The difference is all money.

  2. Converted to two-dimensional code, it can be more recognizable. For example, the two two-dimensional code pictures below are of the same size, because the number of contents is different, the density of the cell is also different.

    http://t.cn/A6ULvJho
    https://www.howardliu.cn/how-to-use-branch-efficiently-in-git/index.html??spm=5176.12825654.gzwmvexct.d118.e9392c4aP1UUdv&scm=20140722.2007.2.1989

  3. Flexible and configurable, because the original link of the short-link jump has undergone a redirection. If you find that there is a problem in the original link at a certain time, or you need to jump to another place, you can modify the redirected target address. This is very beneficial for offline material delivery. For example, if you have already placed a two-dimensional code material. At this time, you find that you want to jump to other websites or activities. You only need to modify the target address of the short chain instead of replacing all the materials that have been placed. .

The principle of short chain

In fact, as mentioned earlier, the short link is realized by redirecting the server to the original link. Let's observe the short chain of Sina Weibo, the console executes the command curl -i http://t.cn/A6ULvJho, and the results are as follows:

HTTP/1.1 302 Found
Date: Thu, 30 Jul 2020 13:59:13 GMT
Content-Type: text/html;charset=UTF-8
Content-Length: 328
Connection: keep-alive
Set-Cookie: aliyungf_tc=AQAAAJuaDFpOdQYARlNadFi502DO2kaj; Path=/; HttpOnly
Server: nginx
Location: https://www.howardliu.cn/how-to-use-branch-efficiently-in-git/index.html??spm=5176.12825654.gzwmvexct.d118.e9392c4aP1UUdv&scm=20140722.2007.2.1989

<HTML>
<HEAD>
<TITLE>Moved Temporarily</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000">
<H1>Moved Temporarily</H1>
The document has moved <A HREF="https://www.howardliu.cn/how-to-use-branch-efficiently-in-git/index.html??spm=5176.12825654.gzwmvexct.d118.e9392c4aP1UUdv&scm=20140722.2007.2.1989">here</A>.
</BODY>
</HTML>

As can be seen from the above information, Sina has made a 302 jump and also returned the HTML content for manual adjustment for compatibility. The entire interactive process is as follows:

Short chain jump process

Short chain generation method

According to the number of pages statistics information, currently the world's 58 million web pages, Java int value of at most 2 ^ 32 = 4294967296 <4.3 billion <5.8 billion, long value is 2 ^ 64> 5.8 billion. So if you use numbers, int can barely support (after all, not all URLs will call short-chain services to create short-chains). Using long is safer, but it will cause a waste of space. The specific type to be used needs to be judged according to the business. .

Sina Weibo uses an 8-bit string to represent the original link. This string can be understood as the 62 hexadecimal representation of a number, 62^8 = 3521614606208> 352.1 billion> 5.8 billion, that is, it can solve the currently known URLs in the world. The base 62 is a number consisting of 10 numbers + (az)26 lowercase letters + (AZ)26 uppercase letters.

Generation method 1: Hash

Taking the hash value of the original link is a relatively simple way of thinking. There are many ready-made algorithms that can be implemented, but there is a problem that cannot be avoided: Hash collisions, so it is more important to choose an algorithm with a low collision rate.

The MurmurHash algorithm is recommended . This algorithm is a non-encrypted hash function and is suitable for general hash retrieval operations. Currently, Redis, Memcached, Cassandra, HBase, and Lucene are all using this algorithm.

With the help of MurmurHash in Guava:

final String url = "https://www.howardliu.cn/how-to-use-branch-efficiently-in-git/index.html?spm=5176.12825654.gzwmvexct.d118.e9392c4aP1UUdv&scm=20140722.2007.2.1989";
final HashFunction hf = Hashing.murmur3_128();
final HashCode hashCode = hf.newHasher().putString(url, Charsets.UTF_8).hash();
final int hashCodeAsInt = hashCode.asInt();// 这里选择返回 int 值,也可以选择返回 long 值
System.out.println(hashCodeAsInt);// 输出的结果是:1810437348,转换成 62 进制是:1Ywpso

For the collision problem, the simplest way of thinking is that if a collision occurs, attach a special string to the original URL until the collision is avoided. The specific operation is as follows:

Hash+Bloom

Generation method 2: Unified number issuer

This is no matter what it is, assign an ID through the centralized unified issuer. This ID is the content of the short chain. For example, the first one is https://tinyurl.com/1, and the second one is https ://tinyurl.com/2, and so on. Of course, some distributed ID algorithms may have a very long serial number. In order to get a shorter circuit, you can also convert it to a 62-base string.

  1. Redis self-increasing: Redis has good performance, and a single machine can support 10W+ requests. If it is used as an issuer, Redis persistence and disaster recovery need to be considered.
  2. MySQL self-incrementing primary key: This scheme is similar to the Redis scheme. It uses the reminder of the database self-incrementing primary key to ensure that IDs are not repeated and are continuously automatically created.
  3. Snowflake: This is an ID sequence generation algorithm that is currently widely used. Meituan's Leaf is an encapsulation and upgrade service for this algorithm. But this algorithm relies on the server clock, if there is a clock back, there may be ID conflicts. (Some people will think that the sequence value in milliseconds is the bottleneck of this algorithm. Having said that, this algorithm only provides an idea. If you feel that the sequence length is not enough, just add it yourself, but the million-level service per second is really bad. Is that much?)
  4. and so on. . .

There will be a separate article introducing the unified number issuer in the follow-up. After that, I will modify it here and attach a link, or you can follow me (WeChat ID: Watching the Mountain Lodge) to get first-hand information.

For the unified issuer, another question that needs to be solved is: if the same original link, should it return the same short chain or a different short chain?

The answer is that the same original link will return different short links according to dimensions such as users and locations. If it is judged that the dimensions are all the same, the same short chain is returned. The advantage of this is that we can make statistics based on short-chain clicks and request information. For short chains, what we sacrifice is only some storage and calculations, but the information collected is invaluable.

Storage short chain

Generally, there are two types of data storage: relational database or NoSQL database. With the above creation logic, storage is a matter of course. The table building statement stored in MySQL is given below:

CREATE TABLE IF NOT EXISTS tiny_url
(
    sid                INT AUTO_INCREMENT PRIMARY KEY,
    create_time        DATETIME  DEFAULT CURRENT_TIMESTAMP NULL,
    update_time        TIMESTAMP DEFAULT CURRENT_TIMESTAMP NULL ON UPDATE CURRENT_TIMESTAMP,
    version            INT       DEFAULT 0                 NULL COMMENT '版本号',
    tiny_url           VARCHAR(10)                         NULL COMMENT '短链',
    original_url       TEXT                                NOT NULL COMMENT '原始链接',
    # 其他附加信息
    creator_ip         INT       DEFAULT 0                 NOT NULL,
    creator_user_agent TEXT                                NOT NULL,
    # 用户其他信息,用于后续统计,对于这些数据,只要存储影响创建短链的必要字段就行,其他的都可以直接发送到数据服务中
    instance_id        INT       DEFAULT 0                 NOT NULL,
    # 创建短链服务实例ID
    state              TINYINT   DEFAULT 1                 NULL COMMENT '-1无效 1有效'
);

In a long-winded sentence, storage needs to consider the level of data, and plan in advance whether it needs to be divided into tables and databases.

Short chain request

After the storage is complete, it's time to use it next.

The usual practice is to find the data from the storage based on the requested short-chain string, and then return the HTTP redirect to the original address. If you use a relational database for storage, you generally need to create indexes for short-chain fields, and in order to avoid the database becoming a bottleneck, the database will also pave the way through caching in front. Moreover, in order to improve the rational use of the cache, non-hot short-chain data is generally eliminated through the LRU algorithm. The process is as follows:

Short chain request

The bloom filter in the picture is to prevent cache breakdown and cause excessive server pressure.

There is another question here: when HTTP returns the redirect code, use 301 or 302, why does Sina Weibo return 302 instead of the more semantic 301 redirect? (For students who don’t know much about HTTP status codes, you can get more information from "HTTP Status Code Summary" )

  • 301, represents permanent redirection. In other words, after the browser obtains the redirection address for the first request, all subsequent requests will directly obtain the redirection address from the browser cache, and will not request short-chain services again. This can effectively reduce the number of service requests and reduce the server load, but because subsequent browsers no longer send requests to the backend, the actual number of clicks cannot be obtained.
  • 302, representing temporary redirection. That is to say, every time the browser will initiate a request to the server to obtain a new address, although it will increase the pressure on the server, but today with excess hardware, this pressure is not worth mentioning compared to the data. Therefore, 302 redirect is the first choice for short-chain services.

to sum up

The short-chain service is actually relatively simple and does not have much business logic. It mainly examines the understanding of the common design of distributed systems, and it is also a question that is often used in the interview process. This is just to provide you with some design ideas. The issuer (distributed ID), Bloom filter, MurmurHash, etc. involved in the article are not too in-depth, because each of them is not in a few words to explain it, and you need to solve it by yourself. .

Personal homepage: https://www.howardliu.cn
Personal blog post: How to design a short chain service in the system design series
CSDN Homepage: http://blog.csdn.net/liuxinghao
CSDN blog post: How to design a short chain service in the system design series Chain service

Public number: Watching the mountain hut

Guess you like

Origin blog.csdn.net/conansix/article/details/107754046