How to design a web crawler

As a programmer who has been deeply involved in the crawler industry for a long time, it must be very simple to design a web crawler. Here are some ideas about web crawler design. You can come and have a look.

Step 1: Briefly describe use cases and constraints

Gather everything needed together and examine the problem. Keep asking questions so we can clarify usage scenarios and constraints. Discuss assumptions.

We will define some use cases and constraints ourselves without the interviewer explicitly stating the problem.

Example

We limit the problem to only address the following use cases.

Serve

Crawl a series of links:
generate an inverted index of web pages containing the search term
Generate page title and summary information
Page title and summary are static, they will not change according to the search term

user

After entering the search term, you can see a list of related search results, each item in the list includes the page title and summary generated by the web crawler.
Only outline components and interaction instructions are drawn for this use case, without discussing details.

Services
are highly available

no need to think

Search Analytics
Personalized Search Results
Page Rank

1.2 Restrictions and assumptions

propose a hypothesis

Uneven distribution of search traffic
Some search terms are very popular and some are very unpopular Only
anonymous users are supported
Users can quickly see the search results
Web crawlers should not fall into an infinite loop
When the crawler path contains loops, it will fall into an infinite loop
crawling 1 billion links
Pages to be recrawled regularly to ensure freshness
Recrawls average once a week, the more popular the site, the more frequently recrawled
4 billion links crawled per month
Average storage size per page: 500 KB
for simplicity, recrawled pages count as new pages
100 billion monthly searches
Practice with a more traditional system - don't use off-the-shelf systems like solr, nutch, etc.

Calculate usage

If you need to perform rough usage calculations, please explain to your interviewer.

Store 2 PB pages per month
Crawl 4 billion pages per month, 500 KB per page
Store 72 PB pages for three years
1600 write requests per second
40000 search requests per second

Easy conversion guide:

2.5 million seconds in a month
1 request per second is 2.5 million requests per month
40 requests per second is 100 million requests per month
400 requests per second is 1 billion requests per month

Step Two: Outline Design

List all important components to plan a high-level design.

insert image description here

Step 3: Design Core Components

Detailed and in-depth analysis of each core component.

Use case: Crawler service crawls a series of web pages

Suppose we have an initial list links_to_crawl (links to crawl), which are initially sorted based on the overall popularity of the site. Of course, if this assumption is unreasonable, we can use well-known portal websites such as Yahoo and DMOZ as seed links for diffusion.

We will use the table crawled_links (crawled links) to record the links that have been processed and the corresponding page signatures.

We can record links_to_crawl and crawled_links in a key-value NoSQL database. For the sorted links in crawled_links, we can use an ordered collection of Redis to maintain the ranking of web links. We should discuss the usage scenarios and the pros and cons of choosing SQL or NoSQL.

The crawler service loops through each page link according to the following process:

Select the highest ranking link to be crawled

In the crawled_links of the NoSQL database, check whether the signature of the page to be crawled is similar to the signature of a crawled page. If so,
reduce the priority of the page link.
Doing so can avoid falling into an infinite loop.
Continue (enter the next loop)
If it does not exist, grab the link.
In the task queue of the inverted index service, add a task to generate the inverted index.
In the document service task queue, add a new task to generate a static title and abstract.
Generate page signature
Delete the link in links_to_crawl of NoSQL database
Insert the link and page signature in crawled_links of NoSQL database

Ask the interviewer how much code you need to write.

PagesDataStore is an abstract class in the crawler service, which uses a NoSQL database for storage.

class PagesDataStore(object):

    def __init__(self, db);
        self.db = db
        ...

    def add_link_to_crawl(self, url):
        """将指定链接加入 `links_to_crawl`。"""
        ...

    def remove_link_to_crawl(self, url):
        """从 `links_to_crawl` 中删除指定链接。"""
        ...

    def reduce_priority_link_to_crawl(self, url)
        """在 `links_to_crawl` 中降低一个链接的优先级以避免死循环。"""
        ...

    def extract_max_priority_page(self):
        """返回 `links_to_crawl` 中优先级最高的链接。"""
        ...

    def insert_crawled_link(self, url, signature):
        """将指定链接加入 `crawled_links`。"""
        ...

    def crawled_similar(self, signature):
        """判断待抓取页面的签名是否与某个已抓取页面的签名相似。"""
        ...

Page is an abstract class of the crawler service, which encapsulates webpage objects and consists of page links, page content, sublinks, and page signatures.

class Page(object):

    def __init__(self, url, contents, child_urls, signature):
        self.url = url
        self.contents = contents
        self.child_urls = child_urls
        self.signature = signature

Crawler is the main class of crawler service, composed of Page and PagesDataStore.

class Crawler(object):

    def __init__(self, data_store, reverse_index_queue, doc_index_queue):
        self.data_store = data_store
        self.reverse_index_queue = reverse_index_queue
        self.doc_index_queue = doc_index_queue

    def create_signature(self, page):
        """基于页面链接与内容生成签名。"""
        ...

    def crawl_page(self, page):
        for url in page.child_urls:
            self.data_store.add_link_to_crawl(url)
        page.signature = self.create_signature(page)
        self.data_store.remove_link_to_crawl(page.url)
        self.data_store.insert_crawled_link(page.url, page.signature)

    def crawl(self):
        while True:
            page = self.data_store.extract_max_priority_page()
            if page is None:
                break
            if self.data_store.crawled_similar(page.signature):
                self.data_store.reduce_priority_link_to_crawl(page.url)
            else:
                self.crawl_page(page)

Handle Duplicate Content

We have to beware of web crawlers falling into an infinite loop, which usually happens when there are loops in the crawler path.

Ask the interviewer how much code you need to write.

Remove duplicate links:

Assuming that the amount of data is small, we can use a method similar to sort | unique. (Annotation: Sort first, then remove duplicates)
Assuming there are 1 billion records, we should use MapReduce to output records that appear only once.

class RemoveDuplicateUrls(MRJob):

    def mapper(self, _, line):
        yield line, 1

    def reducer(self, key, values):
        total = sum(values)
        if total == 1:
            yield key, total

Detecting duplicate content is more complicated than handling duplicate content. We can generate a signature based on the content of the webpage, and then compare the similarity between the two signatures. Algorithms that may be used include Jaccard index and cosine similarity.

Fetch result update strategy

Pages should be recrawled periodically to ensure freshness. Crawl results should have a timestamp field to record the last page crawl time. Every once in a while, let's say 1 week, all pages need to be updated. For popular websites or websites with frequently updated content, the crawling interval can be shortened.

Although we won't go into the details of web data analysis, we will still do some data mining work to determine the average update time of a page, and use relevant statistics to determine the frequency of crawler recrawling.

Of course, we should also control the crawling frequency of crawlers according to the Robots.txt provided by the webmaster.

Use case: After the user enters a search term, he can see a list of related search results, each item in the list includes the page title and summary generated by the web crawler

1. The client sends a request to the web server running the reverse proxy
2. The web server sends the request to the Query API server
3. The Query API service will do these things:

Parse query parameters
Remove HTML tags
Split text into phrases Fix
typos
Normalize capitalization
Convert search terms to Boolean operations

Use the inverted index service to find documents matching the query

The inverted index service ranks the matching results and returns the most matching results

Use the Documentation Service to return article titles and abstracts

We communicate with the client using a REST API:

$ curl https://search.com/api/v1/search?query=hello+world

Response content:

{
    
    
    "title": "foo's title",
    "snippet": "foo's snippet",
    "link": "https://foo.com",
},
{
    
    
    "title": "bar's title",
    "snippet": "bar's snippet",
    "link": "https://bar.com",
},
{
    
    
    "title": "baz's title",
    "snippet": "baz's snippet",
    "link": "https://baz.com",
},

For server internal communication we can use Remote Procedure Call Protocol (RPC)

Step 4: Schema Extension

Based on constraints, find and resolve bottlenecks.

insert image description here

IMPORTANT NOTE: Do not jump straight from initial design to final design!

Now you want to 1) benchmark, load test. 2) Analyze and describe performance bottlenecks. 3) Evaluate alternatives and weigh pros and cons while addressing bottlenecks. 4) Repeat the above steps. Read Designing a System and Scaling It to Serve Millions of AWS Users to learn how to gradually scale up your initial design.

It is important to discuss possible bottlenecks encountered in the initial design and related solutions. For example, would adding a load balancer with multiple web servers solve the problem? What about CDNs? What about master-slave replication? What are their respective alternatives and trade-offs?

We'll introduce some components to complete the design and address architectural scaling. The built-in load balancer will not be discussed to save space.

To avoid duplication of discussion, refer to the relevant section of the System Design Topic Index for key points, trade-offs, and alternatives.

DNS
Load Balancer
Horizontal Scaling
Web Server (Reverse Proxy)
API Server (Application Layer)
Caching
NoSQL
Consistency Mode
Availability Mode

Some search terms are very popular and some are very cold. Popular search terms can be cached through memory such as Redis or Memcached to shorten the response time and avoid overloading the inverted index service and document service. Memory cache is also suitable for uneven traffic distribution and short-term traffic peaks. It takes about 250 microseconds to read 1 MB of continuous data from memory, but it takes 4 times longer to read the same size data from SSD, and more than 80 times longer to read from mechanical hard disk. 1

Here are other suggestions for optimizing your crawler service:

In order to deal with data size issues and network request load, inverted index service and document service may require extensive application data sharding and data replication.
DNS query may become a bottleneck, and it is best for the crawler service to maintain a set of regularly updated DNS query services.
With the help of connection pooling, that is, maintaining multiple open network connections at the same time, the performance of crawler services can be improved and memory usage can be reduced.
Switching to the UDP protocol can also improve performance.
Web crawlers are greatly affected by bandwidth. Please ensure that the bandwidth is sufficient to maintain high throughput.

other points

Whether to delve into these additional topics depends on the scope of your problem and the time you have left.

How to design a web crawler

Guess you like