[CTF knowledge base] robots.txt

  • robots.txt (unified lowercase) is a file stored in the root directory of the websiteASCII encodingA text file that usually tells web search engine robots (also known as web spiders) which content in this website should not be obtained by search engine robots and which can be obtained by robots.
  • Because URLs in some systems are case-sensitive, the file names of robots.txt should be uniformly lowercase. robots.txt should be placed in the root directory of the website . If you want to separately define the behavior of search engine robots when accessing subdirectories, you can merge your custom settings into robots.txt in the root directory, or use robots metadata (Metadata, also known as metadata).
  • The robots.txt protocol is not a specification, but only a convention, so it cannot guarantee the privacy of the website. Note that robots.txt uses string comparison to determine whether to obtain the URL, so there are different URLs with or without a slash "/" at the end of the directory.
  • robots.txt allows wildcards like "Disallow: *.gif".

allow all bots

User-agent: *
Disallow:

//或者另一种写法
User-agent: *
Allow:/

Only allow specific bots

// name_spider用真实名字代替
User-agent: name_spider
Allow:

block all bots

User-agent: *
Disallow: /

Block all bots from accessing a specific directory

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/

Only ban bad crawlers from accessing specific directories

// BadBot用真实的名字代替
User-agent: BadBot
Disallow: /private/

Block all bots from accessing specific file types

User-agent: *
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$

replace

  • While robots.txt is the most widely accepted method, it can also be used with the robots META tag.
  • The robots META tag is mainly for an independent page setting. Like other META tags (such as the language used, the description of the page, keywords, etc.), the robots META tag is also placed in the HEAD tag of the page, which is specially used to tell the search engine How the engine robots crawl the content of this page.
<head>
	<meta name="robots" content="noindex,nofollow" />
</head>

[Note] Reference link for this article: https://zh.wikipedia.org/wiki/Robots.txt

Guess you like

Origin blog.csdn.net/weixin_45489658/article/details/131187053