prepare a python crawling know HTML

Web page source code

Open the Web page, press the shortcut keys [Ctrl + U] open source page
prepare a python crawling know HTML

HTML
HTML is the structure of the entire web, which is equivalent framework of the entire site. With "<", ">" symbols are part of the HTML tags, and the tags are in pairs

Common label as follows:

<html>..</html> 表示标记中间的元素是网页
<body>..</body> 表示用户可见的内容
<div>..</div> 表示框架
<p>..</p> 表示段落
<li>..</li>表示列表
<img>..</img>表示图片
<h1>..</h1>表示标题
<a href="">..</a>表示超链接

HTML
HTML example
local hyperlink can be a relative path or an absolute path.
Pictures of address can be a relative path or an absolute path.

    <html>
    <head>
          <title>这是HTML测试页面的主题</title>
    </head>
    <body>
          <div>   
              <h1>这是标题</h1>   
              <p>这是正文</p>   
          </div> 
          <div>    
              <ul>     
                  <li>这是一个列表</li>       
                  <li><a href='https://www.dytt8.net/index0.html'>这是一个网络超链接</a></li>
                  <li><a href='1.html'>这是一个本地超链接</a></li>      
                  <li>下面这个是一张图片</li>           
                  <img src="20120830173930_PBfJE.jpeg" alt="如果图像无法显示,将显示这个" />           
              </ul>        
         </div>      
    </body>
    </html>

Enter the code, save a Notepad, and then modify the file name and extension name "HTML.html", the following results:

prepare a python crawling know HTML

The legitimacy of reptiles

Each site has a document called robots.txt, of course, there are some sites not set robots.txt. For there is no set robots.txt site can not obtain passwords encrypted data through web crawler, which is the site of all the data pages can be crawled. If the site have a robots.txt file, it is necessary to determine whether there is data acquired prohibit visitors.

prepare a python crawling know HTML

Allow access to some part of its path reptiles, and for not allowed, the total ban crawling

Guess you like

Origin blog.51cto.com/13689359/2456585