Summary of basic knowledge points of reptiles (html file basics and super detailed long text warnings of 4 commonly used libraries)

Summary of basic knowledge points of reptiles (including HTML file foundation, Selenium library, Request library, BeautifulSoup library and Scrapy library)

10000+ word long text warning! ! ! Detailed and rich knowledge points

(ps: As a reptile novice for half a year, I can find the answers to more than 90% of the questions on CSDN during the period of learning reptiles. CSDN is really the source of my life hahaha; now I have also mastered some basic crawler points, I want to I want to share it with the students who are new to the pit, so that everyone can avoid some detours~)

This knowledge point summary is discussed from five aspects: HTML file foundation, Selenium library, Request library, BeautifulSoup library and Scrapy library. The five aspects are presented in the form of large titles, and each title is followed by a diagram of ideas.

1. Basics of HTML files

Image of html file base
Hypertext Markup Language (abbreviated HTML, full name is HyperText Mark-up Language), is by far the most widely used language on the Internet, and it is also the main language that constitutes web documents. HTML text is descriptive text composed of HTML commands, which can explain text, graphics, animations, sounds, tables, links, etc. The structure of HTML includes two parts: head (Head) and body (Body). The head describes the information required by the browser, while the body contains the specific content to be explained.

HTML elements consist of opening and closing tags. There are two standard formats for tags, one is a single tag and the other is a double tag. A double tag has a pair of start < > and end </ />, while a single tag has only one start tag such as < />. Generally, it is recommended to use lowercase for tag names. Tags have attributes, and attributes are used to represent the characteristics of tags.
For example:

<p>这是一个段落</p>      <img src=" 这里表示图片的路径 " />

1. Basic structure

The basic structure of an HTML document includes three parts, in addition to the head and body mentioned above, it also includes the start tag. First, html is the starting tag of the web page, also known as the root tag, and all web page tags are in the html tag; second, the head tag is used to define the head of the document, which is the container of all head elements. Head elements include title, script, style, link, meta and other tags; third, the content between the body tags is the main content of the web page, such as h1, p, a and other web content tags, the content in the tags here will be displayed in the browser.

Here we focus on the introduction of the header tags of the webpage, because the header tags specify various basic information of the webpage, and when we use crawlers to crawl information, we will often find and modify the information in the header. The head element of HTML starts with head and ends with /head. It is used to contain relevant information of the current document, and can contain title elements, meta elements, etc., which are used to define the title and code of the page respectively. Use the head element to distinguish the basic information section from the main content of the page. The page title element title is generally used to explain the purpose of the page, and it is displayed in the title bar of the browser. In an HTML document, title information is set at the head of the page, that is, between the head tags. Page information element: meta, generally used to define additional information of the page, including the author, copyright, keywords and other related information of the page. The meta tag also includes the definition of information such as name, content, and http-equiv. The name attribute is used to specify the name of the additional information in the document, and the content attribute is used to specify the value of the additional information in the document. The http-equiv attribute is similar to the name attribute. to specify the name of the extension. Before the browser loads the page, the server will send the relevant information defined by the http-equiv attribute to the browser, so that the page can be displayed correctly in the browser.

2. Hyperlink tag a

The a tag defines a hyperlink, which is used to link from one page to another. It is precisely because of the existence of hyperlinks that the HTML language can be called a hypertext markup language. The most important attribute of the a element is the href attribute, which indicates the target of the link or the URL of the page. It can link https and http addresses, custom pages, objects (such as pictures), and use #pseudo-links, etc. target specifies where to open the linked document, it has five options: _blank: open the linked document in a new window, _self: default. Open the linked document in the same frame; _parent: Open the linked document in the parent frame set, _top: Open the linked document in the entire window, framename: Open the linked document in the specified frame. name specifies the name of the anchor. When we write a crawler, we often need to obtain the URL link in the a tag or construct a URL crawl list, which requires the use of its href attribute.

3. css selector

Cascading Style Sheets (full name: Cascading Style Sheets) is a computer language used to represent file styles such as HTML (an application of Standard Generalized Markup Language) or XML (a subset of Standardized Generalized Markup Language). CSS can not only modify web pages statically, but also dynamically format elements of web pages with various scripting languages. CSS can perform pixel-level precise control over the typesetting of element positions in web pages, and has the ability to edit the styles of web page objects and models.

When using css to decorate the layout of web page elements, there are three types: inline style, internal style sheet and external style sheet. Currently, external style sheets are the most popular and efficient; but in terms of priority, inline styles, internal style sheets, and external style sheets. When we analyze the structure of the web page, we can use this method to match the attributes and objects that really work. Css is powerful, you can set attributes such as font, background, border, margin, padding, text-align, etc. for elements, but when the same tag appears repeatedly, the identity of the element cannot be determined only by the tag name, so css uses a selector to position the label.

ID selector : the highest priority, the same id cannot appear on the page. The naming rules start with letters and underscores, and cannot start with numbers. Add the attribute id="n" to the tag to be set.

②Class selector : The priority is second only to the id selector. There can be more than one same class name in a page. Cannot start with a number, add class="n" to the label to be set.

③Tag name selector : Control the same tag at the same time.

④Group selector : Take out several ids and the same class name, write them together, and separate them with commas in English.

⑤The descendant selector , also known as the inclusion selector, is used to select the descendants of an element or element group. It is written by writing the outer label in front and the inner label in the back, separated by a space in the middle. When tags are nested, the inner tag becomes the descendant of the outer tag.

⑥The sub-element selector can only select elements that are sub-elements of an element. The way to write it is to write the parent label in the front, write the child label in the back, connect with one in the middle, and reserve a space on the left and right sides of the symbol. Here, the child refers to the pseudo-class selector, using 2 dots, which is a colon. For example: link{} son, does not include grandson, great-grandson and so on.

⑦Union selectors (CSS selector grouping) are formed by connecting selectors with commas. Any form of selector (including label selectors, class selectors, id selectors, etc.) can be used as a union selector a part of. If the styles defined by some selectors are exactly the same, or partially the same, you can use the union selector to define the same CSS style for them.

4. XPath selector

XPath is the XML path language, which is a language used to determine the location of a certain part in an XML (subset of standard general markup language) document. XPath is based on the tree structure of XML, which has different types of nodes, including element nodes, attribute nodes and text nodes, and provides the ability to find nodes in the data structure tree.

XPath uses path expressions to select nodes or sets of nodes in an XML document. A path expression is a written sequence of steps from one XML node (the current context node) to another node, or set of nodes, separated by "/" characters. In XPath, there are seven types of nodes: elements, attributes, text, namespaces, processing instructions, comments, and document (root) nodes. XML documents are treated as nodes trees. The root of the tree is called the document node or root node. Parent (Parent), every element and attribute has a parent. Children (Children), the element node can have zero, one or more children.

Siblings have the same parent node. Ancestor, the parent of a node, the parent's parent, etc. Descendants, children of a node, children of children, and so on. Through the selection of the xpath node axis, the call of more than 100 functions, and the use of operators, we can easily locate the content we want to crawl by using xpath when we make crawlers.

5. Commonly used tags in HTML documents

By understanding the common tags of HTML documents, we can analyze the structure of web pages faster and more accurately, so as to write the correct crawler framework and efficient crawling methods.

Label meaning
h1…/h1 Title word size (h1~h6)
p····/p paragraph
the…/the unordered list
be…/be ordered list
li·····/li list item
a href=”…”…/a Hyperlink
font font
sub subscript
sup Superscript
br new line
img src=’”…”/ image definition
hr horizontal line
of the strikethrough
frame Frameset windows and frames

Two, Selenium library

Selenium crawling essentials
Selenium is a tool for web application testing, supported browsers include IE (7, 8, 9, 10, 11), Mozilla Firefox, Safari, Google Chrome, Opera, etc. The main functions of this tool include: testing compatibility with browsers, that is, testing your application to see if it can work well on different browsers and operating systems. Selenium supports automatic recording of actions and automatic generation of test scripts in different languages ​​such as Net, Java, and Perl. Support multiple operating systems such as Windows, Linux, IOS, Android, etc. Selenium3.x calls the browser must have a webdriver driver file, and then configure the environment variables of the browser.

1. Selenium provides the following 8 ways to locate elements

locate an element Position multiple elements meaning
find_element_by_id find_elements_by_id Locating by element id
find_element_by_name find_elements_by_name Locating by element name
find_element_by_xpath find_elements_by_xpath locate by xpath expression
find_element_by_link_text find_elements_by_link_tex Target by full hyperlink
find_element_by_partial_link_text find_elements_by_partial_link_text Target by partial link
find_element_by_tag_name find_elements_by_tag_name Locating by label
find_element_by_class_name find_elements_by_class_name Locating by class name
find_element_by_css_selector find_elements_by_css_selector Positioning via css selectors

For example:

Position by id, dr.find_element_by_id("kw");

Position by name, dr.find_element_by_name("wd");

Locating by xpath, dr.find_element_by_xpath("/html/body/form/span/input");

Through css positioning, dr.find_element_by_css_selector("[name=wd]") and so on.

2. Control the operation of the browser

From the selenium library, webdriver can be imported to control the operation of the browser. Here are some operation methods

method illustrate
set_window_size() Set the size of the browser
back() Control browser back
forward() Control browser forward
refresh() refresh current page
clear() clear text
send_keys (value) Simulate key input
click() click element
submit() for submitting the form
get_attribute(name) Get element attribute value
is_displayed() Sets whether the element is visible to the user
size Returns the dimensions of the element
text Get the text of the element

For example:

Refresh the browser, browser.refresh();

Set the window size of the browser, browser.set_window_size(1400,800);

Set link content, browser.find_element_by_link_text("News").click(), etc.

3. Mouse events

In WebDriver, the method about mouse operation is encapsulated in the ActionChains class, the following is the mouse event in webdriver.

method illustrate
ActionChains(driver) Construct ActionChains object
context_click() Perform a mouseover action
move_to_element(above) right click
double_click() double click
drag_and_drop() drag
move_to_element(above) Perform a mouseover action
context_click() Used to simulate the right mouse button operation, you need to specify the element positioning when calling
perform() Execute all actions stored in ActionChains, which can be understood as a submission action for the entire operation

举例说明如:
定位到要悬停的元素,
element= driver.find_element_by_link_text(“设置”);

对定位到的元素执行鼠标悬停操作,ActionChains(driver).move_to_element(element).perform()等等。

4.模拟键盘操作

Selenium中的Key模块供了模拟键盘按键的方法,即send_keys()方法。它不仅可以模拟键盘输入,也可以模拟键盘的操作,下面是selenium的常用的模拟键盘操作

模拟键盘按键 说明
send_keys(Keys.BACK_SPACE) 删除键(BackSpace)
send_keys(Keys.SPACE) 空格键(Space)
send_keys(Keys.TAB) 制表键(Tab)
send_keys(Keys.ESCAPE) 回退键(Esc)
send_keys(Keys.ENTER) 回车键(Enter)

5.断言属性和说明

我们使用selenium库时,不管是在做功能测试还是自动化测试,最后一步需要拿实际结果与预期进行比较。这个比较的称之为断言。通过我们获取title 、URL和text等信息进行断言,下面是获取的断言的属性和说明

属性 说明
title 用于获得当前页面的标题
current_url 用户获得当前页面的URL
text 获取搜索条目的文本信息

举例说明如:
获取结果数目,user = driver.find_element_by_class_name(‘nums’).text。

定位一组元素的方法与定位单个元素的方法类似,唯一的区别是在单词element后面多了一个s表示复数。
举例如:
定位一组元素,elements = driver.find_elements_by_xpath(’//div/h3/a’),然后可以用for循环遍历出每一条搜索结果的标题。

6.警告框处理

在WebDriver中处理JavaScript所生成的alert、confirm以及prompt十分简单,具体做法是使用 switch_to.alert 方法定位到 alert/confirm/prompt,然后使用text/accept/dismiss/ send_keys等方法进行操作,下面是webdriver中的警告框处理

方法 说明
text 返回 alert/confirm/prompt 中的文字信息
accept() 接受现有警告框
dismiss() 解散现有警告框
send_keys(keysToSend) 发送文本至警告框。keysToSend:将文本发送至警告框。

7.下拉框操作

利用selenium爬取网页信息时,有时我们会碰到下拉框,我们要选择下拉框选择操作,导入选择下拉框Select类,使用该类处理下拉框操作,下面是select类的方法

方法 说明
select_by_value(“选择值”) select标签的value属性的值
select_by_index(“索引值”) 下拉框的索引
select_by_visible_testx(“文本值”) 下拉框的文本值

8.cookie操作

在爬取网页时,有时候我们需要验证浏览器中cookie是否正确,因为基于真实cookie的测试是无法通过白盒和集成测试进行的。WebDriver提供了操作Cookie的相关方法,可以读取、添加和删除cookie信息,下面是WebDriver操作cookie的方法

方法 说明
get_cookies() 获得所有cookie信息
get_cookie(name) 返回字典的key为“name”的cookie信息
add_cookie(cookie_dict) 添加cookie。“cookie_dict”指字典对象,必须有name 和value 值
delete_cookie(name,optionsString) 删除cookie信息。“name”是要删除的cookie的名称,“optionsString”是该cookie的选项,目前支持的选项包括“路径”,“域”
delete_all_cookies() 删除所有cookie信息

9.滚动条设置

JavaScript来控制浏览器的滚动条。WebDriver提供了execute_script()方法来执行JavaScript代码。window.scrollTo()方法用于设置浏览器窗口滚动条的水平和垂直位置。方法的第一个参数表示水平的左间距,第二个参数表示垂直的上边距。其代码如下,js="window.scrollTo(100,450);"driver.execute_script(js)。

10.关闭浏览器窗口

WebDriver还提供了close()方法,用来关闭当前窗口。对于多窗口的处理,我们就要用到close()方法进行关闭了,下面是关闭浏览器窗口的方法

方法 说明
close() 关闭单个窗口
quit() 关闭所有窗口

三、Requests库

requests library
Requests 是用Python语言编写,基于 urllib,采用 Apache2 Licensed 开源协议的 HTTP 库,完全满足HTTP测试的要求。http即超文本传输协议,是基于一个“请求与响应”模式的,无状态的应用层协议。它采用url作为定位网络资源的标识符,url格式如下:

http //host[:port][path]
host 主机域名或ip地址
Port 端口号,默认80
Path 请求资源的路径

1.requests库的七个方法

方法 说明
requests.request() 构造一个请求,支撑以下各方法的基础方法
requests.get() 获取HTML网页的主要方法,对应于HTTP的GET
requests.head() 获取HTML网页头信息的方法,对应于HTTP的HEAD
requests.post() 向HTML网页提交POST请求的方法,对应于HTTP的POST
requests.put() 向HTML网页提交PUT请求的方法,对应于HTTP的PUT
requests.patch() 向HTML网页提交局部修改请求,对应于HTTP的PATCH
requests.delete() 向HTML页面提交删除请求,对应于HTTP的DELETE

2. 13个参数的形式和描述

每个方法里都包含两个以上的参数,有的是默认参数不需要设置,有的参数需要手动设置。Requests库中的控制访问参数一共有13个,下面是13个参数的形式和描述

参数 描述
params 字典或字节序列,作为参数增加到url中
data 字典,字节薛烈或文件对象,作为request的内容
json json 格式的数据,作为request的内容
headers 字典,http定制头
cookies 字典或cookiejar,request中的cookie
auth 元组类型,支持http认证功能
files 字典类型,传输文件
timeout 设定超时时间,秒为单位
proxies 字典类型,设定访问代理服务器,可以增加登录认证
allow_redirects True/False,默认为True,重定向开关
stream True/False,默认为True,获取内容立即下载开关
verify True/False,默认为True,认证ssl证书开关

3. 7个属性

使用requests方法后,会返回一个response对象,其存储了服务器响应的内容,response的对象又包括7个属性,下面是response对象的属性和说明

属性 说明
r.status_code HTTP请求的返回状态,200表示连接成功,404表示失败
r.text HTTP响应内容的字符串形式,即,url对应的页面内容
r.encoding 从HTTP header中猜测的响应内容编码方式
r.apparent_encoding 从内容中分析出的响应内容编码方式(备选编码方式)
r.content HTTP响应内容的二进制形式

4.requests库的异常说明

用requests库进行爬虫编写时,常常会遇到特殊的异常情况,我们可以用requests库的异常处理来判断程序是否出现了问题,下面是requests库的异常和说明

异常 说明
requests.ConnectionError 网络连接错误异常,如DNS查询失败、拒绝连接等 requests.HTTPError
requests.URLRequired URL缺失异常
requests.TooManyRedirects 超过最大重定向次数,产生重定向异常
requests.ConnectTimeout 连接远程服务器超时异常
requests.Timeout 请求URL超时,产生超时异常
r.raise_for_status() 如果不是200,产生异常requests.HTTPError

5.爬取的通用框架

我们在用requests爬取网页时,要先获得网页上的内容,再进行下一步的筛选,因此我们一定会用到requests.get()函数,下面给出用requests库爬取网页的通用框架

import requests
def getHTMLtext(url):
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status()  #如果状态不为200,引发httperror异常
        r.encoding=r.apparent_encoding
        return r.text  
    except:
        return "产生异常"
    print(r.text)  #即可打印查看爬取到的网页内容(之后还需用BeautifulSoup库解析)
    
(url为想要爬取的网页链接)

四、BeautifulSoup库

BeautifulSoup library
BeautifulSoup库是一个解析HTML或XML文件的第三方库。提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱,通过解析文档能为用户提供需要抓取的数据。Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。除非文档没有指定一个编码方式,Beautiful Soup就不能自动识别编码方式了但仅仅只需说明一下原始编码方式。

1.支持的解析器

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,一共提供4种解析器供我们使用。

解析器 使用方法
Python标准库 BeautifulSoup(markup, “html.parser”)
lxml HTML解析器 BeautifulSoup(markup, “lxml”)
lxml XML解析器 BeautifulSoup(markup, “xml”)
html5lib BeautifulSoup(markup, “html5lib”)

2.BeautifulSoup类的基本元素

BeautifulSoup解析的文件一般包含三种节点:元素节点 - 通常指HTML 或 XML的标签;文本节点 - 标签内部的文本内容;属性节点 - 每个标签的属性。BeautifulSoup库查找到文档里的一个或多个标签元素,并获取每个标签里的文本和属性。下面介绍BeautifulSoup类的基本元素

基本元素 说明
Tag 标签,最基本的信息组织单元,分别用和/标明开头和结尾
Name 标签的名字, p … /p 的名字是’p’,格式: tag .name
Attributes 标签的属性,字典形式组织,格式: tag .attrs
NavigableString 标签内非属性字符串,…/中字符串,格式: tag .string
Comment 标签内字符串的注释部分,一种特殊的Comment类型

对上面的基本元素举例说明
用 BeautifulSoup 可以很方便地获取Tags,如soup.a;
获取name如 soup.a.name;
获取attributes如soup.p.attrs;
获取NavigableString,如soup.p.string;
Comment 对象是一个特殊类型的NavigableString对象,如type(soup.a.string)

3.find和find_all方法

find(tag, attrs, recursive,string ,**kwargs):获取一个元素,参数的含义见下表
find_all()方法来查找标签元素:.find_all(name, attrs, recursive, string, **kwargs) ,返回一个列表类型,存储查找的所有符合条件的元素

参数 说明
name 对标签名称的检索字符串
attrs 对标签属性值的检索字符串,可标注属性检索
recursive 是否对子孙全部检索,默认True
string …/中字符串区域的检索字符串
**kwargs 控制参数,上面已经提过,不再赘述

举例如soup.find_all(id=’link1’),表示查找所有id名为link1的标签,返回一个列表。

4.lambda表达式

soup.findAll(lambda tag: len(tag.attrs)) 获取有两个属性的元素

5.其他节点的获取

获取儿子标签结点、子孙结点:

代码 含义
tag{soup}.children 迭代类型,儿子标签结点
tag{soup}.descendants 迭代类型,子孙标签结点
tag{soup}.contents 列表类型,儿子标签结点

获取父亲结点、祖先结点:

代码 含义
tag.parent 标签类型,父亲结点
tag.parents 迭代类型,所有的祖先结点

获取兄弟节点:

代码 含义
tag.previous_sibling 标签类型,前一个兄弟结点
tag.next_sibling 标签类型,后一个兄弟结点
tag.previous_siblings 迭代类型,前面所有的兄弟结点
tag.next_siblings 迭代类型,后面所有的兄弟结点

举例如for siblings in soup.a.next-siblings,表示循环遍历a标签之后的平行节点(兄弟节点)。

6.css选择器

我们可以利CSS选择器来筛选元素,用到的方法是 soup.select(),返回类型是 list,一般有如下五种方法。

查找方法 举例说明
通过标签名查找 soup.select(‘title’)
通过类名查找 soup.select(’.sister’)
通过id名查找 soup.select(’#link1’)
组合查找 soup.select(‘p #link1’)(表示查找 p 标签中id 等于 link1的内容)
属性查找 soup.select(‘a[href=“http://example.com/elsie”]’)

以上的 select 方法返回的结果都是列表形式,可但我们以遍历形式输出,用 get_text() 方法来获取它的内容。

7.标准选择器find与select方法的区别

1.find方法返回的是单个元素,find_all方法返回的是一个元素列表,而select方法永远返回的是元素列表。如果使用了select方法查找到了单个元素,要先加列表索引[0],然后才可以调用get_text()方法获取文本。

2.find方法还支持方法参数查询,比select方法更强大。

五、Scrapy库

insert image description here
Scrapy是Python开发的一个快速、高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。

1.整体架构

Scrapy 使用了 Twisted异步网络库来处理网络通讯,包括了以下组件:

1.引擎(Scrapy),用来处理整个系统的数据流, 触发事务(框架核心)

2.调度器(Scheduler),用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址

3.下载器(Downloader),用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)

4.爬虫(Spiders),爬虫是主要爬取网页的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面。

5.项目管道(Pipeline),负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。

6.下载器中间件(Downloader Middlewares),位于Scrapy引擎和下载器之间的框架,主要是处理Scrapy引擎与下载器之间的请求及响应。

7.爬虫中间件(Spider Middlewares),介于Scrapy引擎和爬虫之间的框架,主要工作是处理蜘蛛的响应输入和请求输出。

8.调度中间件(Scheduler Middewares),介于Scrapy引擎和调度之间的中间件,从Scrapy引擎发送到调度的请求和响应。

而我们使用scrapy框架时,由于很多部分已经自动搭建完成,我们一般只需编写spiders里面的内容,编写爬取信息的方式和itempipelines中的内容,编写对采集的数据进行合适的操作,如清理、检验、查重等。编写完这两个主要部分后,根据爬取网页的实际情况对其他部分做细微的修改就完成了。

2.Scrapy的运行流程

1.引擎从调度器中取出一个链接(URL)用于接下来的抓取

2.引擎把URL封装成一个请求(Request)传给下载器

3.下载器把资源下载下来,并封装成应答包(Response)

4.爬虫解析Response,通过engine传给spiders

5.spider如果解析出实体(Item),则交给实体管道进行进一步的处理

6.spider如果解析出的是链接(URL),则把URL交给调度器等待抓取

3.Scrapy的常用命令

命令 含义
scrapy --help 查看scrapy的基本命令
scrapy version -v 查看scrapy版本和各组件的信息
scrapy startproject xx Create a crawler project
scrapy genspider names site.com To enter the project directory
scrapy parse url Use the fixed parse function to parse a page
scrapy runspider xx.py Run a single crawler (add .py after the extension)
scrapy crawl name Run the crawler in the project (without adding the file extension)
scrapy bench Check whether scrapy is installed successfully
scrapy list Check how many crawlers are in the project file

4. Related file descriptions of the Scrapy framework

scrapy.cfg, the configuration information of the project, mainly provides a basic configuration information for the Scrapy command line tool. (The configuration information related to the real crawler is in the settings.py file)

items.py, set the data storage template for structured data, such as: Django Model

Pipelines, data processing behavior, such as: general structured data persistence

settings.py, configuration files, such as: recursive layers, concurrency, delayed download, etc.

Spiders, crawler directory, such as: create files, write crawler rules

5. The Scrapy framework is often used in conjunction with the Yield keyword

yield is a keyword similar to return, except that this function returns a generator, which is also an iterable object. When this function is called, the code inside the function is not executed immediately, but a generator object is returned. It can be used in conjunction with a for loop to iterate continuously. In the first iteration, the function will execute, from the beginning to the yield keyword, and then return the value after yield as the return value of the first iteration. Then, every time this function is executed, it will continue to execute the next cycle of the loop defined inside the function, and then return that value until there is no value that can be returned.
For example:

def gen(n):
     	for i in range(n)
     	  	yield i**2

Using the yield keyword is beneficial to reduce server resources and improve the running efficiency of crawlers.

It took me several days to find information and another morning to edit the format when I wrote a blog for the first time, but I still feel very happy after finishing it!

Now that you guys have seen this sentence, please give it a thumbs up and comment! ! !

Guess you like

Origin blog.csdn.net/golden_knife/article/details/107013394