爬虫网络基础（下）

其他 2021-01-30 22:28:45 阅读次数: 0

4. Session和Cookies

4.1 静态网页和动态网页

静态网页: 网页的内容是HTML代码编写的,文字、图片等内容均通过写好的HTML代码来指定;
加载速度快,编写简单,但是存在很大的缺陷如可维护性差,不能根据URL灵活多变地显示内容等;
动态网页: 可以动态解析URL中参数的变化,关联数据库并动态呈现不同的页面内容,非常灵活多变;
无状态HTTP: 指HTTP协议对事务处理是没有记忆能力的,即服务器不知道客户端是什么状态;

4.2 Session和Cookies

Session: 在服务端,也就是网站的服务器,用来保存用户的Session信息;
中文称之为会话,其本身的含义是指有始有终的一系列动作/消息,在Web中,Session对象用来存储特定用户Session所需的属性及配置信息;
Cookies: 在客户端,也可以理解为浏览器端,Cookies里面保存了登录的凭证;
在这里插入图片描述

属性结构:
Name: 即该Cookie的名称;Cookie一旦创建,名称便不可更改
Value: 即该Cookie的值;如果值为Unicode字符,需要为字符编码;如果值为二进制数据,则需要使用BASE64编码;
Max Age: 即该Cookie失效的时间,单位秒,也常和Expires一起使用,通过它可以计算出其有效时间;Max Age如果为正数,则该Cookie在Max Age秒之后失效;如果为负数,则关闭浏览器时Cookie即失效,浏览器也不会以任何形式保存该 Cookie;
Path: 即该Cookie 的使用路径;如果设置为/path/,则只有路径为/path/的页面可以访问该Cookie;如果设置为/,则本域名下的所有页面都可以访问该Cookie;
Domain: 即可以访问该Cookie的域名;
Size字段: 即此Cookie的大小;
Http字段: 即Cookie的httponly属性;若此属性为true,则只有在HTTP Headers中会带有此Cookie的信息,而不能通过document.cookie 访问此 Cookie;
Secure: 即该Cookie是否仅被使用安全协议传输;安全协议;安全协议有HTTPS、SSL等,在网络上传输数据之前先将数据加密;默认为false

5. 多线程和多进程

5.1 线程

线程是操作系统进行运算调度的最小单位,是进程中的一个最小运行单元;
进程中每一个事务的处理就对应着一个线程的执行;这些线程的并发或并行执行保证了进程同时运行这么多的任务;

并发与并行

并发(concurrency): 指同一时刻只能有一条指令执行,但是多个线程的对应的指令被快速轮换地执行;宏观上看起来多个线程在同时运行,微观上只是处理器在连续不断地在多个线程之间切换和执行;
并行(parallel): 是指同一时刻,有多条指令在多个处理器上同时执行,并行必须要依赖于多个处理器;不论是从宏观上还是微观上，多个线程都是在同一时刻一起执行的;
并行只能在多处理器系统中存在,并发在单处理器和多处理器系统中都是可以存在的;

5.2 多线程

多线程就是一个进程中同时执行多个线程;适用于IO密集型任务,使用多线程来提高程序整体的执行效率,尤其对于网络爬虫;

Thread直接创建子线程

import threading
import time

# 定义target方法,传入参数: 秒数
def target(second):
    # threading.current_thread().name方法用来获取正在运行的线程名字
    # f''表示格式化字符串,加上后可以在字符串里面使用用花括号括起来的变量和表达式
    print(f'Threading {threading.current_thread().name} is running')
    print(f'Threading {threading.current_thread().name} sleep {second}s')
    time.sleep(second)
    print(f'Threading {threading.current_thread().name} is ended')


print(f'Threading {threading.current_thread().name} is running')

# 直接创建多线程的子线程
# for i in [1, 5]:
#     thread = threading.Thread(target=target, args=[i])
#     thread.start()
# print(f'Threading {threading.current_thread().name} is ended')

# join方法: 实现主线程在所有子线程完成后才结束
threads = []
for i in [1, 5]:
    thread = threading.Thread(target=target, args=[i])
    threads.append(thread)
    thread.start()
for thread in threads:
    thread.join()
print(f'Threading {threading.current_thread().name} is ended')

继承Thread类创建子线程

import threading
import time

class MyThread(threading.Thread):
    def __init__(self, second):
        threading.Thread.__init__(self)
        self.second = second
    def run(self):
        print(f'Threading {threading.current_thread().name} is running')
        print(f'Threading {threading.current_thread().name} sleep {self.second}s')
        time.sleep(self.second)
        print(f'Threading {threading.current_thread().name} is ended')

print(f'Threading {threading.current_thread().name} is running')
threads = []
for i in [1, 5]:
    thread = MyThread(i)
    threads.append(thread)
    thread.start()
for thread in threads:
    thread.join()
print(f'Threading {threading.current_thread().name} is ended')

5.2.1 守护线程

如果一个线程被设置为守护线程,那么意味着这个线程是"不重要"的,这意味着,如果主线程结束了而该守护线程还没有运行完,那么它将会被强制结束;

import threading
import time

def target(second):
    print(f'Threading {threading.current_thread().name} is running')
    print(f'Threading {threading.current_thread().name} sleep {second}s')
    time.sleep(second)
    print(f'Threading {threading.current_thread().name} is ended')

print(f'Threading {threading.current_thread().name} is running')
t1 = threading.Thread(target=target, args=[2])
t1.start()
t2 = threading.Thread(target=target, args=[5])
# setDaemon方法: 将某个线程设置为守护线程
t2.setDaemon(True)
t2.start()
print(f'Threading {threading.current_thread().name} is ended')
# 这里如果让t1,t2都调用join方法,主线程就会仍然等待各个子线程执行完毕再退出，不论其是否是守护线程;

5.2.2 互斥锁

在一个进程中的多个线程是共享资源的,多个线程同时对某个数据进行读取或修改,就会出现不可预料的结果;
为了避免这种情况，我们需要对多个线程进行同步，要实现同步，我们可以对需要操作的数据进行加锁保护;

import threading
import time

count = 0

class MyThread(threading.Thread):

    def __init__(self):
        threading.Thread.__init__(self)

    def run(self):
        # 未加入互斥锁 Final count: 8
        # global count
        # temp = count + 1
        # time.sleep(0.001)
        # count = temp

        # 加入互斥锁 Final count: 1000
        global count
        lock.acquire()
        temp = count + 1
        time.sleep(0.001)
        count = temp
        lock.release()


# 声明lock实例化threading.Lock
lock = threading.Lock()
threads = []
for _ in range(1000):
    thread = MyThread()
    thread.start()
    threads.append(thread)

for thread in threads:
    thread.join()
print(f'Final count: {count}')

5.2.3 GIL

GIL(Global Interpreter Lock): 全局解释器锁,不论是在单核还是多核条件下,在同一时刻只能运行一个线程;
GIL可以看作是线程执行的一个通行证,只有得到通行证的线程才能够被执行,在python中,GIL只有一个;

5.3 进程

进程(Process): 是具有一定独立功能的程序关于某个数据集合上的一次运行活动,是系统进行资源分配和调度的一个独立单位;
各个进程之间的数据是无法共享的;
进程可以理解为是一个可以独立运行的程序单位,一个进程中是可以同时处理很多事情;可以理解进程就是线程的集合,就是由一个或多个线程构成的;

5.4 多进程

本部分建议在Linux或Mac上运行，在Windows上，您需将共享库传递到参数的Process构造函数列表，否则，子进程将获得一个全新的，而不是父进程的

多进程就是启用多个进程同时运行,由于进程是线程的集合,而且进程是由一个或多个线程构成的,多进程的运行意味着有大于或等于进程数量的线程在运行;
对于多进程来说,每个进程都有属于自己的GIL,在多核处理器下,多进程的运行是不会受GIL的影响的;
对比来看,多进程能更好地发挥多核的优势,在条件允许的情况下,能用多进程就尽量用多进程;

5.4.1 多进程的实现

直接使用process类

# multiprocessing提供了一系列的组件,如Process(进程)、Queue(队列)、Semaphore(信号量)、Pipe(管道)、Lock(锁)、Pool(进程池)等
# 每一个进程都用一个Process类来表示,API调用: Process([group [, target [, name [, args [, kwargs]]]]])
# target 表示调用对象,你可以传入方法的名字;
# args 表示被调用对象的位置参数元组,如果只有一个参数也要在元组第一个元素后面加一个逗号,以作区分;
# kwargs 表示调用对象的字典;
# name 是别名,相当于给这个进程取一个名字;
# group 分组

import multiprocessing

def process(index):

    print(f'Process: {index}')


if __name__ == '__main__':

    for i in range(5):
        p = multiprocessing.Process(target=process, args=(i,))
        p.start()

# 利用多进程获取当前机器CPU的核心数量
import multiprocessing
import time

def process(index):
    time.sleep(index)
    print(f'Process: {index}')


if __name__ == '__main__':
    for i in range(5):
        p = multiprocessing.Process(target=process, args=[i])
        p.start()
    print(f'CPU number: {multiprocessing.cpu_count()}')
    for p in multiprocessing.active_children():
        print(f'Child process name: {p.name} id: {p.pid}')
    print('Process Ended')

继承Process类

from multiprocessing import Process
import time


class MyProcess(Process):
    def __init__(self, loop):
        Process.__init__(self)
        self.loop = loop

    def run(self):
        for count in range(self.loop):
            time.sleep(1)
            print(f'Pid: {self.pid} LoopCount: {count}')


if __name__ == '__main__':
    for i in range(2, 5):
        p = MyProcess(i)
        p.start()

5.4.2 守护进程

如果一个进程被设置为守护进程,当父进程结束后,子进程会自动被终止;
可以有效防止无控制地生成子进程;
可以让我们在主进程运行结束后无需额外担心子进程是否关闭,避免了独立子进程的运行;

from multiprocessing import Process
import time


class MyProcess(Process):
    def __init__(self, loop):
        Process.__init__(self)
        self.loop = loop

    def run(self):
        for count in range(self.loop):
            time.sleep(1)
            print(f'Pid: {self.pid} LoopCount: {count}')


if __name__ == '__main__':
    for i in range(2, 5):
        p = MyProcess(i)
        p.daemon = True
        p.start()
print('Main Process ended')

5.4.3 进程等待

join方法来让子进程运行而不会直接主进程结束导致子进程未执行;

from multiprocessing import Process
import time


class MyProcess(Process):
    def __init__(self, loop):
        Process.__init__(self)
        self.loop = loop

    def run(self):
        for count in range(self.loop):
            time.sleep(1)
            print(f'Pid: {self.pid} LoopCount: {count}')


if __name__ == '__main__':
    processes = []
    for i in range(2, 5):
        p = MyProcess(i)
        processes.append(p)
        p.daemon = True
        p.start()
    for p in processes:
        p.join()

print('Main Process ended')

5.4.4 终止进程

通过terminate方法来终止某个子进程,通过is_alive方法判断进程是否还在运行;

import multiprocessing
import time

def process():
    print('Starting')
    time.sleep(5)
    print('Finished')


if __name__ == '__main__':
    p = multiprocessing.Process(target=process)
    print('Before:', p, p.is_alive())
    p.start()
    print('During:', p, p.is_alive())
    p.terminate()
    print('Terminate:', p, p.is_alive())
    p.join()
    print('Joined:', p, p.is_alive())

5.4.5 进程互斥锁

避免了多个进程同时抢占临界区(输出)资源,在一个进程输出时,加锁,其他进程等待;等此进程执行结束后,释放锁,其他进程可以进行输出

from multiprocessing import Process, Lock
import time

class MyProcess(Process):
    def __init__(self, loop, lock):
        Process.__init__(self)
        self.loop = loop
        self.lock = lock

    def run(self):
        for count in range(self.loop):
            time.sleep(0.1)
            self.lock.acquire()
            print(f'Pid: {self.pid} LoopCount: {count}')
            self.lock.release()

if __name__ == '__main__':
    lock = Lock()
    for i in range(10, 15):
        p = MyProcess(i, lock)
        p.start()

5.4.6 信号量

信号量是进程同步过程中一个比较重要的角色;可以控制临界资源的数量,实现多个进程同时访问共享资源,限制进程的并发量;

from multiprocessing import Process, Semaphore, Lock, Queue
import time

buffer = Queue(10)
empty = Semaphore(2)
full = Semaphore(0)
lock = Lock()

class Consumer(Process):
    def run(self):
        global buffer, empty, full, lock
        while True:
            full.acquire()
            lock.acquire()
            buffer.get()
            print('Consumer pop an element')
            time.sleep(1)
            lock.release()
            empty.release()

class Producer(Process):
    def run(self):
        global buffer, empty, full, lock
        while True:
            empty.acquire()
            lock.acquire()
            buffer.put(1)
            print('Producer append an element')
            time.sleep(1)
            lock.release()
            full.release()

if __name__ == '__main__':
    p = Producer()
    c = Consumer()
    p.daemon = c.daemon = True
    p.start()
    c.start()
    p.join()
    c.join()
    print('Main Process Ended')

5.4.7 队列

Queue作为进程通信的共享队列使用;

from multiprocessing import Process, Semaphore, Lock, Queue
import time
from random import random

buffer = Queue(10)
empty = Semaphore(2)
full = Semaphore(0)
lock = Lock()

class Consumer(Process):
    def run(self):
        global buffer, empty, full, lock
        while True:
            full.acquire()
            lock.acquire()
            print(f'Consumer get {buffer.get()}')
            time.sleep(1)
            lock.release()
            empty.release()

class Producer(Process):
    def run(self):
        global buffer, empty, full, lock
        while True:
            empty.acquire()
            lock.acquire()
            num = random()
            print(f'Producer put {num}')
            buffer.put(num)
            time.sleep(1)
            lock.release()
            full.release()

if __name__ == '__main__':
    p = Producer()
    c = Consumer()
    p.daemon = c.daemon = True
    p.start()
    c.start()
    p.join()
    c.join()
    print('Main Process Ended')

5.4.8 管道

两个进程之间通信的通道;
管道可以是单向的,即half-duplex: 一个进程负责发消息,另一个进程负责收消息;也可以是双向的duplex,即互相收发消息;

from multiprocessing import Process, Pipe

class Consumer(Process):
    def __init__(self, pipe):
        Process.__init__(self)
        self.pipe = pipe

    def run(self):
        self.pipe.send('Consumer Words')
        print(f'Consumer Received: {self.pipe.recv()}')

class Producer(Process):
    def __init__(self, pipe):
        Process.__init__(self)
        self.pipe = pipe

    def run(self):
        print(f'Producer Received: {self.pipe.recv()}')
        self.pipe.send('Producer Words')

if __name__ == '__main__':
    pipe = Pipe()
    p = Producer(pipe[0])
    c = Consumer(pipe[1])
    p.daemon = c.daemon = True
    p.start()
    c.start()
    p.join()
    c.join()
    print('Main Process Ended')

5.4.9 进程池

Pool可以提供指定数量的进程,供用户调用;
新的请求提交到pool时,如果池还没有满,就会创建一个新的进程用来执行该请求;
但如果池中的进程数已经达到规定最大值,那么该请求就会等待,直到池中有进程结束,才会创建新的进程来执行它;

from multiprocessing import Pool
import time

def function(index):
    print(f'Start process: {index}')
    time.sleep(3)
    print(f'End process {index}', )

if __name__ == '__main__':
    pool = Pool(processes=3)
    for i in range(4):
        pool.apply_async(function, args=(i,))

    print('Main Process started')
    pool.close()
    pool.join()
    print('Main Process ended')

祝各位码上无ERROR，键盘无BUG！！

猜你喜欢

转载自blog.csdn.net/czyying123/article/details/113062738

爬虫网络基础（下）

网络爬虫基础1

网络爬虫基础

Java网络爬虫基础

爬虫网络基础（上）

爬虫基础（1）什么是网络爬虫

简单网络爬虫基础功能

网络爬虫——基础大致结构

网络爬虫---HTTP基础（1)

Python网络爬虫基础(一)

Java版网络爬虫基础

网络爬虫基础知识

Python网络爬虫基础篇下-CSDN公开课-专题视频课程

网络协议基础（下）

【python实现网络爬虫（2）】网络爬虫基础

三：爬虫-网络请求模块（下）

网络爬虫基础知识（Python实现）

网络爬虫基础知识（Java实现）

网络爬虫基础之二（requests）

Python实现网络爬虫基础学习（三）

Python实现网络爬虫基础学习（二）

Python实现网络爬虫基础学习（一）

Python实现网络爬虫基础学习（四）

运用 Python 进行网络爬虫基础

Java 网络爬虫基础知识

【爬虫学习笔记】网络协议及请求基础

python 基础网络爬虫 day06

网络爬虫基础-Xpath语法(一)

python 基础网络爬虫 day05

python 基础网络爬虫 day04

今日推荐

Electron中的关于静态资源加载问题解决方案

《Cursor-AI编程》基础篇-界面指南

《Cursor-AI编程》基础篇-Tab代码智能补充

《Cursor-AI编程》基础篇-Composer功能详解

《Cursor-AI编程》基础篇-Chat功能详解

《Cursor-AI编程》进阶篇-自定义模型

《Cursor-AI编程》进阶篇-上下文详解

【大模型系列篇】最强检索增强技术GraphRAG基本原理详解

【大模型系列篇】基于Ollama和GraphRAG v2.0.0快速构建知识图谱

解释什么是迁移学习？在 CNN 中如何应用？（面试题200合集，高频、关键）

解释数据增强（Data Augmentation）的概念和方法（（面试题200合集，高频、关键））

揭秘大模型“魔法”：Function Calling 让 AI 不止会说，更能“做”！

周排行

ConfigurationClassParser类的parse方法源码解析

基础大讲堂-java 位运算符

ConsecutiveInteger判断给定的整数n能否表示成连续的m(m>1)个正整数之和

多项式问题之六——多项式快速幂

Spring Security技术栈开发企业级认证与授权（四）RESTful API服务异常处理

Linux基础命令---apachectl

MATLAB中的线性插值

Unity编辑器拓展之十七：NGUI ComponentSelector增加搜索框

SqlServer 备份还原教程

[Unity动画]01.

每日归档

2025-04-12(10529)

2025-04-11(9561)

2025-04-10(1213)

2025-04-09(10354)

2025-04-08(12998)

2025-04-07(0)

2025-04-06(0)

2025-04-05(0)

2025-04-04(0)

2025-04-03(0)