先假设一应用场景:
爬虫爬取淘宝店铺的店铺列表页,获取到所有的店铺地址,根据每一个店铺地址,进去各店铺获取数据。
该场景便非常适合使用多线程。因为爬取店铺列表页和爬取详情页,其实互不影响,只要处理好线程间通信即可。
上示意代码:
#-*-coding:utf-8-*-
import threading
import time
def get_detail_html(url):
print("get detail html started")
time.sleep(2)
print("get detail html end")
def get_detail_url(url):
print("get detail url started")
time.sleep(4)
print("get detail url end")
执行以下不同调用方法,看执行的结果。先补充个知识点,再看执行结果
thread.setDaemon(True):
使用setDaemon()和守护线程这方面知识有关, 比如在启动线程前设置thread.setDaemon(True),就是设置该线程为守护线程, 表示该线程是不重要的,进程退出时不需要等待这个线程执行完成。 这样做的意义在于:避免子线程无限死循环,导致退不出程序。 thread.setDaemon()设置为True, 则设为true的话 则主线程执行完毕后会将子线程回收掉, 设置为false,主进程执行结束时不会回收子线程
thread.join():
join所完成的工作就是线程同步,即主线程任务结束之后,进入阻塞状态,一直等待其他的子线程执行结束之后,主线程再终止
if __name__=="__main__":
thread1 = threading.Thread(target=get_detail_html,args=("",))
thread2 = threading.Thread(target=get_detail_url, args=("",))
start_time = time.time()
thread1.start()
thread2.start()
print("last time:{}".format(time.time()-start_time))
执行结果:
get detail html started
get detail url started
last time:0.002000093460083008
get detail html end
get detail url end
if __name__=="__main__":
thread1 = threading.Thread(target=get_detail_html,args=("",))
thread2 = threading.Thread(target=get_detail_url, args=("",))
start_time = time.time()
thread1.setDaemon(True)
thread2.setDaemon(True)
thread1.start()
thread2.start()
print("last time:{}".format(time.time()-start_time))
执行结果:
get detail html started
get detail url started
last time:0.0
因为主线程很快执行完毕,守护线程直接退出 ,未执行time.sleep()
if __name__=="__main__":
thread1 = threading.Thread(target=get_detail_html,args=("",))
thread2 = threading.Thread(target=get_detail_url, args=("",))
start_time = time.time()
#
# thread1.setDaemon(True)
# thread2.setDaemon(True)
thread1.start()
thread2.start()
thread1.join()
thread2.join()
print("last time:{}".format(time.time()-start_time))
执行结果:
get detail html started
get detail url started
get detail html end
get detail url end
last time:4.0012288093566895
if __name__=="__main__":
thread1 = threading.Thread(target=get_detail_html,args=("",))
thread2 = threading.Thread(target=get_detail_url, args=("",))
start_time = time.time()
thread1.setDaemon(True)
thread2.setDaemon(True)
thread1.start()
thread2.start()
thread1.join()
thread2.join()
print("last time:{}".format(time.time()-start_time))
执行结果:
get detail html started
get detail url started
get detail html end
get detail url end
last time:4.0012290477752686