爬取杭电oj所有题目

杭电oj并没有反爬

所以直接爬就好了

直接贴源码(参数可改,循环次数可改,存储路径可改)

import requests
from bs4 import BeautifulSoup
import time

def write_in_file(number,string):#output function
    with open ('D:\\python\\python_code\\hdoj\\'+str(number)+".txt","a+",encoding='utf-8') as f:
        f.write(string)
        f.close()


link = "http://acm.hdu.edu.cn/showproblem.php?pid="
headers = {
    'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1'    
}
for i in range (1503,1900):
    print("acquire the request now")
    r = requests.get(link+str(i),headers = headers,timeout = 10)
    print("acquire the reuest completed")
    soup = BeautifulSoup(r.text,"lxml")
    problem_title = soup.find("h1").text#get the title
    write_in_file(i,"question: "+problem_title+"\n")
    problem_des = soup.find_all("div",class_="panel_content") 
    the_title = soup.find_all("div",class_ ="panel_title")
    #print(the_title)
    print("write into file now")
    print("now write in the NO. "+str(i) +" file")
    len_of_the_title = len(the_title)
    for m in range(0,len_of_the_title):
        write_in_file(i,the_title[m].text+": "+problem_des[m].text+"\n")
    time.sleep(1)#sleep for one second

另:爬取纯粹是兴趣,无商业用途,侵删

希望对大家有所帮助

以上

猜你喜欢

转载自www.cnblogs.com/lavender-pansy/p/12118004.html