学习python已经有段时间了,从csdn的博客里学习了很多有用的知识也依靠大神们解决了不少的难题。今天分享一下个人学习python中做的实战成果。本文是关于安徽省各个城市七天的天气爬取的过程,利用scrapy框架。
首先我们先创建一个scrapy的项目
- 新建项目 (Project):新建一个新的爬虫项目
- 明确目标(Items):明确你想要抓取的目标
- 制作爬虫(Spider):制作爬虫开始爬取网页
- 存储内容(Pipeline):设计管道存储爬取内容
各个文件的作用:
- scrapy.cfg:项目的配置文件
- tutorial/:项目的Python模块,将会从这里引用代码
- tutorial/items.py:项目的items文件
- tutorial/pipelines.py:项目的pipelines文件
- tutorial/settings.py:项目的设置文件
- tutorial/spiders/:存储爬虫的目录
创建爬取项目的命令:scrapy startproject Anhuispider
根据提示进入项目:cd Anhuispider
然后创建爬取的文件:scrapy genspider Wcity www.weather.com.cn
Wcity是自己取的文件名称,最后爬取也用这个www.weather.com.cn是爬取的网站域名
所有文件改完的执行命令:scrapy crawl Wcity
Anhuipider
创建scrapy爬取项目后的文件列表如图所示。
获取信息所用的网页源码
Item是保存爬取到的数据的容器;其使用方法和python字典类似,并且提供了额外保护机制来避免拼写错误导致的未定义字段错误。
以下是items里的代码:主要是明确想要抓取的内容
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class AnhuispiderItem(scrapy.Item):
city=scrapy.Field() ##城市名
Date=scrapy.Field()##时间
weather=scrapy.Field() ##天气
maxtemperature=scrapy.Field() ##最高温度
mintemperature=scrapy.Field() ##最低温度
wind=scrapy.Field()##风向
winds=scrapy.Field()##风速
爬取的核心区域
# -*- coding: utf-8 -*-
import scrapy
from Anhuispider.items import AnhuispiderItem
class WcitySpider(scrapy.Spider):
name = 'Wcity'
allowed_domains = ['www.weather.com.cn']##从中国天气网进入
#start_urls = ['http://www.weather.com.cn/']
citys=['/weather/101220101.shtml',
'/weather/101220201.shtml','/weather/101220301.shtml',
'/weather/101220401.shtml','/weather/101220501.shtml',
'/weather/101220601.shtml','/weather/101220701.shtml',
'/weather/101220801.shtml','/weather/101220901.shtml',
'/weather/101221001.shtml','/weather/101221101.shtml',
'/weather/101221201.shtml','/weather/101221301.shtml',
'/weather/101221401.shtml','/weather/101221501.shtml',
'/weather/101221701.shtml',] ###根据观察获取各城市天气信息列表,也可以利用初始网址进行循环得到
start_urls=[]
for city in citys:
start_urls.append('http://'+'weather.com.cn'+city)
for url in start_urls:
def parse(self,response):
item=AnhuispiderItem()
city=response.xpath('//div[@class="crumbs fl"]//a[2]//text()').extract()[0]
item['city']=city
Date=response.xpath('//ul[@class="t clearfix"]//h1//text()').extract()
item['Date']=Date
weather=response.xpath('//p[@class="wea"]//text()').extract()
item['weather']=weather
maxtemperature=response.xpath('//p[@class="tem"]//span/text()').extract()
item['maxtemperature']=maxtemperature
mintemperature=response.xpath('//p[@class="tem"]//i/text()').extract()
item['mintemperature']=mintemperature
wind=response.xpath('//p[@class="win"]//i/text()').extract()
item['wind']=wind
wind=response.xpath('//p[@class="win"]//em//span[1]/@title').extract()
item['winds']=wind
return item
修改pipelines文件
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import time
import os.path
import codecs
import numpy
from urllib import request
class AnhuispiderPipeline(object):
def process_item(self, item, spider):
today=time.strftime('%Y-%m-%d',time.localtime())
fileName=today+'.txt'
with open(fileName,'a',encoding='utf-8') as fp:
fp.write(item['city']+'\n')
for a in item['Date']: ###可以直接写fp.write(format(a,'<15')+'\t')制表符的对齐方式和间距可以自己设定,即'<15'左对齐的方式间距15个字符
fp.write(format(a,'<15')+'\t') ###如果在执行代码报错是列表里不能是字符串时可以使用这种方式将自创串切出来
fp.write('\n') ##如果没有报错可以不使用循环切出字符串的形式
for b in item['weather']:
fp.write(format(b,'<15')+'\t')
fp.write('\n')
for c in item['maxtemperature']:
fp.write(format(c,'<18')+'\t')
fp.write('\n')
for d in item['mintemperature']:
fp.write(format(d,'<15')+'\t')
fp.write('\n')
for e in item['winds']:
fp.write(format(e,'<15')+'\t')
fp.write('\n')
for f in item['wind']:
fp.write(format(f,'<15')+'\t')
fp.write('\n\n')
time.sleep(1)
return item
修改
settings
# -*- coding: utf-8 -*-
# Scrapy settings for Anhuispider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'Anhuispider'
SPIDER_MODULES = ['Anhuispider.spiders']
NEWSPIDER_MODULE = 'Anhuispider.spiders'
ITEM_PIPELINES={'Anhuispider.pipelines.AnhuispiderPipeline':1,}
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Anhuispider (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
最后的效果图如下:
这里之给出了部分结果,利用制表符做成表的形式。本人是学习python的新手,还有不足之处,有兴趣的可以一起交流学习。
本文也借鉴了以下大神的文章
http://www.cnblogs.com/wuxl360/p/5567631.html
https://blog.csdn.net/djd1234567/article/details/45642375