网络爬虫——一个小例子演示如何在github上开源自己的项目

一、部分代码说明

test.py
完整项目见https://github.com/Narutoooooooo/Spider

import scrapy
import re
import json

# 定义网络爬虫类
class ItcastSpider(scrapy.Spider):
    # 每个爬虫必须要有一个名字
    name = "test"
    # 通过头来模仿用户请求
    heads = {
        "Accept": "* / *",
        "Accept - Encoding": "gzip, deflate, br",
        "Accept - Language": "zh - CN, zh; q = 0.9",
        "Connection": "keep - alive",
        "Host": "club.jd.com",
        "Referer": "https: // item.jd.com / 100011336064.html",
        "Sec - Fetch - Dest": "script",
        "Sec - Fetch - Mode": "no - cors",
        "Sec - Fetch - Site": "same - site",
        "User - Agent": "Mozilla / 5.0(Windows NT 10.0; WOW64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 81.0.4044.92 Safari / 537.36",
    }
    url = "https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100011336064&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1"

    # 入口方法，当爬虫启动后会先执行这个方法
    def start_requests(self):
        yield scrapy.Request(url=self.url, headers=self.heads, callback=self.parse)

    def parse(self, response):
        # 正则表达式，贪婪匹配
        # 匹配()中间的内容
        p = re.compile(r'[(](.*)[)]', re.S)
        # 匹配后返回一个数组
        r = re.findall(p, response.text)
        jsonstr = r[0]
        # 将json字符串转化成json对象
        jsonobj = json.loads(jsonstr)
        for line in jsonobj["comments"]:
            id = line["id"]
            nickname = line["nickname"]
            productColor = line["productColor"]
            productSize = line["productSize"]
            print("%s\t%s\t%s\t%s" % (id, nickname, productColor, productSize))

输出结果：
14056395329	羊男神	钛银黑	8GB+128GB
14017180543	羊男神	钛银黑	8GB+128GB
13911871287	Warm_Tiger	蜜桃金	8GB+128GB
13817760263	he186*****923	钛银黑	8GB+256GB
13811369283	小峰c	蜜桃金	12GB+256GB
13809538172	asxks	钛银黑	8GB+128GB
13826227687	z***h	蜜桃金	12GB+256GB
14007819114	就差半步丶	冰海蓝	8GB+128GB
13994655256	大***涂	钛银黑	8GB+128GB
13942676205	月丅丶	冰海蓝	8GB+256GB

mySpider/init.py

from scrapy import cmdline

# 指定名字启动爬虫
cmdline.execute("scrapy crawl test".split())

二、上传项目到github

申请GitHub账号
GitHub中创建存储库
下载并安装git
下载地址：https://git-scm.com/

打开git.bash，登录github

git config --global user.email "[email protected]"
git config --global user.name "Your Name"

按照图片红色框中步骤进行操作
注：
进行这些步骤前需要在git.bash中进入你要上传的文件夹的位置
git add .
表示上传当前位置的所有文件

详细步骤可看https://blog.csdn.net/m0_37725003/article/details/80904824

网络爬虫——一个小例子演示如何在github上开源自己的项目

一、部分代码说明

二、上传项目到github

猜你喜欢