The pit encountered by scrapy

0. Install scrapy on windows

1. Install wheel:
      Enter pip install wheel in the console to automatically complete the installation
2. Install lxml:
      Go to https://www.lfd.uci.edu/~gohlke/pythonlibs/, pull down to find lxml, and download it for your computer
      .whl files for the operating system and python version. cp27, cp35, etc. represent python version 2.7, 3.5, win32 represents
      32-bit Windows operating system, win_amd64 represents 64-bit operating system.
      After the download is complete, right-click on the file - properties - security - object name, you can copy to the file address. Return to control after copying is complete
      console, enter "pip install, right-click and paste the address", and then press Enter to complete the installation.
3. Install PyOpenssl
      Go to https://pypi.python.org/pypi/pyOpenSSL#downloads, pull down to find the following file and download
      load. After the download is complete, install the whl file of the PyOpenssl in the same way as installing lxml.
      Scrapy framework installation and testing under windows system
4. Install Twisted
      Go to https://www.lfd.uci.edu/~gohlke/pythonlibs/#Twisted, pull down to find Twisted, download
      A .whl file suitable for your computer operating system and python version. The same way to install lxml will complete the installation of Twisted.
5. Install Pywin32
      Go to https://sourceforge.net/projects/pywin32/files/pywin32/Build 220/, download the appropriate
      After downloading the files of your computer operating system and python version, double-click to start the installation. The program will automatically locate
      Python directory, so you don't have to adjust the installation settings yourself, just go to the next step.
6. Install scrapy
      After completing steps 1-5, installing scrapy is very simple. Enter pip install scrapy in the console to complete the installation.

 

1.No module named win32api

pip install pypiwin32

 

2. The files in the folder cannot be found, or No module named 'scrapy.pipelines' or no module named ×××.items?

    

The scrapy project is a subproject of the pycharm project, so pycharm cannot find items
. My solution is to right click on the scrapy project -> make_directory as
-->sources roo If the project folder changes to this color, it will be fine.

 

3.No module named PIL

pip install pillow

 

4. Download the picture to the local, and extract the local save address

1) Open the comment of ITEM_PIPELINES in settings.py and add it to ITEM_PIPELINES
ITEM_PIPELINES = {
   'spider_first.pipelines.SpiderFirstPipeline': 300,
   'scrapy.pipelines.images.ImagesPipeline':5, #The number behind represents the execution priority, when the pipeline is executed, it will be executed according to the number from small to large
}
2) Add in settings.py
IMAGES_URLS_FIELD = "image_url" #image_url is the web crawler configured in items.py to get the image address
#Configure to save the local address
project_dir=os.path.abspath(os.path.dirname(__file__)) #Get the absolute path of the current crawler project
IMAGES_STORE=os.path.join(project_dir,'images') #Assemble new image path

 

5.python install mysql module 

windows

pip install mysqlclient

ubuntu

sudo apt-get install libmysqlclent-dev

cents

sudo yum install python-devel mysql-devel

 

 

6. IndentationError: unindent does not match any outer indentation level

Do not use spaces at the beginning of the method name, use tab symbols

 

7.  The code of the connection pool

from twisted.enterprise import adbapi
import MySQLdb.cursors


class MysqlTwistedPipline(object):
    #By way of connection pool
    def __init__(self,dbpool):
        self.dbpool = dbpool

    @classmethod
    def from_settings(cls, settings):
       dbparms = dict(
           host=settings["MYSQL_HOST"],
           db=settings["MYSQL_DBNAME"],
           user=settings["MYSQL_USER"],
           passwd=settings["MYSQL_PASSWORD"],
           charset=settings["MYSQL_CHARSET"],
           cursorclass=MySQLdb.cursors.DictCursor,
           use_unicode=settings["MYSQL_USE_UNICODE"],
       )
       dbpool = adbapi.ConnectionPool("MySQLdb", **dbparms)
       return cls(dbpool)

    def process_item(self, item, spider):
        # Use twisted to turn mysql insert into asynchronous execution
        query = self.dbpool.runInteraction(self.do_insert, item)
        query.addErrback(self.handle_error, item, spider) # handle exception

    def handle_error(self, failure, item, spider):
        # Handle exceptions inserted asynchronously
        print(failure)
    def do_insert(self, cursor, item):
            insert_sql = """
                   insert into jobbole(post_url_id,post_url,re_selector,img_url,img_path,zan,shoucang,pinglun,zhengwen,riqi,fenlei) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
               """
            cursor.execute(insert_sql, (
            item["post_url_id"], item["post_url"], item["re_selector"], item["img_url"][0], item["img_path"],
            item["zan"], item["shoucang"], item["pinglun"], item["zhengwen"], item["riqi"], item["fenlei"]))

 

8.  The solution of verification code crawler

 

1. If a website has the function of downloading verification codes, for me, I have no in-depth understanding of machine learning, and can only solve the problem of verification codes manually. We will perform manual verification code analysis, and download pictures. Then manually enter the coordinates or verification code according to the picture to complete. The essential purpose is to be able to complete the login and enter the enemy's interior to absorb the nutrients in them.
2. If it is for https://www.zhihu.com/captcha.gif?r=1514042860066&type=login&lang=cn
Remember to follow the access of the logged in user through the session
There is a lot of github here in the header, so I won't paste it

session = requests.session()

response = session.get("https://www.zhihu.com/",headers=header)

The pit I encountered in the download of the picture category is
  with open(file_name, 'wb') as f:
        f.write(response.content)
        f.close()
记住如果网页打开直接是图片直接使用response.content 而不是response.text.ecode()
对于一些网站返回的是 unicode json格式看的时候很不爽
解决方案是:
 print(response.text.encode('latin-1').decode('unicode_escape'))

 

9. 出现主键冲突的时候解决ON DUPLICATE KEY UPDATE(只限mysql)

 insert into zhihu_question
                      (zhihu_id,topics,url,title,content,creat_time,update_time,answer_num,comments_num,watch_user_num,click_num,crawl_time,crawl_update_time)
                       VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
                       ON DUPLICATE KEY UPDATE comments_num=VALUES(comments_num),watch_user_num=VALUES(watch_user_num),click_num=VALUES(click_num)

 

10. srcapy items方法很强大通过反向调用的方式就可以动态控制

items

 

class ZhihuQuestionItem(scrapy.Item):
    zhihu_id = scrapy.Field()
    topics = scrapy.Field()
    url = scrapy.Field()
    title = scrapy.Field()
    content = scrapy.Field()
    creat_time = scrapy.Field()
    update_time = scrapy.Field()
    answer_num = scrapy.Field()
    comments_num = scrapy.Field()
    watch_user_num = scrapy.Field()
    click_num = scrapy.Field()

    def get_insert_sql(self):
        insert_sql = """
                      insert into zhihu_question
                      (zhihu_id,topics,url,title,content,creat_time,update_time,answer_num,comments_num,watch_user_num,click_num,crawl_time,crawl_update_time)
                       VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
                       ON DUPLICATE KEY UPDATE comments_num=VALUES(comments_num),watch_user_num=VALUES(watch_user_num),click_num=VALUES(click_num)
                  """
        zhihu_id = self["zhihu_id"][0]
        topics = ",".join(self["topics"])
        url = "".join(self["url"])
        title = "".join(self["title"])
        content = "".join(self["content"])
        creat_time =  datetime.datetime.now().strftime(SQL_DATETIME_FORMAT)
        update_time =  datetime.datetime.now().strftime(SQL_DATETIME_FORMAT)
        answer_num = self["answer_num"][0]
        comments_num = get_nums(self["comments_num"][0])
        watch_user_num =  self["watch_user_num"][0]
        click_num = self["watch_user_num"][1]
        crawl_time = datetime.datetime.now().strftime(SQL_DATETIME_FORMAT)
        crawl_update_time =datetime.datetime.now().strftime(SQL_DATETIME_FORMAT)
        params = (zhihu_id,topics,url,title,content,creat_time,update_time,answer_num,comments_num,watch_user_num,click_num,crawl_time,crawl_update_time)
        return insert_sql,params

 pipelines.py 

class MysqlTwistedZhihuPipline(object):
    #通过连接池的方式
    def __init__(self,dbpool):
        self.dbpool = dbpool

    @classmethod
    def from_settings(cls, settings):
       dbparms = dict(
           host=settings["MYSQL_HOST"],
           db=settings["MYSQL_DBNAME"],
           user=settings["MYSQL_USER"],
           passwd=settings["MYSQL_PASSWORD"],
           charset=settings["MYSQL_CHARSET"],
           cursorclass=MySQLdb.cursors.DictCursor,
           use_unicode=settings["MYSQL_USE_UNICODE"],
       )
       dbpool = adbapi.ConnectionPool("MySQLdb", **dbparms)
       return cls(dbpool)

    def process_item(self, item, spider):
        # 使用twisted将mysql插入变成异步执行
        query = self.dbpool.runInteraction(self.do_insert, item)
        query.addErrback(self.handle_error, item, spider)  # 处理异常

    def handle_error(self, failure, item, spider):
        # 处理异步插入的异常
        print(failure)
    def do_insert(self, cursor, item):
            insert_sql,params = item.get_insert_sql()
            cursor.execute(insert_sql, params)

 

 

9. 'dict' object has no attribute 'has_key'  Python3以后删除了has_key()方法

 

if adict.has_key(key1):  

修改为:

if key1 in adict:  

 

10.  爬虫Max retries exceeded with url

 

 页面中的requests不要直接都requests.post 尽量统一requests.session()然后运行完关闭。s.keep_alive = False

s = requests.session()
  s.keep_alive = False

 

11.  阿里云centos安装python3最牛逼的教程(make编译的是深坑啊,建议用yum)

 

sudo yum install epel-release
sudo yum install python34
wget --no-check-certificate https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py
pip3 -V

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326067994&siteId=291194637