update the database content

2万多条数据已经爬去完毕,发现格式不正确,该怎么办?

爬取的结果如下:

[{“title”: “工艺:”, “content”: [“油爆”]}, {“title”: “口味:”, “content”: [“咸鲜味”]}, {“title”: “菜系:”, “content”: [“福建菜”]}, {“title”: “功效:”, “content”: [“福建菜”, “通乳调理”, “气血双补调理”, “营养不良调理”]}, {“title”: “主料:”, “content”: [“河虾250克”]}, {“title”: “辅料:”, “content”: [“竹笋35克”, “香菇(鲜)10克”, “青椒20克”, “红萝卜25克”]}, {“title”: “调料:”, “content”: [“大葱10克 鸡蛋清10克 大蒜5克 淀粉(豌豆)12克 白砂糖5克 盐3克 味精1克 料酒3克 胡椒1克 植物油75克 各适量”]}]

JSon在线解析后结果如下:
这里写图片描述

通过分析发现:

{“title”: “功效:”, “content”: [“福建菜”, “通乳调理”, “气血双补调理”, “营养不良调理”]}

content中内容还在列表中,我们需要取出来,解决办法有两种:

(1)编写爬虫代码时就应该整理好数据。当数据较少可以修改代码重新跑一次,但是数据太多,重跑不可能。

(2)使用pipelines进行数据整理,这种办法也是数据清理时经常会用到的。方法如下:

#以下代码可以在任意文件夹下运行,只要环境配置正确
import pymysql.cursors
import json
#make a connection with the databases
connection = pymysql.connect(host='localhost',
                             user='root',
                             password='123456',
                             db='baikemy.com',
                             cursorclass=pymysql.cursors.DictCursor,autocommit=True)
try:
    with connection.cursor() as cursor:

        sql = "SELECT `id`, `gongyi` FROM `total_copy1`" 
        cursor.execute(sql) #execute the search
        result = cursor.fetchall() #get all the row from search
        for r in result: 
            g = json.loads(r['gongyi']) # change json string to json object
            #if when make to json objects,we can get the value by  g2['title']
            d = []
            for g2 in g:
                a = g2['title'] #get value 
                b = g2['content'] #b is a list of strings 
                strx = ''
                for s in b:#change content list of strings (b)  into one string
                    strx += s + ' '
                strx = strx[:-1] #remove the last space
                d.append( # new gongyi object
                    {
                        "title": a,
                        "content": strx
                    }
                )
            with connection.cursor() as cursor2:
                d2=str(d).replace("'","\"") # change object to string
                print(d2)
                sql = "UPDATE `total_copy1` SET `gongyi`=%s WHERE `id`=%s" #update gongyi object
                cursor2.execute(sql, (d2, r['id'])) # update by knowing id 
                print(r['id'])
finally:
    connection.close()

猜你喜欢

转载自blog.csdn.net/u014229742/article/details/82348287