@creat_data: 2017-05-01
@author: huangyongye

前言： 相信有不少人和我一样，最开始学习 python 就是为了写个爬虫脚本从网上抓数据。第一次从网页上抓取信息的感觉很爽。那时候用得最多的莫过于正则表达式，但是很久没用，基本也都忘光了。后来学习了 xpath 神器，简直所向披靡，比正则方便多了。对于文本数据，抓取下来后存在 mongodb 中是个很不错的选择，反正我当时就是这么干的。最近需要处理 XML 数据，然后发现网上有不少介绍。但是关于分层信息提取讲得好像很少，所以根据自己实验，做个小笔记，以防遗忘。

最重要的还是那个宗旨：先抓大，后抓小。

import re
import pymongo
from pymongo import MongoClient

mongodb 入门

为了管理方便，其实可以给每个文档添加一个修改时间的字段。这样，如果某一次我们想删掉最近添加的一些文档，就能够根据字段很方便的实现了。

HOST = '10.103.**'
PORT = 27017
DB_NAME = 'db name'
TB_NAME = 'table name'
client = MongoClient(HOST, PORT)
db = client[DB_NAME]
tb = db[TB_NAME]

# 添加一条数据
tb.insert_one({'journal': 'abc', 'title':'bbb', 'count':222})

<pymongo.results.InsertOneResult at 0x7fe8b31f5d20>

# 添加多条数据
tb.insert_many([{'journal':'1', 'title':'2', 'count':'2'}, {'jouranl':'3', 'title':'3'}])

<pymongo.results.InsertManyResult at 0x7fe8ec754500>

# 查看数据库中所有集合
db.collection_names()

[u'system.indexes', u'guke_journal']

# 查看集合的多条记录
result = tb.find()
for item in result:
    print item

{u'count': 222, u'journal': u'abc', u'_id': ObjectId('590e8f286bba81656e28b9c5'), u'title': u'bbb'}
{u'count': u'2', u'journal': u'1', u'_id': ObjectId('590e8f2a6bba81656e28b9c6'), u'title': u'2'}
{u'_id': ObjectId('590e8f2a6bba81656e28b9c7'), u'jouranl': u'3', u'title': u'3'}

result = tb.find({u'count': 222})
print 'find num', result.count()
# 删除集合中的多条记录
result = tb.delete_many({u'count': 222})
print 'delete num', result.deleted_count

find num 1
delete num 1

# 删除集合中的所有记录, filter 为空则删除所有的记录
result = tb.delete_many({})
print result.deleted_count

xpath 处理 XML 分层信息

xpath 语法

from lxml import etree

str_XML = """<?xml version="1.0" encoding="UTF-8"?>

<bookstore>

<book category="COOKING">
  <title lang="en">Everyday Italian</title>
  <author>Giada De Laurentiis</author>
  <year>2005</year>
  <price>30.00</price>
</book>

<book category="CHILDREN">
  <title lang="en">Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

<book category="WEB">
  <title lang="en">XQuery Kick Start</title>
  <author>James McGovern</author>
  <author>Per Bothner</author>
  <author>Kurt Cagle</author>
  <author>James Linn</author>
  <author>Vaidyanathan Nagarajan</author>
  <year>2003</year>
  <price>49.99</price>
</book>

<book category="WEB">
  <title lang="en">Learning XML</title>
  <author>Erik T. Ray</author>
  <year>2003</year>
  <price>39.95</price>
</book>

</bookstore>"""

# XML 中的 xpath 操作跟操作系统中的文件查找一样，它指向当前节点（目录）
# 如要查找当前目录下的所有 title 的内容，只需要加上 ".//title/text()"
# 但是如果使用 "//title/text()" 的话，相当于还是在整个 XML 文件中查找
# 而不是从当前已经到达的节点开始
tree = etree.XML(str_XML)
books = tree.xpath('//book') 
# 查看 book 节点的数量
print 'num of book node:', len(books)
# 从第一个 book 节点开始，下面有几个一级子节点
book0 = books[0]
print 'menus num inside the first book node:', len(books[0])
# 查看第一个 book 节点下的 title 子目录的内容
title = book0.xpath('.//title/text()')
print 'inside the first book node:', len(title)
print title
# 若不加"."，则是查看所有 title 子目录的内容
title = book0.xpath('//title/text()')
print 'find in the whole file:', len(title)
print title

num of book node: 4
menus num inside the first book node: 4
inside the first book node: 1
['Everyday Italian']
find in the whole file: 4
['Everyday Italian', 'Harry Potter', 'XQuery Kick Start', 'Learning XML']

pymongo 和 xpath 基本操作

mongodb 入门

xpath 处理 XML 分层信息

猜你喜欢