Python处理csv，json，xml文本

一、CSV

简介：

CSV(Comma-Separated Value)，即逗号分隔符。CSV并不算真正的结构化数据，CSV文件内容仅仅是一些用逗号分割的原始字符串。虽然可以用str.split(',')分割提取CSV文件，但有些字段值可能含有嵌套的逗号，所以Python提供了专门用于解析和生成CSV的库，CSV即是一个。

eg：该脚本演示了将数据转换成CSV格式写出，并再次读入。

input：

import csv
from distutils.log import warn as printf  #避免python2和3的版本差异

DATA = (
    (9,'Web Client and Server','base64,urllib'),
    (10,'Web Programming:CGI & WSGI','cgi,time,wsgiref'),
    (11,'Web Services','urllib, twython'),
)

printf('***WRITING CSV DATA')
f = open('bookdata.csv','w')
writer = csv.writer(f)
for record in DATA:
     writer.writerow(record)
f.close()

printf('***REVIEW OF SAVED DATA')
f = open('bookdata.csv','r')
reader = csv.reader(f)
for chap, title, modpkgs in reader:
    printf('Chapter %s: %r (featuring %s)' %(chap,title,modpkgs))
f.close()

output：

***WRITING CSV DATA

***REVIEW OF SAVED DATA
Chapter 9: 'Web Client and Server' (featuring base64,urllib)
Chapter 10: 'Web Programming:CGI & WSGI' (featuring cgi,time,wsgiref)
Chapter 11: 'Web Services' (featuring urllib, twython)

二、JSON

简介：

JSON中文意思为JavaScript对象表示法，从名字即可以看出它来自JavaScript领域，JSON是JavaScript的子集，专门用于指定结构化的数据，JSON是以人类更易读的方式传输结构化的数据。关于更多JSON的信息可以访问http://json.org

Python2.6开始通过标准库json支持JSON，同时提供了dump（）和load（）接口，对数据进行操作。

eg1：JSON对象和Python字典很像，以下示例展示JSON和字典对象的互相转换。

input：

import json
dictionary = dict(zip('abcde',range(5)))  #字典格式
print(dictionary)

dict2json = json.dumps((dict(zip('abcde',range(5)))))  #将字典格式转换为json格式,str类型
print(dict2json)

json2dict = json.loads(dict2json)  #与dumps相反，将json转换为dict
print(json2dict)

output：

{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}
{"a": 0, "b": 1, "c": 2, "d": 3, "e": 4}
{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}

eg2：将Python字典转换为JSON格式，，并使用多种格式显示。

input：

from distutils.log import warn as printf
from json import dumps
from pprint import pprint

BOOKS = {
    '001':{
        'title':'core python',
        'edition':'2',
        'year':'7',
    },
    '002':{
        'title':'python web',
        'authors':['jeff','paul','wesley'],
        'year':'2009',
    },
    '003':{
        'title':'python fundamentals',
        'year':'2009',
    },
}

print('*** raw dict ***')
printf(BOOKS)

printf('\n*** pretty_printed dict ***')
pprint(BOOKS)

printf('\n*** raw json ***')
printf(dumps(BOOKS))

printf('\n*** pretty_printed json ***')
printf(dumps(BOOKS,indent=4))

output：

*** raw dict ***
{'001': {'title': 'core python', 'edition': '2', 'year': '7'}, '002': {'title': 'python web', 'authors': ['jeff', 'paul', 'wesley'], 'year': '2009'}, '003': {'title': 'python fundamentals', 'year': '2009'}}

{'001': {'edition': '2', 'title': 'core python', 'year': '7'},
*** pretty_printed dict ***
 '002': {'authors': ['jeff', 'paul', 'wesley'],

         'title': 'python web',
*** raw json ***
         'year': '2009'},
{"001": {"title": "core python", "edition": "2", "year": "7"}, "002": {"title": "python web", "authors": ["jeff", "paul", "wesley"], "year": "2009"}, "003": {"title": "python fundamentals", "year": "2009"}}
 '003': {'title': 'python fundamentals', 'year': '2009'}}

*** pretty_printed json ***
{
    "001": {
        "title": "core python",
        "edition": "2",
        "year": "7"
    },
    "002": {
        "title": "python web",
        "authors": [
            "jeff",
            "paul",
            "wesley"
        ],
        "year": "2009"
    },
    "003": {
        "title": "python fundamentals",
        "year": "2009"
    }
}

三、XML

简介：

XML同样用来表示结构化数据，尽管XML数据是纯文本，但XML并不是可以认为是人类可读的。XML只有在解析器的帮助下的才变得可读。XML诞生已久，且比JSON应用更广。

Python最初在v1.5中提供了xmllib模块支持XML，最终融入到xml包中，v2.5使用了ElementTree进一步成熟的支持XML，是一款使用广泛、快速且符合Python的XML文档解析器和生成器。已添加至标准库。

eg1：将Python字典转换为XML，并以多种格式显示。

input：

from xml.etree.cElementTree import Element,SubElement,tostring
from xml.dom.minidom import parseString

BOOKS = {
    '001':{
        'title':'core python',
        'edition':'2',
        'year':'7',
    },
    '002':{
        'title':'python web',
        'authors':'jeff:paul:wesley',
        'year':'2009',
    },
    '003':{
        'title':'python fundamentals',
        'year':'2009',
    },
}

books = Element('books')
for isbn, info in BOOKS.items():
    book = SubElement(books,'book')
    info.setdefault('authors','wesley chun')
    info.setdefault('edition',1)
    for key, val in info.items():
        SubElement(book,key).text = ','.join(str(val).split(':'))
xml = tostring(books)

print('*** raw xml ***')
print(xml)

print('\n*** pretty-printed xml ***')
dom = parseString(xml)
print(dom.toprettyxml(' '))

print('*** flat structure ***')
for elmt in books.iter():
    print(elmt.tag,'-',elmt.text)

print('\n*** titles only ***')
for book in books.findall('.//title'):
    print(book.text)

output：

*** raw xml ***
b'<books><book><title>core python</title><edition>2</edition><year>7</year><authors>wesley chun</authors></book><book><title>python web</title><authors>jeff,paul,wesley</authors><year>2009</year><edition>1</edition></book><book><title>python fundamentals</title><year>2009</year><authors>wesley chun</authors><edition>1</edition></book></books>'

*** pretty-printed xml ***
<?xml version="1.0" ?>
<books>
 <book>
  <title>core python</title>
  <edition>2</edition>
  <year>7</year>
  <authors>wesley chun</authors>
 </book>
 <book>
  <title>python web</title>
  <authors>jeff,paul,wesley</authors>
  <year>2009</year>
  <edition>1</edition>
 </book>
 <book>
  <title>python fundamentals</title>
  <year>2009</year>
  <authors>wesley chun</authors>
  <edition>1</edition>
 </book>
</books>

*** flat structure ***
books - None
book - None
title - core python
edition - 2
year - 7
authors - wesley chun
book - None
title - python web
authors - jeff,paul,wesley
year - 2009
edition - 1
book - None
title - python fundamentals
year - 2009
authors - wesley chun
edition - 1

*** titles only ***
core python
python web
python fundamentals

eg2：显示实时的排名靠前的头条新闻（默认为5个），以及Google News服务对应的链接。

input：

from io import BytesIO as StringIO
from itertools import *
from urllib.request import urlopen
from pprint import pprint
from xml.etree import ElementTree

g = urlopen('https://news.google.com/news?topic=h&output=rss')  #h代表head头条新闻
f = StringIO(g.read())
g.close()
tree = ElementTree.parse(f)  #用ElementTress解析XML
f.close()

def topnews(count=5):  #默认解析5条新闻
    pair = [None,None]
    for elmt in tree.getiterator():
        if elmt.tag == 'title':  #由于页面最上面还有新闻类型标题，所以需要分析是新闻类型标题还是真正的头条新闻标题
            skip = elmt.text.startswith('Top Stories')
            if skip:
                continue
            pair[0] = elmt.text
        if elmt.tag == 'link':
            if skip:
                continue
            pair[1] = elmt.text
            if pair[0] and pair[1]:  #只有同时存在标题和链接才返回数据
                count -= 1
                yield(tuple(pair))
                if not count:
                    return
                pair = [None,None]

for news in topnews():
    pprint(news)

output：

('This RSS feed URL is deprecated', 'https://news.google.com/news')
('Keeping Summit Hopes Alive Suggests Kim Jong-un May Need a Deal - New York '
 'Times',
 'http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNEp8ER18RwtH8pZKJBxzAXKyAUMHA&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52779907582906&ei=RSAJW4i4JIK94gKYwp24Dg&url=https://www.nytimes.com/2018/05/26/world/asia/kim-summit-trump.html')
('Science teacher who tackled student gunman among 2 wounded at Indiana middle '
 'school - Chicago Tribune',
 'http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNF_qIL2IA4IJRhgVGx6jDWacZIeOg&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52779910889936&ei=RSAJW4i4JIK94gKYwp24Dg&url=http://www.chicagotribune.com/news/nationworld/midwest/ct-noblesville-west-middle-school-20180525-story.html')
("Trump says he'll spare Chinese telecom firm ZTE from collapse, defying "
 'lawmakers - Washington Post',
 'http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNEF9XmwZpRtbcM7yq9Fw_ieMmYv6g&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52779911079543&ei=RSAJW4i4JIK94gKYwp24Dg&url=https://www.washingtonpost.com/business/economy/congress-threatens-to-block-deal-between-white-house-china-to-save-telecom-giant-zte/2018/05/25/1db326ba-604a-11e8-9ee3-49d6d4814c4c_story.html')
('USC President CL Max Nikias to step down - Los Angeles Times',
 'http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNEFRNpZDmQoOJGsesa5yUSgga0fbA&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52779910766807&ei=RSAJW4i4JIK94gKYwp24Dg&url=http://www.latimes.com/local/lanow/la-me-max-nikias-usc-20180525-story.html')

四、参考文献

Wesley Chun. Python核心编程 : 第3版[M]. 人民邮电出版社, 2016.

Python处理csv，json，xml文本

猜你喜欢