30.5 VADER 情感分析
接下来,我们将关注 VADER,这是一种无监督的情感评估技术,基于规则进行文本分析。
30.5.1 配置词典
首先,我们需要定义词典,因为我们需要更新它以适应金融相关的文本数据。词典是包含句子不同部分情感值的词汇字典。它们通过多种技术生成,包括单词位置、周围单词、上下文和词性(POS)。每个部分都会被分配情感值,最终汇总得到每个句子的情感得分。
本项目将使用 NLTK 的 VADER 模块,它有自己的词典,但不适合金融任务,因此我们需要对其进行更新!我们将使用来自 Oliveira 等人的词典。
(1)下面命令用于列出 /input/news-trading/ 目录中的文件和文件夹。请确认您在正确的环境中运行此命令,以查看该路径下的内容。
!ls /input/news-trading/
执行后会输出:
combined.csv lexicon_sentiment.csv return_data.csv
headlines.csv newsdata_labelled_v2.csv sent_label_added_corpus.csv
headlines_archive newsdata_labelled_v3.csv
(2)下面代码用于从指定路径读取名为 lexicon_sentiment.csv 的文件,并将其存储为一个 Pandas 数据框 lexicon,以便后续进行情感分析。
import pandas as pd
# 从指定路径读取情感词典数据,并将其存储为数据框
lexicon = pd.read_csv('/input/news-trading/lexicon_sentiment.csv') # 读取情感词典 CSV 文件
(3)下面代码用于初始化 VADER 情感分析器,并将自定义的情感词典与 VADER 的默认词典合并。随后,它计算输入数据框 ldf 中每个标题的情感得分,并将结果存储在新列 vader_sent 中。
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import numpy as np
def set_analyser(lex, df, corpus):
# 准备 VADER 情感分析器
sia = SentimentIntensityAnalyzer()
# 计算自定义情感词典的情感得分
lex['sentiment'] = (lex['Aff_Score'] + lex['Neg_Score']) / 2
lex = dict(zip(lex.Item, lex.sentiment))
lex = {k: v for k, v in lex.items() if len(k.split(' ')) == 1}
slex = {}
for k, v in lex.items():
if v > 0:
slex[k] = v / max(lex.values()) * 4
else:
slex[k] = v / min(lex.values()) * -4
flex = {}
flex.update(slex)
flex.update(sia.lexicon)
sia.lexicon = flex
# 评估情感
sentiments = np.array([sia.polarity_scores(s)['compound'] for s in df[corpus]])
return sia, sentiments
# 初始化分析器并计算情感得分
analyser, sentiments = set_analyser(lexicon, ldf, 'headline')
ldf['vader_sent'] = sentiments # 将情感得分存储到新列
ldf.head() # 显示数据框的前几行
执行后会输出:
ticker headline date eventRet perc_sent lstm vader_sent
0 AMZN Whole Foods (WFMI) -5.2% following a downgrade... 2011-05-02 0.031269 0 0 0.2115
1 NFLX Netflix (NFLX +1.1%) shares post early gains a... 2011-05-02 0.012173 1 1 0.8575
2 MSFT The likely winners in Microsoft's (MSFT -1.4%)... 2011-05-10 -0.007741 1 1 0.6971
3 MSFT Microsoft (MSFT -1.2%) and Skype signed their ... 2011-05-10 -0.007741 0 0 0.7751
5 AMZN Amazon.com (AMZN -1.7%) shares slip as comment... 2011-05-12 0.010426 0 1 -0.0413
(4)下面代码将数据框 ldf 保存为 CSV 文件,文件名为 sent_label_added_corpus.csv,并且不包括索引。接下来的命令则列出 /input/news-trading/ 目录中的文件。
# 将数据框 ldf 保存为 CSV 文件,文件名为 'sent_label_added_corpus.csv'
ldf.to_csv('sent_label_added_corpus.csv', index=False)
# 列出 /input/news-trading/ 目录中的文件
!ls /input/news-trading/
执行后会输出:
combined.csv lexicon_sentiment.csv return_data.csv
headlines.csv newsdata_labelled_v2.csv sent_label_added_corpus.csv
headlines_archive newsdata_labelled_v3.csv
(5)下面代码用于从指定路径 /input/news-trading/sent_label_added_corpus.csv 读取一个 CSV 文件,并将其内容加载到数据框 ldf 中。
# 从指定路径读取 CSV 文件,并将其加载到数据框 ldf 中
ldf = pd.read_csv('/input/news-trading/sent_label_added_corpus.csv')
30.5.2 情感与事件收益的关系
让我们观察更新后的 VADER 词典情感方法中,情感极性与事件收益之间的关系。同时,我们也将分析所有使用的方法(TextBlob、深度学习和 VADER)之间的相关性。
(1)下面代码创建了一个散点图,展示了使用更新后的 VADER 词典计算的情感极性与事件收益之间的关系,图中按股票代码进行分面展示,便于观察不同股票的情感与收益的相关性。
import plotly.express as px
# 创建散点图,展示VADER情感极性与事件收益的关系
fig = px.scatter(ldf, x='vader_sent', y='eventRet', # x轴为VADER情感,y轴为事件收益
facet_col='ticker', facet_col_wrap=3, # 按股票代码分面,最多三列
template='plotly_white', # 使用白色模板
width=None, height=1000, # 设置图表高度
title='event return & vader sentiment scatter plot') # 图表标题
# 更新图形的散点属性
fig.update_traces(marker=dict(size=4, # 设置散点大小
line=dict(width=1, # 设置散点边框宽度
color='DarkSlateGrey')), # 设置散点边框颜色
opacity=0.3, # 设置散点透明度
selector=dict(mode='markers')) # 仅选择散点模式
# 显示图表
fig.show(renderer='iframe')
执行效果如图30-8所示,按股票代码(ticker)进行分面,并使用 Plotly 库进行可视化,以便清晰地观察不同股票的情感与收益之间的关系。图表的设计采用了浅色模板,散点的大小和透明度也经过调整,以增强可读性。
图30-8 更新后的 VADER 词典计算的情感极性与事件收益之间的关系
(2)下面代码计算了数据框 ldf 中各列之间的相关性,并使用条形图可视化相关性矩阵,使用不同颜色表示正相关和负相关的强度。
# 计算数据框ldf的相关性矩阵
ldf.corr()
# 使用条形图可视化相关性矩阵
ldf.corr().style\
.bar(align='mid', # 条形图对齐方式为中间对齐
color=['#15C3BA', '#CDE10F']) # 正相关和负相关分别使用不同的颜色
执行效果如图30-9所示。
图30-9 数据框 ldf 中各列之间的相关性
(3)下面代码的功能是展示数据框 ldf 的内容,通常用于查看数据的结构和部分数据行。
# 显示数据框ldf的内容
show_panel(ldf)
执行后会输出:
AMZN
Whole Foods (WFMI) -5.2% following a downgrade to "hold" from "buy" at Jefferies. Gas prices are the culprit, writes Scott Mushkin, claiming at $4/gallon, even the affluent who patronize the upscale grocery begin to take notice. Whole Foods announces earnings after Wednesday's close.
2011-05-02
0.031269
0
1
0.2115
NFLX
Netflix (NFLX +1.1%) shares post early gains after Citigroup ups its rating to Buy and lifts its price target to $300 from $245. U.S. revenue growth is sustainable, Citi says, "with a path to 50M subscribers by 2013," adding that NFLX has little competition in price, selection and convenience; mass market adoption of tablets will help, and the mass-market adoption phase is still to come.
2011-05-02
0.012173
1
1
0.8575
MSFT
The likely winners in Microsoft's (MSFT -1.4%) Skype buy: a hodgepodge of private-equity firms, pension funds and VCs (and eBay (EBAY +2.4%), whose $620M stake goes to $2.3B - making for a total eBay payday of more than $4B).
2011-05-10
-0.007741
1
1
0.6971
MSFT
Microsoft (MSFT -1.2%) and Skype signed their deal last night after finalizing the price in mid-April, the CEOs say in their webcast, and the product will be focused on mobile devices, video chat and social features as it gets integrated into Outlook mail and other products, including the Xbox console.
2011-05-10
-0.007741
0
1
0.7751
AMZN
Amazon.com (AMZN -1.7%) shares slip as comments circulate from CEO Jeff Bezos that the company will cease affiliate operations in states that force it to collect sales tax. Amazon has already announced plans to cancel its affiliate program in Illinois, and Bezos says "we will continue to drop states who pass those affiliate laws."
2011-05-12
0.010426
0
1
-0.0413
NVDA
Nvidia (NVDA -8.2%) shares slump after Needham downgrades the stock to Hold from Buy despite its solid Q1 beat and positive outlook, noting concerns about the chip company's core graphics processor business that are not fully offset by its momentum in tablets and smartphones. Citigroup and Bank of America also cut price targets.
2011-05-13
-0.077562
0
0
0.8886
GOOG
It's been some time coming, but Google (GOOG -1.9%) makes its first trip into the bond market with a planned $3B sale that should be "scooped up like nobody's business." The company had $35B in cash and marketable securities at 2010's end, but will pay back short-term borrowings as investment-grade borrowing costs are about the lowest since November.
2011-05-16
-0.031297
0
1
0.611
BA
In "a big win for Europe," the WTO partly overturns the ruling that found the EU had given illegal subsidies to Airbus (EADSY.PK +2%). While finding the subsidies still hurt U.S. interests - i.e., Boeing (BA +0.6%) - the panel didn't agree with reversing some subsidies.
2011-05-18
-0.008017
1
0
0.1511
MSFT
If you bought LinkedIn (LNKD, now legging higher again, +155% to $115) at $45, give yourself a high-five - though that probably means you're MS, BAC and other underwriters who took advantage of the company for serious underpricing of the IPO, Henry Blodget says. LNKD may have left well over $100M on the table.
2011-05-19
0.008149
1
1
0.7904
MSFT
From Jens Heycke, the top 5 things you could buy with $4B (um, make that $7B) instead of LinkedIn (LNKD +90%).
2011-05-19
0.008149
1
1
0.4896