A method for the pull hook python3 visual analysis of the data Detailed

This article is mainly to introduce the relevant information about python3 to pull hook to visualize data analysis, the paper sample code described in great detail to all of us to learn or use Python3 has certain reference value of learning, need friends below come together learn from it
Preface

On the back said how we pull hook crawl down the data, since the acquired data, on the other stood motionless, take it out and analyze it and see these data which contains what information.
The following did not talk much to say, to take a look at a detailed description of it

First, preparation

Since the last catch data which contains information such as ID, we need to get rid of it, and view descriptive statistics confirm whether there is an abnormal value or value indeed.

read_file = "analyst.csv"
# 读取文件获得数据
data = pd.read_csv(read_file, encoding="gbk")
# 去除数据中无关的列
data = data[:].drop(['ID'], axis=1)
# 描述性统计
data.describe()

Here Insert Picture DescriptionThe results in unique representation of the attribute is listed in the presence of a different number of values ​​below to academic requirements as an example, it includes [undergraduate, college degree,] Any four different values, top indicates the maximum number of value [degree], freq denotes a frequency of occurrence 387. As more salary unique, we look at what value there is.

print(data['学历要求'].unique())
print(data['工作经验'].unique())
print(data['薪资'].unique())

Here Insert Picture Description

Second, pretreatment

We can see from the above two graphs, the value of academic requirements and less work experience and no missing values ​​and outliers can be analyzed directly; but more salary distribution, a total of 75 kinds, in order to better analyze, we want to make a pre-pay. According to its distribution, it can be divided into the following [5k, 5k-10k, 10k-20k, 20k-30k, 30k-40k, 40k] or more, more convenient for our analysis, taking the median of each salary range, and we divided into the specified range.

# 对薪资进行预处理
def pre_salary(data):
 salarys = data['薪资'].values
 salary_dic = {}
 for salary in salarys:
 # 根据'-'进行分割并去掉'k',分别将两端的值转换成整数
 min_sa = int(salary.split('-')[0][:-1])
 max_sa = int(salary.split('-')[1][:-1])
 # 求中位数
 median_sa = (min_sa + max_sa) / 2
 # 判断其值并划分到指定范围
 if median_sa < 5:
 salary_dic[u'5k以下'] = salary_dic.get(u'5k以下', 0) + 1
 elif median_sa > 5 and median_sa < 10:
 salary_dic[u'5k-10k'] = salary_dic.get(u'5k-10k', 0) + 1
 elif median_sa > 10 and median_sa < 20:
 salary_dic[u'10k-20k'] = salary_dic.get(u'10k-20k', 0) + 1
 elif median_sa > 20 and median_sa < 30:
 salary_dic[u'20k-30k'] = salary_dic.get(u'20k-30k', 0) + 1
 elif median_sa > 30 and median_sa < 40:
 salary_dic[u'30k-40k'] = salary_dic.get(u'30k-40k', 0) + 1
 else:
 salary_dic[u'40以上'] = salary_dic.get(u'40以上', 0) + 1
 print(salary_dic)
 return salary_dic

[Salary] After pretreatment, but also for [text] Qualifications pretreatment. Because to make the word cloud, text needs to be divided and removed some of the frequency of occurrence of more but no sense of the word, we call stop words, so we treated with jieba library. jieba a thesaurus is divided python implementation of Chinese word has a very strong ability.

import jieba
def cut_text(text):
 stopwords =['熟悉','技术','职位','相关','工作','开发','使用','能力',
 '优先','描述','任职','经验','经验者','具有','具备','以上','善于',
 '一种','以及','一定','进行','能够','我们']
 for stopword in stopwords:
 jieba.del_word(stopword)
  
 words = jieba.lcut(text)
 content = " ".join(words)
 return content

After the pretreatment, it can be a visual analysis.

Third, the visual analysis

We first annular draw bar charts, and then passes the data into the line, FIG cyclic code as follows

def draw_pie(dic):
 labels = []
 count = []
  
 for key, value in dic.items():
 labels.append(key)
 count.append(value)
  
 fig, ax = plt.subplots(figsize=(8, 6), subplot_kw=dict(aspect="equal"))
 
 # 绘制饼状图,wedgeprops 表示每个扇形的宽度
 wedges, texts = ax.pie(count, wedgeprops=dict(width=0.5), startangle=0)
 # 文本框设置
 bbox_props = dict(boxstyle="square,pad=0.9", fc="w", ec="k", lw=0)
 # 线与箭头设置
 kw = dict(xycoords='data', textcoords='data', arrowprops=dict(arrowstyle="-"),
 bbox=bbox_props, zorder=0, va="center")
 
 for i, p in enumerate(wedges):
 ang = (p.theta2 - p.theta1)/2. + p.theta1
 y = np.sin(np.deg2rad(ang))
 x = np.cos(np.deg2rad(ang))
 # 设置文本框在扇形的哪一侧
 horizontalalignment = {-1: "right", 1: "left"}[int(np.sign(x))]
 # 用于设置箭头的弯曲程度
 connectionstyle = "angle,angleA=0,angleB={}".format(ang)
 kw["arrowprops"].update({"connectionstyle": connectionstyle})
 # annotate()用于对已绘制的图形做标注,text是注释文本,含 'xy' 的参数跟坐标点有关
 text = labels[i] + ": " + str('%.2f' %((count[i])/sum(count)*100)) + "%"
 ax.annotate(text, size=13, xy=(x, y), xytext=(1.35*np.sign(x), 1.4*y),
  horizontalalignment=horizontalalignment, **kw)
 plt.show()

Histogram code is as follows:

def draw_workYear(data):
 workyears = list(data[u'工作经验'].values)
 wy_dic = {}
 labels = []
 count = []
 # 得到工作经验对应的数目并保存到count中
 for workyear in workyears:
 wy_dic[workyear] = wy_dic.get(workyear, 0) + 1
 print(wy_dic)
 # wy_series = pd.Series(wy_dic)
 # 分别得到 count 的 key 和 value
 for key, value in wy_dic.items():
 labels.append(key)
 count.append(value)
 # 生成 keys 个数的数组
 x = np.arange(len(labels)) + 1
 # 将 values 转换成数组
 y = np.array(count)
  
 fig, axes = plt.subplots(figsize=(10, 8))
 axes.bar(x, y, color="#1195d0")
 plt.xticks(x, labels, size=13, rotation=0)
 plt.xlabel(u'工作经验', fontsize=15)
 plt.ylabel(u'数量', fontsize=15)
  
 # 根据坐标将数字标在图中,ha、va 为对齐方式
 for a, b in zip(x, y):
 plt.text(a, b+1, '%.0f' % b, ha='center', va='bottom', fontsize=12)
 plt.show()

We then academic requirements and salary data dictionary became a little something about the form, draw a good pass into the ring a function on the line. In addition, we would also like to [text] Qualifications for visualization.

from wordcloud import WordCloud
# 绘制词云图
def draw_wordcloud(content):
  
 wc = WordCloud(
 font_path = 'c:\\Windows\Fonts\msyh.ttf',
 background_color = 'white',
 max_font_size=150, # 字体最大值
 min_font_size=24, # 字体最小值
 random_state=800, # 随机数
 collocations=False, # 避免重复单词
 width=1600,height=1200,margin=35, # 图像宽高,字间距
 )
 wc.generate(content)
 
 plt.figure(dpi=160) # 放大或缩小
 plt.imshow(wc, interpolation='catrom',vmax=1000)
 plt.axis("off") # 隐藏坐标

We recommend learning Python buckle qun: 913066266, look at how seniors are learning! From basic web development python script to, reptiles, django, data mining, etc. [PDF, actual source code], zero-based projects to combat data are finishing. Given to every little python partner! Every day, Daniel explain the timing Python technology, to share some of the ways to learn and need to pay attention to small details, click to join our [python learner gathering

Fourth, the results and summary] (https://jq.qq.com/?_wv=1027&k=5JIjRvv)

Here Insert Picture Description
 Most academic requirements python data analyst undergraduate, accounting for 86%.
 Here Insert Picture Description
  As can be seen from the chart, most of the work experience requirements for python Data Analyst 1--5 years.
  It follows that wages python data analysis for more 10k-20k, and also a lot more than 40, high wages estimated requirements will be relatively high, so we look at the job requirements.
  Here Insert Picture Description
  As can be seen from the word cloud, data analysis will certainly be more sensitive to data and to statistics, excel, python, data mining, hadoop also have certain requirements. Not only that, but also requires a certain degree of resilience and problem-solving skills, good communication skills, thinking ability.

to sum up

That's all for this article, I hope the contents of this paper has some reference value of learning for everyone to learn or work

Published 47 original articles · won praise 53 · views 50000 +

Guess you like

Origin blog.csdn.net/haoxun03/article/details/104270008