Simple data analysis and data visualization with python

Simple data analysis and data visualization with python

This article is mainly about preliminary exploration of data analysis and a simple understanding of the general process of data analysis.
Data source : From a project on the Kaggle platform: Explore San Francisco city employee salary data
source code and original data : https://github.com/yb705 /SF-Salaries
First, we need to import some third-party libraries numpy, pandas, etc., to make some initial settings for data visualization and import the original data:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline#在ipython或者是jupyter notebook上可以显示图片
plt.style.use("fivethirtyeight")
sns.set_style({
    
    'font.sans-serif':['SimHei','Arial']})
original=pd.read_csv('C:\\Users\\1994y\\Desktop\\Salaries.csv')

Preliminary data exploration
Then first we need to observe the data to see if there are missing values, and the format of the data:

original.info()

Insert picture description here
You can see that there are missing values, so pay attention to the parts with missing values ​​in the next operation.
You can also see the data format in the dtype column. I want to remind everyone that some data looks like numbers, but in fact it is a string format. If we have arithmetic or other digital operations, we need to convert these data. Int type or float type.

objlist = ['BasePay', 'OvertimePay', 'OtherPay', 'Benefits']
for obj in objlist:
    original[obj] = pd.to_numeric(original[obj], errors = 'coerce')
original.info()

Insert picture description here
That's it. In addition to to_numeric(), you can also use the astype() method for data format conversion.
Then we can first roughly understand some of these data, such as year:

original['Year'].unique()

Insert picture description here
Average annual base salary:

original.groupby(["Year"])[["BasePay"]].mean()

Insert picture description here
You can also find the person with the least salary:

original[original['TotalPayBenefits'] == original['TotalPayBenefits'].min()]

Insert picture description here
Is this the person who still owes money and why is it negative?
The person with the most overtime pay:

original[original['OvertimePay'] == original['OvertimePay'].max()]

Insert picture description here
Alas, the salary for overtime work is higher than the official salary of the blogger, which is uncomfortable

You can also use the groupby mechanism of pandas to perform grouping statistics to view the average of the respective basic wages of PT (part-time) and FT (full-time):

original.groupby('Status')['BasePay'].mean()

Insert picture description here
Data visualization
Data visualization is a great function for data analysis with python. A few simple lines of code can generate a variety of cool charts and show the data very intuitively, mainly relying on the third-party libraries matplotlib, seaborn to fulfill. (The third-party library is also a feature of python.)
Use the column chart to view the top 5 occupations with the highest average salary in the past four years :
Because we plan to present it in the form of a combination of sub-graphs, we first filter the data for these 4 years:

a=original.loc[original["Year"]==2011]
b=original.loc[original["Year"]==2012]
c=original.loc[original["Year"]==2013]
d=original.loc[original["Year"]==2014]

The next step is to group the four sets of data and find the average salary for each:

a_2011=a.groupby(["JobTitle"])[["TotalPay"]].mean().sort_values(by="TotalPay",ascending=False).reset_index()
a_2011['JobTitle']=a_2011['JobTitle'].str.capitalize()#这里是为了让工作名称即JobTitle的首字母大写,与其他三组一致
b_2012=b.groupby(["JobTitle"])[["TotalPay"]].mean().sort_values(by="TotalPay",ascending=False).reset_index()
c_2013=c.groupby(["JobTitle"])[["TotalPay"]].mean().sort_values(by="TotalPay",ascending=False).reset_index()
d_2014=d.groupby(["JobTitle"])[["TotalPay"]].mean().sort_values(by="TotalPay",ascending=False).reset_index()
f, axs = plt.subplots(2,2,figsize=(20,15))#2x2的子图组合,大小是20x15

sns.barplot(x=a_2011['JobTitle'].head(5), y=a_2011['TotalPay'].head(5), palette="Greens_d",data=a_2011, ax=axs[0,0])#子图的数据,位置等设置
axs[0,0].set_title('2011年SF工资top5',fontsize=15)#标题
axs[0,0].set_xlabel('工作')
axs[0,0].set_ylabel('平均薪资')

sns.barplot(x=b_2012['JobTitle'].head(5), y=b_2012['TotalPay'].head(5), palette="Greens_d",data=b_2012, ax=axs[0,1])
axs[0,1].set_title('2012年SF工资top5',fontsize=15)
axs[0,1].set_xlabel('工作')
axs[0,1].set_ylabel('平均薪资')

sns.barplot(x=c_2013['JobTitle'].head(5), y=c_2013['TotalPay'].head(5), palette="Greens_d",data=c_2013.head(5), ax=axs[1,0])
axs[1,0].set_title('2013年SF工资top5',fontsize=15)
axs[1,0].set_xlabel('工作')
axs[1,0].set_ylabel('平均薪资')

sns.barplot(x=d_2014['JobTitle'].head(5), y=d_2014['TotalPay'].head(5), palette="Greens_d",data=d_2014.head(5), ax=axs[1,1])
axs[1,1].set_title('2014年SF工资top5',fontsize=15)
axs[1,1].set_xlabel('工作')
axs[1,1].set_ylabel('平均薪资')

Insert picture description here
The results are shown above, and it is very intuitive to see the occupations with the highest salaries each year and their respective salaries.
Use the line chart to view the average salary changes of several occupations in three years.
I chose the five occupations with the highest wages in 12 years to view:

job_list=["Chief of Police","Chief, Fire Department","Gen Mgr, Public Trnsp Dept","Executive Contract Employee","Asst Chf of Dept (Fire Dept)"]

Next, I compiled a function to filter the salary of the corresponding occupation in the past three years, and generate a dictionary:

def check_job(x):
    salary_dict={
    
    }
    for i in range(len(x["JobTitle"])):
        if x.loc[i,'JobTitle'] in job_list:
            salary_dict[x.loc[i,'JobTitle']]=x.loc[i,'TotalPay']
    return salary_dict
d1=check_job(a_2011)
d2=check_job(b_2012)
d3=check_job(c_2013)
d4=check_job(d_2014)
d4

One of the results is as follows:
Insert picture description here
Finally, the generated dictionary is sorted out and it becomes like this:

salary={
    
    'Chief of Police':[321552.11,339282.07,326716.76],
 'Chief, Fire Department':[ 314759.6,336922.01,326233.44],
 'Gen Mgr, Public Trnsp Dept': [294000.17,305307.89,294000.18],
 'Executive Contract Employee': [273776.24,207269.5166666667,278544.71],
 'Asst Chf of Dept (Fire Dept)': [270674.81666666665,294846.6766666667,279768.9583333334]}

Next, the dictionary is generated in a dataframe format, and a picture is generated:

df=pd.DataFrame(salary,index=["2012","2013","2014"])
df.plot()#生成折线图

Insert picture description here
Of course, the selection of data feature values ​​is a bit problematic. If the time can be stretched a little longer, this chart will be more intuitive.
Use the heat map to view the correlation of eigenvalues ​​Correlation
refers to the laws that exist between the values ​​of two or more variables in a certain sense, and its purpose is to explore the hidden correlation network in the data set. Generally, the correlation coefficient is used to describe the correlation between the two sets of data, and the correlation coefficient is obtained by dividing the covariance by the standard deviation of the two variables. The value of the correlation coefficient will be between [-1, 1], -1 Means a complete negative correlation, and 1 means a complete correlation. Correlation plays a very important role in data analysis and data mining. As for the performance of relevance, it can be very intuitively expressed with a heat map.

del original["Notes"]#删掉note这个不相关的特征值
plt.figure(figsize=(20,20))
plt.rcParams['font.sans-serif']=['SimHei'] #定义字体避免出现乱码的情况
plt.rcParams['axes.unicode_minus']=False
sns.heatmap(original.corr(),linewidths=0.1,vmax=1.0, square=True, linecolor='white', annot=True)#。corr()就是相关系数
plt.show()

Insert picture description here
From the color and the result, we can see the correlation coefficient between each attribute, that is, whether it affects each other. The lighter the color, the closer the coefficient is to 1, and the greater the correlation.

to sum up

The above is the simple process of data analysis and data visualization. I hope everyone can understand it briefly. In fact, there are still a lot of data that can be analyzed, and interested students can study on their own. If you want to learn more about the process of data analysis and mining, you can read another article of mine. Data analysis, mining and machine learning process items in python-Tianjin renting a house
blogger has not been in contact with data analysis for a long time, if there is anything I hope everyone can correct me if it is bad.
Thanks for reading.

Guess you like

Origin blog.csdn.net/weixin_43580339/article/details/105975813