Employment Analysis worry-free future based on Internet

                 Based on the Internet worry-free future employment data
                                                                                      by wooden Ye

This article Department of Public elective course "Introduction to Data Science" large operations, as a sophomore non-computer student, and this job is a complete exam week, time is tight, if not the rigorous, but also look understanding.

This article will briefly, crawling data, data cleaning, data visualization analysis, results are presented to show five aspects.

A brief introduction


Background to the issue
of the Internet industry as currently accepted high-paying jobs, internet jobs also have a large market demand, many graduates enter the job each year Internet industry
issues elaborated
more on what the city needs Internet career salary? Salary offered more attractive?

Which programming language to master an advantage in employment, a higher salary?

Education, work experience for employment compensation they have the kind of impact?
Starting from these practical problems, Analysis of Internet employment.

Data collection: from 51job.com fetch data were analyzed statistically.

The main use: scrapy, regular expressions, SVM and random forest predict, Kmeans clustering, BIRCH clustering, word cloud, visualization methods
 

II. Data crawling

Use crawlers to collect relevant information from 51job.com (worry-free future) website.
(URL link is: https: //search.51job.com/)

He has tried crawling a pull hook net, BOSS network and 51job.com in the three large domestic recruitment website.

After several rebuffed found BOSS pull hook web and the desired information is located, or because the js, or as anti-crawler crawler mechanism easily lead to too strong such that IP is closed, and 51job.com in trans crawler mechanism is much weaker than the pull hook BOSS net, the relative difficulty in terms of data collection, but because it can crawl data is not neat, so clean data requires more work.


Reptiles the following steps:

1. First of https://search.51job.com/ climb to get spider1.csv, because the information this site contains only a small amount, and the remaining information is hidden in each position corresponding recruitment of small pages. So start with this website crawling specific positions corresponding to the detailed description of the URL of the page

2. crawling with small pages, get spider2.csv, examples of sites such as: https:? //Jobs.51job.com/beijing/119317470.html s = 01 & t = 0 

3. The two csv file merged data set derived data.csv

Crawling a hundred thousand data sets involving Beijing, Shanghai, Guangzhou, Shenzhen, Hangzhou, Chengdu, Chongqing, Harbin, Xi'an, Wuhan ten cities, are shown below:

III. Data Cleansing


   1. Original crawling data have job (job title), web (site), company recruitment company, companyweb (details Company website), place (place of work), salary (wage compensation), time (Published), jtap ( entry, which contains the location, work experience, academic requirements, the number of recruits etc.), text (employee benefits introduced), key (keyword), duty (duties and job description)

   2. Due to limitations of the original site, work experience, academic requirements, the number of recruits are all located jtap one, with a "|" separated, with regular expressions to extract keywords

   3. The irregular original compensation data into a uniform format, and extract a number, Wan / year, yuan / day, thousand / month unified converted for the million / month

   4. From the Job (job name), Key (keyword positions), Duty (description of the job needs) matching the required programming language, and stored in different columns

   The deleting redundant web, companyweb other column, make some adjustments format

After finishing data are as follows , education for the academic requirements, money to pay (a unified format for the month / year), city to city, num is the number of recruits, exp for the work experience requirements, a total of 19 languages after the adjacency matrix, a representative of the Jobs there is a demand for such language.

Matching language there: 'c ++', 'java ', 'c #', 'python', 'go', 'php', 'matlab', 'swift', 'lua', 'perl', 'delphi', 'kotlin ',' ruby ',' typescript ',' vba ',' rust ',' haskell ',' visual basic ',' sql '

IV. Visual analysis


1. explore different cities the demand for professional Internet

As can be seen from the following histogram, Shanghai, Shenzhen, Guangzhou is the largest city in the Internet professional needs, compared with only 1/20 of Harbin and demand. These three cities in
big cities, there are indeed more opportunity

2. explore employment requirements for academic qualifications

It can be seen clearly requires very little doctorate, master's degree also explicitly require only about 3% of graduate will be able to step into the most professional.

And the following college, middle school, secondary technical secondary school in the Internet occupations than difficult.


3. Study of Employment Remuneration different cities of
Beijing highest average salary employment, followed by Shanghai, Shenzhen, Hangzhou, Guangzhou.


Harbin employment pay significantly lower than other cities.

4. explore the relationship between pay and education
Dr. salary significantly higher than the other


 

 The experience and compensation
can be seen that, aside empirical data of 0, the remainder substantially linear distribution

0 experience is generally university graduates, it does not follow this law

6. Use the word cloud showing corporate welfare


7. The number of requests in different languages

java is the most popular language, most of the recruitment requirements or programming language

8. The average salary in different languages

19 languages ​​average salary similar language that followed, the key lies in the mastery of language.

V. The results show

首先仅提取薪酬、教育、经验、城市、编程语言这几项,并将学历、城市用数字替换,方便后续处理。

第一部分,我以75%数据作训练,25%数据进行监测,分别使用了线性预测、SVM预测与随机森林预测三种预测方法,尝试利用教育、经验、城市、编程语言等预测薪酬,但拟合结果并不理想,可能是薪酬很难直接利用这几项完成预测的缘故,故不作展示。

第二部分使用kmeans算法与birch两种聚类方法对19种语言进行聚类

其中kmeans算法轮廓系数输出为0.62左右,birch算法输出轮廓系数为0.95,故使用birch拟合结果,将语言种类聚类为60种,由于数据集来自招聘网站,可以认为这六十种编程语言常作为“捆绑需求”,也可以为编程者学习语言提供一些知道,输出结果如下:

 六.总结

总体来说,基本达到预期目标

1.完成对前程无忧网站的爬取,获取足够数量的数据集,并进行数据清洗

2.使用8张可视化图反应一些现实问题

3.尝试根据参数预测薪酬并不成功,薪酬并不简单的是几个招聘因素所能决定,并且实际就业薪酬与招聘时书写的

薪酬范围可能不同,例如部分高薪职位虽然无学历要求,实际录取竞争时往往选择硕士以上学位。

4.对19种语言进行聚类,提供一种根据现在掌握的一或两种语言,判断下一步学习哪一语言有利于就业
 

 

 
发布了1 篇原创文章 · 获赞 0 · 访问量 9

Guess you like

Origin blog.csdn.net/qq_32088519/article/details/103959078