Articles collecting medical direction of journal articles + + author information (the amount of data ten million)

  Recently Wanfang Data crawling codes have been reconstructed, about 10w per hour speed, because the project belongs to the company, not the code being open source, so let us talk about ideas here and some notes of it, incidentally Tucao about Articles.

  First on the map:

  

  In fact, the logic is also quite simple, medical journals class divided into 16 categories, first hand the unique id of these 16 categories corresponding to splice out to win this type of url, then flip it can get the type of request information for each of a journal under.

  Then we got the id of each journal, can be spliced ​​out of each journal's home page url, but this time you will find, Journal Home Articles route is there are two sets of: I call it the new / old Version

  The new version: http://www.wanfangdata.com.cn/sns/user/qkzgf4

  Old version: http://www.wanfangdata.com.cn/perio/detail.do?perio_id=zgjhmy

  Both versions of the url is different, then how to identify a journal that is new or the old version of it? After all, we now know id journals, and here I use the default approach is each a journal deemed the old version, and then use the id journal splicing out the old version of url,

  If this is really the old version of the journal, then you can request to the journal homepage, if it is a new version, then he will be redirected to the new home, eventually we need only observe its response.url clear.

  Why should distinguish journals belonging to the new or old version of it, that all issues related to the request for the next article in this journal.

  Articles in journals because each is a rule, there is a tree of time, indicating that the article belongs journals, which year, which period, so we want to get all the articles in this journal, first of all we need to resolve this periodicals time tree, but the new version and the old version of the time the tree is not the same.

  Figure:

  

  

  That did not, this is the step we want to identify the new or the old version of the journal's why.

  Now that the journal is new or the old version, then the next step we need to request time to get the tree this journal for all years and every year the number of, that need to use this information request within the next article.

  After the request had time to parse tree again, to get useful information based on each period of each of the invited article json request should be flattered thing, but to get feedback but not how good, because I found the final request to the only a small part of the article, others are less than the request, returns an empty json,

  I am puzzled, and then began to fiddle with all kinds of, like what the change agent ah, ah UA exchange, plus cookie ah meal operation, the result is a meal of operating fierce as a tiger, a look at the results of 250, which is very embarrassing, the final toss for a long time finally I have found the key to the problem, and that is this:

  

  That did not, time to parse tree down in 2019 out of a total of seven articles, to a 01, 02 until 07, but at the request Shique 1 to 7,0 gone. . . So the result request is empty.

  But even so, at the request of each article when there will be a problem, but this problem occurs only in the old version, it is an article after a request resolution time tree stands to reason that the requested result will be a json ,

  But here in the old version would not return a html, led me to a program error, because I will return the results are handled in accordance with json, until now I did not understand why do not suddenly go back returns json return html, I guess it should request is too it.

  So I pay more of a process, not when the return json will find this journal + id + year periods into redis, and then the other from a job will come from redis remove it again this request, the request to json its removal from the set, or to put it in the set in turn, cycle requests.

  So that the final resolution request to re-json article in this issue of the content of the article is obtained, the article is also author information in this json.

  This is the entire process. Next is Tucao link:

  Articles have to say is really too fond of the revision, called Medical beginning, later changed to the Wanfang Data, and then well it happens during this gathering and I and another revision, and it was I found the spot and an interface from which to discover:

  http://www.wanfangdata.com.cn/perio/page.do   

  这个接口是在万方改版期间出现的,改版前没有,改版后也没有,就只出现了一小会儿,现在网页中是看不到它的

  这个接口是用来请求那16个大类中又哪些期刊的,返回的json中包含了各个期刊的所有信息,比期刊主页展示的信息要全的多,并且有两点对我的工作有了很大的帮助,本身每个期刊的时间树都是要请求一次的,这样无疑会拖慢爬虫的速度,而且会出现请求不到的情况,在这个接口中却包含了期刊的时间树,还有一个就是本身想要获取这个期刊的影响因子的话,是需要请求期刊主页来解析页面的,现在也不用了,json中也有了,省了不少事儿,但是这个接口是不在万方官网中显示的,说明他们现在展示时用的不是这个接口,当初只是临时用了一会儿,以后会不会消失,不清楚。

  这是请求这个接口的formdata:  (code_name是那16大类的唯一标识)

  

 

 

   最后,打个广告: 想了解更多Python关于爬虫、数据分析的内容,获取大量爬虫爬取到的源数据,欢迎大家关注我的微信公众号:悟道Python

  

 

Guess you like

Origin www.cnblogs.com/ss-py/p/11569976.html