Xiaohongshu is hard to climb? I will teach you the latest crawling method~

点击上方“Python爬虫与数据挖掘”,进行关注

回复“书籍”即可获赠Python从入门到进阶共10本电子书
今日鸡汤莫学武陵人,暂游桃源里。
Python进击者
第184篇原创文章

Preface

Hi everyone, my name is Kuls.

The configuration of the app capture software Charles I wrote before said that more than 30 are watching, and I will update the next one soon. Some readers asked me how to climb the App, and I taught him step by step (warning for multiple pictures)

So I worked overtime and wrote today's article for everyone.

This article will take you through the entire process of crawling Xiaohongshu

Little Red Book

The prerequisite work that needs to be done is to assemble mitmproxy

For the specific configuration process, I suggest you refer to Cui's capitalization for installation

https://zhuanlan.zhihu.com/p/33747453

First of all, we open the Charles that you configured before

Let’s simply grab the Xiaohongshu applet (note that this is an applet, not an app)

The reason for not choosing the app is that the app of Xiaohongshu is a bit difficult, referring to some ideas on the Internet, I still chose a small program

1. Analyze the mini program through Charles packet capture

We open the Xiaohongshu applet and search for a keyword at will

Following my path, you can find that the data in the list has been caught by us.

But do you think this is over?

No no no

Through this packet capture, we know that data can be obtained through this api interface

But when we write all the crawlers, we will find that there are two difficult parameters in the headers

"authorization"和"x-sign"

These two things are constantly changing, and I don't know where to get them.

and so

2. Use mitmproxy to capture packets

In fact, through Charles captures packets, we have already clear the overall grasping ideas

Is to get the "authorization" and "x-sign" two parameters, and then make a get request to the url

The mitmproxy used here is actually similar to Charles, both of which are packet capture tools.

But mitmproxy can be executed with Python

This is a lot more comfortable

Give you a simple example

 def request(flow):
     print(flow.request.headers)

Provide us with such a method in mitmproxy, we can intercept the url, cookies, host, method, port, scheme and other attributes in the request headers through the request object

Isn't this exactly what we want?

We directly intercept the two parameters "authorization" and "x-sign"

Then fill in the headers

The whole is complete.

The above is our entire crawling idea. Let me explain to you how to write the code.

In fact, the code is not difficult to write

First, we must intercept the search api stream so that we can get information on it

if 'https://www.xiaohongshu.com/fe_api/burdock/weixin/v2/search/notes' in flow.request.url:

We judge whether there is a search api URL in the flow request

To determine the request we need to crawl

authorization=re.findall("authorization',.*?'(.*?)'\)",str(flow.request.headers))[0]
x_sign=re.findall("x-sign',.*?'(.*?)'\)",str(flow.request.headers))[0]
url=flow.request.url

Through the above code, we can get the three most critical parameters, and then some common parsing json.

Finally, we can get the data we want

If you want to get a single piece of data, you can grab it after getting the article id

"https://www.xiaohongshu.com/discovery/item/" + str(id)

The headers of this page need to contain cookies. You can get cookies when you visit a website at will. It seems to be fixed at the moment

Finally, you can put the data into csv


to sum up

In fact, the crawling of Xiaohongshu crawler is not particularly difficult. The key lies in the idea and the method used.

This is the end of the article in this issue . If you are reading more than 40 in this issue, the next article will be published immediately!

See you in the next issue~

------------------- End -------------------

Recommendations of previous wonderful articles:

Welcome everyone to like , leave a message, forward, reprint, thank you for your company and support

If you want to join the Python learning group, please reply in the background [ Enter the group ]

Thousands of rivers and mountains are always in love, can you click [ Looking ]

/Today's message topic/

Just say a word or two~~

Guess you like

Origin blog.csdn.net/pdcfighting/article/details/113154428