Facebook爬虫

使用requests抓取:

selenium抓取的效率实在是慢,没有办法,只能自己用requests实现一遍,分析思路花了好久.这里不表.

抓取的是m.facebook.com的内容,并非使用facebook的graph,web网页试了下,也可以抓取,因为m站和www站点是公用的cookies,可以来回切换

实现功能:

指定任意用户,抓取该用户的所有公开信息,包括:
历史发表/图片/好友/视屏(链接)/个人资料/…

目前抓取部分已经完成,余下的数据提取部分,正在进行…

1. 反爬分析:

https://m.facebook.com/profile/timeline/stream/?cursor=tmln_strm%3A1341235186%3A4123521292106084490%3A0&profile_id=100003102976600&replace_id=u_z_0

抓包发现是通过post请求的,参数很复杂,但是,多次尝试后发现可以通过get请求得到,前提是得到replace_id,replace_id就是来自上一个数请求的__req参数,__req又在心跳包里面,不停的发生变化,发现每次请求的时候会发生的请求有:

get:
https://edge-chat.facebook.com/pull?channel=p_100003102976600&seq=1&clientid=4166e2a6&profile=mobile&partition=-2&sticky_token=588&msgs_recv=1&qp=y&cb=2838170899&state=active&sticky_pool=ash4c09_chat-proxy&uid=100003102976600&viewer_uid=100003102976600&m_sess=&__dyn=1KQdAmm1gxu4U4ifGh28sBBgS5UqxKcwRwAxu3-UcodUbE6u7HzE4p0Yxm6Uhx6484G58O0PEhxm3O3q1rwxwdC2O1gCwSxu0BU7W1KxO1ZxO3W3G1uxmcG1lwf-68WUS2G2DxK18wXCwn8mw&__req=13&__ajax__=AYnvmks18JXzR0XmAgzkyTe1jE_EqXv8w1Gy89AKwm_kyMYEQzG4asGXoRwYbKNBTc6nKql4LCx3320Uy4Y66xytbvwlhkY_SE6Qzt5UTHx3XQ&__user=100003102976600

response: for (;;); {"t":"fullReload","seq":1}

post  form_data 为空
https://edge-chat.facebook.com/sub?cb=lfnh&sticky_token=588&uid=100003102976600&viewer_uid=100003102976600&sticky_pool=ash4c09_chat-proxy&profile=mobile&clientid=4166e2a6&cap=0

response: for (;;); {"t":"pong"}
post  form_data 为空
https://edge-chat.facebook.com/sub?cb=iif3&sticky_token=588&uid=100003102976600&viewer_uid=100003102976600&sticky_pool=ash4c09_chat-proxy&profile=mobile&clientid=4166e2a6&cap=0
# 与上面一个差别在cb这个参数上
response: for (;;); {"t":"pong"} 

也就是说,必须模拟心跳包发送到facebook的服务器端,触发这个心跳包可能是一个本地的setinterval函数,自己在本地调试js时候发现生成的链接的js代码:

    Q.prototype.getURI = function() {
        "use strict";
        return this._uri
    }

下面代码可以验证:最外面的数字,是依次增大的,但是不是顺序的,右边的代码是固定的,13,16,代表请求数据还是心跳包

(function anonymous() {
    (new (require("ServerJS"))()).handle({
    ......省略
			"4": {
                "16": {
                    "sprited": true,
                    "spriteCssClass": "sx_944ae1",
                    "spriteMapCssClass": "sp_MIDvfupbSW_"
                },
                "13": {
                    "sprited": true,
                    "spriteCssClass": "sx_e5986d",
                    "spriteMapCssClass": "sp_MIDvfupbSW_"
                }
            },
            "5": {
                "16": {
                    "sprited": true,
                    "spriteCssClass": "sx_2d2185",
                    "spriteMapCssClass": "sp_MIDvfupbSW_"
                },
                "13": {
                    "sprited": true,
                    "spriteCssClass": "sx_15dd69",
                    "spriteMapCssClass": "sp_MIDvfupbSW_"
                }
            },
            "3": {
                "16": {
                    "sprited": true,
                    "spriteCssClass": "sx_5b32bd",
                    "spriteMapCssClass": "sp_MIDvfupbSW_"
                },
                "13": {
                    "sprited": true,
                    "spriteCssClass": "sx_831a68",
                    "spriteMapCssClass": "sp_MIDvfupbSW_"
                }
            },
            "10": {
                "16": {
                    "sprited": true,
                    "spriteCssClass": "sx_172a06",
                    "spriteMapCssClass": "sp_MIDvfupbSW_"
                },
                "13": {
                    "sprited": true,
                    "spriteCssClass": "sx_c84a14",
                    "spriteMapCssClass": "sp_MIDvfupbSW_"
                }
            },
            "7": {
                "16": {
                    "sprited": true,
                    "spriteCssClass": "sx_d22440",
                    "spriteMapCssClass": "sp_MIDvfupbSW_"
                },
                "13": {
                    "sprited": true,
                    "spriteCssClass": "sx_65f560",
                    "spriteMapCssClass": "sp_MIDvfupbSW_"
                }
            },

如何破:模拟发送请求,将心跳包参数递增
完整代码:[email protected]

猜你喜欢

转载自blog.csdn.net/wu0che28/article/details/83110192