Facebook内容采集/用户信息数据抓取API爬虫

上个月老同学给我介绍了一个可以采Facebook内容和用户信息的第三方API爬虫提供商iDataAPI,连续一周的百万条级别测试发现稳定性不错,准备加入到下半年向省里申报的海外信息数据监测项目当中。

Facebook官方虽然提供了Graph API,不过很多信息在最新版本的API中是不提供的。比如根据关键字搜索用户发帖等等。通过PC端的web页面抓取难度也很大,因为Facebook的页面使用了大量的js脚本动态加载数据,所以听说他们研发了云端分布式&手机端抓取数据的爬虫,就很感兴趣试一试,测试结果符合预期。

我们主要用到的功能如下: 

  1. 根据关键字抓取用户的时间线 
  2. 抓取某用户的资料 
  3. 抓取某用户的好友 
  4. 抓取某用户的帖子

通过使用这几项功能最终可以生成一个关系网络,为后序的NLP工作打下基础。基于这个API爬虫,后续进一步接入舆情监测,情感分析,意见挖掘等等。

这个iDataAPI提供包括一个完整的web控制界面,可以在浏览器中测试爬虫(facebook、twitter、微博、youtube...测了都挺不多的,难得),后台可以完整的创建任务、查看日志、查看数据,普通开发者注册就送钱免费测试了。只是目前项目还在申报期,我们研究院需要的量级比较大,等后面签个合同包年,加上TWITTER这些,每天估计得采集个千万条。

返回示例值(FACEBOOK帖子)
{
    "hasNext": true,
    "retcode": "000000",
    "appCode": "facebook",
    "dataType": "post",
    "pageToken": "enc_AdBhgxzOwy0fZBFjW6GXwbjJDRUca1SS5ccSTKp4TvchMAF3De0qdfVEC8sZAcCQZCw1CtORi9eLls3iJvJJk8PlNIQ|1493425239",
    "data": [
        {
            "posterId": "4",
            "commentCount": 76221,
            "posterScreenName": "Mark Zuckerberg",
            "title": null,
            "url": "https:\/\/www.facebook.com\/4_10103685865597591",
            "imageUrls": [
                "https:\/\/fb-s-d-a.akamaihd.net\/h-ak-xtp1\/v\/t15.0-10\/s720x720\/18223192_10103685908017581_8465195272706719744_n.jpg?_nc_ad=z-m&oh=e0736750f4882bed329ad89749849443&oe=59C22C86&__gda__=1505738387_e11048d689aba9e12e3fef771eab44f5"
            ],
            "originUrl": "https:\/\/www.facebook.com\/zuck\/videos\/10103685865597591\/",
            "geoPoint": "37.484, -122.149",
            "mediaType": "video",
            "publishDate": 1493494203,
            "likeCount": 173477,
            "content": "Part II of driving through South Bend, Indiana with Mayor Pete Buttigieg.",
            "parentPostId": "3791568f35f4c067d6403a5c344136cc",
            "shareCount": 9506,
            "parentAppCode": "facebook",
            "publishDateStr": "2017-04-29T19:30:03",
            "id": "4_10103685865597591",
            "origin": false,
            "originContent": null
        }
    ]
}
返回示例值(FACEBOOK用户信息)
{
    "hasNext": false,
    "retcode": "000000",
    "appCode": "facebook",
    "dataType": "profile",
    "pageToken": null,
    "data": [
        {
            "userName": "zuck",
            "idType": "user",
            "educations": [
                {
                    "schoolName": "Ardsley High School"
                },
                {
                    "schoolName": "Phillips Exeter Academy"
                },
                {
                    "schoolName": "Harvard University"
                }
            ],
            "works": [
                {
                    "employer": "Chan Zuckerberg Initiative"
                },
                {
                    "employer": "Facebook"
                }
            ],
            "idVerified": null,
            "friendCount": null,
            "idVerifiedInfo": null,
            "url": "https:\/\/www.facebook.com\/4",
            "gender": "m",
            "fansCount": null,
            "avatarUrl": "https:\/\/fb-s-c-a.akamaihd.net\/h-ak-fbx\/v\/t34.0-1\/p50x50\/16176889_112685309244626_578204711_n.jpg?efg=eyJkdHciOiIifQ==&_nc_ad=z-m&oh=1d19d2bcf1881ee7deaaf7cf777cb194&oe=597DA91E&__gda__=1501340296_20445fee97f7852820dbda04f427e5d8",
            "followCount": null,
            "viewCount": null,
            "postCount": null,
            "birthday": null,
            "location": "Palo Alto, California",
            "likeCount": null,
            "id": "4",
            "biography": "I'm trying to make the world a more open place.",
            "screenName": "Mark Zuckerberg"
        }
    ]
}

API平台:

返回参数:

后台:

猜你喜欢

转载自blog.csdn.net/littermaker/article/details/81506016