上个月老同学给我介绍了一个可以采Facebook内容和用户信息的第三方API爬虫提供商iDataAPI,连续一周的百万条级别测试发现稳定性不错,准备加入到下半年向省里申报的海外信息数据监测项目当中。
Facebook官方虽然提供了Graph API,不过很多信息在最新版本的API中是不提供的。比如根据关键字搜索用户发帖等等。通过PC端的web页面抓取难度也很大,因为Facebook的页面使用了大量的js脚本动态加载数据,所以听说他们研发了云端分布式&手机端抓取数据的爬虫,就很感兴趣试一试,测试结果符合预期。
我们主要用到的功能如下:
- 根据关键字抓取用户的时间线
- 抓取某用户的资料
- 抓取某用户的好友
- 抓取某用户的帖子
通过使用这几项功能最终可以生成一个关系网络,为后序的NLP工作打下基础。基于这个API爬虫,后续进一步接入舆情监测,情感分析,意见挖掘等等。
这个iDataAPI提供包括一个完整的web控制界面,可以在浏览器中测试爬虫(facebook、twitter、微博、youtube...测了都挺不多的,难得),后台可以完整的创建任务、查看日志、查看数据,普通开发者注册就送钱免费测试了。只是目前项目还在申报期,我们研究院需要的量级比较大,等后面签个合同包年,加上TWITTER这些,每天估计得采集个千万条。
返回示例值(FACEBOOK帖子)
{
"hasNext": true,
"retcode": "000000",
"appCode": "facebook",
"dataType": "post",
"pageToken": "enc_AdBhgxzOwy0fZBFjW6GXwbjJDRUca1SS5ccSTKp4TvchMAF3De0qdfVEC8sZAcCQZCw1CtORi9eLls3iJvJJk8PlNIQ|1493425239",
"data": [
{
"posterId": "4",
"commentCount": 76221,
"posterScreenName": "Mark Zuckerberg",
"title": null,
"url": "https:\/\/www.facebook.com\/4_10103685865597591",
"imageUrls": [
"https:\/\/fb-s-d-a.akamaihd.net\/h-ak-xtp1\/v\/t15.0-10\/s720x720\/18223192_10103685908017581_8465195272706719744_n.jpg?_nc_ad=z-m&oh=e0736750f4882bed329ad89749849443&oe=59C22C86&__gda__=1505738387_e11048d689aba9e12e3fef771eab44f5"
],
"originUrl": "https:\/\/www.facebook.com\/zuck\/videos\/10103685865597591\/",
"geoPoint": "37.484, -122.149",
"mediaType": "video",
"publishDate": 1493494203,
"likeCount": 173477,
"content": "Part II of driving through South Bend, Indiana with Mayor Pete Buttigieg.",
"parentPostId": "3791568f35f4c067d6403a5c344136cc",
"shareCount": 9506,
"parentAppCode": "facebook",
"publishDateStr": "2017-04-29T19:30:03",
"id": "4_10103685865597591",
"origin": false,
"originContent": null
}
]
}
返回示例值(FACEBOOK用户信息)
{
"hasNext": false,
"retcode": "000000",
"appCode": "facebook",
"dataType": "profile",
"pageToken": null,
"data": [
{
"userName": "zuck",
"idType": "user",
"educations": [
{
"schoolName": "Ardsley High School"
},
{
"schoolName": "Phillips Exeter Academy"
},
{
"schoolName": "Harvard University"
}
],
"works": [
{
"employer": "Chan Zuckerberg Initiative"
},
{
"employer": "Facebook"
}
],
"idVerified": null,
"friendCount": null,
"idVerifiedInfo": null,
"url": "https:\/\/www.facebook.com\/4",
"gender": "m",
"fansCount": null,
"avatarUrl": "https:\/\/fb-s-c-a.akamaihd.net\/h-ak-fbx\/v\/t34.0-1\/p50x50\/16176889_112685309244626_578204711_n.jpg?efg=eyJkdHciOiIifQ==&_nc_ad=z-m&oh=1d19d2bcf1881ee7deaaf7cf777cb194&oe=597DA91E&__gda__=1501340296_20445fee97f7852820dbda04f427e5d8",
"followCount": null,
"viewCount": null,
"postCount": null,
"birthday": null,
"location": "Palo Alto, California",
"likeCount": null,
"id": "4",
"biography": "I'm trying to make the world a more open place.",
"screenName": "Mark Zuckerberg"
}
]
}
API平台:
返回参数:
后台: