Facebook爬虫它是我这些年付出心血最多的一只虫儿(附源码讲解)

前言

一只好的爬虫它就像是一只宠物,让每一位工程师想要精心的喂养它、呵护它、壮大它

做过舆情项目的爬虫工程师都知道,他们的工作往往需要爬取成百上千个网站,其中以社交、新闻类为主,而FB、Twitter(下期源码讲解)等做为海外最强大的社交巨头,让我跟它们不期而遇,在这段时光中这只爬虫被我迭代了几个版本,其中包括:Requests、Webdriver、API等多个版本

Facebook(以下简称FB)的反爬机制曾让我多次深陷困境、这只爬虫的成长需要消耗我大量的账号喂食,而账号的大规模封禁让我不得不编写注册账号的bot、恰巧在2018年我编写API版本的爬虫那段时光里无意发现Facebook API有一个隐私漏洞(其实也算不上吧,因为我发现一些隐藏字段能查看用户个人隐私信息、其中包括邮箱信息

出于我另一重身份网络安全爱好者,还是把这个算不上问题的问题反馈给了他们的安全团队,巧合的是过了两个月后看到FB数据泄露事件的新闻,从而导致API大改。同年11月份接到FB邀请信,以China地区安全人员的别称参加北京会议(18年照片-*-)
在那次会议中我也得知FB为防止恶意注册对设备指纹的采集机制后不得不让我把生产账号的bot部署到K8S上分布的各个虚拟节点工作

说到这里曾经很多小伙伴问过我FB注册需要手机号,怎么办?所以它是我付出过心血的,我曾找寻全网最后写了一个对接国外第三方API的接码bot,它吸引我的亮点就是支持全球手机号,我的又一利器

部分源码:

接码效果:

而注册bot所产生的所有资源、用AC数据库存储展示

页面效果:

整个爬虫的工程量还是比较大的,覆盖了:简介、好友、贴文、评论、点赞、关注、分享、小组等等…代码量有4000+吧~

部分源码:

数据存储的话我用了PG

以下我挑选了两个函数示例,相关问题可以联系作者公号

一、帐号检测

前面说过FB对帐号的封禁机制特别严,所以一名爬虫工程师是不可能一次性完整的写好一个爬虫上线而不出问题的!反爬机制往往是需要时间、精力去验证的。FB前期照样需要调研涵盖所有可能出现的反爬情况,让我的爬虫能够实时检测并预警

# 检查账号是否不能访问
    def _check_page_source(self, driver):
        if "https://m.facebook.com/home.php?_rdr" == driver.current_url:
            self.home_url_time += 1
            if self.home_url_time == 3:
                self.home_url_time = 0
                return False, "seven_days"
            return True, "good"
        error_dict = {
            u"登录 Facebook 即可浏览个人主页": "cookies_error",
            u"你必须先登录": "cookies_error",
            u"安全验证码": "code_error",
            u"我们需要验证你的身份": "upload_photo",
            u"请上传一张您本人的照片": "upload_photo",
            u"你的帐户已被停用": "useless",
            u"使用手机验证你的帐户": "phone_number",
            #u"你要求的页面无法显示": "seven_days",
            u"今天就加入 Facebook 吧。": "account_failure",
            u"We Need You To Confirm Your Identity": "upload_photo",
            u"我们最近发现您的帐户在开展可疑活动": "upload_photo",
            u"Your account has been disabled": "useless",
            u"We Need You To Confirm Your Identity": "upload_photo",
            u"Upload A Photo Of Yourself": "upload_photo",
            u"Please enter your phone number": "phone_number",
            u"Please enter the text below": "code_error",
            u"You must log in first": "cookies_error",
        }
        for ele in error_dict:
            if ele in driver.page_source:
                print "error fonud in html", ele
                if ele in [u"找不到页面", "Sorry, something went wrong"]:
                    continue
                return False, error_dict.get(ele, "unknow")
        return True, "good"

二、获取简介

一名合格的爬虫工程师写任何爬虫都不会信手拈来,有些常见的爬虫大可不必不必自己造轮子。而有些网站尝试找找API、瞧瞧M端。FB爬虫为了发挥最大效率不建议选择使用PC端采集(我很负责任的告诉大家PC端的模拟浏览器方式可以慢到让你想哭、更别想驾驭PC端的HTTP构造),不仅仅是FB,很多网站PC端的JS加载永远比M端要浪费太多时间,在此之前你还得根据自身需求(因为M端的精简有时候可能没有你想要的信息

# 获取简介
    def getAbout(self, params, _driver, facebookId, url):
        if self.fail_num >=3:
            _driver.quit()
        proxies = {"http": "http://127.0.0.1:8118", "https": "http://127.0.0.1:8118"}
        ###手机端分为两种  firefox 和 chrome 目前适配的是 firefox
        print "==getAbout=="
        # 存储简介字段
        key_dict = {}
        item = AccountItem()
        item["account_id"] = facebookId
        item["account_url"] = url
        image_link_ele = _driver.find_elements_by_xpath('//a/img[contains(@src,"https://scontent") and contains(@src,"p74x74")]') or \
        _driver.find_elements_by_xpath('//img[contains(@src,"https://scontent") and contains(@src,"p74x74")]')
        
        image_link = image_link_ele[0].get_attribute('src')
        response = requests.get(image_link,proxies=proxies,verify=False)
        ls_f=base64.b64encode(BytesIO(response.content).read())
        pic_format = image_link.split('?', 1)[0].split('.')[-1]
        ext_name = 'data:image/%s;base64,' % pic_format

        # 获取指定信息
        about_selector = Selector(text=_driver.page_source)
        #对于FB别名形式ID,二次获取其唯一ID
        if not facebookId.isdigit():
            unique_id_ele = about_selector.xpath('//div[@id="objects_container"]/div/div/div/div[2]/div/div/div/a/@href').extract_first(default="")
            if 'profile_id' in unique_id_ele:
                unique_id = re.findall(r'profile_id=(\d+)',unique_id_ele)
            else:
                unique_id = re.findall(r'&id=(\d+)&',unique_id_ele)
            item['account_unique_id'] = unique_id[0] if unique_id else ''
        item['account_name'] = about_selector.xpath('//div/span/strong/text()').extract_first(default="")
        if not item['account_name']:
            self.fail_num += 1
        friend = _driver.find_elements_by_xpath("//div[@id='root']/div[1]/div[2]/div[2]/div[1]/a")
        friends_num = 0
        if len(friend) != 0:
            friend_nums = friend[0].text if friend else ''
            friend_num = re.findall(r'\d+', friend_nums)
            friends_num =friend_num[0] if friend_num else 0
        # 指定简介信息的key
        about_list = [
            "work",
            "education",
            "skills",
            "living",
            "contact-info",
            "basic-info",
            "nicknames",
            "relationship",
            "quote",
        ]
        # 循环获取指定标签
        for key in about_list:
            elements = about_selector.xpath("//div[@id='%s']/div/div[2]//table/tbody/tr" % key)
            if elements:
                ele_dict = {}
                for ele in elements:
                    data_key, value = tuple(ele.xpath("td").xpath("string(.)").extract())
                    if ele_dict.has_key(data_key):
                        new_value = ele_dict[data_key]
                        new_value = (new_value + [value]) if isinstance(new_value,list) else [new_value,value]
                        ele_dict[data_key]=new_value
                    else:
                        ele_dict[data_key]=value
                key_dict[key] = ele_dict
            else:
                key_dict[key] = about_selector.xpath("//div[@id='%s']/div/div[2]" % key).xpath("string(.)").extract_first(default="")
        div_ele = about_selector.xpath("//div[@id='family']/div/div[2]/div/div")
        families = []
        get_id_func = lambda tag:"".join([x for i in re.compile(u"/profile.php\?id=(\d+)|/(.*)\?refid=|/(.*)").findall(tag) for x in i])
        for ele in div_ele:
            ele_data = ele.xpath('h3').xpath('string(.)').extract()
            if ele_data:
                href = ele.xpath('h3/a/@href').extract_first(default="")
                relation = OrderedDict()
                relation['fb_id'] = get_id_func(href)
                relation['img_link'] = ele.xpath('parent::*/a/img[contains(@src,"https://scontent")]/@src').extract_first(default="")
                relation['name'], relation['relation'] = tuple(ele_data)
                families.append(json.dumps(relation, ensure_ascii=False))

三、获取贴文

面对页面的改版往往是每一位爬虫工程师的烦恼与痛,那么怎么解决呢?下期我们好好聊聊这个问题,因为本期这篇文章是我在远程办公中抽出时间更的~时间有限

# 获取发帖
    def getPost(self, _driver, facebookId, postUrl):
        is_public = None
        print "==getPost=="
        count = 0
        while True:
            if count >= 3:
                 os.system('./fb_stop.sh 1')
            try: 
                _driver.get(postUrl)
                print u'get post url ok...'
                break
            except Exception as e:
                count += 1 
                exc_type, exc_obj, exc_tb = sys.exc_info()
                fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
                self.logger.warning(u"get post Error: %s, %s, %s, %s" % (exc_type, e, fname, exc_tb.tb_lineno))
        
        page_source = _driver.page_source
        page_response = Selector(text=page_source)
        head = page_response.xpath("//head/link[@rel]").extract_first()
        if head and "canonical" in head:
            is_public = True
        # 检查页面信息,判断是否该账号不可用
        check = self._check_page_source(_driver)
        if check:
            # 如果可以直接找到该元素,则说明是公众号,可直接点击click
            _public = _driver.find_elements_by_id("m-timeline-cover-section")
            # 如果拿不到就循环指定元素,找出可点击的元素click
            for num in range(3, 6):
                _private = _driver.find_elements_by_xpath("/html/body/div/div/div[2]/div/div[1]/div[1]/div[%s]/a[1]" % num)
                if _private: break
            # 如果通过前两种情况都找不到,则只能直接拼串,此时需要通过get的方式获取
            time_line = page_response.xpath("//div[@id='objects_container']/div/div/div/div[4]/a[contains(@href,'v=timeline')]/@href").extract_first(default="")
            if not time_line:
                time_line = postUrl + "&v=timeline" if "?" in postUrl else postUrl + "/?v=timeline"
            if time_line or _private or _public:
                if time_line:
                    if 'http' in time_line:
                        _driver.get(time_line)
                    else:
                        # 防止元素is not clickable先请求URL
                        _driver.get('https://m.facebook.com' + time_line)
                else:
                    _private[0].click()
                # 统计当前用户总共多少帖子数
                posts_count = 0

                # 当前年度下标
                current_year_subscript = 0

                # 循环爬取帖子
                while True:
                    breakpoint_post_url = _driver.current_url
                    time.sleep(random.randrange(5, 10))
                    # 获取帖子列表的文本
                    selector = Selector(text=_driver.page_source)
                    posts = selector.xpath('//div[@role="article" and contains(@data-ft,"top_level_post_id")]').extract() #or selector.xpath('//div[@data-ft]').extract()
                    posts_count += len(posts)
                    # 对取出来的帖子遍历
                    for post in posts:
                        if not isinstance(post, unicode):
                            post = post.decode("utf-8")
                        _post, _comments, _reaction = self.getComment(_driver, facebookId=facebookId, _post=post, post_breakpoint_url=breakpoint_post_url, is_public=is_public)
                        yield _post, _comments, _reaction
                        
                    # 获取更多按钮
                    more_content = _driver.find_elements_by_xpath('//div[@id="structured_composer_async_container"]/div[2]/a')
                    # 更多帖文
                    if more_content:
                        _driver.execute_script(
                            "window.scrollBy(0,%s)" % (more_content[0].location_once_scrolled_into_view['y'] - 200))
                        if more_content[0].text in [u"更多", u"更多动态", "More"]:
                            more_content[0].click()
                            time.sleep(random.randrange(2, 5))
                            if not posts:
                                continue
                            selector = Selector(text=_driver.page_source)
                            new_post = selector.xpath('//div[@role="article"]').extract() or selector.xpath('//div[@data-ft]').extract()
                            if new_post and Selector(text=sorted(new_post)[0]).xpath("string(.)").extract_first() != Selector(text=sorted(posts)[0]).xpath("string(.)").extract_first():

                                continue
                    handles = _driver.window_handles
                    _driver.switch_to_window(handles[0])
                    #获取下一年点击链接列表
                    next_year_more_content = _driver.find_elements_by_xpath('//div[@id="structured_composer_async_container"]/div[last()]/div/a')
                        # 当前年份下表从[0]起,遍历从高到低进行
                    if next_year_more_content and current_year_subscript < len(next_year_more_content[1:]):
                        more_content = next_year_more_content[current_year_subscript]
                        more_content.click()
                        current_year_subscript += 1
                        continue
                    break

谢谢大家能抽出宝贵的时间阅读,创作不易,如果您喜欢的话,给君君点个关注再走吧~您的支持是我创作的动力,希望今后能带给大家更多优质的文章。

发布了5 篇原创文章 · 获赞 134 · 访问量 1万+

猜你喜欢

转载自blog.csdn.net/qiulin_wu/article/details/104310709