Crawling the site requires a login?

Recently I had to perform an item from a Web site requires a login climb fetches some of the pages. It is not as simple as I thought, so I decided to write it as an auxiliary tutorial.

In this tutorial, we will crawl a list of items in our bitbucket account.

Tutorial code can be found in my Github.

We will follow these steps:

For more information log on to extract the required
Execution site login
Crawling required data

In this tutorial, I used the following package (can be found in requirements.txt in):

requests

lxml

# Step one: the site study

Open the login page

Enter the following pages " bitbucket.org/account/signin." You will see the page shown below (execution write-off, in case you are logged in)

Careful study of details that we need to extract, for login purposes

In this section, we will create a dictionary to save the login details of execution:

1. Right-click the "Username or email" field, select "View elements." We will use the value of the "name" attribute for the "username" input box. "Username" will be the key values, we username / e-mail is the corresponding value value (on other sites these key values might be "email", "user_name", "login", and so on). Learning Python junior partner, to learn the information, they can learn to exchange our python q-u-n [784,758,214]

Here are the latest big cattle finishing a set of python tutorial system, ranging from basic web development python script to, reptiles, data analysis, data visualization, machine learning. Are learning python gave a small partner! Here is a python learner gathering, welcome beginners and advanced junior partner!

2. 右击 “Password” 字段，选择“查看元素”。在脚本中我们需要使用 “name” 属性为 “password” 的输入框的值。“password” 将是字典的 key 值，我们输入的密码将是对应的 value 值（在其他网站key值可能是 “userpassword”，“loginpassword”，“pwd”，等等）。

3. 在源代码页面中，查找一个名为 “csrfmiddlewaretoken” 的隐藏输入标签。“csrfmiddlewaretoken” 将是 key 值，而对应的 value 值将是这个隐藏的输入值（在其他网站上这个 value 值可能是一个名为 “csrftoken”，“ authenticationtoken” 的隐藏输入值）。列如：“Vy00PE3Ra6aISwKBrPn72SFml00IcUV8”。

最后我们将会得到一个类似这样的字典：

payload = {

"username": "<USER NAME>",

"password": "<PASSWORD>",

"csrfmiddlewaretoken": "<CSRF_TOKEN>"

}

请记住，这是这个网站的一个具体案例。虽然这个登录表单很简单，但其他网站可能需要我们检查浏览器的请求日志，并找到登录步骤中应该使用的相关的 key 值和 value 值。

#步骤2：执行登录网站

对于这个脚本，我们只需要导入如下内容：

import requests

from lxml import html

##首先，我们要创建 session 对象。这个对象会允许我们保存所有的登录会话请求。

session_requests = requests.session()

##第二，我们要从该网页上提取在登录时所使用的 csrf 标记。在这个例子中，我们使用的是 lxml 和 xpath 来提取，我们也可以使用正则表达式或者其他的一些方法来提取这些数据。

login_url = "https://bitbucket.org/account/signin/?next=/"

result = session_requests.get(login_url)

tree = html.fromstring(result.text)

authenticity_token = list(set(tree.xpath("//input[@name='csrfmiddlewaretoken']/@value")))[0]

更多关于xpath 和lxml的信息可以在这里找到。

接下来，我们要执行登录阶段。在这一阶段，我们发送一个 POST 请求给登录的 url。我们使用前面步骤中创建的 payload 作为 data 。也可以为该请求使用一个标题并在该标题中给这个相同的 url 添加一个参照键。

result = session_requests.post(

login_url,

data = payload,

headers = dict(referer=login_url)

)

#步骤三：爬取内容

现在，我们已经登录成功了，我们将从 bitbucket dashboard 页面上执行真正的爬取操作。

url = 'https://bitbucket.org/dashboard/overview'

result = session_requests.get(

url,

headers = dict(referer = url)

)

为了测试以上内容，我们从 bitbucket dashboard 页面上爬取了项目列表。我们将再次使用 xpath 来查找目标元素，清除新行中的文本和空格并打印出结果。如果一切都运行 OK，输出结果应该是你 bitbucket 账户中的 buckets / project 列表。

tree = html.fromstring(result.content)

bucket_elems = tree.findall(".//span[@class='repo-name']/")

bucket_names = [bucket.text_content.replace("n", "").strip() for bucket inbucket_elems]

print bucket_names

你也可以通过检查从每个请求返回的状态代码来验证这些请求结果。它不会总是能让你知道登录阶段是否是成功的，但是可以用来作为一个验证指标。

例如：

result.ok # 会告诉我们最后一次请求是否成功

result.status_code # 会返回给我们最后一次请求的状态

就是这样。

Crawling the site requires a login?

Guess you like