A detailed explanation of Cookies and Selenium's automatic acquisition of Cookies

Preface

In the future, data acquisition methods and data assets will definitely be one of the core production tools and resources in the future. Every large model is inseparable from secondary feeding and training of more refined data. However, there are still many ways to collect large amounts of data. Professional data in some vertical fields is difficult to obtain. It is very time-consuming and laborious to search manually, and it is also troublesome to process. The key is that the strong data cannot be accurately obtained. Relevant data will be of great help to project development. I have been engaged in big data algorithm work before, and I also have good technical development in data acquisition and collection. Currently, I will launch a new technology column to review the technology that has been deeply involved in data collection and acquisition.

1.What are Cookies

Many times we find that if we have logged in to our account on this web page, we will find that we can log in without entering the password and account the next time we visit the website. Imagine you go to a coffee shop. The first time you go, you tell the clerk your name and the type of coffee you like. The clerk memorized this information. The next time you go there, the clerk will know your name and what kind of coffee you like when he sees you, so he will prepare the coffee you like directly for you.

In this example, the coffee shop is like a website, and you are like the user visiting the website. The name and coffee preference you provide are just like the information you enter on the website. The coffee shop clerk remembers your information, much like a website stores a cookie on your computer. Therefore, cookies are small data files stored on the user's computer by the website in order to remember the user's preferences or identity information. In this way, the next time you visit the same website, it can quickly identify you and customize the content based on the stored information, without having to log in again.

After understanding what functions cookies have, let's have a more in-depth understanding of cookies.

2.The function and data form of cookies

Sometimes when browsing the website, you will see:

Insert image description here

Pop-up prompts, according to the pop-up prompts, we can actually see some of the functions of cookies: they can discover our needs, and they can also analyze our traffic and website usage, that is, monitor our users' browsing habits and activities. Advertisers can also use cookies to collect information about us to display more relevant advertising. This is based on your browsing history and other online behaviors, so this is why when we searched for a certain product before, a certain product started to be pushed directly, and a certain video website and a certain website will also push related videos one after another. Of course, if you turn on incognito mode, the browser will not save cookies.

To summarize the functions of cookies, there are the following points:

  • Authentication and session management : When you log into a website, the website will use cookies to remember your login status so that you do not have to log in again every time you visit a new page.
  • Personalized settings : The website uses cookies to store personalized settings, such as language preferences, theme selection, etc., so as to provide the same customized experience the next time you visit.
  • Tracking and Analytics : The website uses cookies to track users’ browsing habits and activities. This is useful for websites to improve their content and structure, providing a more personalized experience.
  • Ad targeting : Advertisers use cookies to collect information about you to display more relevant ads. This is based on your browsing history and other online behavior.
  • Expiration time : Cookies can be set to different expiration times. Some disappear when you close your browser (session cookies), while others disappear after a specific date (persistent cookies).
  • Privacy and security : While cookies are important for improving the website experience, they also raise privacy and security concerns. Users can usually manage cookies, including deleting and disabling them, in their browser settings.
  • Third-party Cookies : In addition to cookies set directly by the website (first-party cookies), there are also third-party cookies, usually set by advertisers and analytics service providers, to track user behavior across websites.

After understanding the above points, let's take a look at the specific data format stored by cookies. Each browser stores and sets cookies differently. Taking Firefox as an example, you can directly search for settings in the browser and you will see:

Insert image description here

Below you can see the cookies stored by your browser:

Insert image description here

Click to manage data

Insert image description here

If you frequently browse a certain website, in my case bilibili, you will find that the stored cookies take up a lot of memory, which means that a lot of your personal behavior is stored. So how do you check a cookie?

According to the steps, take CSDN as an example, enter F12 to enter developer mode, click Save, and you can see the stored cookies on the left side of the interface:

Insert image description here

Generally, a cookie is a small text data of no more than 4KB, consisting of a name (Name), a value (Value) and several other optional attributes used to control the cookie validity period, security, and usage scope. Some server-side cookie settings are very complicated and have many key fields, and some are very simple.

3.Cookie attributes

We now understand what attributes the data saved by cookies have

Insert image description here

(1) Name/Value: Name and value set the name of the cookie and the corresponding value. For authentication cookies, the Value value includes the access token provided by the web server.

(2)Domain attribute: Specifies the Web site or domain that can access the cookie. The cookie mechanism does not follow the strict origin policy, allowing a subdomain to set or obtain the cookie of its parent domain. The above characteristics of Cookie are very useful when it is necessary to implement a single sign-on solution. However, it also increases the risk of Cookie being attacked. For example, attackers can use this to launch session setting attacks. Therefore, browsers prohibit setting generic top-level domain names such as .org and .com in the Domain attribute, as well as second-level domain names registered under national and regional top-level domains, to reduce the scope of attacks.

(3) Path attribute: Defines the directory where the cookie can be accessed on the Web site. Generally, csrToken has this attribute.

(4) Expires attribute: Set the lifetime of the cookie. There are two storage types of cookies: session and persistent. When the Expires attribute defaults, it is a session cookie, which is only stored in the client's memory and expires when the user closes the browser; a persistent cookie will be stored in the user's hard drive until the lifetime expires or the user directly clicks on the web page. It will become invalid only when you click the "Logout" button to end the session.

(5)Secure attribute: Specify whether to use the HTTPS security protocol to send cookies. Using the HTTPS security protocol can protect cookies from being stolen and tampered with during transmission between the browser and the web server. This method can also be used for identity authentication of Web sites, that is, during the HTTPS connection establishment phase, the browser will check the validity of the Web site's [certificate]. However, due to compatibility reasons (for example, some websites use self-signed certificates) when the SSL certificate is detected to be invalid, the browser will not immediately terminate the user's connection request, but will display security risk information. The user can still choose to continue accessing the SSL certificate. site. Because many users lack security awareness, they may still connect to websites faked by Pharming attacks.

(6) HTTPOnly attribute: Used to prevent client scripts from accessing Cookies through the document.cookie attribute, helping to protect Cookies from being stolen or tampered with by cross-site scripting attacks. However, there are still limitations in the application of HTTPOnly. Some browsers can prevent client scripts from reading cookies, but allow writing operations; in addition, most browsers still allow the Set-Cookie header in the HTTP response to be read through the XMLHTTP object.

These are the attributes that come with every cookie element, so let’s focus on the general meanings of cookie names.

4.Cookie name

Insert image description here

The name of the cookie is used to uniquely identify different cookies. The name can be named according to the purpose of the cookie. The following are some common cookie names and their purposes:

Name use
session_id/PHPSESSID Used to identify the user's session. This type of cookie is typically used to maintain user status after logging in.
user_id /uid Used to identify a specific user, possibly for tracking or personalization.
remember_me Usually related to the long-term login function, used to remember the user's login status.
token /auth_token Used to store authentication tokens, typically used for API calls or maintaining login status.
preferences /settings Save user settings and preferences, such as interface themes, language settings, etc.
cart/shopping_cart For e-commerce websites, it is used to track the contents of users' shopping carts.
analytics /tracking_id Used for website analysis and user tracking, and may be used to count user access behavior.
csrftoken/ XSRF-TOKEN For cross-site request forgery (CSRF) protection.
ads/ ad_id Advertising-related tracking to personalize ad display.
locale/ language Stores the user's language preference.
cookie_consent/ consent Record the user's consent to the use of cookies.

The above are basically all the identifiers contained in cookies. Of course, there are also many websites whose cookies have more other businesses or other notes to prevent crawlers and other mechanisms. Let's use Python Selenium to obtain our current cookies.

5. Get cookies

There are many ways to obtain cookies. You can use JavaScript in the web browser and document.cookieaccess the cookies of the current page through attributes. Cookies can also be received in HTTP request headers. For example, in PHP, $_COOKIEcookies can be accessed through a global array; in Node.js, they can headers.cookiebe accessed through properties of the HTTP request object. Or Python's Requests, Node.js's Axios, etc. Here's how to use the browser automation tool Selenium to extract browser cookies. If you don’t know about selenium, it is recommended to read the blog written by the blogger with a detailed introduction to selenium.

First introduce the library:

from selenium import webdriver

The cookie stored before and after login is inconsistent, so we can get the cookie twice to see which values ​​have changed. This time we get the csdn blog cookie.

def password_login(self):
    self.driver = webdriver.Firefox()
    self.driver.get("https://blog.csdn.net/")
    cookieBefore = self.driver.get_cookies()
    time.sleep(2)
    self.driver.find_element(By.LINK_TEXT, "登录").click()
    #登入后再获取一次cookie
    time.sleep(2)
    #扫码
    time.sleep(20)
	print("登录后!")
	cookiesAfter = self.driver.get_cookies()
    print("cookiesAfter:")
    print(cookiesAfter)

You can run it yourself, because cookies are private content and will not be demonstrated here.

Please pay attention to prevent it from getting lost. If there are any mistakes, please leave a message for advice. I am very grateful.

That’s all for this issue. My name is fanstuck. If you have any questions, feel free to leave a message for discussion. See you in the next issue.

Guess you like

Origin blog.csdn.net/master_hunter/article/details/135291351