Learn python from scratch (thirteen) crawler engineer automation and packet capture

foreword

Looking back, I described the basics of python grammar programming compulsory introduction and network programming, multi-threading/multi-process/coroutine, etc., and then talked about database programming articles MySQL, Redis, MongoDB articles, and machine learning, full-stack development, data analysis , there is no need to turn forward if you haven’t read the crawler data collection before, the series of articles have been sorted out:

1. Learn python from scratch with me (1) Compulsory programming grammar
2. Learn python from scratch with me (2) Network programming
3. Learn python from scratch with me (3) Multi-thread/multi-process/ Coroutine
4. Learn python from scratch with me (4) Database programming: MySQL database
5. Learn python from scratch with me (5) Database programming: Redis database
6. Learn python from scratch with me (6) ) Database programming: MongoDB database
7. Learn python from scratch with me (7) Machine learning
8. Learn python from scratch with me (8) Full stack development
9. Learn python from scratch with me (9) Data analysis
10. Learn python from scratch with me (10) Getting started with Hadoop from scratch
11. Learn python from scratch with me (11) Brief introduction to spark
12. Learn python from scratch with me (12) ) How to become an excellent crawler engineer

This series of articles is based on the following learning routes. Due to the large content:

Learn python from scratch to advanced advanced roadmap


Python resources suitable for zero-based learning and advanced people:

① Tencent certified python complete project practical tutorial notes PDF
② More than a dozen major manufacturers python interview topic PDF
③ Python full set of video tutorials (zero foundation-advanced advanced JS reverse)
④ Hundreds of project actual combat + source code + notes
⑤ Programming grammar - machine learning -Full stack development-data analysis-crawler-APP reverse engineering and other full set of projects + documents

Automation and packet capture topics

1. Selenium collection

I have used selenium to grab concert tickets before. If you are interested, you can take a look at
selenium automation to grab concert tickets.

Selenium is a popular automated web testing tool that can simulate the behavior of human users in a browser and is suitable for building automated crawlers. This topic will introduce how to use Selenium for automated crawling, and explain how to obtain web page data through packet capture.

1. Introduction to Selenium

Selenium is an open source automated testing tool that simulates the behavior of human users in the browser, such as clicking links, filling forms, submitting data, and more. Selenium supports interfaces to multiple programming languages, including Python, Java, JavaScript, and more. For crawler engineers, using Selenium can easily build automated crawlers to simulate human users' operations in the browser to obtain data.

2. Selenium installation

Selenium can be installed via pip, using the following command:

pip install selenium

At the same time, it is also necessary to download the driver corresponding to the browser, for example, the Chrome browser needs to download the ChromeDriver. The download link of ChromeDriver is as follows:

https://sites.google.com/a/chromium.org/chromedriver/downloads

After the download is complete, ChromeDriver needs to be placed in the PATH environment variable of the system, so that you can use ChromeDriver in the code to start the Chrome browser.

3. Basic use of Selenium

1. To use Selenium for automated crawling, the following steps are required :

Introduce the selenium library: When writing code in Python, you need to introduce the selenium library first.

from selenium import webdriver

2. Initialize the browser :

Selenium supports a variety of browsers, including Chrome, Firefox, and more. You need to initialize the browser through the API provided by Selenium and specify the location of the browser driver.

driver = webdriver.Chrome('D:/chromedriver.exe')

3. Open the web page :

Use Selenium's get method to open a web page.

driver.get('http://www.baidu.com')

4. Operate page elements :

Use Selenium to simulate user actions in a browser by locating page elements. Examples include clicking a link, filling out a form, submitting data, and so on.

driver.find_element_by_id('kw').send_keys('Python')   # 输入关键字
driver.find_element_by_id('su').click()              # 点击搜索按钮

5. Get data :

Use Selenium's API to get the data of the page, such as getting the HTML source code of the page, screenshots, etc.

html = driver.page_source    # 获取页面的HTML源代码
driver.save_screenshot('baidu.png')   # 截图并保存

6. Close the browser :

When all operations are completed, the browser needs to be closed.

driver.quit()

4. Advanced usage of Selenium

In addition to basic usage, Selenium also provides a variety of advanced usage, such as using a proxy, using a headless browser, and more.

1. Use a proxy :

When performing automated crawling, sometimes it is necessary to use a proxy. In this case, the proxy can be set through Selenium.

from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--proxy-server=http://127.0.0.1:8080')    # 设置代理
driver = webdriver.Chrome('D:/chromedriver.exe', chrome_options=chrome_options)

2. Use a headless browser :

Some websites will detect whether it is a real browser visit. At this time, a headless browser can be used to simulate real browser behavior.

chrome_options = Options()
chrome_options.add_argument('--headless')    # 使用无头浏览器
driver = webdriver.Chrome('D:/chromedriver.exe', chrome_options=chrome_options)

3. Use wait :

When performing automated crawling, it is necessary to wait for the webpage to be loaded before proceeding with subsequent operations. You can use the waiting module provided by Selenium to help us wait for the web page to load.

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver.get('http://www.baidu.com')
element = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.ID, 'kw'))
)    # 等待id为kw的元素加载完成
element.send_keys('Python')
driver.quit()

5. Capture packets to get data

In addition to using Selenium for automated crawling, you can also use packet capture to obtain web page data. Packet capture can help us analyze web page requests and responses and obtain the required data.

Commonly used packet capture tools include Fiddler, Charles, and so on. Take Fiddler as an example to introduce how to use Fiddler to capture packets.

1. Download Fiddler :

Fiddler can be downloaded on the official website, the download link is as follows:
https://www.telerik.com/download/fiddler/fiddler4

After the download is complete, install Fiddler and start Fiddler.

2. Set the proxy :

When capturing packets, you need to set the browser's proxy to Fiddler's proxy. In the Fiddler interface, find Tools -> Options -> Connections, check "Allow remote computers to connect", and write down the IP address and port number of Fiddler.

Open the proxy setting interface in the browser, and set the IP address and port number of the proxy server to the address and port number of Fiddler. For example, in the Chrome browser, the proxy setting interface can be opened through the following address:

chrome://settings/system

Check "Use a proxy server", and set "Address" and "Port" to Fiddler's address and port number respectively.

3. Capture packets to obtain data :

Open the browser and visit the webpage that needs to be crawled. In the Fiddler interface, you can see the communication data between the browser and the website, which includes request and response data. By analyzing these data, the required data can be found and extracted.
The data obtained through packet capture is usually in JSON or XML format. You can use Python's built-in libraries or third-party libraries to process these data to obtain the required content.

Summarize

When performing automated crawling, using Selenium can easily simulate the operations of human users to obtain data. At the same time, by capturing packets, you can analyze web page requests and responses, so as to obtain the required data. It should be noted that when performing automated crawling and packet capture, it is necessary to abide by relevant laws and regulations, and illegal operations are not allowed.

Two, pyppeteer collection

Pyppeteer is a Python library based on the Chrome DevTools Protocol that can interact with Headless Chrome to automate browser operations and web crawling. This article will detail how to use Pyppeteer for crawling and automation tasks.

Pyppeteer installation can be done via pip:

pip install pyppeteer

After the installation is complete, you can create a new Python script, import the pyppeteer library and create an asynchronous browser object:

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    # 程序的主要逻辑写在这里

    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

Once you have the browser object, you can start performing various browser operations. Here are some examples of common operations:

1. Open the specified URL :

await page.goto('https://www.example.com')

2. Get the HTML content of the page :

html = await page.evaluate('document.documentElement.outerHTML')

3. Fill in the content in the form and submit :

await page.type('#username', 'my_username')
await page.type('#password', 'my_password')
await page.click('#submit')

4. Wait for the page to load :

await page.waitForNavigation()

5. Take a screenshot to save the page snapshot :

await page.screenshot({'path': 'screenshot.png'})

When web scraping, sometimes it is necessary to simulate user interaction, such as clicking a button, scrolling a page, getting the text content of a specific element, and so on. Pyppeteer also provides a rich API to realize these functions. For example, to click a button on a page:

await page.click('#my_button')

To get the text content of a specific element:

element = await page.querySelector('#my_element')
text_content = await element.evaluate('(element) => element.textContent')

Pyppeteer also supports advanced functions such as handling file uploads, handling JavaScript popups, and executing custom JavaScript code. For specific usage methods, please refer to the official documentation of Pyppeteer.

At the same time, since Pyppeteer is developed based on the Chrome DevTools Protocol, it is possible to analyze the browser's network request through the packet capture tool, so as to obtain the request and response information of the web page. A commonly used packet capture tool is Fiddler, which can be used to monitor the communication between the browser and the website, and view the details of the request and response.

Pyppeteer provides functionality to work with packet capture tools such as Fiddler. The communication between Pyppeteer and Fiddler can be realized by calling the method pyppeteer.connection.Connectionof the class from_browser_wseand passing in Fiddler's WebSocket proxy address.

import pyppeteer.connection

pyppeteer.connection.Connection.from_browser_wse(f'ws://127.0.0.1:8866/devtools/browser/{browser.browserWSEndpoint.lstrip("ws://")}')

In this way, Fiddler can view the communication between Pyppeteer and the browser, and capture request and response data.

To summarize, Pyppeteer is a handy Python library that automates browser manipulation and web scraping. It provides a wealth of APIs, which can simulate user interaction behaviors, handle advanced functions such as file uploads and JavaScript pop-up windows, and analyze network request and response data through packet capture tools to facilitate access to web page content and other further processing.

Pyppeteer requests actual combat cases

When using Pyppeteer for asynchronous requests, you can use Python's asynchronous programming features (such as asyncio and await) to achieve efficient automated tasks and packet capture functions. In this practice, we will explain in detail how to use Pyppeteer to make asynchronous requests, and provide a sample project for reference.

1. Install Pyppeteer :
First, make sure you have Python and pip installed. Then run the following command on the command line to install Pyppeteer:

pip install pyppeteer

2. Import required modules and libraries :
Create a new Python script and import required modules and libraries:

import asyncio
from pyppeteer import launch

3. Define an asynchronous function :
Create an asynchronous function in the script for the operation of asynchronous requests:

async def fetch_data(url):
    browser = await launch()
    page = await browser.newPage()
    await page.goto(url)
    content = await page.content()
    await browser.close()
    return content

In this example, we define an asynchronous function called fetch_data that takes a URL parameter and uses Pyppeteer to open and render the URL page. Then, we use the await keyword to wait for the loading and rendering of the page to complete, and use page.content()the method to get the content of the page. Finally, we close the browser and return the content of the page.

4. Call the asynchronous function :
Create a main function and call the asynchronous function in it:

async def main():
    url = 'https://example.com'
    content = await fetch_data(url)
    print(content)

asyncio.get_event_loop().run_until_complete(main())

In this example, we define an asynchronous function named main, set the URL to be requested in it, and call the fetch_data function through the await keyword to get the page content. Finally, we use the print statement to print the content on the console.

5. Run the script :

Run the script on the command line, and you will see the fetched page content printed to the console.

This is a simple detailed explanation of the project using Pyppeteer for asynchronous requests. By flexibly using the asynchronous feature of Pyppeteer, you can achieve more complex and efficient automation tasks and packet capture functions. You can further expand and optimize this example according to your specific needs to meet different project requirements. Note that Pyppeteer also provides other functions and APIs, such as simulating user interaction, processing page elements, executing JavaScript code, and more. In actual projects, you can use these functions to complete more tasks as needed

3. Charles grabs the bag

Charles is a popular network packet capture tool, which is widely used in the automation and packet capture tasks of PC-side crawler engineers. It can intercept network request and response data to help developers analyze and debug network communication. The following is a detailed explanation and usage of Charles capture packets:

1. Download and install Charles :

First, you need to download and install Charles from the official website. The official website address is: https://www.charlesproxy.com/

2. Configure the network proxy :

Once installed, open Charles and configure the web proxy. Select "Proxy" from the Charles menu bar, then click "Proxy Settings". In the window that pops up, select the "HTTP Proxy" option, set the port to 8888 (the default port), and make sure the "Enable macOS proxy" or "Enable Windows proxy" option is checked. Save the settings and close the window.

3. SSL proxy configuration (optional) :

If you need to fetch data from HTTPS requests, you also need to configure an SSL proxy in Charles. From the Charles menu bar, select "Help", then click "SSL Proxying", and select "Install Charles Root Certificate". Depending on the operating system, you may need to enter an administrator password to complete the certificate installation. Once complete, Charles can parse and display the data from the HTTPS request.

4. Start the agent :

After the configuration is complete, click the "Proxy" menu of Charles and turn on the "Proxy" option, which means that the proxy function of Charles has been activated.

5. Mobile proxy settings (optional) :

If you need to grab the request data from the mobile phone, you also need to set the proxy to Charles's IP address and port in the mobile phone settings. The IP address is usually the IP address of the computer running Charles, on port 8888 (the default port). Open the settings on the phone, enter the Wi-Fi or mobile network option, find the currently connected network, click to enter the advanced settings, and then set the proxy to the IP address and port of Charles.

6. Start capturing packets :

In Charles, you will see a list of request and response data. To start capturing packets, you can click on the "Record" button in the upper left of the Charles window. When clicked, Charles will start logging network request and response data.

7. Analyze the packet capture results :

Charles displays in its window as network requests and responses occur. You can click on each request to view its details, including request headers, response headers, request body, response body, etc.

8. Filtering and locating requests :
If your packet capture results are very large, you can use Charles' filtering function to find specific requests. You can filter by URL, domain name, request method, etc., so that you can more easily locate the request you are interested in.

9. Modify the request and response :

Charles also allows you to modify request and response data for debugging and testing purposes. You can modify the request and response data for a particular request by right-clicking on it and selecting "Edit".

10. Save and export the packet capture results :

Charles allows you to save the packet capture results of the entire session, and can export them as Har files or other formats for further analysis and processing.

Through the above steps, you can start using Charles to capture packets and analyze network request and response data. It provides a wealth of tools and functions for PC-side crawler engineers to help them automate tasks and network debugging. Please note that it is important to use Charles legally and respect the site's Terms of Use and Privacy Policy.

Charles replaces CSS and JS files

In the PC-side crawler project, Charles is a very powerful packet capture tool that can help developers perform operations such as capturing, modifying, and replaying network requests. In the process of packet capture, sometimes we need to replace CSS and JS files in order to perform some customized operations in analysis and testing. Here are the detailed steps:

Step 1: Install and configure Charles

  • Download and install Charles. You can download the version suitable for your operating system from the official website (https://www.charlesproxy.com/), and install it according to the installation wizard.
  • Configure browser proxy. After starting Charles, configure the proxy in the browser, set the proxy to the IP address and port where Charles is located (the default port is 8888). This ensures that all browser requests are man-in-the-middle proxied through Charles.

Step 2: Set up replacement rules

  • After opening the main interface of Charles, click the "Tools" menu and select "Map Local...".
  • In the "Map Local" dialog box, click the "Add" button to add a new mapping rule.
  • In the mapping rule, fill in the path of the remote URL and local file to be mapped. The remote URL refers to the network path of the CSS or JS file to be replaced, and the local file path refers to the local file path you intend to replace. Click the "Browse" button to select a local file.

Step Three: Apply Changes and Verify

  • Make sure you have completed all replacement rules and saved them.
  • Visit the webpage containing the replaced file in your browser. Charles will broker the request and return the replaced file.
  • In the main interface of Charles, you can click the "Sequence" tab to view the ongoing request. Find the corresponding request and save the response file by right-clicking and selecting "Save Response...".
  • Check the webpage to confirm that the CSS or JS files have been successfully replaced. You can verify the result of the replacement by viewing the source code, developer tools, or the changes in web page performance.

It should be noted that Charles will only work if the request is redirected to the proxy, so you may need to configure the proxy in the browser to ensure that the request can pass through Charles.

Using Charles to replace CSS and JS files can help you perform some customized operations during development and testing, such as replacing specific CSS styles or JS codes, so as to achieve some specific testing requirements.

Four, mitmproxy

mitmproxy scripting - data interception - proxy response

When it comes to the automation and packet capture topics of PC-side crawler engineers, mitmproxy is a commonly used tool. It can be used as a proxy server for Man-in-the-Middle attacks to intercept and modify the client and server. Communication data between. From there, you can use mitmproxy's scripting capabilities to automate data interception and proxy responses.

Here are some basic steps to script mitmproxy to automate data interception and proxy responses:

1. Install mitmproxy: You can install mitmproxy from the official website of mitmproxy

(https://mitmproxy.org/) Download and install mitmproxy, choose the appropriate version according to your operating system.

2. Create a script file: Use a text editor to create a Python script file, eg mitmproxy_script.py.

3. Import necessary modules: Your script needs to import modules such as mitmproxy, http, and netlib, which can be imported using code similar to the following :

from mitmproxy import ctx, http
from mitmproxy.net import encoding

4. Write script logic : In script files, you can define various callback functions to process requests and responses. The following are several commonly used callback functions:

  • request(flow: http.HTTPFlow): Process each request flow (Flow), you can access and modify the requested information in this function.
  • response(flow: http.HTTPFlow): Process each response flow (Flow), you can access and modify the information of the response in this function.

5. Realize the logic of data interception and proxy response : In the corresponding callback function, you can write code to realize the logic of data interception and proxy response. Here are some sample code snippets:

def request(flow: http.HTTPFlow):
    # 检查请求是否符合条件
    if 'example.com' in flow.request.host:
        # 获取请求数据
        request_data = flow.request.content
        # 进行相应处理
        ...

def response(flow: http.HTTPFlow):
    # 检查响应是否符合条件
    if 'example.com' in flow.request.host:
        # 获取响应数据
        response_data = flow.response.content
        # 进行相应处理
        ...

In the above code snippet, you can use the flow.request and flow.response objects to obtain request and response related information, such as URL, request header, request body, response status code, response header, and response body, etc.

6. Run the mitmproxy script : use the command line tool to enter the directory where the script file is located, and run the following command to start mitmproxy and load your script:

mitmproxy -s mitmproxy_script.py

This command will start mitmproxy, passing it your script as an argument. mitmproxy will start listening to network traffic and call the corresponding callback function on request and response.

Through the above steps, you can write mitmproxy scripts to automate data interception and proxy responses. You can further customize and optimize script logic according to specific needs and scenarios.

Python resources suitable for zero-based learning and advanced people:

① Tencent certified python complete project practical tutorial notes PDF
② More than a dozen major manufacturers python interview topic PDF
③ Python full set of video tutorials (zero foundation-advanced advanced JS reverse)
④ Hundreds of project actual combat + source code + notes
⑤ Programming grammar - machine learning -Full stack development-data analysis-crawler-APP reverse engineering and other full set of projects + documents

Guess you like

Origin blog.csdn.net/ch950401/article/details/132025562