Python异步下载文件:异步并发、进度条、日志记录、代理、完整性验证

异步下载文件

要使用Python异步下载文件,您可以使用aiohttp库。首先,确保已安装aiohttp库:
pip install aiohttp

接下来,创建一个Python脚本,使用aiohttp异步下载文件:

import aiohttp
import asyncio
import os

async def download_file(url, save_path):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            with open(save_path, 'wb') as f:
                while True:
                    chunk = await response.content.read(1024)
                    if not chunk:
                        break
                    f.write(chunk)
    return save_path

async def main():
    url = 'https://example.com/path/to/your/file'
    save_path = os.path.join(os.getcwd(), 'downloaded_file.ext')
    downloaded_file = await download_file(url, save_path)
    print(f'File downloaded to: {
      
      downloaded_file}')

if __name__ == '__main__':
    asyncio.run(main())

在这个例子中,替换url变量的值为您要下载的文件的URL。save_path变量定义了下载的文件将保存在哪个位置。download_file()函数使用aiohttp库异步下载文件,并将其保存在指定的路径。main()函数是一个异步函数,用于运行download_file()函数。

请注意,这个示例使用了Python 3.7或更高版本的asyncio.run()函数。如果您使用的是较早的Python版本,您可能需要使用asyncio.get_event_loop().run_until_complete()来启动事件循环。

异步并发下载

如果您希望下载多个文件并发执行,可以使用asyncio.gather()函数。以下是如何使用aiohttp和asyncio.gather()同时异步下载多个文件的示例:

import aiohttp
import asyncio
import os

async def download_file(url, save_path):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            with open(save_path, 'wb') as f:
                while True:
                    chunk = await response.content.read(1024)
                    if not chunk:
                        break
                    f.write(chunk)
    return save_path

async def main():
    files_to_download = [
        {
    
    
            'url': 'https://example.com/path/to/your/file1',
            'save_path': os.path.join(os.getcwd(), 'downloaded_file1.ext')
        },
        {
    
    
            'url': 'https://example.com/path/to/your/file2',
            'save_path': os.path.join(os.getcwd(), 'downloaded_file2.ext')
        },
        # 更多文件...
    ]

    tasks = [
        download_file(file['url'], file['save_path']) for file in files_to_download
    ]

    downloaded_files = await asyncio.gather(*tasks)
    for downloaded_file in downloaded_files:
        print(f'File downloaded to: {
      
      downloaded_file}')

if __name__ == '__main__':
    asyncio.run(main())

在这个例子中,我们创建了一个files_to_download列表,其中包含要下载的每个文件的URL和保存路径。然后,我们为每个文件创建一个download_file()协程,将它们添加到tasks列表中。最后,我们使用asyncio.gather()同时运行所有下载任务。

限制同时下载的文件数量

当然,您可以根据需要添加更多文件到files_to_download列表。这个示例将同时下载所有文件,确保最大化下载速度。
如果您希望限制同时下载的文件数量,可以使用asyncio.Semaphore来实现。以下示例说明了如何在异步下载多个文件时设置并发限制:

import aiohttp
import asyncio
import os

# 设置同时下载的最大文件数量
concurrent_downloads = 3
semaphore = asyncio.Semaphore(concurrent_downloads)

async def download_file(url, save_path):
    async with semaphore:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                with open(save_path, 'wb') as f:
                    while True:
                        chunk = await response.content.read(1024)
                        if not chunk:
                            break
                        f.write(chunk)
        return save_path

async def main():
    files_to_download = [
        {
    
    
            'url': 'https://example.com/path/to/your/file1',
            'save_path': os.path.join(os.getcwd(), 'downloaded_file1.ext')
        },
        {
    
    
            'url': 'https://example.com/path/to/your/file2',
            'save_path': os.path.join(os.getcwd(), 'downloaded_file2.ext')
        },
        # 更多文件...
    ]

    tasks = [
        download_file(file['url'], file['save_path']) for file in files_to_download
    ]

    downloaded_files = await asyncio.gather(*tasks)
    for downloaded_file in downloaded_files:
        print(f'File downloaded to: {
      
      downloaded_file}')

if __name__ == '__main__':
    asyncio.run(main())

在此示例中,我们首先设置了一个concurrent_downloads变量,该变量指定了同时进行的最大下载数量。接下来,我们创建了一个asyncio.Semaphore实例,用于限制并发任务。

然后,在download_file函数中,我们使用async with semaphore:来获取信号量。当信号量的计数达到concurrent_downloads时,将阻塞任何尝试获取信号量的协程,直到有其他协程释放信号量。这样,我们就可以限制同时进行的下载任务数量,避免过多的并发任务消耗系统资源。

下载过程中显示进度条

其他部分与之前的示例相同。请注意,您可以根据需要调整concurrent_downloads变量的值来限制同时下载的文件数量。
如果您想要在下载过程中显示进度条,可以使用tqdm库。首先,确保已安装tqdm库:

pip install tqdm

接下来,使用asyncio, aiohttp和tqdm实现异步下载文件并显示进度条:

import aiohttp
import asyncio
import os
from tqdm.asyncio import tqdm

# 设置同时下载的最大文件数量
concurrent_downloads = 3
semaphore = asyncio.Semaphore(concurrent_downloads)

async def download_file(url, save_path):
    async with semaphore:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                total_size = int(response.headers.get('Content-Length', 0))
                with open(save_path, 'wb') as f:
                    with tqdm(total=total_size, desc=save_path, unit='B', unit_scale=True) as pbar:
                        while True:
                            chunk = await response.content.read(1024)
                            if not chunk:
                                break
                            f.write(chunk)
                            pbar.update(len(chunk))
        return save_path

async def main():
    files_to_download = [
        {
    
    
            'url': 'https://example.com/path/to/your/file1',
            'save_path': os.path.join(os.getcwd(), 'downloaded_file1.ext')
        },
        {
    
    
            'url': 'https://example.com/path/to/your/file2',
            'save_path': os.path.join(os.getcwd(), 'downloaded_file2.ext')
        },
        # 更多文件...
    ]

    tasks = [
        download_file(file['url'], file['save_path']) for file in files_to_download
    ]

    downloaded_files = await asyncio.gather(*tasks)
    for downloaded_file in downloaded_files:
        print(f'File downloaded to: {
      
      downloaded_file}')

if __name__ == '__main__':
    asyncio.run(main())

在这个示例中,我们引入了tqdm.asyncio模块,并在download_file函数中使用tqdm上下文管理器创建了一个进度条。进度条的总大小从响应头Content-Length中获取,每次下载一个数据块时,我们使用pbar.update(len(chunk))更新进度条。

这样,在下载过程中,您将在控制台看到一个实时更新的进度条,显示每个文件的下载进度。这对于下载大文件或者需要较长时间的下载任务非常有用,因为它提供了一个直观的进度表示。

错误处理

接下来,我们将向示例中添加异常处理以确保程序更加健壮。当遇到下载过程中的问题时,这将有助于提供有关错误的详细信息。

import aiohttp
import asyncio
import os
from tqdm.asyncio import tqdm

# 设置同时下载的最大文件数量
concurrent_downloads = 3
semaphore = asyncio.Semaphore(concurrent_downloads)

async def download_file(url, save_path):
    try:
        async with semaphore:
            async with aiohttp.ClientSession() as session:
                async with session.get(url) as response:
                    if response.status != 200:
                        raise Exception(f"Failed to download file: {
      
      url}, status: {
      
      response.status}")
                    total_size = int(response.headers.get('Content-Length', 0))
                    with open(save_path, 'wb') as f:
                        with tqdm(total=total_size, desc=save_path, unit='B', unit_scale=True) as pbar:
                            while True:
                                chunk = await response.content.read(1024)
                                if not chunk:
                                    break
                                f.write(chunk)
                                pbar.update(len(chunk))
            return save_path
    except Exception as e:
        print(f"Error downloading {
      
      url}: {
      
      e}")
        return None

async def main():
    files_to_download = [
        {
    
    
            'url': 'https://example.com/path/to/your/file1',
            'save_path': os.path.join(os.getcwd(), 'downloaded_file1.ext')
        },
        {
    
    
            'url': 'https://example.com/path/to/your/file2',
            'save_path': os.path.join(os.getcwd(), 'downloaded_file2.ext')
        },
        # 更多文件...
    ]

    tasks = [
        download_file(file['url'], file['save_path']) for file in files_to_download
    ]

    downloaded_files = await asyncio.gather(*tasks, return_exceptions=True)
    for downloaded_file in downloaded_files:
        if downloaded_file is not None:
            print(f'File downloaded to: {
      
      downloaded_file}')
        else:
            print(f'File failed to download.')

if __name__ == '__main__':
    asyncio.run(main())

在这个示例中,我们对download_file函数进行了修改,添加了try-except语句。如果在下载过程中发生异常,将捕获异常并打印一条错误消息。同时,我们在asyncio.gather()中添加了return_exceptions=True参数,以确保错误不会导致整个程序终止。如果download_file函数返回None,我们将在循环中打印一条表示文件下载失败的消息。

命令行参数

通过添加异常处理,您可以确保程序在遇到问题时更加稳定,并提供有关错误的详细信息,从而更容易进行调试和修复。
如果您希望能够从命令行参数中获取要下载的文件列表和相关设置,可以使用Python的argparse库。以下示例说明了如何将argparse集成到我们的异步文件下载程序中:

import aiohttp
import asyncio
import os
import argparse
from tqdm.asyncio import tqdm

def parse_args():
    parser = argparse.ArgumentParser(description="Async file downloader")
    parser.add_argument("-f", "--file", nargs=2, action="append", metavar=("URL", "SAVE_PATH"),
                        help="The URL of the file to download and the path where it should be saved")
    parser.add_argument("-c", "--concurrency", type=int, default=3,
                        help="Maximum number of concurrent downloads (default: 3)")

    args = parser.parse_args()

    if args.file is None:
        parser.error("At least one file is required for downloading")

    return args

async def download_file(url, save_path):
    try:
        async with semaphore:
            async with aiohttp.ClientSession() as session:
                async with session.get(url) as response:
                    if response.status != 200:
                        raise Exception(f"Failed to download file: {
      
      url}, status: {
      
      response.status}")
                    total_size = int(response.headers.get('Content-Length', 0))
                    with open(save_path, 'wb') as f:
                        with tqdm(total=total_size, desc=save_path, unit='B', unit_scale=True) as pbar:
                            while True:
                                chunk = await response.content.read(1024)
                                if not chunk:
                                    break
                                f.write(chunk)
                                pbar.update(len(chunk))
            return save_path
    except Exception as e:
        print(f"Error downloading {
      
      url}: {
      
      e}")
        return None

async def main(args):
    global semaphore
    semaphore = asyncio.Semaphore(args.concurrency)

    tasks = [
        download_file(file[0], file[1]) for file in args.file
    ]

    downloaded_files = await asyncio.gather(*tasks, return_exceptions=True)
    for downloaded_file in downloaded_files:
        if downloaded_file is not None:
            print(f'File downloaded to: {
      
      downloaded_file}')
        else:
            print(f'File failed to download.')

if __name__ == '__main__':
    args = parse_args()
    asyncio.run(main(args))

在此示例中,我们添加了一个parse_args函数,用于解析命令行参数。我们添加了两个命令行参数:-f或–file,它需要一个URL和保存路径作为参数;-c或–concurrency,它是一个可选参数,用于设置最大并发下载数量,默认为3。

我们还更新了main函数,以便接受args参数,并将其传递给main函数。现在,您可以从命令行运行此程序并提供要下载的文件列表以及并发设置。例如:

python downloader.py -f https://example.com/file1.ext /path/to/save/file1.ext \
                     -f https://example.com/file2.ext /path/to/save/file2.ext \
                     -c 5

这将从指定的URL下载两个文件,并将它们保存到指定的路径,同时允许最多5个并发下载。通过使用argparse,您可以轻松地从命令行配置和运行下载任务,使您的程序更加灵活和可定制。

日志记录

接下来,我们将添加日志记录功能,以便在下载过程中记录详细信息和错误。

以下是使用logging库集成日志记录功能的示例:

import aiohttp
import asyncio
import os
import argparse
import logging
from tqdm.asyncio import tqdm

def setup_logging():
    logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")

def parse_args():
    parser = argparse.ArgumentParser(description="Async file downloader")
    parser.add_argument("-f", "--file", nargs=2, action="append", metavar=("URL", "SAVE_PATH"),
                        help="The URL of the file to download and the path where it should be saved")
    parser.add_argument("-c", "--concurrency", type=int, default=3,
                        help="Maximum number of concurrent downloads (default: 3)")

    args = parser.parse_args()

    if args.file is None:
        parser.error("At least one file is required for downloading")

    return args

async def download_file(url, save_path):
    try:
        async with semaphore:
            async with aiohttp.ClientSession() as session:
                async with session.get(url) as response:
                    if response.status != 200:
                        raise Exception(f"Failed to download file: {
      
      url}, status: {
      
      response.status}")
                    total_size = int(response.headers.get('Content-Length', 0))
                    with open(save_path, 'wb') as f:
                        with tqdm(total=total_size, desc=save_path, unit='B', unit_scale=True) as pbar:
                            while True:
                                chunk = await response.content.read(1024)
                                if not chunk:
                                    break
                                f.write(chunk)
                                pbar.update(len(chunk))
            logging.info(f"File downloaded to: {
      
      save_path}")
            return save_path
    except Exception as e:
        logging.error(f"Error downloading {
      
      url}: {
      
      e}")
        return None

async def main(args):
    global semaphore
    semaphore = asyncio.Semaphore(args.concurrency)

    tasks = [
        download_file(file[0], file[1]) for file in args.file
    ]

    downloaded_files = await asyncio.gather(*tasks, return_exceptions=True)
    for downloaded_file in downloaded_files:
        if downloaded_file is None:
            logging.error(f"File failed to download.")

if __name__ == '__main__':
    setup_logging()
    args = parse_args()
    asyncio.run(main(args))

在这个示例中,我们首先添加了一个setup_logging函数,用于设置日志记录的基本配置。我们将日志级别设置为INFO,并定义了日志记录消息的格式。

我们还将print语句替换为logging.info()和logging.error(),以便在下载过程中记录详细信息和错误。这样,您可以在程序运行时记录更详细的信息,便于调试和分析程序行为。如果您希望将日志记录到文件而不是控制台,可以通过更新basicConfig调用来实现。例如,可以将日志记录到名为downloader.log的文件中:

logging.basicConfig(filename="downloader.log", level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")

这将使您的异步文件下载程序更加健壮,易于调试,并提供有关程序执行过程的详细信息。

代理支持

为了进一步优化代码,我们可以添加对代理的支持,这对于需要绕过地区限制或实现更高级别的网络安全性的场景非常有用。

以下是一个在程序中添加代理支持的示例:

import aiohttp
import asyncio
import os
import argparse
import logging
from tqdm.asyncio import tqdm

def setup_logging():
    logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")

def parse_args():
    parser = argparse.ArgumentParser(description="Async file downloader")
    parser.add_argument("-f", "--file", nargs=2, action="append", metavar=("URL", "SAVE_PATH"),
                        help="The URL of the file to download and the path where it should be saved")
    parser.add_argument("-c", "--concurrency", type=int, default=3,
                        help="Maximum number of concurrent downloads (default: 3)")
    parser.add_argument("-p", "--proxy", type=str, help="Proxy URL to use for downloads")

    args = parser.parse_args()

    if args.file is None:
        parser.error("At least one file is required for downloading")

    return args

async def download_file(url, save_path, proxy=None):
    try:
        async with semaphore:
            async with aiohttp.ClientSession() as session:
                async with session.get(url, proxy=proxy) as response:
                    if response.status != 200:
                        raise Exception(f"Failed to download file: {
      
      url}, status: {
      
      response.status}")
                    total_size = int(response.headers.get('Content-Length', 0))
                    with open(save_path, 'wb') as f:
                        with tqdm(total=total_size, desc=save_path, unit='B', unit_scale=True) as pbar:
                            while True:
                                chunk = await response.content.read(1024)
                                if not chunk:
                                    break
                                f.write(chunk)
                                pbar.update(len(chunk))
            logging.info(f"File downloaded to: {
      
      save_path}")
            return save_path
    except Exception as e:
        logging.error(f"Error downloading {
      
      url}: {
      
      e}")
        return None

async def main(args):
    global semaphore
    semaphore = asyncio.Semaphore(args.concurrency)

    tasks = [
        download_file(file[0], file[1], proxy=args.proxy) for file in args.file
    ]

    downloaded_files = await asyncio.gather(*tasks, return_exceptions=True)
    for downloaded_file in downloaded_files:
        if downloaded_file is None:
            logging.error(f"File failed to download.")

if __name__ == '__main__':
    setup_logging()
    args = parse_args()
    asyncio.run(main(args))

在这个示例中,我们添加了一个新的命令行参数-p或–proxy,允许您指定代理URL。我们还更新了download_file函数,添加了一个名为proxy的参数,当使用session.get()时,我们将其传递给proxy参数。

现在,您可以在命令行中指定代理URL以便在下载文件时使用代理。例如:

python downloader.py -f https://example.com/file1.ext /path/to/save/file1.ext \
                     -f https://example.com/file2.ext /path/to/save/file2.ext \
                     -c 5 -p http://proxy.example.com:8080

这将使用指定的代理服务器下载文件。通过添加代理支持,您可以根据需要自定义下载过程,满足特定网络环境或安全要求。

验证文件的完整性

接下来,我们将添加一个功能,允许程序在下载完成后验证文件的完整性。我们将使用文件的哈希值来实现这一点。在这个示例中,我们将使用SHA-256哈希算法,但您可以根据需要选择其他哈希算法。

以下是添加文件哈希验证功能的示例:

import aiohttp
import asyncio
import os
import argparse
import logging
import hashlib
from tqdm.asyncio import tqdm

def setup_logging():
    logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")

def parse_args():
    parser = argparse.ArgumentParser(description="Async file downloader")
    parser.add_argument("-f", "--file", nargs=3, action="append", metavar=("URL", "SAVE_PATH", "SHA256"),
                        help="The URL of the file to download, the path where it should be saved, and the SHA-256 hash")
    parser.add_argument("-c", "--concurrency", type=int, default=3,
                        help="Maximum number of concurrent downloads (default: 3)")
    parser.add_argument("-p", "--proxy", type=str, help="Proxy URL to use for downloads")

    args = parser.parse_args()

    if args.file is None:
        parser.error("At least one file is required for downloading")

    return args

def calculate_sha256(file_path):
    sha256 = hashlib.sha256()
    with open(file_path, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            sha256.update(chunk)
    return sha256.hexdigest()

async def download_file(url, save_path, sha256_hash, proxy=None):
    try:
        async with semaphore:
            async with aiohttp.ClientSession() as session:
                async with session.get(url, proxy=proxy) as response:
                    if response.status != 200:
                        raise Exception(f"Failed to download file: {
      
      url}, status: {
      
      response.status}")
                    total_size = int(response.headers.get('Content-Length', 0))
                    with open(save_path, 'wb') as f:
                        with tqdm(total=total_size, desc=save_path, unit='B', unit_scale=True) as pbar:
                            while True:
                                chunk = await response.content.read(1024)
                                if not chunk:
                                    break
                                f.write(chunk)
                                pbar.update(len(chunk))
            downloaded_sha256 = calculate_sha256(save_path)
            if downloaded_sha256 == sha256_hash:
                logging.info(f"File downloaded and verified: {
      
      save_path}")
                return save_path
            else:
                logging.error(f"File verification failed: {
      
      save_path}")
                os.remove(save_path)
                return None
    except Exception as e:
        logging.error(f"Error downloading {
      
      url}: {
      
      e}")
        return None

async def main(args):
    global semaphore
    semaphore = asyncio.Semaphore(args.concurrency)

    tasks = [
        download_file(file[0], file[1], file[2], proxy=args.proxy) for file in args.file
    ]

    downloaded_files = await asyncio.gather(*tasks, return_exceptions=True)
    for downloaded_file in downloaded_files:
        if downloaded_file is None:
            logging.error(f"File failed to download.")

if __name__ == '__main__':
    setup_logging()
    args = parse_args()
    asyncio.run(main(args))

在此示例中,我们首先添加了一个名为calculate_sha256的新函数,用于计算文件的SHA-256哈希值。我们还更新了parse_args函数,现在需要提供每个文件的SHA-256哈希值。这意味着每个-f参数需要三个参数:URL,保存路径和SHA-256哈希。

然后,我们在download_file函数中添加了文件验证部分。下载完成后,我们计算下载文件的SHA-256哈希值,并将其与预期的哈希值进行比较。如果哈希值匹配,我们记录成功下载并验证了文件;如果不匹配,我们记录文件验证失败,并删除下载的文件。

现在,您可以在命令行中指定文件的SHA-256哈希值以便在下载完成后验证文件完整性。例如:

python downloader.py -f https://example.com/file1.ext /path/to/save/file1.ext <file1_sha256> \
                     -f https://example.com/file2.ext /path/to/save/file2.ext <file2_sha256> \
                     -c 5 -p http://proxy.example.com:8080

通过添加文件完整性验证功能,您可以确保下载的文件与预期的文件匹配,从而提高程序的可靠性和安全性。

这个示例演示了如何逐步完善和优化异步文件下载程序,包括命令行参数解析、日志记录、代理支持和文件完整性验证。您可以根据需要继续调整和优化此程序,以满足特定用例和要求。

猜你喜欢

转载自blog.csdn.net/lilongsy/article/details/129713657
今日推荐