面试题：网络编程之Python异步IO与多线程基础协同

利用异步IO和多线程提高爬虫效率的方法

异步IO获取网页内容：使用asyncio库来并发地发起网络请求获取网页内容。asyncio基于事件循环机制，能在等待I/O操作（如网络请求）完成时，切换到其他任务，从而避免线程阻塞，提高I/O效率。
多线程处理内容：获取到网页内容后，由于解析HTML等操作可能是CPU密集型任务，使用多线程可以利用多核CPU的优势，提高处理效率。

关键代码片段

import asyncio
import aiohttp
import threading
from bs4 import BeautifulSoup


async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()


async def fetch_all(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        return await asyncio.gather(*tasks)


def process_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    # 这里进行具体的HTML解析操作，例如提取标题
    title = soup.title.string if soup.title else 'No title'
    print(f'Processed title: {title}')


def process_in_threads(html_list):
    threads = []
    for html in html_list:
        t = threading.Thread(target=process_html, args=(html,))
        threads.append(t)
        t.start()
    for t in threads:
        t.join()


urls = [
    'http://example.com',
    'http://example.org',
    'http://example.net'
]

loop = asyncio.get_event_loop()
htmls = loop.run_until_complete(fetch_all(urls))
process_in_threads(htmls)

在上述代码中：

fetch函数使用aiohttp异步获取单个URL的网页内容。
fetch_all函数并发地获取多个URL的内容。
process_html函数负责解析HTML，这里简单示例提取标题。
process_in_threads函数使用多线程来处理获取到的HTML内容。

面试题：网络编程之Python异步IO与多线程基础协同

知识考点

面试题答案

利用异步IO和多线程提高爬虫效率的方法

关键代码片段