本文介紹了如何使用多處理循環遍歷一大串 URL?的處理方法，對大家解決問題具有一定的參考價值，需要的朋友們下面隨著小編來一起學習吧！

問題描述

限時送ChatGPT賬號..

問題:檢查超過 1000 個 url 的列表并獲取 url 返回碼(status_code).

我的腳本可以運行，但速度很慢.

我認為必須有一種更好的、pythonic(更漂亮)的方式來執行此操作，我可以在其中生成 10 或 20 個線程來檢查 url 并收集 resonses.(即:

200 ->www.yahoo.com404->www.badurl.com...

輸入文件:Url10.txt

www.example.comwww.yahoo.comwww.testsite.com

....

導入請求使用 open("url10.txt") 作為 f:urls = f.read().splitlines()打印(網址)對于網址中的網址:url = 'http://'+url #將http://添加到每個url(必須有更好的方法來做到這一點)嘗試:resp = requests.get(url, timeout=1)print(len(resp.content), '->', resp.status_code, '->', resp.url)例外為 e:打印(錯誤"，網址)

挑戰:通過多處理提高速度.

多處理

但它不工作.我收到以下錯誤:(注意:我不確定我是否正確地實現了這個)

AttributeError: Can't get attribute 'checkurl' on <module '__main__' (built-in)>

導入請求從多處理導入池使用 open("url10.txt") 作為 f:urls = f.read().splitlines()def checkurlconnection(url):對于網址中的網址:url = 'http://'+url嘗試:resp = requests.get(url, timeout=1)print(len(resp.content), '->', resp.status_code, '->', resp.url)例外為 e:打印(錯誤"，網址)如果 __name__ == __main__":p = 池(進程=4)結果 = p.map(checkurlconnection, urls)

解決方案

在這種情況下，您的任務受 I/O 限制而非處理器限制 - 網站回復所需的時間比 CPU 循環一次所需的時間長您的腳本(不包括 TCP 請求).這意味著您不會從并行執行此任務中獲得任何加速(這就是 multiprocessing 所做的).你想要的是多線程.實現這一點的方法是使用文檔很少，可能名稱不佳的 multiprocessing.dummy:

導入請求from multiprocessing.dummy import Pool as ThreadPoolurls = ['https://www.python.org','https://www.python.org/about/']def get_status(url):r = requests.get(url)返回 r.status_code如果 __name__ == "__main__":pool = ThreadPool(4) # 建立工人池results = pool.map(get_status, urls) #在自己的線程中打開urlpool.close() #關閉池并等待工作完成pool.join()

參見此處，了解 Python 中多處理與多線程的示例.p>

Problem: Check a listing of over 1000 urls and get the url return code (status_code).

The script I have works but very slow.

I am thinking there has to be a better, pythonic (more beutifull) way of doing this, where I can spawn 10 or 20 threads to check the urls and collect resonses. (i.e:

200 -> www.yahoo.com
404 -> www.badurl.com
...

Input file:Url10.txt

www.example.com
www.yahoo.com
www.testsite.com

....

import requests

with open("url10.txt") as f:
    urls = f.read().splitlines()

print(urls)
for url in urls:
    url =  'http://'+url   #Add http:// to each url (there has to be a better way to do this)
    try:
        resp = requests.get(url, timeout=1)
        print(len(resp.content), '->', resp.status_code, '->', resp.url)
    except Exception as e:
        print("Error", url)

Challenges: Improve speed with multiprocessing.

With multiprocessing

But is it not working. I get the following error: (note: I am not sure if I have even implemented this correctly)

AttributeError: Can't get attribute 'checkurl' on <module '__main__' (built-in)>

import requests
from multiprocessing import Pool

with open("url10.txt") as f:
    urls = f.read().splitlines()
 
def checkurlconnection(url):
    
    for url in urls:
        url =  'http://'+url
        try:
            resp = requests.get(url, timeout=1)
            print(len(resp.content), '->', resp.status_code, '->', resp.url)
        except Exception as e:
            print("Error", url)
        
if __name__ == "__main__":
    p = Pool(processes=4)
    result = p.map(checkurlconnection, urls)

解決方案

In this case your task is I/O bound and not processor bound - it takes longer for a website to reply than it does for your CPU to loop once through your script (not including the TCP request). What this means is that you wont get any speedup from doing this task in parallel (which is what multiprocessing does). What you want is multi-threading. The way this is achieved is by using the little documented, perhaps poorly named, multiprocessing.dummy:

import requests
from multiprocessing.dummy import Pool as ThreadPool 

urls = ['https://www.python.org',
        'https://www.python.org/about/']

def get_status(url):
    r = requests.get(url)
    return r.status_code

if __name__ == "__main__":
    pool = ThreadPool(4)  # Make the Pool of workers
    results = pool.map(get_status, urls) #Open the urls in their own threads
    pool.close() #close the pool and wait for the work to finish 
    pool.join()

See here for examples of multiprocessing vs multithreading in Python.

這篇關于如何使用多處理循環遍歷一大串 URL?的文章就介紹到這了，希望我們推薦的答案對大家有所幫助，也希望大家多多支持html5模板網！

【網站聲明】本站部分內容來源于互聯網,旨在幫助大家更快的解決問題，如果有圖片或者內容侵犯了您的權益，請聯系我們刪除處理，感謝您的支持！

pbootcms网站模板|日韩1区2区|织梦模板||网站源码|日韩1区2区|jquery建站特效-html5模板网

如何使用多處理循環遍歷一大串 URL?

問題描述

輸入文件:Url10.txt

多處理

Input file:Url10.txt

With multiprocessing

相關文檔推薦