青青草伊人网,久久影视一区,国产在线观看www

本文介紹了Python 多處理性能僅隨著使用的內核數的平方根而提高的處理方法，對大家解決問題具有一定的參考價值，需要的朋友們下面隨著小編來一起學習吧！

問題描述

限時送ChatGPT賬號..

我正在嘗試在 Python (Windows Server 2012) 中實現多處理，但無法達到我期望的性能改進程度.特別是對于一組幾乎完全獨立的任務，我希望通過增加內核實現線性改進.

<小時>

我知道——尤其是在 Windows 上——打開新進程會產生開銷
_{( "Normalized Performance" 是 [ 1 CPU-core ] 的運行時間除以 [ N CPU 內核 ] 的運行時間).}

多處理導致回報急劇減少是否正常?或者我的實現缺少什么?

<小時>

將 numpy 導入為 npfrom multiprocessing import Pool, cpu_count, Manager將數學導入為 m從 functools 導入部分從時間進口時間def check_prime(num):#斷言正整數值如果 num!=m.floor(num) 或 num<1:print("輸入必須是正整數")返回無#檢查所有可能因素的可分性素數 = 真對于范圍內的 i (2,num):如果 num%i==0: 素數=假返回素數def cp_worker(num, L):素數 = check_prime(num)L.append((num, prime))def mp_primes(omag, mp=cpu_count()):以 Manager() 作為經理:np.random.seed(0)numlist = np.random.randint(10**omag, 10**(omag+1), 100)L = manager.list()cp_worker_ptl = 部分(cp_worker，L=L)嘗試:池 = 池(進程 = mp)列表(pool.imap(cp_worker_ptl，numlist))例外為 e:打印(e)最后:pool.close() # 沒有更多任務pool.join()返回 L如果 __name__ == '__main__':rt = []對于我在范圍內(cpu_count()):t0 = 時間()mp_result = mp_primes(6, mp=i+1)t1 = 時間()rt.append(t1-t0)print("使用 %i 個核心，運行時間為 %.2fs" % (i+1, rt[-1]))

注意:我知道對于這項任務，實現多線程可能會更有效線程，但這個是簡化模擬的實際腳本是由于 GIL，與 Python 多線程不兼容.

解決方案

_{最近的祝你在這個有趣的領域好運！}

<小時>

最后但并非最不重要的一點，

NUMA/non-locality 問題在討論 HPC 級調整(緩存內/RAM 內計算策略)的擴展討論中得到了重視，并且可能 - 作為副作用 - 有助于檢測缺陷(如由

I am attempting to implement multiprocessing in Python (Windows Server 2012) and am having trouble achieving the degree of performance improvement that I expect. In particular, for a set of tasks which are almost entirely independent, I would expect a linear improvement with additional cores.

I understand that--especially on Windows--there is overhead involved in opening new processes [1], and that many quirks of the underlying code can get in the way of a clean trend. But in theory the trend should ultimately still be close to linear for a fully parallelized task [2]; or perhaps logistic if I were dealing with a partially serial task [3].

However, when I run multiprocessing.Pool on a prime-checking test function (code below), I get a nearly perfect square-root relationship up to N_cores=36 (the number of physical cores on my server) before the expected performance hit when I get into the additional logical cores.

Here is a plot of my performance test results :
_{( "Normalized Performance" is [ a run time with 1 CPU-core ] divided by [ a run time with N CPU-cores ] ).}

Is it normal to have this dramatic diminishing of returns with multiprocessing? Or am I missing something with my implementation?

import numpy as np
from multiprocessing import Pool, cpu_count, Manager
import math as m
from functools import partial
from time import time

def check_prime(num):

    #Assert positive integer value
    if num!=m.floor(num) or num<1:
        print("Input must be a positive integer")
        return None

    #Check divisibility for all possible factors
    prime = True
    for i in range(2,num):
        if num%i==0: prime=False
    return prime

def cp_worker(num, L):
    prime = check_prime(num)
    L.append((num, prime))


def mp_primes(omag, mp=cpu_count()):
    with Manager() as manager:
        np.random.seed(0)
        numlist = np.random.randint(10**omag, 10**(omag+1), 100)

        L = manager.list()
        cp_worker_ptl = partial(cp_worker, L=L)

        try:
            pool = Pool(processes=mp)   
            list(pool.imap(cp_worker_ptl, numlist))
        except Exception as e:
            print(e)
        finally:
            pool.close() # no more tasks
            pool.join()

        return L


if __name__ == '__main__':
    rt = []
    for i in range(cpu_count()):
        t0 = time()
        mp_result = mp_primes(6, mp=i+1)
        t1 = time()
        rt.append(t1-t0)
        print("Using %i core(s), run time is %.2fs" % (i+1, rt[-1]))

Note: I am aware that for this task it would likely be more efficient to implement multithreading, but the actual script for which this one is a simplified analog is incompatible with Python multithreading due to GIL.

解決方案

_{@KellanM deserved [+1] for quantitative performance monitoring}

am I missing something with my implementation?

Yes, you abstract from all add-on costs of the process-management.

While you have expressed an expectation of " a linear improvement with additional cores. ", this would hardly appear in practice for several reasons ( even the hype of communism failed to deliver anything for free ).

Gene AMDAHL has formulated the inital law of diminishing returns.
A more recent, re-formulated version, took into account also the effects of process-management {setup|terminate}-add-on overhead costs and tried to cope with atomicity-of-processing ( given large workpackage payloads cannot get easily re-located / re-distributed over available pool of free CPU-cores in most common programming systems ( except some indeed specific micro-scheduling art, like the one demonstrated in Semantic Design's PARLANSE or LLNL's SISAL have shown so colourfully in past ).

A best next step?

If indeed interested in this domain, one may always experimentally measure and compare the real costs of process management ( plus data-flow costs, plus memory-allocation costs, ... up until the process-termination and results re-assembly in the main process ) so as to quantitatively fair record and evaluate the add-on costs / benefit ratio of using more CPU-cores ( that will get, in python, re-instated the whole python-interpreter state, including all its memory-state, before a first usefull operation will get carried out in a first spawned and setup process ).

Underperformance ( for the former case below )
if not disastrous effects ( from the latter case below ),
of either of ill-engineered resources-mapping policy, be it
an "under-booking"-resources from a pool of CPU-cores
or
an "over-booking"-resources from a pool of RAM-space
are discussed also here

The link to the re-formulated Amdahl's Law above will help you evaluate the point of diminishing returns, not to pay more than will ever receive.

Hoefinger et Haunschmid experiments may serve as a good practical evidence, how a growing number of processing-nodes ( be it a local O/S managed CPU-core, or a NUMA distributed architecture node ) will start decreasing the resulting performance,
where a Point of diminishing returns ( demonstrated in overhead agnostic Amdahl's Law )
will actually start to become a Point after which you pay more than receive. :

Good luck on this interesting field!

Last, but not least,

NUMA / non-locality issues get their voice heard, into the discussion of scaling for HPC-grade tuned ( in-Cache / in-RAM computing strategies ) and may - as a side-effect - help detect the flaws ( as reported by @eryksun above ). One may feel free to review one's platform actual NUMA-topology by using lstopo tool, to see the abstraction, that one's operating system is trying to work with, once scheduling the "just"-[CONCURRENT] task execution over such a NUMA-resources-topology:

這篇關于Python 多處理性能僅隨著使用的內核數的平方根而提高的文章就介紹到這了，希望我們推薦的答案對大家有所幫助，也希望大家多多支持html5模板網！

【網站聲明】本站部分內容來源于互聯網,旨在幫助大家更快的解決問題，如果有圖片或者內容侵犯了您的權益，請聯系我們刪除處理，感謝您的支持！

pbootcms网站模板|日韩1区2区|织梦模板||网站源码|日韩1区2区|jquery建站特效-html5模板网

Python 多處理性能僅隨著使用的內核數的平方根而

問題描述

最后但并非最不重要的一點，

Yes, you abstract from all add-on costs of the process-management.

A best next step?

Last, but not least,

相關文檔推薦