“Parallel external memory”的意思、由来-开放百科全书

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine.^[1] It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

Model

Definition

The PEM model^[2] is a combination of the EM model and the PRAM model. The PEM model is a computation model which consists of

processors and a two-level memory hierarchy. This memory hierarchy consists of a large external memory (main memory) of size

and

small internal memories (caches). The processors share the main memory. Each cache is exclusive to a single processor. A processor can't access another’s cache. The caches have a size

which is partitioned in blocks of size

. The processors can only perform operations on data which are in their cache. The data can be transferred between the main memory and the cache in blocks of size

I/O complexity

The complexity measure of the PEM model is the I/O complexity^[3], which determines the number of parallel blocks transfers between the main memory and the cache. During a parallel block transfer each processor can transfer a block. So if

processors load parallelly a data block of size

form the main memory into their caches, it is considered as an I/O complexity of

not

. A program in the PEM model should minimize the data transfer between main memory and caches and operate as much as possible on the data in the caches.

Read / Write conflicts

In the PEM model, there is no direct communication network between the P processors. The processors have to communicate indirectly over the main memory. If multiple processors try to access the same block in main memory concurrently read/write conflicts^[4] occur. Like in the PRAM model, three different variations of this problem are considered:

The following two algorithms^[5] solve the CREW and EREW problem if

processors write to the same block simultaneously.

A first approach is to serialize the write operations. Only one processor after the other writes to the block. This results in a total of

parallel block transfers. A second approach needs

parallel block transfers and an additional block for each processor. The main idea is to schedule the write operations in a binary tree fashion and gradually combine the data into a single block. In the first round

processors combine their blocks into

blocks. Then

processors combine the

blocks into

. This procedure is continued until all the data is combined in one block.

Comparison to other models

Examples

Multiway partitioning

Let

be a vector of d-1 pivots sorted in increasing order. Let

be an unordered set of N elements. A d-way partition^[6] of

is a set

, where

and

for

is called the i-th bucket. The number of elements in

is greater than

and smaller than

. In the following algorithm^[7] the input is partitioned into N/P-sized contiguous segments

in main memory. The processor i primarily works on the segment

. The multiway partitioning algorithm (PEM_DIST_SORT^[8]) uses a PEM prefix sum algorithm^[9] to calculate the prefix sum with the optimal

I/O complexity. This algorithm simulates an optimal PRAM prefix sum algorithm.

If the vector of

pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model with

I/O complexity. The content of the final buckets have to be located in contiguous memory.

Selection

The selection problem is about finding the k-th smallest item in an unordered list

of size

The following code^[1] makes use of PRAMSORT which is a PRAM optimal sorting algorithm which runs in

, and SELECT, which is a cache optimal single-processor selection algorithm.

Under the assumption that the input is stored in contiguous memory, PEMSELECT has an I/O complexity of:

Distribution sort

Distribution sort partitions an input list

of size

into

disjoint buckets of similar size. Every bucket is then sorted recursively and the results are combined into a fully sorted list.

See also

References

1. ^¹²³{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}
2. ^{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}
3. ^{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}
4. ^{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}
5. ^{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}
6. ^{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}
7. ^{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}
8. ^{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}
9. ^{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}
10. ^¹²³{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Sitchinava|first3=Nodari|date=2010|title=Parallel external memory graph algorithms|url=http://dx.doi.org/10.1109/ipdps.2010.5470440|journal=2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)|publisher=IEEE|doi=10.1109/ipdps.2010.5470440|isbn=9781424464425}}