OpenMP is a standard API for shared memory parallel programming. Here is C binding, but there are C++ and Fortran bindings. This document is written for OpenMP 3.1. The blue parts describe something in relatively new standard, so may be not implemented in all compilers.
OpenMP is not an automatic parallelization. Programmers are responsible for analysis of dependencies and their solution. If it is used in a wrong way, it results in a wrong result.
#include
Defines what is needed to use OpenMP.
_OPENMP
Defined by the compiler if it compiles the source for OpenMP. Now a string which represents the version of OpenMP.
#pragma omp parallel clause
The next sentence (or block) is executed in parallel in duplication, i.e., in SPMD.
The number of threads can be specified by num_threads clause, omp_set_num_threads function, or environmental variable OMP_NUM_THREADS.
The variables defined within the block are thread-private. The variables that are declared before the block and the static variables are shared. You can specify it with shared(variables) or private(variables).
#pragma omp for clause
Distribute the next loop over the threads. Five methods of distributions can be specified in the clause:
#pragma omp single clause: Executed by a single thread.
#pragma omp master: Executed by the master thread, i.e. thread 0.
#pragma omp critical (name): Critical section. Name can be omitted (unnamed critical section).
#pragma omp atomic [read|write|update|capture]: Atomic operation. If no clause is attached, "update" is assumed. The following sentence is limited to x++ and similar ones.
#pragma omp barrier: Barrier synchronization.
#pragma omp flush (variables): Construct for memory consistency. If one wants to make thread 0 write a variable x and thread 1 read the written data, one must guarantee the following four things must happen in this order:
int omp_get_num_threads(void); returns the number of threads.
int omp_get_thread_num(void); returns the own thread number.
double omp_get_wtime(void); returns the wall-clock time in seconds.
Other useful interface includes: functions related to locks, and task constructs.
OpenMP assumes simple parallelism such as parallel for construct. Fuzzy barrier and pairwise synchronization are not supported by the API. Reduction on an array is not supported in C/C++.
One may be able to write an OpenMP program which can be compiled as a sequential program if OpenMP directives are ignored. It is strongly recommended to do so, since it is very useful to debug. Note that debugging in OpenMP is much harder than debugging in MPI.
One must be aware that for each variable whether it is shared or private. Many bugs are created when one use a shared variable as if it is private.
I recommend to make the block to be parallelized into a function, as
#pragma omp parallel f(a, b, c);or
#pragma omp parallel for for (i=0; i< n; i++) f(a[i], b[i], c[i]);
There are private variables and thread-private variables. The names, the slight difference of the concepts, and modifications from the previous versions of OpenMP are too delicate. I would not use thread-private variables.
If you want to send a data on a private variable, you need a shared variable for "communication". Or you can use copyprivate construct.
If you want to program communication between threads on shared memory, consumer thread must not start reading before the producer completes writing of the data into the shared memory. This is achieved by synchronization.
In data parallel programming, it can be done by barrier synchronization. Do it as
Dynamic load balancing can be done by using critical or atomic construct. A little more complex thing can be done with locks. But note that lock functions are not accompanied by implicit flush (except lock variables).
There was much effort to provide the same semantics with parallel program as sequential program. But every such method requires high overhead. After all, weak consistency was proposed, where the order of the memory accesses is guaranteed as programmed only against a special synchronization operation, i.e. flush.
All the memory accesses before a flush are guaranteed to be done before the flush completes. Any memory access after the flush is guaranteed not to be started before the flush completes.
Debugging shared memory parallel program is much harder than message passing program. A message passing parallel program is just a sequential program with message passing, but a shared memory parallel program is something very different from sequential program. In shared memory parallel programs, shared variables could be overwritten by any other threads at an unexpected time. Worse, such an unexpected execution depends on the timing and not reproducible. There are some proposals of parallel programming languages which use message passing although they assume shared memory hardware!
Locality: Shared memory parallel programs tend to have a worse locality than sequential programs. To recover the performance lost in parallelization, one need to change data layouts and to make some data private. This is almost equivalent to data distribution on a distributed parallel system.
Memory contention: If the algorithm performance is memory bandwidth bounded, the parallel performance will be limited by the total memory bandwidth. To reduce such performance loss, enhancement of cache utilization and latency hiding can be effective.
Communication overhead: Memory is far. Memory is slow. Since data are communicated through shared memory, it takes a longer time than an access to data on cache. It is better to reduce "communication", similarly to distributed memory parallelism.
False sharing: There is coherence mechanism on cache, and the cache line is the unit of coherence. Coherence overheads occur when two threads access data on the same cache line, even if they are assigned to the same cache line by the compiler accidentally. This is false sharing. It can be reduced by changing data layout.
Synchronization overhead: Such functions as barrier synchronization, flush and thread creation are necessary only for parallel processing, and thus parallelization overheads. However when minimizing them, one easily introduce a new bug, which is hard to debug. Sometimes bugs are resolved or become invisible by introducing more synchronization than required.
Load balancing: Load balancing is general problem in parallel processing. In shared memory parallelization, execution times fluctuates because of such factors as false sharing and memory access contention, in a way hard to predict. Thus the execution times are not become equal even with "load" are balanced. Dynamic load balancing can help.
Demand-driven communication: With message passing, the producer can send the data to the consumer as soon as it is generated. In shared memory parallel programming, data transfer happens only after the consumer requests it. To solve it we need prefetch. Also, data on the same cache line will be sent. Compare it with message passing model, where only the designated data are sent.
Difficulty of understanding performance: In message passing model, the major parallelization overheads come from load imbalance and message communication. We can predict them relatively accurately. Sometimes performance gain by communication latency hiding and lower cache mishit. On the contrary, on shared memory processors, performance loss comes not only from load imbalance and synchronization overhead, but also from higher cache mishit and memory access contention. The latters are hard to predict.