Parallel Computer Architecture - Introduction

In the last 50 years, there has been huge developments in the performance and capability of a computer system. This has been possible with the help of Very Large Scale Integration (VLSI) technology. VLSI technology allows a large number of components to be accommodated on a single chip and clock rates to increase. Therefore, more operations can be performed at a time, in parallel.

Parallel processing is also associated with data locality and data communication. Parallel Computer Architecture is the method of organizing all the resources to maximize the performance and the programmability within the limits given by technology and the cost at any instance of time.

Why Parallel Architecture?

Parallel computer architecture adds a new dimension in the development of computer system by using more and more number of processors. In principle, performance achieved by utilizing large number of processors is higher than the performance of a single processor at a given point of time.

Application Trends

With the advancement of hardware capacity, the demand for a well-performing application also increased, which in turn placed a demand on the development of the computer architecture.

Before the microprocessor era, high-performing computer system was obtained by exotic circuit technology and machine organization, which made them expensive. Now, highly performing computer system is obtained by using multiple processors, and most important and demanding applications are written as parallel programs. Thus, for higher performance both parallel architectures and parallel applications are needed to be developed.

To increase the performance of an application Speedup is the key factor to be considered. Speedup on p processors is defined as −

Speedup(p processors)≡Performance(p processors)Performance(1 processor)Speedup(p processors)≡Performance(p processors)Performance(1 processor)

For the single fixed problem,

performance of a computer system=1Time needed to complete the problemperformance of a computer system=1Time needed to complete the problem

Speedup fixed problem(p processors)=Time(1 processor)Time(p processor)Speedup fixed problem(p processors)=Time(1 processor)Time(p processor)

Scientific and Engineering Computing

Parallel architecture has become indispensable in scientific computing (like physics, chemistry, biology, astronomy, etc.) and engineering applications (like reservoir modeling, airflow analysis, combustion efficiency, etc.). In almost all applications, there is a huge demand for visualization of computational output resulting in the demand for development of parallel computing to increase the computational speed.

Commercial Computing

In commercial computing (like video, graphics, databases, OLTP, etc.) also high speed computers are needed to process huge amount of data within a specified time. Desktop uses multithreaded programs that are almost like the parallel programs. This in turn demands to develop parallel architecture.

Technology Trends

With the development of technology and architecture, there is a strong demand for the development of high-performing applications. Experiments show that parallel computers can work much faster than utmost developed single processor. Moreover, parallel computers can be developed within the limit of technology and the cost.

The primary technology used here is VLSI technology. Therefore, nowadays more and more transistors, gates and circuits can be fitted in the same area. With the reduction of the basic VLSI feature size, clock rate also improves in proportion to it, while the number of transistors grows as the square. The use of many transistors at once (parallelism) can be expected to perform much better than by increasing the clock rate

Technology trends suggest that the basic single chip building block will give increasingly large capacity. Therefore, the possibility of placing multiple processors on a single chip increases.

Architectural Trends

Development in technology decides what is feasible; architecture converts the potential of the technology into performance and capability. Parallelism and locality are two methods where larger volumes of resources and more transistors enhance the performance. However, these two methods compete for the same resources. When multiple operations are executed in parallel, the number of cycles needed to execute the program is reduced.

However, resources are needed to support each of the concurrent activities. Resources are also needed to allocate local storage. The best performance is achieved by an intermediate action plan that uses resources to utilize a degree of parallelism and a degree of locality.

Generally, the history of computer architecture has been divided into four generations having following basic technologies −

Vacuum tubes
Transistors
Integrated circuits
VLSI

Till 1985, the duration was dominated by the growth in bit-level parallelism. 4-bit microprocessors followed by 8-bit, 16-bit, and so on. To reduce the number of cycles needed to perform a full 32-bit operation, the width of the data path was doubled. Later on, 64-bit operations were introduced.

The growth in instruction-level-parallelism dominated the mid-80s to mid-90s. The RISC approach showed that it was simple to pipeline the steps of instruction processing so that on an average an instruction is executed in almost every cycle. Growth in compiler technology has made instruction pipelines more productive.

In mid-80s, microprocessor-based computers consisted of

An integer processing unit
A floating-point unit
A cache controller
SRAMs for the cache data
Tag storage

As chip capacity increased, all these components were merged into a single chip. Thus, a single chip consisted of separate hardware for integer arithmetic, floating point operations, memory operations and branch operations. Other than pipelining individual instructions, it fetches multiple instructions at a time and sends them in parallel to different functional units whenever possible. This type of instruction level parallelism is called superscalar execution.

Convergence of Parallel Architectures

Parallel machines have been developed with several distinct architecture. In this section, we will discuss different parallel computer architecture and the nature of their convergence.

Communication Architecture

Parallel architecture enhances the conventional concepts of computer architecture with communication architecture. Computer architecture defines critical abstractions (like user-system boundary and hardware-software boundary) and organizational structure, whereas communication architecture defines the basic communication and synchronization operations. It also addresses the organizational structure.

Programming model is the top layer. Applications are written in programming model. Parallel programming models include −

Shared address space
Message passing
Data parallel programming

Shared address programming is just like using a bulletin board, where one can communicate with one or many individuals by posting information at a particular location, which is shared by all other individuals. Individual activity is coordinated by noting who is doing what task.

Message passing is like a telephone call or letters where a specific receiver receives information from a specific sender.

Data parallel programming is an organized form of cooperation. Here, several individuals perform an action on separate elements of a data set concurrently and share information globally.

Shared Memory

Shared memory multiprocessors are one of the most important classes of parallel machines. It gives better throughput on multiprogramming workloads and supports parallel programs.

In this case, all the computer systems allow a processor and a set of I/O controller to access a collection of memory modules by some hardware interconnection. The memory capacity is increased by adding memory modules and I/O capacity is increased by adding devices to I/O controller or by adding additional I/O controller. Processing capacity can be increased by waiting for a faster processor to be available or by adding more processors.

All the resources are organized around a central memory bus. Through the bus access mechanism, any processor can access any physical address in the system. As all the processors are equidistant from all the memory locations, the access time or latency of all the processors is same on a memory location. This is called symmetric multiprocessor.

Message-Passing Architecture

Message passing architecture is also an important class of parallel machines. It provides communication among processors as explicit I/O operations. In this case, the communication is combined at the I/O level, instead of the memory system.

In message passing architecture, user communication executed by using operating system or library calls that perform many lower level actions, which includes the actual communication operation. As a result, there is a distance between the programming model and the communication operations at the physical hardware level.

Send and receive is the most common user level communication operations in message passing system. Send specifies a local data buffer (which is to be transmitted) and a receiving remote processor. Receive specifies a sending process and a local data buffer in which the transmitted data will be placed. In send operation, an identifier or a tag is attached to the message and the receiving operation specifies the matching rule like a specific tag from a specific processor or any tag from any processor.

The combination of a send and a matching receive completes a memory-to-memory copy. Each end specifies its local data address and a pair wise synchronization event.

Convergence

Development of the hardware and software has faded the clear boundary between the shared memory and message passing camps. Message passing and a shared address space represents two distinct programming models; each gives a transparent paradigm for sharing, synchronization and communication. However, the basic machine structures have converged towards a common organization.

Data Parallel Processing

Another important class of parallel machine is variously called − processor arrays, data parallel architecture and single-instruction-multiple-data machines. The main feature of the programming model is that operations can be executed in parallel on each element of a large regular data structure (like array or matrix).

Data parallel programming languages are usually enforced by viewing the local address space of a group of processes, one per processor, forming an explicit global space. As all the processors communicate together and there is a global view of all the operations, so either a shared address space or message passing can be used.

Fundamental Design Issues

Development of programming model only cannot increase the efficiency of the computer nor can the development of hardware alone do it. However, development in computer architecture can make the difference in the performance of the computer. We can understand the design problem by focusing on how programs use a machine and which basic technologies are provided.

In this section, we will discuss about the communication abstraction and the basic requirements of the programming model.

Communication Abstraction

Communication abstraction is the main interface between the programming model and the system implementation. It is like the instruction set that provides a platform so that the same program can run correctly on many implementations. Operations at this level must be simple.

Communication abstraction is like a contract between the hardware and software, which allows each other the flexibility to improve without affecting the work.

Programming Model Requirements

A parallel program has one or more threads operating on data. A parallel programming model defines what data the threads can name, which operations can be performed on the named data, and which order is followed by the operations.

To confirm that the dependencies between the programs are enforced, a parallel program must coordinate the activity of its threads.

Parallel Computer Architecture - Models

Parallel processing has been developed as an effective technology in modern computers to meet the demand for higher performance, lower cost and accurate results in real-life applications. Concurrent events are common in today’s computers due to the practice of multiprogramming, multiprocessing, or multicomputing.

Modern computers have powerful and extensive software packages. To analyze the development of the performance of computers, first we have to understand the basic development of hardware and software.

· Computer Development Milestones − There is two major stages of development of computer - mechanical or electromechanical parts. Modern computers evolved after the introduction of electronic components. High mobility electrons in electronic computers replaced the operational parts in mechanical computers. For information transmission, electric signal which travels almost at the speed of a light replaced mechanical gears or levers.

· Elements of Modern computers − A modern computer system consists of computer hardware, instruction sets, application programs, system software and user interface.

The computing problems are categorized as numerical computing, logical reasoning, and transaction processing. Some complex problems may need the combination of all the three processing modes.

· Evolution of Computer Architecture − In last four decades, computer architecture has gone through revolutionary changes. We started with Von Neumann architecture and now we have multicomputers and multiprocessors.

· Performance of a computer system − Performance of a computer system depends both on machine capability and program behavior. Machine capability can be improved with better hardware technology, advanced architectural features and efficient resource management. Program behavior is unpredictable as it is dependent on application and run-time conditions

Multiprocessors and Multicomputers

In this section, we will discuss two types of parallel computers −

Multiprocessors
Multicomputers

Shared-Memory Multicomputers

Three most common shared memory multiprocessors models are −

Uniform Memory Access (UMA)

In this model, all the processors share the physical memory uniformly. All the processors have equal access time to all the memory words. Each processor may have a private cache memory. Same rule is followed for peripheral devices.

When all the processors have equal access to all the peripheral devices, the system is called a symmetric multiprocessor. When only one or a few processors can access the peripheral devices, the system is called an asymmetric multiprocessor.

Non-uniform Memory Access (NUMA)

In NUMA multiprocessor model, the access time varies with the location of the memory word. Here, the shared memory is physically distributed among all the processors, called local memories. The collection of all local memories forms a global address space which can be accessed by all the processors.

Cache Only Memory Architecture (COMA)

The COMA model is a special case of the NUMA model. Here, all the distributed main memories are converted to cache memories.

· Distributed - Memory Multicomputers − A distributed memory multicomputer system consists of multiple computers, known as nodes, inter-connected by message passing network. Each node acts as an autonomous computer having a processor, a local memory and sometimes I/O devices. In this case, all local memories are private and are accessible only to the local processors. This is why, the traditional machines are called no-remote-memory-access (NORMA)machines.

Multivector and SIMD Computers

In this section, we will discuss supercomputers and parallel processors for vector processing and data parallelism.

Vector Supercomputers

In a vector computer, a vector processor is attached to the scalar processor as an optional feature. The host computer first loads program and data to the main memory. Then the scalar control unit decodes all the instructions. If the decoded instructions are scalar operations or program operations, the scalar processor executes those operations using scalar functional pipelines.

On the other hand, if the decoded instructions are vector operations then the instructions will be sent to vector control unit.

SIMD Supercomputers

In SIMD computers, ‘N’ number of processors are connected to a control unit and all the processors have their individual memory units. All the processors are connected by an interconnection network.

PRAM and VLSI Models

The ideal model gives a suitable framework for developing parallel algorithms without considering the physical constraints or implementation details.

The models can be enforced to obtain theoretical performance bounds on parallel computers or to evaluate VLSI complexity on chip area and operational time before the chip is fabricated.

Parallel Random-Access Machines

Sheperdson and Sturgis (1963) modeled the conventional Uniprocessor computers as random-access-machines (RAM). Fortune and Wyllie (1978) developed a parallel random-access-machine (PRAM) model for modeling an idealized parallel computer with zero memory access overhead and synchronization.

An N-processor PRAM has a shared memory unit. This shared memory can be centralized or distributed among the processors. These processors operate on a synchronized read-memory, write-memory and compute cycle. So, these models specify how concurrent read and write operations are handled.

Following are the possible memory update operations −

· Exclusive read (ER) − In this method, in each cycle only one processor is allowed to read from any memory location.

· Exclusive write (EW) − In this method, at least one processor is allowed to write into a memory location at a time.

· Concurrent read (CR) − It allows multiple processors to read the same information from the same memory location in the same cycle.

· Concurrent write (CW) − It allows simultaneous write operations to the same memory location. To avoid write conflict some policies are set up.

VLSI Complexity Model

Parallel computers use VLSI chips to fabricate processor arrays, memory arrays and large-scale switching networks.

Nowadays, VLSI technologies are 2-dimensional. The size of a VLSI chip is proportional to the amount of storage (memory) space available in that chip.

We can calculate the space complexity of an algorithm by the chip area (A) of the VLSI chip implementation of that algorithm. If T is the time (latency) needed to execute the algorithm, then A.T gives an upper bound on the total number of bits processed through the chip (or I/O). For certain computing, there exists a lower bound, f(s), such that

A.T² >= O (f(s))

Where A=chip area and T=time

Architectural Development Tracks

The evolution of parallel computers I spread along the following tracks −

Multiple Processor Tracks

Multiprocessor track
Multicomputer track

Multiple data track

Vector track
SIMD track

Multiple threads track

Multithreaded track
Dataflow track

In multiple processor track, it is assumed that different threads execute concurrently on different processors and communicate through shared memory (multiprocessor track) or message passing (multicomputer track) system.

In multiple data track, it is assumed that the same code is executed on the massive amount of data. It is done by executing same instructions on a sequence of data elements (vector track) or through the execution of same sequence of instructions on a similar set of data (SIMD track).

In multiple threads track, it is assumed that the interleaved execution of various threads on the same processor to hide synchronization delays among threads executing on different processors. Thread interleaving can be coarse (multithreaded track) or fine (dataflow track).

…………………………………………………………………………..Hyper-threading is Intel's simultaneous multi-threading design. It allows a singleprocessor to manage data as if it were two processors by handling data instructions in parallel rather than one at a time. Hyper-Threading Technology is designed to improve system performance and efficiency

………………..

Amdahl's Law Defined

A program (or algorithm) which can be parallelized can be split up into two parts:

A part which cannot be parallelized
A part which can be parallelized

Imagine a program that processes files from disk. A small part of that program may scan the directory and create a list of files internally in memory. After that, each file is passed to a separate thread for processing. The part that scans the directory and creates the file list cannot be parallelized, but processing the files can.

The total time taken to execute the program in serial (not in parallel) is called T. The time T includes the time of both the non-parallelizable and parallelizable parts. The non-parallelizable part is called B. The parallizable part is referred to as T - B. The following list sums up these definitions:

T = Total time of serial execution
B = Total time of non-parallizable part
T - B = Total time of parallizable part (when executed serially, not in parallel)

From this follows that:

T = B + (T-B)

It may look a bit strange at first that the parallelizable part of the program does not have its own symbol in the equation. However, since the parallelizable part of the equation can be expressed using the total time T and B (the non-parallelizable part), the equation has actually been reduced conceptually, meaning that it contains less different variables in this form.

It is the parallelizable part, T - B, that can be sped up by executing it in parallel. How much it can be sped up depends on how many threads or CPUs you apply to execute it. The number of threads or CPUs is called N. The fastest the the parallelizable part can be executed is thus:

(T - B) / N

Another way to write this is:

(1/N) * (T - B)

Wikipedia uses this version in case you read about Amdahl's law there.

According to Amdahl's law, the total execution time of the program when the parallelizable part is executed using N threads or CPUs is thus:

T(N) = B + (T - B) / N

T(N) means total execution with with a parallelization factor of N. Thus, T could be written T(1) , meaning the total execution time with a parallelization factor of 1. Using T(1) instead of T, Amdahl's law looks like this:

T(N) = B + ( T(1) - B ) / N

It still means the same though.

A Calculation Example

To better understand Amdahl's law, let's go through a calculation example. The total time to execute a program is set to 1. The non-parallelizable part of the programs is 40% which out of a total time of 1 is equal to 0.4 . The parallelizable part is thus equal to 1 - 0.4 = 0.6 .

The execution time of the program with a parallelization factor of 2 (2 threads or CPUs executing the parallelizable part, so N is 2) would be:

T(2) = 0.4 + ( 1 - 0.4 ) / 2

     = 0.4 + 0.6 / 2

     = 0.4 + 0.3

     = 0.7

Making the same calculation with a parallelization factor of 5 instead of 2 would look like this:

T(5) = 0.4 + ( 1 - 0.4 ) / 5

     = 0.4 + 0.6 / 5

     = 0.4 + 0.12

     = 0.52

Amdahl's Law Illustrated

To better understand Amdahl's law I will try to illustrate how the law is derived.

First of all, a program can be broken up into a non-parallelizable part B, and a parallelizable part 1-B, as illustrated by this diagram:

The line with the delimiters on at the top is the total time T(1).

Here you see the execution time with a parallelization factor of 2:

Here you see the execution time with a parallelization factor of 3:

Optimizing Algorithms

From Amdahl's law it follows naturally, that the parallelizable part can be executed faster by throwing hardware at it. More threads / CPUs. The non-parallelizable part, however, can only be executed faster by optimizing the code. Thus, you can increase the speed and parallelizability of your program by optimizing the non-parallelizable part. You might even change the algorithm to have a smaller non-parallelizable part in general, by moving some of the work into the parallelizable part (if possible).

Optimizing the Sequential Part

If you optimize the sequential part of a program you can also use Amdahl's law to calculate the execution time of the program after the optimization. If the non-parallelizable part B is optimized by a factor of O, then Amdahl's law looks like this:

T(O,N) = B / O + (1 - B / O) / N

Remember, the non-parallelizable part of the program now takes B / O time, so the parallelizable part takes 1 - B / O time.

If B is 0.4, O is 2 and N is 5, then the calculation looks like this:

T(2,5) = 0.4 / 2 + (1 - 0.4 / 2) / 5

       = 0.2 + (1 - 0.4 / 2) / 5

       = 0.2 + (1 - 0.2) / 5

       = 0.2 + 0.8 / 5

       = 0.2 + 0.16

       = 0.36

Execution Time vs. Speedup

So far we have only used Amdahl's law to calculate the execution time of a program or algorithm after optimization or parallelization. We can also use Amdahl's law to calculate the speedup, meaning how much faster the new algorithm or program is than the old version.

If the time of the old version of the program or algorithm is T, then the speedup will be

Speedup = T / T(O,N)

We often set T to 1 just to calculate the execution time and speedup as a fraction of the old time. The equation then looks like this:

Speedup = 1 / T(O,N)

If we insert the Amdahl's law calculation instead of T(O,N), we get the following formula:

Speedup = 1 / ( B / O + (1 - B / O) / N )

With B = 0.4, O = 2 and N = 5, the calculation becomes:

Speedup = 1 / ( 0.4 / 2 + (1 - 0.4 / 2) / 5)

        = 1 / ( 0.2 + (1 - 0.4 / 2) / 5)

        = 1 / ( 0.2 + (1 - 0.2) / 5 )

        = 1 / ( 0.2 + 0.8 / 5 )

        = 1 / ( 0.2 + 0.16 )

        = 1 / 0.36

        = 2.77777 ...

That means, that if you optimize the non-parallelizable (sequential) part by a factor of 2, and paralellize the parallelizable part by a factor of 5, the new optimized version of the program or algorithm would run a maximum of 2.77777 times faster than the old version.

Measure, Don't Just Calculate

While Amdahl's law enables you to calculate the theoretic speedup of parallelization of an algorithm, don't rely too heavily on such calculations. In practice, many other factors may come into play when you optimize or parallelize an algorithm.

The speed of memory, CPU cache memory, disks, network cards etc. (if disk or network are used) may be a limiting factor too. If a new version of the algorithm is parallelized, but leads to a lot more CPU cache misses, you may not even get the desired x N speedup of using x N CPUs. The same is true if you end up saturating the memory bus, disk or network card or network connection.

My recommendation would be to use Amdahl's law to get an idea about where to optimize, but use a measurement to determine the real speedup of the optimization. Remember, sometimes a highly serialized sequential (single CPU) algorithm may outperform a parallel algorithm, simply because the sequential version has no coordination overhead (breaking down work and building the total again), and because a single CPU algorithm may conform better with how the underlying hardware works (CPU pipelines, CPU cache etc).

Gustafson’s Law tutorialspoint

computer architecture, Gustafson's law (or Gustafson–Barsis's law) gives the theoretical speedup in latency of the execution of a task at fixed execution time that can be expected of a system whose resources are improved.

System Overview of Threading

Operating System - Multi-Threading

What is Thread?

A thread is a flow of execution through the process code, with its own program counter that keeps track of which instruction to execute next, system registers which hold its current working variables, and a stack which contains the execution history.

A thread shares with its peer threads few information like code segment, data segment and open files. When one thread alters a code segment memory item, all other threads see that.

A thread is also called a lightweight process. Threads provide a way to improve application performance through parallelism. Threads represent a software approach to improving performance of operating system by reducing the overhead thread is equivalent to a classical process.

Each thread belongs to exactly one process and no thread can exist outside a process. Each thread represents a separate flow of control. Threads have been successfully used in implementing network servers and web server. They also provide a suitable foundation for parallel execution of applications on shared memory multiprocessors. The following figure shows the working of a single-threaded and a multithreaded process.

Difference between Process and Thread

S.N.	Process	Thread
1	Process is heavy weight or resource intensive.	Thread is light weight, taking lesser resources than a process.
2	Process switching needs interaction with operating system.	Thread switching does not need to interact with operating system.
3	In multiple processing environments, each process executes the same code but has its own memory and file resources.	All threads can share same set of open files, child processes.
4	If one process is blocked, then no other process can execute until the first process is unblocked.	While one thread is blocked and waiting, a second thread in the same task can run.
5	Multiple processes without using threads use more resources.	Multiple threaded processes use fewer resources.
6	In multiple processes each process operates independently of the others.	One thread can read, write or change another thread's data.

Advantages of Thread

Threads minimize the context switching time.
Use of threads provides concurrency within a process.
Efficient communication.
It is more economical to create and context switch threads.
Threads allow utilization of multiprocessor architectures to a greater scale and efficiency.

Types of Thread

Threads are implemented in following two ways −

· User Level Threads − User managed threads.

· Kernel Level Threads − Operating System managed threads acting on kernel, an operating system core.

User Level Threads

In this case, the thread management kernel is not aware of the existence of threads. The thread library contains code for creating and destroying threads, for passing message and data between threads, for scheduling thread execution and for saving and restoring thread contexts. The application starts with a single thread.

Advantages

Thread switching does not require Kernel mode privileges.
User level thread can run on any operating system.
Scheduling can be application specific in the user level thread.
User level threads are fast to create and manage.

Disadvantages

In a typical operating system, most system calls are blocking.
Multithreaded application cannot take advantage of multiprocessing.

Kernel Level Threads

In this case, thread management is done by the Kernel. There is no thread management code in the application area. Kernel threads are supported directly by the operating system. Any application can be programmed to be multithreaded. All of the threads within an application are supported within a single process.

The Kernel maintains context information for the process as a whole and for individuals threads within the process. Scheduling by the Kernel is done on a thread basis. The Kernel performs thread creation, scheduling and management in Kernel space. Kernel threads are generally slower to create and manage than the user threads.

Advantages

Kernel can simultaneously schedule multiple threads from the same process on multiple processes.
If one thread in a process is blocked, the Kernel can schedule another thread of the same process.
Kernel routines themselves can be multithreaded.

Disadvantages

Kernel threads are generally slower to create and manage than the user threads.
Transfer of control from one thread to another within the same process requires a mode switch to the Kernel.

Multithreading Models

Some operating system provide a combined user level thread and Kernel level thread facility. Solaris is a good example of this combined approach. In a combined system, multiple threads within the same application can run in parallel on multiple processors and a blocking system call need not block the entire process. Multithreading models are three types

Many to many relationship.
Many to one relationship.
One to one relationship.

Many to Many Model

The many-to-many model multiplexes any number of user threads onto an equal or smaller number of kernel threads.

The following diagram shows the many-to-many threading model where 6 user level threads are multiplexing with 6 kernel level threads. In this model, developers can create as many user threads as necessary and the corresponding Kernel threads can run in parallel on a multiprocessor machine. This model provides the best accuracy on concurrency and when a thread performs a blocking system call, the kernel can schedule another thread for execution.

Many to One Model

Many-to-one model maps many user level threads to one Kernel-level thread. Thread management is done in user space by the thread library. When thread makes a blocking system call, the entire process will be blocked. Only one thread can access the Kernel at a time, so multiple threads are unable to run in parallel on multiprocessors.

If the user-level thread libraries are implemented in the operating system in such a way that the system does not support them, then the Kernel threads use the many-to-one relationship modes.

One to One Model

There is one-to-one relationship of user-level thread to the kernel-level thread. This model provides more concurrency than the many-to-one model. It also allows another thread to run when a thread makes a blocking system call. It supports multiple threads to execute in parallel on microprocessors.

Disadvantage of this model is that creating user thread requires the corresponding Kernel thread. OS/2, windows NT and windows 2000 use one to one relationship model.

Difference between User-Level & Kernel-Level Thread

S.N.	User-Level Threads	Kernel-Level Thread
1	User-level threads are faster to create and manage.	Kernel-level threads are slower to create and manage.
2	Implementation is by a thread library at the user level.	Operating system supports creation of Kernel threads.
3	User-level thread is generic and can run on any operating system.	Kernel-level thread is specific to the operating system.
4	Multi-threaded applications cannot take advantage of multiprocessing.	Kernel routines themselves can be multithreaded.

What are Threads?

Thread is an execution unit which consists of its own program counter, a stack, and a set of registers. Threads are also known as Lightweight processes. Threads are popular way to improve application through parallelism. The CPU switches rapidly back and forth among the threads giving illusion that the threads are running in parallel.

As each thread has its own independent resource for process execution, multpile processes can be executed parallely by increasing number of threads.

Types of Thread

There are two types of threads:

1. User Threads

2. Kernel Threads

User threads, are above the kernel and without kernel support. These are the threads that application programmers use in their programs.

Kernel threads are supported within the kernel of the OS itself. All modern OSs support kernel level threads, allowing the kernel to perform multiple simultaneous tasks and/or to service multiple kernel system calls simultaneously.

Multithreading Models

The user threads must be mapped to kernel threads, by one of the following strategies:

· Many to One Model

· One to One Model

· Many to Many Model

Many to One Model

· In the many to one model, many user-level threads are all mapped onto a single kernel thread.

· Thread management is handled by the thread library in user space, which is efficient in nature.

One to One Model

· The one to one model creates a separate kernel thread to handle each and every user thread.

· Most implementations of this model place a limit on how many threads can be created.

· Linux and Windows from 95 to XP implement the one-to-one model for threads.

Many to Many Model

· The many to many model multiplexes any number of user threads onto an equal or smaller number of kernel threads, combining the best features of the one-to-one and many-to-one models.

· Users can create any number of the threads.

· Blocking the kernel system calls does not block the entire process.

· Processes can be split across multiple processors.

What are Thread Libraries?

Thread libraries provide programmers with API for creation and management of threads.

Thread libraries may be implemented either in user space or in kernel space. The user space involves API functions implemented solely within the user space, with no kernel support. The kernel space involves system calls, and requires a kernel with thread library support.

Three types of Thread

1. POSIX Pitheads, may be provided as either a user or kernel library, as an extension to the POSIX standard.

2. Win32 threads, are provided as a kernel-level library on Windows systems.

3. Java threads: Since Java generally runs on a Java Virtual Machine, the implementation of threads is based upon whatever OS and hardware the JVM is running on, i.e. either Pitheads or Win32 threads depending on the system.

Benefits of Multithreading

1. Responsiveness

2. Resource sharing, hence allowing better utilization of resources.

3. Economy. Creating and managing threads becomes easier.

4. Scalability. One thread runs on one CPU. In Multithreaded processes, threads can be distributed over a series of processors to scale.

5. Context Switching is smooth. Context switching refers to the procedure followed by CPU to change from one task to another.

Multithreading Issues

Below we have mentioned a few issues related to multithreading. Well, it's an old saying, All good things, come at a price.

Thread Cancellation

Thread cancellation means terminating a thread before it has finished working. There can be two approaches for this, one is Asynchronous cancellation, which terminates the target thread immediately. The other is Deferred cancellation allows the target thread to periodically check if it should be cancelled.

Signal Handling

Signals are used in UNIX systems to notify a process that a particular event has occurred. Now in when a Multithreaded process receives a signal, to which thread it must be delivered? It can be delivered to all, or a single thread.

fork() System Call

fork() is a system call executed in the kernel through which a process creates a copy of itself. Now the problem in Multithreaded process is, if one thread forks, will the entire process be copied or not?

Security Issues

Yes, there can be security issues because of extensive sharing of resources between multiple threads.

There are many other issues that you might face in a multithreaded process, but there are appropriate solutions available for them. Pointing out some issues here was just to study both sides of the coin

UNIT-II:

What is Thread?

A thread shares with its peer threads few information like code segment, data segment and open files. When one thread alters a code segment memory item, all other threads see that.

Difference between Process and Thread

S.N.	Process	Thread
1	Process is heavy weight or resource intensive.	Thread is light weight, taking lesser resources than a process.
2	Process switching needs interaction with operating system.	Thread switching does not need to interact with operating system.
3	In multiple processing environments, each process executes the same code but has its own memory and file resources.	All threads can share same set of open files, child processes.
4	If one process is blocked, then no other process can execute until the first process is unblocked.	While one thread is blocked and waiting, a second thread in the same task can run.
5	Multiple processes without using threads use more resources.	Multiple threaded processes use fewer resources.
6	In multiple processes each process operates independently of the others.	One thread can read, write or change another thread's data.

Advantages of Thread

Threads minimize the context switching time.
Use of threads provides concurrency within a process.
Efficient communication.
It is more economical to create and context switch threads.
Threads allow utilization of multiprocessor architectures to a greater scale and efficiency.

Types of Thread

Threads are implemented in following two ways −

· User Level Threads − User managed threads.

· Kernel Level Threads − Operating System managed threads acting on kernel, an operating system core.

User Level Threads

Advantages

Thread switching does not require Kernel mode privileges.
User level thread can run on any operating system.
Scheduling can be application specific in the user level thread.
User level threads are fast to create and manage.

Disadvantages

In a typical operating system, most system calls are blocking.
Multithreaded application cannot take advantage of multiprocessing.

Kernel Level Threads

Advantages

Kernel can simultaneously schedule multiple threads from the same process on multiple processes.
If one thread in a process is blocked, the Kernel can schedule another thread of the same process.
Kernel routines themselves can be multithreaded.

Disadvantages

Kernel threads are generally slower to create and manage than the user threads.
Transfer of control from one thread to another within the same process requires a mode switch to the Kernel.

Multithreading Models

Many to many relationship.
Many to one relationship.
One to one relationship.

Many to Many Model

The many-to-many model multiplexes any number of user threads onto an equal or smaller number of kernel threads.

Many to One Model

If the user-level thread libraries are implemented in the operating system in such a way that the system does not support them, then the Kernel threads use the many-to-one relationship modes.

One to One Model

Disadvantage of this model is that creating user thread requires the corresponding Kernel thread. OS/2, windows NT and windows 2000 use one to one relationship model.

Difference between User-Level & Kernel-Level Thread

S.N.	User-Level Threads	Kernel-Level Thread
1	User-level threads are faster to create and manage.	Kernel-level threads are slower to create and manage.
2	Implementation is by a thread library at the user level.	Operating system supports creation of Kernel threads.
3	User-level thread is generic and can run on any operating system.	Kernel-level thread is specific to the operating system.
4	Multi-threaded applications cannot take advantage of multiprocessing.	Kernel routines themselves can be multithreaded.

Design Process & Task Analysis

Task Analysis

Task Analysis plays an important part in User Requirements Analysis.

Task analysis is the procedure to learn the users and abstract frameworks, the patterns used in workflows, and the chronological implementation of interaction with the GUI. It analyzes the ways in which the user partitions the tasks and sequence them.

What is a TASK?

Human actions that contributes to a useful objective, aiming at the system, is a task. Task analysis defines performance of users, not computers.

Hierarchical Task Analysis

Hierarchical Task Analysis is the procedure of disintegrating tasks into subtasks that could be analyzed using the logical sequence for execution. This would help in achieving the goal in the best possible way.

"A hierarchy is an organization of elements that, according to prerequisite relationships, describes the path of experiences a learner must take to achieve any single behavior that appears higher in the hierarchy. (Seels & Glasgow, 1990, p. 94)".

Techniques for Analysis

· Task decomposition − Splitting tasks into sub-tasks and in sequence.

· Knowledge-based techniques − Any instructions that users need to know.

‘User’ is always the beginning point for a task.

· Ethnography − Observation of users’ behavior in the use context.

· Protocol analysis − Observation and documentation of actions of the user. This is achieved by authenticating the user’s thinking. The user is made to think aloud so that the user’s mental logic can be understood.

What is Functional Decomposition?

Functional decomposition corresponds to the various functional relationships as how the original complex business function was developed. It mainly focusses on how the overall functionality is developed and its interaction between various components.

Large or complex functionalities are more easily understood when broken down into pieces using functional decomposition.

When and How?

· Functional decomposition is mostly used during the project analysis phase in order to produce functional decomposition diagrams as part of the functional requirements document.

· Functional Decomposition is done after meeting with business analysts and subject matter expertise.

· Decompose the first level components with their functions and continue to decompose to lower levels until sufficient level of detail is achieved

· Perform an end-to-end walk-through of the business operation and check each function to confirm that it is correct.

Example:

The below example, best describes the Functional Decomposition:

What is Data Flow Testing?

Data flow testing is a family of test strategies based on selecting paths through the program's control flow in order to explore sequences of events related to the status of variables or data objects. Dataflow Testing focuses on the points at which variables receive values and the points at which these values are used.

Advantages of Data Flow Testing:

Data Flow testing helps us to pinpoint any of the following issues:

· A variable that is declared but never used within the program.

· A variable that is used but never declared.

· A variable that is defined multiple times before it is used.

· Deallocating a variable before it is used.

What is Functional Decomposition?

Large or complex functionalities are more easily understood when broken down into pieces using functional decomposition.

What is Functional Decomposition?

Large or complex functionalities are more easily understood when broken down into pieces using functional decomposition.

Example:

The below example, best describes the Functional Decomposition:

Design and Analysis of Algorithms Tutorial

An Algorithm is a sequence of steps to solve a problem. Design and Analysis of Algorithm is very important for designing algorithm to solve different types of problems in the branch of computer science and information technology. This tutorial introduces the fundamental concepts of Designing Strategies, Complexity analysis of Algorithms, followed by problems on Graph Theory and Sorting methods. This tutorial also includes the basic concepts on Complexity theory.

Parallel Algorithm - Analysis

Analysis of an algorithm helps us determine whether the algorithm is useful or not. Generally, an algorithm is analyzed based on its execution time (Time Complexity) and the amount of space (Space Complexity) it requires.

Since we have sophisticated memory devices available at reasonable cost, storage space is no longer an issue. Hence, space complexity is not given so much of importance.

Parallel algorithms are designed to improve the computation speed of a computer. For analyzing a Parallel Algorithm, we normally consider the following parameters −

Time complexity (Execution Time),
Total number of processors used, and
Total cost.

Time Complexity

The main reason behind developing parallel algorithms was to reduce the computation time of an algorithm. Thus, evaluating the execution time of an algorithm is extremely important in analyzing its efficiency.

Execution time is measured on the basis of the time taken by the algorithm to solve a problem. The total execution time is calculated from the moment when the algorithm starts executing to the moment it stops. If all the processors do not start or end execution at the same time, then the total execution time of the algorithm is the moment when the first processor started its execution to the moment when the last processor stops its execution.

Time complexity of an algorithm can be classified into three categories−

· Worst-case complexity − When the amount of time required by an algorithm for a given input is maximum.

· Average-case complexity − When the amount of time required by an algorithm for a given input is average.

· Best-case complexity − When the amount of time required by an algorithm for a given input is minimum.

Asymptotic Analysis

The complexity or efficiency of an algorithm is the number of steps executed by the algorithm to get the desired output. Asymptotic analysis is done to calculate the complexity of an algorithm in its theoretical analysis. In asymptotic analysis, a large length of input is used to calculate the complexity function of the algorithm.

Note − Asymptotic is a condition where a line tends to meet a curve, but they do not intersect. Here the line and the curve is asymptotic to each other.

Asymptotic notation is the easiest way to describe the fastest and slowest possible execution time for an algorithm using high bounds and low bounds on speed. For this, we use the following notations −

Big O notation
Omega notation
Theta notation

Big O notation

In mathematics, Big O notation is used to represent the asymptotic characteristics of functions. It represents the behavior of a function for large inputs in a simple and accurate method. It is a method of representing the upper bound of an algorithm’s execution time. It represents the longest amount of time that the algorithm could take to complete its execution. The function −

f(n) = O(g(n))

iff there exists positive constants c and n₀ such that f(n) ≤ c * g(n) for all nwhere n ≥ n₀.

Omega notation

Omega notation is a method of representing the lower bound of an algorithm’s execution time. The function −

f(n) = Ω (g(n))

iff there exists positive constants c and n₀ such that f(n) ≥ c * g(n) for all nwhere n ≥ n₀.

Theta Notation

Theta notation is a method of representing both the lower bound and the upper bound of an algorithm’s execution time. The function −

f(n) = θ(g(n))

iff there exists positive constants c₁, c₂, and n₀ such that c1 * g(n) ≤ f(n) ≤ c2 * g(n) for all n where n ≥ n₀.

Speedup of an Algorithm

The performance of a parallel algorithm is determined by calculating its speedup. Speedup is defined as the ratio of the worst-case execution time of the fastest known sequential algorithm for a particular problem to the worst-case execution time of the parallel algorithm.

speedup =

Worst case execution time of the fastest known sequential for a particular problemWorst case execution time of the parallel algorithm

Number of Processors Used

The number of processors used is an important factor in analyzing the efficiency of a parallel algorithm. The cost to buy, maintain, and run the computers are calculated. Larger the number of processors used by an algorithm to solve a problem, more costly becomes the obtained result.

Total Cost

Total cost of a parallel algorithm is the product of time complexity and the number of processors used in that particular algorithm.

Total Cost = Time complexity × Number of processors used

Therefore, the efficiency of a parallel algorithm is −

Efficiency =

Worst case execution time of sequential algorithmWorst case execution time of the parallel algorithm

UNIT-V:

Solutions to Common Parallel Programming Problems

A deadlock happens in operating system when two or more processes need some resource to complete their execution that is held by the other process.

In the above diagram, the process 1 has resource 1 and needs to acquire resource 2. Similarly process 2 has resource 2 and needs to acquire resource 1. Process 1 and process 2 are in deadlock as each of them needs the other’s resource to complete their execution but neither of them is willing to relinquish their resources.

Coffman Conditions

A deadlock occurs if the four Coffman conditions hold true. But these conditions are not mutually exclusive.

The Coffman conditions are given as follows:

1. Mutual Exclusion

There should be a resource that can only be held by one process at a time. In the diagram below, there is a single instance of Resource 1 and it is held by Process 1 only.

2. Hold and Wait

A process can hold multiple resources and still request more resources from other processes which are holding them. In the diagram given below, Process 2 holds Resource 2 and Resource 3 and is requesting the Resource 1 which is held by Process 1.

3. No Preemption

A resource cannot be preempted from a process by force. A process can only release a resource voluntarily. In the diagram below, Process 2 cannot preempt Resource 1 from Process 1. It will only be released when Process 1 relinquishes it voluntarily after its execution is complete.

4. Circular Wait

A process is waiting for the resource held by the second process, which is waiting for the resource held by the third process and so on, till the last process is waiting for a resource held by the first process. This forms a circular chain. For example: Process 1 is allocated Resource2 and it is requesting Resource 1. Similarly, Process 2 is allocated Resource 1 and it is requesting Resource 2. This forms a circular wait loop.

Deadlock Detection

A deadlock can be detected by a resource scheduler as it keeps track of all the resources that are allocated to different processes. After a deadlock is detected, it can be resolved using the following methods:

1. All the processes that are involved in the deadlock are terminated. This is not a good approach as all the progress made by the processes is destroyed.

2. Resources can be preempted from some processes and given to others till the deadlock is resolved.

Deadlock Prevention

It is very important to prevent a deadlock before it can occur. So, the system checks each transaction before it is executed to make sure it does not lead to deadlock. If there is even a slight chance that a transaction may lead to deadlock in the future, it is never allowed to execute.

Deadlock Avoidance

It is better to avoid a deadlock rather than take measures after the deadlock has occurred. The wait for graph can be used for deadlock avoidance. This is however only useful for smaller databases as it can get quite complex in larger databases.

synchronization

There are two types of synchronization: data synchronization and process synchronization:

Process Synchronization: The simultaneous execution of multiple threads or processes to reach a handshake such that they commit a certain sequence of actions. Lock, mutex, and semaphores are examples of process synchronization.
Data Synchronization: Involves the maintenance of data to keep multiple copies of data coherent with each other, or to maintain data integrity. For example, database replication is used to keep multiple copies of data synchronized with database servers that store data in different locations.

Synchronization forms the basis of the execution of multiple threads asynchronously in a multithreaded application. It provides the means to achieve the sharing of resources such as file handling, network connections and memory by coordinating threads and processes to avoid data corruption.

The term is used in the context of multithreaded applications where the resources to be shared across multiple threads have to be controlled, which otherwise can lead to an unpredictable and undesirable outcome. The .NET framework provides synchronization primitives using the multi-threaded applications controlled without any race conditions.

Synchronization is designed to be cooperative, demanding that every thread follow the synchronization mechanism before accessing protected resources for consistent results. Locking, signaling, lightweight synchronization types, spinwait and interlocked operations are mechanisms related to synchronization in .

Process Synchronization

Process Synchronization means sharing system resources by processes in a such a way that, Concurrent access to shared data is handled thereby minimizing the chance of inconsistent data. Maintaining data consistency demands mechanisms to ensure synchronized execution of cooperating processes.

Process Synchronization was introduced to handle problems that arose while multiple process executions. Some of the problems are discussed below.

Critical Section Problem

A Critical Section is a code segment that accesses shared variables and has to be executed as an atomic action. It means that in a group of cooperating processes, at a given point of time, only one process must be executing its critical section. If any other process also wants to execute its critical section, it must wait until the first one finishes.

Solution to Critical Section Problem

A solution to the critical section problem must satisfy the following three conditions:

1. Mutual Exclusion

Out of a group of cooperating processes, only one process can be in its critical section at a given point of time.

2. Progress

If no process is in its critical section, and if one or more threads want to execute their critical section then any one of these threads must be allowed to get into its critical section.

3. Bounded Waiting

After a process makes a request for getting into its critical section, there is a limit for how many other processes can get into their critical section, before this process's request is granted. So after the limit is reached, system must grant the process permission to get into its critical section.

Synchronization Hardware

Many systems provide hardware support for critical section code. The critical section problem could be solved easily in a single-processor environment if we could disallow interrupts to occur while a shared variable or resource is being modified.

In this manner, we could be sure that the current sequence of instructions would be allowed to execute in order without pre-emption. Unfortunately, this solution is not feasible in a multiprocessor environment.

Disabling interrupt on a multiprocessor environment can be time consuming as the message is passed to all the processors.

This message transmission lag, delays entry of threads into critical section and the system efficiency decreases.

Mutex Locks

As the synchronization hardware solution is not easy to implement for everyone, a strict software approach called Mutex Locks was introduced. In this approach, in the entry section of code, a LOCK is acquired over the critical resources modified and used inside critical section, and in the exit section that LOCK is released.

As the resource is locked while a process executes its critical section hence no other process can access it.n

Thursday, 23 May 2019

M.tech 2nd sem MTCS201 Multi Core Architecture and Programming Multi Core Architecture and Programming

Why Parallel Architecture?

Application Trends

Scientific and Engineering Computing

Commercial Computing

Technology Trends

Architectural Trends

Convergence of Parallel Architectures

Communication Architecture

Shared Memory

Message-Passing Architecture

Convergence

Data Parallel Processing

Fundamental Design Issues

Communication Abstraction

Programming Model Requirements

Parallel Computer Architecture - Models

Multiprocessors and Multicomputers

Shared-Memory Multicomputers

Uniform Memory Access (UMA)

Non-uniform Memory Access (NUMA)

Cache Only Memory Architecture (COMA)

Multivector and SIMD Computers

Vector Supercomputers

SIMD Supercomputers

PRAM and VLSI Models

Parallel Random-Access Machines

VLSI Complexity Model

Architectural Development Tracks

Amdahl's Law Defined

A Calculation Example

Amdahl's Law Illustrated

Optimizing Algorithms

Optimizing the Sequential Part

Execution Time vs. Speedup

Measure, Don't Just Calculate

Operating System - Multi-Threading

What is Thread?

Difference between Process and Thread

Advantages of Thread

Types of Thread

User Level Threads

Advantages

Disadvantages

Kernel Level Threads

Advantages

Disadvantages

Multithreading Models

Many to Many Model

Many to One Model

One to One Model

Difference between User-Level & Kernel-Level Thread

What are Threads?

Types of Thread

Multithreading Models

Many to One Model

One to One Model

Many to Many Model

What are Thread Libraries?

Three types of Thread

Benefits of Multithreading

Multithreading Issues

Thread Cancellation

Signal Handling

fork() System Call

Security Issues

What is Thread?

Difference between Process and Thread

Advantages of Thread

Types of Thread

User Level Threads

Advantages

Disadvantages

Kernel Level Threads

Advantages

Disadvantages

Multithreading Models

Many to Many Model

Many to One Model