Parallel Computer Architecture - Introduction
In the last 50 years, there has been huge developments in the
performance and capability of a computer system. This has been possible with
the help of Very Large Scale Integration (VLSI) technology. VLSI technology
allows a large number of components to be accommodated on a single chip and
clock rates to increase. Therefore, more operations can be performed at a time,
in parallel.
Parallel processing is also associated with data locality and data
communication. Parallel Computer Architecture is the method of
organizing all the resources to maximize the performance and the
programmability within the limits given by technology and the cost at any
instance of time.
Why Parallel
Architecture?
Parallel computer architecture adds a new dimension in the
development of computer system by using more and more number of processors. In
principle, performance achieved by utilizing large number of processors is
higher than the performance of a single processor at a given point of time.
Application Trends
With the advancement of hardware capacity, the demand for a
well-performing application also increased, which in turn placed a demand on
the development of the computer architecture.
Before the microprocessor era, high-performing computer system was
obtained by exotic circuit technology and machine organization, which made them
expensive. Now, highly performing computer system is obtained by using multiple
processors, and most important and demanding applications are written as
parallel programs. Thus, for higher performance both parallel architectures and
parallel applications are needed to be developed.
To increase the performance of an application Speedup is the key
factor to be considered. Speedup on p processors is defined as
−
Speedup(p processors)≡Performance(p processors)Performance(1 processor)Speedup(p processors)≡Performance(p processors)Performance(1 processor)
For the single fixed problem,
performance of a computer system=1Time needed to complete the problemperformance of a computer system=1Time needed to complete the problem
Speedup fixed problem(p processors)=Time(1 processor)Time(p processor)Speedup fixed problem(p processors)=Time(1 processor)Time(p processor)
Scientific and
Engineering Computing
Parallel architecture has become indispensable in scientific
computing (like physics, chemistry, biology, astronomy, etc.) and engineering
applications (like reservoir modeling, airflow analysis, combustion efficiency,
etc.). In almost all applications, there is a huge demand for visualization of
computational output resulting in the demand for development of parallel
computing to increase the computational speed.
Commercial Computing
In commercial computing (like video, graphics, databases, OLTP,
etc.) also high speed computers are needed to process huge amount of data
within a specified time. Desktop uses multithreaded programs that are almost
like the parallel programs. This in turn demands to develop parallel
architecture.
Technology Trends
With the development of technology and architecture, there is a
strong demand for the development of high-performing applications. Experiments
show that parallel computers can work much faster than utmost developed single
processor. Moreover, parallel computers can be developed within the limit of
technology and the cost.
The primary technology used here is VLSI technology. Therefore,
nowadays more and more transistors, gates and circuits can be fitted in the
same area. With the reduction of the basic VLSI feature size, clock rate also
improves in proportion to it, while the number of transistors grows as the
square. The use of many transistors at once (parallelism) can be expected to
perform much better than by increasing the clock rate
Technology trends suggest that the basic single chip building
block will give increasingly large capacity. Therefore, the possibility of
placing multiple processors on a single chip increases.
Architectural Trends
Development in technology decides what is feasible; architecture
converts the potential of the technology into performance and capability. Parallelism and locality are
two methods where larger volumes of resources and more transistors enhance the
performance. However, these two methods compete for the same resources. When
multiple operations are executed in parallel, the number of cycles needed to
execute the program is reduced.
However, resources are needed to support each of the concurrent
activities. Resources are also needed to allocate local storage. The best
performance is achieved by an intermediate action plan that uses resources to
utilize a degree of parallelism and a degree of locality.
Generally, the history of computer architecture has been divided
into four generations having following basic technologies −
- Vacuum tubes
- Transistors
- Integrated circuits
- VLSI
Till 1985, the duration was dominated by the growth in bit-level
parallelism. 4-bit microprocessors followed by 8-bit, 16-bit, and so on. To
reduce the number of cycles needed to perform a full 32-bit operation, the
width of the data path was doubled. Later on, 64-bit operations were introduced.
The growth in instruction-level-parallelism dominated
the mid-80s to mid-90s. The RISC approach showed that it was simple to pipeline
the steps of instruction processing so that on an average an instruction is
executed in almost every cycle. Growth in compiler technology has made
instruction pipelines more productive.
In mid-80s, microprocessor-based computers consisted of
- An integer processing unit
- A floating-point unit
- A cache controller
- SRAMs for the cache data
- Tag storage
As chip capacity increased, all these components were merged into
a single chip. Thus, a single chip consisted of separate hardware for integer
arithmetic, floating point operations, memory operations and branch operations.
Other than pipelining individual instructions, it fetches multiple instructions
at a time and sends them in parallel to different functional units whenever
possible. This type of instruction level parallelism is called superscalar
execution.
Convergence
of Parallel Architectures
Parallel machines have been developed with several distinct
architecture. In this section, we will discuss different parallel computer
architecture and the nature of their convergence.
Communication
Architecture
Parallel architecture enhances the conventional concepts of
computer architecture with communication architecture. Computer architecture
defines critical abstractions (like user-system boundary and hardware-software
boundary) and organizational structure, whereas communication architecture
defines the basic communication and synchronization operations. It also
addresses the organizational structure.
Programming model is the top layer. Applications are written in
programming model. Parallel programming models include −
- Shared address space
- Message passing
- Data parallel programming
Shared address programming is just like using a bulletin board, where one
can communicate with one or many individuals by posting information at a
particular location, which is shared by all other individuals. Individual
activity is coordinated by noting who is doing what task.
Message passing is like a telephone call or letters where a specific
receiver receives information from a specific sender.
Data parallel programming is an organized form of cooperation. Here,
several individuals perform an action on separate elements of a data set
concurrently and share information globally.
Shared Memory
Shared memory multiprocessors are one of the most important
classes of parallel machines. It gives better throughput on multiprogramming
workloads and supports parallel programs.
In this case, all the computer systems allow a processor and a set
of I/O controller to access a collection of memory modules by some hardware
interconnection. The memory capacity is increased by adding memory modules and
I/O capacity is increased by adding devices to I/O controller or by adding
additional I/O controller. Processing capacity can be increased by waiting for
a faster processor to be available or by adding more processors.
All the resources are organized around a central memory bus.
Through the bus access mechanism, any processor can access any physical address
in the system. As all the processors are equidistant from all the memory
locations, the access time or latency of all the processors is same on a memory
location. This is called symmetric multiprocessor.
Message-Passing
Architecture
Message passing architecture is also an important class of
parallel machines. It provides communication among processors as explicit I/O
operations. In this case, the communication is combined at the I/O level,
instead of the memory system.
In message passing architecture, user communication executed by
using operating system or library calls that perform many lower level actions,
which includes the actual communication operation. As a result, there is a
distance between the programming model and the communication operations at the
physical hardware level.
Send and receive is the most common user level
communication operations in message passing system. Send specifies a local data
buffer (which is to be transmitted) and a receiving remote processor. Receive
specifies a sending process and a local data buffer in which the transmitted
data will be placed. In send operation, an identifier or
a tag is attached to the message and the receiving operation specifies
the matching rule like a specific tag from a specific processor or any tag from
any processor.
The combination of a send and a matching receive completes a
memory-to-memory copy. Each end specifies its local data address and a pair
wise synchronization event.
Convergence
Development of the hardware and software has faded the clear
boundary between the shared memory and message passing camps. Message passing
and a shared address space represents two distinct programming models; each
gives a transparent paradigm for sharing, synchronization and communication.
However, the basic machine structures have converged towards a common
organization.
Data Parallel Processing
Another important class of parallel machine is variously called −
processor arrays, data parallel architecture and
single-instruction-multiple-data machines. The main feature of the programming
model is that operations can be executed in parallel on each element of a large
regular data structure (like array or matrix).
Data parallel programming languages are usually enforced by
viewing the local address space of a group of processes, one per processor,
forming an explicit global space. As all the processors communicate together
and there is a global view of all the operations, so either a shared address
space or message passing can be used.
Fundamental Design
Issues
Development of programming model only cannot increase the
efficiency of the computer nor can the development of hardware alone do it.
However, development in computer architecture can make the difference in the
performance of the computer. We can understand the design problem by focusing
on how programs use a machine and which basic technologies are provided.
In this section, we will discuss about the communication abstraction
and the basic requirements of the programming model.
Communication Abstraction
Communication abstraction is the main interface between the
programming model and the system implementation. It is like the instruction set
that provides a platform so that the same program can run correctly on many
implementations. Operations at this level must be simple.
Communication abstraction is like a contract between the hardware
and software, which allows each other the flexibility to improve without
affecting the work.
Programming Model Requirements
A parallel program has one or more threads operating on data. A
parallel programming model defines what data the threads can name,
which operations can be performed on the named data, and which
order is followed by the operations.
To confirm that the dependencies between the programs are
enforced, a parallel program must coordinate the activity of its threads.
Parallel
Computer Architecture - Models
Parallel processing has been developed as an effective technology
in modern computers to meet the demand for higher performance, lower cost and
accurate results in real-life applications. Concurrent events are common in
today’s computers due to the practice of multiprogramming, multiprocessing, or
multicomputing.
Modern computers have powerful and extensive software packages. To
analyze the development of the performance of computers, first we have to
understand the basic development of hardware and software.
·
Computer Development Milestones −
There is two major stages of development of computer - mechanical or electromechanical parts.
Modern computers evolved after the introduction of electronic components. High
mobility electrons in electronic computers replaced the operational parts in
mechanical computers. For information transmission, electric signal which
travels almost at the speed of a light replaced mechanical gears or levers.
·
Elements of Modern computers −
A modern computer system consists of computer hardware, instruction sets,
application programs, system software and user interface.
The computing problems are categorized as numerical computing,
logical reasoning, and transaction processing. Some complex problems may need
the combination of all the three processing modes.
·
Evolution of Computer Architecture −
In last four decades, computer architecture has gone through revolutionary
changes. We started with Von Neumann architecture and now we have
multicomputers and multiprocessors.
·
Performance of a computer system −
Performance of a computer system depends both on machine capability and program
behavior. Machine capability can be improved with better hardware technology,
advanced architectural features and efficient resource management. Program
behavior is unpredictable as it is dependent on application and run-time
conditions
Multiprocessors and
Multicomputers
In this section, we will discuss two types of parallel computers −
- Multiprocessors
- Multicomputers
Shared-Memory
Multicomputers
Three most common shared memory multiprocessors models are −
Uniform Memory Access (UMA)
In this model, all the processors share the physical memory
uniformly. All the processors have equal access time to all the memory words.
Each processor may have a private cache memory. Same rule is followed for
peripheral devices.
When all the processors have equal access to all the peripheral
devices, the system is called a symmetric multiprocessor. When only
one or a few processors can access the peripheral devices, the system is called
an asymmetric multiprocessor.
Non-uniform Memory Access (NUMA)
In NUMA multiprocessor model, the access time varies with the
location of the memory word. Here, the shared memory is physically distributed
among all the processors, called local memories. The collection of all local
memories forms a global address space which can be accessed by all the
processors.
Cache Only Memory Architecture (COMA)
The COMA model is a special case of the NUMA model. Here, all the
distributed main memories are converted to cache memories.
·
Distributed - Memory Multicomputers −
A distributed memory multicomputer system consists of multiple computers, known
as nodes, inter-connected by message passing network. Each node acts as an
autonomous computer having a processor, a local memory and sometimes I/O
devices. In this case, all local memories are private and are accessible only
to the local processors. This is why, the traditional machines are called no-remote-memory-access
(NORMA)machines.
Multivector and SIMD
Computers
In this section, we will discuss supercomputers and parallel
processors for vector processing and data parallelism.
Vector Supercomputers
In a vector computer, a vector processor is attached to the scalar
processor as an optional feature. The host computer first loads program and
data to the main memory. Then the scalar control unit decodes all the
instructions. If the decoded instructions are scalar operations or program
operations, the scalar processor executes those operations using scalar
functional pipelines.
On the other hand, if the decoded instructions are vector
operations then the instructions will be sent to vector control unit.
SIMD Supercomputers
In SIMD computers, ‘N’ number of processors are connected to a
control unit and all the processors have their individual memory units. All the
processors are connected by an interconnection network.
PRAM and VLSI Models
The ideal model gives a suitable framework for developing parallel
algorithms without considering the physical constraints or implementation
details.
The models can be enforced to obtain theoretical performance
bounds on parallel computers or to evaluate VLSI complexity on chip area and
operational time before the chip is fabricated.
Parallel Random-Access Machines
Sheperdson and Sturgis (1963) modeled the conventional Uniprocessor
computers as random-access-machines (RAM). Fortune and Wyllie (1978) developed
a parallel random-access-machine (PRAM) model for modeling an idealized
parallel computer with zero memory access overhead and synchronization.
An N-processor PRAM has a shared memory unit. This shared memory
can be centralized or distributed among the processors. These processors
operate on a synchronized read-memory, write-memory and compute cycle. So,
these models specify how concurrent read and write operations are handled.
Following are the possible memory update operations −
·
Exclusive read (ER) −
In this method, in each cycle only one processor is allowed to read from any
memory location.
·
Exclusive write (EW) −
In this method, at least one processor is allowed to write into a memory
location at a time.
·
Concurrent read (CR) −
It allows multiple processors to read the same information from the same memory
location in the same cycle.
·
Concurrent write (CW) −
It allows simultaneous write operations to the same memory location. To avoid
write conflict some policies are set up.
VLSI Complexity Model
Parallel computers use VLSI chips to fabricate processor arrays,
memory arrays and large-scale switching networks.
Nowadays, VLSI technologies are 2-dimensional. The size of a VLSI
chip is proportional to the amount of storage (memory) space available in that
chip.
We can calculate the space complexity of an algorithm by the chip
area (A) of the VLSI chip implementation of that algorithm. If T is the time
(latency) needed to execute the algorithm, then A.T gives an upper bound on the
total number of bits processed through the chip (or I/O). For certain
computing, there exists a lower bound, f(s), such that
A.T2 >= O (f(s))
Where A=chip area and T=time
Architectural
Development Tracks
The evolution of parallel computers I spread along the following
tracks −
- Multiple Processor Tracks
- Multiprocessor track
- Multicomputer track
- Multiple data track
- Vector track
- SIMD track
- Multiple threads track
- Multithreaded track
- Dataflow track
In multiple processor track, it is assumed that
different threads execute concurrently on different processors and communicate
through shared memory (multiprocessor track) or message passing (multicomputer
track) system.
In multiple data track, it is assumed that the same
code is executed on the massive amount of data. It is done by executing same
instructions on a sequence of data elements (vector track) or through the
execution of same sequence of instructions on a similar set of data (SIMD
track).
In multiple threads track, it is assumed that the
interleaved execution of various threads on the same processor to hide
synchronization delays among threads executing on different processors. Thread
interleaving can be coarse (multithreaded track) or fine (dataflow track).
…………………………………………………………………………..Hyper-threading is Intel's simultaneous multi-threading design.
It allows a singleprocessor to manage data as if it were two processors by
handling data instructions in parallel rather than one at a
time. Hyper-Threading Technology is designed to improve system
performance and efficiency
………………..
Amdahl's Law Defined
A program (or algorithm) which can be parallelized can be split up
into two parts:
- A
part which cannot be parallelized
- A
part which can be parallelized
Imagine a program that processes files from disk. A small part of
that program may scan the directory and create a list of files internally in
memory. After that, each file is passed to a separate thread for processing.
The part that scans the directory and creates the file list cannot be
parallelized, but processing the files can.
The total time taken to execute the program in serial (not in
parallel) is called T. The time T includes the time of both the non-parallelizable
and parallelizable parts. The non-parallelizable part is called B. The
parallizable part is referred to as T - B. The following list sums up these
definitions:
- T
= Total time of serial execution
- B
= Total time of non-parallizable part
- T
- B = Total time of parallizable part (when executed serially, not in
parallel)
From this follows that:
T = B + (T-B)
It may look a bit strange at first that the parallelizable part of
the program does not have its own symbol in the equation. However, since the parallelizable
part of the equation can be expressed using the total time T and B (the
non-parallelizable part), the equation has actually been reduced conceptually,
meaning that it contains less different variables in this form.
It is the parallelizable part,
T - B
, that can be sped up by
executing it in parallel. How much it can be sped up depends on how many
threads or CPUs you apply to execute it. The number of threads or CPUs is
called N. The fastest the the parallelizable part can be executed is thus:(T - B) / N
Another way to write this is:
(1/N) * (T - B)
Wikipedia uses this version in case you read about Amdahl's law
there.
According to Amdahl's law, the total execution time of the program
when the parallelizable part is executed using N threads or CPUs is thus:
T(N) = B + (T - B) / N
T(N) means total execution with with a parallelization factor of
N. Thus, T could be written T(1) , meaning the total execution time with a
parallelization factor of 1. Using T(1) instead of T, Amdahl's law looks like
this:
T(N) = B + ( T(1) - B ) / N
It still means the same though.
A Calculation Example
To better understand Amdahl's law, let's go through a calculation
example. The total time to execute a program is set to 1. The
non-parallelizable part of the programs is 40% which out of a total time of 1
is equal to 0.4 . The parallelizable part is thus equal to
1 - 0.4 = 0.6
.
The execution time of the program with a parallelization factor of
2 (2 threads or CPUs executing the parallelizable part, so N is 2) would be:
T(2) = 0.4 + ( 1 - 0.4 ) / 2
= 0.4 + 0.6 / 2
= 0.4 + 0.3
= 0.7
Making the same calculation with a parallelization factor of 5
instead of 2 would look like this:
T(5) = 0.4 + ( 1 - 0.4 ) / 5
= 0.4 + 0.6 / 5
= 0.4 + 0.12
= 0.52
Amdahl's Law Illustrated
To better understand Amdahl's law I will try to illustrate how the
law is derived.
First of all, a program can be broken up into a non-parallelizable
part B, and a parallelizable part 1-B, as illustrated by this diagram:
The line with the delimiters on at the top is the total time T(1).
Here you see the execution time with a parallelization factor of
2:
Here you see the execution time with a parallelization factor of
3:
Optimizing Algorithms
From Amdahl's law it follows naturally, that the parallelizable
part can be executed faster by throwing hardware at it. More threads / CPUs.
The non-parallelizable part, however, can only be executed faster by optimizing
the code. Thus, you can increase the speed and parallelizability of your
program by optimizing the non-parallelizable part. You might even change the
algorithm to have a smaller non-parallelizable part in general, by moving some
of the work into the parallelizable part (if possible).
Optimizing the Sequential
Part
If you optimize the sequential part of a program you can also use
Amdahl's law to calculate the execution time of the program after the
optimization. If the non-parallelizable part B is optimized by a factor of O,
then Amdahl's law looks like this:
T(O,N) = B / O + (1 - B / O) / N
Remember, the non-parallelizable part of the program now
takes
B / O
time,
so the parallelizable part takes 1 - B / O
time.
If B is 0.4, O is 2 and N is 5, then the calculation looks like
this:
T(2,5) = 0.4 / 2 + (1 - 0.4 / 2) / 5
= 0.2 + (1 - 0.4 / 2) / 5
= 0.2 + (1 - 0.2) / 5
= 0.2 + 0.8 / 5
= 0.2 + 0.16
= 0.36
Execution Time vs. Speedup
So far we have only used Amdahl's law to calculate the execution
time of a program or algorithm after optimization or parallelization. We can
also use Amdahl's law to calculate the speedup, meaning how much
faster the new algorithm or program is than the old version.
If the time of the old version of the program or algorithm is T,
then the speedup will be
Speedup = T / T(O,N)
We often set T to 1 just to calculate the execution time and
speedup as a fraction of the old time. The equation then looks like this:
Speedup = 1 / T(O,N)
If we insert the Amdahl's law calculation instead of T(O,N), we
get the following formula:
Speedup = 1 / ( B / O + (1 - B / O) / N )
With B = 0.4, O = 2 and N = 5, the calculation becomes:
Speedup = 1 / ( 0.4 / 2 + (1 - 0.4 / 2) / 5)
= 1 / ( 0.2 + (1 - 0.4 / 2) / 5)
= 1 / ( 0.2 + (1 - 0.2) / 5 )
= 1 / ( 0.2 + 0.8 / 5 )
= 1 / ( 0.2 + 0.16 )
= 1 / 0.36
= 2.77777 ...
That means, that if you optimize the non-parallelizable
(sequential) part by a factor of 2, and paralellize the parallelizable part by
a factor of 5, the new optimized version of the program or algorithm would run
a maximum of 2.77777 times faster than the old version.
Measure, Don't Just Calculate
While Amdahl's law enables you to calculate the theoretic speedup
of parallelization of an algorithm, don't rely too heavily on such
calculations. In practice, many other factors may come into play when you
optimize or parallelize an algorithm.
The speed of memory, CPU cache memory, disks, network cards etc.
(if disk or network are used) may be a limiting factor too. If a new version of
the algorithm is parallelized, but leads to a lot more CPU cache misses, you
may not even get the desired x N speedup of using x N CPUs. The same is true if
you end up saturating the memory bus, disk or network card or network
connection.
My recommendation would be to use Amdahl's law to get an idea
about where to optimize, but use a measurement to determine the real speedup of
the optimization. Remember, sometimes a highly serialized sequential (single
CPU) algorithm may outperform a parallel algorithm, simply because the
sequential version has no coordination overhead (breaking down work and
building the total again), and because a single CPU algorithm may conform
better with how the underlying hardware works (CPU pipelines, CPU cache etc).
computer
architecture, Gustafson's law (or Gustafson–Barsis's law)
gives the theoretical speedup in latency of the execution of a task at fixed
execution time that can be expected of a system whose resources are improved.
System Overview of
Threading
Operating
System - Multi-Threading
What is Thread?
A thread is a flow of execution through the process code, with its
own program counter that keeps track of which instruction to execute next,
system registers which hold its current working variables, and a stack which
contains the execution history.
A thread shares with its peer threads few information like code
segment, data segment and open files. When one thread alters a code segment
memory item, all other threads see that.
A thread is also called a lightweight process. Threads
provide a way to improve application performance through parallelism. Threads
represent a software approach to improving performance of operating system by
reducing the overhead thread is equivalent to a classical process.
Each thread belongs to exactly one process and no thread can exist
outside a process. Each thread represents a separate flow of control. Threads
have been successfully used in implementing network servers and web server.
They also provide a suitable foundation for parallel execution of applications
on shared memory multiprocessors. The following figure shows the working of a
single-threaded and a multithreaded process.
Difference between
Process and Thread
S.N.
|
Process
|
Thread
|
1
|
Process is heavy weight or resource
intensive.
|
Thread is light weight, taking lesser
resources than a process.
|
2
|
Process switching needs interaction
with operating system.
|
Thread switching does not need to
interact with operating system.
|
3
|
In multiple processing environments,
each process executes the same code but has its own memory and file
resources.
|
All threads can share same set of open
files, child processes.
|
4
|
If one process is blocked, then no
other process can execute until the first process is unblocked.
|
While one thread is blocked and
waiting, a second thread in the same task can run.
|
5
|
Multiple processes without using
threads use more resources.
|
Multiple threaded processes use fewer
resources.
|
6
|
In multiple processes each process
operates independently of the others.
|
One thread can read, write or change
another thread's data.
|
Advantages of Thread
- Threads minimize the context
switching time.
- Use of threads provides
concurrency within a process.
- Efficient communication.
- It is more economical to create
and context switch threads.
- Threads allow utilization of
multiprocessor architectures to a greater scale and efficiency.
Types of Thread
Threads are implemented in following two ways −
·
User Level Threads −
User managed threads.
·
Kernel Level Threads −
Operating System managed threads acting on kernel, an operating system core.
User Level Threads
In this case, the thread management kernel is not aware of the
existence of threads. The thread library contains code for creating and
destroying threads, for passing message and data between threads, for
scheduling thread execution and for saving and restoring thread contexts. The
application starts with a single thread.
Advantages
- Thread switching does not
require Kernel mode privileges.
- User level thread can run on
any operating system.
- Scheduling can be application
specific in the user level thread.
- User level threads are fast to
create and manage.
Disadvantages
- In a typical operating system,
most system calls are blocking.
- Multithreaded application
cannot take advantage of multiprocessing.
Kernel Level Threads
In this case, thread management is done by the Kernel. There is no
thread management code in the application area. Kernel threads are supported
directly by the operating system. Any application can be programmed to be
multithreaded. All of the threads within an application are supported within a
single process.
The Kernel maintains context information for the process as a
whole and for individuals threads within the process. Scheduling by the Kernel
is done on a thread basis. The Kernel performs thread creation, scheduling and
management in Kernel space. Kernel threads are generally slower to create and
manage than the user threads.
Advantages
- Kernel can simultaneously
schedule multiple threads from the same process on multiple processes.
- If one thread in a process is
blocked, the Kernel can schedule another thread of the same process.
- Kernel routines themselves can
be multithreaded.
Disadvantages
- Kernel threads are generally
slower to create and manage than the user threads.
- Transfer of control from one
thread to another within the same process requires a mode switch to the
Kernel.
Multithreading Models
Some operating system provide a combined user level thread and
Kernel level thread facility. Solaris is a good example of this combined
approach. In a combined system, multiple threads within the same application
can run in parallel on multiple processors and a blocking system call need not
block the entire process. Multithreading models are three types
- Many to many relationship.
- Many to one relationship.
- One to one relationship.
Many to Many Model
The many-to-many model multiplexes any number of user threads onto
an equal or smaller number of kernel threads.
The following diagram shows the many-to-many threading model where
6 user level threads are multiplexing with 6 kernel level threads. In this
model, developers can create as many user threads as necessary and the
corresponding Kernel threads can run in parallel on a multiprocessor machine.
This model provides the best accuracy on concurrency and when a thread performs
a blocking system call, the kernel can schedule another thread for execution.
Many to One Model
Many-to-one model maps many user level threads to one Kernel-level
thread. Thread management is done in user space by the thread library. When
thread makes a blocking system call, the entire process will be blocked. Only
one thread can access the Kernel at a time, so multiple threads are unable to
run in parallel on multiprocessors.
If the user-level thread libraries are implemented in the
operating system in such a way that the system does not support them, then the
Kernel threads use the many-to-one relationship modes.
One to One Model
There is one-to-one relationship of user-level thread to the
kernel-level thread. This model provides more concurrency than the many-to-one
model. It also allows another thread to run when a thread makes a blocking
system call. It supports multiple threads to execute in parallel on
microprocessors.
Disadvantage of this model is that creating user thread requires
the corresponding Kernel thread. OS/2, windows NT and windows 2000 use one to
one relationship model.
Difference between
User-Level & Kernel-Level Thread
S.N.
|
User-Level Threads
|
Kernel-Level Thread
|
1
|
User-level threads are faster to create
and manage.
|
Kernel-level threads are slower to
create and manage.
|
2
|
Implementation is by a thread library
at the user level.
|
Operating system supports creation of
Kernel threads.
|
3
|
User-level thread is generic and can
run on any operating system.
|
Kernel-level thread is specific to the
operating system.
|
4
|
Multi-threaded applications cannot take
advantage of multiprocessing.
|
Kernel routines themselves can be
multithreaded.
|
What are Threads?
Thread is an execution unit
which consists of its own program counter, a stack, and a set of registers.
Threads are also known as Lightweight processes. Threads are popular way to
improve application through parallelism. The CPU switches rapidly back and
forth among the threads giving illusion that the threads are running in
parallel.
As
each thread has its own independent resource for process execution, multpile
processes can be executed parallely by increasing number of threads.
Types of Thread
There
are two types of threads:
1.
User Threads
2.
Kernel Threads
User threads, are above the kernel and without kernel support. These are the
threads that application programmers use in their programs.
Kernel threads are supported within the kernel of the OS itself. All modern
OSs support kernel level threads, allowing the kernel to perform multiple
simultaneous tasks and/or to service multiple kernel system calls
simultaneously.
Multithreading Models
The
user threads must be mapped to kernel threads, by one of the following
strategies:
·
Many to One Model
·
One to One Model
·
Many to Many Model
Many to One Model
·
In the many to one model, many user-level threads
are all mapped onto a single kernel thread.
·
Thread management is handled by the thread library in user space,
which is efficient in nature.
One to One Model
·
The one to one model creates a separate kernel
thread to handle each and every user thread.
·
Most implementations of this model place a limit on how many
threads can be created.
·
Linux and Windows from 95 to XP implement the one-to-one model for
threads.
Many to Many Model
·
The many to many model multiplexes any number of
user threads onto an equal or smaller number of kernel threads, combining the
best features of the one-to-one and many-to-one models.
·
Users can create any number of the threads.
·
Blocking the kernel system calls does not block the entire
process.
·
Processes can be split across multiple processors.
What are Thread Libraries?
Thread
libraries provide programmers with API for creation and management of threads.
Thread
libraries may be implemented either in user space or in kernel space. The user
space involves API functions implemented solely within the user space, with no
kernel support. The kernel space involves system calls, and requires a kernel
with thread library support.
Three types of Thread
1.
POSIX Pitheads, may be provided as either a user or kernel library, as an
extension to the POSIX standard.
2.
Win32 threads, are provided as a kernel-level library on Windows systems.
3.
Java threads: Since Java generally runs on a Java Virtual Machine, the
implementation of threads is based upon whatever OS and hardware the JVM is
running on, i.e. either Pitheads or Win32 threads depending on the system.
Benefits of Multithreading
1.
Responsiveness
2.
Resource sharing, hence allowing better utilization of resources.
3.
Economy. Creating and managing threads becomes easier.
4.
Scalability. One thread runs on one CPU. In Multithreaded
processes, threads can be distributed over a series of processors to scale.
5.
Context Switching is smooth. Context switching refers to the
procedure followed by CPU to change from one task to another.
Multithreading Issues
Below
we have mentioned a few issues related to multithreading. Well, it's an old
saying, All good things, come at a price.
Thread Cancellation
Thread
cancellation means terminating a thread before it has finished working. There
can be two approaches for this, one is Asynchronous cancellation,
which terminates the target thread immediately. The other is Deferred
cancellation allows the target thread to periodically check if it
should be cancelled.
Signal Handling
Signals
are used in UNIX systems to notify a process that a particular event has
occurred. Now in when a Multithreaded process receives a signal, to which
thread it must be delivered? It can be delivered to all, or a single thread.
fork() System Call
fork()
is a system call executed in the kernel through which a process creates a copy
of itself. Now the problem in Multithreaded process is, if one thread forks,
will the entire process be copied or not?
Security Issues
Yes,
there can be security issues because of extensive sharing of resources between
multiple threads.
There
are many other issues that you might face in a multithreaded process, but there
are appropriate solutions available for them. Pointing out some issues here was
just to study both sides of the coin
UNIT-II:
What is Thread?
A thread is a flow of execution through the process code, with its
own program counter that keeps track of which instruction to execute next,
system registers which hold its current working variables, and a stack which
contains the execution history.
A thread shares with its peer threads few information like code
segment, data segment and open files. When one thread alters a code segment
memory item, all other threads see that.
A thread is also called a lightweight process. Threads
provide a way to improve application performance through parallelism. Threads
represent a software approach to improving performance of operating system by
reducing the overhead thread is equivalent to a classical process.
Each thread belongs to exactly one process and no thread can exist
outside a process. Each thread represents a separate flow of control. Threads
have been successfully used in implementing network servers and web server. They
also provide a suitable foundation for parallel execution of applications on
shared memory multiprocessors. The following figure shows the working of a
single-threaded and a multithreaded process.
Difference between
Process and Thread
S.N.
|
Process
|
Thread
|
1
|
Process is heavy weight or resource
intensive.
|
Thread is light weight, taking lesser
resources than a process.
|
2
|
Process switching needs interaction
with operating system.
|
Thread switching does not need to
interact with operating system.
|
3
|
In multiple processing environments,
each process executes the same code but has its own memory and file
resources.
|
All threads can share same set of open
files, child processes.
|
4
|
If one process is blocked, then no
other process can execute until the first process is unblocked.
|
While one thread is blocked and
waiting, a second thread in the same task can run.
|
5
|
Multiple processes without using
threads use more resources.
|
Multiple threaded processes use fewer
resources.
|
6
|
In multiple processes each process operates
independently of the others.
|
One thread can read, write or change
another thread's data.
|
Advantages of Thread
- Threads minimize the context
switching time.
- Use of threads provides
concurrency within a process.
- Efficient communication.
- It is more economical to create
and context switch threads.
- Threads allow utilization of
multiprocessor architectures to a greater scale and efficiency.
Types of Thread
Threads are implemented in following two ways −
·
User Level Threads −
User managed threads.
·
Kernel Level Threads −
Operating System managed threads acting on kernel, an operating system core.
User Level Threads
In this case, the thread management kernel is not aware of the
existence of threads. The thread library contains code for creating and
destroying threads, for passing message and data between threads, for
scheduling thread execution and for saving and restoring thread contexts. The
application starts with a single thread.
Advantages
- Thread switching does not
require Kernel mode privileges.
- User level thread can run on
any operating system.
- Scheduling can be application
specific in the user level thread.
- User level threads are fast to
create and manage.
Disadvantages
- In a typical operating system,
most system calls are blocking.
- Multithreaded application
cannot take advantage of multiprocessing.
Kernel Level Threads
In this case, thread management is done by the Kernel. There is no
thread management code in the application area. Kernel threads are supported
directly by the operating system. Any application can be programmed to be
multithreaded. All of the threads within an application are supported within a
single process.
The Kernel maintains context information for the process as a
whole and for individuals threads within the process. Scheduling by the Kernel
is done on a thread basis. The Kernel performs thread creation, scheduling and
management in Kernel space. Kernel threads are generally slower to create and
manage than the user threads.
Advantages
- Kernel can simultaneously
schedule multiple threads from the same process on multiple processes.
- If one thread in a process is
blocked, the Kernel can schedule another thread of the same process.
- Kernel routines themselves can
be multithreaded.
Disadvantages
- Kernel threads are generally
slower to create and manage than the user threads.
- Transfer of control from one
thread to another within the same process requires a mode switch to the
Kernel.
Multithreading Models
Some operating system provide a combined user level thread and
Kernel level thread facility. Solaris is a good example of this combined
approach. In a combined system, multiple threads within the same application
can run in parallel on multiple processors and a blocking system call need not
block the entire process. Multithreading models are three types
- Many to many relationship.
- Many to one relationship.
- One to one relationship.
Many to Many Model
The many-to-many model multiplexes any number of user threads onto
an equal or smaller number of kernel threads.
The following diagram shows the many-to-many threading model where
6 user level threads are multiplexing with 6 kernel level threads. In this
model, developers can create as many user threads as necessary and the
corresponding Kernel threads can run in parallel on a multiprocessor machine.
This model provides the best accuracy on concurrency and when a thread performs
a blocking system call, the kernel can schedule another thread for execution.
Many to One Model
Many-to-one model maps many user level threads to one Kernel-level
thread. Thread management is done in user space by the thread library. When
thread makes a blocking system call, the entire process will be blocked. Only
one thread can access the Kernel at a time, so multiple threads are unable to
run in parallel on multiprocessors.
If the user-level thread libraries are implemented in the
operating system in such a way that the system does not support them, then the
Kernel threads use the many-to-one relationship modes.
One to One Model
There is one-to-one relationship of user-level thread to the
kernel-level thread. This model provides more concurrency than the many-to-one
model. It also allows another thread to run when a thread makes a blocking
system call. It supports multiple threads to execute in parallel on microprocessors.
Disadvantage of this model is that creating user thread requires
the corresponding Kernel thread. OS/2, windows NT and windows 2000 use one to
one relationship model.
Difference between
User-Level & Kernel-Level Thread
S.N.
|
User-Level Threads
|
Kernel-Level Thread
|
1
|
User-level threads are faster to create
and manage.
|
Kernel-level threads are slower to
create and manage.
|
2
|
Implementation is by a thread library
at the user level.
|
Operating system supports creation of
Kernel threads.
|
3
|
User-level thread is generic and can
run on any operating system.
|
Kernel-level thread is specific to the
operating system.
|
4
|
Multi-threaded applications cannot take
advantage of multiprocessing.
|
Kernel routines themselves can be
multithreaded.
|
Design Process & Task Analysis
Task Analysis
Task Analysis plays an important part in User Requirements
Analysis.
Task analysis is the procedure to learn the users and abstract
frameworks, the patterns used in workflows, and the chronological
implementation of interaction with the GUI. It analyzes the ways in which the
user partitions the tasks and sequence them.
What is a TASK?
Human actions that contributes to a useful objective, aiming at
the system, is a task. Task analysis defines performance of users, not computers.
Hierarchical Task Analysis
Hierarchical Task Analysis is the procedure of disintegrating
tasks into subtasks that could be analyzed using the logical sequence for
execution. This would help in achieving the goal in the best possible way.
"A hierarchy is an organization of elements that, according
to prerequisite relationships, describes the path of experiences a learner must
take to achieve any single behavior that appears higher in the hierarchy.
(Seels & Glasgow, 1990, p. 94)".
Techniques for Analysis
·
Task decomposition −
Splitting tasks into sub-tasks and in sequence.
·
Knowledge-based techniques −
Any instructions that users need to know.
‘User’ is always the beginning point for a task.
·
Ethnography −
Observation of users’ behavior in the use context.
·
Protocol analysis −
Observation and documentation of actions of the user. This is achieved by
authenticating the user’s thinking. The user is made to think aloud so that the
user’s mental logic can be understood.
What is Functional
Decomposition?
Functional decomposition corresponds to the various functional
relationships as how the original complex business function was developed. It
mainly focusses on how the overall functionality is developed and its
interaction between various components.
Large or complex functionalities are more easily understood when
broken down into pieces using functional decomposition.
When and How?
·
Functional decomposition is mostly used during the project
analysis phase in order to produce functional decomposition diagrams as part of
the functional requirements document.
·
Functional Decomposition is done after meeting with business
analysts and subject matter expertise.
·
Decompose the first level components with their functions and
continue to decompose to lower levels until sufficient level of detail is
achieved
·
Perform an end-to-end walk-through of the business operation and
check each function to confirm that it is correct.
Example:
The below example, best describes the Functional Decomposition:
What is Data Flow Testing?
Data flow testing is a family of test strategies based on
selecting paths through the program's control flow in order to explore
sequences of events related to the status of variables or data objects.
Dataflow Testing focuses on the points at which variables receive values and
the points at which these values are used.
Advantages of Data Flow
Testing:
Data Flow testing helps us to pinpoint any of the following
issues:
·
A variable that is declared but never used within the program.
·
A variable that is used but never declared.
·
A variable that is defined multiple times before it is used.
·
Deallocating a variable before it is used.
What is Functional
Decomposition?
Functional decomposition corresponds to the various functional
relationships as how the original complex business function was developed. It
mainly focusses on how the overall functionality is developed and its
interaction between various components.
Large or complex functionalities are more easily understood when
broken down into pieces using functional decomposition.
What is Functional
Decomposition?
Functional decomposition corresponds to the various functional
relationships as how the original complex business function was developed. It
mainly focusses on how the overall functionality is developed and its
interaction between various components.
Large or complex functionalities are more easily understood when
broken down into pieces using functional decomposition.
Example:
The below example, best describes the Functional Decomposition:
Design and Analysis of Algorithms Tutorial
An Algorithm is a sequence of steps to solve a problem.
Design and Analysis of Algorithm is very important for designing algorithm to
solve different types of problems in the branch of computer science and
information technology. This tutorial introduces the fundamental concepts of
Designing Strategies, Complexity analysis of Algorithms, followed by problems
on Graph Theory and Sorting methods. This tutorial also includes the basic
concepts on Complexity theory.
Parallel Algorithm - Analysis
Analysis of an algorithm helps us determine whether the algorithm
is useful or not. Generally, an algorithm is analyzed based on its execution
time (Time Complexity) and the amount of space (Space
Complexity) it requires.
Since we have sophisticated memory devices available at reasonable
cost, storage space is no longer an issue. Hence, space complexity is not given
so much of importance.
Parallel algorithms are designed to improve the computation speed
of a computer. For analyzing a Parallel Algorithm, we normally consider the
following parameters −
- Time complexity (Execution
Time),
- Total number of processors
used, and
- Total cost.
Time Complexity
The main reason behind developing parallel algorithms was to
reduce the computation time of an algorithm. Thus, evaluating the execution
time of an algorithm is extremely important in analyzing its efficiency.
Execution time is measured on the basis of the time taken by the
algorithm to solve a problem. The total execution time is calculated from the
moment when the algorithm starts executing to the moment it stops. If all the
processors do not start or end execution at the same time, then the total
execution time of the algorithm is the moment when the first processor started
its execution to the moment when the last processor stops its execution.
Time complexity of an algorithm can be classified into three
categories−
·
Worst-case complexity −
When the amount of time required by an algorithm for a given input is maximum.
·
Average-case complexity −
When the amount of time required by an algorithm for a given input is average.
·
Best-case complexity −
When the amount of time required by an algorithm for a given input is minimum.
Asymptotic Analysis
The complexity or efficiency of an algorithm is the number of
steps executed by the algorithm to get the desired output. Asymptotic analysis
is done to calculate the complexity of an algorithm in its theoretical
analysis. In asymptotic analysis, a large length of input is used to calculate
the complexity function of the algorithm.
Note − Asymptotic is a condition where a line
tends to meet a curve, but they do not intersect. Here the line and the curve
is asymptotic to each other.
Asymptotic notation is the easiest way to describe the fastest and
slowest possible execution time for an algorithm using high bounds and low
bounds on speed. For this, we use the following notations −
- Big O notation
- Omega notation
- Theta notation
Big O notation
In mathematics, Big O notation is used to represent the asymptotic
characteristics of functions. It represents the behavior of a function for
large inputs in a simple and accurate method. It is a method of representing
the upper bound of an algorithm’s execution time. It represents the longest
amount of time that the algorithm could take to complete its execution. The
function −
f(n) = O(g(n))
iff there exists positive constants c and n0 such
that f(n) ≤ c * g(n) for all nwhere n ≥ n0.
Omega notation
Omega notation is a method of representing the lower bound of an
algorithm’s execution time. The function −
f(n) = Ω (g(n))
iff there exists positive constants c and n0 such
that f(n) ≥ c * g(n) for all nwhere n ≥ n0.
Theta Notation
Theta notation is a method of representing both the lower bound
and the upper bound of an algorithm’s execution time. The function −
f(n) = θ(g(n))
iff there exists positive constants c1, c2, and n0 such
that c1 * g(n) ≤ f(n) ≤ c2 * g(n) for all n where n ≥
n0.
Speedup of an Algorithm
The performance of a parallel algorithm is determined by
calculating its speedup. Speedup is defined as the ratio of the
worst-case execution time of the fastest known sequential algorithm for a
particular problem to the worst-case execution time of the parallel algorithm.
speedup
=
Worst
case execution time of the fastest known sequential for a particular problemWorst
case execution time of the parallel algorithm
Number of Processors
Used
The number of processors used is an important factor in analyzing
the efficiency of a parallel algorithm. The cost to buy, maintain, and run the
computers are calculated. Larger the number of processors used by an algorithm
to solve a problem, more costly becomes the obtained result.
Total Cost
Total cost of a parallel algorithm is the product of time
complexity and the number of processors used in that particular algorithm.
Total Cost = Time complexity × Number of
processors used
Therefore, the efficiency of a parallel algorithm
is −
Efficiency =
Worst
case execution time of sequential algorithmWorst case execution time of
the parallel algorithm
UNIT-V:
Solutions to Common Parallel Programming Problems
A deadlock happens in operating system
when two or more processes need some resource to complete their execution that
is held by the other process.
In the above diagram, the process 1 has
resource 1 and needs to acquire resource 2. Similarly process 2 has resource 2
and needs to acquire resource 1. Process 1 and process 2 are in deadlock as
each of them needs the other’s resource to complete their execution but neither
of them is willing to relinquish their resources.
Coffman Conditions
A deadlock occurs if the four Coffman
conditions hold true. But these conditions are not mutually exclusive.
The Coffman conditions are given as
follows:
1.
Mutual
Exclusion
There should be a
resource that can only be held by one process at a time. In the diagram below,
there is a single instance of Resource 1 and it is held by Process 1 only.
2.
Hold
and Wait
A process can
hold multiple resources and still request more resources from other processes
which are holding them. In the diagram given below, Process 2 holds Resource 2
and Resource 3 and is requesting the Resource 1 which is held by Process 1.
3.
No
Preemption
A resource cannot
be preempted from a process by force. A process can only release a resource
voluntarily. In the diagram below, Process 2 cannot preempt Resource 1 from
Process 1. It will only be released when Process 1 relinquishes it voluntarily
after its execution is complete.
4.
Circular
Wait
A process is
waiting for the resource held by the second process, which is waiting for the
resource held by the third process and so on, till the last process is waiting
for a resource held by the first process. This forms a circular chain. For
example: Process 1 is allocated Resource2 and it is requesting Resource 1.
Similarly, Process 2 is allocated Resource 1 and it is requesting Resource 2.
This forms a circular wait loop.
Deadlock Detection
A deadlock can be detected by a resource
scheduler as it keeps track of all the resources that are allocated to
different processes. After a deadlock is detected, it can be resolved using the
following methods:
1.
All
the processes that are involved in the deadlock are terminated. This is not a
good approach as all the progress made by the processes is destroyed.
2.
Resources
can be preempted from some processes and given to others till the deadlock is
resolved.
Deadlock Prevention
It is very important to prevent a
deadlock before it can occur. So, the system checks each transaction before it
is executed to make sure it does not lead to deadlock. If there is even a
slight chance that a transaction may lead to deadlock in the future, it is
never allowed to execute.
Deadlock Avoidance
It is better to avoid a deadlock rather
than take measures after the deadlock has occurred. The wait for graph can be
used for deadlock avoidance. This is however only useful for smaller databases
as it can get quite complex in larger databases.