Chapter Three

Data Compression and Encryption


This chapter is about various compression algorithms used in data communications. The purpose of introducing data compression is to speed up data transmission. It is because some of the data are redundant and it is more effective to filter out these redundant data or use a method to pack them together. The compressed data after transmission should be decompressed by the receiver. Upon completion of this chapter, you should be able:

Understand reversible and non-reversible data compression.

Identify Huffman code and FAX compression algorithm

Understand the encryption and decryption

Introduction


Data compression is developed to reduce the number of characters required in a transmission to increase transmission efficiency. Messages are composed of fixed length of bits coded to represent characters. The codes such as ASCII was designed to facilitate the procession by computers. These fixed-length characters are possibly compressed into variable-length characters smaller than the conventional code sets during transmission in order to save transmission bandwidth. Figure shows the data transmission with/without data compression. Assume the time of transmission without data compression is T. The time of transmission with data compression is T2 and the data compression and data decompression is T2 and T2 respectively. Whether it is worth to perform data compression is whether the following expression holds true:

That is to say if the time required to perform data compression by the sender, data transmission, data decompression by the receiver is longer than simply data transmission without performing data compression, it is not cost effective to perform data compression. Data compression techniques are generally classified into two types, namely, reversible and non-reversible. The description for each of them is given below:

Reversible: the data by the receiver can be recovered after data compression. For instance, the sender sends a string of “abcde”, which can be re-produced by the receiver as “abcde”.
Nonreversible: the data cannot be recovered by the receiver. However, the unrecovered data is less important to the receiver. For instance, the sender sends “23.911", which can be re-produced by the receiver as ”23.91". The “0.01" is discarded after data compression to reduce transmission bandwidth.

Data Compression - Reversible

This technique provides for temporary elimination of portions of the data. It uses either semantic dependent procedures which depend on the content and context of the data for data reduction or semantic independent procedures which do not depend on the the context.

The most commonly used Reversible codes include:

Huffman code

This code is widely used in data compression. This assigns fewer bits to the most commonly used characters and allocates fewest bits to the less commonly used characters. It has been used to achieve 50% of reduction of original codes. Assume that two bits are assigned to represent a grade of subject, namely, excellent, good, fair and poor as follows:

Grade

Block Code

Excellent

00

Good

01

Fair

10

Poor

11


A string of data using the above table is sending as “11111001111100000", can you decode the string?

Can you fill in the occurrence of each grade?

Grade

Occurrence

Assume that the occurrence in probability of a particular string pattern are: Excellent (0.1), Good(0.3), Fair (0.4) and Poor (0.2). The average occurrence for each coded word is two bits in length. If the coded word is now changed to a variable length together with the same probability occurrence, the average coded length is now reduced 3 x 0.05 + 2 x 0.3 + 1 x 0.6 + 3 x 0.05 = 1.5. This is less than a fixed length of two bits, which is the idea of Huffman code compression.

Grade

Coded word

Probability

Excellent

111

0.05

Good

10

0.3

Fair

0

0.6

Poor

110

0.05


Using the Huffman code compression, can you decode the string of “01100111100".

If two sets of coded words are modified as follows

Grade

Coded word 3

Coded word 4

Excellent

111

110

Good

01

11

Fair

0

0

Poor

011

011


Using the same string of “01100111100", can you decode it using the code word 3 and code 4.



You will find that the using either of them will produce an unclear result as the word length is not fixed to two bits but is a variable length. Using the code word 3, the string will produce poor, fair, fair, excellent...

To be reasonable, codes have to be uniquely decoded. Otherwise, it will produce an unclear decoded result. To code a pattern, a coding tree is used as shown in Figure . In this diagram, the highly occurred grade will be assigned first.  

A code word consisting of five states are: A, B, C, D, and E. Their probabilities of occurrence (possible states) for each of them are: 0.1, 0.3, 0.4, 0.1.5, 0.5. The coding tress are . Determine the average length of coded word in terms of number of bits for each coding tree.

Example of Simple Block Code

To represent deneary digits 0, 1, 2, .... in computer, Binary Coded Decimal entitled BCD is used. It uses four digits to represent a number. As the denary number counts from 0 to 9, while four bits can produce 24 = 16 possible numbers, six of them are wasted. However, we cannot use three digits to represent the denary numbers as it represent 23 = 8 possible numbers only. The average length in BCD is therefore four bits. Since we have learnt the coding tree to simplify the coded word, can we apply to this case. The answer is YES as shown Figure . The average coded length is now reduced 3.4 instead of 4 bits.

Calculate the average word length using the figure and its improvement.


The probability for each denary is 1/10, the average length is therefore (6 (number of pattern using 3 bits) x 3 (bits) x 0.1 (probability of occurrence) + 4 x (4 bits) x 0.1 (probability) = 3.4. The improvement is (4 - 3.4)/4 = 0.15 or 15%.

Can we improve the efficiency further such as sending a numerical value of 3456234517?



We could propose a coded word by grouping the above numerical value into 34, 56, 23, 45, and 17. The new pattern would be the hundred denary numbers from 00 to 99 instead of 0 to 9. This could improve it further. However, the coding tree becomes more complex and difficult to manage. Can you imagine how big the tree will be.

Example of English Language Symbol

When we read a paragraph, we noticed that SPACE is frequently encountered than symbol Z. That is true. According to the Reza in 1961, the typical symbol probabilities for English text are as follows. Of course, it varies as it depends on the text. However, it is sufficient to provide us a clear picture on the frequently symbols.

Symbol

Probability

Symbol

Probability

Symbol

Probability

Symbol

Probability

Symbol

Probability

SPACE

0.187

E

0.103

J

0.008

O

0.063

T

0.084

A

0.064

F

0.02

K

0.005

P

0.015

U

0.023

B

0.012

G

0.15

L

0.032

Q

0.008

V

0.008

C

0.02

H

0.047

M

0.2

R

0.0487

W

0.0076

D

0.031

I

0.06

N

0.058

S

0.052

X

0.0013

Using the above probabilities, we can then construct a coding tree to improve the coding efficiency. Of course, we haven’t considered the punctuation. It is found that without taking the punctuations, the average number of bits used to represent the 27 symbols (SPACE, A, B.... Z) assuming that the probability of occurrence is equal (1/27), is 4.75 bits. This is already a significant improvement when compared to the average of seven bits used to represent the symbols. (7 - 4.75)/7 = 32%. The actual word length using the above table is 3.3 bits per symbol, which is further improvement. This implies that, on average, 3.3 bits are sufficient to represent the 27 symbols. That is to say that fewer bits will be used for SPACE and more bits for symbol Z.

Can we further improve it by group the symbols together such as AA, AB, AC etc.


yes, we can. It is revealed that further improvement can be achieved by grouping the symbols together. It is mathematically shown that the average bit length for each symbol by grouping two consecutive symbols together is 2.1 bits per symbol.

Information

Information, I, is associated with the source state having probability p. It is found that it is related to

I = - log p

where p is the probability of occurrence. The number of bits, n, to carry out the information is related to I/log 2. In a BCD, assume the probability for each digit is equal to 1/10, the information is 1 and the number of bits is 1/log 2 = 1/0.301 = 3.3 bits. Hence, information is associated with the minimum number of bits to represent a coded word. Here, it means that 4 binary bits are used to carry 3.3 bits of information. It is further concluded that BCD is not an efficient coding as some of the information are redundant.

Adaptive scanning

It uses a dictionary to store frequently occurring strings of characters. The common character string will be replaced by a shorter code during transmission and will be reversed after finished. Consider the following hypothetical coded word for frequently used string.

String

Coded Word

String

Coded Word

Hong Kong

@1

School

@2

Boy

@3

House

@4

University

@5

City

@6

Tat Chee

@7

Avenue

@8


al character used in this case. If a paragraph contains this special character, it uses other symbol to represent that it is a coded word.

Calculate the number of bytes required if a sender sends out the above strings and also the number of bytes for the corresponding coded word. Find out the percentage of improvement if coded words are used.


The number of bytes for the string is 51 bytes and the number of coded words is 16 bytes. The percentage of improvement is over 300%. A really significant improvement.


Using the above coded word to substitute the strings of “A school boy pays a visit to the City University of Hong Kong at Tat Chee Avenue. and find out the improvement.

As the occurrence of string varies from paragraph, it is advisable by first determining the occurrence of strings and send it to the receiver with the corresponding coded word, then followed by the contents of paragraph. Sending the first string and coded word also occupies the Bandwidth, which might not be worthwhile if that string only occurs once in the paragraph.

FAX compression

FAX treats each facsimile line as a series of white and black runs. The runs are coded based on their length, and the code is transmitted instead of the full bit map. It is also termed run length encoding. For instance, if the FAX detects 40 black spots, instead of sending 40 “dot”s, it will send “40b”. This will save significantly. This applies to the white spots. As it will send the whites and blacks alternatively, the code consists of a series of numbers on whites and blocks.

Example

Assume a A4 paper (8.5" x 11") using 300 DPI (dot per inch) is used for its pictures and is further assumed that these pcitures occupy 30% of that page. Without using run length encoding for FAX, how long does it take to transmit this page using 9600 bps (standard rate) and 56K bps?

The total number of dots is: 300 (DPI) x 8.5 x 300 (DPI) x 11 x 30% = 8.4 x 10 6 dots

The time it takes using 9600 bps is 8.4 x 10 6 / 9600 = 875 s

The time it takes using 56k bps is 8.4 x 10 6 / 56000 = 150 s

You might surpursie with the figures as they are over 2 minutes.

Run length Encoding: Consider the following Black/White pattern


0000 0000 0000 0000



0000 0011 0000 0000



0000 0011 0000 0000



0000 1111 0000 0000



0000 1111 0000 0000



0000 1111 0000 0000



0000 1111 0000 0000



1111 1111 1111 1111



1111 1111 1111 1111



1111 1111 1111 1111

Without using the run length code for this pattern, the number of bytes are 10 x 2 = 20 bytes (0 or 1 refers to 1 bit). With the run length encoding, the transmit pattern is: 22W 2B 14W 2B 12W 4B 12W 4B 12W 4B 12W 4B 8W 48B. As it transmits the pattern alternatively, there is no need to specify the B (black) or W (white). As a result, the number is 22 2 14 2 12 4 12 4 12 4 12 4 8 48 and the number of bytes becomes 14. (20 - 14)/20 = 30% improved.  

Instead of sending the word Tsim Sha Tsui, you program the computer to send a coded word called TST. Can you point out what compression method it is? also describes the software requirement in the receiver.

Data Compression - Nonreversible

It is often called data compression, which permanently eliminates the irrelevant portions of the data such as rounding the figure to a meaningful decimal values. For instance, the sender is sending a data of having five decimal places 23.45612, the sender can chop the decimal places to four, 23.4561 before sending it out. In this example, the bandwidth (8-7)/8 = 12.5% is saved. This also applies to voice and video signals.

Voice Compression

As mentioned in previous chapter, an authentic voice signal is at least sampled at 8K data rate and is assigned 8 bits leading to a signalling rate for a voice signal operating at 64K bps. This does not include the data compression. Data compression can be applied to reduce to 8K bps for a single voice with acceptable quality. That voice signal is being used in most mobile phones. That is why, nowadays, mobile phone network can support more mobile phone network in view of situation that the bandwidth is of no further expansion. How can we compress the voice, observe the way of speaking, within one second, you might perhaps pronounce the same pattern. Instead of sending the same pattern for 8000 times (this is the sampling rate for ordinary voice), it can send the first pattern together with information indicating the number of repetitive patterns, in order to save the transmission bandwidth.



The mobile phone you are using can carry data at the rate of 9600 bps. It is expected that this data rate can be increased to 20M bps by 2003.


A 56K bps modem is used to connect the Internet. If voice signal is multiplexed in this modem, calculate theoretically the maximum number of voice signal it can support.



56 kbs/s / 8k bps = 7 voice channels

Other on-line audio

On-line video works by converting sound into digital information and storing that information in audio computer files so that it can be transferred over the network such as the Internet. It can be played by the remote through a set of computer. The use of on-line audio is increasing rapidly. However, there is no uniform standard for digital audio formats. Below are a few standards:

WAV (common sound blaster type audio files. It is also the native sound format for MS Widnows. The disadvantage of this format is less efficient when compared to other two. The advantage is that it has many sound clips.)
AU (common Internet audio files. It is also called uLaw, NeXT or SUN Audio format. Its advantaage is the use little bandwidth for transfer.)
RA (Real Audio files, ideal for voice. It is developed by RealAudio that company and is almost the most popular standard in the Internet.)

Access Real Audio Web site and test this format. www.realaudio.com and also try the GoldWave site at web.cs.mun.ca/~chris3/goldwave to show the quality of a digital sound recorder.


Video Compression

Video after compression is a variable bit rate (VBR), which means that the bandwidth required to transmit the data varies with time as shown in Figure . Here, the bandwidth at t0 is less than the bandwidth at t1. For this type of signal, it imposes a headache to the network designer as the bandwidth is unpredictable. Compression using commercial compression algorithm can achieve up to 26-to-1. To determine the bandwidth required for a TV type signal, let us find out the requirement as follows:

Number of frames per second 50

Screen width: 1024 dots

Screen height: 768 dots

Level of colors per dot: 256 = 8 bits (in fact it is more than that)

Level of intensity per dot: 8 levels = 3 bits

The amount of information to be transmitted per second is 50 x 1024 x 768 x 8 x 3 = 37.7 Mbs/s

That is why we cannot use a 56k bps modem to watch a full screen TV programme. However, if we restrict the display screen to 100 x 76 and the number of colors to 16 (4 bits), the modified data rate becomes 50 x 100 x 76 x 4 x 3 = 4560 k bps. A commercial compression algorithm can be up to 100:1. The resultant transmission rate after compression becomes 4560/100 = 45.6 K bps which can be played by a 56 K bps modem. That is why we can use Real-video to watch video programme through the Internet. The Moving Pictures Experts Group (MPEG) is a standards group that is under the International Standards Organisation (ISO). The group has developed a set of standards for the transmission of compressed variable bit rate video. The goal of the MPEG group is to develop a scheme for compressing and storing digital video (with its audio) and also the encoding scheme for transmission over the networks. MPEG1 currently being used in ITV is defined as operating at 1.5M bps. MPEG uses three frame formats, namely, I, and B. I (Image) frames are complete bit images of a single frame and are sent periodically to bound the effect of compression errors. P (Predicted) frames update the image using a predictive algorithm from the last I or P frame. B (Bi-directional) frames describe a frame in terms of the differences between either the last and next I or P frames. Figure shows a pattern of compressed video. It starts with an I frame followed by two B frames and one P frame. It then repeat until twelve frames, then a new I frame is resumed. You can imagine without the compression, each frame is the same as I frame. How can we compress the image? Look at the difference between two images in Figure . There is only a slight different as within a second, it has to send 50 frames to maintain a continuous changes. Instead of sending the whole images again, it can send out the first whole image together with the changes in second frame to reduce the transmission bandwidth. By using this method, it can save a lot of redundant information that is contained in previous frames.

Other on-line Video

Apart from MPEG, there are three more formats, namely, MOV, AVI and HQX files. HQX is for Macintosh. MOV is the format developed by QuickTime video, to play around it go to www.kwanza.com/~embleton/service.html and download it.


Data Encryption

The aim of data encryption is to ensure the data transmission in the communication network is secure and confidential. That is to say only the user who knows the method to revert (decrypt) to data. The process of converting the original data into another message is called encryption and the reverse process to convert back into the original message is called decryption. The original message is therefore called plaintext while the encrypted message is called ciphertext. The most widely used encryption method is defined by US National Bureau of Standards and is sometimes called data encryption standard (DES). Before studying the details of operation of method available at DES, consider the following message and suggest means to encrypt it.


I am a student of Computer Studies.

Conversion Table

Use the following table and write down the ciphertext.

Plaintext

Ciphertext

Plaintext

Ciphertext

Plaintext

Ciphertext

a

1

i

a

q

%

b

4

j

s

r

@

c

r

k

x

s

s

d

y

l

z

t

p

e

i

m

e

u

]

f

&

n

6

v

g

g

d

o

7

w

j

h

q

p

f

x

z

y

t

z

;

SPACE

\

Similarly, given the following ciphertext, what is the plaintext

78\\@%%dqt

Looking at this method, it is quite easy to determine the rule of conversion algorithm as the translation is quite straightforward. Each character is correspondingly converted to a character. Can you recall that Egyptian character is not to difficult to interpret as each character represents a single meaning. If the conversion is based on word rather than character it becomes more difficult as follows:  

Plainword

Cipherword

Plainword

Cipherwod

the

iuyoo

school

67%

I

po00

am

23we

student

-0pi

space

ivcd

CityU’s

54t

   

If the plainword is “I am a student”, the cipherword is “po00ivcd23weivcdoivcd-0pi”. If is more difficult as the conversion is not one to one mapping. One to one mapping means one character maps one character. The mapping, in terms of, character length is a variable. However, if you look at the cipherword again, you will find that some patterns are simply repeated such as SPACE, the “ivcd”. The conversion is the same for every SPACE, it is quite logical for others to decipher that “ivcd” is the SPACE. Therefore, more complex DES algorithm was developed.

Web Security

The word cryptography comes from the Greek, which means secret writing. It is popular due to the publicity of the Internet and the convenience of Wen-shopping using credit card. All cryptographic systems, regadless of complexity and types, consists of four basic elements as given below:

Plaintext: The is the original message to be sent and is usually human-readable. This can include video or voice signals not just text messages.
Ciphertext: This is the plaintext after encryption.
Cryptographic algorthim: This is the mathematical operations used to encrypt/decrypt the plaintext and vice versa.
Key: This is a secret key used to encrypt/decrypt the message. Only people who knows the correct key can decrypt the ciphertext. Figure shows the relationship amongst the four basic elements.

Symmetric Cryptography

Symmetric Cryptography means the same key is used to encrypt the plaintext and decrypt the ciphertext. To enable it works, both the sender and receiver must hold the same key and the receiver must know the key in advance. Figure shows how it works

Asymmetric Cryptography

This algorithm can be achieved through a set of keys. This key, usually termed as private key, must be well informed to the remote before it can be converted into the plaintext. This method seems to work fine if the key does not change regularly. It is however not the case as the key should be changed daily so that the hacker cannot decrypt easily. It leads to the development of public key. The sender will use its own key to encrypt the message while the remote will use the public key (not the same key of sender) to decrypt the received message. Figure shows how it works, the public key of the receipents is widely distributed to the sender. To send a message to a someone, we should pick up the receipent’s public key and use it encrypt the message. The message will be transmitted without fear of interception. The ciphertext is then decrypted using the recipient’s private key.

DES algorithm

The DES algorithm is a block cipher. Block cipher means it work on a fixed length of transmitted data in the network. The message is first segmented (broken) into a number of fixed data blocks of plaintext. The length is 64 bits (8 bytes). The plaintext is then encrypted by a 56-bit key (SET standard, being used in Banking in the Internet uses 128-bit key). This makes it difficult to decrypt as 256 = 1017 combination. The encrypted message is then transmitted through the network and is inverse by the remote. The conversion is very complex and mainly go through two major conversion, transposition and substitution.

Transposition

Transposition means to transpose the original order of pattern into different order. For instance, a byte of 1100 1000 after transposition becomes 10100100 as shown in Figure .


Using the same transposition rule, can you determine the conversion of plaintext 01001101.

The conversion can be achieved through the transposition of ciphertext into plaintext as shown in Figure. Here, the first bit is right-hand bit and the flow of data is from right to left in this diagram. You can convert it from left to right if you like.

Substitution

Substitution means to substitute the complete set of bits by a different set of bits. For instance, the original plaintext is 1011 0011 and the key is 11000011. After substitution the ciphertext becomes 01110000 using exclusive-or operation as shown in Figure . To convert it back to the original test, use the same key and perform the conversion using exclusive or again.


Assume the plaintext is 10101100 and the key is 10101010, find out the ciphertext.



Using the above result, determine the plaintext using the same key of 10101010.

The DES using symmertric cryptography is a very complex algorithm as it involves 19 distinct steps. It is so complex so that it is not used to be decrypted by others. The first step is a simple transposition of the 64-bit block of plaintext using the transposition rule then followed by 16 identical iterations of substitution processing, and a few more transpositions. The 56-bit key will generate 16 subkeys each of 48 bits for substitution. The transmitted data can be converted using the same key by the remote machine For details of operation, please refer to Data Communications, Computer Networks and OSI by Fred Halsall on Data Encryption.



Self-examined Questions

Short Questions

Find out the transmission bandwidth for a video signal, the width and height are 200 and 150 pixels respectively with 128 color patterns and 16 level intensities. Further assume that 25 frames per second is acceptable to the user.

 
 
 
Find out the above transmission bandwidth using the Real Video in the Netscape if the signal is now compressed using 50:1 algorithm. Determine the transmission bandwidth saved.

 
 
 
If the speed of a modem operating at 28800 bps is used to support a compressed voice channel, how many it can support?

If the above problem support a voice channel and graphics/text using IE Explorer, determine the remaining bandwidth for graphics/test usage.

 
 
 

True or False

If KLN is used to replace Kowloon, the percentage save is 5/8.

Private key is changed on regular basic.