Devanagari Font Design for Optical Character

Transcription

Devanagari Font Design for Optical Character
Devanagari Font Design for Optical
Character Recognition
Dual Degree Dissertation
Submitted in partial fulfillment of the requirements
of the degree of
(Bachelor of Technology & Master of Technology)
By
Mustafa Saifee
07D07004
Supervisor:
Prof. Ravi Poovaiah
Department of Electrical Engineering
INDIAN INSTITUTE OF TECHNOLOGY BOMBAY
May 2012
Approval Sheet
Devanagiri Font Design for Optical Character
Recognition by Mustafa Saifee is approved for the degree of Bachelor of Technology in Electrical
Engineering & Master of Technology in Communication and Signal Processing.
Examiner
Examiner
Supervisor
Chairman
Date
Place
i
Declaration
I declare that this written submission represents my ideas in my own words and where others’
ideas or words have been included, I have adequately cited and referenced the original sources.
I also declare that I have adhered to all principles of academic honesty and integrity and have
stand that any violation of the above will be cause for disciplinary action by the institute and can
also evoke penal action from the sources which have thus not been properly cited or from whom
proper permission has not been taken when needed.
(Signature)
(Name)
(Roll No.)
Date
ii
Acknowledgements
I express my sincere gratitude to Prof. Ravi Poovaiah for his support, guidance and constant encouragement. I would like to thank Manoj G. from CDAC for his guidance and support. This time I
remember my parents and brother with great reverence whose support and prayers are always
my strength. I would also like to thank my faculty advisor and HoD Prof. Abhay Karandikar for
his support.
Mustafa Saifee
07D07004
iii
Abstract
The original motivations for developing optical character recognition technologies were modest
to convert printed text on flat physical media to digital form, producing machine-readable digital
content. By doing this, words that had been inert and bound to physical material would be brought
into the digital realm and thus gain new and powerful functionalities and analytical possibilities.
It is crucial to the computerization of printed texts so that they can be electronically searched,
stored more compactly, displayed on-line, and used in machine processes such as machine translation, text to speech and text mining.
We design a Devanagari script font optimized for OCR. In this report we first study the the basics
of OCR systems and working of Devanagari OCR. We also study Latin fonts designed for OCR
and what precaution needs to be taken while designing a Devanagari script font for OCR. Then
we apply this knowledge to design the font which is showcased in the report. In this report we also
discuss the different features of letterforms which help in recognition. In conclusion, we tested the
font and found the results encouraging.
iv
Content
1. Introduction
1
2. Devanagari Script
3
2.1 Alphabets
3
2.2 Anatomy
6
2.2 Character Frequency in Hindi
8
3. Optical Character Recognition
9
3.1 History of OCR
9
3.2 Applications of OCR
10
3.3 Recognition of Devanagari Script
11
4. Typefaces for OCR
17
4.1 Latin Font for OCR
16
4.3 Devanagari Font for OCR
21
5. Font Design for OCRs
22
5.1 Design Features Important for OCR
22
5.2 Special Care for Devanagari
24
6. Proposed Font Designed for OCRs
27
6.1 Designing Characters
29
6.2 Evolution of Typeface
41
6.3 Anatomy of the Typeface
42
6.4 Testing of the Typefaces
43
7. Future Work
47
References
48
v
List of Figures
Figure 2.3 Bhagwat’s grouping of letters on the basis of graphical similarities
Figure 2.4 Bhagwat’s guidelines
Figure 2.5 Naik’s grouping of letters on the basis of the position of the vertical bar or the kana
Figure 3.1 OCR-A (top); OCR-B (Bottom)
Figure 3.2 Procedure of Devanagari Recognition
Figure 3.3 Image before binarization (left); Image after binarization (right)
Figure 3.4 Horizontal Projection Profiles of a document for line segmentation
Figure 3.5 Vertical Projection Profiles of a document for word segmentation
Figure 3.6 Three part of a Devanagari word
Figure 3.7 The procedure of Hindi character segmentation
Figure 4.1 Characters of OCR-A
Figure 4.2 Characters of OCR-B
Figure 4.3 Optimal Overlapping of characters
Figure 4.4 More coverage ratio because of serifs
Figure 4.5 First test version of OCR-B
Figure 4.6 First published version of OCR-B
Figure 4.7 OCR-B (top), OCRBczyk (bottom)
Figure 5.1 Addition of elements in OCR-B to differentiate two characters
Figure 5.2 Serifs in a typeface (grey serifs)
Figure 5.3 Shadow Characters
Figure 5.4 Counters is the circular negative space (grey)
Figure 5.5 Example of few characters after the removal of Shiro Rekha
Figure 5.6 Similarities between different characters once the Shiro Rekha is removed
Figure 5.7 Similarities between different characters once the bottom strip is removed
Figure 5.8 Similarities between descender and Halanta
Figure 5.9 Example of open counter (grey)
Figure 6.1 Extension in the diagonal stem of र to differentiate it from स
Figure 6.2 Difference in the shape of the bowl
Figure 6.3 Final design of letters र स and ख
Figure 6.4 Common element of ग न भ and म
Figure 6.5 Final design of letters ग न भ and म
Figure 6.6 Difference in the vertical and horizontal distance of the vertical and horizontal bar
Figure 6.7 Difference in भ and म once Shiro Rekha is removed
vi
Figure 6.8 Closed counter in the letters क ब and व
Figure 6.9 Final design of the letters क ब and व
Figure 6.10 Small diagonal stroke and the openness of the counter in ल
Figure 6.11 Overlapping of the letters ल and त
Figure 6.12 Final design of the letters ल and त
Figure 6.13 Distance between horizontal and vertical bar of the letters
Figure 6.14 Difference in ध and घ once Shiro Rekha is removed
Figure 6.15 Final design of the letters च ज छ ध and घ
Figure 6.17 Final design of the letters प फ ण and ष
Figure 6.18 Final design of the letters ट ठ ढ and द
Figure 6.19 Equal character height of all the letters
Figure 6.20 Final design of the letters इ ई ङ ड झ and ह
Figure 6.21 Overlap of the letters थ and य without the Shiro Rekha
Figure 6.22 Final design of the letters थ and य
Figure 6.23 Overlap of the letters अ and उ
Figure 6.24 Final design of the letters अ आ उ and ऊ
Figure 6.25 Final design of the letters ए ऐ ञ and श
Figure 6.26 Small white space between the letters and lower matras
Figure 6.27 Final design of matras
Figure 6.28 First version of OCR-D (top); final version of OCR-D (bottom)
Figure 6.29 Grid used in OCR-D
Figure 6.30 H:V Ratio
Figure 6.31 Test document typed in OCR-D
Figure 6.32 Test document after binarization
Figure 6.33 Extracted test from the image
Figure 6.34 Scanned document typeset in OCR-D
Figure 6.35 Extracted text when OCR algorithm is executed on document typed in OCR-D
Figure 6.36 Scanned document typeset in Yogesh
Figure 6.37 Extracted text when OCR algorithm is executed on document typed in Yogesh
Figure 6.38 Scanned document typeset in Surekh
Figure 6.39 Extracted text when OCR algorithm is executed on document typed in Surekh
vii
List of Tables
Table 2.1 Vowels in Devanagari
Table 2.2 Consonants in Devanagari
Table 2.3 Character Frequency in Hindi
viii
Chapter 1: Introduction
Machine replicating human functions, like reading, is an old dream. However, over the last five
decades, machine reading has grown from a dream to reality. Machine reading uses the principles
of Optical Character Recognition (OCR). OCR has also become one of the most successful applications of technology in the field of pattern recognition and artificial intelligence. Since the mid
1950s, OCR has been a very active field of research and development. While the OCR technology
for some scripts like Latin is fairly mature and commercial OCR systems like Nuance OmniPage
Pro or ABBYY FineReader are available which can perform with high accuracy, it is still under
development for other scripts like Chinese and Devangari.
Although a great deal of research has been done for OCR applications for Latin script, even theses
OCR based machines are still not able to compete with human reading capabilities. This problem is more prominent for other scripts for which OCR technology is relatively newer. Typefaces
are very important in determining the performance of the OCR technology. Hence in order to
improve the accuracy of the OCR system, typefaces which are specially designed for OCR are
required. For Latin script, quite a few typefaces have been designed which are optimized for OCR.
These specially designed typefaces have a unique and well defined character set which allows for
greater accuracy in recognition. This in turn helps in building low cost systems which can recognize characters using simple algorithms. However, no Devanagari script font is available which is
designed specifically for machine reading and we address this problem in this report.
In general, documents contain text, graphics, and images. The procedure of reading the text component in such a document can be divided into three steps:
1. Document layout analysis in which the text component of the document is extracted.
2. Segmentation, i.e. extraction of characters from the text component of the document.
3. Recognition of the segmented characters.
Typically, the OCR character segmentation stage needs to be redesigned for each new script, while
the other stages are easier to port from one script to another and can be generalized over large
classes of languages. There is a great need for OCR related research in Indian languages as there
are many technical challenges which are specific to Devanagari script. With the spread of computers in organizations and offices, automatic processing and machine reading of paper documents
is gaining importance in India. Although a lot of research is going on Devanagari script recognition, there is no commercial OCR systems focusing on Devanagari based languages. OCR for
1
Devanagari is still in the research and development stage.
In chapter 2, we give an overview of the Devanagari script. We discuss the alphabets in the
Devanagari script and how they are grouped. Thereafter we discuss the anatomy of the script and
the graphical grouping of the alphabets.
In chapter 3, we have a look at the basics of OCR systems. First, we discuss the history and the applications of OCR system and then we look at one of the algorithms used in OCR systems for Devanagari
script. We analyse this algorithm discussing all the steps that are involved in character recognition.
In chapter 4, we discuss the features of typefaces which are designed specifically for OCR systems.
We discuss the need of a specially designed typeface for OCR and perform an in-depth analysis of
one of the most commonly used Latin typeface for OCR system – OCR-B. We discuss the precautions taken by its designer while designing the typeface. Finally, we look at the lack of Devanagari
typeface designed for OCR systems.
In chapter 5, we discuss the precautions that need to be taken while designing Devanagari typefaces for OCR systems. We also look at the design decisions taken specifically for Devanagari
script because of the difference in the recognition algorithm of Devanagari and Latin script.
In chapter 6, we design a new Devanagari font optimized for OCR systems. We discuss each character
in detail and how the features of the letters are designed for improved performance in OCR systems.
We also present the evolution of our font from the first version to the final version. Thereafter, we test
this font on an OCR system which is available for free download on the internet.
Finally, we discuss the scope of future work and the improvements in the design which can further enhance the performance of the font.
2
Chapter 2: Devanagari Script
Devanagari script is the most important and widely used script in India. It is the script used by
many Indian languages like Hindi, Marathi, Nepali and Sanskrit. Several other languages like
Punjabi, Kashmiri use close variations of this script. It was also formerly used to write Gujrati.
Devanagari is a part of the Brahmi script family. An evolutionary transition can be seen from
Brahmi script to the Gupta script to the Nagari script to Devanagari script. It was first seen in 7th
century A.D. and the transition to a more stable form can be seen from the 11th century onwards.
The current appearance of Devanagari was reached sometime around 12th century.
Etymologically, the word Devanagari is considered to be combination of two Sanskrit words ‘Deva’
meaning God, Brahma or sometime the king and ‘nagara’ meaning city. Thus, literally combining
to form the ‘city of god’ or the script used in the city of god. The use of the name Devanagari is
relatively recent, the older term Nagari is also used.
The Devanagari script represents the sounds which are consistent. Unlike letters of the English
alphabet which can be pronounced in different ways, the letters of the Devanagari script have the
same pronunciations (with a few minor exceptions). Some of the conceptual differences in Latin
and Devanagari scripts are as follows:
• In Devanagari script each character has a horizontal bar (Shiro Rekha) at the top. In contemporary time the Shiro Rekha is broken between words, to differentiate between two words.
• Devanagari alphabets do not have distinct letter cases i.e. upper and lower case character.
• The concept of matras is not present in Latin script. They can occur as a standalone characters
or with other alphabets to modify their sound.
2.1 Alphabets
There are around 50 basic characters in the script. The grouping of vowels and consonants is called
Swaras and Vyanjanas respectively. The grouping of vowels and consonants in Devanagari is done
on the basis of phonetic point of articulation. Within a word, vowels often take modified shapes
called modifiers or matras. Consonant modifiers are also possible. Moreover, 2 to 5 consonants
can combine to form compound characters called conjunct, which may partly retain the shape
of the constituent consonants. Along with these there also exist a set of sign or diacritical mark
which indicates the nasalization of vowels or use of Persian sound etc.
3
Vowels
Devanagari in its most elaborate form has 18 vowels out of which 11 are frequently used. Others
can be seen in the Vedic and non-Vedic Sanskrit text. Vowels in Devanagari are transcribed in two
distinct forms: the independent form, and the dependent (matra) form. The independent form is
used when the vowel letter appears alone, at the beginning of a word, or immediately following
another vowel letter. Matras are used when the vowel follows a consonant.
Independent
form
अ
आ
इ
ई
उ
ऊ
ऋ
ॠ
Modifier or
Matras
Independent
form
Modifier or
Matras
None
ए
◌े
◌ा
ऐ
ि◌
◌ी
◌ु
◌ू
◌ृ
◌ॄ
ऎ
ओ
औ
ऒ
ऌ
ॡ
◌ै
◌ॆ
◌ो
◌ौ
◌ॊ
◌ॢ
◌ॣ
Table 2.1 Vowels in Devanagari
Apart from these, there also exist another set of vowels which has been added to the traditional
Devanagari to expand its range. For example ऑ is used to write transliteration or English loan
word like ball (बॉल).
Consonants
There are around 33 consonants in Devanagari script which are grouped phonetically. The first set
of 25 consonants are called occlusive,and rest 8 are called non occlusive. The occlusive consonants
are further divided into five groups: gutturals, palatals, cerebals or retroflex, dentals and labials.
The first four consonants in these groups are further divided in two groups: plosive and voiced
4
plosive and the last consonant is the nasal consonant. The plosive and voiced plosive are again
divided into unaspirated and aspirated version (each having one character).
There 8 non occlusive consonants are divided in three groups semivowel or approximant, sabilants
and aspirate each have four, three and one character respectively.
Occlusive Consonants
Plosive
Voiced Plosive
Unaspirated
Aspirated
Unaspirated
Aspirated
क
च
ख
छ
ग
ज
घ
झ
य
श
र
ष
ल
स
व
Gutturals
Palatals
Cerebals
Dentals
Labials
ट
त
प
ठ
थ
फ
ड
द
ब
ढ
ध
भ
Nasal
ङ
ञ
ण
न
म
Non Occlusive Consonants
Semivowels
Sibilants
Aspirate
ह
Table 2.2 Consonants in Devanagari
Conjunct
Conjuncts are combination of two to five consonants. There are about a thousand conjucts in
Devanagari script. Some of these conjuncts partly retain the shape of the constituent consonants
while there are others like � (द् + य) which are not clearly derived from the letters making up
their components.
5
Diacritics
Diacritics are glyphs added to a letter, or basic glyph to change the sound of the letter. Some of the
commonly used diacritics in Devanagari are Visarga, Chandra, Halanta and Nukta. Visarga (◌ः)
is an unvoiced variation of ह. Chandra is an open mid front rounded independent vowel ऍ. In its
dependent form it is placed on the top of the consonants (◌ॅ ). Chandrabindu is use to represent the
inherent nasalization of the vowel
Halant (◌्) is use to represent a lone consonant without a vowel. It kills the vowel अ and reduces
the consonant to its base form. Nukta (◌ ़) is used represent the Persian sound encountered in
some of the borrowed Urdu words like ज़ for ‫ ظ ز ذ‬or ग़ for ‫غ‬.
2.2 Anatomy
The anatomy of a letter can be defined as a system which depicts the structural form of a letter;
describing key features of a letter in a typeface. The first attempt of graphical classification of
Devanagari script was done by S. V. Bhagwat. He grouped letters on the basis of graphical similarities as shown in figure 2.1.
Figure 2.1 Bhagwat’s grouping of letters on the basis of graphical similarities[1]
He has also defined guidelines for the letters and terminology for some of the graphical elements
present in the letters which are shown in figure 2.2.
Figure 2.2 Bhagwat’s guidelines[1]
6
The top most line is the Rafar Line, which is followed by the Matra Line. Matra Line denotes the
top of the upper matras. After the Matra Line, Head Line is indicated. Head Line is the top of the
Shiro Rekha. Head Line is followed by the Upper Mean Line and the Lower Mean Line. Upper
Mean Line indicates the point where the actual letter starts for example the upper part of the
counter of ‘ब’ or ‘व’. Lower Mean Line denotes the point where the characteristic feature of the
letter comes to an end for example the lower part of the counter of ‘ब’ or ‘व’. This is followed by the
Base Line which denotes the end of the character and the point where the lower matra starts. The
lowermost line is the Rukar Line which denotes the end of the lowest portion of the Rukar.
Bapurao Naik also attempted the graphical grouping of letters. Naik organized letters graphically
in five groups on the basis of the position of the kana or the vertical bar. The important aspect of
this grouping is that ए and ऐ are missing from the group. Naik's grouping of letters is shown in
figure 2.3.
Figure 2.3 Naik’s grouping of letters on the basis of the position of the vertical bar or the kana[1]
Few other people like M. W. Gokhale, Mahendra Patel have also proposed different method to
group the letters and create the vocabulary of Devanagari script. A comprehensive study on anatomy on Devanagari script can be found in [2].
2.3 Character Frequency in Hindi
Table 2.3 shows the frequencies of letters in Hindi language. Basis of this list were some Hindi texts
with together 978.430 characters (238.604 words), 736.216 characters were used for the counting.
The texts consist of a good mix of different literary genres. Of course, if other texts were used as a
basis, the result would be slightly different.
7
◌ा
8.22%
व
1.62%
फ
0.35%
क
7.14%
◌ु
1.45%
इ
0.31%
◌े
6.85%
ज
1.39%
◌ ँ
0.30%
र
5.91%
ए
1.34%
ष
0.27%
ह
4.82%
ग
1.31%
घ
0.20%
स
3.78%
च
1.16%
ई
0.20%
न
3.48%
थ
1.15%
झ
0.19%
◌ी
3.47%
अ
1.01%
ठ
0.17%
◌ं
3.44%
औ
0.94%
◌ौ
0.15%
म
3.28%
◌ू
0.81%
ण
0.13%
ि◌
3.20%
उ
0.78%
◌ृ
0.10%
◌्
3.02%
श
0.76%
ओ
0.10%
त
2.89%
ड
0.75%
◌ॉ
0.10%
प
2.66%
ख
0.70%
ढ
0.09%
ल
2.45%
◌़
0.67%
ऊ
0.05%
◌ो
2.21%
भ
0.67%
ऐ
0.03%
य
2.20%
आ
0.66%
ऑ
0.03%
◌ै
1.96%
ट
0.57%
ञ
0.01%
ब
1.78%
छ
0.45%
◌ः
0.01%
द
1.68%
ध
0.36%
◌ ॅ
0.00%
Table 2.3 Character Frequency in Hindi[3]
8
Chapter 3: Optical Character Recognition
With the recent emergence and widespread application of multimedia technologies, there is
increasing demand to create a paperless environment in our daily life. Wide variety of information which has been conventionally stored on paper is now converted to electronic form for better
storage and intelligent processing. The primary purpose of such system is to facilitate the retrieval
of information based on a given query. Representation of documents as images is also undesirable
because it does not allow the user to edit or search the document. These limitations can be overcome by representing the date as text, which takes less storage space and is also easier to process.
This kind of conversion can be achieved by Optical Character Recognition.
Optical Character Recognition or OCR is technology which allows machine to recognize text
from an image. It is the conversion of scanned image of printed or hand-written text to machine
encoded text. It is important for computerizing printed text so that they can be searched electronically, stored compactly or used for machine processing like translation or text to speech
conversion.
3.1 History of OCR
The dream of making machines perform humane tasks like reading is not new. The first attempt was
in 1870 when C. R. Carey invented an image transmission system. During the first decade of 19th
Century many attempt were made. But the modern version of OCR came into existence in 1940s.
The First OCR
The first OCR was installed in Reader’s Digest in 1954. It was used to convert typewritten report to
punched card so that they can be input in the computer
First Generation OCR
The first commercial OCR appeared from 1960 to 1965. These OCR had a constrained letter shape
read. The characters were specifically designed for machine recognition and were not very natural.
With time the OCR was able to recognize up to 10 different fonts.
Second Generation OCR
The reading machines of the second generation appeared in the middle of the 1960’s and early
1970’s. These systems were able to recognize regular machine printed characters and also had
hand-printed character recognition capabilities. The first one of this kind was IBM 1287. In this
9
period, characters in Latin script were also standardized. OCR-A and OCR-B were also designed
in this period. These fonts were designed so that they can be recognized by a machine but were
also still readable by a human.
ABCDEFGHIJKL
MNOPQRSTUVWX
YZ0123456789
ABCDEFGHIJKL
MNOPQRSTUVWX
YZ0123456789
Figure 3.1 OCR-A (top); OCR-B (Bottom)
Third Generation OCR
These first appeared in the middle of 1970s. The challenge was to recognize poorly scanned documents ad hand-written character set. Also low cost and high accuracy were main objective which
was achieved also because of the advancement in the technology.
Present OCR
Today OCRs are available at a very low cost and OCR systems are also available as software package. Omnifont OCRs are available for Latin script. Although systems are available for Latin,
Cyrillic, far eastern and many middle eastern scripts, such systems for Devanagari are still in the
research and development stage. This is mainly due to a lack of a commercial market.
3.2 Applications of OCR
OCR has been used to computerize data for dissemination and processing. The first major use of
OCR was in the banking industry where it was first used to read credit card numbers. Nowadays
OCRs are widely used for automated data entry especially in banks where it is used to read account
number, customer identification, amount of money etc.
It is also used for text entry i.e. extracting text out of a scanned document. The reading machine
is used to process large amount of text, which can then be used of several other purposes like for
searching within the document.
10
OCRs also have huge application for the blind. This was one of the earliest thought applications
of OCRs. Combined with text to speech conversion OCRs would enable blind people to read the
printed documents. It can also be used for automated license plate reading and can also help in
reading specially designed forms automatically. Once the text is computerized it can be used for
machine processes like text to speech conversion, language translation and text mining.
3.3 Recognition of Devanagari Script
The most important principle of automatic pattern recognition is training the machine what kind
of pattern may be present and what they look like. In OCR the patterns are letters, numbers and
punctuations. Machine is trained to recognize the pattern by showing it all the kind of characters
present in the script. This period is referred as the training period. On the basis of these examples the machine builts a prototype of all the characters. Then during recognition the machine
compares the unknown character to the prototype and assigns the character which is the closest
match. The four steps in recognition shown in figure 3.2 are as follow:
1. Preprocessing
2. Segmentation
3. Recognition
4. Post Processing
Preprocessing
The text document is generally scanned at 300 or 400 DPI. Preprocessing is also done to improve
the accuracy of the recognition algorithm. Main steps in preprocessing are noise removal, binarization and skew correction.
Noise Removal or De-Noising
The main sources of noise in the input image are as follows:
• Noise due to the quality of paper on which the printing is done.
• Noise induced due to printing on both sides of paper or the quality of printing
• Noise added due to the scanner source brightness and sensors.
All this noise results in reduction of accuracy of OCR system. As a result of this having a noise
correction routine in place becomes inevitable. To reduce the amount of noise, image is passed
through a mean filter; in this filter the intensity of the each pixel is replaced by the average intensity
of pixels surrounding it. After de-noising the image is subjected to binarization and skew (or tilt)
correction.
11
Preprocessing
Noise removal, Binarization and
Skew (or tilt) correction
Segmentation
Line, word and character segmentation
Recognition
Post Processing
Output Text
Figure 3.2 Procedure of Devanagari Recognition
Binarization
Printed documents generally are black text on white background. Hence most of the OCR algorithms are designed to interpret bi-level images (an image that has only two possible value of pixel
i.e. black and white). This process of converting colored or grayscale images to bi-level image is
often known as binarization or thresholding. A comprehensive study on the method of binarization for OCRs can found in [4]
Figure 3.3 Image before binarization (left); Image after binarization (right)
12
Skew (or Tilt) Correction
When a document is scanned a small amount of skew (or tilt) is unavoidable. Skew angle is the angle
that the text lines make with the horizontal line. Skew estimation and correction are important preprocessing steps of document layout analysis and character recognition. One of the popular skew estimation techniques is based on projection profile of the documents. The horizontal/vertical projection
profile is a histogram of the number of black pixels along horizontal/vertical scan lines. In Devanagari Shiro Rekha is use to find the skew angle. The algorithm of skew correction can be found in [5].
Segmentation
Segmentation is the process of the dividing the page into its constituent element. The aim of segmentation is to extract out all the character from the text in the image. This is needed to recognize
these characters.
Segmentation phase is a very crucial stage since this is where most of the errors occur. Even in
good quality documents, sometimes adjacent characters touch each other due to inappropriate scanning resolution or the design of characters. This can create problems in segmentation.
Incorrect segmentation leads to incorrect recognition. Segmentation phase includes line, word
and character segmentation. Segmentation in OCR occurs in three steps: line segmentation, word
segmentation and character segmentation. While the precise algorithm for segmentation can be
found in [6] and [7], an overview of segmentation process is given below.
Line Segmentation
In line segmentation our aim is to separate out the line of text from the image. For this global
horizontal projection profile method is used which constructs a histogram of all the black pixels
in every row as shown in figure 3.4. Based on the peak/valley points of the histogram, individual
lines are separated. The steps for line segmentation are as follow:
1. Horizontal projection profile for the image is created.
2. Using the projection profile, the points from which the line starts and ends are found.
3. For a line of text, upper line is drawn at a point where we start finding black pixels and lower
line is drawn where we start finding absence of black pixels. And the process continues for next
line and so on.
Word Segmentation
After line segmentation the boundary of the line (i.e. the top and bottom of the line) is known. Word
13
Figure 3.4 Horizontal Projection Profiles of a document for line segmentation[8]
segmentation is extracting out the boundary of the words from these segmented lines. Word segmentation is done in the same way as line segmentation but in place of horizontal profiling, vertical
projection profiling is done as shown in figure 3.5. The steps for line segmentation are as follow:
1. Vertical projection profile for the image is created.
2. Using the projection profile, points from which the word starts and ends are found.
3. Then we create vertical lines at the start and end of each line. And the process continues for
next word and so on.
Figure 3.5 Vertical Projection Profiles of a document for word segmentation[8]
Character Segmentation
Once the words are segmented, the next step is to extract out the characters form these words. A
word in Devanagari script is further divided into three parts: as shown in figure 3.6:
1. Top
2. Core (or Middle)
3. Bottom
The top strip and the core part are separated by the Header Line or the Shiro Rekha. But there is
no separation between the core strip and the bottom strip. The top strip contains the top matras
and the bottom strip contains the bottom matras or the descenders of some on the characters. The
14
Shiro Rekha is a unique feature of Devanagari script and helps to identify Devanagari in multilingual document. It also helps in the identification of the baseline of the text.
अकुलीन
Top Strip
Core Strip
Head Line
(Shiro Rekha)
Bottom Strip
Figure 3.6 Three part of a Devanagari word
The steps of character segmentation shown in figure 3.7 are as follows:
1. Shiro Rekha is identified and the top strip is seperated from the core and bottom strip. So now the text
is divided in two parts a.) The Shiro Rekha and the top mantra and b.) The core-bottom part of the text
2. Core strip and bottom strip from the core-bottom part of the text, is identified and lower
matras are extracted.
3. Core strip is segmented into different letters or characters which may include conjuncts, punctuation or numerals.
4. Conjuncts are segmented into single characters.
5. Shiro Rekha is removed form the extracted top strip and top matras are extracted.
6. Once the segmentation of the core character is done, Shiro Rekha is put back on the top of individual characters.
Figure 3.7 The procedure of Hindi character segmentation[6]
15
Recognition
Segmentation is followed by recognition of the characters. The two main methods used for recognizing characters are as follows:
• Template Matching
• Feature Based Recognition
Template Matching
In this method a matrix containing the image of the input character is matched with the set of
prototypes created in the training period. The distance between the pattern and each prototype is
computed and the character which is the best match to the pattern is assigned to the pattern.
The technique is simple and easy to implement in hardware. However, this technique is sensitive to
noise and style variations and has no way of handling rotated characters.
Feature Based Recognition
In this method significant features of the pattern are measured and examined. These features are
then compared to the prototypes developed in training phase. The description which provide the
closest match provides the recognition. These features can be like presence of vertical bar or the
number of conjunctions.
Algorithm of recognition in detail can be found in [9].
Post Processing
The result of recognition is set of some characters. However these characters doesn't contain the
complete information. We would like to combine these individual characters to form strings. This
process is called grouping. Grouping of string depends on the location of string in the document.
Strings which are close to each other are grouped together to form a word, since the distance between
two words is more than the distance between the letters of the word.
16
Chapter 4: Typefaces for OCR
Typeface is a design of a collection of alphanumeric symbols. A typeface may include letters, numerals, punctuation, various symbols, and more — often for multiple languages. It is usually grouped
together in a family containing individual fonts for italic, bold, and other variations of the primary
design. Although typeface and font are used interchangeably; font refers to the physical embodiment of the typeface (i.e. the a computer file or a metal piece in letterpress). Typeface is what we see
whereas font is what we use. In rest of this thesis, font and typefaces are used interchangeably.
4.1 Latin Fonts for OCR
Typefaces are designed for OCR so that they can be read by low cost systems. These fonts have a
unique and well defined character sets which allow for greater accuracy in recognition. The most
popular Latin script fonts which were designed for OCRs are as follows:
• OCR-A
• OCR-B
OCR-A
OCR-A is a monospaced font designed by American Type Founders. It was developed to meet the
standards set by the American National Standards Institute for the processing of documents by
banks, credit card companies and similar businesses. The design was simple so that it can be read by
machine but it is very difficult to read it by human eyes.
ABCDEFGHIJKLM
NOPQRSTUVWXYZ
abcdefghijklm
nopqrstuvwxyz
0123456789
&$£%.!?
Figure 4.1 Characters of OCR-A
OCR-B
OCR-B was also designed by Adrian Frutiger for European Computer Manufacturers Association
17
(ECMA). It is a monospaced font and was designed following the standards of ECMA. The first
version contained 109 characters. The main objective was to create international standards for
optical recognition. They also wanted to avoid the wider acceptance of OCR-A because of its
unnatural looks. Therefore, OCR-B was designed to be pleasant to human eyes.
ABCDEFGHIJKLM
NOPQRSTUVWXYZ
abcdefghijklm
nopqrstuvwxyz
0123456789
&$£%.!?
Figure 4.2 Characters of OCR-B
It pushed the limits of optical recognition. This was the first typeface, with respect to the machine
readable typeface which gave consideration to aesthetics. OCR-B was declared worldwide standard in 1973.
The principle of OCR-B was based on the fact that all the characters must differ from each other
by at least 7% in the worst possible case. To check this two characters were overlapped in such a
way that they overlapped optimally as shown in figure 4.3. This test was also carried out using two
different printing weights, a fine weight due to the lack of ink in typewriter ribbon and a fat weight
due to the ink blots.
Figure 4.3 Optimal Overlapping of characters
Generous character spacing was provided since its important for correct recognition whereas serif
were avoided because it increases the common coverage area of the characters therefore increasing
the similarity between characters as shown in figure 4.4.
18
Figure 4.4 More coverage ratio because of serifs
First Test Version
The first test version was designed in 1963 and had 109 characters is shown in figure 4.5. In the
first version of the font the bowl shape in the uppercase case letters was constant whereas there
were two types of bowl shapes in lowercase letters: a round bowl for example in c d p, and a flat
bowl in b g q. Also initially the height of the numerals and uppercase letters was kept the same
which was then changed before the first test version All the numerals also had dynamic shape but
the curves were different. The uppercase O was very similar to the numeral 0.
Figure 4.5 First test version of OCR-B[10]
First Published Version
Figure 4.6 First published version of OCR-B[10]
19
OCR-B was first published as Standard EMCA-11 in 1965 containing 112 characters and is shown
in figure 4.6. Some characters underwent considerable changes. The flat dynamic bowl of b g q
was converted to static bowl shape to match the rest of the characters. The
height of the upperwww.linotype.com
F2Fcase
OCRBczyk™
characters wasRegular
also reduced to differentiate it more from the numerals. Some character had
undergone considerable correction. W’s outer diagonals were curved in the new version. Also the
0 received a more oval shape to differentiate it with the character O and the dot of j was
OHamburgefonstiv
24 pt numeral
now normally placed. Altogether the typeface now had a more consistent look and feel to it.
36 pt
OHamburgefonstiv
Some further corrections took place from 1969 onwards.. The British pound sign was considerably
changed. There was still problems in differentiating D O 0 and B 8 &.. With D the curve stroke now
OHamburgefonsti
started directly at the stem; O was given a much more oval shape and 0 became more angular. B
48 pt was made wider which resolved to problem of B-8 pair and the upper bowl of & was made smaller. A
horizontal bar was added to j (just like i) and the descender of y was curved. All these were not beneficial in term of shape or aesthetics but were very important for differentiating different characters.
OHamburgefon
60 pt Even
with the international standard in 1976 OCR-B project was far from over. The number of
character increased constantly; from 121 characters in 1976 to 147 in 1994. Also in 1994, a designer
named Alexander Branczyk designed a proportional version of OCR-B called OCRBczyk. It fea-
OHamburgef
tured much finer visual features but remained true to the design of OCR-B.
0Hamburg
OHamburg
72 pt
84 pt
Figure 4.7 OCR-B (top), OCRBczyk (bottom)
OHambur
Application if OCR-B
96 pt
Since 1960s machine readable typefaces have been used for data recognition. They can be found on
cheques, bank statement, credit cards and postal forms. OCR-B can also be found in many countries
paying-in forms and countries identity card. Most of the barcode numbers are also set in OCR-B.
20
F2F OCRBczyk is a trademark of Linotype GmbH and may be registered in certain jurisdictions.
For further information please contact: info@linotype.com
4.3 Devanagari Font for OCR
Although development of OCRs for Indian script is an active area of research today, not much
work has been done for designing a Devanagari font optimized for OCR. Unlike the Latin script
there is not even one commercially available Devanagari font which is optimized for Devanagari
OCR systems.
Few of the most common Hindi fonts are KrutiDev, Mangal, Surekh and Yogesh. But none of
them is designed for OCR. All these fonts have some letters with parts above the Shiro Rekha.
Also KrutiDev and Yogesh have some letters which are not connected horizontally like स. Also
the stroke with os Yogesh is also thin for an OCR font. Surekh is not a monolinear font that is why
it cannnot be used for OCR. Therefore there is need of a Devanagari font designed for OCR.
21
Chapter 5: Font Design for OCRs
Font design is the art and process of designing typefaces. Regardless of the method used to specify
type design, all the characters of type should have artistic consistency. No character should look
small or large as compared to the other characters in the font. Although while designing type for
OCR systems special precautions have to be taken for better accuracy. Many of the principles of type
design for Latin fonts for OCR system apply directly to Devanagari fonts, but due to the difference in
the segmentation algorithm extra care need to be taken while designing for Devanagari OCR system.
5.1 Design Features Important for OCR
Every letter has to be more strongly differentiated than is customary in type design. Most of the
principle for designing type for OCR remain same as Latin, while special care need to be taken
for Devanagari because of the difference in character segmentation. However, many constraints
which were present while designing OCR-B are not applicable now because of the advancement
in technology for example previously OCRs were only able to detect monospaced font but now
because of the development in the OCR system it can also recognize proportional fonts with accuracy. Some of the things that should be taken care while designing type for OCRs are as follows
One Character should Never be Contained in Another Character
No character when overlapped with another should be completely inside the other letter. This is
very important for correct recognition. To do this certain additional feature or elements are added
to differentiate it from the other characters as shown in figure 5.1. We can also have different
counter size of similar looking characters like प and ष.
Figure 5.1 Addition of elements in OCR-B to differentiate two characters
Font should be Monolinear
Monolinear fonts are the fonts that have same visual weight of the vertical and horizontal strokes.
If a font has different stroke width then there is a possibility of the breaking of the thin stroke at
small point size while scanning or during the process of binarization thus creating problems while
recognizing.
22
Font should be Sans Serif
Serif is a small decorative line added at the end of some of the strokes that make up thee basic form
of a character as shown in figure 5.2. A typeface with serifs is called a serif and a typeface without
serifs is called sans serif. Sans serif typefaces are preferred for OCR because serifs increases the
common coverage area of the characters therefore increasing the similarity between characters.
Figure 5.2 Serifs in a typeface (grey serifs)
Generous Character Spacing
Character spacing is the distance between two characters. White space between two characters
help in character segmentation but it should not be comparable to the space bar (' '). If the character spacing is not enough, the characters can end up touching each other because of the noise
added while scanning; then this would create problem in character segmentation.
Shadow characters should also be avoided. A character is said to be under the shadow of another
character if they do not physically touch each other but it is not possible to separate them merely
by drawing a vertical line. Example of shadow characters is shown in figure 5.3.
Although the algorithm takes care of shadow characters, it reduces the accuracy in some cases.
Figure 5.3 Shadow Characters[6]
Big Closed Counters
The enclosed or partially enclosed circular or curved negative space (white space) of some letters
such as d, o, and s is the counter as shown in figure 5.4.
23
Figure 5.4 Counters is the circular negative space (grey)
While designing for OCRs, counter size needs to be kept huge so that they don't get completely
filled because of noise while scanning or they can also get filled while printing. This can result in
faulty recognition as a character can be confused for other characters, for example if the counter
of ढ is filled it can be confused for द by the OCR.
Bold Strokes
Stroke width is another feature of a font which is very important for recognition as thin strokes
can get smudged and get broken because of poor quality of printing and scanning. Bold stroke is
also helpful in the process of binarization.
5.2 Special Care for Devanagari
Apart from the precautions stated above some special care has to be taken for Devanagari because
of the complicated segmentation process. For character segmentation the script is divided in three
parts: top, core (or middle) and bottom and all these parts are recognized separately. This increases
the complication because unlike Latin script, descenders and ascenders of the characters (in core
strip) won't be treated as the part of the character in Devanagari script. So no differentiating feature can be present in the ascender or descender of the character. These special precautions that
need to be taken care of are discussed below.
Removal of Shiro Rekha and the Top Strip
Removal of Shiro Rekha is the second step in character segmentation (as shown in figure 3.7).
When Shiro Rekha is removed, all the features of the character at the level of Shiro Rekha or above
it are also removed from the core strip as shown in figure 5.5.
Figure 5.5 Example of few characters after the removal of Shiro Rekha
24
When some important features of the character are at the level of Shiro Rekha or above it gets
removed resulting in no recognition or recognizing a different character. For example भ has a
curve at the level of Shiro Rekha which when removed results in looking like म. Similarly ध looks
like घ when the Shiro Rekha is removed. This can be seen in figure 5.6
Figure 5.6 Similarities between different characters once the Shiro Rekha is removed
Also the differentiating characteristic between the kana (ा) and purna viraam (।) is the presence of Shiro
Rekha above the kana. Once the Shiro Rekha is removed there is no differentiating features between
theses two characters and one character can be confused for other. So while designing some differentiating features have to be added in either of two characters so that they can be recognized accurately.
Removal of Bottom Strip
The step after the removal of top strip in character segmentation is the removal of bottom strip.
Bottom strip is the strip which contains the lower matras, halanta and descenders of the letters in
the core strip. The most difficult part of this step is to determine where the core strip ends and the
bottom strip begin because in Devanagari script the lower matras are connected to the characters
in the core strip.
Also a few characters like इ झ has characteristic features extending to the bottom strip. When these
features are removed the character might closely resemble other characters as shown in figure 5.7.
Figure 5.7 Similarities between different characters once the bottom strip is removed
Also in some cases the descender resembles a particular lower matra or a diacritical mark. While
recognizing the lower matras in the bottom strip, the descender can be confused for the lower
matra which would result in incorrect recognition of both the character and the lower matra as
shown in figure 5.8.
25
Figure 5.8 Similarities between descender and Halanta
Recognition of Characters
Recognition of characters is much more complicated in Devanagari than in Latin because of the
graphical similarities in the letters. The graphical similarities in the letters in Devanagari is much
more than that in Latin. Some of the letters have just a difference of a stroke like ष just has an
additional diagonal stroke as compared to प. While there are others which differ from each other
only because of the presence of vertical line like न and म.
Also unlike Latin script, Devanagari has letters which are disjoint horizontally. This should be
avoided in the characters in which this can be avoided for example रव can also be designed as ख.
This results in inaccurate recognition.
Also the open counters in the letters should be designed carefully. Open counter is the curved part
of the character that encloses curved parts (counter) of some letters as shown in figure 5.9.
Figure 5.9 Example of open counter (grey)
While designing the counters, special care need to be taken so that the strokes forming these curves
don't get connected because of noise or smudging. This results in the algorithm to confuse between
two letters. For example if the strokes of ल connects together they can be recognized as न.
26
Chapter 6: Proposed Font Designed for OCRs
The proposed version contains 53 characters including letters and matras. The font is unicode
based. A reduction in the calligraphic strokes can be seen. All the characters are designed to have
same height i.e. no part of the character goes above the Shiro Rekha or goes below in the bottom
strip. Characters which were similar in design were given additional features. Also a small gap is
given between the lower matras and the core strip which helps in segmentation. Font designed is
monolinear and have a bold stroke so that the strokes are not broken in the process of binarization.
अ
ऊ
क
च
ट
त
प
य
श
ह
आ
ए
ख
छ
ठ
थ
फ
र
ष
इ
ऐ
ग
ज
ड
द
ब
ल
स
27
ई
ओ
घ
झ
ढ
ध
भ
व
उ
औ
ङ
ञ
ण
न
म
6 pts
िकसी जाती का जीवन तथा इन
8 pts
िकसी जाती का जीवन तथा इन
9 pts
िकसी जाती का जीवन तथा इन
10 pts
िकसी जाती का जीवन तथा इन
11 pts
िकसी जाती का जीवन तथा इन
12 pts िकसी जाती का जीवन तथा इन
14 pts
िकसी जाती का जीवन तथा इन
18 pts
िकसी जाती का जीवन तथा इन
24 pts
िकसी जाती का जीवन तथा इन
36 pts
िकसी जाती का जीवन तथा इन
48 pts
60 pts
72 pts
96 pts
िकसी जाती का जीवन तथा
िकसी जाती का जीवन
िकसी जाती का जी
िकसी जाती क
28
8 pts
हालािक सूर के जीवन के बारे मे कई जनश्रुितया प्रचिलत है पर इन मे िकतनी सच्चाई है यह कहना किठन है। कहा जाता है उनका जन्म िदल्ली के पास एक गरीब
ब्राह्मीण पिरवार मे हुआ। जनश्रुित के अनुसार सूरदास जन्म से ही अन्धे थे। आजकल थी अन्धे आदमी अक्सर सूरदास कहलाते है। कई लोगो ने उन्हे गुरु के रूप मे
अपनाया और उनकी पूजा करना शुरु कर िदया।
10 pts
हालािक सूर के जीवन के बारे मे कई जनश्रुितया प्रचिलत है पर इन मे िकतनी सच्चाई है यह कहना किठन है। कहा जाता है उनका
जन्म िदल्ली के पास एक गरीब ब्राह्मीण पिरवार मे हुआ। जनश्रुित के अनुसार सूरदास जन्म से ही अन्धे थे। आजकल थी अन्धे आदमी
अक्सर सूरदास कहलाते है। कई लोगो ने उन्हे गुरु के रूप मे अपनाया और उनकी पूजा करना शुरु कर िदया।
11 pts
हालािक सूर के जीवन के बारे मे कई जनश्रुितया प्रचिलत है पर इन मे िकतनी सच्चाई है यह कहना किठन है। कहा जाता
है उनका जन्म िदल्ली के पास एक गरीब ब्राह्मीण पिरवार मे हुआ। जनश्रुित के अनुसार सूरदास जन्म से ही अन्धे थे।
आजकल थी अन्धे आदमी अक्सर सूरदास कहलाते है। कई लोगो ने उन्हे गुरु के रूप मे अपनाया और उनकी पूजा करना
शुरु कर िदया।
12 pts हालािक सूर के जीवन के बारे मे कई जनश्रुितया प्रचिलत है पर इन मे िकतनी सच्चाई है यह कहना किठन है।
कहा जाता है उनका जन्म िदल्ली के पास एक गरीब ब्राह्मीण पिरवार मे हुआ। जनश्रुित के अनुसार सूरदास
जन्म से ही अन्धे थे। आजकल थी अन्धे आदमी अक्सर सूरदास कहलाते है। कई लोगो ने उन्हे गुरु के रूप मे
अपनाया और उनकी पूजा करना शुरु कर िदया।
14 pts
हालािक सूर के जीवन के बारे मे कई जनश्रुितया प्रचिलत है पर इन मे िकतनी सच्चाई है यह
कहना किठन है। कहा जाता है उनका जन्म िदल्ली के पास एक गरीब ब्राह्मीण पिरवार मे हुआ।
जनश्रुित के अनुसार सूरदास जन्म से ही अन्धे थे। आजकल थी अन्धे आदमी अक्सर सूरदास
कहलाते है। कई लोगो ने उन्हे गुरु के रूप मे अपनाया और उनकी पूजा करना शुरु कर िदया।
18 pts
हालािक सूर के जीवन के बारे मे कई जनश्रुितया प्रचिलत है पर इन मे
िकतनी सच्चाई है यह कहना किठन है। कहा जाता है उनका जन्म िदल्ली
के पास एक गरीब ब्राह्मीण पिरवार मे हुआ। जनश्रुित के अनुसार सूरदास
जन्म से ही अन्धे थे। आजकल थी अन्धे आदमी अक्सर सूरदास कहलाते है।
कई लोगो ने उन्हे गुरु के रूप मे अपनाया और उनकी पूजा करना शुरु कर
िदया।
24 pts
हालािक सूर के जीवन के बारे मे कई जनश्रुितया प्रचिलत
है पर इन मे िकतनी सच्चाई है यह कहना किठन है। कहा
जाता है उनका जन्म िदल्ली के पास एक गरीब ब्राह्मीण
पिरवार मे हुआ। जनश्रुित के अनुसार सूरदास जन्म से
ही अन्धे थे। आजकल थी अन्धे आदमी अक्सर सूरदास
कहलाते है। कई लोगो ने उन्हे गुरु के रूप मे अपनाया और
उनकी पूजा करना शुरु कर िदया।
29
6.1 Designing Characters
रसख
The characters र, स, and ख have similar features. Hence while designing these characters, care
must be taken so that OCR is able to differentiate between these characters. In order to incorporate differences in the features of these characters, following steps are taken:
• The diagonal bar of र is elongated so that it can be differentiated from स. The elongated part of
र is shown in Figure 6.1.
• The bowl of these characters are designed differently so that even if there is smudging and the
horizontal bar in स breaks, these characters can be differentiated by the shapes of their bowls.
The bowls of र and स are kept same as they can be differentiated using the elongated diagonal
bar and the bowls of र and ख are different. The difference in the shape of bowls of र and ख is
shown in Figure 6.2 by overlapping these
Figure 6.1 Extension in the diagonal stem of र to differentiate it from स
Figure 6.2 Difference in the shape of the bowl
In the first version, the width of ख was also compressed assuming it would provide better result but
test results showed that the width didn't have a prominent impact and hence the width of ख was
changed to the standard in the final design. The final design of र, स, and ख is shown in Figure 6.3.
30
Figure 6.3 Final design of letters र, स and ख
गनभम
The common element in this group is the presence of the filled counter as shown in the figure 6.4
which is the most distinguishing feature of the character. The final design of the letters can be seen
in the figure 6.5.
Figure 6.4 Common element of ग न भ and म
Figure 6.5 Final design of letters ग न भ and म
31
To differentiate the letters in this group the horizontal and the vertical distance of the horizontal
and vertical bar is not kept the same as shown in figure 6.6.
Figure 6.6 Difference in the vertical and horizontal distance of the vertical and horizontal bar
Also the top part of भ doesn't go above the Shiro Rekha so that it doesn't look like म once the Shiro
Rekha is removed as shown in figure 6.7.
Figure 6.7 Difference in भ and म once Shiro Rekha is removed
32
कबव
The counter of the letters should be large so that it does not get filled with noise. Also a closed
counter was designed so that if there is a joint is broken from one end it still doesn't look like half
form of the letter as shown in the figure 6.8. Also the diagonal bar of ब has to be bold so that it is
not broken while printing or scanning or in the process of binarization.
Figure 6.8 Closed counter in the letters क ब व
The final design of the letters can be seen in the figure 6.9.
Figure 6.9 Final design of the letters क ब and व
तल
The challenge while designing this group was that ल should not look like �. For this the length of
the diagonal stroke was reduced which also made the counter ore open as shown in the figure 6.10.
Also ल should not like त if the lower counter is filled. The difference of the shape of त and ल can
be seen in the figure 6.11.
Figure 6.10 Small diagonal stroke and the openness of the counter in ल
33
Figure 6.11 Overlapping of the letters ल and त
The final design of the letters can be seen in the figure 6.12.
Figure 6.12 Final design of the letters ल and त
चजछधघ
While designing these characters, the following care must be taken so that OCR is able to recognize these characters
• The horizontal bar shown in the figure 6.13 should not touch the vertical bar at the left even in
the presence of noise, in order to do this the distance should be kept more.
• The closed counter of छ should be of large size so that it is not filled in small size or in presence
of noise.
• The letter ध should not look like घ after the Shiro Rekha is removed as shown in the figure 6.14.
Figure 6.13 Distance between horizontal and vertical bar of the letters
34
Figure 6.14 Difference in ध and घ once Shiro Rekha is removed
The final design of the letters can be seen in the figure 6.15.
Figure 6.15 Final design of the letters च ज छ ध and घ
पफणष
While designing these characters, the following care must be taken so that OCR is able to recognize these characters
• No character when overlapped with another should be completely inside the other letter.
• The diagonal bar of ष has to be bold so that it is not broken while printing or scanning or in the
process of binarization
The final design of the letters can be seen in the figure 6.17.
35
Figure 6.17 Final design of the letters प फ ण and ष
टठढद
The counter of ढ had to be large so that it does not get filled with noise because if it is filled with noise
the OCR system confuses ढ with द. The final design of the letters can be seen in the figure 6.18.
Figure 6.18 Final design of the letters ट ठ ढ and द
36
इईङडझह
While designing these characters, the following care must be taken so that OCR is able to recognize these characters
• The letters इ and झ are designed to have the same characters height as the other letters as
shown in figure 6.19.
• The ending of the stroke of ड had to be extended more than required for the normal design so
that it is not confused by इ.
Figure 6.19 Equal character height of all the letters
The final design of the letters can be seen in the figure 6.20.
Figure 6.20 Final design of the letters इ ई ङ ड झ and ह
37
थय
The main concern while designing these letters is that थ should not look like य after the Shiro
Rekha is removed. An overlap of these letters without the Shiro Rekha is shown in the figure 6.21
and the final design of the letters is shown in the figure 6.22.
Figure 6.21 Overlap of the letters थ and य without the Shiro Rekha
Figure 6.22 Final design of the letters थ and य
अआउऊ
While designing these characters, the following care must be taken so that OCR is able to recognize these characters
• The horizontal bar of अ should be bold so that its not broken while scanning or printing.
• The top of अ should not go above the Shiro Rekha.
• The letter अ should not completely overlap उ. The difference is shown in figure 6.23.
Figure 6.23 Overlap of the letters अ and उ
38
The final design of the letters can be seen in the figure 6.24.
Figure 6.24 Final design of the letters अ आ उ and ऊ
एऐञश
These characters don't have any common element. The final design of the letters is shown in figure 6.25.
Figure 6.25 Final design of the letters ए ऐ ञ and श
39
Matras
While designing the lower matras a small white space was given between the lower matras and the
पु पू
का िक की
कु कू के कै
को कौ क्
bottom of the letters as shown in figure 6.26. The final design is shown in figure 6.27.
Figure 6.26 Small white space between the letters and lower matras
Figure 6.27 Final design of matras
40
6.2 Evolution of the Typefaces
The first version contained 52 characters. All the character had the same character height. The stroke
width of the first version was very less. The bowl of र and स had a dynamic shape. Some of the letters
like ख were compressed so that the character width of all the letters is comparable. The lower mean
line was also kept higher.
The final version on the other hand has a much bolder stroke so that the strokes are not broken while
printing or scanning. Some characters underwent considerable corrections. The horizontal bar of न and
त were brought closed to the Shiro Rekha. The knot at the bottom part of इ, ई, झ and द was removed to
make the characters more open and the diagonal stroke at the bottom was also converted to a straight
stroke. क was given a mre calligraphic look. Althogether the typeface now has a more natural look and
has a better stroke consistency as compared to the first version. Comparison of the first and final version is shown in figure 6.28.
अआइउऊएऐओऔकख
गघङचछजझटठडढण
तथदधनपफबभमयर
लवशषसह
अआइईउऊएऐओऔ
कखगघङचछजझञ
टठडढणतथदधनप
फबभमयरलवशषस
ह
Figure 6.28 First version of OCR-D (top); final version of OCR-D (bottom)
41
6.3 Anatomy of the Typefaces
The grid used is shown in figure 6.29.
बुटे
Upper Matra
Shiro Rekha
Character Height
Figure 6.29 Grid used in OCR-D
The H:V ratio used is 1.1 as shown in figure 6.30
H
V
Figure 6.30 H:V Ratio
42
Lower Matra
6.4 Testing of the Typefaces
Once the font is designed the next step is to test its accuracy on an OCR system. For this an OCR
system called HindiOCR is used. Oliver Hellwig of Department for Languages and Cultures of
Southern Asia, Freie Universität Berlin designed HindiOCR. HindiOCR converts printed Hindi
texts into rich-text documents (RTF) in Devanagari-Unicode encoding. It processes standard
image formats i.e. *.jpeg, *.png and *.bmp. A free demo version of HindiOCR can be found at
[11]. The text document can be seen in figure 6.31. Figure 6.32 shows the text after the process of
binarization and figure 6.33 shows the result of HindiOCR when the algorithm was run on test
document in figure 6.31.
Figure 6.31 Test document typed in OCR-D
Figure 6.32 Test document after binarization
Figure 6.33 Extracted test from the image
43
Comparison with Other Fonts
Performance of the font was compared to other fonts. For testing purposes the fonts Surekh and
Yogesh were used. Same text was typeset in all the three fonts and the results were matched. The
best result was of OCR-D and Surekh had the maximum errors. Although Yogesh performed
much better than Surekh, there were some errors which occurred consistently like 'र 'ु was most of
the time recognized as 'द 'ु and sometimes 'स' was recognized as a combination of र and न because
of the disjoint in स. Figure 6.34 shows the scanned text document typeset in OCR-D and figure 6.35
shows the test result when OCR algorithm was executed on this scanned documents. Figures 6.36 and
figure 6.38 shows the scanned text document typeset in Yogesh and Surekh respectively and figure 6.37
and figure 6.39 show their results.
Figure 6.34 Scanned document typeset in OCR-D
Figure 6.35 Extracted text when OCR algorithm is executed on document in figure 6.34
44
Figure 6.36 Scanned document typeset in Yogesh
Figure 6.37 Extracted text when OCR algorithm is executed on document in figure 6.36
Figure 6.38 Scanned document typeset in Surekh
45
Figure 6.39 Extracted text when OCR algorithm is executed on document in figure 6.38
46
Chapter 7: Future Work
All the alphabets in the Devanagari script are designed and has been tested on an OCR system.
Although all the vowels (independent and dependent forms) and the consonants have been
designed, numerals and some of the diacritics still have to be designed.
Recognition of conjucts in Devanagari has to be studied. This includes the algorithm for recognition and the algorithm for separation of the half form of the letter from the full form.. Designing
of conjunct has to be completed. Also the kerning table has to decicded upon keeping in mind
generous character spacing.
A comprehensive testing of the font also needs to done and based on the results of the test, design
of the characters have to be tweaked appropiately.
47
References
[1] Bapurao S. Naik — Typography of Devanagari
[2] Girish Dalvi — Anatomy of Devanagari Typefaces
[3] www. stefantrost.de
[4] Tushar Patnaik, Shalu Gupta, Deepak Arya — Comparison of Binarization Algorithm in
Indian Language OCR
[5] B.B. Chaudhuri and U. Pal — Skew Angle Detection of Digitized Indian Script Documents
[6] Huanfeng Ma, David Doermann — Adaptive Hindi OCR Using Generalized Hausdorff Image
Comparison
[7] Vijay Kumar, Pankaj K. Sengar — Segmentation of Printed Text in Devanagari Script and
Gurmukhi Script
[8] Mudit Agrawal, M. N. S. S. K. Pavan Kumar, C. V. Jawahar — Indexing and Retrieval of
Devanagari Text in Printed Documents
[9] R. Jayadevan, Satish R. Kolhe, Pradeep M. Patil, and Umapada Pal — Offline Recognition of
Devanagari Script: A Survey
10] Heidrun Osterer, Plilipp Stamm — Adrian Frutiger Typefaces. The Complete Work
[11] http://www.indsenz.com
48