Offline Urdu OCR using Ligature based Segmentation for Nastaliq

Transcription

Offline Urdu OCR using Ligature based Segmentation for Nastaliq
Indian Journal of Science and Technology, Vol 8(35), DOI: 10.17485/ijst/2015/v8i35/86807, December 2015
ISSN (Print) : 0974-6846
ISSN (Online) : 0974-5645
Offline Urdu OCR using Ligature based
Segmentation for Nastaliq Script
Ankur Rana1* and Gurpreet Singh Lehal2
Department of Computer Engineering, Punjabi University, Patiala - 147002, Punjab, India; ankurrana628@gmail.com
2
Department of Computer Science, Punjabi University, Patiala - 147002, Punjab, India; gslehal@gmail.com
1
Abstract
There are two most popular writing styles of Urdu i.e. Naskh and Nastaliq. Considering Arabic OCR research, ample amount
of work has been done on Naskh writing style; focusing on Urdu, which uses Arabic character set commonly used Nastaliq
writing style. Due to Nastaliq writing style, Urdu OCR poses many distinct challenges like compactness, diagonal orientation
and context character shape sensitivity etc., for OCR system to correctly recognize the Urdu text image. Due to compactness
and slanting nature of Nastaliq writing style, existing methods for Naskh style would not give desirable results. Therefore,
in this paper, we are presenting ligature based segmentation OCR system for Urdu Nastaliq script. We have discussed in
detail various unique challenges for the Urdu OCR and different feature extraction techniques for Ligature recognition
using SVM and kNN classifier. The system is trained to recognize 11,000 Urdu ligatures. We have achieved overall 90.29%
accuracy tested on Urdu text images.
Keywords: Feature Extraction (DCT, Directional, Gabor and Gradient), K-Nearest Neighbor, SVM, Urdu OCR
1. Introduction
Optical character recognition is technique used to convert
images data into editable text format. Optical character
recognition is used for digitization of the printed text
material like books, newspapers and old literature book,
blind book reader, banks etc. A lot of research is being
done for Arabic languages. Urdu also shares its character set with the Arabic character set. Generally Arabic
language is written from left and right in Naskh writing
style whereas Urdu is written in Nastaliq writing style
from left to right. Nastaliq writing style is cursive writing
style. Therefore very little work has been done for Urdu
language because of Nastaliq writing style. For the cursive script like Urdu, segmentation and segmentation free
approaches were being used by the researchers. Most of
the researchers either worked on isolated Urdu character
recognition or the small set of the Urdu ligature.
Shamsher et al.1 and Ahmad et al.2 developed OCR
for printed Urdu script. They used feed forward neural network for training and classification of the Urdu
*Author for correspondence
character. Shamsher et al.1 extracted the extreme points
of all characters as feature vector for the neural network.
They reported 98.3% accuracy in recognizing individual
Urdu character. Ahmad et al.2 experimented with the
structural features of the machine printed ligatures. They
reported 70% accuracy. Pathan3 worked on the handwritten isolated Urdu characters.
Al Muhtaseb et al.4 used HMM for developing Urdu
OCR. They used vertical sliding with three pixels width
and extracted 16 features by dividing window into 8 equal
sizes and the sum up number of black pixels in each subwindow. They used 2500 lines for the training and 266
lines for testing. They achieved 99% accuracy for the Arial
font. Their system is font style and font size depended.
Hasan et al.5 used LSTM (Long Short-Term Memory)
neural network to recognize the printed Urdu Nastaliq
text. They also used sliding window of width 30 for feature
extraction. They took pixels values as the feature vector.
They reported 94.85% character recognition accuracy.
Lehal and Rana6 explored ligatures recognition for
the Urdu OCR. They have divided ligature into two
Offline Urdu OCR using Ligature based Segmentation for Nastaliq Script
shown in Figure 3. Rest of character exception these 12
c­omponents i.e. Primary component (main shape of
non ­joiners connect with succeeding and proceeding
ligature without diacritics) and secondary component
character and change its shape.
(diacritics). They have achieved 98% recognition rate in
primary component and 99.9% recognition rate in diacritics component using SVM as classifier. Lehal7 and
3. Challenges in Urdu Recognition
Hussain8 worked on the segmentation of the Nastaliq
Urdu OCR.
The development of OCR for Urdu Script involves many
We have considered 10082ligatures for the developunique complexities which are as follows:
2
character.
Ahmadis et
al. experimented
ment of UrduUrdu
OCR.
Our approach
segmentation
based with the structural features of the machine printed ligatures.
3
isolated
Urdu characters.
They
70%pre-segmented
accuracy. Pathan
Urdu is written
diagonally
from right to left and top
approach. We
are reported
considering
text worked
for the on the• handwritten
to bottom. All ligatures look tilted at some angle from
recognition. We have divided 4our ligatures into two sub
verticalleft
sliding
with three
widthjoinand
al. used HMM
topThey
rightused
to bottom
direction
as thepixels
different
components Al
i.e.Muhtaseb
i) PrimaryetComponent
andfor
ii) developing
Secondary Urdu OCR.
extracted
16
features
by
dividing
window
into
8
equal
sizes
and
the
sum
up
number
of
black
pixels
in
each
subers are written in Urdu. Numerals add another level of
Component. So our total classes are reduced to 1845 classes
window.
They
used
2500
lines
for
the
training
and
266
lines for testing.
accuracy
for the
the
complexity
to the They
Urduachieved
OCR. As99%
observed
from
for primary component and 19 classes for the secondary
Arial
font.
Their
system
is
font
style
and
font
size
depended.
books having both Roman as well as Arabic Numerals
components. We have achieved total 98.15%
for and number.
Figure 2.accuracy
Urdu alphabets
written. It is found that Urdu and Roman numerals are
primary component and599.91% accuracy for the secondused
LSTM
(Long
Short-Term
Memory)
neural
network
to
recognize
the as
printed
Urdu
Nastaliq
Hasan
et
al.
Urdu
characters
are
classified
as joiner
non-joiner.
Those
Urdu
which join with
written
left and
to right
in the Urdu
showcharacters
in
Figure
4.
ary components. Total overall accuracy on input of Urdu
text.
They
also
used
sliding
window
of
width
30
for
feature
extraction.
They
took
pixels
values
as
the
feature
character
but notfor
with the•succeeding
character
non-joiner.
There
aretop
12 non joiners
As Urdu is
mostly termed
written as
diagonally
from
right
text image (having 500 ligatures on ­average)
achieved
Figure 2. Urdu alphabets and number.
vector.
They reported
character
recognition
to bottom
left, there
is problem
of segmentation.
Thissucceeding an
the Urdu OCR
is 90.29%
accuracy.94.85%
shown
in Figure
3. Rest ofaccuracy.
character
exception
these
12 non joiners
connect with
Urdu
characters
are classified
as joiner
and non-joiner.
Those Urdu
which join with t
poses
problem
for the character
and characters
word segmencharacter
and change
its shape.
6
Lehal and Rana explored ligatures recognition for the Urdu OCR. They have divided ligature into two
tation. Figure 5 exhibit
overlapped
ligatures
character but not with the succeeding
character
termedshows
as non-joiner.
There
are 12 non joiners i
2. Overview
ofi.e.Urdu
components
Primary component (main shape of ligature without diacritics) and secondary component
marked exception
as in red, green
bluejoiners
color. connect with succeeding an
shown in Figure 3. Rest of character
these and
12 non
(diacritics). They have achieved 98% recognition rate in primary component and 99.9% recognition rate in
Urdu is the national language of the Pakistan
characterand
andofficial
change its shape.
7
8
diacritics component using SVM as classifier. Lehal and Hussain worked on the segmentation of the Nastaliq
language of six Indian states Delhi, Jammu
Kashmir,in Urdu.
Figureand
3. Non-joiner
Urdu OCR.
Uttar Pradesh, Bihar, Andhra Pradesh and Telangana.
3. Challenges
Urdu Recognition
Urdu is the national
of the
Pakistan
andforofficial
We havelanguage
considered
10082ligatures
theindevelopment
of Urdu OCR. Our approach is segmentation based
language of six
Indian states
Jammu
and
Kashmir,
Figurepre-segmented
3. Non-joiner
in Urdu.
approach.
We areDelhi,
considering
text for the recognition. We have divided our ligatures into two
Theand
development
of OCR for Urdu Script involves
unique complexities
Uttar Pradesh, Bihar, Andhra Pradesh
Telangana.
Figure 2.many
Urdu alphabets
and number.which are as follows:
sub components i.e. i) Primary Component and ii) Secondary Component. So our total classes are reduced to
Urdu language is derived from the Farsi
Urdu lan3. script.
Challenges
in Urdu Recognition
1845 classes for primary component
19 classes for the secondary
components.
Weclassified
haveAllachieved
total
Urdu
characters
as joiner
andtilted
non-joiner
 been
Urduvastly
is and
written
left
and top to are
bottom.
ligatures
look
at som
guage is particularly important as it has
used diagonally from right to
98.15% accuracy for primary component
and
99.91%
accuracy
for
the
secondary
components.
Total
overall
top
right
to
bottom
left
direction
as
the
different
joiners
are
written
in
Urdu.
Numerals
add
character
but
not
with
the
succeeding
character
terme
development
of OCR for Urdu Script involves many unique complexities which are as follows:
by poets for composing their poetry.The
Urdu
is written in
accuracy on input of Urdu textofimage
(having
500
ligatures
on
average)
achieved
for
the
Urdu
OCR
is
90.29%
observed
from 3.
the
books
having both
Roman
as w
in Figure
Rest
of character
exception
these
1
Nastaliq style whereas Arabic is written in complexity
the Naskhto the Urdu OCR. As shown
accuracy.
Figure
2. Urdu
alphabets
and
number.

Urdu
is
written
diagonally
from
right
to
left
and
top
to
bottom.
All
ligatures
look
tilted
at
som
Numerals
written.
It
is
found
that
Urdu
and
Roman
numerals
are
written
left
to
right
in
the
character
and
change
its
shape.
style. Figure 1 shows the same Urdu text in two different
Figure 2. Urdu alphabets and number.
top
right to
in Figure
4. bottom left direction as the different joiners are written in Urdu. Numerals add
writing styles.
2. Overview of Urdu
are classified
joinerhaving
and non-joiner.
Those
complexity
OCR.characters
As observed
from theasbooks
both Roman
as Ur
w
Urdu has 38 basic letters. Figure 2 ofshows
all theto the UrduUrdu
character
but
not
with
the
succeeding
character
termed
as
non-jo
Numerals
written.
It is found that Urdu and Roman numerals are written left to right in the
basic letters of
Urdu
script.
It is written
from
to
left and
Urdu
is the
national
language
of right
the Pakistan
official3. language
of six
Indian states Delhi, Jammu and
Figure
Urdu.
shownNon-joiner
in Figure
Rest
of character
in
Figure
4.
whereas Urdu
numerals
are
written
as
roman
numerals
Figure 3.
3.inNon-joiner
in Urdu. exception these 12 non join
Kashmir, Uttar Pradesh, Bihar, Andhra Pradesh and Telangana. Urdu is the national language
of the Pakistan
character
and
change
its
shape.
i.e. from left to right.
and official language of six Indian states Delhi, Jammu and Kashmir, 3.
Uttar
Pradesh,
Bihar,
Andhra
Pradesh and
Challenges
inleft
Urdu
Recognition
Figure 4. Urdu writing style in nastalique
from right to
and number
written from left to righ
Urdu characters are classified as joiner and non-joiner.
Telangana. Urdu language is derived from the Farsi script. Urdu language is particularly important as it has
Those Urdu characters which join with the preceding
been vastly used by poets forAscomposing
their poetry. Urdu is written
in Nastaliq
style
Arabic
is
Urdu is mostly
right
top to bottom
left,
there
is problem
of many
segm
The development
of
OCRwhereas
for Urdu
Script
involves
character but not with the succeeding character
termed written diagonally from
Figure
4. Urdu
writing
style
in
nastalique
from
right
to
left
written in the Naskh style. Figure
1 shows
the same
Urdu
text in two
different
writing styles.Figure 5 exhibit shows overlap
poses
problem
for
the
character
and
word
segmentation.
as non-joiner. There are 12 non joiners in the Urdu
Figureas
4. Urdu writingFigure
style in3.nastalique
from
right to left and number written from left to righ
Non-joiner
inleft
Urdu.
and number
written
from
to right.
2
 Urdu is written diagonally from right to left and to
marked as in red, green and blue color.
bottom
left direction
as the different
 As Urdu is mostly written diagonally
from top
right
top to
toRecognition
bottom left,
there is problem
of segm
3. Challenges
in right
Urdu
of
complexity
to
the
Urdu
OCR.
As
observed
fr
poses problem for the character and word segmentation. Figure 5 exhibit shows overlap
Numerals
written.
It
is
found
that
Urdu
and
Roma
development
of OCR for Urdu Script involves many unique co
marked as in red, green andThe
blue
color.
in Figure 4.
Figure 1. Red
and1.Blue
textcolor
is intext
Nastaliq
and Naskh
Figure
Red color
and Blue
is in Nastaliq
and Naskh writing style respectively.
 Urdu is written diagonally from right to left and top to bottom
writing style respectively.
Figure 5. Ligature overlapping.
Figure bottom
5. Ligature
overlapping.
topofright
direction
as the
Urdu has 38 basic letters. Figure 2 shows all the basic letters
Urdutoscript.
It left
is written
from
rightdifferent
to left joiners are
of
complexity
to
the
Urdu
OCR.
As
observed
the bo
whereas Urdu numerals are
roman numerals
from
toContext
right. sensitive means that the shape from
 written
Urdu as
is context
sensitive i.e.
script
as left
well.
of the chara
Numerals written.
It isJournal
found
that Urdu
and Roman numerals
Vol 8 (35) | December 2015 | www.indjst.org
Indian
of Science
andtogether,
Technology
on the succeeding character. When different
characters
are
written
the shape of
Figure
overlapping.
in Figure
4. 5. Ligature
Figure
4. Urdu writing style in nastalique from r
depends on the shape of the character it follows as shown in table. Character in red color in
 Urdu
is the
context
sensitive
as well.
sensitive
means
thatdiagonally
the
shapefrom
of
the
chara
side of
equal
to is thescript
character
which
written
with
other
characters
changes
its
 Context
Aswhen
Urdu
is mostly
written
right
top
‫ﻭ‬+‫ﺕ= ﺕ‬
‫ﻭ‬
‫ﺏ‬+‫ﺕﺏ= ﺕ‬
‫ﻥ‬+‫ﺕﻥ= ﺕ‬
‫ﺏ‬+‫ٹﺏ= ٹ‬
‫ﻥ‬+‫ٹﻥ= ٹ‬
‫ڈ‬+‫ٹ= ٹ‬
‫ڈ‬
Ankur Rana and Gurpreet
+‫ﺕ==ﻡ‬
‫ﺏ ﻡ‬+‫ﻥ‬
+=‫ﺕﺏﻡ‬
=‫ﻥ ﻡ ﻥ‬+‫ﺭﺕ‬+
‫ﻭ‬+‫ﺕﻝ‬
‫ ﻝ ﻭ‬Singh
‫ﺕ‬Lehal
=‫ﻡ ﺭ=ﺕﻡﻥ‬
• Urdu is context sensitive script as well. Context ­sensitive
means that the shape of the character depends on the
succeeding character. When different characters are
written together, the shape of the character depends
on the shape of the character it follows as shown in
table. Character in red color in the left hand side of
the equal to is the character which when written with
other characters changes its shape on the right hand
side, because of the context sensitive nature of script,
where in Naskh each character has its four shapes i.e.
isolated, middle, left and right as show in Table. But
Urdu in Nastaliq script has more than four shapes for
its character set. Naz9 reported 32 different shapes of
the one character in nastaliq script.
• One of the major problems of the Urdu OCR is the
broken character in the image. If one of the diacritics
is not printed or not correctly scanned then the OCR
cannot correctly identify it. In Figure 6 ligature is broken at the end. Consequently, OCR will get confused
whether to treat broken ligature as one component or
more than one.
• Another problem with the Urdu OCR is the case
merged characters. In Urdu different ligatures have
same basic shape but different diacritics. From the
diacritics of the ligature, we can identify the correct
meaning of the ligature. But some time diacritics get
‫ﺕ= ﺕ‬
‫ ﻭ‬of
‫ﺏ‬+the
‫ﺕﺏ= ﺕ‬ligature
‫ﻥ‬+‫ﺕﻥ= ﺕ‬and it
merged with the primary‫ﻭ‬+shape
‫ﻭ‬+‫ﺕ= ﺕ‬
‫ﻭ‬
‫ﺏ‬+‫ﻥ ﺕﺏ= ﺕ‬+‫ﺕﻥ= ﺕ‬
is difficult to segment even
ligature primary‫ڈ‬+shape
and
‫ﺏ‬+‫ﻥ ٹﺏ= ٹ‬+‫ٹﻥ= ٹ‬
‫ٹ= ٹ‬
‫ڈ‬
diacritics as shown in Figure 7.
‫ﺏ‬+‫ٹﺏ= ٹ‬
‫ﻥ‬+‫ﻝٹ‬+=‫ٹﻥ ﻡ‬
=‫ﻡ ﻝ‬
‫ﻥڈ‬++‫ٹ‬
‫ڈ‬
‫ٹﻡ=ﻥ= ﻡ‬
Table 1. Different Shape
of‫ٹﺏ‬tey, tay
and
‫ﺏ‬+‫= ٹ‬
‫ ﻥ‬+‫ٹ‬
=‫ٹﻥ‬meem‫ڈ‬+with
‫ٹ= ٹ‬
‫ڈ‬differen
• There is very little space in between words in Urdu.
Different
One of ligatures
the major either
problems
of the
is‫ﻥ‬+the
in
have
space
‫ﻝ‬Urdu
+‫= ﻡ‬or
‫ﻝ‬OCR
‫ ﻡ‬non-joiner
‫ﻥ= ﻡ‬broken
‫ﻡ‬
‫ﺭ‬+character
‫ﻡ ﺭ= ﻡ‬
not as
printed
not correctly
scanned
the8,OCR
at end
a last or
character.
As shown
in then
Figure
bar cannot correct
Table 1. Different Shape of tey, tay and meem with different oth
brokencolor
at the
end. the
Consequently,
OCR will differget confused whethe
in green
shows
boundary between
or more
than
 component
One of where
the major
ofbar
therepresent
Urdu OCR isthe
the space
broken character in the
ent
words
barproblems
in
redone.
not
printed
or
not
correctly
scanned
then
the
OCR
cannot correctly id
between two ligatures in one words. From the space
broken
at
the
end.
Consequently,
OCR
will
get
confused
whether to
between ligature and words, we cannot find the word
component
or
more
than
one.
boundary between different words.
• Line segmentation also pose problem in Urdu OCR.
Figure 6. Broken ligatures in urdu text.
Because of the diagonal writing style of the script, two
ligatures in two difference lines get merged as shown
in Figure 9.
Figure 6. Broken ligatures in urdu text.
problem
with
the Urdu
is the
• It is Another
observed
in some
Urdu
booksOCR
footer
ofcase
the merged
page characters. In
basic
shape
but
different
diacritics.
From
the
diacritics
is written in Naskh style which add another complex- of the ligature,
ofAnother
the recognition.
ligature.
Butwith
some
diacritics
get
merged
with
the primary
ity
the
Having
both
Naskh
and
Nastliq
 to
problem
thetime
Urdu
OCR
is the
case
merged
characters.
In Urs
tobasic
segment
ligature
primary
shape
anddiacritics
diacritics
in we
Figu
writing
style
oneven
one
page adds
another
level
of chalshape
but
different
diacritics.
From the
of as
theshown
ligature,
of
the
ligature.
But
some
time
diacritics
get
merged
with
the
primary
shap
lenge in correct recognition of the page. As shown
to segment
even ligature
diacritics
as shown in Figure 7
in Figure
10, Urdu
text inprimary
green shape
colorand
is in
nastaliq
writing style and red color Urdu text is in Naskh writing style. Having both type of script on page is type
of multiscript recognition which further increases the
complexity of the system. Figure 7. Merged diacritics with ligature in Urdu
7. Merged
diacritics
withDifferent
ligature in Urdu
text
 There is very little space inFigure
between
words
in Urdu.
ligature
4. Urdu
Data Preparation
end as a last character. As shown in Figure 8, bar in green color sho
 There is very little space in between words in Urdu. Different ligatures eit
words
where
bar
in red
bar representcan
thebe
space
between two ligat
Urdu text
incharacter.
books
and
divided
endprinted
as a last
Asnewspapers
shown in Figure 8,
bar
in green color shows t
between
ligature
and
words,
we
cannot
find
the
word
betwe
into twowords
generations.
books
printed before
1995
areboundary
where barThe
in red
bar represent
the space
between
two ligatures
between ligature and words, we cannot find the word boundary between d
‫ ﺭ‬+‫ﻡ ﺭ= ﻡ‬
Table 1. Different
ShapeShape
of tey,
tay and
meem with
‫ﻝ‬+Table
‫ﻝ= ﻡ‬1.‫ﻡ‬Different‫ﻥ‬+
‫ﻥ= ﻡ‬of‫ﻡ‬tey, tay‫ﺭ‬and
+‫ ﻡ‬meem
=‫ ﻡ ﺭ‬with different other characters
different other
characters
 One of the major problems of the Urdu OCR is the broken character in the image. If oneFigure
of the diacritics
is with very small space between ligatures.
8. Urdu text
‫ﺕ‬ofcorrectly
=tey,
‫ﺕ‬
‫ ﻭ‬tay‫ﺏ‬
+‫ﺕ‬meem
=‫ﺏ‬
‫ ﺕ‬with
‫ﻥ‬+OCR
‫ﺕ‬different
=cannot
‫ ﺕﻥ‬other
not printed
or‫ﻭ‬+
not
scanned
then
the
correctly
identify it. Figure
In Figure
68. ligature
istexttext
Table 1. Different
Shape
and
characters
Figure
8. Urdu
with very
small
spacesmall
betweenspace
ligatures.
Urdu
with
very
between
broken at the end. Consequently, OCR will get confused whether to treat broken
as one
 ligature
Line segmentation
also pose problem in Urdu OCR. Because of the dia
ligatures.
component
more OCR
than one.

Line
segmentation
also
poselines
problem
in Urduas
OCR.
Because
of the9.diagona
he major problems
of theorUrdu
is
the
broken
character
in
the
image.
If
one
of
the
diacritics
is
ligatures in two difference
get merged
shown
in Figure
‫ﺏ‬+‫ﻥ ٹﺏ= ٹ‬+‫ٹﻥ= ٹ‬
‫ڈ‬+‫ٹ= ٹ‬
‫ڈ‬
ligatures
in
two
difference
lines
get
merged
as
shown
in
Figure
9.
ed or not correctly scanned then the OCR cannot correctly identify it. In Figure 6 ligature is
‫ﻡ ﻝ= ﻡ‬will get
‫ﻥ‬+‫= ﻡ‬confused
‫ﻡﻥ‬
‫ﺭ‬+‫ﻡ‬whether
=‫ﻡ ﺭ‬
at the end. Consequently,‫ﻝ‬+OCR
to treat broken ligature as one
nt or more than one.
Figure 6. Broken ligatures in urdu text.
Table 1. Different Shape of tey, tay and meem with different other characters
of challenge in correct recognition of the page. As shown in figure 10, Urdu t
writing
style
and red color
Urdu lines
text isget
in Naskh
writing style. Having both ty
Figure
9. Two
ligature
different
merged
he major problems of the Urdu OCR is the broken character in the image. If one
of
the
diacritics
is in in
Figure
9.9.Two
ligature
different
lines
getmerged.
merged.
Figure
ligature
in different
lines
get
multiscript
recognition
which
further
increases the complexity of the system.
 Another problem with the Urdu OCR is the case merged characters. In Urdu different ligatures
haveTwo
same
ed or not correctly
scanned then the OCR cannot correctly identify it. In Figure 6 ligature is
basic shape but different diacritics. From the diacritics of the ligature, we can identify the correct meaning
ligature
ItItitisisis observed
some Urdu
Urdu books
booksfooter
footerof ofthethe
page
is written
and
observed
in some
page
is written
in N
at the end. Consequently,
OCR
get confused
to treat
as one in
of the ligature. But
somewill
time diacritics
get mergedwhether
with the primary
shape ofbroken
the ligature
difficult
Figure
6. Broken
ligatures
in
urdu
text.
Figure
6.
Broken
ligatures
in
urdu
text.
complexity
to
the
recognition.
Having
both
Naskh
and
Nastliq
writing
s
complexity
to
the
recognition.
Having
both
Naskh
and
Nastliq
writing
style
to
segment
even
ligature
primary
shape
and
diacritics
as
shown
in
Figure
7.
ent or more than one.
Journal
Journal
problem with the Urdu OCR is the case merged characters. In Urdu different ligatures have same
Figure
10. Urdu
in two
Figure
10.
Urdu
TextText
writtenwritten
in two different
stylesdifferent
(multiscriptsstyles
on one page).
pe but different diacritics. From the diacritics of the ligature, we can identify the
correct
meaning
Figure
7.
Merged
diacritics
with
ligature
in
Urdu
text.
Figure
6.
Broken
ligatures
in
urdu
text.
Figure
Mergedget
diacritics
in Urdu
text.of the ligature
(multiscripts
on one page).
ature. But some
time7. diacritics
mergedwith
withligature
the primary
shape
and it is difficult
4. Urdu Data Preparation
 There
is veryshape
little space
between words
in Urdu. Different
ligatures
nt even ligature
primary
andin diacritics
as shown
in Figure
7. either have space or non-joiner at
end as a last character. As shown in Figure 8, bar in green color shows the boundary between different
Urdu text printed in books and newspapers can be divided into two generatio
words where bar in red bar represent the space between two ligatures in one words. From the space
3 1995 use
1995 are all hand written
whileJournal
majority
of the and
books
published after
Vol 8 (35) | December 2015 | www.indjst.org
Indian
of Science
Technology
between
words,
we cannot
find characters.
the word boundary
between
different words.
with the
Urduligature
OCR and
is the
case
merged
In Urdu
different
ligatures have same
problem
fonts such as Alvi Nastaliq or Noori Nastaliq. Shape of the characters in Urdu
pe but different diacritics. From the diacritics of the ligature, we can identify the correct
meaning
character
it follows. That’s why we cannot take the character unit as the classifi
Ligature
at character level is shown in Figure 11.
ature. But some time diacritics get merged with the primary shape of the ligature and
it is segmentation
difficult
yles (multiscripts
on one
page).Text written in two different styles (multiscripts on one page).
Figure
10. Urdu
Offline
multiscripts on one
page).Urdu OCR using Ligature based Segmentation for Nastaliq Script
Urdu
Data Preparation
multiscripts on
on4.one
one
page).
multiscripts
page).
We have gathered training data from various scanned
all hand
written
while majority
of the
books
published
ers can be divided
intoprinted
two
generations.
Thenewspapers
books
printed
Urdu text
in books and
can before
be divided into two generations. The books printed before
books.
Some of the least frequent ligatures, as mentioned
after
1995
use
computer
generated
Nastaliq
fonts
such
as
of
published
useThe
computer
generated
Nastaliq
1995
allafter
hand1995
written
while
majority
of the
books published after 1995 10use computer generated Nastaliq
anthe
be books
divided
intoare
two
generations.
books
printed
before
by
Lehal
, rarely
occur
books.
Alvi
Nastaliq
or
Noori
Nastaliq.
Shape
of
the
characters
aliq.
Shape
of
the such
characters
Urdu
depends
onNastaliq.
the shape
of the
fonts
Alviin
Nastaliq
orbooks
Noori
Shape
of the characters in Urdu
depends
on in
thesome
shape
of theTo generate the
e books
published
1995
use
computer
generated
an
be divided
into after
twoas
generations.
The
printedNastaliq
before
take the
character
unitit depends
as
classification
unitthe
for
the
recognition.
data for those
primary
components of the ligain Urdu
on
thewhy
shape
of
the
character
follows. unit astraining
character
follows.
we
cannot
take
character
the classification
unit for
the recognition.
the
characters
inthe
Urdu
depends
on
shape
of the
theit
eShape
booksofpublished
after
1995
useThat’s
computer
generated
Nastaliq
shown
in Figure
11.as
ture,
we
made
some
synthetic
images
with different font
That’s
why
we
cannot
take
the
character
unit
as
the
claseShape
the character
unit
the
classification
unit
for
the
recognition.
Ligature
segmentation
at
character
level
is
shown
in
Figure
11.
of the characters in Urdu depends on the shape of the
allenge
in
correct
recognition
of
the
page.
As
shown
in
figure
10,
Urdu
text
in
green
color
is
in
nastaliq
n
correct
recognition
of
the
page.
As
shown
in
figure
10,
Urdu
text
in
green
color
is
in
nastaliq
size
i.e.
35,
38,
40,
45,
50,
55
and
different
format option
incharacter
sification
the recognition.
Ligature
Figure 11.
ewn
ge
in
correct
recognition
ofclassification
thefor
page.
Asunit
shown
in figure
10,segmentation
Urdu text in green color is in nastaliq
the
unit
as theunit
for the
recognition.
=
=
ng
andcolor
red
text
Naskh
style.
Having
type
of script
page
type
likepage
bold
regular.
ndstyle
red
color
Urdu
textUrdu
is in is
Naskh
writing
style.
Having
both
typeboth
of script
on
page
ison
type
ofis
atcolor
character
level
isin
shown
inwriting
Figure
11.
wn
in
Figure
11.
yle
and
red
Urdu
text
in is
Naskh
writing
style.
Having
both
type
of script
on
is or
type
of ofWe have removed diacritics marks
from
these
ligatures
to get the primary component. We
For
the
development
of
Urdu
OCR
we
have
taken
script
recognition
which
further
increases
the
complexity
of
the
system.
ognition
which
further
increases
the
complexity
of
the
system.
t recognition which further increases the complexity of the system.
have gathered total 1200 primary component from the
ligatures as a classification unit. Ligature is the connected
Figure
11. characters
Urdu word with
scanned books and rest of the primary components were
component and has different
Urdu
andcharacter
end segmentation.
word with character
segmentation.
generated synthetically. Even in 1200 primary compocharacter
is
either
non
joiner
or
space.
Urdu
words
can
ve taken ligatures
a classification
the taken
connected
For theas
development
of unit.
Urdu Ligature
OCR we ishave
ligatures as a classification unit. Ligature is the connected
word with
with character
character
segmentation.
word
segmentation.
nent, some
of primary
component
samples
have more
than
onejoiner
ligature.
As
for example,
the
word is either
ers and
endcomponent
character
isand
either
or space.
words
cancharacter
hasnon
different
Urdu
and end
non joiner
or space.
Urdu words
can are less than
aken
ken
ligatures
as a classification
unit.
Ligature
ischaracters
theUrdu
connected
the
required
number
of
samples.
To
complete
(
)
(badshah)
composed
four
ligatures:
two
ligample,
the
word
(‫)ﺑﺎﺩﺷﺎﻩ‬
is
composed
of
four
ligatures:
two
have
thannon
one
ligature.
As for
example,
and
character
isclassification
either
joiner
or space.
words the
can word (‫( )ﺑﺎﺩﺷﺎﻩ‬badshah) is composed of four ligatures: two the samples
ken end
ligatures
as amore
unit.
Ligature
isUrdu
the
connected
‫ﺑ‬and
and
‫)ﺷﺎ‬
and
have
two
single
character
ligature
(‫ﺩ‬
and
‫)ﻩ‬.
The
two
we mixed
synthetic
and (‫ﺩ‬
scanned
samples. We have
tures
having
multiple
are
((‫ﺑﺎ‬
andtwo
and have two single
multiple
ligatures
havingnon
‫ ))ﺷﺎ‬and
character
ligature
and ‫)ﻩ‬.books
The two
theend
word
(‫)ﺑﺎﺩﺷﺎﻩ‬
(badshah)
is
composed
of four
ligatures:
character
is either
joiner characters
or space.
Urdu
words
can
er
composed
of
two
characters
each
(‫ﺏ‬+
‫ﺍ‬
=
‫ﺑﺎ‬
and
‫ﺵ‬+
‫ﺍ‬
=
‫)ﺷﺎ‬
also trained
Urdu
two
single
character
ligature
((‫ ﺩ‬and
The
two ligatures
character
ligature
two
ligatures
with multiple
character
are
further
composed
of two characters
each (‫ﺏ‬+
‫ﺑﺎ = ﺍ‬numerals
and ‫ﺵ‬+ ‫= ﺍ‬and
‫ )ﺷﺎ‬roman numerals as the
dthe
‫ )ﺷﺎ‬word
and have
two
single
‫)ﻩ‬.). The
(‫)ﺑﺎﺩﺷﺎﻩ‬
(badshah)
is composed
of four
ligatures:
two
.t Urdu
Text
written
indifferent
two
different
styles
(multiscripts
on
one
page).
written
inhave
two
styles
(multiscripts
on
one
page).
Text
written
indifferent
two
styles
(multiscripts
on
one
page).
primary
component.
Sample
of
the primary components
with
multiple
character
are
further
composed
of
two
of
two
characters
each
(‫ﺏ‬+
‫ﺍ‬
=
‫ﺑﺎ‬
and
‫ﺵ‬+
‫ﺍ‬
=
‫)ﺷﺎ‬
dumposed
‫ )ﺷﺎ‬and
two
single
character
ligature
(‫ﺩ‬
and
‫)ﻩ‬.
The
two
cognizable unit
for10Urdu
OCR.
He hasanalysis
taken 6,533,057
words corpus
did
the
statistical
of
the
recognizable
unit
for
Urdu
OCR.
He
has
taken
6,533,057
words
corpus
Lehal
mposed of two
characters
each
(‫ﺏ‬+
‫ﺍ‬
=
‫ﺑﺎ‬
and
‫ﺵ‬+
‫ﺍ‬
=
‫)ﺷﺎ‬
are shown in Figure 13.
characters
(
and
)
identified
nearly
10082
ligatures
which
are
usedwords
99%
time
inUrdu
the
ueta
Data
reparation
and
identified
25,957
unique
ligatures.
He
identified
nearly
ligatures
which
are used 99% time in the
llenge
inPreparation
correct
recognition
of
the
page.
As
shown
in
figure
10,
text10082
in green
color
is
in
nastaliq
izable
unit
for
Urdu
OCR.
has
taken
6,533,057
corpus
Preparation
10 He
We have also collected samples of the merged ligatures.
Lehal did the statistical analysis of the recognizable
whole
corpus.
ntified
nearly
10082
ligatures
which
are
used
99%
timeHaving
in theboth type of script on page is type of
for
Urdu
OCR.
He
taken
6,533,057
words
corpus
gizable
style unit
and
red
color
Urdu
texthas
is in
Naskh
writing
style.
Merged ligature are those ligature in which diacritics merge
unit
for
Urdu
OCR.
He
has
taken
6,533,057
words
corxt
in
books
and
newspapers
can
be
divided
into
two
The
books
printed
before
inprinted
books
and
newspapers
canwhich
be
divided
into
twotime
generations.
The The
books
printed
before
ntified
10082
ligatures
aredivided
used
99%
thegenerations.
nted
innearly
books
and
newspapers
can
be
into
two
generations.
books
printed
before
cript
recognition
which
further
increases
the
complexity
ofin
the
system.
with
these
10082
ligatures.
But books
toour
the
training
data
for
with
the
component
shape.
merged characters
pus
and
identified
25,957
unique
ligatures.
He10082
identified
We
have
developed
proposed
system
with
these
10082
ligatures.
But
to primary
collect
the
training data
for As
10082
ehand
all hand
written
while
majority
ofcollect
the
books
published
after
1995
use
computer
generated
Nastaliq
written
while
majority
of
the
published
after
1995
use
computer
generated
Nastaliq
written
while
majority
of the
books
published
after
1995
use
computer
generated
Nastaliq
10
10for the recognition
or
touching
character
pose
problem
has
me
task.
To
decrease
the
number
of
recognition
classes,
Lehal
nearly
10082
ligatures
which
are
used
99%
time
in
the
these
10082
ligatures.
But
to
collect
the
training
data
for
10082
ligatures
from
the
books
very
cumbersome
task.
To
decrease
the
number
recognition
chAlvi
as Nastaliq
Alvi or
Nastaliq
or
Noori
Nastaliq.
of characters
the characters
in Urdu
depends
on shape
the
shape
of the classes, Lehal has
Nastaliq
Noori
Nastaliq.
Shape
ofisShape
the
in Urdu
depends
on the
ofof
the
or Noori
Nastaliq.
Shape
of characters
the
in Urdu
depends
on shape
the
of
the
10
10
in
any
OCR
system.
whole
corpus.
dary
component
as
shown
in
Figure
12.
r
it
follows.
That’s
why
we
cannot
take
the
character
unit
as
the
classification
unit
for
the
recognition.
ws.
That’s
why
we
cannot
take
the
character
unit
as
the
classification
unit
for
the
recognition.
has
ask.
To
decrease
the
number
of
recognition
classes,
Lehal
these
10082
ligatures.
But
to
collect
the
training
data
for
10082
separated
ligature
into
primary
and
secondary
component
as
shown
in
Figure
12.
ollows. That’s why we cannot take the character unit as the classification unit for the recognition. After analyzing scanned images from
10
10
books, we have found some diacritics components merged
Wenumber
have
our
proposed
system
segmentation
at
character
level
is
shown
in
ation
at decrease
character
level
isFigure
shown
in
Figure
11.Figure
haswith these
ask.
To
the
of12.
recognition
classes,
component
shown
in
mentation
at as
character
level
isdeveloped
shown
in Figure
11. 11.Lehal
with the primary component shape. We have collected
10082
ligatures.
But
to
collect
the
training
data
for
10082
component as shown in Figure 12.
=
=
=
total 41 such primary components merged with the existligatures from the books is very cumbersome task. To
ing primary components. Some of the diacritics touching
decrease the number of recognition classes, Lehal10 has
primary components are shown in the Figure 14.
separated ligature into primary and secondary compoUrdu Text
in two
different
styles
(multiscripts
on one page).
(b) written
(c)11.
Figure
11.
Urdu
word
with
character
segmentation.
Figure
Urdu
word
with
character
segmentation.
(a)
(b)
(c)
11. Urdu
word with
We have collected150 samples for each secondnentFigure
as shown
in Figure
12.character segmentation.
gature
Ligature
secondary
component.
Figure
12. (a)
Urdu Ligature
(b) Ligature primary component (c) Ligaturecomponents
secondary component.
(b) primary component
(c) After (c)
Even of
in the
1200 primary com
ary
component
fromwere
thegenerated
scanned synthetically.
books. Sample
removing
diacritics
from
the
10082ligatures
and
Data
development
of(c)Urdu
OCR
we
have
taken
ligatures
as
a classification
unit.
Ligature
is connected
the connected
ment
ofPreparation
Urdu
OCR
we(c)have
taken
ligatures
as a as
classification
unit.unit.
Ligature
is
theis connected
lopment
of
Urdu
OCR
we
have
taken
ligatures
a
classification
Ligature
the
e(b)
primary
component
Ligature
secondary
component.
(b)
samples
are
less
than
the
required
number
of
samples.
To complete
(c)
s
­
econdary
component
is
shown
in
the
Figure
15:
grouping
ligatures
having
same
we ligatures
gatures
and
grouping
ligatures
having
sameend
primary
component,
ent
anddifferent
has
different
characters
and
character
iscomponent,
either
non
or
space.
Urdu
can
After
removing
diacritics
the
10082ligatures
andwe
grouping
having
same
primary component, we
asprimary
different
Urdu
characters
and
end
character
isprimary
either
non
joiner
or joiner
space.
Urdu
words
canwords
has
Urdu
characters
and from
end
character
is either
non
joiner
or space.
Urdu
words
canbooks
component
(c)Urdu
Ligature
secondary
component.
eendprimary
component
(c)
Ligature
secondary
component.
scanned
samples.
We
have
also
trained
Urdu
numerals
and roma
Complete
statistics
of training
data is given in
1845
classes
of
theword
primary
component
and
16
secprinted
in books
and
newspapers
can
be
divided
into (badshah)
two
generations.
The
books
printed
before
ponent
and
16have
secondary
components.
When
these
secondary
re
than
one
ligature.
As
for
example,
the
word
(‫)ﺑﺎﺩﺷﺎﻩ‬
is16
composed
of four
ligatures:
two these
one
ligature.
As
for
example,
theof
word
(‫)ﺑﺎﺩﺷﺎﻩ‬
(badshah)
is composed
of
four
ligatures:
two two
res
and
grouping
ligatures
having
same
primary
component,
we
have
1845
classes
the
primary
component
secondary
components.
When
secondary
han
one
ligature.
As
for
example,
the
(‫)ﺑﺎﺩﺷﺎﻩ‬
(badshah)
isand
composed
of four
ligatures:
Sample
of the primary components are shown in Figure 13.
all
hand
written
while
of
the
books
published
after
1995
use ligature
computer
generated
­ftotal
ollowing
Table
2. two
ondary
components.
When
these
secondary
components
having
multiple
characters
are
(‫ﺑﺎ‬
and
‫)ﺷﺎ‬
have
two
single
character
(‫ﺩ‬The
and
‫)ﻩ‬.Nastaliq
The
multiple
characters
are
(‫ﺑﺎ‬
and
‫)ﺷﺎ‬
and
have
two
single
character
ligature
(‫ﺩ‬ligature
and
‫)ﻩ‬.
two
mary
component
we
getmajority
aare
total
of
10082ligatures.
ing
multiple
characters
(‫ﺑﺎ‬
and
‫)ﺷﺎ‬
and
have
two
single
character
and
The
two
ent
and
16 components
secondary
components.
When
these
secondary
res
and
grouping
ligatures
having
same
primary
component,
we
are
used
along
withand
1845
primary
component
we
get
a(‫ﺩ‬
of‫)ﻩ‬.
10082ligatures.
h
as
Alvi
Nastaliq
or
Noori
Nastaliq.
Shape
of
the
characters
in
Urdu
depends
on
the
shape
of
the
with
multiple
character
are
further
of characters
twocomponent
characters
(‫ﺏ‬+
‫ﺵﺍ‬+
= ‫ﺍ ﺑﺎ‬and
tiple
character
are
composed
ofWhen
two
characters
eacheach
(‫ﺏ‬+each
‫ﺏ(ﺍ‬+
= ‫ﺑﺎ‬get
= ‫ﺵ ﺍ)ﺷﺎ‬+
are
used
along
withcomposed
1845
primary
we
hent
multiple
character
further
composed
of
two
‫ﺍ‬and
= ‫ﺑﺎ‬a and
‫ﺵ‬+
= ‫)ﺷﺎ = ﺍ)ﺷﺎ‬
and
16
secondary
components.
these
secondary
component
we
get further
aare
total
of
10082ligatures.
it
follows.
That’s
why
we
cannot
take
the
character
unit
as
the
classification
unit
for
the
recognition.
arious scanned
books.
Some
oftraining
the least
ligatures,
as books. Some of the least frequent ligatures, as
10082ligatures.
We
have
gathered
datafrequent
from various
scanned
component
wetotal
get
aoftotal
of 10082ligatures.
Urdu word with character segmentation.
component
we
a total
ofthe
10082ligatures.
the statistical
analysis
oflevel
unit
for
Urdu
OCR.
Hetaken
has
taken
6,533,057
words
corpus
atistical
analysis
of
the
recognizable
unit
for
Urdu
OCR.
He
has
taken
6,533,057
words
corpus
egmentation
at get
character
isrecognizable
shown
inthose
Figure
11.
10
eid
statistical
analysis
of
the
recognizable
for
Urdu
OCR.
Heas
has
6,533,057
words
corpus
books.
To generate
the
training
data
forunit
primary
components
us
scanned
books.
Some
of the
least
frequent
ligatures,
, rarely
occur
in
some
books.
To
generate
the
training
data
for those
mentioned
by
Lehal
ntified
25,957
unique
ligatures.
He
identified
nearly
10082
ligatures
which
are
used
99%
time
in theprimary components
,957
unique
ligatures.
He
identified
nearly
10082
ligatures
which
are
used
99%
time
in
the
d 25,957
unique
ligatures.
Hei.e.
identified
nearly
10082
ligatures
which are used 99% time in the
mages
with
different
font size
35,
38,
40,
45, 50,
55images
and
different
oks.
ks.
To
generate
the
training
data
for
those
primary
components
us
scanned
books.
Some
of
the
least
frequent
ligatures,
as
of
the
ligature,
we
made
some
synthetic
with
different
font
size
i.e.
35,
38,
40,
45, 50, 55 and different
=
orpus.
.
ve
removed
diacritics
marks
from
these
ligatures
to
get
the
primary
ks.
To
generate
the
training
data
for
those
primary
components
ess with different
fontoption
size i.e.
35,bold
38, or
40,regular.
45, 50, We
55 and
different
format
like
have
removed diacritics marks from these ligatures to get the primary
our
proposed
system
with
these
10082
ligatures.
But
to
collect
thedata
training
data
for
10082
ed
our different
proposed
system
with
these
10082
ligatures.
But
to
the
training
for
10082
primary
component
from
the
scanned
and
rest
of
the
primary
these
moved
diacritics
marks
from
ligatures
toligatures.
get
the
primary
sdeveloped
with
font
size
i.e.
35,
38,
40,books
45,
50,
55
and
different
eloped
our
proposed
system
with
these
10082
Butcollect
to
collect
the training
for 10082
component.
We
have
gathered
total
1200
primary
component
from
thedata
scanned
books
and rest of the primary
10
10
10
has
from
the
books
is
very
cumbersome
task.
To
decrease
the
number
of
recognition
classes,
Lehal
has
e
books
is
very
cumbersome
task.
To
decrease
the
number
of
recognition
classes,
Lehal
moved
diacritics
marks
from
these
ligatures
to
get
the
primary
thenumber
primaryof recognition classes, Lehal has
from
the
scanned
books
rest
ofthe
mary
thecomponent
books is very
cumbersome
task.
Toand
decrease
Figure
11. Urdu
word
with
character
segmentation.
Figure
11. Urdu
wordcomponent
with
character
segmentation.
dinto
ligature
intoand
primary
and
secondary
as
shown
in
Figure
eature
primary
secondary
component
as shown
inthe
Figure
12.
ary
component
from
the
scanned
books
and
rest
of
primary
into
primary
and
secondary
component
as
shown
in
Figure
12.5 12.
Page
Journal
Page 5
evelopment of Urdu OCR we have taken ligatures as a classification unit. Ligature is the connected
Page
5
nt and has different Urdu characters and end character is either non joiner or space. Urdu words can
Page 5 is composed of four ligatures: two
e than one ligature. As for example, the word (‫( )ﺑﺎﺩﺷﺎﻩ‬badshah)
having multiple characters are (‫ ﺑﺎ‬and ‫ )ﺷﺎ‬and have two single character ligature (‫ ﺩ‬and ‫)ﻩ‬. The two
with multiple character are further composed of two characters each (‫ﺏ‬+ ‫ ﺑﺎ = ﺍ‬and ‫ﺵ‬+ ‫)ﺷﺎ = ﺍ‬
(a) (a) (b)
(b)
(c)
(a) (a)
(b) (b) (c) (c) (c)
d12.
theFigure
statistical
analysis
of the(b)
recognizable
unitcomponent
for
Urdu
OCR.
He hassecondary
taken 6,533,057
words corpus
12.
(a)
Urdu
Ligature
Ligature
primary
(c)
Ligature
component.
e
(a)
Urdu
Ligature
(b)
Ligature
primary
component
(c)
Ligature
secondary
component.
Figure 12. (a) Urdu Ligature (b) Ligature primary component (c) Ligature
secondary
component.
Figureligatures.
12. (a)He Urdu
Ligature
Ligature
primary
ified 25,957 unique
identified
nearly(b)
10082
ligatures
which are used 99% time in the
component
(c)
Ligature
secondary
component.
Figure
13. wePrimary
pus.
moving
diacritics
from
the
10082ligatures
and
grouping
ligatures
having
same
primary
component,
acritics
from
the
10082ligatures
and
grouping
ligatures
having
same
primary
component,
ng diacritics from the 10082ligatures and grouping ligatures having same
primary
component,
we wecomponent training sample data.
Figure
13. Primary component training sample data.
45
classes
of
the
primary
component
and
16
secondary
components.
When
these
secondary
es
of
the
primary
component
and
16
secondary
components.
When
these
secondary
classes
of our
the proposed
primary system
component
and 16
secondary
these data
secondary
developed
with these
10082
ligatures.components.
But to collectWhen
the training
for 10082
ents
are
used
along
with
1845
primary
component
we
get
a
total
of
10082ligatures.
used
along
with
1845
primary
component
we
get
a
total
of
10082ligatures.
We
have10also
samples of the merged ligatures. Merged ligatu
are
used
along
with
1845
primary
component
we
get
a
total
of
10082ligatures.
hascollected
rom the books
is very
To decrease the number of recognition classes, Lehal
4
Vol
8 (35)cumbersome
| December 2015task.
| www.indjst.org
Indian Journal of Science and Technology
merge with the primary component shape. As merged characters or to
ligature
intodata
primary
and
secondary
component
as
shown
inthe
Figure
ethered
gathered
training
data
from
various
scanned
books.
of 12.
the
least
frequent
ligatures,
ed
training
from
various
scanned
books.
Some
of Some
least
frequent
ligatures,
as as inasany OCR system. After analyzing scanned images from
training
data
from
various
scanned
books.
Some
of the
least
frequent
ligatures,
recognition
10
10
10
,
rarely
occur
in
some
books.
To
generate
the
training
data
for
those
primary
components
ed
by
Lehal
,
rarely
occur
in
some
books.
To
generate
the
training
data
for
those
primary
components
al
components merged with the primary component shape. We ha
y Lehal , rarely occur in some books. To generate the training data for those primary components
components merged with the primary component shape. We have collected total 41 such primary
merge with the primary component shape. As merged characters or touching character pose problem for the
components merged with the existing primary components. Some of the diacritics touching primary
recognition in any OCR system. After analyzing scanned images from books, we have found some diacritics
components are shown in the Figure 14.
components merged with the primary component shape. We have collected total 41 such primary
components merged with the existing primary components. Some of the diacritics
touching
primary
Ankur
Rana and
Gurpreet Singh Lehal
components are shown in the Figure 14.



Figure 14. Sample of merged primary component with secondary component.
a(q) = 
We have
each secondary
component
Figure
14. collected150
Sample ofsamples
mergedforprimary
component
with from the scanned books.

Figure
14. Sample
merged
component
with secondary component.
secondary
component
is of
shown
in primary
the Figure
15:
secondary
component.
1
,
v=0
N
2
Sample
, 1of≤ qthe
≤ N −1
N
the images
× 32 for the DCT.
We have collected150 samples for each secondary component We
fromhave
the scaled
scannedallbooks.
Sampletoof32
the
secondary component is shown in the Figure 15:
Whole image DCT gives us 1024 frequency components.
DCT has one important property that left top of the
DCT matrix gives high frequency component. High freFigure 15. Secondary component training sample data.
Figure 15. Secondary component training sample data.
quency component means maximum information about
Complete statistics of training data is given in following Table 2.
the image is stored on top left the DCT matrix. We have
15. Secondary component training sample data.
Table 2. Figure
Training
data statistics
Table 2. Training data statistics extracted total 100 features from the total of 1024 feature
Complete statistics of training data is given in following
Total Ligatures
10082 Table 2. values in zigzag manner as shown in Figure 16.
Primary Components (After removing diacriticsTable 2.1845
Training data statistics
from the ligatures)
Total Ligatures
5.2 Gabor10082
Features
English Numerals
10
Gabor function G(i,j) is the linear filter. Rajneesh Rani
Urdu
Numerals
10
Journal
Page 6
Total Ligatures
10082 for
et al.11 used gabor features
the script identification. It is
Secondary Components
10
used
for
the
edge
detection
in
image processing. It is mulPunctuation
marks as Primary Component
9
Journal
Page 6
tiplication of harmonic function and Gaussian function.
Punctuation marks as secondary Component
6
G(m, n) =P (m, n) * C(m, n)
5. Features Extraction
For the classification of any pattern, relevant features
have to be extracted. For Urdu OCR many researchers
use different features. We have recognized primary and
secondary components. For classification of the primary
component of the ligature, we calculated DCT, Gabor,
directional and gradient features.
5.1 Discrete Cosine Transformation (DCT)
DCT is the statistical feature extraction technique. DCT
maps ligature image from spatial domain to the frequency
domain. DCT maps the entire high frequency component
to the upper right corner of the image matrix and low frequency components maps to the bottom right corner of
the image. DCT coefficients f(p, q) of image I(m, n) are
computed by equation(1) :
f ( p, q) = a( p)a(q)
Where
M −1 N −1
This is used both for the orientation and spatial
­frequency. A Gabor filter is defined as

n
   2 2  
  −1  R21 + R22    j 2pR1
s 
 2 s
m, n, q , sm , s = e   m n    *e l


G





where R1 = m cos q + n sin q and R2 = − m sin q + n cos q
 (2m + 1)p p 
 (2n + 1)pq 
 cos 

2M
 2N


∑ ∑ f (m, n)cos 
m= 0 N = 0
(1)
 1
,
u=0

 M
a( p) = 
 2 , 1 ≤ p ≤ M −1
 M
Vol 8 (35) | December 2015 | www.indjst.org
Figure
16. 16. DCT
DCTcoefficients
coefficients
of image
in zigzag
Figure
of image
selectedselected
in zigzag direction
into one vector.
direction into one vector.
5.2 Gabor Features
Gabor function G(i,j) isIndian
the linear
filter.
Rajneesh
Rani et al.11 5used gabor
Journal
of Science
and Technology
It is used for the edge detection in image processing. It is multiplication
function.
(
)
(
)
(
)
Offline Urdu OCR using Ligature based Segmentation for Nastaliq Script
q is the orientation of sinusoidal plane wave, λ is the
wavelength. sm and sn are the standard deviations. We
have taken both the standard deviations equal of the
­feature extraction.
To calculate the feature of the input, first image is scaled
to 32 × 32 pixels. Then it is further partitioned into four
equal non overlapping sub regions of size 16 × 16. These
sub regions are again further partitioned into 4 non overlapping sub-sub regions of size 8 × 8. After 8x8 sub region
division we get total 16 small regions. These 21 images are
then convolved with odd symmetric and even symmetric
Gabor filters in nine different angles, of orientation q of 20
degrees, to obtain a feature vector of 189 values.
5.3 Directional Features
Directional features12 calculated the distance of black and
white pixels in eight different directions for each pixel.
We have scaled our input image to 36 × 36 pixels. After
scaling, directional features vector is calculated in eight
different direction i.e. 0o, 45o, 90o, 135o, 180o, 225o, 270o
and 315o. It gives directional feature vector of length 16
for each pixel. To down sample 20736 (16*36*36) directional features value, we divided our image into 9 (3 × 3)
zones. Then we have generated 16 feature vector lengths
from each zone by adding corresponding directional feature vector values of all the pixels in that zone. So, we have
obtained directional feature vector of length 144 i.e. 16
feature values ∗ 9 zones.
5.4 Gradient Features
After calculating gradient vector, we calculated the
magnitude and direction as given in equation.
Magnitude = g x ( x , y ) + g y ( x , y )
2
2
(
direction = tan −1 g x ( x , y ) g y ( x , y )
)
The direction gradient vector is then decomposed
along 8 chain code directions (D0, D1, D2, D3, D4, D5,
D6 and D7) as shown in Figure 18. After that, the character image is divided into 9 × 9 blocks (81 blocks). If
the gradient vector lies between two directions, then
it is decomposed, else, its magnitude of the vector is
retained. This results in 63 × 63 × 8 values. Next, the
spatial resolution 9 × 9 is reduced to 5 × 5 for the down
sampling of every two horizontal and vertical blocks
with 5 × 5 Gaussian filter to get the 200 features value
per image.
6. Experiments
6.1 Primary Component Recognition
Lehal and Rana6 reported 98.01% and 96.78% accuracy of
the 2190 primary component classes using Support Vector
Machine14 (SVM) and k nearest neighbor respectively. We
have found that out of 2190 primary component many
primary shapes of the ligatures looks same. Therefore, our
primary component count of the ligature is reduced to
1873. We have used DCT, Gabor, directional and gradient for the feature extraction of the primary component.
We have used SVM (linear and polynomial kernel) and
kNN classifier for the primary component recognition.
Results primary component recognition with SVM classifier using linear and polynomial kernel having degree
3 and 4 is given in Table 3.
A gradient feature12,13 calculates the magnitude and
direction of the maximum changes in intensity in the
neighborhood of the pixels. For the gradient features
extraction, first image is normalized to 63 × 63 pixel sizes.
After normalizing image, the gradient vector is calculated
in both x and y direction at each pixel position using the
sobel operatorAfter
as shown
in Figure
17. the gradient vector is calculated in both x and y direction at each pixel position using
normalizing
image,
shown
gx(x, y) = z (x +the
1, ysobel
– 1) operator
+ 2z(x + as
1, y)
+ z(xin+figure
1, y +17.
1)
– z(x – 1, y – 1) – 2z(x – 1, y) – z(x – 1, y – 1)
gx(x,y) =z(x+1,y-1) +2z(x+1,y) + z(x+1,y+1) - z(x-1,y-1) – 2 z(x-1, y) – z(x-1,y+1)
gy(x, y) = z(x – 1, y + 1) + 2z(x, y +1) + z(x + 1, y + 1)
– z(xgy–(x,y)
1, y=z(x-1,y+1)
– 1) – 2z(x,+2z(x,y+1)
y – 1) – z(x
+ 1, y – 1)- z(x-1,y-1) – 2 z(x, y-1) – z(x+1,y-1)
+ z(x+1,y+1)
-1
-2
-1
0
0
0
1
2
1
1
0
-1
2
0
-2
1
0
21
(a)
(b)
Figure 18. (a) Eight (D0–D7) chain code direction.
Figure 17. Sobel operator.
Figure 17. Sobel operator.
(b) Breakdown of gradient vector.
After calculating gradient vector, we calculated the magnitude and direction as given in equation.
6
Vol 8 (35) | December 2015 | www.indjst.org
𝑀𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒 = �𝑔𝑥 (𝑥, 𝑦)2 +𝑔𝑦 (𝑥, 𝑦)2 Indian Journal of Science and Technology
𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛 = tan−1 (𝑔𝑥 (𝑥, 𝑦)⁄𝑔𝑦 (𝑥, 𝑦))
Ankur Rana and Gurpreet Singh Lehal
We also experimented with k nearest neighbor
c­ lassifier with different values of k (1, 3 and 5). Result of
the primary component recognition using kNN classifier
is shown in Table 4. With increase in value of k we have
observed that output goes to 98.30%.
We observed that due to large number of classes SVM
takes average 559 seconds to recognize 3746 primary
components of ligature where as kNN classifier takes only
170 seconds to recognize the same.
Table 3. Primary component of ligature recognition
using SVM classifier
Features
Feature
Polynomial Polynomial
Linear
Vector
SVM
SVM
SVM
Length
(Degree 3) (Degree 4)
Discrete Cosine
Transformation
(DCT)
100
98.15
94.62
Discrete Cosine
Transformation
(DCT)
32
95.93
92.97
Discrete Cosine
Transformation
(DCT)
64
97.03
94.12
Gabor
189
94.68
94.55
Directional
Feature
144
90.10
89.53
Gradient
Features
200
95.14
91.56
92.87
90.57
7. Formation of Ligature
After getting the primary and secondary ­component,
we form ligature from the grouping of these two
Table 5. Secondary component of ligature
recognition using SVM classifier
Features
90.15
Feature
Vector
Length
Linear
SVM
Polynomial
SVM
(Degree = 3)
96.87
99.50%
94.37
95%
89.50
DCT (Feature Vector
Length 100)
100
93.96
Gabor (Feature vector
length 189)
189
Directional Feature
144
95.16
95.69
Gradient Features
200
96.77
93.01
Feature
Vector
Length
K=1
K=3
Discrete Cosine
Transformation
(DCT)
100
95.17
97.96
Discrete Cosine
Transformation
(DCT)
32
93.34
97.23
Discrete Cosine
Transformation
(DCT)
64
94.90
97.78
Gabor
189
88.56
94.93
96.39
Directional Feature
144
86.32
94.07
Gradient Features
200
93.60
97.25
Vol 8 (35) | December 2015 | www.indjst.org
We have 19 secondary component classes, for which,
we have extracted DCT, Gabor and zoning features.
For training samples, we have 100 samples for each
class. Table 5 shows the result of secondary component
recognition with different features and classifiers. As
we can see, the combination of DCT and polynomial
SVM (degree = 3) classifier attains 99.50% recognition
accuracy.
We also use kNN classifier for the secondary component recognition with different values of k (1 and
3) Experimental result of the secondary component
­recognition with kNN is given in Table 6.
92.08
Table 4. Primary component of ligature recognition
using kNN classifier
Features
6.2 Secondary Component Recognition
K=5
Table 6. Secondary component of ligature
recognition using kNN classifier
Features
Feature
Vector
Length
K=1
K=3
Discrete Cosine
Transformation (DCT)
100
94.11
98.93
Discrete Cosine
Transformation (DCT)
32
93.58
98.93
Discrete Cosine
Transformation (DCT)
64
94.11
98.93
Gabor
189
93.04
97.32
95.79
Directional Feature
144
94.11
99.99
97.83
Gradient Features
200
93.04
99.46
98.30
97.88
98.27
Indian Journal of Science and Technology
7
component code and secondary string code. From the combination of primary and secondary components
code we extracted the ligature from the primary component code and the secondary string code. To search
the combination of primary and secondary code, we implemented Binary Search Tree (BST). Nodes of the
binary
searchOCR
tree
contain
the primary
code and linked
list of nodes
Offline Urdu
using
Ligature
based Segmentation
for Nastaliq
Script having secondary code and their ligature
code in Unicode. Binary search tree and nodes of the linked list of our code book structure is shown in the
Figure 19.
9. References
45
1. Shamsher I, Ahmad Z, Orakzai JK, Adnan A. OCR for
printed urdu script using feed forward neural network.
40
53
Proceeding of World Academy of Science, Engineering and
Technology. 2007; 172–5. ISSN – 1307-6884.
2. Ahmad Z, Orakzai JK, Shamsher I, Adnan A. Urdu nas38
43
46
56
taleeq optical character recognition. Proceedings of World
Academy of Science, Engineering and Technology. 2007;
26:249–52.
‫ﻙ‬
‫گ‬
b
hb ‫گ‬
hA
3. Pathan IK, Ahmed Ali A, Ramteke RJ. Recognition of offline
‫ﺗﻮ‬
‫ﺗﻮ‬
‫ﺑﻮ‬
handwritten isolated urdu character. Journal of Advances
in Computational Research. 2012; 4(1):117–21
‫گ‬
‫ﮔﻰ‬
‫ﮔﻨﻮ‬
hC
hp
hB
4.
Al-Muhtaseb
HA, Mahmoud SA, Qahwaji RS. Recognition
‫ﻭ‬
‫ﭘﻮ‬
of off-line printed Arabic text using Hidden Markov modFigure
19. A sample
of Binary
Search
Tree Depiction
of
Figure 19.
A sample
of Binary
Search Tree
Depiction
of Code book.
els. Journal on Signal Processing. 2008; 2902–12. DOI:
Code book.
10.1016/j.sigpro.2008.06.013.
5. Ul-Hasan A, Ahmed SB, Rashid SF, Shafait F, Breuel TM.
Offline printed Urdu Nastaleeq script recognition with
codes (primary and secondary component code). We
bidirectional LSTM networks. International
Journal crafted code book which comprises primary
Page 11 Conference
manually
on Document Analysis and Recongnition; 2013. p. 1061–5.
component code and secondary string code. From the
DOI: 10.1109/ICDAR.2013.212.
combination of primary and secondary components
6. Lehal GS, Rana A. Recognition of Nastalique Urdu
code we extracted the ligature from the primary comLigatures. Proceedings of 4th International Workshop of
ponent code and the secondary string code. To search
Multilingual OCR, ACM; Washington DC, USA. 2013.
the combination of primary and secondary code, we
DOI: 10.1145/2505377.2505379.
implemented Binary Search Tree (BST). Nodes of the
7. Lehal GS. Ligature Segmentation for Urdu OCR.
Proceedings 12th International Conference on Document
binary search tree contain the primary code and linked
Analysis and Recognition (ICDAR). IEEE. 2013. p. 1130–4.
list of nodes having secondary code and their ligature
DOI: 10.1109/ICDAR.2013.229.
code in Unicode. Binary search tree and nodes of the
8.
Sarmad
H, Salman A, Qurat ul Ain Akram A. Nastalique
linked list of our code book structure is shown in the
segmentation-based
approach for Urdu OCR. International
Figure 19.
Journal on Document Analysis and Recognition. 2015.
Let primary component classifier gives a code of 46
DOI: 10.1007/s10032-015-0250-2.
and the
string code
is hB. gives
Thena BST
Letsecondary
primary component
classifier
code search
of 46 and the secondary string code is hB. Then BST search gives
9. Saeeda N, Khizar H, Muhammad Imran R, Mohammad
gives ligature
ligature‫ ﮔﯩﻮ‬as the identified
identifiedligatures.
ligatures.
Waqas A, Madani Sajjat A, Same KU. The Optical
Character recognition of Urdu-like cursive script. Journal
of Pattern Recognition. 2013 Oct; 11:1229–48.
8. Conclusions
and
Future
8. Conclusions and
Future
Scope Scope
10. Lehal GS. Choice of recognizable units for Urdu OCR.
Proceedings of the Workshop on Document Analysis and
We have used DCT with linear kernel SVM for the
We have used DCT with linear kernel SVM for the primary
component.
For the
secondary
component
Recognition
(Mumbai,
India,
DAR 2012),
Publisher ACM,
primary component. For the secondary component recrecognition
we
used
DCT
features
(feature
vector
length
100)
and
polynomial
kernel
SVM
(Degree
3).
We have
USA. 2012; 79–85.
ognition we used DCT features (feature vector length
tested
our
system
on
110
pages
Urdu
and
got
83%
accuracy.
Urdu
images
having
no
broken
or
merged
11. Rani R, Dhir R, Lehal GS. Gabor featuresprimary
based script
100) and polynomial kernel SVM (Degree 3). We have
identification
of
lines
within
a
bilingual/trilingual
or
secondary
component
and
no
Naskh
style
Urdu
text
have
accuracy
nearly
90.29%.
New
methodology
needs
tested our system on 110 pages Urdu and got 83% accuInternational
Journal
of on
Advanced
to beimages
devisedhaving
to handle
brokenor
ormerged
mergedprimary
ligatures. Also todocument.
recognize the
Naskh writing
style
the sameScience
racy. Urdu
no broken
and
Technology.
2014;
66(2014):1–12.
ISSN: 2005Urdu text
page, Naskh
recognition
and
font text
identification needs to be developed.
or secondary
component
and
no NaskhOCR
style
Urdu
4238. Available from: http://dx.doi.org/10.14257/
have accuracy nearly 90.29%. New methodology needs
ijast.2014.66.01
9. References
to be devised
to handle broken or merged ligatures. Also
12. Singh J, Lehal GS. Comparative Performance Analysis of
to recognize the Naskh writing style on the same Urdu
Combination
for Devanagari
1. Shamsher I, Ahmad Z, Orakzai JK, Adnan A. OCR for printedFeature(S)-Classifier
urdu script using feed
forward neural
network.Optical
text page, Naskh recognition OCR and font identification
Character
Recognition
System.
International
Proceeding of World Academy of Science, Engineering and Technology. 2007; 172–5. ISSN – 1307-6884.Journal of
needs to be developed.
Advanced Computer Science and Applications (IJACSA).
2. Ahmad Z, Orakzai JK, Shamsher I, Adnan A. Urdu nastaleeq optical character recognition. Proceedings of
World Academy of Science, Engineering and Technology, 2007; 26:249–52.
‫ﺝ‬
‫ﻫ‬
‫ڑ‬
8
3.| December
Pathan2015
IK, Ahmed
Ali A, Ramteke RJ.
Vol 8 (35)
| www.indjst.org
Recognition of offline handwritten isolatedIndian
urduJournal
character.
Journal
of
of Science
and Technology
Advances in Computational Research. 2012; 4(1):117–21
4.
Al-Muhtaseb HA, Mahmoud SA, Qahwaji RS. Recognition of off-line printed Arabic text using Hidden
Ankur Rana and Gurpreet Singh Lehal
2014; 5(6). Available from: http://dx.doi.org/10.14569/
IJACSA.2014.050608
13. Aggarwal A, Rani R, Dhir R. Recognition of devanagari
handwritten numerals using gradient features and SVM.
International Journal of Computer Applications. 2012.
DOI: 10.5120/7371-0151.
14. Hsu C-W, Chang C-C, Lin C-J. A Practical Guide to Support
Vector Classification. 2010. Available from: http://www.
csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
Vol 8 (35) | December 2015 | www.indjst.org
15. Akram Q, Hussain S, Adeeba F, Rehman S, Saeed M.
Framework of Urdu Nastalique Optical Character
Recognition System. In the Proceedings of Conference on
Language and Technology. (CLT 14), Karachi, Pakistan.
2014.
16. Sabbour N, Shafait F. A segmentation-free approach to
Arabic and Urdu OCR. SPIE Document Recognition
and Retrieval XX, DRR’13; San Francisco, CA, USA.
2013 Feb.
Indian Journal of Science and Technology
9