Urdu fonts in xd

#Urdu fonts in xd windows#

The major challenges offered by Urdu document images include non-uniform inter- and intra-word spacing, overlapping of neighboring and partial words, filled or false loops, and no fixed baseline. Unlike Pashto and Arabic which are mostly scripted in the Naskh style, Urdu generally employs the Nastaliq script which runs diagonally from right to left. The alphabet of Urdu is a super set of Arabic, borrows some characters from Pashto, and comprises a total of 39 characters. Despite these developments, OCRs for many languages are yet either to be developed or are in very early stages, and cursive Urdu being one of such example is investigated in our study. Today, commercially mature OCRs are available realizing high recognition rates on a number of scripts, those based on Latin and Chinese alphabets for instance. OCR is one of the most researched pattern classification problems. With this, the need to have efficient Optical Character Recognizers (OCRs) to convert the digitized images into text has increased. Consequently, an increased tendency to digitize the existing paper documents in the form of books, magazines, newspapers, and notes has also been observed over the last decade. With the tremendous advancements in computation and communication technologies, the amount of information available in the digital form has increased manifolds over the recent years.

The system evaluated on the standard UPTI Urdu database reported a ligature recognition rate of 92% on more than 6000 query ligatures. Given the query text, the primary and secondary ligatures are separately recognized and later associated together using a set of heuristics to recognize the complete ligature.

#Urdu fonts in xd windows#

Hidden Markov Models are trained separately for each ligature using the examples in the respective cluster by sliding right-to-left the overlapped windows and extracting a set of statistical features. Ligatures extracted from text lines are first split into primary (main body) and secondary (dots and diacritics) ligatures and multiple instances of the same ligature are grouped into clusters using a sequential clustering algorithm. A total of 1525 unique high-frequency Urdu ligatures from the standard Urdu Printed Text Images (UPTI) database are considered in our study. The proposed technique relies on statistical features and employs Hidden Markov Models for classification.

This paper presents a segmentation-free optical character recognition system for printed Urdu Nastaliq font using ligatures as units of recognition.