Monday, June 1, 2020
OPTICAL CHARACTER RECOGNITION SOFTWARE - Free Essay Example
CHAPTER 1 ABSTRACT Suppose we wanted to digitize a magazine article or a printed contract. We could spend hours retyping and then correcting misprints. Or we could convert all the required materials into digital format in several minutes using a scanner (or a digital camera) Obviously, a scanner is not enough to make this information available for editing, say in Microsoft Word. All a scanner can do is create an image or a snapshot of the document that is nothing more than a collection of black and white or colour dots, known as a raster image. In order to extract and repurpose data from scanned documents, camera images or image-only PDFs, we need an Optical Character Recognition software that would single out letters on the image, put them into words and then words into sentences, thus enabling us to access and edit the content of the original document. Optical Character Recognition or OCR, is a technology long used by libraries and government agencies to make lengthy documents quickly available electronically. Advances in OCR technology have spurred its increasing use by enterprises. For many document-input tasks, OCR is the most cost-effective and speedy method available. And each year, the technology frees acres of storage space once given over to file cabinets and boxes full of paper documents. This project is aimed at designing and developing a C/C++ based basic optical character recognition system capable of converting a preprocessed image file containing printed text into an editable text file. This will enable fast conversion of images into text which can be later edited if required. CHAPTER 2 INTRODUCTION WHAT IS OCR? Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of images of handwritten, typewritten or printed text (usually captured by a scanner) into machine-editable text. OCR is a field of research in pattern recognition, artificial intelligence and machine vision. Optical character recognition (using optical techniques such as mirrors and lenses) and digital character recognition (using scanners and computer algorithms) were originally considered separate fields. Because very few applications survive that use true optical techniques, the OCR term has now been broadened to include digital image processing as well. Early systems required training (the provision of known samples of each character) to read a specific font. Intelligent systems with a high degree of recognition accuracy for most fonts are now common. Some systems are even capable of reproducing formatted output that closely approximates the original scanned page including images, columns and other non-textual components. All OCR systems include an optical scanner for reading text, and sophisticated software for analyzing images. Most OCR systems use a combination of hardware (specialized circuit boards) and software to recognize characters, although some inexpensive systems do it entirely through software. Advanced OCR systems can read text in large variety of fonts, but they still have difficulty with handwritten text. HISTORY In 1929 Gustav Tauschek obtained a patent on OCR in Germany, followed by Handel who obtained a US patent on OCR in USA in 1933. In 1950, David H. Shepard decided it must be possible to build a machine to do this, and, with the help of Harvey Cook, a friend, built Gismo in his attic. Shepard then founded Intelligent Machines Research Corporation (IMR), which went on to deliver the worlds first several OCR systems used in commercial operation. While both Gismo and the later IMR systems used image analysis, as opposed to character matching, and could accept some font variation, Gismo was limited to reasonably close vertical registration, whereas the following commercial IMR scanners analyzed characters anywhere in the scanned field, a practical necessity on real world documents. The first commercial system was installed at the Readers Digest in 1955, which, many years later, was donated by Readers Digest to the Smithsonian, where it was put on display. In about 1965 Readers Digest and RCA collaborated to build an OCR Document reader designed to digitize the serial numbers on Reader Digest coupons returned from advertisements. The United States Postal Service has been using OCR machines to sort mail since 1965 based on technology devised primarily by the prolific inventor Jacob Rabinow. The first use of OCR in Europe was by the British General Post Office or GPO. In 1965 it began planning an entire banking system, the National Giro, using OCR technology, a process that revolutionized bill payment systems in the UK. Canada Post has been using OCR systems since 1971. In 1974, Ray Kurzweil developed the first omni-font optical character recognition systema computer program capable of recognizing text printed in any normal font. However, this device required the invention of two enabling technologiesthe CCD flatbed scanner and the text-to-speech synthesizer. On January 13, 1976, the finished product was unveiled which covered an entire tabletop, but functioned exactly as intended. The United States Postal Service, banking systems in UK and Canada Post have been using OCR machines to sort mail since 1965. Today, OCR technology incorporates high-speed scanners and complex computer algorithms to increase speed and data accuracy. OCR systems no longer require training to read a specific font. Current systems can recognize most fonts with a high degree of accuracy and some are capable of outputting formatted text that closely approximates the printed page. APPLICATIONS ACCESS An entire company with multiple sites can access documents on a central server. Robust database applications can manage electronic documents, performing searches based on document location or content. A content management program can add value to an electronic storage system, allowing users to store additional information with the document. Finding information within a long document is easier and faster. Multiple users can access an electronic document simultaneously. Users can easily and instantly distribute documents to a number of people at once via email. CONTROL Electronic document storage systems prevent documents from being misfiled or erroneously deleted. A central content management system serves as a single source for up-to-date information. The system can keep track of document revisions maintaining a record of the who, when, and what of every change made. Security is easier to maintain in an electronic environment. Administrators can Control who can see, read, modify, or destroy a particular document. Electronic documents can easily be stored offsite as part of a disaster recovery program. RESOURCE EFFICIENCY Electronic documents use less office space than traditional paper files Paper documents that need to be retained can be moved off-site to storage facilities. Electronic filing systems save human resources because users can access files on their own rather than requiring the help of support staff. LIMITATIONS OCR has never achieved a read rate that is 100% perfect. Because of this, a system which permits rapid and accurate correction of rejects is a major requirement. A great concern is the problem of misreading a character (substitutions). Through the years, the desire has been: to increase the accuracy of reading, that is, to reduce rejects and substitutions to reduce the sensitivity of scanning to read less-controlled input to eliminate the need for specially designed fonts (characters), and to read handwritten characters efficiently. However, todays systems, while much more forgiving of printing quality and more accurate than earlier equipment, still work best when specially designed characters are used and attention to printing quality is maintained. However, these limits are not objectionable to most applications, and dedicated users of OCR systems are growing each year. OUR PROJECT Our project will focus on creating efficient core components of an OCR such as- text detection, segmentation, character extraction from image, character recognition etc. The OCR software is divided into various modules and sub modules with a specific task. This enables easy updating, easy debugging, and streamlined distribution of work among team members. All code is written in C/C++ due to wide availability of supporting libraries and IDEs. The three main modules of our software are: 1. Image input and preprocessing Function to read an image file (*.bmp format ) Recognize and extract text from image (apply image binarization and thinning.) Apply segmentation to segregate a single character. Send image to recognition module. 2. Character Recognition Use basic pattern recognition/ feature extraction to determine which character it is. Find match in look up table for various characters- to match the exact alphabet. A dictionary is used to correct spelling errors. Send processed data to output module. 3. Output Develop a basic text file (*.txt), set attributes (name, location, access permissions etc.), write the output of OCR into file save and close the file. CHAPTER 3 HARDWARE AND SOFTWARE REQUIREMENTS 3.1 HARDWARE USED: General purpose PC running windows xp 32bit/64bit the software may not run properly on older windows version. Flatbed Scanner (optional) Digital Camera(optional) SOFTWARE USED: Bloodshed Dev-C++ v4.9.9.2 (C/C++ IDE) Microsoft Visual C++ Express 2008(C/C++ IDE) Borland Turbo C++ (C/C++ IDE) XVI32 (hex editor, binary file analysis) Adobe Photoshop(image editing, preprocessing and analysis) Microsoft Paint(image editing, preprocessing and analysis) CHAPTER 4 DATA FLOW DIAGRAMS LEVEL 0 LEVEL 1 LEVEL 2 preprocessImage function: OCRprocess function: textOutput function: LEVEL 3 CHAPTER 5 FUNCTIONS AND METHODOLOGIES USED 5.1 BMP IMAGES BMP images can range from black and white (1 bit per pixel) up to 24 bit colour (16.7 million colours). The input file that we are going to use for our software is an 8 bit bmp image, i.e., pixels are stored with a color depth of 8 bits per pixel. The color resolution of each pixel can be from 0 to 255(2^8=256). So any file that has to be processed must be first converted to it. Microsoft has defined a particular representation of color bitmaps of different color depths, as an aid to exchanging bitmaps between devices and applications with a variety of internal representations. They called these device-independent bitmaps or DIBs, and the file format for them is called DIB file format or BMP file format. According to Microsoft support: A device-independent bitmap (DIB) is a format used to define device-independent bitmaps in various color resolutions. The main purpose of DIBs is to allow bitmaps to be moved from one device to another (hence, the device-independent part of the name). A DIB is an external format, in contrast to a device-dependent bitmap, which appears in the system as a bitmap object. A typical BMP file usually contains the following blocks of data: Header Info Header Color Palette Image Data BMP File Header Stores general information about the BMP file. Bitmap Information (DIB header) Stores detailed information about the bitmap image. Color Palette Stores the definition of the colors being used for indexed color bitmaps. Bitmap Data Stores the actual image, pixel by pixel. India is a country of diversity. With a population of more than a billion, people differ in nearly all dimensions whether it is food patterns, occupation, education, health BMP HEADER This block of bytes is at the start of the file and is used to identify the file. A typical application reads this block first to ensure that the file is actually a BMP file and that it is not damaged. The first two bytes of the BMP file format are the character B then the character M in 1-byte ascii encoding. All of the integer values are stored in little-endian format (i.e. least-significant byte first). Offset# Size Purpose 0000h 2 bytes The magic number used to identify the BMP file: 0x42 0x4D and 19778 in decimal (Hex code points for B and M). The following entries are possible: BM Windows 3.1x, 95, NT, etc BA OS/2 Bitmap Array CI OS/2 Color Icon CP OS/2 Color Pointer IC OS/2 Icon PT OS/2 Pointer 0002h 4 bytes the size of the BMP file in bytes 0006h 2 bytes reserved; actual value depends on the application that creates the image 0008h 2 bytes reserved; actual value depends on the application that creates the image 000Ah 4 bytes the offset, i.e. starting address, of the byte where the bitmap data can be found. BITMAP INFORMATION (DIB HEADER) This block of bytes tells detailed information about the image.All values are stored as unsigned integers, unless explicitly noted. Offset # Size Purpose Eh 4 the size of this header (40 bytes) 12h 4 the bitmap width in pixels (signed integer). 16h 4 the bitmap height in pixels (signed integer). 1Ah 2 the number of color planes being used. Must be set to 1. 1Ch 2 the number of bits per pixel, which is the color depth of the image. Typical values are 1, 4, 8, 16, 24 and 32. 1Eh 4 the compression method being used. See the next table for a list of possible values. 22h 4 The image size. This is the size of the raw bitmap data (see below), and should not be confused with the file size. 26h 4 the horizontal resolution of the image. (pixel per meter, signed integer) 2Ah 4 the vertical resolution of the image. (pixel per meter, signed integer) 2Eh 4 the number of colors in the color palette, or 0 to default to 2n. 32h 4 the number of important colors used; generally ignored. COLOR PALETTE The palette occurs in the BMP file directly after the BMP header and the DIB header. Therefore, its offset is the sum of the size of BMP header and the size of the DIB header. The palette is a block of bytes (a table) listing the colors available for use in a particular indexed-color image. Each pixel in the image is described by a number of bits (1, 4, or 8) which index a single color in this table. The purpose of the color palette in indexed-color bitmaps is to tell the application the actual color that each of these index values corresponds to. A DIB always uses the RGB color model. In this model, a color is in terms of different intensities (from 0 to 255) of the additive primary colors red (R), green (G), and blue (B). A color is thus defined using the 3 values for R, G and B (though stored in backwards order in each palette entry). BITMAP DATA This block of bytes describes the image, pixel by pixel. Pixels are stored upside-down with respect to normal image raster scan order, starting in the lower left corner, going from left to right, and then row by row from the bottom to the top of the image. Uncompressed Windows bitmaps can also be stored from the top row to the bottom, if the image height value is negative. In the original DIB, the only four legal numbers of bits per pixel are 1, 4, 8, and 24. In all cases, each row of pixels is extended to a 32-bit (4-byte) boundary, filling with an unspecified value (not necessarily 0) so that the next row will start on a multiple-of-four byte location in memory or in the file. The total number of bytes in a row can be calculated as the image size/bitmap height in pixels. Following these rules there are several ways to store the pixel data depending on the color depth and the compression type of the bitmap. One-bit (two-color, for example, black and white) pixel values are stored in each bit, with the first (left-most) pixel in the most-significant bit of the first byte. An unset bit will refer to the first color table entry, and a set bit will refer to the last (second) table entry. Four-bit color (16 colors) is stored with two pixels per byte, the left-most pixel being in the more significant nibble. Each pixel value is an index into a table of up to 16 colors. Eight-bit color (256 colors) is stored one pixel value per byte. Each byte is an index into a table of up to 256 colors. RGB color (24-bit) pixel values are stored with bytes as BGR (blue, green, red). int main() This function asks for an input filename and then reads the BMP image. The header of the image is stored in with the help of the class BMPHEADER and the pixel values are retrieved and stored in a dynamically allocated unsigned int array, bmpImage which is used by all other functions. It does not use predefined bmp libraries and only relies on standard file handling libraries. However since the pixels are stored starting from the bottom left, a function called bmpImageCorrectFunc() is called after reading. It changes the order and the bmpImage array now stores the pixels starting from top left. Then main calls these three functions:- void preProcessImage(unsigned int * bmpImage, BMPHEADER bmpHeader, BOUNDINGPIXEL boundingPixelCoord); void OCRprocess(char * text, unsigned int *bmpImage, BMPHEADER bmpHeader, BOUNDINGPIXEL boundingPixelCoord); void textOutput(char arr[],char *filename); preprocessImage() This function calls other functions and performs pre processing on the image. First the bmp image is read and then it is stored in an array and a Header structure. These are used to access various attributes (pixel colour values, height width size bits per pixel etc.) of the image. The image is then converted into a complete 8 bit black and white image and a modified thinning algorithm is applied to obtain single pixel wide lines. A projection histogram based segmentation algorithm is applied to segment individual lines, words and characters. The following functions are called from this function in the same order: void binarization(unsigned int * bmpImage, BMPHEADER bmpHeader); void thin(unsigned int * bImage, BMPHEADER bmph); void segmentation(unsigned int *bmpImage, BMPHEADER bmpHeader, BOUNDINGPIXEL boundingPixelCoord) binarization() This function uses a technique known as Thresholding to convert the BMP image into a completely black and white image. Thresholding often provides an easy and convenient way to perform a segmentation on the basis of the different intensities or colors in the foreground and background regions of an image. In addition, it is often useful to be able to see what areas of an image consist of pixels whose values lie within a specified range, or band of intensities (or colours). The input to a thresholding operation a grayscale or color image or a black and white image having some coloured pixels due to noise. The output is a binary image representing the segmentation. Black pixels correspond to background and white pixels correspond to foreground (or vice versa). The segmentation is determined by a single parameter known as the intensity threshold. In a single pass, each pixel in the image is compared with this threshold. If the pixels intensity is higher than the threshold, the pixel is set to, say, white in the output. If it is less than the threshold, it is set to black. Multiple thresholds can also be specified, so that a band of intensity values can be set to white while everything else is set to black. SHADE CARD Using the above shade card the threshold was set at 127 colour value. All the pixels below 127 colour value would be changed to black and those above to white of the input image. However on studying the shades carefully we noticed, that 7, 8 and 9 pixel values are closer to white than to black. So they were included as exceptions and changed to white. The following images show an example of binarization: INPUT OUTPUT Now the black and white image is sent to the thinning function. thin() The basic idea of thinning is to repeatedly delete object boundary pixels so as to reduce the line width to one pixel. This must be done without locally disconnecting the object (splitting the object in two parts) or deleting line end points. The result is like a skeleton of the image. We used the modified Rutovitz parallel processing algorithm. In parallel processing, the value of a pixel at the n-th iteration depends on the values of the pixel and its neighbours at the (n-1)-th iteration. Thus all the pixels if the image can be processed simultaneously. MODIFIED RUTOVITZ ALGORITHM Definitions: To decide whether a pixel P1 should be deleted, it is assumed that a 3*3 window is used for each pixel. That is the values of the eight neighbours of the central pixel (P1) are used in the calculation of its values for the next iteration. The eight neighbouring values are denoted in the following way: P9 P2 P3 P8 P1 P4 P7 P6 P5 We define the following: 0 represents WHITE and 1 BLACK. N(P1) : number of non-zero neighbors: N (P1) = P2 + P3 + + P9 S(P1) : number of 0 to 1 transitions in the sequence (P2, P3,, P9, P2) Algorithm: Repeat until no more change can be made. A pixel P1 is marked if the all the following conditions are true: P1=1 S (P1) = 1 2= N (P1) =6 P2 or P4 or P8 = 0 or S (P2) != 0 P2 or P4 or P6 = 0 or S (P4) != 0 Delete the marked pixels. This function makes permanent changes in the image itself. An example of thinning: INPUT OUTPUT segmentation() Itrefers to the process of partitioning adigital imageinto multiple segments (setsofpixels). The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in images. Several general-purposealgorithmsand techniques have been developed for image segmentation. Since there is no general solution to the image segmentation problem, these techniques often have to be combined with domain knowledge in order to effectively solve an image segmentation problem for a problem domain. Some methods are described below: Clustering methods Using K-means algorithm the image is iteratively divided into clusters. Histogram-based methods Horizontal and vertical projection histograms are used to partition the image. Edge detection methods Region boundaries and edges are segmented thus it is very useful in matching shapes. Region growing methods A specific region is segmented by comparing each pixel intensity w.r.t a starting point(seed point). Thus all similar intensity pixel are covered in the region as it grows. Level set methods It uses motion equations on a curve/surface to calculate propagation of the contour. Graph partitioning methods Graphs can effectively be used for image segmentation. Usually a pixel or a group of pixels are vertices and edges define the (dis)similarity among the neighborhood pixels. Watershed transformation Itconsiders the gradient magnitude of an image as a topographic surface. Pixels having the highest gradient magnitude intensities correspond to watershed lines, which represent the region boundaries. Water placed on any pixel enclosed by a common watershed line flows downhill to a common local intensity minimum. Pixels draining to a common minimum form a catch basin, which represents a segment. Model based segmentation The central assumption of such an approach is that structures of interest/organs have a repetitive form of geometry. Multi-scale segmentation Image segmentations are computed at multiple scales inscale-spaceand sometimes propagated from coarse to fine scales.eg.1D, 2D etc. Semi-automatic segmentation In this kind of segmentation, the user outlines the region of interest with the mouse clicks and algorithms are applied so that the path that best fits the edge of the image is shown. Neural networks segmentation Using artificial neural networks, segmentation is performed on small areas. Each pixel is treated as neuron which contains its colour/intensity values and is connected to neighboring neurons. Thus such connections can be evaluated to segment the region. This project utilizes the projection histogram method because: It is the most efficient in case of pure black and white images. It is well suited for OCR segmentation as several iterations of the same algorithm can be applied to segment a line, a word or a single character. It has very little processing and memory requirements. PROJECTION HISTOGRAM SEGMENTATION METHOD A projection histogram referrers to a graph representing the horizontal/vertical projection of an image. Projection of a black white image would simply be the total number of black pixels in a row or column. When a horizontal projection is performed on an image, an array stores the number of black pixels in each row of the image while when a vertical projection is performed on an image, an array stores the number of black pixels in each column of the image. These steps are done by hProjFunc() and vProjFunc() and they store the data in hProj[ ] and vProj[ ] respectively. The data from these two arrays is utilized to segment the image into lines, words and individual characters. To segment various lines in an image containing text, the hProjFunc() is applied to the whole image. Then the array containing horizontal projection data, hProj[ ] is analyzed and empty/ 0 values are identified. These 0 values in the array indicate an empty line i.e. the portion between two lines of text. Thus the top and bottom positions of each line can be easily calculated. Once all lines are segmented, the vProjFunc() is applied on images of only single lines. Thus the vertical projection of each line contained in vProj[ ] can be used to determine position of words and individual characters in that line by analyzing the empty portions in the array. The hProjFunc() is again applied on images of individual characters to obtain the coordinates of top-left corner and bottom-right corner of an imaginary box tightly enclosing each character. The red and green pixels in the above image indicate the top-left(red) and bottom-right(green) corners of boxes enclosing individual characters. The final output of segmentation is stored in a boundingPixelCoord structure which contains 4 arrays of x and y coordinates of top-left and bottom-right corners of all characters(x1, y2, x2, y2). It also stores the top and bottom values for each line as well as the average width of characters for each line (this is utilized to determine spacing between words). OCRprocess() The OCRprocess() function is the main optical character function. It converts the image data into text. Once called by int main() it analyzes the segmentation data and performs the following steps: Line identification/changing Creating appropriate spacing in between words Finding the start character and end character of each word Determining the final word by using OCRcontrol() which calls extraction1() and dictionary(). Return the whole text in a dynamic array containing the extracted words, spaces and lines. It uses the segmentation data to identify character positions, start and end of each line and inserts n at end of lines to change lines. It calculates the start and end of each word and spacing by using the average width of character and matching that against the positions of two consecutive characters, if they are far apart a space is inserted and the start and end of word is stored. The following function is called by this function: void OCRcontrol(char * wordArr,unsigned int start, unsigned int end, unsigned int *bmpImage, BMPHEADER bmpHeader, BOUNDINGPIXEL boundingPixelCoord, int count, unsigned int firstWordFlag); The starting and ending of a single word once known is passed on to the OCRcontrol() function which performs the actual conversion from image to text. OCRcontrol() OCRcontrol() function which performs the actual conversion from image to text. It calls two functions extraction1() and dictionary(). The extracted word returned by extraction1() is passed onto the dictionary() which finds the closest possible match. If no discrepancies are found between the dictionary output and the extraction output, it is finally returned to OCRprocess() to be appended in the final text dynamic array. These are the two functions called: void extraction1(char * wordArr,unsigned int start, unsigned int end, unsigned int *bmpImage, BMPHEADER bmpHeader, BOUNDINGPIXEL boundingPixelCoord); void dictionary(char *input,char ** equalStore, int eqValue, char ** unequalStore, int uneqValue); extraction() When the input data to an algorithm is too large to be processed and it is suspected to be notoriously redundant (much data, but not much information) then the input data will be transformed into a reduced representation set of features (also named features vector). Transforming the input data into the set of features is calledfeature extraction. If the features extracted are carefully chosen it is expected that the features set will extract the relevant information from the input data in order to perform the desired task using this reduced representation instead of the full size input. Feature extraction is an essential part of any OCR system. It utilizes various algorithms to accurately match each character from a number of reference files. It extracts various features such as edges, boundary, loops, extreme points etc. to correctly identify the character. Various methods used for feature extraction: Zoning The image of each character is divided into a number of zones and average gray level or binary level in each zone is calculated. Center of gravity Each black pixel is treated as unit mass and the center of gravity of each character is calculated. Tips A character is identified using the number and positions of various tips in a character. Number of branches or intersections A character is identified using the number and positions of various I ntersections in a character Projection histograms Horizontal and vertical projection histograms are used to accurately match the character. Extreme points Positions of various extreme points are used to match characters. Moments Similar to COG method calculate moments of each character. The software utilizes multiple methods to efficiently and accurately match each individual character. It utilizes two main functions extraction1() . Features used in Extraction function 1 to match individual characters: Projection histograms (primary) Top and bottom gap detection (secondary) Closed loop detection (secondary) Character height calculation (secondary) PROJECTION HISTOGRAMS The projection histograms extraction method utilizes techniques similar to segmentation. A horizontal and vertical projection of each character is calculated by hProjFunc() and vProjFunc() and stored in hProj[ ] and vProj[ ] respectively. Accurate matching of these arrays (hProj and vProj) to stored data is done using two methods. The first method directly matches the values of various indices of the array. The second method matches the peaks in the graph i.e. position of maximum and minimum values in the arrays. Before the above matching schemes are applied normalization is done to account for varying font sizes. The positions of various indices of hProj and vProj arrays are normalized over a range of 0 to 5 where the position of the 0th value is represented by 0, last position by 5, the mid-position by 2.5. Thus depending on the height and width a normalizing factor is calculated and used to match characters of different font sizes. This method requires the use of 4 reference files to store data for both horizontal/vertical projection for both types of matching schemes. The reference files store Courier New font characters, size 12- all lowercase alphabets, uppercase alphabets and numbers. TOP AND BOTTOM GAP DETECTION This method analyzes the top two and bottom two lines to identify gaps. The gaps are identified by finding a set of white pixels enclosed between sets of black pixels. It is performed by the lineGaps() function. This method helps to differentiate between characters such as h and b. CLOSED LOOP DETECTION This method iteratively employs the use of gap detection to identify a closed loop. It is performed by loopCalc() function. If a set of gap lines is found to be enclosed between two sets of no gap lines then a closed loop is found. This helps to differentiate between a, e, o, B and h, c, t. CHARACTER HEIGHT CALCULATION This performed by analyzing the position of the enclosing box coordinates calculated by segmentation relative to the top and bottom of the line. Thus differentiating between small alphabets such as a, e, v and t, h, g etc. The secondary methods are matched using another reference file which stores data for all secondary features. This is done in the setXtraParam() function. Data of a total set of 62 characters( 26 lowercase, 26 uppercase, 10 numbers) stored for all reference files. Once the features are calculated/identified each character of the reference file is assigned a matching value. At the end of all of the above operations the maxCharXtraParam() function identifies the character with the highest matching value. Finally the matching alphabet is appended to the word array. Once all iterations are complete the word array contains the extracted word. dictionary() The dictionary file contains an extensive list of approximately 1, 35,000 words from both US and UK English. It also contains common acronyms, abbreviations, names of countries and cities and common names. Firstly, a function char *filenameGen() is created which creates seperate reference files for different length of words and returns them. Each length file has all the words of that length in ascending order. Thus the output of this function creates 23 reference files into which the original dictionary is now split. The extracted words are then sent as input to the dictionary function. The length of the word is calculated and using file handling the appropriate file is opened. The void equal() function searches for the word in the file. If an exact match is found, that word is is stored in a dynamic array equalstored[] to be returned. However, words with at the most 1 error are also stored in the array. Since extraction can lead to two characters being combined to one, the dictionary function also calls a void unequal() function which opens the file containing words which are of one more than the length of input. It finds the closest match for the input word by comparing relative positions. For example aple would give a correct output as apple. The words with zero or one error are stored in a dynamic array unequalstored[] which is returned. The OCRcontrol() function choses the best match for the input word. CHAPTER 6 OUTPUTS AND RESULTS Main menu: Enter filename and run program Original 8bit .BMP image of a sample word: Black and white/binarization output sample: Thinning output sample: Segmentation output sample: Feature Extraction output sample: werld However the dictionary used changes the word to the closest match possible resulting in the word: world Actual 8bit image file textpaint.bmp read: Output textpaint.txt file created: Another example: Accuracy: The accuracy may be affected by a number of factors such as: Noise Rotation/skew error Poor image quality Reference data does not match the font style of the text Unsupported special characters encountered eg. ; @ ? etc. Courier New 12 Courier New 18 Times New Roman 14 Total Characters 434 434 434 Incorrect Characters 27 45 98 Correct Characters 407 389 336 % Accuracy 93.77% 89.63% 77.42% The above observations show that the software is capable of handling Courier New and/or a similar font style, for which it is optimized. While considerable reduction in accuracy is observed in the case of Times New Roman (for which software is not optimized), but it is still well above 50%. Accuracy may also slightly differ with varying size of images as probability of error also varies with different character counts. Courier New 12 Total Characters 434 867 1465 Incorrect Characters 27 65 44 Correct Characters 407 802 1421 % Accuracy 93.77% 92.50% 96.99% Thus the software is capable of achieving high accuracy for the most optimized font style and size. RESULT The software is designed to read 8bit/256 colour bitmap image which contains only text. It is optimized to read Courier New font style at font size of 10 to 16 points. Other font styles and sizes will show reduced accuracy. CHAPTER 7 CONCLUSION AND LIMITATIONS This software functions as a basic concepts demonstrator and performs all basic functions required from an optical character recognition software. It is capable of very accurately identifying Courier New style of characters and shows considerable accuracy in other similar font styles also. The software runs most efficiently on windows XP and is capable of reading 8bit .BMP image files. The software has a few inherent limitations which may be overcome with further development. These are: Support only for 8bit/256 colour .BMP image file format and no other format Optimization only for Courier New font style Only for images containing text, no support for images containing text and pictures in a single file Optimized for grayscale images with white background only. Inability to identify special characters such as @,!,? etc. Command line interface which is archaic and must be replace by a GUI Preprocessing does not correct noise, rotation and skew error Mainly designed for windows XP, it does not support Mac OS X, Linux etc. CHAPTER 8 FUTURE SCOPE As shown above the current implementation of the software has various limitations. These limitations can be easily overcome with the support of extra features such as:- Work for all types of image formats like JPEG, TIFF, GIF, etc. Inclusion of rotation/skew correction and noise removal in the preprocessing module. A user friendly graphical user interface (GUI), capable of running on various systems. Support for more operating systems such as Linux, Windows vista, Mac OS X etc. Support for other font styles can be improved by addition of more reference files containing data for those fonts. Use of more complex and accurate extraction methods such as artificial neural networks (ANN), which require considerable amount of processing but provide better results. Support for more languages/ scripts such as hindi, urdu, mandarin etc. Capability to decipher cursive handwriting would greatly improve the scope of the software. It would require use of heuristics analysis, support for different hand writing styles, improved segmentation capability, pattern matching and ANN. The basic OCR software can also be used along with a video analysis library to apply optical character recognition to text contained in video streams. This would create various new possibilities such as real time in video translation, real time vehicle tracking using number plates etc. CHAPTER 9 REFERENCES Research Articles: Le Daniel X., Thomas George R., Automated portrait/landscape mode detection on a binary Image Rajashekararadhya S.V. Dr Ranjan P. Vanaja, Efficient zone based feature extraction algorithm for handwritten numeral recognition of four popular south Indian scripts, Journal of Theoretical and Applied Information Technology 2005-2008 Zhang Y.Y and Wang P.S.P, A modified parallel thinning algorithm, 9th International Conference on Image Processing, 1988. Lee His-Jian, ChenSi-Yuan, Wang Shen-Zheng, Extraction and recognition of license plates of motorcycles and vehicles on highways, 17th International Conference on Pattern Recognition, 2004 Santos Rodolfo P. dos, Clemente Gabriela S., Ren Tsang Ing and Calvalcanti George D.C., Text line segmentation based on morphology and histogram projection, 10th International Conference on document analysis and recognition, 2009 Sanyuan Zhang, Mingli Zhang and Xiuzi Ye,Car plate character extraction under complicated environment, IEEE conference on systems, man and cybernetics, 2004 Books: 1. Handbook of Character Recognition and Document Image Analysis, H Bunke, P.S.P Wang, World Scientific. 2. Digital image processing,Rafael C. Gonzalez, Richard Eugene Woods Character recognition systems,Mohamed Cheriet, Nawwaf Kharma, Cheng- Lin Liu, Ching Suen Compressed image file formats, John Miano Topological algorithms for digital image processing, T Yung Kong, Azriel Rozenfeld Optical Character recognition: an illustrated guide to the frontier, Stephen V. Rice, George Nagy, Thomas A. Nartkar, Kluwer Academic Publishers The Image Processing Handbook, John C. Russ Web: https://www.dtek.chalmers.se/~d95danb/ocr/ https://homepages.inf.ed.ac.uk/rbf/HIPR2/hipr_top.htm https://en.wikipedia.org/wiki/BMP_file_format https://en.wikipedia.org/wiki/Optical_character_recognition https://www.getahead-direct.com/gwbadfd.htm https://local.wasp.uwa.edu.au/~pbourke/dataformats/bmp/ https://www.eecs.berkeley.edu/~fateman/kathey/char_recognition.html https://www2.mustek.com/Class/ocrinfo.html https://jocr.sourceforge.net/index.html https://www.ccs.neu.edu/home/feneric/charrec.html https://cgm.cs.mcgill.ca/~godfried/teaching/projects97/azar/skeleton.html https://en.wikipedia.org/wiki/Feature_extraction https://fourier.eng.hmc.edu/e161/lectures/morphology/node2.html https://people.idiap.ch/vincia/Projects/acet https://users.ecs.soton.ac.uk/msn/book/ https://www.webopedia.com/TERM/O/optical_character_recognition.html https://www.computerworld.com/s/article/73023/Optical_Character_Recognition https://en.wikipedia.org/wiki/Image_file_formats https://www.scantips.com/basics09.html https://www.elated.com/articles/understanding-image-formats/ https://www.ibiblio.org/pub/packages/ccic/software/info/HBF-1.1/BitmapFile.html https://www.cambridgeincolour.com/tutorials/image-averaging-noise.htm
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.