Improving recognition of handwriting with component samples

Yet another post on handwriting recognition of Japanese and Chinese characters with Tegaki. This time I want to improve recognition rates of existing models.

Some weeks ago I proposed using component data for bootstrapping a completly new model. Characters are broken down into their components, for which handwriting models already exist. The character's model is then built from those of the components.

This time I want to improve an existing model, but use the same approach here. Many characters can be found multiple times in the existing models, most as components of another character. These occurences provide unique instances of handwriting that can be used to increase the number of training samples. More samples will add greater variation in the hope of improving recognition accuracy.

First of all we need to extract handwriting data "hidden" in components. Lets use the Japanese model here:

$ python hwr/tegaki-tools/src/tegaki-extractcomponents \
    -t hwr/tegaki-models/data/train/japanese/handwriting-ja.xml -m 5 \
    components.xml

Maximum 5 occurences of each character will be stored into components.xml. You can view those with tegaki-train easily. Some extracted characters will be wrong. In this case either handwriting samples were incorrect, or cjklib has false information. I didn't correct or delete any of them, leaving that for later.

While we now have 5 instances for basic characters, complex characters with components still need data. We can do this using the bootstrapping process. We will add full character versions to the component set, so that we try not to break a character down too much. As our building from components isn't perfect we try to minimize unneccessary build steps using a parent component if available. We then bootstrap a character collection for characters from JIS-X-0208 with maximum 5 instances per character. Giving option "-x" will make sure only component transformations are used so we by-pass using exact matches.

$ python hwr/tegaki-tools/src/tegaki-convert -m 5 \
    -t hwr/tegaki-models/data/train/japanese/handwriting-ja.xml \
    -c components.xml full_components.xml
$ python hwr/tegaki-tools/src/tegaki-bootstrap -x -l J -m 5 --domain=JISX0208 \
    -c full_components.xml handwriting_complex.xml

Results show that 81% of characters can be built using components. The others are either basic characters, cannot be composed due to their stucture, or just plainly lack component data. On avarage 36 instances could be provided (using cross product), of which we only use 5.

We will add those handwriting models built in this way to the existing Japanese model, extending the instance count. Finally, we train the model:

$ python hwr/tegaki-tools/src/tegaki-convert -m 5 \
    -c full_components.xml -c handwriting_complex.xml \
    handwriting_enhanced-ja.xml
$ python hwr/tegaki-tools/src/tegaki-build -c handwriting_enhanced-ja.xml \
    zinnia handwriting_enhanced-ja.meta

We now have a new "enhanced" Japanese model that we want to evaluate. I decided to use the KanjiVG data, which does not share a common source with the Tegaki data. A character collection can be built using Roger Braun's KVG-Tools. An integrated support with Tegaki is currently being worked on. To get meaningfull results we should limit the testing set to the same character domain, which can be done (ab-)using tegaki-bootstrap:

$ python hwr/tegaki-tools/src/tegaki-bootstrap -l J --domain=JISX0208 \       
    -c kanjivg.xml kanjivg_jis.xml

First we run the evaluation on the old model:

$ python hwr/tegaki-tools/src/tegaki-eval zinnia Japanese \
    -c kanjivg_jis.xml
Overall results                                                                 
        Recognizer: zinnia                                                      
        Number of characters evaluated: 6377                                    

        Total time: 118.09 sec
        Average time per character: 0.02 sec
        Recognition speed: 53.82 char/sec

        match1
                Accuracy/Recall: 86.28
                Precision: 80.92
                F1 score: 83.51

        match5
                Accuracy/Recall: 93.84

        match10
                Accuracy/Recall: 94.92

We have a F1 score of 83.51 and a Recall of 86.28. This goes up when considering the first 5/first 10 characters in the result set.

Now the moment of truth:

$ python hwr/tegaki-tools/src/tegaki-eval zinnia JapaneseEnhanced \
    -c kanjivg_jis.xml
Overall results                
        Recognizer: zinnia     
        Number of characters evaluated: 6356

        Total time: 114.82 sec
        Average time per character: 0.02 sec
        Recognition speed: 55.36 char/sec   

        match1
                Accuracy/Recall: 91.96
                Precision: 88.96      
                F1 score: 90.44       

        match5
                Accuracy/Recall: 96.27

        match10
                Accuracy/Recall: 96.93

With delight we see that F1 score goes up to 90.44 and Recall to 91.96. And even match5 and match10 go up, moving towards 100%. It seems that using handwriting data only from inside the model can help increase recognition rates. I think this eases the pressure to get more data by manually drawing characters. The sources can be found in my branch at Github.