Thursday, January 3, 2008

Tsume Shogi Parser (or "How to store 28k of data into 802k file and be happy about it")

What is Tsume Shogi Parser?

I am a huge shogi patzer. I am also a poor Java programmer. Well, all in all, I am a miserable creature. But even though I am so old that I remember the era of black&white TV, I think I can improve. (Not in being miserable, I mean. The other way.)

I wanted to strengthen my shogi skills by solving mating problems. You would be surprised how hard it is to find some good materials on the net. If you have some experience with Go or Western Chess you think that the Internet is the ultimate source for this kind of stuff. You couldn't be more mistaken -- at least when you don't read Japanese.

Fortunately, there are good folks out there. Shogi-l email discussion list is a place where people like these meet. Reijer Grimbergen, i.e., posted series of 300 tsume shogi problems. Roger Hare was so kind to collect some of them into a single file:

That's how I got my learning materials. The only problem left was it's form:
...
4. Black: R4d, +B4c, +P2a In hand: G
White: K2c, G3c, S4b, L1b, P4a

5. Black: R2b, G3e, S3a, L2e In hand: None
White: K3c, G4c, S1d, P4e
...

A beginner like me could use some more "visually attractive" format. By the way, it's a human nature I think: people always want more. First I complained that there are no materials at all, and then (after I found them) I wanted them to be full of diagrams and nicely formated.
Anyway, that's how I decided to write a converter for the problems. I called it Tsume Shogi Parser.

The process

Tsume Shogi Parser is intended to be a building block to use in other applications (these applications are meant to provide user interface for user convenience). In the near future I plan to provide a web app that would allow the user to upload a text file and get the resulting pdf. The parser could also be used in shogi game database to produce handouts on the fly.

Ok, let's face it. The parser will be not for much use. After all, there is only one source of data stored in this format. I treat this thing as an exercise before more important challenge: writing a parser for "This week in Shukan Shogi" posted by Reijer Grimbergen on shogi-l. The second benefit from it is trying out the part that generates pdf. That's IS going to be useful in other projects.

Let's go back to the parser. For know I’ll describe the inner mechanics of the Tsume Shogi Parser.

Technologies

For this piece of software I chose to use:
  • Java -- of course ;-),
  • JavaCC -- for defining the input file grammar and automatically generate a parser,
  • XML -- for storing the data into format that can be converted into virtually anything else,
  • JiBX -- for translating Java objects to XML (marshaling)
  • XLS-FO -- for storing information about expected layout of the handout,
  • XSLT 2.0 -- for defining a way of injecting tsume data to layout template,
  • Saxon -- the only XSLT processor I know that supports XSLT 2.0,
  • Apache FOP -- for generating PDF.
So, the process can be diagrammed as follows:

Shogi-l file->(.jj + javacc + jibix)->xml file->(xslt 2.0, saxon)->(fo, fo)->pdf

and in the future

-> spd, psn
-> database

Reading tsume

I have put all the tsume problems I found on shogi-l (most of them posted by Reijer Grimbergen) to a single file. If you're interested in it, you can download it from Tsume Shogi Project download pages. I also collected in a single file the answers for the tsume problems.

Then I wrote a JavaCC grammar for the file -- I made the computer understand what which field in the file meant. You can see the grammar on the ShogiTools project's wiki.

It was my first contact with parsers and it cost me a lot of sweat. I am still no expert here but I manage to write a simple grammar and generate a parser from it. JavaCC is of great help here. It automates most of the work so you can focus on the merit, although I had lots of problems with it. Partly because I was a beginner in the matter of parsers, partly because of lousy project documentation and community support. I've read that similar program, ANTLR, is much better in these matters. Next time I should try it so I could make an honest comparison. (If you can share your experience in the subject, please, drop me a note.

The thing is, that with the help of JavaCC and JavaCC Eclipse Plugin I generated a program that read tsume problem file and stored it in memory in Java objects. (The class diagrams of the value object used for it should be on ShogiTools pages.)
It turned out that the input data was a bit inconsistent in form. The parser showed few places and I had to do some corrections by hand. But this was not a big problem.

Producing XML

Now I wanted the data stored into an XML file.

Why XML? It is license-free, platform-independent and well-supported. Also because of the promise of easy transformations to presentation formats (without coding).

Anyway that's what the theory says. I wanted to see this in practice.

The process of transforming the memory representation of an object to a data format suitable for storage or transmission is called Marshaling. So I typed "fastet marshaller" into google search engine and got JiBX on pole position. And I used JiBX.

The interesting thing about JiBX is it's byte-code injection mechanism. In short: you define the mapping between classes and the XML. Then the mapping information is injected into your existing classes' bytecode. This means you can structure the Java classes the way you want -- they doesn't need to match the structure of the XML. It also turns out that this is a fast way do the conversion.

This part was quite quick. The program read 11k of tsume definitions and spat out 802 kB XML file. The description of XML structure can be found on ShogiTools pages.

The output

So, what is so good about having such a huge file instead of small one? And why it is so big?

Let's compare the entry for a single position in input text file and output xml file.
The text entry goes like this:
1. Black:+B3d, +B1c In hand: G
White: K2a, +R3c, L1a

Equivalent XML fragment is:
<Position>
<number>1</number>
<BlackDescription>
<PiecesPositions>
<PiecePosition>
<Piece>
<isPromoted>true</isPromoted>
<rank>B</rank>
</Piece>
<col>3</col>
<row>d</row>
</PiecePosition>
<PiecePosition>
<Piece>
<isPromoted>true</isPromoted>
<rank>B</rank>
</Piece>
<col>1</col>
<row>c</row>
</PiecePosition>
</PiecesPositions>
<Hand>
<Piece>
<isPromoted>false</isPromoted>
<rank>G</rank>
</Piece>
</Hand>
</BlackDescription>
<WhiteDescription>
<PiecesPositions>
<PiecePosition>
<Piece>
<isPromoted>false</isPromoted>
<rank>K</rank>
</Piece>
<col>2</col>
<row>a</row>
</PiecePosition>
<PiecePosition>
<Piece>
<isPromoted>true</isPromoted>
<rank>R</rank>
</Piece>
<col>3</col>
<row>c</row>
</PiecePosition>
<PiecePosition>
<Piece>
<isPromoted>false</isPromoted>
<rank>L</rank>
</Piece>
<col>1</col>
<row>a</row>
</PiecePosition>
</PiecesPositions>
</WhiteDescription>
</Position>

The second fragment is huge. It is well structured, data are organized in logical chunks. You may ask here: So what!?

And I agree. It's is no good to have kosher data structures for the sake of kosherness only. The real issue is what you can do with the data. There are the benefits of data conversion. Let me show you how our XML can be transformed.

My first try was to write XSL transformation to get an ASCII diagrams of the positions. For the given position (XML fragment above) I got this:

1.

White in hand:
9   8   7   6   5   4   3   2   1
+---+---+---+---+---+---+---+---+---+
|   |   |   |   |   |   |   | wK| wL|a
+---+---+---+---+---+---+---+---+---+
|   |   |   |   |   |   |   |   |   |b
+---+---+---+---+---+---+---+---+---+
|   |   |   |   |   |   |+wR|   |+bB|c
+---+---+---+---+---+---+---+---+---+
|   |   |   |   |   |   |+bB|   |   |d
+---+---+---+---+---+---+---+---+---+
|   |   |   |   |   |   |   |   |   |e
+---+---+---+---+---+---+---+---+---+
|   |   |   |   |   |   |   |   |   |f
+---+---+---+---+---+---+---+---+---+
|   |   |   |   |   |   |   |   |   |g
+---+---+---+---+---+---+---+---+---+
|   |   |   |   |   |   |   |   |   |h
+---+---+---+---+---+---+---+---+---+
|   |   |   |   |   |   |   |   |   |i
+---+---+---+---+---+---+---+---+---+
Black in hand:G

The whole file can be downloaded from here.
Surly, we can do better than this! Well, we're almost there :-) This ASCII diagram thing was just a warm up before writing XSLT to produce PDF file. First I had to do xsl-fo my homework. For the PDF renderer I chose apache FOP which is quite well documented. So, I was able to learn some xsl-fo from apache documentation. I also used w3schools tutorials (http://www.w3schools.com/xslfo/default.asp).
I wanted my tsume shogi handout to have 6 problems per page and some tsume definition as an introduction. It was quite a bit of tedious work but I think it was worth it.
Judge for yourself. Here is the output pdf file.


Future development


I have my handouts for doing some shogi exercises off-line. My goal is reached so I stop my experiments for now. Instead I will list possible improvements below.

The pdf I managed to produce has English description of pieces on the board (i.e. wG for white gold, bK for black king). I could use Kanji characters instead. But first I need to know the unicode for each piece character and own some unicode font.

The whole thing is not a big deal. But during the process of converting my tsume text files into pdf I managed to learn few things. I think I also broke fresh ground for future conversion component that could be used in shogi programs. Imagine, you have your shogi database and you are able to add ability to render you games in the format you need just by supplying suitable transformation (XSLT) file.



See You next time...

4 comments:

  1. Small update note.
    1. Reijer Grimbergen moved his page to http://www.teu.ac.jp/gamelab/SHOGI/shogipage.htm so I updated the proper link in the post.
    2. Shogi List is now hosted on Google groups pages (http://groups.google.com/group/shogi-l) so I corrected the link in the post.
    3. Some time after the post I managed to produce Tsume Shogi handout with Kanji characters. Looks much better now, I think. You can find the new file here.

    ReplyDelete
  2. Awesome! thank you!

    I'm enjoying the problems.

    That said... is there a possibility for the problems to have different solutions than those listed? I've only solved the first 4, but found different solutions for 2 of them. I'm new to tsume problems, so I might be missunderstanding a rule or something.

    ReplyDelete
  3. I remember (when collecting the problems) some answers were missing, some were wrong. I also copied them by hand so there is a possibility of mistake, too.

    I planned to do automatic "testing" of the solutions with Spear (http://www.teu.ac.jp/gamelab/SHOGI/SPEAR/spearmain.html) which has a good tsume solver. But it didn't work out. I had both problem with Japanese ;-) and the program being Windows application.

    Now I've read that Spear is an engine that can be "plugged" into other programs (like BCMShogi - http://home.arcor.de/Bernhard.Maerz/BCMShogi/).

    If you decide to use some program to check the solution, please give me a note. I could prepare the positions in format acceptable by the software you choose.

    Best wishes,
    fbc

    ReplyDelete