YAML syntax processing in C with libyaml and lemon parser generator (part 2)
In previous part of the blog we have created the basis of the program. In this part we will work on lemon parser grammar file "parser.y".
We can simply run "make" command in our working directory and it will generate the parser .c and .h files.
Libyaml wiki has a grammar explained for events: http://pyyaml.org/wiki/LibYAML#Events
This is basically the same for tokens.
Let's start with a very minimal (minimum minimorum) YAML syntax (file minimum.yaml):
If we run now our program, we will have this in output:
Here, the "key" and the "value" tokens are followed by "scalar". They are wrapped into the "block" and the "stream" start/end tokens.
Here is bootstrap grammar file to parse this minimum YAML (parser.y):
Lemon wiki has a brief but good explanation of grammar rules. I will not go into the details how to build lemon grammar in general, but will focus here on YAML tokens parsing.
There is "parser_failure" and "syntax_error" directives to print error message for debugging purposes. There is for now only single rule: stream, with two tokens. We expect when parsing is ok, program will print "Parsing is OK." string.
When we run "make", it will produce "parser.c" and "parser.h" parser files from "parser.y".
Now we add lemon functions to "main.c" file. First step is to declare lemon function on the top of the file, right after "include":
This is standard way to reference functionality from "parser.c".
Then we declare two new variables in "main" function:
Variable "val" will contain scalars (char *) that we will pass to "Parser" function.
In main loop function (while), we add following code, right before "/* scan finished */" comment:
Note, how the structure "token" is used. Whenever token "YAML_SCALAR_TOKEN" is received, its text value can be retrieved as following: token.data.scalar.value
Then, scalar value and token type are passed to "Parse" function.
When document is finally parsed, we need to call "Parse (scanner, 0, 0);" after main loop to let the parser know that parsing is done. To free memory, call "ParseFree (scanner, free);".
But, when we run "make" and run the program, we have error parsing:
So, what's wrong? The answer is in the file "parser.h" generated by lemon. Lemon defines its own token IDs and they do not match those in "yaml.h" file. For example, YAML_VALUE_TOKEN from "yaml.h" is 17, while lemon defines it as 5:
Well, it is easy to fix. For that we need a new function that will convert libyaml tokens to the tokens from "parser.h". First of all, in "parser.y", we need to change tokens to remove "YAML_" prefix and "_TOKEN" suffix. For example: YAML_STREAM_START_TOKEN will become STREAM_START. Then run make command again.
Include file "parser.h" in "main.c" and add new static function just above "main" function:
Then, parse function in main loop should use the one we've just created:
Parse(scanner, token_type(token.type), val);
And finally, when we make and run the program, everything looks good:
Well, it looks good only for that very minimum example. When more key/value pairs are added this parser will fail:
Let us improve grammar to parse any number of key/value pairs. Each member, key and value, will produce two tokens: YAML_KEY_TOKEN or YAML_VALUE_TOKEN and YAML_SCALAR_TOKEN which brings the actual value for key or token. And the number of these pairs is arbitrary. Using lemon grammar let's split our single rule with terminal symbols to several non-terminal rules. We start from key and value:
key ::= KEY SCALAR .
value ::= VALUE SCALAR .
The "key" and "value" non-terminals can be joined to another non-terminal:
field ::= key value .
The "field" non-terminal represents YAML key/value pair. And there can be any number of the fields which are nested in the BLOCK_MAPPING. It can be implemented as following:
block ::= block field .
block ::= .
With all these changes, grammar rules now looks like this:
The grammar rules do only parsing now. It also should do something useful. At lease, print the found key/value pairs. We have already set token_type to char* and also pass the scalar value to parser. Let's print it from grammar rules:
This will print key:value pairs. It is not quite useful, but helps to demonstrate connections between libyaml and lemon.
There is also forth argument in Parse() function that should be used to pass a pointer to some structure to parser. Here what lemon wiki says:
If we run our program now with this YAML file, we will see parsing error messages:
Let's update grammar rules to fix it. First of all, note, that nested list is just another BLOCK_MAPPING block. Basically, it is another key/value pair, but value is another block.
Let's simply add new value like this:
value ::= VALUE BLOCK_MAPPING_START block BLOCK_END .
Now, when we re-build and re-run the program with nested file as argument everything will be ok:
Here is the current parsing grammar:
The current implementation covers only around 40% of YAML language syntax. I am not going to do everything as it will take much more time. For more complete implementation it would take to create grammar for YAML sequences, documents, flows, tags and more.
I have played a bit with implementing other feature and published it on github:
https://github.com/staskobzar/libyaml_lemon_blog
That's it! Thanks for reading.
Lemon is an application that can be installed from packages (comes almost with all Linux distributions), with Homebrew on Mac OS. Or you can build it from the sources. It is to find in Internet how to do that.
When lemon is installed, we can run command "lemon" with a path to the grammar file as an argument.
Using Makefile will much improve our build process:
We can simply run "make" command in our working directory and it will generate the parser .c and .h files.
Libyaml wiki has a grammar explained for events: http://pyyaml.org/wiki/LibYAML#Events
This is basically the same for tokens.
Let's start with a very minimal (minimum minimorum) YAML syntax (file minimum.yaml):
key: value
If we run now our program, we will have this in output:
Here, the "key" and the "value" tokens are followed by "scalar". They are wrapped into the "block" and the "stream" start/end tokens.
Here is bootstrap grammar file to parse this minimum YAML (parser.y):
1 %include {
2 #include <assert.h>
2 #include <stdio.h>
3 }
4
5 %parse_failure {
6 printf("ERROR: Failed to parse.\n");
7 }
8 %syntax_error {
9 printf("ERROR: Syntax error.\n");
10 }
11
12 %token_type { const char * }
13
14 /* grammar rules */
15 stream ::= YAML_STREAM_START_TOKEN
16 YAML_BLOCK_MAPPING_START_TOKEN
17 YAML_KEY_TOKEN
18 YAML_SCALAR_TOKEN
19 YAML_VALUE_TOKEN
20 YAML_SCALAR_TOKEN
21 YAML_BLOCK_END_TOKEN
22 YAML_STREAM_END_TOKEN . {
23 printf("Parsing is OK.\n");
24 }
25
2 #include <assert.h>
2 #include <stdio.h>
3 }
4
5 %parse_failure {
6 printf("ERROR: Failed to parse.\n");
7 }
8 %syntax_error {
9 printf("ERROR: Syntax error.\n");
10 }
11
12 %token_type { const char * }
13
14 /* grammar rules */
15 stream ::= YAML_STREAM_START_TOKEN
16 YAML_BLOCK_MAPPING_START_TOKEN
17 YAML_KEY_TOKEN
18 YAML_SCALAR_TOKEN
19 YAML_VALUE_TOKEN
20 YAML_SCALAR_TOKEN
21 YAML_BLOCK_END_TOKEN
22 YAML_STREAM_END_TOKEN . {
23 printf("Parsing is OK.\n");
24 }
25
Lemon wiki has a brief but good explanation of grammar rules. I will not go into the details how to build lemon grammar in general, but will focus here on YAML tokens parsing.
There is "parser_failure" and "syntax_error" directives to print error message for debugging purposes. There is for now only single rule: stream, with two tokens. We expect when parsing is ok, program will print "Parsing is OK." string.
When we run "make", it will produce "parser.c" and "parser.h" parser files from "parser.y".
Now we add lemon functions to "main.c" file. First step is to declare lemon function on the top of the file, right after "include":
5 void* ParseAlloc(void* (*allocProc)(size_t));
6 void* Parse(void*, int, const char*);
7 void* ParseFree(void*, void(*freeProc)(void*));
6 void* Parse(void*, int, const char*);
7 void* ParseFree(void*, void(*freeProc)(void*));
This is standard way to reference functionality from "parser.c".
Then we declare two new variables in "main" function:
68 void *scanner = ParseAlloc (malloc);
69 char *val;
69 char *val;
Variable "val" will contain scalars (char *) that we will pass to "Parser" function.
In main loop function (while), we add following code, right before "/* scan finished */" comment:
109 if ( token.type == YAML_SCALAR_TOKEN ) {
110 val = (char *)strndup((char *)token.data.scalar.value, token.data.scalar.length);
111 } else if ( token.type == YAML_TAG_TOKEN ) {
112 val = (char *)strdup((char *)token.data.tag.suffix);
113 } else { val = NULL; }
114 Parse(scanner, token.type, val);
110 val = (char *)strndup((char *)token.data.scalar.value, token.data.scalar.length);
111 } else if ( token.type == YAML_TAG_TOKEN ) {
112 val = (char *)strdup((char *)token.data.tag.suffix);
113 } else { val = NULL; }
114 Parse(scanner, token.type, val);
Note, how the structure "token" is used. Whenever token "YAML_SCALAR_TOKEN" is received, its text value can be retrieved as following: token.data.scalar.value
Then, scalar value and token type are passed to "Parse" function.
When document is finally parsed, we need to call "Parse (scanner, 0, 0);" after main loop to let the parser know that parsing is done. To free memory, call "ParseFree (scanner, free);".
But, when we run "make" and run the program, we have error parsing:
So, what's wrong? The answer is in the file "parser.h" generated by lemon. Lemon defines its own token IDs and they do not match those in "yaml.h" file. For example, YAML_VALUE_TOKEN from "yaml.h" is 17, while lemon defines it as 5:
Well, it is easy to fix. For that we need a new function that will convert libyaml tokens to the tokens from "parser.h". First of all, in "parser.y", we need to change tokens to remove "YAML_" prefix and "_TOKEN" suffix. For example: YAML_STREAM_START_TOKEN will become STREAM_START. Then run make command again.
Include file "parser.h" in "main.c" and add new static function just above "main" function:
Then, parse function in main loop should use the one we've just created:
Parse(scanner, token_type(token.type), val);
And finally, when we make and run the program, everything looks good:
Well, it looks good only for that very minimum example. When more key/value pairs are added this parser will fail:
Let us improve grammar to parse any number of key/value pairs. Each member, key and value, will produce two tokens: YAML_KEY_TOKEN or YAML_VALUE_TOKEN and YAML_SCALAR_TOKEN which brings the actual value for key or token. And the number of these pairs is arbitrary. Using lemon grammar let's split our single rule with terminal symbols to several non-terminal rules. We start from key and value:
key ::= KEY SCALAR .
value ::= VALUE SCALAR .
The "key" and "value" non-terminals can be joined to another non-terminal:
field ::= key value .
The "field" non-terminal represents YAML key/value pair. And there can be any number of the fields which are nested in the BLOCK_MAPPING. It can be implemented as following:
block ::= block field .
block ::= .
With all these changes, grammar rules now looks like this:
15 /* grammar rules */
16 stream ::= STREAM_START
17 BLOCK_MAPPING_START
18 block
19 BLOCK_END
20 STREAM_END . {
21 printf("Parsing is OK.\n");
22 }
23
24 block ::= block field .
25 block ::= .
26
27 field ::= key value .
28
29 key ::= KEY SCALAR .
30 value ::= VALUE SCALAR .
16 stream ::= STREAM_START
17 BLOCK_MAPPING_START
18 block
19 BLOCK_END
20 STREAM_END . {
21 printf("Parsing is OK.\n");
22 }
23
24 block ::= block field .
25 block ::= .
26
27 field ::= key value .
28
29 key ::= KEY SCALAR .
30 value ::= VALUE SCALAR .
The grammar rules do only parsing now. It also should do something useful. At lease, print the found key/value pairs. We have already set token_type to char* and also pass the scalar value to parser. Let's print it from grammar rules:
16 stream ::= STREAM_START
17 BLOCK_MAPPING_START
18 block
19 BLOCK_END
20 STREAM_END . {
21 printf("Parsing is OK.\n");
22 }
23
24 block ::= block field .
25 block ::= .
26
27 field ::= key(K) value(V) . {
28 printf(" %s: %s\n", K, V);
29 }
30
31 key(V) ::= KEY SCALAR(S) . { V = S; }
32 value(V) ::= VALUE SCALAR(S) . { V = S; }
17 BLOCK_MAPPING_START
18 block
19 BLOCK_END
20 STREAM_END . {
21 printf("Parsing is OK.\n");
22 }
23
24 block ::= block field .
25 block ::= .
26
27 field ::= key(K) value(V) . {
28 printf(" %s: %s\n", K, V);
29 }
30
31 key(V) ::= KEY SCALAR(S) . { V = S; }
32 value(V) ::= VALUE SCALAR(S) . { V = S; }
This will print key:value pairs. It is not quite useful, but helps to demonstrate connections between libyaml and lemon.
There is also forth argument in Parse() function that should be used to pass a pointer to some structure to parser. Here what lemon wiki says:
The Parse() function may have either three or four arguments, depending on the grammar. If the grammar specification file request it, the Parse() function will have a fourth parameter that can be of any type chosen by the programmer. The parser doesn't do anything with this argument except to pass it through to action routines. This is a convenient mechanism for passing state information down to the action routines without having to use global variables.Our grammar rule works only for one level of key/value pairs. But, YAML supports nested list with arbitrary depth. Here is valid YAML nested example:
If we run our program now with this YAML file, we will see parsing error messages:
Let's update grammar rules to fix it. First of all, note, that nested list is just another BLOCK_MAPPING block. Basically, it is another key/value pair, but value is another block.
Let's simply add new value like this:
value ::= VALUE BLOCK_MAPPING_START block BLOCK_END .
Now, when we re-build and re-run the program with nested file as argument everything will be ok:
Here is the current parsing grammar:
16 stream ::= STREAM_START
17 BLOCK_MAPPING_START
18 block
19 BLOCK_END
20 STREAM_END .
21
22 block ::= block field .
23 block ::= .
24
25 field ::= key(K) value(V) . {
26 printf(" %s: %s\n", K, V);
27 }
28
29 key(V) ::= KEY SCALAR(S) . { V = S; }
30
31 value(V) ::= VALUE SCALAR(S) . { V = S; }
32 value ::= VALUE BLOCK_MAPPING_START block BLOCK_END .
17 BLOCK_MAPPING_START
18 block
19 BLOCK_END
20 STREAM_END .
21
22 block ::= block field .
23 block ::= .
24
25 field ::= key(K) value(V) . {
26 printf(" %s: %s\n", K, V);
27 }
28
29 key(V) ::= KEY SCALAR(S) . { V = S; }
30
31 value(V) ::= VALUE SCALAR(S) . { V = S; }
32 value ::= VALUE BLOCK_MAPPING_START block BLOCK_END .
The current implementation covers only around 40% of YAML language syntax. I am not going to do everything as it will take much more time. For more complete implementation it would take to create grammar for YAML sequences, documents, flows, tags and more.
I have played a bit with implementing other feature and published it on github:
https://github.com/staskobzar/libyaml_lemon_blog
That's it! Thanks for reading.
Comments
Post a Comment