Skip to content

Formatter Serialize#67

Open
gulugulubing wants to merge 9 commits intoschveiguy:masterfrom
gulugulubing:JSONFormatter
Open

Formatter Serialize#67
gulugulubing wants to merge 9 commits intoschveiguy:masterfrom
gulugulubing:JSONFormatter

Conversation

@gulugulubing
Copy link
Contributor

Reimplemented the formatjson example program with new Formatter and tested some json files locally.

Copy link
Owner

@schveiguy schveiguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start.

@schveiguy
Copy link
Owner

Hi @gulugulubing how much time do you have to finish this up? I want to get the serialization stuff in the books for release. If you have time, I can do some more review. If you don't have time, I can take what you have so far and get it to a mergeable state. No worries if you don't have time, I would completely understand.

@gulugulubing
Copy link
Contributor Author

Sorry @schveiguy, I just checked the conservations but I could't recall what still need to do in this PR.

@schveiguy
Copy link
Owner

Yes, and that's my fault. I was going to go through this again, but wanted to first make sure you still are able to work on it. I'll review this again and get more feedback for you. Thanks!

private bool isValidJSONNumber(string str) {
import std.regex;
static auto numberRegex = regex(r"^-?(0|[1-9]\d*)(\.\d+)?([eE][+-]?\d+)?$");
return !str.matchFirst(numberRegex).empty;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love using regex here. We do have a number validator in the parser, but it's awkward to use. Probably should factor out the validation.

This is OK for now.

@schveiguy
Copy link
Owner

I had a whole thing written up for how I want this to look, and I'm still mulling over it in my head, so I erased it all.

Let me get this all written down, and I will get back to you. I think this is the right path, but maybe tweak some things. The formatter needs either options for everything, or we need specific formatter types (this is what I'm leaning towards).

Will work on a review this weekend.

Copy link
Owner

@schveiguy schveiguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to all the comments below, we need a state system.

Many of these functions are only valid if you are in the right state. For example, it only makes sense to accept string data inside a string. It only makes sense to accept an object key if you are in an object, etc.

If we add the comma function, there's another state that needs handling.


void addMember(T)(T key, MemberOptions options) {

if (key.length == 0 ) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this requirement to call addMember for arrays. I had expected the different value starting functions (beginObject, beginArray, beginString, addNumericData, addKeywordValue) to handle the comma.

Thinking about comments, I think actually we probably want to add a specific function to handle adding the comma. Because comments can go anywhere.

@schveiguy
Copy link
Owner

Needs a rebase and resolve the conflict.

@gulugulubing
Copy link
Contributor Author

In addition to all the comments below, we need a state system.

Many of these functions are only valid if you are in the right state. For example, it only makes sense to accept string data inside a string. It only makes sense to accept an object key if you are in an object, etc.

If we add the comma function, there's another state that needs handling.

It makes sense, but I haven't thought previously. Is this something like below in struct JSONFormatter:

enum FormatterState {
        TopLevel,
        InObjectKey,    // expecting a key (addMember)
        InObjectValue,  // expecting a value after colon
        InArray,        // inside an array, expecting a value
        InString,       // inside string data
    }

    FormatterState state = FormatterState.TopLevel;

    // track nested containers: Obj/Array
    enum ContainerKind { Obj, Arr }
    ContainerKind[] containerStack;

@schveiguy
Copy link
Owner

That's close I think. You can determine the current aggregate type (object or array) by the containerStack.

You can actually probably just reuse the state enum from the parser, though there is no "colon" state as that's already handled. At least you can reuse the state names that do apply. Note the BitArray stack as well, please copy that design.

Oh, one more thing, the formatter needs to go in its own file. This is not part of the serializer, just like the parser isn't. I think an iopipe.json.formatter module.

@gulugulubing
Copy link
Contributor Author

You can actually probably just reuse the state enum from the parser, though there is no "colon" state as that's already handled. At least you can reuse the state names that do apply. Note the BitArray stack as well, please copy that design.

In the latest commit, I find comma state is not needed neither.

Copy link
Owner

@schveiguy schveiguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stopped looking, the state machine seems wrong. Each thing that adds data to the stream should change the state.

Maybe diagram it out? I'm not sure how it's supposed to work.

// Parse array elements
size_t elementCount = 0;
while(true) {
while(tokenizer.peekSkipComma() != JSONToken.ArrayEnd) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure why this is here. Maybe you need a rebase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since here was a conflict between this branch and master, I guess I need to accept the master version before rebase or merge.

// subsequent member: comma + indent
putStr(",");
putIndent();
// stay in Member
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems incorrect:

startObjectMember();
startObjectMember();

will output 2 consecutive commas.

@gulugulubing
Copy link
Contributor Author

gulugulubing commented Nov 30, 2025

Stopped looking, the state machine seems wrong. Each thing that adds data to the stream should change the state.

Maybe diagram it out? I'm not sure how it's supposed to work.

Yeah, the process is a little bit subtle. Some functions of adding data seems no need to change state.

Let's check a simple example:

{
  "name": "John",
  "age": 30,
  "hobbies": ["reading", "swimming"]
}
  1. Begin → beginObject(), output"{" → First
  2. First → addMember() → Value
  • in addMember(): startObjectMember(), don't add comma since now is First ->Member->beginString(), addStringData(), endStringRaw(), output "name:"
  1. Value → beginString(), addStringData(), endString(), output "John"→ Member
  2. Member → addMember() -> Value
  • in addMember(): → startObjectMember(), add comma since now is Member→Member->beginString(), addStringData(), endStringRaw(), output "age:"
  1. Value → addNumericData(), output 30 → Member
  2. Member→ addMember() -> Value
  • in addMember(): → startObjectMember(), add comma since now is Member->Member->beginString(), addStringData(), endStringRaw(), output "hobbies:"
  1. Value → beginArray(), output"["→ First
  2. First → beginArrayValue(), don't add comma since now is First → Member
  3. Member → beginString(), addStringData(), endString(), output "reading" → Member
  4. Member->beginArrayValue(), add comma since now is Member→ Member
  5. Member → beginString(), addStringData(), endString(), output "swimming" → Member
  6. Member → endArray(), output "]"→ stack is not empty, so Member
  7. Member → endObject() , output "}"→ stack is empty, so End

Copy link
Owner

@schveiguy schveiguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are a few states which allow adding a "value":

  • Begin
  • First for arrays
  • Member for arrays
  • Value for objects

And a "value" can either be the start of a string, a number, a keyword, an object start, an array start.

I suggest to write a helper method called canAddValue which returns true if the next thing can be a value. Then use this in all your code where you are validating state. This will help make the code cleaner also.

In fact, helper methods to validate the state in all cases, named appropriately, would make things a lot clearer. Methods like:

  • canAddValue -> adding a value is allowed
  • canAddStringData -> adding string data (or ending string) is allowed
  • canAddComment -> adding a comment is allowed
  • canCloseAggregate -> closing the aggregate is allowed

These can be private methods. They will just contain state validation and return a boolean value. The code that uses it will read a lot cleaner. It also allows the same validation code to be used for both throwing an exception and asserting.

// // add a comment (JSON5 only), must be a complete comment (validated)
void addComment(T)(T commentData) {
static if (is(T == string)) {
if (!commentData.startsWith("//") && !commentData.startsWith("/*") && !commentData.endsWith("*/")) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// comments need to end with a newline

This also requires // comments to end with */

The logic is hard to parse as well. I suggest:

bool isValidComment = 
    (commentData.startsWith("//") && commentData.endsWith("\n")) ||
    (commentData.startsWith("/*") && commentData.endsWith("*/"));
if(!isValidComment)
    throw new ...

@schveiguy
Copy link
Owner

One thing here is that there are no unittests.

The serialize code has formatting unittests, but we haven't hooked up serialization yet. So maybe a few simple ones, and maybe a few that validate the state handling is correct (i.e. you get an exception if you try doing things in the wrong order).

Copy link
Owner

@schveiguy schveiguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I've been super busy. More changes. I haven't reviewed the unittests yet.

static if (validate) {
// Validate that the string doesn't contain invalid characters
foreach(char c; value) {
if (c < 0x20 && c != '\t' && c != '\n' && c != '\r') {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't exactly correct, it needs to check for quote characters and lone backslashes.

In addition, I would expect both add escapes and validate as true should work with characters like newline, which should replace with a literal \n

I know I made this API in the discussion, I think it's wrong now. You can validate, add escapes, or do neither. You can't do both (adding escapes would ensure a valid string).

Maybe change to an enum?

passThru -> don't do anything, assume the string data is correct
addEscapes -> any invalid character is escaped into a valid sequence (this should be the default)
validate -> verify escapes are correctly used in the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants