James John – Software Engineer

How I Append Data Into JSON at Constant Time – PHP

php-jsondb is a light framework that provides an object and methods to perform CRUD operation on a JSON file, using a similar interface to that of RDBMS. In a situation where you are within a small scale project, you can use this library to read and store data in a JSON file, allowing cross-language support like JavaScript.

I built this library when I was not much experienced with software engineering. On 15th August, lizhaoyao raised an issue that shows how catastrophic this library can be when the data grows. Following this GitHub issue, I was able to replicate that. I am just going to show below what it looks like to insert just 5,000 records of similar data.

The Test/Problem

You can also replicate this test by cloning from the repository, and checkout the commit. A quick rundown on how to replicate it:

From my system, I have the below result for inserting 5000 records, 1000 on each batch:

The result shows that the insertion time increases on every thousand records. Assuming we have 50,000 records, it will take a minute to insert new records, which is very bad. It makes me wonder how people are still using the library 😀

Why does this happen?

To parse a JSON data type into a PHP data type is a linear process. Keep walking through each character to make sure they are following the standard. Some other algorithms can be in place in-between for optimization. But this reading and converting process will keep increasing in time as the data increases. php-jsondb reads all the data into the buffer before performing each insert. Crazy right? But why did I do that?

I was trying to maintain a uniform column across each row of JSON data, to keep the standard of an RDBMS by judging from the first row inserted. Like we see below in the insert method:

Following the code, let us see the from() method which loads all file content into a buffer and decodes:

This method still calls another method check_file() which after it returns true, decodes the file. Let’s look at check_file() here:

After seeing that eyesore, I wept for my past self. First, the whole design is not a nice one, then the implementations were so costly. I decoded the whole JSON data just to check if it’s an array of JSON objects, it could be a 50,000 records file. Then when it returns true, I decoded another one in from() method! Damn! What a design!

Please one last one, let me show you:

  1. I used array_diff_key() – O(n^2)
  2. I then used array_map() – O(n)
  3. array_merge() – O(n)
  4. Then another array_diff_key!

Please, that was me 🙁 I was just trying to maintain uniform columns across each row. But then, I do not know any library or PHP native function that allows you to read the first row alone without decoding all the JSON data into the buffer and perform a current(). I came up with a solution. On typing this post, I am still preparing a standalone library to do this exact thing and publish it. I forked it from this solution.

The Solution

Well, it was an eyesore for me to see such test results. I took it upon myself to make sure insertion happens at a constant time, no matter the size of the data.

The issue author offered up help on this by appending data into the JSON file. But, it was inefficient from my experience. There were still checks to be made before appending to a JSON file (e.g. What if it’s not an array of objects? What if it’s an invalid JSON already?) and also, I still need to maintain uniform columns across every row. That approach gave me an idea of how to do this. I dived deep into IO operations, this is my favourite part of programming.

I improved the user’s contribution by first creating a function to retrieve array objects chunk by chunk, without reading all the JSON data into the buffer. With this, I was able to just get the first chunk of the array object and use it to compare the new role columns. This helper function can be found in helpers/json.php

On insertion, I created a memory-safe method to perform this in src/JSONDB.php in O(1) by reading and parsing from the back. Everything happened in this commit.

The Result

This is an improvement of over 90%. This is exact same test we performed above, but with the modifications made. It is a constant time between 20 – 30 milliseconds, which is dependent on your OS and hardware. But this is the average I get on my machine no matter the size of the data.

There are other optimizations I need to still do on the library which I will work on in my free time. You can download and run the tests on this library here. I would really love to hear your critics and feedback on my approach. I believe there must be a better and faster approach out there.

James John

Software Engineer