How I Append Data Into JSON at Constant Time – PHP
php-jsondb is a light framework that provides an object and methods to perform CRUD operation on a JSON file, using a similar interface to that of RDBMS. In a situation where you are within a small scale project, you can use this library to read and store data in a JSON file, allowing cross-language support like JavaScript.
I built this library when I was not much experienced with software engineering. On 15th August, lizhaoyao raised an issue that shows how catastrophic this library can be when the data grows. Following this GitHub issue, I was able to replicate that. I am just going to show below what it looks like to insert just 5,000 records of similar data.
The Test/Problem
You can also replicate this test by cloning from the repository, and checkout
the commit. A quick rundown on how to replicate it:
1 2 3 4 5 |
$ git clone https://github.com/donjajo/php-jsondb.git $ cd php-jsondb $ git checkout 59913f99dcc20 $ composer update $ ./vendor/bin/phpunit |
From my system, I have the below result for inserting 5000 records, 1000 on each batch:
1 2 3 4 5 |
Took average of 0.534072s to insert batch 1 Took average of 1.596439s to insert batch 2 Took average of 2.649176s to insert batch 3 Took average of 3.877814s to insert batch 4 Took average of 5.202599s to insert batch 5 |
The result shows that the insertion time increases on every thousand records. Assuming we have 50,000 records, it will take a minute to insert new records, which is very bad. It makes me wonder how people are still using the library 😀
Why does this happen?
To parse a JSON data type into a PHP data type is a linear process. Keep walking through each character to make sure they are following the standard. Some other algorithms can be in place in-between for optimization. But this reading and converting process will keep increasing in time as the data increases. php-jsondb reads all the data into the buffer before performing each insert. Crazy right? But why did I do that?
I was trying to maintain a uniform column across each row of JSON data, to keep the standard of an RDBMS by judging from the first row inserted. Like we see below in the insert method:
1 2 |
public function insert( $file, array $values ) : array { $this->from( $file ); |
Following the code, let us see the from()
method which loads all file content into a buffer and decodes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
public function from( $file ) { /** * Loads the jSON file * * @param type $file. Accepts file path to jSON file * @return type object */ $this->file = sprintf( '%s/%s.json', $this->dir, str_replace( '.json', '', $file ) ); // Adding .json extension is no longer necessary // Reset where $this->where( [] ); $this->content = ''; // Reset order by $this->order_by = []; if( $this->check_file() ) { $this->content = ( array ) json_decode( file_get_contents( $this->file ) ); } return $this; } |
This method still calls another method check_file()
which after it returns true, decodes the file. Let’s look at check_file()
here:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
private function check_file() { /** * Checks and validates if JSON file exists * * @return bool */ // Checks if JSON file exists, if not create if( !file_exists( $this->file ) ) { $this->commit(); } // Read content of JSON file $content = file_get_contents( $this->file ); $content = json_decode( $content ); // Check if its arrays of jSON if( !is_array( $content ) && is_object( $content ) ) { throw new \Exception( 'An array of json is required: Json data enclosed with []' ); return false; } // An invalid jSON file elseif( !is_array( $content ) && !is_object( $content ) ) { throw new \Exception( 'json is invalid' ); return false; } else return true; } |
After seeing that eyesore, I wept for my past self. First, the whole design is not a nice one, then the implementations were so costly. I decoded the whole JSON data just to check if it’s an array of JSON objects, it could be a 50,000 records file. Then when it returns true, I decoded another one in from()
method! Damn! What a design!
Please one last one, let me show you:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
if( !empty( $this->content[ 0 ] ) ) { $nulls = array_diff_key( ( array ) $this->content[ 0 ], $values ); if( $nulls ) { $nulls = array_map( function() { return ''; }, $nulls ); $values = array_merge( $values, $nulls ); } } if( !empty( $this->content ) && array_diff_key( $values, (array ) $this->content[ 0 ] ) ) { throw new Exception( 'Columns must match as of the first row' ); } |
- I used
array_diff_key()
– O(n^2) - I then used
array_map()
– O(n) array_merge()
– O(n)- Then another
array_diff_key
!
Please, that was me 🙁 I was just trying to maintain uniform columns across each row. But then, I do not know any library or PHP native function that allows you to read the first row alone without decoding all the JSON data into the buffer and perform a current()
. I came up with a solution. On typing this post, I am still preparing a standalone library to do this exact thing and publish it. I forked it from this solution.
The Solution
Well, it was an eyesore for me to see such test results. I took it upon myself to make sure insertion happens at a constant time, no matter the size of the data.
The issue author offered up help on this by appending data into the JSON file. But, it was inefficient from my experience. There were still checks to be made before appending to a JSON file (e.g. What if it’s not an array of objects? What if it’s an invalid JSON already?) and also, I still need to maintain uniform columns across every row. That approach gave me an idea of how to do this. I dived deep into IO operations, this is my favourite part of programming.
I improved the user’s contribution by first creating a function to retrieve array objects chunk by chunk, without reading all the JSON data into the buffer. With this, I was able to just get the first chunk of the array object and use it to compare the new role columns. This helper function can be found in helpers/json.php
On insertion, I created a memory-safe method to perform this in src/JSONDB.php in O(1) by reading and parsing from the back. Everything happened in this commit.
The Result
1 2 3 4 5 |
Took average of 29.922322ms to insert 1000 records - BATCH 1 Took average of 27.482164ms to insert 1000 records - BATCH 2 Took average of 30.244147ms to insert 1000 records - BATCH 3 Took average of 29.854274ms to insert 1000 records - BATCH 4 Took average of 34.284435ms to insert 1000 records - BATCH 5 |
This is an improvement of over 90%. This is exact same test we performed above, but with the modifications made. It is a constant time between 20 – 30 milliseconds, which is dependent on your OS and hardware. But this is the average I get on my machine no matter the size of the data.
There are other optimizations I need to still do on the library which I will work on in my free time. You can download and run the tests on this library here. I would really love to hear your critics and feedback on my approach. I believe there must be a better and faster approach out there.