|
| 1 | +# json-stream |
| 2 | + |
| 3 | +Simple streaming JSON parser. |
| 4 | + |
| 5 | +`json-stream` is a JSON parser just like the standard library's |
| 6 | + [`json.load()`](https://docs.python.org/3/library/json.html#json.load). It |
| 7 | + will read a JSON document and convert it into native python types. |
| 8 | + |
| 9 | +Features: |
| 10 | +* stream all JSON data types (objects or lists) |
| 11 | +* stream nested data |
| 12 | +* simple pythonic `list`-like/`dict`-like interface |
| 13 | + |
| 14 | +Unlike `json.load()`, `json-stream` can _stream_ JSON data from a file-like |
| 15 | +object. This has the following benefits: |
| 16 | + |
| 17 | +* It does not require the whole json document to be into memory up-front |
| 18 | +* It can start producing data before the entire document has finished loading |
| 19 | +* It only requires enough memory to hold the data currently being parsed |
| 20 | + |
| 21 | +## What are the problems with standard `json.load()`? |
| 22 | + |
| 23 | +The problem with the `json.load()` stem from the fact that it must read |
| 24 | +the whole JSON document into memory before parsing it. |
| 25 | + |
| 26 | +### Memory usage |
| 27 | + |
| 28 | +`json.load()` first reads the whole document into memory as a string. It |
| 29 | +then starts parsing that string and converting the whole document into python types |
| 30 | +again stored in memory. For a very large document, this could be more memory |
| 31 | +than you have available to your system. |
| 32 | + |
| 33 | +`json-stream` does not read the whole document into memory, it only buffers |
| 34 | +enough from the stream to produce the next item of data. |
| 35 | + |
| 36 | +In transient mode (see below) `json-stream` also doesn't store up all of |
| 37 | +the parsed data is memory. |
| 38 | + |
| 39 | +### Latency |
| 40 | + |
| 41 | +`json.load()` produces all the data after parsing the whole document. If you |
| 42 | +only care about the first 10 items in a list of 2 million items, then you |
| 43 | +have wait until all 2 million items have been parsed first. |
| 44 | + |
| 45 | +`json-stream` produces data as soon as it is available in the stream. |
| 46 | + |
| 47 | +## Usage |
| 48 | + |
| 49 | +`json_stream.load()` has two modes of operation, controlled by |
| 50 | +the `persistent` argument (default false). |
| 51 | + |
| 52 | +### Transient mode (default) |
| 53 | + |
| 54 | +This mode is appropriate if you can consume the data iteratively. It is also |
| 55 | +the mode you must use if you do not want to use the all memory required to store |
| 56 | +the entire parsed result. |
| 57 | + |
| 58 | +In transient mode, only the data currently being read is stored in memory. Any |
| 59 | +data previously read from the stream is discarded (it's up to you what to do |
| 60 | +with it) and attempting to access this data results in a `TransientAccessException`. |
| 61 | + |
| 62 | +```python |
| 63 | +import json_stream |
| 64 | + |
| 65 | +# JSON: {"x": 1, "y": ["a", "b", "c"]} |
| 66 | +data = json_stream.load(f) # {"x": 1, "y": ['a', 'b', 'c']} |
| 67 | + |
| 68 | +# use data like a list or dict |
| 69 | +y = data["y"] |
| 70 | + |
| 71 | +# already read past "x" in stream -> exception |
| 72 | +x = data["x"] |
| 73 | + |
| 74 | +# iterate |
| 75 | +for c in y: |
| 76 | + print(c) # prints a, b, c |
| 77 | + |
| 78 | +# already read from list -> exception |
| 79 | +for c in y: pass |
| 80 | +``` |
| 81 | + |
| 82 | +### Persistent mode |
| 83 | + |
| 84 | +In persistent mode all previously read data is stored in memory as |
| 85 | +it is parsed. The returned `dict`-like or `list`-like objects |
| 86 | +can be used just like normal data structures. |
| 87 | + |
| 88 | +If you request an index or key that has already been read from the stream |
| 89 | +then it is retrieved from memory. If you request an index or key that has |
| 90 | +not yet been read from the stream, then the request blocks until that item |
| 91 | +is found in the stream. |
| 92 | + |
| 93 | +```python |
| 94 | +import json_stream |
| 95 | + |
| 96 | +# JSON: {"x": 1, "y": ["a", "b", "c"]} |
| 97 | +data = json_stream.load(f, persistent=True) |
| 98 | + |
| 99 | +# use data like a list or dict |
| 100 | +# stream is read up to the middle of list |
| 101 | +b = data["y"][1] # b = "b" |
| 102 | + |
| 103 | +# read from memory |
| 104 | +x = data["x"] # x = 1 |
| 105 | +``` |
| 106 | + |
| 107 | +Persistent mode is not appropriate if you care about memory consumption, but |
| 108 | +provides an identical experience compared to `json.load()`. |
| 109 | + |
| 110 | +## visitor pattern |
| 111 | + |
| 112 | +You can also parse using a visitor-style approach where a function you supply |
| 113 | +is called for each data item as it is parsed (depth-first). |
| 114 | + |
| 115 | +This uses a transient parser under the hood, so does not consume memory for |
| 116 | +the whole document. |
| 117 | + |
| 118 | +```python |
| 119 | +import json_stream |
| 120 | + |
| 121 | +# JSON: {"x": 1, "y": {}, "xxxx": [1,2, {"yyyy": 1}, "z", 1, []]} |
| 122 | + |
| 123 | +def visitor(path, data): |
| 124 | + print(f"{path}: {data}") |
| 125 | + |
| 126 | +json_stream.visit(f, visitor) |
| 127 | +``` |
| 128 | + |
| 129 | +Output: |
| 130 | +``` |
| 131 | +('x',): 1 |
| 132 | +('y',): {} |
| 133 | +('xxxx', 0): 1 |
| 134 | +('xxxx', 1): 2 |
| 135 | +('xxxx', 2, 'yyyy'): 1 |
| 136 | +('xxxx', 3): z |
| 137 | +('xxxx', 4): 1 |
| 138 | +('xxxx', 5): [] |
| 139 | +``` |
| 140 | + |
| 141 | +# Future improvements |
| 142 | + |
| 143 | +* Allow long strings in the JSON to be read as streams themselves |
| 144 | +* Allow transient mode on seekable streams to seek to data earlier in |
| 145 | +the stream instead of raising a `TransientAccessException` |
| 146 | +* A more efficient tokenizer? |
| 147 | + |
| 148 | +# Alternatives |
| 149 | + |
| 150 | +## NAYA |
| 151 | + |
| 152 | +[NAYA](https://github.com/danielyule/naya) is a pure python JSON parser for |
| 153 | +parsing a simple JSON list as a stream. |
| 154 | + |
| 155 | +### Why not NAYA? |
| 156 | + |
| 157 | +* It can only stream JSON containing a top-level list |
| 158 | +* It does not provide a pythonic `dict`/`list`-like interface |
| 159 | + |
| 160 | +## Yajl-Py |
| 161 | + |
| 162 | +[Yajl-Py]() is a wrapper around the Yajl JSON library that can be used to |
| 163 | +generate SAX style events while parsing JSON. |
| 164 | + |
| 165 | +### Why not Yajl-Py? |
| 166 | + |
| 167 | +* It's not pure python |
| 168 | +* It does not provide a pythonic `dict`/`list`-like interface |
| 169 | + |
| 170 | +# Acknowledgements |
| 171 | + |
| 172 | +The JSON tokenizer used in the project was taken from the [NAYA](https://github.com/danielyule/naya) project. |
0 commit comments