Skip to content

Commit 79d1d0f

Browse files
committed
Initial commit
0 parents  commit 79d1d0f

9 files changed

Lines changed: 787 additions & 0 deletions

File tree

LICENSE.txt

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
Copyright (c) 2020 Jamie Cockburn
2+
3+
Permission is hereby granted, free of charge, to any person obtaining a copy of
4+
this software and associated documentation files (the "Software"), to deal in
5+
the Software without restriction, including without limitation the rights to
6+
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
7+
of the Software, and to permit persons to whom the Software is furnished to do
8+
so, subject to the following conditions:
9+
10+
The above copyright notice and this permission notice shall be included in all
11+
copies or substantial portions of the Software.
12+
13+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
19+
SOFTWARE.

README.md

Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
# json-stream
2+
3+
Simple streaming JSON parser.
4+
5+
`json-stream` is a JSON parser just like the standard library's
6+
[`json.load()`](https://docs.python.org/3/library/json.html#json.load). It
7+
will read a JSON document and convert it into native python types.
8+
9+
Features:
10+
* stream all JSON data types (objects or lists)
11+
* stream nested data
12+
* simple pythonic `list`-like/`dict`-like interface
13+
14+
Unlike `json.load()`, `json-stream` can _stream_ JSON data from a file-like
15+
object. This has the following benefits:
16+
17+
* It does not require the whole json document to be into memory up-front
18+
* It can start producing data before the entire document has finished loading
19+
* It only requires enough memory to hold the data currently being parsed
20+
21+
## What are the problems with standard `json.load()`?
22+
23+
The problem with the `json.load()` stem from the fact that it must read
24+
the whole JSON document into memory before parsing it.
25+
26+
### Memory usage
27+
28+
`json.load()` first reads the whole document into memory as a string. It
29+
then starts parsing that string and converting the whole document into python types
30+
again stored in memory. For a very large document, this could be more memory
31+
than you have available to your system.
32+
33+
`json-stream` does not read the whole document into memory, it only buffers
34+
enough from the stream to produce the next item of data.
35+
36+
In transient mode (see below) `json-stream` also doesn't store up all of
37+
the parsed data is memory.
38+
39+
### Latency
40+
41+
`json.load()` produces all the data after parsing the whole document. If you
42+
only care about the first 10 items in a list of 2 million items, then you
43+
have wait until all 2 million items have been parsed first.
44+
45+
`json-stream` produces data as soon as it is available in the stream.
46+
47+
## Usage
48+
49+
`json_stream.load()` has two modes of operation, controlled by
50+
the `persistent` argument (default false).
51+
52+
### Transient mode (default)
53+
54+
This mode is appropriate if you can consume the data iteratively. It is also
55+
the mode you must use if you do not want to use the all memory required to store
56+
the entire parsed result.
57+
58+
In transient mode, only the data currently being read is stored in memory. Any
59+
data previously read from the stream is discarded (it's up to you what to do
60+
with it) and attempting to access this data results in a `TransientAccessException`.
61+
62+
```python
63+
import json_stream
64+
65+
# JSON: {"x": 1, "y": ["a", "b", "c"]}
66+
data = json_stream.load(f) # {"x": 1, "y": ['a', 'b', 'c']}
67+
68+
# use data like a list or dict
69+
y = data["y"]
70+
71+
# already read past "x" in stream -> exception
72+
x = data["x"]
73+
74+
# iterate
75+
for c in y:
76+
print(c) # prints a, b, c
77+
78+
# already read from list -> exception
79+
for c in y: pass
80+
```
81+
82+
### Persistent mode
83+
84+
In persistent mode all previously read data is stored in memory as
85+
it is parsed. The returned `dict`-like or `list`-like objects
86+
can be used just like normal data structures.
87+
88+
If you request an index or key that has already been read from the stream
89+
then it is retrieved from memory. If you request an index or key that has
90+
not yet been read from the stream, then the request blocks until that item
91+
is found in the stream.
92+
93+
```python
94+
import json_stream
95+
96+
# JSON: {"x": 1, "y": ["a", "b", "c"]}
97+
data = json_stream.load(f, persistent=True)
98+
99+
# use data like a list or dict
100+
# stream is read up to the middle of list
101+
b = data["y"][1] # b = "b"
102+
103+
# read from memory
104+
x = data["x"] # x = 1
105+
```
106+
107+
Persistent mode is not appropriate if you care about memory consumption, but
108+
provides an identical experience compared to `json.load()`.
109+
110+
## visitor pattern
111+
112+
You can also parse using a visitor-style approach where a function you supply
113+
is called for each data item as it is parsed (depth-first).
114+
115+
This uses a transient parser under the hood, so does not consume memory for
116+
the whole document.
117+
118+
```python
119+
import json_stream
120+
121+
# JSON: {"x": 1, "y": {}, "xxxx": [1,2, {"yyyy": 1}, "z", 1, []]}
122+
123+
def visitor(path, data):
124+
print(f"{path}: {data}")
125+
126+
json_stream.visit(f, visitor)
127+
```
128+
129+
Output:
130+
```
131+
('x',): 1
132+
('y',): {}
133+
('xxxx', 0): 1
134+
('xxxx', 1): 2
135+
('xxxx', 2, 'yyyy'): 1
136+
('xxxx', 3): z
137+
('xxxx', 4): 1
138+
('xxxx', 5): []
139+
```
140+
141+
# Future improvements
142+
143+
* Allow long strings in the JSON to be read as streams themselves
144+
* Allow transient mode on seekable streams to seek to data earlier in
145+
the stream instead of raising a `TransientAccessException`
146+
* A more efficient tokenizer?
147+
148+
# Alternatives
149+
150+
## NAYA
151+
152+
[NAYA](https://github.com/danielyule/naya) is a pure python JSON parser for
153+
parsing a simple JSON list as a stream.
154+
155+
### Why not NAYA?
156+
157+
* It can only stream JSON containing a top-level list
158+
* It does not provide a pythonic `dict`/`list`-like interface
159+
160+
## Yajl-Py
161+
162+
[Yajl-Py]() is a wrapper around the Yajl JSON library that can be used to
163+
generate SAX style events while parsing JSON.
164+
165+
### Why not Yajl-Py?
166+
167+
* It's not pure python
168+
* It does not provide a pythonic `dict`/`list`-like interface
169+
170+
# Acknowledgements
171+
172+
The JSON tokenizer used in the project was taken from the [NAYA](https://github.com/danielyule/naya) project.

setup.py

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
from setuptools import setup, find_packages
2+
import pathlib
3+
4+
here = pathlib.Path(__file__).parent.resolve()
5+
6+
# Get the long description from the README file
7+
long_description = (here / 'README.md').read_text(encoding='utf-8')
8+
9+
setup(
10+
name='json-stream',
11+
version='1.0.0',
12+
description='Streaming JSON decoder',
13+
long_description=long_description,
14+
long_description_content_type='text/markdown',
15+
url='https://github.com/daggaz/json-stream',
16+
author='Jamie Cockburn',
17+
author_email='[email protected]', # Optional
18+
classifiers=[ # Optional
19+
# How mature is this project? Common values are
20+
# 3 - Alpha
21+
# 4 - Beta
22+
# 5 - Production/Stable
23+
'Development Status :: 3 - Alpha',
24+
25+
# Indicate who your project is intended for
26+
'Intended Audience :: Developers',
27+
'Topic :: Software Development :: Libraries',
28+
29+
# Pick your license as you wish
30+
'License :: OSI Approved :: MIT License',
31+
32+
# Specify the Python versions you support here. In particular, ensure
33+
# that you indicate you support Python 3. These classifiers are *not*
34+
# checked by 'pip install'. See instead 'python_requires' below.
35+
'Programming Language :: Python :: 3',
36+
'Programming Language :: Python :: 3.5',
37+
'Programming Language :: Python :: 3.6',
38+
'Programming Language :: Python :: 3.7',
39+
'Programming Language :: Python :: 3.8',
40+
'Programming Language :: Python :: 3 :: Only',
41+
],
42+
keywords='sample, setuptools, development',
43+
package_dir={'': 'src'},
44+
packages=find_packages(where='src'),
45+
python_requires='>=3.5, <4',
46+
project_urls={
47+
'Bug Reports': 'https://github.com/daggaz/json-stream/issues',
48+
'Source': 'https://github.com/daggaz/json-stream/',
49+
},
50+
)

src/json_stream/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
from json_stream.loader import load
2+
from json_stream.visitor import visit

0 commit comments

Comments
 (0)