Replies: 2 comments 8 replies
-
Interesting! I've been storing my data in S3, but never got around to setting up automatic imports. I saw a bunch of comments about problems with Anyway: the better interface for reading nontrivial amounts of data to the server is the stream based Provider interface. This is what the server uses internally: it pipes data from a source such as a serial connection, file or a tcp or WebSocket connection to other pipe elements, that transform the data eventually to SK delta messages that are fed into the server as is, with the source information from the source. The biggest functional difference between a Provider and a plugin's handleMessage call is that handleMessage does not provide any backpressure mechanism. My use case originally was importing log files from S3 via server to InfluxDb. With no backpressure the input data overflows what can be written to Influx. Provider mechanism uses Node Streams' backpressure mechanism, throttling the S3 input stream when internal buffers fill up. The code is at https://github.com/SignalK/signalk-server/blob/master/packages/streams/s3.js From a quick look you could have created just a single huge delta with updates from multiple real time updates just concatenated to one. Did you consider that instead of rolling your own format? One problem with JSON for larger data volumes is that usually parsing happens in memory and not in a streaming fashion. Same applies for generation: unless you jump through hoops you first need to generate the data in memory and then serialise it in one go. One way around this would be a hybrid approach: "line-oriented JSON", one large but not overly so JSON message per line. This way you write and read in chunks instead of a single huuge JSON object I did not quite understand the special handling for object valued properties. Why are they different in your format? As in deltas they are just path-values. |
Beta Was this translation helpful? Give feedback.
-
I'd like to fork this discussion into the write side first, then the read side. I think they're different, and the write side is the driver for how the read side ended up. Write side: Yes, I tried a single delta and it's too big. This is kind of the "signalk to signalk" rationale in my README. I took a full dump every 5s from the signalk on my boat, concatenated them and gzipped the result. It came out to over 450KB per minute. My format is about 7KB per minute. The main reason for the difference is that my format only lists the structure of the tree and the paths once. Concatenated updates duplicate the structure and the paths. When you're counting bytes, the structure and the paths dominate the values themselves. Gzip helps, but not enough. FWIW, my first approach was to use csv instead of json. Those files were 3-4KB per minute. But json is much easier to work with, feels more like signalk, handles sparse data better, and 7KB a minute still gives me plenty of headroom on my monthly LTE data plan. I also don't think that this approach produces overly large json objects. It's more or less equivalent in size to the full signalk state. Again, that's because each file only contains the structure and path once. I think of the file size as O(n) in the number of paths, not O(n) in the number of data points. It's not quite true, the file is O(n) in both, but the constant factor for paths is much larger than the constant factor for data points. The special object handling comes from paths like navigation.position. The value is:
This object value is hard to put into a time series db, like Timestream. The only option is to record the value as a VARCHAR and then do a (Coming up with this example made me realize I have a bug in the reading back in of these values. Sigh. Filed a bug against myself.) Now, I could leave object in the json format and do this at the last second before writing to Timestream, but that bloats the json object. I'm storing a value every 5s, and writing a file every 60s. If I left the value in my file as the json object, I'd store the strings "longitude" and "latitude" 6 times. By flattening the structure, I only store those strings once. To be honest, I've never understood why there are object values. I read in the spec it's because the values aren't useful on their own, but I'm not sure I buy that as sufficient rationale. But probably a topic for a separate thread. I'm sure I'm missing something. :-) This avoidance of repeating strings is why I made the optimization to not repeat the same value twice. I have some string values in my signalk dataset that are long and change very rarely (one is the last text message received by my LTE modem, which is currently 154 bytes). I could omit the entire thing, but it's important to me to know that the data is fresh, even if it hasn't changed. I could have modified the delta format in a similar style to what I did for the full state, but the thing I like about the current approach is that a reader can start at any point in time. With deltas, I'd have to emit periodic snapshots and a reader could only start at a snapshot. This seems simpler as there's only one writer and one reader. Read side: Re: Re: streams, it never occurred to me. I'll look at that. Thanks. I'm interested in your comment about back pressure. I don't (think I) want back pressure when reading from the NMEA bus. The bus isn't going to slow down giving up updates on the depth, and I sure don't want it to. The latest data is always the most important data to me. In the case that signalk gets behind reading from the bus, I'd much rather that old metrics get dropped than readers slow down. |
Beta Was this translation helpful? Give feedback.
-
(Cross-posted from the email list)
I've created a batch json format that includes multiple data points at a time offset for each signalk path. I'm compressing those and uploading to S3. I've got triggers that run lambda and publish via SNS to write to timestream and for a signalk instance in the cloud to consume. This has been running for about a month now and is much more reliable and bandwidth efficient than anything else I've come up with so far.
I've got much more detail in the readme here:
https://github.com/c33howard/signalk-to-batch-format
The other relevant components are:
https://github.com/c33howard/signalk-batcher
https://github.com/c33howard/signalk-from-batch-format
https://github.com/c33howard/signalk-to-timestream-lambda
Feedback welcome on the approach, code, readme, etc.
Beta Was this translation helpful? Give feedback.
All reactions