Time series data in the cloud #2

c33howard · 2021-01-24T19:32:29Z

c33howard
Jan 24, 2021

(Cross-posted from the email list)

I've created a batch json format that includes multiple data points at a time offset for each signalk path. I'm compressing those and uploading to S3. I've got triggers that run lambda and publish via SNS to write to timestream and for a signalk instance in the cloud to consume. This has been running for about a month now and is much more reliable and bandwidth efficient than anything else I've come up with so far.

I've got much more detail in the readme here:

https://github.com/c33howard/signalk-to-batch-format

The other relevant components are:

https://github.com/c33howard/signalk-batcher
https://github.com/c33howard/signalk-from-batch-format
https://github.com/c33howard/signalk-to-timestream-lambda

Feedback welcome on the approach, code, readme, etc.

tkurki · 2021-01-25T17:00:48Z

tkurki
Jan 25, 2021
Maintainer

Interesting! I've been storing my data in S3, but never got around to setting up automatic imports.

I saw a bunch of comments about problems with handleMessage and it sounds like you needed to go out of your to work around that. Next time be sure to reach out first.

Anyway: the better interface for reading nontrivial amounts of data to the server is the stream based Provider interface. This is what the server uses internally: it pipes data from a source such as a serial connection, file or a tcp or WebSocket connection to other pipe elements, that transform the data eventually to SK delta messages that are fed into the server as is, with the source information from the source.

The biggest functional difference between a Provider and a plugin's handleMessage call is that handleMessage does not provide any backpressure mechanism. My use case originally was importing log files from S3 via server to InfluxDb. With no backpressure the input data overflows what can be written to Influx. Provider mechanism uses Node Streams' backpressure mechanism, throttling the S3 input stream when internal buffers fill up. The code is at https://github.com/SignalK/signalk-server/blob/master/packages/streams/s3.js

From a quick look you could have created just a single huge delta with updates from multiple real time updates just concatenated to one. Did you consider that instead of rolling your own format?

One problem with JSON for larger data volumes is that usually parsing happens in memory and not in a streaming fashion. Same applies for generation: unless you jump through hoops you first need to generate the data in memory and then serialise it in one go. One way around this would be a hybrid approach: "line-oriented JSON", one large but not overly so JSON message per line. This way you write and read in chunks instead of a single huuge JSON object

I did not quite understand the special handling for object valued properties. Why are they different in your format? As in deltas they are just path-values.

0 replies

c33howard · 2021-01-27T07:39:53Z

c33howard
Jan 27, 2021
Author

I'd like to fork this discussion into the write side first, then the read side. I think they're different, and the write side is the driver for how the read side ended up.

Write side:

Yes, I tried a single delta and it's too big. This is kind of the "signalk to signalk" rationale in my README. I took a full dump every 5s from the signalk on my boat, concatenated them and gzipped the result. It came out to over 450KB per minute. My format is about 7KB per minute. The main reason for the difference is that my format only lists the structure of the tree and the paths once. Concatenated updates duplicate the structure and the paths. When you're counting bytes, the structure and the paths dominate the values themselves. Gzip helps, but not enough.

FWIW, my first approach was to use csv instead of json. Those files were 3-4KB per minute. But json is much easier to work with, feels more like signalk, handles sparse data better, and 7KB a minute still gives me plenty of headroom on my monthly LTE data plan.

I also don't think that this approach produces overly large json objects. It's more or less equivalent in size to the full signalk state. Again, that's because each file only contains the structure and path once. I think of the file size as O(n) in the number of paths, not O(n) in the number of data points. It's not quite true, the file is O(n) in both, but the constant factor for paths is much larger than the constant factor for data points.

The special object handling comes from paths like navigation.position. The value is:

{ "longitude": -122.4, "latitude": 47.7 }

This object value is hard to put into a time series db, like Timestream. The only option is to record the value as a VARCHAR and then do a JSON.parse() on the read side. It's much nicer to store two separate NUMBERs in the db. So, I convert from navigation.position = {...} to navigation.position.longitude=-122.4 and navigation.position.latitude=47.7.

(Coming up with this example made me realize I have a bug in the reading back in of these values. Sigh. Filed a bug against myself.)

Now, I could leave object in the json format and do this at the last second before writing to Timestream, but that bloats the json object. I'm storing a value every 5s, and writing a file every 60s. If I left the value in my file as the json object, I'd store the strings "longitude" and "latitude" 6 times. By flattening the structure, I only store those strings once. To be honest, I've never understood why there are object values. I read in the spec it's because the values aren't useful on their own, but I'm not sure I buy that as sufficient rationale. But probably a topic for a separate thread. I'm sure I'm missing something. :-)

This avoidance of repeating strings is why I made the optimization to not repeat the same value twice. I have some string values in my signalk dataset that are long and change very rarely (one is the last text message received by my LTE modem, which is currently 154 bytes). I could omit the entire thing, but it's important to me to know that the data is fresh, even if it hasn't changed.

I could have modified the delta format in a similar style to what I did for the full state, but the thing I like about the current approach is that a reader can start at any point in time. With deltas, I'd have to emit periodic snapshots and a reader could only start at a snapshot. This seems simpler as there's only one writer and one reader.

Read side:

Re: handleMessage. Yeah, I probably should've reached out. But I find working through these things fun and I came up with a solution relatively quickly. If I'd gotten stuck, I would've asked for help. But since I'm doing this for fun anyway, I just kept going.

Re: streams, it never occurred to me. I'll look at that. Thanks.

I'm interested in your comment about back pressure. I don't (think I) want back pressure when reading from the NMEA bus. The bus isn't going to slow down giving up updates on the depth, and I sure don't want it to. The latest data is always the most important data to me. In the case that signalk gets behind reading from the bus, I'd much rather that old metrics get dropped than readers slow down.

8 replies

c33howard Jan 30, 2021
Author

Are you sure it adjusts the timestamps for each individual path? I can see where it adjusts the time offset in timestamp-throttle.js order to call setTimeout() with an appropriate value. I can also see in canboatjs.js where the timestamp is parsed. But I don't see where the timestamps themselves are being changed. I must be missing something in my config?

The Data Browser in the UI is showing data from "08/15 ~12:00", which matches the raw data in samples/aava-n2k.data. And the signalk-batcher:trace debug channel is outputting a lot of these messages:

signalk-batcher:trace filtered too old navigation.trip.log 1408129234108 1612049609284 +0ms

The unix timestamp 1408129234108 is 2014-08-15, which again matches the raw input file. Where should I be looking for the timestamp adjustment?

Hmmm, good point about startup. That probably invalidates my point 1. I do agree that subscriptions and deltas should be the way to go for most things and regardless, centralizing the selection/filter logic makes tons of sense. I'm not sold on deltas for my use case; I still think points 2 and 3 above apply. I thought of another reason.

If I haven't seen a report from a source for a long time, how do I filter it out in the delta model? It's easy in the full state, because I'm iterating over everything anyway.

tkurki Jan 31, 2021
Maintainer

Oh, my mistake with the timestamps. Messed things up with Influx plugin's override time with now setting.

If I haven't seen a report from a source for a long time, how do I filter it out in the delta model? It's easy

Sounds like you are somehow stuck in the retrieve data periodically mindset or I am so stuck in mine that I don't understand you ;-).

The way I'd do this is

subscribe to data with the update rate you want and start accumulating data in a buffer structure (not unlike your batch format, in essence pushing more data points in the arrays)
periodically flush the buffer and start over

Here is a bit of code that does essentially that logically, except it is not using subscriptions, does the downsampling itself and the format is just a delta array: https://github.com/chacal/signalk-stash/blob/master/plugin/index.js#L165-L197

Only data that is updating will end up in the batch/buffer, no need to filter anything.

pbegg Mar 23, 2021

I have been looking through this plugin again recently, how does the plugin know when there is no internet connectivity and then save the data to disk to be uploaded at a later date?

c33howard Mar 24, 2021
Author

I have been looking through this plugin again recently, how does the plugin know when there is no internet connectivity and then save the data to disk to be uploaded at a later date?

That's hard, so it doesn't. :-)

The plugin always writes to disk. After the write is complete, it tries to upload. When the upload succeeds, the file is deleted. There's a sweeper that runs and looks for files that are old and tries again to upload, again deleting after success. While there's no internet connectivity, the upload will fail and the files will remain on disk to be uploaded later by the sweeper.

c33howard Mar 24, 2021
Author

Sounds like you are somehow stuck in the retrieve data periodically mindset or I am so stuck in mine that I don't understand you ;-).

I haven't forgotten this; life's just gotten in the way. I probably won't get a chance to hack on this again until sometime next month. I'm still not sure which way's simpler, but I am going to look at this alternative. There's a pretty good chance you're right here and I'm thinking about it wrong.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time series data in the cloud #2

{{title}}

Replies: 2 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Time series data in the cloud #2

c33howard Jan 24, 2021

Replies: 2 comments · 8 replies

tkurki Jan 25, 2021 Maintainer

c33howard Jan 27, 2021 Author

c33howard Jan 30, 2021 Author

tkurki Jan 31, 2021 Maintainer

pbegg Mar 23, 2021

c33howard Mar 24, 2021 Author

c33howard Mar 24, 2021 Author

c33howard
Jan 24, 2021

Replies: 2 comments 8 replies

tkurki
Jan 25, 2021
Maintainer

c33howard
Jan 27, 2021
Author

c33howard Jan 30, 2021
Author

tkurki Jan 31, 2021
Maintainer

c33howard Mar 24, 2021
Author

c33howard Mar 24, 2021
Author