This article shares my first adventures exploring a real science, open access dataset with freely available open source software.
- About the BeesBook 2015 dataset sample
- About Apache Solr 7.5
- Cloning this repo
- Downloading the dataset
- Preprocessing the dataset
- Downloading the software
- Indexing the dataset
- Faceting with bees
- Streaming expressions with bees
- Bee plots
- Wrap up
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
In their open access Frontiers in Robotics and AI journal paper entitled "Tracking All Members of a Honey Bee Colony Over Their Lifetime Using Learned Models of Correspondence" Boenisch et. al. [1] present an in-depth description of a multi-step algorithm which produces motion paths of automatically tracked marked honey bees. Alongside their paper, the team at Freie Universität Berlin published the first trajectory dataset for all bees in a colony, extracted from ∼3 million images covering 3 days [2].
The sample dataset entitled "BeesBook Recording Season 2015 Sample Release" is available online at!Synapse:syn11737848 or via its Digital Object Identifier doi: 10.7303/syn11737848.1. The full dataset comprises 71 days of continuous positional data (3 Hz sample rate) and in total 2,775 bees in a one-frame observation hive were marked and recorded with four cameras.
[1] Boenisch F, Rosemann B, Wild B, Dormagen D, Wario F and Landgraf T (2018) Tracking All Members of a Honey Bee Colony Over Their Lifetime Using Learned Models of Correspondence. Front. Robot. AI 5:35. doi: 10.3389/frobt.2018.00035
[2] Boenisch, F., Rosemann, B., Wild, B., Wario, F., Dormagen, D., and Landgraf, T. (2018). BeesBook Recording Season 2015 Sample. doi: 10.7303/syn11737848.1
Apache Solr is an open source search platform based on the Apache Lucene search engine library. As I'm writing this in late September 2018 the latest release is version 7.5.0. Lucene and Solr are both written in Java and are two of many open source software projects hosted by the not-for-profit Apache Software Foundation.
In this article I am using Solr mainly because I'm already familiar with it and wish to learn more about its many features. As such what follows is not an introductory Solr tutorial (which can be found elsewhere) and quite possibly also not even the best tool for a dataset exploration job, but well, let's see how it goes.
git clone
cd bee-informatics/BeesBook2015-sample-with-ApacheSolr750
When I downloaded it the BeesBook Recording Season 2015 Sample Release dataset was called data_sample_release.csv
, contained about 200 million rows and it was about 19GB in size. Let's assume this large file is downloaded into the data-sets
$ pwd
$ ls ./data_sample_release.csv
$ wc -l ./data_sample_release.csv
200145135 ./data_sample_release.csv
Working with such a large dataset is bound to take time. So let's try all steps out first with a partial dataset and if that works well the step can be repeated with the full dataset.
$ head -1000001 ./data_sample_release.csv > ./partial_data_sample_release.csv
$ wc -l ./partial_data_sample_release.csv
1000001 ./partial_data_sample_release.csv
The dataset is conveniently in comma separated values (csv) format but we need to mildly pre-process the timestamp, x_pos and y_pos columns' format to suit our subsequent steps.
The sed stream editor utility can be used to adjust the timestamp format. Illustration:
$ echo "2018-07-06 05:43:21.012345+00"
2018-07-06 05:43:21.012345+00
$ echo "2018-07-06 05:43:21.012345+00" | sed 's/\(....\-..\-..\) \(..:..:...*\)+00/\1T\2Z/g'
The sed
utility can also be used to merge and repeat the position columns. Illustration:
$ echo "1234,5678"
$ echo "1234,5678" | sed 's/\(.*\),\(.*\)/\1,\2,\1 \2/g'
1234,5678,1234 5678
If we assume that a coordinate is in the 0 .. 9999 range we can also, somewhat hackily perhaps, use the sed
utility to map that down to 0.0000 .. 0.9999 scale. Illustration:
$ echo "1234,5678
$ echo "1234,5678
3,21" | sed 's/\(.*\),\(.*\)/posX0000\1 posY0000\2/g'
posX00001234 posY00005678
posX0000987 posY0000654
posX00003 posY000021
$ echo "1234,5678
3,21" | sed 's/\(.*\),\(.*\)/posX0000\1 posY0000\2/g' | sed 's/posX.*\(....\) posY.*\(....\)/0\.\1 0\.\2/g'
0.1234 0.5678
0.0987 0.0654
0.0003 0.0021
The script combines the above pre-processing steps into one convenient script.
$ pwd
$ ls ./
$ ./ ./partial_data_sample_release.csv ./prepared-partial-dataset.csv
Input file: ./partial_data_sample_release.csv
Output file: ./prepared-partial-dataset.csv
$ ./ ./data_sample_release.csv ./prepared-dataset.csv
Input file: ./data_sample_release.csv
Output file: ./prepared-dataset.csv
script uses sed
three times. Alternatives to this no doubt exist e.g. a custom program or script that does just one pass through the dataset. However, we must balance any reduction in pre-processing time against the time required to develop such a custom tool. And well, if pre-processing is just a one-off job here then sed
taking 40 minutes or so to do it will be just fine and provide an opportunity for a leisurely lunch or tea break I'd say.
- You can download the latest version Apache Solr at or find older releases in the Apache archives.
is about 161MB in size.
$ pwd
$ ls ./solr-7.5.0.tgz
$ tar xf ./solr-7.5.0.tgz
$ ls ./solr-7.5.0
CHANGES.txt NOTICE.txt contrib example
LICENSE.txt README.txt dist licenses
LUCENE_CHANGES.txt bin docs server
We've got two choices here: a Solr Cloud or a standalone Solr instance. For the purposes of this article, choosing the cloud option is preferred but if you wish to experiment on an offline computer (e.g. without wifi in flight-safe mode) then only the standalone Solr instance might work.
$ pwd
$ ./solr-7.5.0/bin/solr start -m 16g -cloud -noprompt
$ ./solr-7.5.0/bin/solr create -c bee-hive -d _default -shards 1 -replicationFactor 1
$ pwd
$ ./solr-7.5.0/bin/solr start -m 16g
$ ./solr-7.5.0/bin/solr create_core -c bee-hive -d _default
The following commands can be used to delete an existing Solr collection and to stop the Solr instance when you're finished experimenting.
$ pwd
$ ./solr-7.5.0/bin/solr delete -c bee-hive
$ ./solr-7.5.0/bin/solr stop -all
The script is a simple wrapper around Solr's Post Tool.
$ pwd
$ ls ../data-sets/prepared-partial-dataset.csv
$ ls ./
$ ./ ../data-sets/prepared-partial-dataset.csv bee-hive
We have chosen the name bee-hive
for our Solr collection and created it with the _default
configset. This configset contains a number of default fieldType and dynamicField definitions. This means, for example, that simply the _i
and _f
endings of the bee_id_i
and bee_id_confidence_f
field names indicate that those field values are of integer and floating point type.
The partial dataset of just 1 million rows can be indexed relatively quickly, about 2 minutes on my computer. Indexing the full 200 million rows dataset will take longer.
$ pwd
$ ls ../data-sets/prepared-dataset.csv
$ ls ./
$ ./ ../data-sets/prepared-dataset.csv bee-hive
You can see indexing progress via the http://localhost:8983/solr/bee-hive/select?q=*:*&rows=0 search in your browser or on the command line.
curl "http://localhost:8983/solr/bee-hive/select?q=*:*&rows=0"
In this section we explore the content of the dataset in a very broad sense using some of Apache Solr's faceting functionality.
The year 2015 is part of the BeesBook Recording Season 2015 Sample Release name and the Tracking All Members of a Honey Bee Colony Over Their Lifetime Using Learned Models of Correspondence article mentioned that the sample covers 3 days. Let's find out exactly which days we have data for.
# request
curl "http://localhost:8983/solr/bee-hive/select?q=*:*&rows=0&facet=true\
# response
# request
curl "http://localhost:8983/solr/bee-hive/select" -d 'q=*:*&rows=0&json.facet={
timestamp_by_day : {
type : range,
field : timestamp_dt,
mincount : 1,
gap : "%2B1DAY",
start : "2015-01-01T00:00:00.000Z",
end : "2015-12-31T23:59:59.999Z",
# response
The Tracking All Members of a Honey Bee Colony Over Their Lifetime Using Learned Models of Correspondence article mentioned that in total 2,775 bees were marked. Let's count how many bees we have data for in the dataset sample.
# request
curl "http://localhost:8983/solr/bee-hive/select" \
-d 'q=*:*&rows=0&json.facet={
most_seen_bee_ids : {
type : terms,
field : bee_id_i,
numBuckets : true,
sort : count,
limit : 3
# response
Hmm, clearly 4,096 is more than 2,775, how can that be? Let's restrict our search with a bee_id_confidence_f
filter query and count again.
# request
curl "http://localhost:8983/solr/bee-hive/select?fq=bee_id_confidence_f:\[0.5+TO+1.0\]" \
-d 'q=*:*&rows=0&json.facet={
most_seen_bee_ids : {
type : terms,
field : bee_id_i,
numBuckets : true,
sort : count,
limit : 3
# response
The Tracking All Members of a Honey Bee Colony Over Their Lifetime Using Learned Models of Correspondence article mentioned that four cameras were used to record the bees in a one-frame observation hive. Let's find the (x,y) coordinate range for one of the cameras.
# request
curl "http://localhost:8983/solr/bee-hive/select" -d 'q=*:*&rows=0&json.facet={
cam_id_2_x_y : {
type : query,
q : "cam_id_s:2",
facet : {
min_x_pos_i : "min(x_pos_i)", max_x_pos_i : "max(x_pos_i)",
min_y_pos_i : "min(y_pos_i)", max_y_pos_i : "max(y_pos_i)"
# response
We indexed bee coordinates as individual x_pos_i
and y_pos_i
integer fields and also in a joint pos_srpt
field of SpatialRecursivePrefixTreeFieldType
type for Spatial Search.
Let's start to create a heatmap of bee positions via the addition of the facet.heatmap=pos_srpt
search parameter
- in your browser e.g. http://localhost:8983/solr/bee-hive/select?q=cam_id_s:2&rows=0&facet=true&facet.heatmap=pos_srpt or
- on the command line.
curl "http://localhost:8983/solr/bee-hive/select?q=cam_id_s:2&rows=0&facet=true&facet.heatmap=pos_srpt\
As you can see by default facet.heatmap=pos_srpt
heatmap results are returned in counts_ints2D
matrix form. However, if we also add the facet.heatmap.format=png
search parameter
- in your browser e.g. http://localhost:8983/solr/bee-hive/select?q=cam_id_s:2&rows=0&facet=true&facet.heatmap=pos_srpt&facet.heatmap.format=png
then the heatmap results will be in counts_png
string form.
A few lines of Python code can easily save that string as a .png format image file.
import base64
fileName = ...
fileContent = ...
with open(fileName, "wb") as file:
The small wrapper script can be used to generate different heatmaps.
In this section we search for and find one or more (presumed to be) forager bees using some of Apache Solr's Streaming Expressions functionality.
(Streaming expressions are specific to Solr Cloud. If earlier on you chose to start a standalone Solr instance you can now skip ahead to the Mid-day bees section for a more manual approach of finding bees.)
# request
curl --data-urlencode 'expr=
q="timestamp_dt:[2015-09-08T00:00:00Z-1MINUTE TO 2015-09-08T00:00:00Z+1MINUTE]",
sort="bee_id_i asc",
' "http://localhost:8983/solr/bee-hive/stream"
# response
As mentioned in the search stream source documentation the default qt
query type is /select
and with it the rows
parameter is mandatory; the /export
query type on the other hand always returns all rows.
# request
curl --data-urlencode 'expr=
q="timestamp_dt:[2015-09-08T00:00:00Z-1MINUTE TO 2015-09-08T00:00:00Z+1MINUTE]",
sort="bee_id_i asc",
' "http://localhost:8983/solr/bee-hive/stream"
# response
The dataset contains data from three days. The above expression finds data points around midnight between the first and the second day -- note that there is more than one data point for an individual bee.
We can use the unique stream decorator to reduce the many data points down to a list of individual bees seen around midnight.
# request
curl --data-urlencode 'expr=
q="timestamp_dt:[2015-09-08T00:00:00Z-1MINUTE TO 2015-09-08T00:00:00Z+1MINUTE]",
sort="bee_id_i asc",
' "http://localhost:8983/solr/bee-hive/stream"
# response
We can use the intersect stream decorator to find a list of individual bees seen on consecutive days around midnight.
# request
curl --data-urlencode 'expr=
q="timestamp_dt:[2015-09-08T00:00:00Z-1MINUTE TO 2015-09-08T00:00:00Z+1MINUTE]",
sort="bee_id_i asc",
q="timestamp_dt:[2015-09-09T00:00:00Z-1MINUTE TO 2015-09-09T00:00:00Z+1MINUTE]",
sort="bee_id_i asc",
' "http://localhost:8983/solr/bee-hive/stream"
# response
We can use the complement stream decorator to find a list of individual bees seen on consecutive days around midnight but not seen around noon (mid-day).
# request
curl --data-urlencode 'expr=
q="timestamp_dt:[2015-09-08T00:00:00Z-1MINUTE TO 2015-09-08T00:00:00Z+1MINUTE]",
sort="bee_id_i asc",
q="timestamp_dt:[2015-09-09T00:00:00Z-1MINUTE TO 2015-09-09T00:00:00Z+1MINUTE]",
sort="bee_id_i asc",
q="timestamp_dt:[2015-09-08T12:00:00Z-10MINUTE TO 2015-09-08T12:00:00Z+10MINUTE]",
sort="bee_id_i asc",
' "http://localhost:8983/solr/bee-hive/stream"
# response
Via the streaming expression above we have (with one /stream
request in about two seconds) found the four bees not seen for (at least) the 20 minutes around noon on 2015-09-08 but previously and subsequently seen around midnight.
Streaming expressions are specific to Solr Cloud. Here's another, more manual approach of searching for the same thing with multiple queries in a shell script.
$ ./
Searching for bees home at midnight on the 8th
Searching for bees home at midday on the 8th
Searching for bees home at midnight on the 9th
Calculating bees home at midnight on the 8th and 9th
Calculating bees home at midnight but away at midday
Displaying number of bees per file
655 midnight8.log
2204 midday8.log
484 midnight9.log
270 midnight.log
4 not-midday8.log
=== Bees home at midnight but away at midday ===
bee: 104
bee: 2483
bee: 3406
bee: 3477
In this section we use matplotlib.pyplot to visualise parts of the dataset, specifically focusing on four individuals i.e. the mid-day bees identified above.
The small wrapper script can be used to visualise the results of time range faceting queries: black bars correspond to times the bee was seen and white bars indicate times when the bee was not seen.
for bee_id_i in 104 2483 3406 3477
./ ./bee-${bee_id_i}-in-out-8.png "bee_id_i:${bee_id_i}" \
--facet-range-start "2015-09-08T00:00:00.000Z" \
--facet-range-end "2015-09-08T23:59:59.999Z"
![]() |
![]() |
![]() |
![]() |
Comparison of the four 'home or not?' visualisations clearly shows that bee 3477 is 'the odd one out' in our quartet of mid day bees.
Earlier in this article we used terms faceting and (time) range faceting. Now let's combine them and see if nested faceting can help us to better understand the data.
# request
curl "http://localhost:8983/solr/bee-hive/select?fq=bee_id_i:(104+2483+3406+3477)" \
-d 'q=*:*&rows=0&json.facet={
most_seen_bee_ids : {
type : terms,
field : bee_id_i,
facet : {
bee_id_confidence_quartiles : {
type : range,
field : bee_id_confidence_f,
include : edge,
gap : "0.25",
start : "0.0",
end : "1.0",
# response
The nested faceting query results above clearly show
- that we have very few (
) data points for the fourth bee ("val":3477
), and - that the bee id confidence values for those few data points are almost all in the lowest quartile.
So it's probably fair to assume that 'bee 3477' is not actually real and that a few times some other bee or bees were mistakenly consider to be the 'bee 3477'.
The small wrapper script can be used to visualise the results of queries such as http://localhost:8983/solr/bee-hive/select?q=bee_id_i:104&fq=cam_id_s:1&fl=x_pos_i,y_pos_i,timestamp_dt&sort=timestamp_dt+asc&rows=10.
./ ./bee-104-cam-1-plot.png 'bee_id_i:104' 'cam_id_s:1' \
'timestamp_dt:[2015-09-08T12:00:00Z+TO+2015-09-08T12:00:00Z%2B30MINUTES]' \
--rows=207 --axis 0 4000 0 3000
./ ./bee-104-cam-1-plot-zoom.png 'bee_id_i:104' 'cam_id_s:1' \
'timestamp_dt:[2015-09-08T12:00:00Z+TO+2015-09-08T12:00:00Z%2B30MINUTES]' \
--rows=207 --axis 1800 4000 0 1650
![]() |
![]() |
Looking at the full scale plot on the left and then zooming in via the plot on the right it seems that 'bee 104' is perhaps running around in circles in the (2500,200) coordinate area.
The dataset includes an orientation column and via the --orientation orientation_f
argument we can use the wrapper script to plot that column and see that bee 104's orientation does vary in the (-pi, +pi) range.
./ ./bee-104-cam-1-plot-angles.png 'bee_id_i:104' 'cam_id_s:1' \
'timestamp_dt:[2015-09-08T12:00:00Z+TO+2015-09-08T12:00:00Z%2B30MINUTES]' \
--rows=207 --axis 0 208 -4 +4 \
--orientation orientation_f
![]() |
In this article I shared my first adventures exploring the BeesBook Recording Season 2015 Sample Release with the open source search platform Apache Solr and a small number of short Python scripts.
- The open access nature of the dataset and the availability of open source software make it possible for interested citizens to explore scientific datasets.
- Questions such as 'which bees were seen at this time but not at that time?' can be asked and answered with relatively little computer programming expertise and 'home or not?' visualisations can provide a condensed and motivating view of slices of the dataset.
- Full colony heatmaps as well as short plots of an individual bee's movements offer a brief taste of the fantastic scope and resolution of this BeesBook 2015 dataset.
And that then leaves just one last question, what to pursue as the next adventure: 2,775 marked bees, 200 million data points over 3 days -- what would you wanna ask those bees?