This repository provides detailed descriptions of the RealSmoking dataset in addition to step-by-step instructions for replicating results presented in Tang's paper published in the proceedings of PervasiveHealth 14.
Real-time, automatic detection of smoking behavior could lead to novel measurement tools for smoking research and “just-in-time” interventions that may help people quit, reducing preventable deaths. This paper discusses the use of machine learning with wrist accelerometer data for automatic puffing and smoking detection. A two-layer smoking detection model is proposed that incorporates both low-level time domain features and high-level smoking topography such as inter-puff intervals and puff frequency to detect puffing then smoking. On a pilot dataset of 6 individuals observed for 11.8 total hours in real-life settings performing complex tasks while smoking, the model obtains a cross validation F1-score of 0.70 for puffing detection and 0.79 for smoking detection over all participants, and a mean F1-score of 0.75 for puffing detection with user-specific training data. Unresolved challenges that must still be addressed in this activity detection domain are discussed.
Please cite the paper in any presentations or publications if using this dataset or source codes.
Tang, Q., Vidrine, D., Crowder, E., and Intille, S. 2014. Automated Detection of Puffing and Smoking with Wrist Accelerometers. 8th International Conference on Pervasive Computing Technologies for Healthcare, ICST (2014).
* Root Path
* data
* original
* featureset (very large size ~ 3.0G)
* pkl
* src
* results
* supplements
* publication_plots
* consolidate_plots
* posture_example_plots
* puffing_example_plots
-
data/original
contains raw sensor data and annotation files. These data may be downloaded here. -
data/featureset
contains all generated feature sets used by the detection model. These data may be downloaded here.Note that this folder will not normally be included into the version control repository, as they are intermediate results and can be very large size.
-
data/pkl
contains all binary intermediate files that can be loaded with the python serialization package pickle. These data may be downloaded here.Note that these intermediate files are featureset files in binary format, with 50% less size). They are not commited to the version control system.
-
src
contains the python source code used in the paper. -
results
stores all results generated by the source code in .csv format. -
supplements
stores all plots.publication_plots
stores all plots used in the final publication.consolidate_plots
stores supplimentary plots showing and comparison of model prediction and visual inspection along with annotations.posture_example_plots
stores raw sensor data plot along with annotations for different posture examples.puffing_example_plots
stores raw sensor data plot along with annotations for different puffing examples.
Each session represents a single subject. There are in total seven sessions, please refer to Table 2 in Tang's paper for the statistical details of each session. And refer to data/statistics_dataset
for the original results.
Inside data/original
folder you will find a list of .csv
files, all the files have similar filename pattern, used to distinguish different sessions, sensor location and dataset type.
[session name]_[sensor location].[dataset_type].csv
For example, file session1_DAK.raw.csv
represents the raw 3-axis accelerometer data file for session 1 on dominant ankle. And file session2.annotation.csv
represents the annotation file for session 2.
Note that there is only one annotation file for each session so sensor_location
is omitted.
There are four raw accelerometer data files for each session at four sensor locations,
- dominant wrist (DW)
- dominant ankle (DAK)
- nondominant wrist (NDW)
- dominant arm (DAR)
Note that for the experiments run in the paper, only wrist accelerometers are used, please refer to the paper for detailed explanation.
The accelerometer we used is Actigraph GT3X, which has a dynamic range of ±4g and stores value using 10 bits, it’s a 3-axis linear accelerometer. It stores in csv
format, with the first column being the unix timestamp, and column 2 to 4 for x, y and z axis.
1319714602650,0.15625,-0.4375,-0.109375
There is no header row. Data has already been converted to -4g to +4g from voltage.
Annotation file has standard csv format with header describes the meaning of each column.
Column name | Meaning |
---|---|
STARTTIME | start time of current annotation |
ENDTIME | end time of current annotation |
COLOR | unused |
posture | annotation for body posture, could be walking, sitting, standing, lying |
activity | annotation for subject activity |
smoking | annotation for smoking status, could be either smoking or not-smoking |
puffing | annotation for puffing status, could be no-puff, left-puff or right-puff |
puff index | the index (start from 0) of current puff, if two annotation has the same puff index, that means they actually belong to the same puff. If it’s a no-puff annotation, the index will be -1 |
prototypical? | A binary value (0 or 1) indicates whether this puff is prototypical puff. If it’s no-puff, the value will be -1 |
potential error? | A binary value indicates whether this annotation might be wrong due to human error |
note | additional comments on this annotation |
link | filename of the corresponding puff example plot in the puffing_example_plots folder |
Inside the data/featureset
or data/pkl
folder, you will find the feature set data files that may also be reproduced by running the codes. Files in data/featureset
folder are in .csv
format and files in data/pkl
folder are in binary format .pkl
. Both will share the same filename and have the same content/values once loaded in the program.
[session name]_[feature set type].[window size].[data type].csv
[session name]_[feature set type].[window size].[data type].pkl
There will be two data types: one is for feature vector (.data.csv
or .data.pkl
) and the other is for class label (.class.csv
or .class.pkl
). They are associated with each other.
For example, session1_BFW.40.data.csv
represents feature vectors for session 1 and the features are computed with data from both wrists and with window size of 40 samples (which is 1s under 40 Hz sampling rate).
As another example, session_all_DW.600.class.csv
represents class labels for all sessions and the corresponding .data.csv
is computed with dominant wrist accelerometer data with window size 600 samples (which is 15s under 40 Hz sampling rate).
There will be a header row. The first column is the index of current segment (start from 0). And the rest columns are different features. It’s easy to understand the meaning through the names, or you may refer to the documentation of the source codes.
There will be a header row with all necessary or unnecessary columns, you can choose to use any of them.
Column name | Meaning |
---|---|
segment | index of current segment (start from 0) |
STARTTIME | start time of current segment |
ENDTIME | end time of current segment |
seg duration | the duration of current segment in seconds |
session | the number indicate which session current segment belongs to |
sensor | which sensor does current segment row belongs to (DW, NDW, and BFW), refer to footnote 5 for explanation. |
inside puff/puff duration | The duration of puffing happens within current segment / the duration of this current puffing |
inside puff/segment duration | The duration of puffing happens within current segment / the duration of this current segment |
prototypical? | Whether it’s prototypical puff or not (0, 1) if no-puff, leave empty string |
puff index | the index of puff for current segment from current sensor |
puff side | left or right |
car percentage | The percentage of time that in-car activity occupies current segment |
talk percentage | The percentage of time that talking activity occupies current segment |
unknown percentage | The percentage of time that unknown-activity occupies current segment |
drinking percentage | The percentage of time that drinking-beverage activity occupies current segment |
reading percentage | The percentage of time that reading-paper activity occupies current segment |
eating percentage | The percentage of time that eating-a-meal activity occupies current segment |
smoking percentage | The percentage of time that smoking activity occupies current segment |
computer percentage | The percentage of time that using-computer activity occupies current segment |
phone percentage | The percentage of time that using-phone activity occupies current segment |
walking percentage | The percentage of time that walking posture occupies current segment |
superposition percentage | The percentage of time that activity superposition happens during current segment |
name | The name of the class assigned to this segment |
target | The index of the class assigned to this segment |
Coming soon...
These videos are shot to show the real world cases of smoking behavior. Videos and corresponding accelerometer signal are put side by side to give a direct impression of how movement could affect and intervene the underlying signal.
Separable concurrent activities
In this video, tester was asked to perform smoking while walking. As you can see from the video, signal was shown to contain additive components from both hand movement (puffing) and body movement (walking). These two components are independent and additive because they are conducted by different body components. This gives us some inspiration when dealing with concurrent activities.
Ambiguous hand gestures
In this video, tester was asked to perform puffing, eating and drinking during smoking in a natural way. The signal contains several ups and downs but none of them shows distinguish characteristics only for puffing. In fact, these activites all belong to hand-to-mouse gestures. The differences between these activities are quite minor from the view of the signal. This exposes one of the chanllenges in activity recognition, which is to classify similar movements.
A comprehensive episode
This video shows a comprehensive episode of natural smoking behavior including a series of complex activities ongoing at the same time. Signals, as shown on the right side, appear to be quite intervened and complex and lack of visible distinguish characteristics for each type of activities. The activites are changing relatively fast in time, thus makes them even more difficult to be captured and recognized in real time.
If you have any question about the dataset or source codes, please create a github issue.