importing vandam data for checking CVC and AVC #211

alecristia · 2021-05-21T14:36:05Z

alecristia
May 21, 2021
Maintainer

We are trying to get how accurate CVC and AVC from LENA are in the VanDam corpora. We can get an idea by using the annotations and transcriptions in VanDam-5minute. This discussion documents how I did this.

Getting the current version of the database if any

E1000 has a vandam corpus, and I checked some of the filenames, confirming that it is this one. Therefore, I follow instructions from the README of EL1000:

I don't need to request access to the data
or create an account, or give my username to the tech advisor, or install childproject
or configuring ssh on gin

I've done all of that before, so I start by adapting the instructions to install the datasets.

I don't need to install the superdataset because I already have it, but it doesn't contain lyon (or many other datasets that are now available), since it may be a bit old. So I navigate to it and update it as follows:

cd Documents/git-data/EL1000/
source ~/ChildProjectVenv/bin/activate
datalad update --merge
datalad get .

Next, I navigate into vandam and set it up, as per the same instructions -- except that it looks like it's not yet in the superdataset!

I exited the superdataset and installed only vandam, following these instructions:

cd ..
datalad install [email protected]:/EL1000/vandam.git
cd vandam
ls

The contents & structure look ok. I read the warning in the instructions and checked, but there is no readme in vandam, so I think I'm ready to go.

Preparation work

I've retrieved the annotations from homebank, and noticed:

they are .cha
tiers are named based on a narrow speaker id (eg MOT, CHI, GMA, etc.) -- ie not just the 4 usual classes (CHI OCH FEM MAL); note that there are also phonological and morphological tiers that should be ignored
the files are named like this:
- AR31_021108a.cha
- AR31_021108b.cha
- AR31_021108c.cha
so based on the name we don't know who the child is nor which part of the audio file was annotated
this means I'll need to:
- match up children's name to anonymized child ID
- match up start time in local time to audio start time
I opened a couple of the files, and it is possible that there is a start time: start at 13379, 11:05; start at 18179, 12:25 -- we don't know what the unit is, but probably seconds, although we cannot be sure why minutes in local time would always coincide with seconds, and not fractions of seconds
there is also a 0manifest2.xlsx with the homebank download that matches up files with names that do NOT match up with the ones we see in the download to the its, and provides elapsed time too
for simplicity, we'll assume that the files match up; that is:

Z:\Rembrandt\LENA\AR31\2010_03
Z:\Rembrandt\LENA\AR31\2010_03
Z:\Rembrandt\LENA\AR31\2010_03

correspond to:

AR31_021108a.cha
AR31_021108b.cha
AR31_021108c.cha

So what I did in the end is start from 0manifest2.xlsx and try to make it fit the annotation import csv format recommendations by adding the following columns:

raw_filename : this is AR31/AR31_021108a.cha
set : an1 because we don't know who the annotator is
recording_filename same as File_Name minus .its
time_seek = Elapsed_Time * 1000
range_onset =0
range_offset = 560100
format = cha

I notice also that in metadata/recordings.csv some files are split into _1 and _2. So then I check by hand whether this affects any of the files, and it did not, because all of the sections annotated come from _1.

So I modified the columns so that there is a "original_recording_filename" that has the name without _1 and a "recording_filename" with _1 suffix.

Finally, I saved this file as extra/0manifest2.csv

I can now adapt the instructions from the daylong vandam tutorial

mkdir -p annotations/cha/an1/raw
# cp ~/Desktop/VanDam-5minute/[A-Z]*/ annotations/cha/raw/an1/. # this didn't work, I moved by hand
child-project import-annotations annotations/cha/an1/raw/ --annotations extra/0manifest2.csv # this failed

That failed with the following error:

Traceback (most recent call last):
File "/Users/acristia/ChildProjectVenv/bin/child-project", line 11, in
load_entry_point('ChildProject==0.0.1', 'console_scripts', 'child-project')()
File "/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/ChildProject/cmdline.py", line 263, in main
args.func(args)
File "/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/ChildProject/cmdline.py", line 83, in import_annotations
errors, warnings = project.validate(ignore_files = True)
File "/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/ChildProject/projects.py", line 128, in validate
self.read()
File "/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/ChildProject/projects.py", line 105, in read
self.children = self.ct.read()
File "/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/ChildProject/tables.py", line 50, in read
self.df = pd.read_csv(self.path, **pd_flags)
File "/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/pandas/io/parsers.py", line 688, in read_csv
return _read(filepath_or_buffer, kwds)
File "/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/pandas/io/parsers.py", line 454, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/pandas/io/parsers.py", line 948, in init
self._make_engine(self.engine)
File "/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/pandas/io/parsers.py", line 1180, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/pandas/io/parsers.py", line 2010, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 382, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] No such file or directory: 'annotations/cha/an1/raw/metadata/children.csv'

Why would it be looking for the children.csv metadata inside there?

AHA! Because the proper interpretation of the command is:

child-project import-annotations . --annotations extra/0manifest2.csv # this also failed

what probably happened is that since .cha is not available yet, nothing got added
so annotation_filename was not created

I'm not sure, so I'll first validate & push my changes, so that I can get help:

child-project validate . --ignore-files
datalad save . -m "adding chas"
datalad push

-- time spent: 2h

lucasgautheron · 2021-05-22T09:46:15Z

lucasgautheron
May 22, 2021
Maintainer

I notice also that in metadata/recordings.csv some files are split into _1 and _2. So then I check by hand whether this affects any of the files, and it did not, because all of the sections annotated come from _1.

Just to clarify, how did you conclude that all the sections come from _1 ?
For the record, the recordings are split in _1, _2, etc. when the its show interruptions of the recording.
In other word, there is one recording entry in the metadata for each block in the ITS.
I chose to do that because this ensures each 'recording' is contiguous. (a choice we should discuss)

2 replies

lucasgautheron May 24, 2021
Maintainer

Update: In order to make things easier, at least for vandam, I have switched back to a 1:1 correspondance between its files and recordings on vandam.

alecristia May 26, 2021
Maintainer Author

the way I decided which section they belonged to was by comparing the onset of the annotation against the length of the first recording. It turned out that all of the annotations were done in a time attributed to the first section. I think this process can be automatized.

lucasgautheron · 2021-05-24T15:40:17Z

lucasgautheron
May 24, 2021
Maintainer

For the record, here is one way of comparing word counts produced by the LENA and those that have been manually transcribed.
The problem is that until this issue has been addressed, there is proper way to separate adult words from children words automatically.

from ChildProject.projects import ChildProject
from ChildProject.annotations import AnnotationManager
from matplotlib import pyplot as plt
import pandas as pd

project = ChildProject('.')
am = AnnotationManager(project)
am.read()

intersection = AnnotationManager.intersection(am.annotations)
segments = am.get_collapsed_segments(intersection)
segments = segments[segments['speaker_type'] != 'CHI'] # smarter filter needed

words = segments.groupby(['position', 'set']).agg(
    words = ('words', 'sum')
).reset_index()
words['set'] = words['set'].str.replace('cha/an1', 'an1')

words = words.pivot(index = 'position', columns = 'set', values = 'words')

import statsmodels.formula.api as sm
result = sm.ols(formula = "its ~ an1", data = words).fit()
print(result.summary())

plt.scatter(words['an1'], words['its'])
plt.show()

1 reply

alecristia Jun 2, 2021
Maintainer Author

this compares eg number of words at the level of the annotated segment -- but most previous studies compare numbers at the level of the annotation clip. So in annotations.csv (eg here for bergelson), we'd calculate overlap based on range_onset,range_offset.

and then eg count how many segments from each talker type, sum the number of words, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

importing vandam data for checking CVC and AVC #211

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

importing vandam data for checking CVC and AVC #211

alecristia May 21, 2021 Maintainer

Getting the current version of the database if any

Preparation work

Replies: 2 comments · 3 replies

lucasgautheron May 22, 2021 Maintainer

lucasgautheron May 24, 2021 Maintainer

alecristia May 26, 2021 Maintainer Author

lucasgautheron May 24, 2021 Maintainer

alecristia Jun 2, 2021 Maintainer Author

alecristia
May 21, 2021
Maintainer

Replies: 2 comments 3 replies

lucasgautheron
May 22, 2021
Maintainer

lucasgautheron May 24, 2021
Maintainer

alecristia May 26, 2021
Maintainer Author

lucasgautheron
May 24, 2021
Maintainer

alecristia Jun 2, 2021
Maintainer Author