importing vandam data for checking CVC and AVC #211
Replies: 2 comments 3 replies
-
Just to clarify, how did you conclude that all the sections come from _1 ? |
Beta Was this translation helpful? Give feedback.
-
For the record, here is one way of comparing word counts produced by the LENA and those that have been manually transcribed. from ChildProject.projects import ChildProject
from ChildProject.annotations import AnnotationManager
from matplotlib import pyplot as plt
import pandas as pd
project = ChildProject('.')
am = AnnotationManager(project)
am.read()
intersection = AnnotationManager.intersection(am.annotations)
segments = am.get_collapsed_segments(intersection)
segments = segments[segments['speaker_type'] != 'CHI'] # smarter filter needed
words = segments.groupby(['position', 'set']).agg(
words = ('words', 'sum')
).reset_index()
words['set'] = words['set'].str.replace('cha/an1', 'an1')
words = words.pivot(index = 'position', columns = 'set', values = 'words')
import statsmodels.formula.api as sm
result = sm.ols(formula = "its ~ an1", data = words).fit()
print(result.summary())
plt.scatter(words['an1'], words['its'])
plt.show() |
Beta Was this translation helpful? Give feedback.
-
We are trying to get how accurate CVC and AVC from LENA are in the VanDam corpora. We can get an idea by using the annotations and transcriptions in VanDam-5minute. This discussion documents how I did this.
Getting the current version of the database if any
E1000 has a vandam corpus, and I checked some of the filenames, confirming that it is this one. Therefore, I follow instructions from the README of EL1000:
I've done all of that before, so I start by adapting the instructions to install the datasets.
I don't need to install the superdataset because I already have it, but it doesn't contain lyon (or many other datasets that are now available), since it may be a bit old. So I navigate to it and update it as follows:
Next, I navigate into vandam and set it up, as per the same instructions -- except that it looks like it's not yet in the superdataset!
I exited the superdataset and installed only vandam, following these instructions:
The contents & structure look ok. I read the warning in the instructions and checked, but there is no readme in vandam, so I think I'm ready to go.
Preparation work
I've retrieved the annotations from homebank, and noticed:
Z:\Rembrandt\LENA\AR31\2010_03
Z:\Rembrandt\LENA\AR31\2010_03
Z:\Rembrandt\LENA\AR31\2010_03
correspond to:
AR31_021108a.cha
AR31_021108b.cha
AR31_021108c.cha
So what I did in the end is start from 0manifest2.xlsx and try to make it fit the annotation import csv format recommendations by adding the following columns:
I notice also that in metadata/recordings.csv some files are split into _1 and _2. So then I check by hand whether this affects any of the files, and it did not, because all of the sections annotated come from _1.
So I modified the columns so that there is a "original_recording_filename" that has the name without _1 and a "recording_filename" with _1 suffix.
Finally, I saved this file as extra/0manifest2.csv
I can now adapt the instructions from the daylong vandam tutorial
That failed with the following error:
Why would it be looking for the children.csv metadata inside there?
AHA! Because the proper interpretation of the command is:
what probably happened is that since .cha is not available yet, nothing got added
so annotation_filename was not created
I'm not sure, so I'll first validate & push my changes, so that I can get help:
-- time spent: 2h
Beta Was this translation helpful? Give feedback.
All reactions