Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--cv failed #21

Open
joqb opened this issue May 26, 2015 · 8 comments
Open

--cv failed #21

joqb opened this issue May 26, 2015 · 8 comments

Comments

@joqb
Copy link

joqb commented May 26, 2015

Hi there,

I'm trying fastStructure on a relatively small individuals dataset (25) but very large (10000 SNPs from GBS) in .str format.
When I tried to run it with --cv=5, for I thought it would bring the same as running repetitions in the regular Structure, I only get FAILED {1,} to the screen and Structure keeps running.
When I tried the same with the testdata it worked fine.
Running on my data without --cv works also fine but is crazy fast with the simple prior (4 seconds which leaves me wondering...) but with the logistic prior it's much slower (didn't update the log file in an hour...)

Any suggestion?

Thanks,
Nath

@LaureneAlicia
Copy link

Hi Nath,

I run into the exact same problem as you while using fastStructure. I have a dataset with 48 individuals and 800 SNPs in a .str file. When I use the --cv option, I get a "Failed" message and without it only takes 2-3 seconds. Did you ever find out what was the issue?

Thanks,
Laurène

@rajanil
Copy link
Owner

rajanil commented Apr 27, 2016

Hi Laurene,
The --cv option would make the software run slower (e.g., --cv = 5 would make it run 5 times slower, since it runs 5-fold cross-validation and reports ancestry proportions resulting from aggregating these 5 runs).
However, I have not encountered the Failed error message before. Could you please copy-paste or provide a snap shot of the error? If you could share the dataset so I can replicate and fix the error, that would be really helpful!

thanks!

@LaureneAlicia
Copy link

Hi Anil,

Thank you very much for your answer!

Since the software only takes 2 or 3 seconds to run on my dataset (48 ind, 800 SNPs) for each K, it would be no problem if the --cv option would make it run several times slower. My understanding of this option is that it's the number of replicates for each K, correct me if I'm wrong. The runs produce the same results (same output files) when I use the --cv option and when I don't use it, except in the .log file the last line says "CV error = 0.2362436, 0.0097023" and in the terminal it gives me several "Failed" messages:
python ./structure.py -K 2 --input=structure --output=output/test --cv=3 --full --format=str
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed

Some people have reported the same problem before but I haven't seen any explanation or solution so far:
https://groups.google.com/forum/#!search/faststructure$20cv/structure-software/cXyfoWXsOe4/Mix0Fo4nDAAJ

I carried out the analysis (without the --cv option), using chooseK.py and distruct.py and the final plot gives meaningful results, which are nearly identical to the results I got from the classic Structure software. Running fastStructure is much faster (which is the all purpose) but I would like to have replicates for each K (like in Structure) which would be then used by chooseK.py to choose the K more reliably.

I attached my input file so that you can have a look at the issue (I had to .zip it since github wouldn't accept a file with .str extension)
structure.str.zip

Thank you very much for your time,
Laurène

@elinck
Copy link

elinck commented Jul 7, 2016

Hi Anil (and others),

I encountered the same error today using fastStructure v1.0 and the following command:

python /home/elinck/bin/fastStructure/structure.py -K 2 --input /home/elinck/atlapetes/atlapetes --output /home/elinck/atlapetes/atlapetes_output --format str --cv 3

My .str file is zipped and attached. Curious if you ever figured out what was causing the issue. Thanks in advance!

atlapetes.str.zip

@atcg
Copy link

atcg commented Jul 13, 2016

I'm also getting these errors. It looks like it could be from lines 293-305 of fastStructure.pyx? :

 # test to ensure that for all partitions, the loci are all variant
        newmasks = []
        for mask in masks:
            G = Gtrue.copy()
            Gmask = -1*np.ones((N,L), dtype='int8')
            Gmask[mask[0],mask[1]] = G[mask[0],mask[1]]
            G[mask[0],mask[1]] = 3
            if not (((G==1)+(G==2)).sum(0)==0).any():
                newmasks.append(mask)

        if not len(newmasks)>=cv:
            wellmasked = False
            print "Failed"

I do not have any invariant columns in my dataset, and I get the error even if I remove all tri-allelic sites from my input. I'm calling fastStructure as follows:

python fastStructure/structure.py -K 2 --input=inputFile --output=outputFile --cv=5 --format=str

I'm using Ubuntu 14.04.04 LTS, 64 bit.

@atcg
Copy link

atcg commented Aug 6, 2016

I can confirm that I no longer get these errors if I convert my data to plink .bed format and remove any sites with over 90% missing data and minor allele frequencies greater than 99% or lower than 1%.

@xiekunwhy
Copy link

Hi, @atcg ,
I got the same error when I use plink .bed format as input!
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed

@vmkalbskopf
Copy link

Has this been addressed? I am running into the same error. I'm using a plink bed file as the input file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants