Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Problem opening file: '-fold'] --> angst-wrapper SFS ./Site_Frequency_Spectrum_Config in online tutorial #11

Open
squisquater opened this issue Nov 1, 2021 · 4 comments

Comments

@squisquater
Copy link

I followed the installation instructions and the tutorial instructions but am running into an error when I try to run the Site Frequency Spectrum.

angsd-wrapper SFS ./Site_Frequency_Spectrum_Config

This is my output and it appears it's failing when trying to find the file needed to fold (or not fold) the spectrum, but it can't.

WRAPPER: Zipping advanced arguments onto basic ones

        -> angsd version: 0.911-44-g1c0ebb6 (htslib: 1.3.1-30-gbb03b02) build(Oct 31 2021 11:04:52)
        -> Reading fasta: /mnt/steelhead/remote/Sophie/Programs/angsd-wrapper/Example_Data/Sequences/Tripsacum_TDD39103.fa
        -> Reading fasta: /mnt/steelhead/remote/Sophie/Programs/angsd-wrapper/Example_Data/Sequences/Zea_mays.AGPv3.30.dna_sm.chromosome.10.fa
        -> (Using Filipe G Vieira modification of: abcSaf.cpp)
        -> Parsing 11 number of samples
        -> Region lookup 1/1

        -> We have now allocated approximately 10 Megabytes of raw nodes to the nodepool
        -> Printing at chr: 10 pos:17551496 chunknumber 1100
        -> We have now allocated approximately 20 Megabytes of raw nodes to the nodepool
        -> Printing at chr: 10 pos:19386992 chunknumber 2000 [emFrequency_F] caught nan will not exit
logLike (3*nInd). nInd=11
keepList (nInd)
used logLike (3*length(keep))=11
        -> Printing at chr: 10 pos:22395913 chunknumber 3200 [emFrequency_F] caught nan will not exit
logLike (3*nInd). nInd=11
keepList (nInd)
used logLike (3*length(keep))=10
[emFrequency_F] caught nan will not exit
logLike (3*nInd). nInd=11
keepList (nInd)
used logLike (3*length(keep))=10
[emFrequency_F] caught nan will not exit
logLike (3*nInd). nInd=11
keepList (nInd)
used logLike (3*length(keep))=10
        -> Printing at chr: 10 pos:24004662 chunknumber 3600 [emFrequency_F] caught nan will not exit
logLike (3*nInd). nInd=11
keepList (nInd)
used logLike (3*length(keep))=11
        -> Printing at chr: 10 pos:24908040 chunknumber 4000
        -> Done reading data waiting for calculations to finish
        -> Done waiting for threads

        -> npools:26 unfreed tnodes before clean:0
        -> Output filenames:
                ->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.arg"
                ->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.mafs.gz"
                ->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.geno.gz"
                ->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.gz"
                ->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.pos.gz"
                ->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.idx"
        -> Sun Oct 31 12:08:56 2021
        -> Arguments and parameters for all analysis are located in .arg file
        [ALL done] cpu-time used =  199.08 sec
        [ALL done] walltime used =  130.00 sec
        -> Version of fname:/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.idx is:2
        -> Assuming .saf.gz file: /mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.gz
        -> Assuming .saf.pos.gz: /mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.pos.gz
        -> Problem opening file: '-fold'

Looking at the wrapper shell script (Site_Frequency_Spectrum.sh) it appears that is failing in the final section of the script in the middle of a series of pipes to the final file which does get output in my scratch directory, it's just empty.

#!/usr/bin/env bash

set -e
set -o pipefail

#   Load variables from supplied config file
source "$1"

#   Are we using Common_Config? If so, source it
if [[ -f "${COMMON}" ]]
then
    source "${COMMON}"
fi

#   Where is angsd-wrapper located?
SOURCE=$2

#   Where is ANGSD?
ANGSD_DIR=${SOURCE}/dependencies/angsd

#   Variables created from transforming other variables
#       The number of individuals in the taxon we are analyzing
N_IND=$(wc -l < "${SAMPLE_LIST}")
#       How many inbreeding coefficients are supplied?
N_F=$(wc -l < "${SAMPLE_INBREEDING}")
#       For ANGSD, the actual sample size is twice the number of individuals, since each individual has two chromosomes.
#       The individual inbreeding coefficents take care of the mismatch between these two numbers

#   Perform a check to see if number of individuals matches number of inbreeding coefficients
if [ "${N_IND}" -ne "${N_F}" ]
then
    echo "Mismatch between number of samples in ${SAMPLE_LIST} and ${SAMPLE_INBREEDING}"
    exit 1
fi

#   Check to see if ancestral state is supplied: If not, polarize samples using
#   the reference sequence and generate folded saf.
if [ ! -f "${ANC_SEQ}" ]
then
    echo "Ancestral state data not found, using reference sequence to polarize alignment data. BAQ will likewise not be calculated."
    if [ ! -f "${REF_SEQ}" ]
    then
        echo "No reference sequence supplied, unable to perform calculations."
        exit 2
    else
        ANC_SEQ=$REF_SEQ
        REF_SEQ=
        BAQ=0
        FOLD=1
    fi
else
    FOLD=0
fi

#   Create outdirectory
OUT="${SCRATCH}"/"${PROJECT}"/SFS
mkdir -p "${OUT}"

#   Now we actually run the command, this creates a binary file that contains the prior SFS
if [[ -f "${OUT}"/"${PROJECT}"_SFSOut.mafs.gz ]] && [ "$OVERRIDE" = "false" ]
then
    echo "WRAPPER:maf already exists and OVERRIDE=false, skipping angsd -bam..."
else
    #   Do we have a regions file?
    if [[ -f "${REGIONS}" ]]
    then
	WRAPPER_ARGS=$(echo -bam "${SAMPLE_LIST}" \
            -out "${OUT}"/"${PROJECT}"_SFSOut \
            -indF "${SAMPLE_INBREEDING}" \
            -doSaf "${DO_SAF}" \
            -uniqueOnly "${UNIQUE_ONLY}" \
            -anc "${ANC_SEQ}" \
            -minMapQ "${MIN_MAPQ}" \
            -minQ "${MIN_BASEQUAL}" \
            -nInd "${N_IND}" \
            -minInd "${MIN_IND}"\
            -baq "${BAQ}" \
            -ref "${REF_SEQ}" \
            -GL "${GT_LIKELIHOOD}" \
            -P "${N_CORES}" \
            -doMajorMinor "${DO_MAJORMINOR}" \
            -doMaf "${DO_MAF}" \
            -doGeno "${DO_GENO}" \
            -rf "${REGIONS}" \
            -doPost "${DO_POST}")
    #   Are we missing a definiton for regions?
    elif [[ -z "${REGIONS}" ]]
    then
	WRAPPER_ARGS=$(echo -bam "${SAMPLE_LIST}" \
            -out "${OUT}"/"${PROJECT}"_SFSOut \
            -indF "${SAMPLE_INBREEDING}" \
            -doSaf "${DO_SAF}" \
            -uniqueOnly "${UNIQUE_ONLY}" \
            -anc "${ANC_SEQ}" \
            -minMapQ "${MIN_MAPQ}" \
            -minQ "${MIN_BASEQUAL}" \
            -nInd "${N_IND}" \
            -minInd "${MIN_IND}"\
            -baq "${BAQ}" \
            -ref "${REF_SEQ}" \
            -GL "${GT_LIKELIHOOD}" \
            -P "${N_CORES}" \
            -doMajorMinor "${DO_MAJORMINOR}" \
            -doMaf "${DO_MAF}" \
            -doGeno "${DO_GENO}" \
            -doPost "${DO_POST}")
    #   Assuming a single region was defined in config file
    else
	WRAPPER_ARGS=$(echo -bam "${SAMPLE_LIST}" \
            -out "${OUT}"/"${PROJECT}"_SFSOut \
            -indF "${SAMPLE_INBREEDING}" \
            -doSaf "${DO_SAF}" \
            -uniqueOnly "${UNIQUE_ONLY}" \
            -anc "${ANC_SEQ}" \
            -folded "${FOLD}" \
            -minMapQ "${MIN_MAPQ}" \
            -minQ "${MIN_BASEQUAL}" \
            -nInd "${N_IND}" \
            -minInd "${MIN_IND}" \
            -baq "${BAQ}" \
            -ref "${REF_SEQ}" \
            -GL "${GT_LIKELIHOOD}" \
            -P "${N_CORES}" \
            -doMajorMinor "${DO_MAJORMINOR}" \
            -doMaf "${DO_MAF}" \
            -doGeno "${DO_GENO}" \
            -doPost "${DO_POST}" \
            -r "${REGIONS}")
    fi
fi
# Check for advanced arguments, and overwrite any overlapping definitions
FINAL_ARGS=($(source "${SOURCE}/Wrappers/Arg_Zipper.sh" "${WRAPPER_ARGS}" "${ADVANCED_ARGS}"))
# DEBUGGING
# echo "Wrapper arguments: ${WRAPPER_ARGS}" 1<&2
# echo -e "Final arguments:" ${FINAL_ARGS} 1<&2

"${ANGSD_DIR}"/angsd "${FINAL_ARGS[@]}"

"${ANGSD_DIR}"/misc/realSFS \
    "${OUT}"/"${PROJECT}"_SFSOut.saf.idx \
    -P "${N_CORES}" \
    -fold "${FOLD}" \
    > "${OUT}"/"${PROJECT}"_DerivedSFS.graph.me`

I can also include my configuration file if helpful (Site_Frequency_Spectrum_Config) which also directs the script to another configuration file in the same directory (Common_Config), but I'm wondering whether anyone else has run into this error while trying to move through this tutorial before. I am trying to figure out if this is a file path issue or if the SFS is not running correctly and there is some other error in the output file I am not identifying correctly.

@squisquater
Copy link
Author

UPDATE: I think there is something wrong with my Maize_SFSOut.saf.idx file but I'm not sure what the issue is. Specifically "Only read nSites: 0 will therefore prepare next chromosome (or exit)"

I found some related threads that suggest filtering is too stringent but since this is all based on the angsd-wrapper tutorial that doesn't seem the likely culprit.

/mnt/steelhead/remote/Sophie/scratch/Maize/SFS$ /mnt/steelhead/remote/Sophie/Programs/angsd-wrapper/dependencies/angsd/misc/realSFS ./Maize_SFSOut.saf.idx
        -> Version of fname:./Maize_SFSOut.saf.idx is:2
        -> Assuming .saf.gz file: ./Maize_SFSOut.saf.gz
        -> Assuming .saf.pos.gz: ./Maize_SFSOut.saf.pos.gz
        -> args: tole:0.000001 nthreads:4 maxiter:100 nsites:0 start:(null) chr:(null) start:-1 stop:-1 fname:./Maize_SFSOut.saf.idx fstout:(null) oldout:0 seed:1635892830 bootstrap:0
        -> You are printing the optimized SFS to the terminal consider dumping into a file
        -> E.g.: './realSFS ./Maize_SFSOut.saf.idx >sfs.ml.txt'
        -> nSites: 873481
        -> The choice of -nSites will require atleast: 83.301636 megabyte memory, that is at least: 1.05% of total memory
        -> dim(./Maize_SFSOut.saf.idx):23
        -> Dimension of parameter space: 23
        -> Done reading data from chromosome will prepare next chromosome
        -> Is in multi sfs, will now read data from chr:10
        -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect
        -> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
        -> Will run optimization on nSites: 873481
------------
startlik=-3393452.567606
lik[2]=-225047.521432 diff=3.168405e+06 alpha:1.000000 sr2:9.277772e-01
lik[5]=-225010.243117 diff=3.727831e+01 alpha:1.284184 sr2:1.180093e-07
lik[8]=-225008.303606 diff=1.939511e+00 alpha:3.949081 sr2:1.254134e-09
lik[11]=-225008.300234 diff=3.372127e-03 alpha:1.904168 sr2:4.637274e-12
        -> Breaking EM(sr2) at iter:12, sqrt(sr2):2.292394e-07
likelihood: -225008.300234
------------
837794.500918 0.000000 5871.665062 0.000000 3235.764685 0.000000 1832.887050 0.000000 1379.652548 0.000000 1391.708114 0.000000 1219.220259 0.000000 938.974705 0.000000 881.494241 0.000000 1172.368006 0.000000 1678.848464 0.000000 16083.915949
        -> Only read nSites: 0 will therefore prepare next chromosome (or exit)

        -> NB NB output is no longer log probs of the frequency spectrum!
        -> Output is now simply the expected values!
        -> You can convert to the old format simply with log(norm(x))

@carte731
Copy link
Member

I'll look into it - Your config file might help with troubleshooting.

@jrmandel
Copy link

jrmandel commented Mar 10, 2022

Hi all, I also have the same issue with my own data: Problem opening file: '-fold' after running SFS.

My _SFSOut.saf.idx file also contains:

-> Only read nSites: 0 will therefore prepare next chromosome (or exit)
	-> NB NB output is no longer log probs of the frequency spectrum!
	-> Output is now simply the expected values! 
	-> You can convert to the old format simply with log(norm(x))

@wul88
Copy link

wul88 commented Sep 14, 2022

I also encounter the same problem when running the tutorial (Example_Data/Maize). Here is output message:
angsd-wrapper running from /gpfs/group/restricted/drc9/restricted/Lan/dotfiles/angsd-wrapper


WRAPPER: Zipping advanced arguments onto basic ones

-> angsd version: 0.911-44-g1c0ebb6 (htslib: 1.3.1-30-gbb03b02) build(Sep 14 2022 09:44:04)
-> Reading fasta: /gpfs/group/restricted/drc9/restricted/Lan/dotfiles/angsd-wrapper/Example_Data/Sequences/Tripsacum_TDD39103.fa
-> Reading fasta: /gpfs/group/restricted/drc9/restricted/Lan/dotfiles/angsd-wrapper/Example_Data/Sequences/Zea_mays.AGPv3.30.dna_sm.chromosome.10.fa
-> (Using Filipe G Vieira modification of: abcSaf.cpp)
-> Parsing 11 number of samples
-> Region lookup 1/1

-> We have now allocated approximately 10 Megabytes of raw nodes to the nodepool

-> We have now allocated approximately 20 Megabytes of raw nodes to the nodepool
-> Printing at chr: 10 pos:15302669 chunknumber 100
-> Printing at chr: 10 pos:15780341 chunknumber 200
-> Printing at chr: 10 pos:16322598 chunknumber 300
-> Printing at chr: 10 pos:16492300 chunknumber 400
-> Printing at chr: 10 pos:16660947 chunknumber 500
-> Printing at chr: 10 pos:16771873 chunknumber 600
-> Printing at chr: 10 pos:16943362 chunknumber 700
-> Printing at chr: 10 pos:17114919 chunknumber 800
-> Printing at chr: 10 pos:17264910 chunknumber 900
-> Printing at chr: 10 pos:17405281 chunknumber 1000
-> Printing at chr: 10 pos:17551496 chunknumber 1100
-> Printing at chr: 10 pos:17759315 chunknumber 1200
-> Printing at chr: 10 pos:17906277 chunknumber 1300
-> Printing at chr: 10 pos:18074909 chunknumber 1400
-> Printing at chr: 10 pos:18231910 chunknumber 1500
-> Printing at chr: 10 pos:18354120 chunknumber 1600
-> Printing at chr: 10 pos:18713498 chunknumber 1700
-> Printing at chr: 10 pos:18966706 chunknumber 1800
-> Printing at chr: 10 pos:19170751 chunknumber 1900
-> Printing at chr: 10 pos:19386992 chunknumber 2000

[emFrequency_F] caught nan will not exit
logLike (3nInd). nInd=11
keepList (nInd)
used logLike (3
length(keep))=11
-> Printing at chr: 10 pos:19526855 chunknumber 2100
-> Printing at chr: 10 pos:19622754 chunknumber 2200
-> Printing at chr: 10 pos:19787284 chunknumber 2300
-> Printing at chr: 10 pos:19985193 chunknumber 2400
-> Printing at chr: 10 pos:20417329 chunknumber 2500
-> Printing at chr: 10 pos:20706518 chunknumber 2600
-> Printing at chr: 10 pos:20898012 chunknumber 2700
-> Printing at chr: 10 pos:21081629 chunknumber 2800
-> Printing at chr: 10 pos:21574904 chunknumber 2900
-> Printing at chr: 10 pos:21732294 chunknumber 3000
-> Printing at chr: 10 pos:22185942 chunknumber 3100
-> Printing at chr: 10 pos:22395913 chunknumber 3200
[emFrequency_F] caught nan will not exit
logLike (3nInd). nInd=11
keepList (nInd)
used logLike (3
length(keep))=10
[emFrequency_F] caught nan will not exit
logLike (3nInd). nInd=11
keepList (nInd)
used logLike (3
length(keep))=10
[emFrequency_F] caught nan will not exit
logLike (3nInd). nInd=11
keepList (nInd)
used logLike (3
length(keep))=10
-> Printing at chr: 10 pos:22763312 chunknumber 3300
-> Printing at chr: 10 pos:23074752 chunknumber 3400
-> Printing at chr: 10 pos:23410690 chunknumber 3500
-> Printing at chr: 10 pos:24004662 chunknumber 3600
[emFrequency_F] caught nan will not exit
logLike (3nInd). nInd=11
keepList (nInd)
used logLike (3
length(keep))=11
-> Printing at chr: 10 pos:24104091 chunknumber 3700
-> Printing at chr: 10 pos:24449175 chunknumber 3800
-> Printing at chr: 10 pos:24798701 chunknumber 3900
-> Printing at chr: 10 pos:24908040 chunknumber 4000

-> Done reading data waiting for calculations to finish
-> Done waiting for threads

-> npools:27 unfreed tnodes before clean:0
-> Output filenames:
	->"/gpfs/scratch/lxw34/lowCov/angsd_wrapper/Maize/SFS/Maize_SFSOut.arg"
	->"/gpfs/scratch/lxw34/lowCov/angsd_wrapper/Maize/SFS/Maize_SFSOut.mafs.gz"
	->"/gpfs/scratch/lxw34/lowCov/angsd_wrapper/Maize/SFS/Maize_SFSOut.geno.gz"
	->"/gpfs/scratch/lxw34/lowCov/angsd_wrapper/Maize/SFS/Maize_SFSOut.saf.gz"
	->"/gpfs/scratch/lxw34/lowCov/angsd_wrapper/Maize/SFS/Maize_SFSOut.saf.pos.gz"
	->"/gpfs/scratch/lxw34/lowCov/angsd_wrapper/Maize/SFS/Maize_SFSOut.saf.idx"
-> Wed Sep 14 17:12:37 2022
-> Arguments and parameters for all analysis are located in .arg file
[ALL done] cpu-time used =  344.06 sec
[ALL done] walltime used =  217.00 sec
-> Version of fname:/gpfs/scratch/lxw34/lowCov/angsd_wrapper/Maize/SFS/Maize_SFSOut.saf.idx is:2
-> Assuming .saf.gz file: /gpfs/scratch/lxw34/lowCov/angsd_wrapper/Maize/SFS/Maize_SFSOut.saf.gz
-> Assuming .saf.pos.gz: /gpfs/scratch/lxw34/lowCov/angsd_wrapper/Maize/SFS/Maize_SFSOut.saf.pos.gz
-> Problem opening file: '-fold'

SFS/Maize_SFSOut.saf.idx is not right. It only contains one line as follows:
safv3

SFS/Maize_DerivedSFS.graph.me is of size of 0.
I don't know how to check the other binary files.

Has anyone resolved this problem yet? Please help.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants