Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QE Blow Assertion #9

Open
BradyTaylor1996 opened this issue Jun 22, 2022 · 4 comments
Open

QE Blow Assertion #9

BradyTaylor1996 opened this issue Jun 22, 2022 · 4 comments

Comments

@BradyTaylor1996
Copy link

BradyTaylor1996 commented Jun 22, 2022

I'm attempting to look at the effects of certain hardware parameters (cellBit, ADCPrecision, etc.) on accuracy and energy. I set "--inference 1" on a relatively unchanged clone of the repository and my GPU ran out of memory. After reducing the size of the layers but leaving everything else generally unchanged (except for a few errors), I keep getting a "QE Blow" assertion error. I've used print statements to find that the assertion error occurs during the second run of "backward" for WAGERounding. Changing grad_scale hasn't helped, nor has adjusting the network architecture. Adding a small value to "x" since it is zero also doesn't help. Is there a possible explanation for why this error is occurring?

@BradyTaylor1996
Copy link
Author

I have an update, but the problem remains. The output of the network is all NaNs. I can change the activation functions to tanh() to get better network outputs, but during the backward pass, the max_entry when QE is called is still always 0. I'm wondering if this is maybe a quantization error, but I can't pinpoint where the error is occurring.

@neurosim
Copy link
Owner

You can try to change the "beta" of scale_limit function in wage_initializer.py. It may work.

@rafaelfmoura
Copy link

Change the method QE inside wage_quantizer.py to:
def QE(x, bits):
max_entry = x.abs().max()
if max_entry == 0:
max_entry = max_entry+1e-9
x /= shift(max_entry)
return Q(C(x, bits), bits)
This will prevent division by zero in the quantization process

@SenFFF
Copy link

SenFFF commented Jul 21, 2022

BTW, is there any way to solve the CUDA out of memory issue without changing the network topology? I tried train.py with --inference=1, but the memory insufficiency message keeps being reported even if I set batch_size =1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants