This repository has been archived by the owner on Oct 23, 2024. It is now read-only.
forked from Bdot42/mfakto
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtodo.txt
151 lines (142 loc) · 7.82 KB
/
todo.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
Test-DONE
=========
+ sub_if_gte (avoid branches/if ...) +small enh+ (also needed for vectors)
- __local tmps instead of res->... intermediates? - no effect -
- inline small functions (automatically done)
- divergent branches (bad, bad, bad)
- select vs. "... ? ... : ..."
+ 3x24-bit kernel
- ((x<<8)>>8) instead of (x&0xFFFFFF) (found in the forum) : no difference at all!
+ loop-unroll: OK, 64-loop in mod: unrolled 10% faster on GPU, no (big) change on CPU
+ vectorize (4 FC's at once?)
+ mad(ulong, ...) runs implicit I_TO_F + MULADD_e (float) ... SLOW!!!
+ hi= (hi << 1) + ((lo & 0x8000000000000000) ? 1 : 0) faster than (hi<<1)+(lo>>63ULL)
Test-TODO
=========
- 4x24-bit kernel
- in mod144_72: if q.d5!=0 ... will most likely skip the first block for one in 19 executions. Is that worth it?
TODO
====
+DONE+ - only set new params for subsequent kernel-runs
+DONE+ - cleanup mystuff
+DONE+ - cleanup cuda
+DONE+ - cleanup kernel file => test-kernel
+DONE+ - cleanup AMD licensed stuff
+DONE+ - KERNEL trace only the first thread
+DONE+ - check limits for 95-bit kernel
+DONE+ - add 64-bit kernel
+DONE+ - port 72 (3x24 bit) kernel
+DONE+ - call 72-bit kernel correctly
+DONE+ - port barrett kernels (div, mul, mod)
+DONE+ - create a kernel struct: name, cl_kernel_id, bit_min, bit_max
+DONE+ - save a copy / rev control
+DONE+ - re-enable auto-adjust of sieve-limit (measure wait-time)
+DONE+ - signal handler
+DONE+ - max 128 threads per block ... needed at all?
+DONE+ - --help
+DONE+ - complain about bad args
+DONE+ - calibrate auto-adjust
+DONE+ - cleanup unused kernels and kernel functions
+DONE+ - make it run on 11.10 / 11.11
+DONE+ - save copy of cpt file
+DONE+ - use old copy of cpt file if necessary
+DONE+ - allow running multiple instances in the same dir (all parameters as options)
+DONE+ - allow cpts only after some classes
+DONE+ - -d g
+DONE+ - "restrict" "const" wherever possible
+DONE+ - merge mfaktc 0.18
+DONE+ - allow reading ckp's of other versions (now even mfaktc)
+DONE+ - optimize 72-bit kernel for small vs. large factors (skip some converts depending on the size)
+DONE+ - result format changes
+DONE+ - versioning for txt files
+DONE+ - bugfix for "Only 1 platforms found, cannot use platform 1 ..."
+DONE+ - lock results file
+DONE+ - bugfix for total runtime calculation if a factor was found in a restarted run
+DONE+ - test optimisation options on Linux
+DONE+ - fast 64-bit kernel (mul24_63)
+DONE+ - 5x15bit based on mul24 (barrett15_75)
+DONE+ - display GHz-days/day: 0.016968 * pow(2, $bitlevel - 48) * 1680 / $exponent
// example using M50,000,000 from 2^69-2^70:
= 0.016968 * pow(2, 70 - 48) * 1680 / 50000000
= 2.3912767291392 GHz-days
// magic constant is 0.016968 for TF to 65-bit and above
// magic constant is 0.017832 for 63-and 64-bit
// magic constant is 0.01116 for 62-bit and below
Then all you need is: 86400 / (time_per_class * classes_per_exponent) * ghz_days_assignment
+ %C - class ID (n/4620) "%4d"
+ %c - class number (n/960) "%3d"
+ %p - percent complete (%) "%6.2f"
+ %g - GHz-days/day (GHz) "%7.2f"
+ %t - time per class (s) "%6G"
+ %e - eta (d/h/m/s) "%2dm%02ds"/"%2dh%02dm"/"%2dd%02dh"
+ %n - number of candidates (M/G) "%6.2fM"/"%6.2fG"
+ %r - rate (M/s) "%6.2f"
+ %s - SievePrimes "%7d"
+ %w - CPU wait time for GPU (us) "%6lld"
+ %W - CPU wait % (%) "6.2f"
+ %d - date (Mon nn) "%b %d"
+ %T - time (HH:MM) "%H:%M"
+ %U - username (as configured) "%15s"
+ %H - hostname (as configured) "%15s"
+ %M - the exponent being worked on "%d" !! no fixed width !!
+ %l - the lower bit-limit "%2d"
+ %u - the upper bit-limit "%2d"
+DONE+ - run all tests with SievePrimes 256 and 1000
+DONE+ - SievePrimesMin
+DONE+ - two optional fields in mfaktc.ini for username and computerid, so that if the user fills them out then it puts
the Primenet-style "UID: username/computername, " at the beginning of results lines
+DONE+ - output datestamp lines in results.txt [Sat Dec 31 21:29:40 2011]
+DONE+ - decode GPU-names
+DONE+ - bugfix for high SievePrimes
+DONE+ - reordering kernels dynamically
+DONE+ - worktodo.add
+DONE+ - GHz-days/day in the "total time ..." line
+DONE+ - better estimate of compute cores (CPU!)
+DONE+ - 6x15bit based on mul24
+DONE+ - test a montgomery kernel (1x64-bit, 6x15-bit: ~10% slower than barretts)
+NO+ - transfer 4 FCs in 3 ints +not needed due to GPU sieving+
+NO+ - dynamic sieve size, adjusting as SievePrimes changes +not needed due to GPU sieving+
+NO+ - test if fix SieveSizeLimit is faster : yes, is. Provide 3 fix size sieve functions and dynamically use them. +not needed due to GPU sieving+
+DONE+ - GPU-sieve (for 15-bit and 32-bit based barrett kernels)
+DONE+ - set affinity for main thread
+NO+ - mu = 2^(bitlevel+1) / FC on host? + not needed due to GPU sieving +
+NO+ - evaluate max mem alloc (device info) + not needed +
+NO+ - retrieve L1/L2-cache-size and optimize sieve accordingly at runtime + not needed due to GPU sieving +
+NO+ - verify factors using the modulo-kernels and clEnqueueTask + not needed +
+NO+ - 4 threads - 4 classes at once - also in host code to maximize GPU-utilization in a single program + not needed due to GPU sieving +
+NO+ - Llano: zero-copy-path (BeaverCreek) + not needed due to GPU sieving +
+DONE+ - _getcwd / _chdir to avoid problems on temporary loss of the device
+NO+ - CPU-sieving: multiple queues + not needed due to GPU sieving +
+DONE+ - save and reload binary kernels
+DONE+ - migrate to newer VS version (to VS12 - +2% CPU sieving speed)
+NO+ - AllowSleep + not needed due to GPU sieving +
+DONE+ - _kbhit + getchar to allow config changes at runtime
+DONE+ perftest modes for sieving (SIEVE_SIZE), data copy and kernel speed.
+DONE+ allow for kernel reload (e.g. different vector size, or -DSMALL_EXP)
+DONE+ Split device detection from kernel compilation to make use of device's capabilities
+DONE+ correct GHzDays prediction for factors found
+DONE+ Complete instructions on building on Windows (VS and MinGW)
1 copy small exp handling from mfaktc
1 find out why 0.13/0.14 were faster for some kernels
2 build test kernels to auto-detect DP vs. SP and mul24 vs. mul32 performance
2 detect microarchitecture based on feature set rather than device name
2 MODBASECASE errors in 6x15bit-kernels
2 Try to build an assembler kernel for GCN (find possible optimizations like carry, mul24_hi, scalar unit etc)
2 Auto-adjust VectorSize per kernel
3 add James' PM1-F-vs-TF-NF list to selftest
3 serialize memory access for vector kernels
3 make/allow for small exps/bit-levels
3 SDK 2.6: __attribute__((always_inline)), set environment variable GPU_ASYNC_MEM_COPY=2, Profiler, __kernel __attribute__((vec_type_hint(uintn)))
3 test different local workgroup sizes, also in combination with the __attribute__((reqd_work_group_size(X, Y, Z))) qualifier
3 test speed without event sync (just in-order-queue)
3 Performance: vectorize GPU-sieve / reading from sieve
3 auto-optimizer: parameterize possible uses of mad_hi, amd_bitalign and others, let a test mode find out about the best config
4 dummy kernel for performance testing of sieve and transfer
4 kernel trace more specific (bitmask)
4 error handling: RES[0] -= 100; save up to 3 ints in RES; only in single-vec
4 clGetKernelWorkGroupInfo in load kernels: preferred size, local mem
5 div/mul-modulo for 96 bits
5 cleanup: separate GPU- and CPU-sieve based kernels, compile and load only one set of them
5 repunit
OpenCL 2.0 (per #ifdef): clCreateCommandQueueWithProperties instead of clCreateCommandQueue: use device queue: requires out-of-order queues: need events
merge mfaktc changes: small exps: should correct failing test