forked from datastax/dsbulk
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathapplication.template.conf
1761 lines (1615 loc) · 96.2 KB
/
application.template.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
####################################################################################################
# This is a template configuration file for the DataStax Bulk Loader (DSBulk).
#
# This file is written in HOCON format; see
# https://github.com/typesafehub/config/blob/master/HOCON.md
# for more information on its syntax.
#
# Please make sure you've read the DataStax Bulk Loader documentation included in this binary
# distribution:
# ../manual/README.md
#
# An exhaustive list of available settings can be found here:
# ../manual/settings.md
#
# Also, two template configuration files meant to be used together can be found here:
# ../manual/application.template.conf
# ../manual/driver.template.conf
#
# We recommend that this file be named application.conf and placed in the /conf directory; these
# are indeed the default file name and path where DSBulk looks for configuration files.
#
# To use other file names, or another folder, you can use the -f command line switch; consult the
# DataStax Bulk Loader online documentation for more information:
# https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkLoadConfigFile.html
####################################################################################################
####################################################################################################
# DataStax Java Driver settings.
#
# You can declare any Java Driver settings directly in this file, but for maintainability sake, we
# placed them in a separate file, which is expected to be named driver.conf and located in the same
# /conf directory.
# Use that file, for example, to define contact points, provide authentication and encryption
# settings, modify timeouts, consistency levels, page sizes, policies, and much more.
# If you decide to declare the driver settings in a different way, or in a file named differently,
# make sure to test your setup to ensure that all settings are correctly detected.
#
# You can also consult the Java Driver online documentation for more details:
# https://docs.datastax.com/en/developer/java-driver/latest/
# https://docs.datastax.com/en/developer/java-driver-dse/latest/
include classpath("driver.conf")
####################################################################################################
####################################################################################################
# DataStax Bulk Loader settings.
#
# Settings for the DataStax Bulk Loader (DSBulk) are declared below. Use this section, for
# example, to define which connector to use and how, to customize logging, monitoring, codecs, to
# specify schema settings and mappings, and much more.
#
# You can also consult the DataStax Bulk Loader online documentation for more details:
# https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkRef.html
####################################################################################################
dsbulk {
################################################################################################
# Connector-specific settings. This section contains settings for the connector to use; it also
# contains sub-sections, one for each available connector.
#
# This setting is ignored when counting.
################################################################################################
# The name of the connector to use.
# Type: string
# Default value: "csv"
#connector.name = "csv"
################################################################################################
# CSV Connector configuration.
################################################################################################
# The URL or path of the resource(s) to read from or write to.
#
# Which URL protocols are available depend on which URL stream handlers have been installed, but
# at least the **file** protocol is guaranteed to be supported for reads and writes, and the
# **http** and **https** protocols are guaranteed to be supported for reads.
#
# The file protocol can be used with all supported file systems, local or not.
# - When reading: the URL can point to a single file, or to an existing directory; in case of a
# directory, the *fileNamePattern* setting can be used to filter files to read, and the
# *recursive* setting can be used to control whether or not the connector should look for files
# in subdirectories as well.
# - When writing: the URL will be treated as a directory; if it doesn't exist, the loader will
# attempt to create it; CSV files will be created inside this directory, and their names can be
# controlled with the *fileNameFormat* setting.
#
# Note that if the value specified here does not have a protocol, then it is assumed to be a
# file protocol. Relative URLs will be resolved against the current working directory. Also, for
# convenience, if the path begins with a tilde (`~`), that symbol will be expanded to the
# current user's home directory.
#
# In addition the value `-` indicates `stdin` when loading and `stdout` when unloading. This is
# in line with Unix tools such as tar, which uses `-` to represent stdin/stdout when
# reading/writing an archive.
#
# Examples:
#
# url = "/path/to/dir/or/file" # without protocol
# url = "./path/to/dir/or/file" # without protocol, relative to working directory
# url = "~/path/to/dir/or/file" # without protocol, relative to the user's home
# directory
# url = "file:///path/to/dir/or/file" # with file protocol
# url = "http://acme.com/file.csv" # with HTTP protocol
# url = "-" # to read csv data from stdin (for load) or
# url = "-" # write csv data to stdout (for unload)
#
# For other URLs: the URL will be read or written directly; settings like *fileNamePattern*,
# *recursive*, and *fileNameFormat* will have no effect.
#
# The default value is `-` (read from `stdin` / write to `stdout`).
# Type: string
# Default value: "-"
#connector.csv.url = "-"
# The character(s) to use as field delimiter. Field delimiters containing more than one
# character are accepted.
# Type: string
# Default value: ","
#connector.csv.delimiter = ","
# Enable or disable whether the files to read or write begin with a header line. If enabled for
# loading, the first non-empty line in every file will assign field names for each record
# column, in lieu of `schema.mapping`, `fieldA = col1, fieldB = col2, fieldC = col3`. If
# disabled for loading, records will not contain fields names, only field indexes, `0 = col1, 1
# = col2, 2 = col3`. For unloading, if this setting is enabled, each file will begin with a
# header line, and if disabled, each file will not contain a header line.
#
# Note: This option will apply to all files loaded or unloaded.
# Type: boolean
# Default value: true
#connector.csv.header = true
# The number of records to skip from each input file before the parser can begin to execute.
# Note that if the file contains a header line, that line is not counted as a valid record. This
# setting is ignored when writing.
# Type: number
# Default value: 0
#connector.csv.skipRecords = 0
# The maximum number of records to read from or write to each file. When reading, all records
# past this number will be discarded. When writing, a file will contain at most this number of
# records; if more records remain to be written, a new file will be created using the
# *fileNameFormat* setting. Note that when writing to anything other than a directory, this
# setting is ignored. This setting takes into account the *header* setting: if a file begins
# with a header line, that line is not counted as a record. This feature is disabled by default
# (indicated by its `-1` value).
# Type: number
# Default value: -1
#connector.csv.maxRecords = -1
# The character used for quoting fields when the field delimiter is part of the field value.
# Only one character can be specified. Note that this setting applies to all files to be read or
# written.
# Type: string
# Default value: "\""
#connector.csv.quote = "\""
# The character that represents a line comment when found in the beginning of a line of text.
# Only one character can be specified. Note that this setting applies to all files to be read or
# written. This feature is disabled by default (indicated by its `null` character value).
# Type: string
# Default value: "\u0000"
#connector.csv.comment = "\u0000"
# The compression that will be used for writing or reading files. Supported values are (for both
# reading and writing): `none`, `xz`, `gzip`, `bzip2`, `zstd`, `lz4`, `lzma`, `snappy`,
# `deflate`. For reading only, supported values are: `brotli`, `z`, `deflate64`.
# Type: string
# Default value: "none"
#connector.csv.compression = "none"
# Sets the String representation of an empty value. When reading, if the parser does not read
# any character from the input, and the input is within quotes, this value will be used instead.
# When writing, if the writer has an empty string to write to the output, this value will be
# used instead. The default value is `AUTO`, which means that, when reading, the parser will
# emit an empty string, and when writing, the writer will write a quoted empty field to the
# output.
# Type: string
# Default value: "AUTO"
#connector.csv.emptyValue = "AUTO"
# The file encoding to use for all read or written files.
# Type: string
# Default value: "UTF-8"
#connector.csv.encoding = "UTF-8"
# The character used for escaping quotes inside an already quoted value. Only one character can
# be specified. Note that this setting applies to all files to be read or written.
# Type: string
# Default value: "\\"
#connector.csv.escape = "\\"
# The file name format to use when writing. This setting is ignored when reading and for
# non-file URLs. The file name must comply with the formatting rules of `String.format()`, and
# must contain a `%d` format specifier that will be used to increment file name counters.
#
# If compression is enabled, the default value for this setting will be modified to include the
# default suffix for the selected compression method. For example, if compression is `gzip`, the
# default file name format will be `output-%06d.csv.gz`.
# Type: string
# Default value: "output-%06d.csv"
#connector.csv.fileNameFormat = "output-%06d.csv"
# The glob pattern to use when searching for files to read. The syntax to use is the glob
# syntax, as described in `java.nio.file.FileSystem.getPathMatcher()`. This setting is ignored
# when writing and for non-file URLs. Only applicable when the *url* setting points to a
# directory on a known filesystem, ignored otherwise.
#
# If compression is enabled, the default value for this setting will be modified to include the
# default suffix for the selected compression method. For example, if compression is `gzip`, the
# default glob pattern will be `**/*.csv.gz`.
# Type: string
# Default value: "**/*.csv"
#connector.csv.fileNamePattern = "**/*.csv"
# Defines whether or not leading whitespaces from values being read/written should be skipped.
# This setting is honored when reading and writing. Default value is false.
# Type: boolean
# Default value: false
#connector.csv.ignoreLeadingWhitespaces = false
# Defines whether or not trailing whitespaces from quoted values should be skipped. This setting
# is only honored when reading; it is ignored when writing. Default value is false.
# Type: boolean
# Default value: false
#connector.csv.ignoreLeadingWhitespacesInQuotes = false
# Defines whether or not trailing whitespaces from values being read/written should be skipped.
# This setting is honored when reading and writing. Default value is false.
# Type: boolean
# Default value: false
#connector.csv.ignoreTrailingWhitespaces = false
# Defines whether or not leading whitespaces from quoted values should be skipped. This setting
# is only honored when reading; it is ignored when writing. Default value is false.
# Type: boolean
# Default value: false
#connector.csv.ignoreTrailingWhitespacesInQuotes = false
# The maximum number of characters that a field can contain. This setting is used to size
# internal buffers and to avoid out-of-memory problems. If set to -1, internal buffers will be
# resized dynamically. While convenient, this can lead to memory problems. It could also hurt
# throughput, if some large fields require constant resizing; if this is the case, set this
# value to a fixed positive number that is big enough to contain all field values.
# Type: number
# Default value: 4096
#connector.csv.maxCharsPerColumn = 4096
# The maximum number of columns that a record can contain. This setting is used to size internal
# buffers and to avoid out-of-memory problems.
# Type: number
# Default value: 512
#connector.csv.maxColumns = 512
# The maximum number of files that can be read or written simultaneously. This setting is
# effective only when reading from or writing to many resources in parallel, such as a
# collection of files in a root directory; it is ignored otherwise. The special syntax `NC` can
# be used to specify a number of threads that is a multiple of the number of available cores,
# e.g. if the number of cores is 8, then 0.5C = 0.5 * 8 = 4 threads.
#
# The default value is the special value AUTO; with this value, the connector will decide the
# best number of files.
# Type: string
# Default value: "AUTO"
#connector.csv.maxConcurrentFiles = "AUTO"
# The character(s) that represent a line ending. When set to the special value `auto` (default),
# the system's line separator, as determined by `System.lineSeparator()`, will be used when
# writing, and auto-detection of line endings will be enabled when reading. Only one or two
# characters can be specified; beware that most typical line separator characters need to be
# escaped, e.g. one should specify `\r\n` for the typical line ending on Windows systems
# (carriage return followed by a new line).
# Type: string
# Default value: "auto"
#connector.csv.newline = "auto"
# Defines whether or not line separators should be replaced by a normalized line separator '\n'
# inside quoted values. This setting is honored when reading and writing. Note: due to a bug in
# the CSV parsing library, on Windows systems, the line ending detection mechanism may not
# function properly when this setting is false; in case of problem, set this to true. Default
# value is false.
# Type: boolean
# Default value: false
#connector.csv.normalizeLineEndingsInQuotes = false
# Sets the String representation of a null value. When reading, if the parser does not read any
# character from the input, this value will be used instead. When writing, if the writer has a
# null object to write to the output, this value will be used instead. The default value is
# `AUTO`, which means that, when reading, the parser will emit a `null`, and when writing, the
# writer won't write any character at all to the output.
# Type: string
# Default value: "AUTO"
#connector.csv.nullValue = "AUTO"
# Enable or disable scanning for files in the root's subdirectories. Only applicable when *url*
# is set to a directory on a known filesystem. Used for loading only.
# Type: boolean
# Default value: false
#connector.csv.recursive = false
# The URL or path of the file that contains the list of resources to read from.
#
# The file specified here should be located on the local filesystem.
#
# This setting and `connector.csv.url` are mutually exclusive. If both are defined and non
# empty, this setting takes precedence over `connector.csv.url`.
#
# This setting applies only when loading. When unloading, this setting should be left empty or
# set to null; any non-empty value will trigger a fatal error.
#
# The file with URLs should follow this format:
#
# ```
# /path/to/file/file.csv
# /path/to.dir/
# ```
#
# Every line should contain one path. You don't need to escape paths in this file.
#
# All the remarks for `connector.csv.url` apply for each line in the file, and especially,
# settings like `fileNamePattern`, `recursive`, and `fileNameFormat` all apply to each line
# individually.
#
# You can comment out a line in the URL file by making it start with a # sign:
#
# ```
# #/path/that/will/be/ignored
# ```
#
# Such a line will be ignored.
#
# For your convenience, every line in the urlfile will be trimmed - that is, any leading and
# trailing white space will be removed.
#
# The file should be encoded in UTF-8, and each line should be a valid URL to load.
#
# The default value is "" - which means that this property is ignored.
# Type: string
# Default value: ""
#connector.csv.urlfile = ""
################################################################################################
# JSON Connector configuration.
################################################################################################
# The URL or path of the resource(s) to read from or write to.
#
# Which URL protocols are available depend on which URL stream handlers have been installed, but
# at least the **file** protocol is guaranteed to be supported for reads and writes, and the
# **http** and **https** protocols are guaranteed to be supported for reads.
#
# The file protocol can be used with all supported file systems, local or not.
# - When reading: the URL can point to a single file, or to an existing directory; in case of a
# directory, the *fileNamePattern* setting can be used to filter files to read, and the
# *recursive* setting can be used to control whether or not the connector should look for files
# in subdirectories as well.
# - When writing: the URL will be treated as a directory; if it doesn't exist, the loader will
# attempt to create it; json files will be created inside this directory, and their names can be
# controlled with the *fileNameFormat* setting.
#
# Note that if the value specified here does not have a protocol, then it is assumed to be a
# file protocol. Relative URLs will be resolved against the current working directory. Also, for
# convenience, if the path begins with a tilde (`~`), that symbol will be expanded to the
# current user's home directory.
#
# In addition the value `-` indicates `stdin` when loading and `stdout` when unloading. This is
# in line with Unix tools such as tar, which uses `-` to represent stdin/stdout when
# reading/writing an archive.
#
# Examples:
#
# url = "/path/to/dir/or/file" # without protocol
# url = "./path/to/dir/or/file" # without protocol, relative to working directory
# url = "~/path/to/dir/or/file" # without protocol, relative to the user's home
# directory
# url = "file:///path/to/dir/or/file" # with file protocol
# url = "http://acme.com/file.json" # with HTTP protocol
# url = "-" # to read json data from stdin (for load) or
# url = "-" # write json data to stdout (for unload)
#
# For other URLs: the URL will be read or written directly; settings like *fileNamePattern*,
# *recursive*, and *fileNameFormat* will have no effect.
#
# The default value is `-` (read from `stdin` / write to `stdout`).
# Type: string
# Default value: "-"
#connector.json.url = "-"
# The number of JSON records to skip from each input file before the parser can begin to
# execute. This setting is ignored when writing.
# Type: number
# Default value: 0
#connector.json.skipRecords = 0
# The maximum number of records to read from or write to each file. When reading, all records
# past this number will be discarded. When writing, a file will contain at most this number of
# records; if more records remain to be written, a new file will be created using the
# *fileNameFormat* setting. Note that when writing to anything other than a directory, this
# setting is ignored. This feature is disabled by default (indicated by its `-1` value).
# Type: number
# Default value: -1
#connector.json.maxRecords = -1
# The mode for loading and unloading JSON documents. Valid values are:
#
# * MULTI_DOCUMENT: Each resource may contain an arbitrary number of successive JSON documents
# to be mapped to records. For example the format of each JSON document is a single document:
# `{doc1}`. The root directory for the JSON documents can be specified with `url` and the
# documents can be read recursively by setting `connector.json.recursive` to true.
# * SINGLE_DOCUMENT: Each resource contains a root array whose elements are JSON documents to be
# mapped to records. For example, the format of the JSON document is an array with embedded JSON
# documents: `[ {doc1}, {doc2}, {doc3} ]`.
# Type: string
# Default value: "MULTI_DOCUMENT"
#connector.json.mode = "MULTI_DOCUMENT"
# The compression that will be used for writing or reading files. Supported values are (for both
# reading and writing): `none`, `xz`, `gzip`, `bzip2`, `zstd`, `lz4`, `lzma`, `snappy`,
# `deflate`. For reading only, supported values are: `brotli`, `z`, `deflate64`.
# Type: string
# Default value: "none"
#connector.json.compression = "none"
# A map of JSON deserialization features to set. Map keys should be enum constants defined in
# `com.fasterxml.jackson.databind.DeserializationFeature`. The default value is the only way to
# guarantee that floating point numbers will not have their precision truncated when parsed, but
# can result in slightly slower parsing. Used for loading only.
#
# Note that some Jackson features might not be supported, in particular features that operate on
# the resulting Json tree by filtering elements or altering their contents, since such features
# conflict with dsbulk's own filtering and formatting capabilities. Instead of trying to modify
# the resulting tree using Jackson features, you should try to achieve the same result using the
# settings available under the `codec` and `schema` sections.
# Type: map<string,boolean>
# Default value: {"USE_BIG_DECIMAL_FOR_FLOATS":true}
#connector.json.deserializationFeatures = {"USE_BIG_DECIMAL_FOR_FLOATS":true}
# The file encoding to use for all read or written files.
# Type: string
# Default value: "UTF-8"
#connector.json.encoding = "UTF-8"
# The file name format to use when writing. This setting is ignored when reading and for
# non-file URLs. The file name must comply with the formatting rules of `String.format()`, and
# must contain a `%d` format specifier that will be used to increment file name counters.
#
# If compression is enabled, the default value for this setting will be modified to include the
# default suffix for the selected compression method. For example, if compression is `gzip`, the
# default file name format will be `output-%06d.json.gz`.
# Type: string
# Default value: "output-%06d.json"
#connector.json.fileNameFormat = "output-%06d.json"
# The glob pattern to use when searching for files to read. The syntax to use is the glob
# syntax, as described in `java.nio.file.FileSystem.getPathMatcher()`. This setting is ignored
# when writing and for non-file URLs. Only applicable when the *url* setting points to a
# directory on a known filesystem, ignored otherwise.
#
# If compression is enabled, the default value for this setting will be modified to include the
# default suffix for the selected compression method. For example, if compression is `gzip`, the
# default glob pattern will be `**/*.json.gz`.
# Type: string
# Default value: "**/*.json"
#connector.json.fileNamePattern = "**/*.json"
# JSON generator features to enable. Valid values are all the enum constants defined in
# `com.fasterxml.jackson.core.JsonGenerator.Feature`. For example, a value of `{
# ESCAPE_NON_ASCII : true, QUOTE_FIELD_NAMES : true }` will configure the generator to escape
# all characters beyond 7-bit ASCII and quote field names when writing JSON output. Used for
# unloading only.
#
# Note that some Jackson features might not be supported, in particular features that operate on
# the resulting Json tree by filtering elements or altering their contents, since such features
# conflict with dsbulk's own filtering and formatting capabilities. Instead of trying to modify
# the resulting tree using Jackson features, you should try to achieve the same result using the
# settings available under the `codec` and `schema` sections.
# Type: map<string,boolean>
# Default value: {}
#connector.json.generatorFeatures = {}
# The maximum number of files that can be read or written simultaneously. This setting is
# effective only when reading from or writing to many resources in parallel, such as a
# collection of files in a root directory; it is ignored otherwise. The special syntax `NC` can
# be used to specify a number of threads that is a multiple of the number of available cores,
# e.g. if the number of cores is 8, then 0.5C = 0.5 * 8 = 4 threads.
#
# The default value is the special value AUTO; with this value, the connector will decide the
# best number of files.
# Type: string
# Default value: "AUTO"
#connector.json.maxConcurrentFiles = "AUTO"
# JSON parser features to enable. Valid values are all the enum constants defined in
# `com.fasterxml.jackson.core.JsonParser.Feature`. For example, a value of `{ ALLOW_COMMENTS :
# true, ALLOW_SINGLE_QUOTES : true }` will configure the parser to allow the use of comments and
# single-quoted strings in JSON data. Used for loading only.
#
# Note that some Jackson features might not be supported, in particular features that operate on
# the resulting Json tree by filtering elements or altering their contents, since such features
# conflict with dsbulk's own filtering and formatting capabilities. Instead of trying to modify
# the resulting tree using Jackson features, you should try to achieve the same result using the
# settings available under the `codec` and `schema` sections.
# Type: map<string,boolean>
# Default value: {}
#connector.json.parserFeatures = {}
# Enable or disable pretty printing. When enabled, JSON records are written with indents. Used
# for unloading only.
#
# Note: Can result in much bigger records.
# Type: boolean
# Default value: false
#connector.json.prettyPrint = false
# Enable or disable scanning for files in the root's subdirectories. Only applicable when *url*
# is set to a directory on a known filesystem. Used for loading only.
# Type: boolean
# Default value: false
#connector.json.recursive = false
# A map of JSON serialization features to set. Map keys should be enum constants defined in
# `com.fasterxml.jackson.databind.SerializationFeature`. Used for unloading only.
#
# Note that some Jackson features might not be supported, in particular features that operate on
# the resulting Json tree by filtering elements or altering their contents, since such features
# conflict with dsbulk's own filtering and formatting capabilities. Instead of trying to modify
# the resulting tree using Jackson features, you should try to achieve the same result using the
# settings available under the `codec` and `schema` sections.
# Type: map<string,boolean>
# Default value: {}
#connector.json.serializationFeatures = {}
# The strategy to use for filtering out entries when formatting output. Valid values are enum
# constants defined in `com.fasterxml.jackson.annotation.JsonInclude.Include` (but beware that
# the `CUSTOM` strategy cannot be honored). Used for unloading only.
# Type: string
# Default value: "ALWAYS"
#connector.json.serializationStrategy = "ALWAYS"
# The URL or path of the file that contains the list of resources to read from.
#
# The file specified here should be located on the local filesystem.
#
# This setting and `connector.json.url` are mutually exclusive. If both are defined and non
# empty, this setting takes precedence over `connector.json.url`.
#
# This setting applies only when loading. When unloading, this setting should be left empty or
# set to null; any non-empty value will trigger a fatal error.
#
# The file with URLs should follow this format:
#
# ```
# /path/to/file/file.json
# /path/to.dir/
# ```
#
# Every line should contain one path. You don't need to escape paths in this file.
#
# All the remarks for `connector.csv.json` apply for each line in the file, and especially,
# settings like `fileNamePattern`, `recursive`, and `fileNameFormat` all apply to each line
# individually.
#
# You can comment out a line in the URL file by making it start with a # sign:
#
# ```
# #/path/that/will/be/ignored
# ```
#
# Such a line will be ignored.
#
# For your convenience, every line in the urlfile will be trimmed - that is, any leading and
# trailing white space will be removed.
#
# The file should be encoded in UTF-8, and each line should be a valid URL to load.
#
# The default value is "" - which means that this property is ignored.
# Type: string
# Default value: ""
#connector.json.urlfile = ""
################################################################################################
# Schema-specific settings.
################################################################################################
# Keyspace used for loading or unloading data. Keyspace names should not be quoted and are
# case-sensitive. `MyKeyspace` will match a keyspace named `MyKeyspace` but not `mykeyspace`.
# Either `keyspace` or `graph` is required if `query` is not specified or is not qualified with
# a keyspace name.
# Type: string
# Default value: null
#schema.keyspace = null
# Table used for loading or unloading data. Table names should not be quoted and are
# case-sensitive. `MyTable` will match a table named `MyTable` but not `mytable`. Either
# `table`, `vertex` or `edge` is required if `query` is not specified.
# Type: string
# Default value: null
#schema.table = null
# The field-to-column mapping to use, that applies to both loading and unloading; ignored when
# counting. If not specified, the loader will apply a strict one-to-one mapping between the
# source fields and the database table. If that is not what you want, then you must supply an
# explicit mapping. Mappings should be specified as a map of the following form:
#
# - Indexed data sources: `0 = col1, 1 = col2, 2 = col3`, where `0`, `1`, `2`, are the
# zero-based indices of fields in the source data; and `col1`, `col2`, `col3` are bound variable
# names in the insert statement.
# - A shortcut to map the first `n` fields is to simply specify the destination columns: `col1,
# col2, col3`.
# - Mapped data sources: `fieldA = col1, fieldB = col2, fieldC = col3`, where `fieldA`,
# `fieldB`, `fieldC`, are field names in the source data; and `col1`, `col2`, `col3` are bound
# variable names in the insert statement.
# - A shortcut to map fields named like columns is to simply specify the destination columns:
# `col1, col2, col3`.
#
# To specify that a field should be used as the timestamp (a.k.a. write-time) or ttl (a.k.a.
# time-to-live) of the inserted row, use the specially named fake columns `__ttl` and
# `__timestamp`: `fieldA = __timestamp, fieldB = __ttl`. Note that Timestamp fields are parsed
# as regular CQL timestamp columns and must comply with either `codec.timestamp`, or
# alternatively, with `codec.unit` + `codec.epoch`. TTL fields are parsed as integers
# representing durations in seconds, and must comply with `codec.number`.
#
# To specify that a column should be populated with the result of a function call, specify the
# function call as the input field (e.g. `now() = c4`). Note, this is only relevant for load
# operations. Similarly, to specify that a field should be populated with the result of a
# function call, specify the function call as the input column (e.g. `field1 = now()`). This is
# only relevant for unload operations. Function calls can also be qualified by a keyspace name:
# `field1 = ks1.max(c1,c2)`.
#
# In addition, for mapped data sources, it is also possible to specify that the mapping be
# partly auto-generated and partly explicitly specified. For example, if a source row has fields
# `c1`, `c2`, `c3`, and `c5`, and the table has columns `c1`, `c2`, `c3`, `c4`, one can map all
# like-named columns and specify that `c5` in the source maps to `c4` in the table as follows:
# `* = *, c5 = c4`.
#
# One can specify that all like-named fields be mapped, except for `c2`: `* = -c2`. To skip `c2`
# and `c3`: `* = [-c2, -c3]`.
#
# Any identifier, field or column, that is not strictly alphanumeric (i.e. not matching
# `[a-zA-Z0-9_]+`) must be surrounded by double-quotes, just like you would do in CQL: `"Field
# ""A""" = "Column 2"` (to escape a double-quote, simply double it). Note that, contrary to the
# CQL grammar, unquoted identifiers will not be lower-cased: an identifier such as `MyColumn1`
# will match a column named `"MyColumn1"` and not `mycolumn1`.
#
# The exact type of mapping to use depends on the connector being used. Some connectors can only
# produce indexed records; others can only produce mapped ones, while others are capable of
# producing both indexed and mapped records at the same time. Refer to the connector's
# documentation to know which kinds of mapping it supports.
# Type: string
# Default value: null
#schema.mapping = null
# Specify whether or not to accept records that contain extra fields that are not declared in
# the mapping. For example, if a record contains three fields A, B, and C, but the mapping only
# declares fields A and B, then if this option is true, C will be silently ignored and the
# record will be considered valid, and if false, the record will be rejected. This setting also
# applies to user-defined types and tuples. Only applicable for loading, ignored otherwise.
#
# This setting is ignored when counting.
# Type: boolean
# Default value: true
#schema.allowExtraFields = true
# Specify whether or not to accept records that are missing fields declared in the mapping. For
# example, if the mapping declares three fields A, B, and C, but a record contains only fields A
# and B, then if this option is true, C will be silently assigned null and the record will be
# considered valid, and if false, the record will be rejected. If the missing field is mapped to
# a primary key column, the record will always be rejected, since the database will reject the
# record. This setting also applies to user-defined types and tuples. Only applicable for
# loading, ignored otherwise.
#
# This setting is ignored when counting.
# Type: boolean
# Default value: false
#schema.allowMissingFields = false
# Edge label used for loading or unloading graph data. This option can only be used for modern
# graphs created with the Native engine (DSE 6.8+). The edge label must correspond to an
# existing table created with the `WITH EDGE LABEL` option; also, when `edge` is specified, then
# `from` and `to` must be specified as well. Edge labels should not be quoted and are
# case-sensitive. `MyEdge` will match a label named `MyEdge` but not `myedge`. Either `table`,
# `vertex` or `edge` is required if `query` is not specified.
# Type: string
# Default value: null
#schema.edge = null
# The name of the edge's incoming vertex label, for loading or unloading graph data. This option
# can only be used for modern graphs created with the Native engine (DSE 6.8+). This option is
# mandatory when `edge` is specified; ignored otherwise. Vertex labels should not be quoted and
# are case-sensitive. `MyVertex` will match a label named `MyVertex` but not `myvertex`.
# Type: string
# Default value: null
#schema.from = null
# Graph name used for loading or unloading graph data. This option can only be used for modern
# graphs created with the Native engine (DSE 6.8+). Graph names should not be quoted and are
# case-sensitive. `MyGraph` will match a graph named `MyGraph` but not `mygraph`. Either
# `keyspace` or `graph` is required if `query` is not specified or is not qualified with a
# keyspace name.
# Type: string
# Default value: null
#schema.graph = null
# Specify whether to map `null` input values to "unset" in the database, i.e., don't modify a
# potentially pre-existing value of this field for this row. Valid for load scenarios, otherwise
# ignore. Note that setting to false creates tombstones to represent `null`.
#
# Note that this setting is applied after the *codec.nullStrings* setting, and may intercept
# `null`s produced by that setting.
#
# This setting is ignored when counting. When set to true but the protocol version in use does
# not support unset values (i.e., all protocol versions lesser than 4), this setting will be
# forced to false and a warning will be logged.
# Type: boolean
# Default value: true
#schema.nullToUnset = true
# Whether to preserve cell timestamps when loading and unloading. Ignored when `schema.query` is
# provided, or when the target table is a counter table. If true, the following rules will be
# applied to generated queries:
#
# - When loading, instead of a single INSERT statement, the generated query will be a BATCH
# query; this is required in order to preserve individual column timestamps for each row.
# - When unloading, the generated SELECT statement will export each column along with its
# individual timestamp.
#
# For both loading and unlaoding, DSBulk will import and export timestamps using field names
# such as `"writetime(<column>)"`, where `<column>` is the column's internal CQL name; for
# example, if the table has a column named `"MyCol"`, its corresponding timestamp would be
# exported as `"writetime(MyCol)"` in the generated query and in the resulting connector record.
# If you intend to use this feature to export and import tables letting DSBulk generate the
# appropriate queries, these names are fine and need not be changed. If, however, you would like
# to export or import data to or from external sources that use different field names, you could
# do so by using the function `writetime` in a schema.mapping entry; for example, the following
# mapping would map `col1` along with its timestamp to two distinct fields, `field1` and
# `field1_writetime`: `field1 = col1, field1_writetime = writetime(col1)`.
# Type: boolean
# Default value: false
#schema.preserveTimestamp = false
# Whether to preserve cell TTLs when loading and unloading. Ignored when `schema.query` is
# provided, or when the target table is a counter table. If true, the following rules will be
# applied to generated queries:
#
# - When loading, instead of a single INSERT statement, the generated query will be a BATCH
# query; this is required in order to preserve individual column TTLs for each row.
# - When unloading, the generated SELECT statement will export each column along with its
# individual TTL.
#
# For both loading and unlaoding, DSBulk will import and export TTLs using field names such as
# `"ttl(<column>)"`, where `<column>` is the column's internal CQL name; for example, if the
# table has a column named `"MyCol"`, its corresponding TTL would be exported as `"ttl(MyCol)"`
# in the generated query and in the resulting connector record. If you intend to use this
# feature to export and import tables letting DSBulk generate the appropriate queries, these
# names are fine and need not be changed. If, however, you would like to export or import data
# to or from external sources that use different field names, you could do so by using the
# function `ttl` in a schema.mapping entry; for example, the following mapping would map `col1`
# along with its TTL to two distinct fields, `field1` and `field1_ttl`: `field1 = col1,
# field1_ttl = ttl(col1)`.
# Type: boolean
# Default value: false
#schema.preserveTtl = false
# The query to use. If not specified, then *schema.keyspace* and *schema.table* must be
# specified, and dsbulk will infer the appropriate statement based on the table's metadata,
# using all available columns. If `schema.keyspace` is provided, the query need not include the
# keyspace to qualify the table reference.
#
# For loading, the statement can be any `INSERT`, `UPDATE` or `DELETE` statement. `INSERT`
# statements are preferred for most load operations, and bound variables should correspond to
# mapped fields; for example, `INSERT INTO table1 (c1, c2, c3) VALUES (:fieldA, :fieldB,
# :fieldC)`. `UPDATE` statements are required if the target table is a counter table, and the
# columns are updated with incremental operations (`SET col1 = col1 + :fieldA` where `fieldA` is
# a field in the input data). A `DELETE` statement will remove existing data during the load
# operation.
#
# For unloading and counting, the statement can be any regular `SELECT` statement. If the
# statement does not contain any WHERE, ORDER BY, GROUP BY, or LIMIT clause, the engine will
# generate a token range restriction clause of the form: `WHERE token(...) > :start and
# token(...) <= :end` and will generate range read statements, thus allowing parallelization of
# reads while at the same time targeting coordinators that are also replicas (see
# schema.splits). If the statement does contain WHERE, ORDER BY, GROUP BY or LIMIT clauses
# however, the query will be executed as is; the engine will only be able to parallelize the
# operation if the query includes a WHERE clause including the following relations: `token(...)
# > :start AND token(...) <= :end` (the bound variables can have any name). Note that, unlike
# LIMIT clauses, PER PARTITION LIMIT clauses can be parallelized.
#
# Statements can use both named and positional bound variables. Named bound variables should be
# preferred, unless the protocol version in use does not allow them; they usually have names
# matching those of the columns in the destination table, but this is not a strict requirement;
# it is, however, required that their names match those of fields specified in the mapping.
# Positional variables can also be used, and will be named after their corresponding column in
# the destination table.
#
# When loading and unloading graph data, the query must be provided in plain CQL; Gremlin
# queries are not supported.
#
# Note: The query is parsed to discover which bound variables are present, and to map the
# variables correctly to fields.
#
# See *mapping* setting for more information.
# Type: string
# Default value: null
#schema.query = null
# The timestamp of inserted/updated cells during load; otherwise, the current time of the system
# running the tool is used. Not applicable to unloading nor counting. Ignored when
# `schema.query` is provided. The value must be expressed in the timestamp format specified by
# the `codec.timestamp` setting.
#
# Query timestamps for Cassandra have microsecond resolution; any sub-microsecond information
# specified is lost. For more information, see the [CQL
# Reference](https://docs.datastax.com/en/dse/6.0/cql/cql/cql_reference/cql_commands/cqlInsert.html#cqlInsert__timestamp-value).
# Type: string
# Default value: null
#schema.queryTimestamp = null
# The Time-To-Live (TTL) of inserted/updated cells during load (seconds); a value of -1 means
# there is no TTL. Not applicable to unloading nor counting. Ignored when `schema.query` is
# provided. For more information, see the [CQL
# Reference](https://docs.datastax.com/en/dse/6.0/cql/cql/cql_reference/cql_commands/cqlInsert.html#cqlInsert__ime-value),
# [Setting the time-to-live (TTL) for
# value](http://docs.datastax.com/en/dse/6.0/cql/cql/cql_using/useTTL.html), and [Expiring data
# with time-to-live](http://docs.datastax.com/en/dse/6.0/cql/cql/cql_using/useExpire.html).
# Type: number
# Default value: -1
#schema.queryTtl = -1
# The number of token range splits in which to divide the token ring. In other words, this
# setting determines how many read requests will be generated in order to read an entire table.
# Only used when unloading and counting; ignored otherwise. Note that the actual number of
# splits may be slightly greater or lesser than the number specified here, depending on the
# actual cluster topology and token ownership. Also, it is not possible to generate fewer splits
# than the total number of primary token ranges in the cluster, so the actual number of splits
# is always equal to or greater than that number. Set this to higher values if you experience
# timeouts when reading from the database, specially if paging is disabled. This setting should
# also be greater than `engine.maxConcurrentQueries`. The special syntax `NC` can be used to
# specify a number that is a multiple of the number of available cores, e.g. if the number of
# cores is 8, then 0.5C = 0.5 * 8 = 4 splits.
# Type: string
# Default value: "8C"
#schema.splits = "8C"
# The name of the edge's outgoing vertex label, for loading or unloading graph data. This option
# can only be used for modern graphs created with the Native engine (DSE 6.8+). This option is
# mandatory when `edge` is specified; ignored otherwise. Vertex labels should not be quoted and
# are case-sensitive. `MyVertex` will match a label named `MyVertex` but not `myvertex`.
# Type: string
# Default value: null
#schema.to = null
# Vertex label used for loading or unloading graph data. This option can only be used for modern
# graphs created with the Native engine (DSE 6.8+). The vertex label must correspond to an
# existing table created with the `WITH VERTEX LABEL` option. Vertex labels should not be quoted
# and are case-sensitive. `MyVertex` will match a label named `MyVertex` but not `myvertex`.
# Either `table`, `vertex` or `edge` is required if `query` is not specified.
# Type: string
# Default value: null
#schema.vertex = null
################################################################################################
# Batch-specific settings.
#
# These settings control how the workflow engine groups together statements before writing them.
#
# Only applicable for loading.
################################################################################################
# The buffer size to use for flushing batched statements. Should be set to a multiple of
# `maxBatchStatements`, e.g. 2 or 4 times that value; higher values consume more memory and
# usually do not incur in any noticeable performance gain. When set to a value lesser than or
# equal to zero, the buffer size is implicitly set to 4 times `maxBatchStatments`.
# Type: number
# Default value: -1
#batch.bufferSize = -1
# **DEPRECATED**. Use `maxBatchStatements` instead.
# Type: number
# Default value: null
#batch.maxBatchSize = null
# The maximum number of statements that a batch can contain. The ideal value depends on two
# factors:
# - The data being loaded: the larger the data, the smaller the batches should be.
# - The batch mode: when `PARTITION_KEY` is used, larger batches are acceptable, whereas when
# `REPLICA_SET` is used, smaller batches usually perform better. Also, when using `REPLICA_SET`,
# it is preferrable to keep this number below the threshold configured server-side for the
# setting `unlogged_batch_across_partitions_warn_threshold` (the default is 10); failing to do
# so is likely to trigger query warnings (see `log.maxQueryWarnings` for more information).
# When set to a value lesser than or equal to zero, the maximum number of statements is
# considered unlimited. At least one of `maxBatchStatements` or `maxSizeInBytes` must be set to
# a positive value when batching is enabled.
# Type: number
# Default value: 32
#batch.maxBatchStatements = 32
# The maximum data size that a batch can hold. This is the number of bytes required to encode
# all the data to be persisted, without counting the overhead generated by the native protocol
# (headers, frames, etc.). The value specified here should be lesser than or equal to the value
# that has been configured server-side for the option `batch_size_fail_threshold_in_kb` in
# cassandra.yaml, but note that the heuristic used to compute data sizes is not 100% accurate
# and sometimes underestimates the actual size. See the documentation for the [cassandra.yaml
# configuration
# file](https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/config/configCassandra_yaml.html#configCassandra_yaml__advProps)
# for more information. When set to a value lesser than or equal to zero, the maximum data size
# is considered unlimited. At least one of `maxBatchStatements` or `maxSizeInBytes` must be set
# to a positive value when batching is enabled.
# Type: number
# Default value: -1
#batch.maxSizeInBytes = -1
# The grouping mode. Valid values are:
# - `DISABLED`: batching is disabled.
# - `PARTITION_KEY`: groups together statements that share the same partition key. This is
# usually the most performant mode; however it may not work at all if the dataset is unordered,
# i.e., if partition keys appear randomly and cannot be grouped together.
# - `REPLICA_SET`: groups together statements that share the same replica set. This mode works
# in all cases, but may incur in some throughput and latency degradation, specially with large
# clusters or high replication factors.
# When tuning DSBulk for batching, the recommended approach is as follows:
# 1. Start with `PARTITION_KEY`;
# 2. If the average batch size is close to 1, try increasing `bufferSize`;
# 3. If increasing `bufferSize` doesn't help, switch to `REPLICA_SET` and set
# `maxBatchStatements` or `maxSizeInBytes` to low values to avoid timeouts or errors;
# 4. Increase `maxBatchStatements` or `maxSizeInBytes` to get the best throughput while keeping
# latencies acceptable.
# The default is `PARTITION_KEY`.
# Type: string
# Default value: "PARTITION_KEY"
#batch.mode = "PARTITION_KEY"
################################################################################################
# Conversion-specific settings. These settings apply for both load and unload workflows.
#
# When writing, these settings determine how record fields emitted by connectors are parsed.
#
# When unloading, these settings determine how row cells emitted by DSE are formatted.
#
# When counting, these settings are ignored.
################################################################################################
# Strategy to use when converting binary data to strings. Only applicable when unloading columns
# of CQL type `blob`, or columns of geometry types, if the value of `codec.geo` is `WKB`; and
# only if the connector in use requires stringification. Valid values are:
#
# - BASE64: Encode the binary data into a Base-64 string. This is the default strategy.
# - HEX: Encode the binary data as CQL blob literals. CQL blob literals follow the general
# syntax: `0[xX][0-9a-fA-F]+`, that is, `0x` followed by hexadecimal characters, for example:
# `0xcafebabe`. This format produces lengthier strings than BASE64, but is also the only format
# compatible with CQLSH.
# Type: string
# Default value: "BASE64"
#codec.binary = "BASE64"
# Set how true and false representations of numbers are interpreted. The representation is of
# the form `true_value,false_value`. The mapping is reciprocal, so that numbers are mapping to
# Boolean and vice versa. All numbers unspecified in this setting are rejected.
# Type: list<number>
# Default value: [1,0]
#codec.booleanNumbers = [1,0]
# Specify how true and false representations can be used by dsbulk. Each representation is of
# the form `true_value:false_value`, case-insensitive. For loading, all representations are
# honored: when a record field value exactly matches one of the specified strings, the value is
# replaced with `true` of `false` before writing to the database. For unloading, this setting is
# only applicable for string-based connectors, such as the CSV connector: the first
# representation will be used to format booleans before they are written out, and all others are
# ignored.
# Type: list<string>
# Default value: ["1:0","Y:N","T:F","YES:NO","TRUE:FALSE"]
#codec.booleanStrings = ["1:0","Y:N","T:F","YES:NO","TRUE:FALSE"]
# The temporal pattern to use for `String` to CQL `date` conversion. Valid choices:
#
# - A date-time pattern such as `yyyy-MM-dd`.
# - A pre-defined formatter such as `ISO_LOCAL_DATE`. Any public static field in
# `java.time.format.DateTimeFormatter` can be used.
# - The special formatter `UNITS_SINCE_EPOCH`, which is a special parser that reads and writes
# local dates as numbers representing time units since a given epoch; the unit and the epoch to
# use can be specified with `codec.unit` and `codec.timestamp`.
#
# For more information on patterns and pre-defined formatters, see [Patterns for Formatting and
# Parsing](https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#patterns)
# in Oracle Java documentation.
#
# For more information about CQL date, time and timestamp literals, see [Date, time, and
# timestamp
# format](https://docs.datastax.com/en/dse/6.0/cql/cql/cql_reference/refDateTimeFormats.html?hl=timestamp).
# Type: string
# Default value: "ISO_LOCAL_DATE"
#codec.date = "ISO_LOCAL_DATE"
# This setting is used in the following situations:
#
# - When the target column is of CQL `timestamp` type, or when loading to a `USING TIMESTAMP`
# clause, or when unloading from a `writetime()` function call, and if `codec.timestamp` is set
# to `UNITS_SINCE_EPOCH`, then the epoch specified here determines the relative point in time to
# use to convert numeric data to and from temporals. For example, if the input is 123 and the
# epoch specified here is `2000-01-01T00:00:00Z`, then the input will be interpreted as N
# `codec.unit`s since January 1st 2000.