-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Counted Set #132
base: main
Are you sure you want to change the base?
Counted Set #132
Conversation
This module will contain CountedSet and potentially other multiset implementations.
Similar to other collections in the package, CountedSet is backed by an internal _storage property, and exposes an initializer for preallocating storage.
Sequence conformance is implemented by iterating over the stored elements, repeating according to each stored count.
Collection conformance is implemented by using the index of an element in the underlying storage along with a position to distinguish between copies of the same element.
The implementation requires all elements to be unique and all multiplicities to be positive. Tests have been added for initializing empty CountedSets with and without a minimum capacity.
Using unsigned integers for multiplicity of an element in a counted set makes it much clearer that negative values are not allowed. Initializing a counted set with a dictionary literal can now be done with any given multiplicity. Negative values are now barred by the compiler, and multiplicities of zero can simply be discarded. This change necessitated a slight tweak to the way iteration is performed: since Swift.repeatElement(_:count:) requires Int counts, multiple calls are used for multiplicities larger than Int.max. CountedSet.count will trap on cardinalities larger than Int.max, as with most Sequences.
The custom implementation of count adds the multiplicities of the elements together, which is far more efficient than the default approach of iterating through the set manually. underestimatedCount has also been customized: it exposes the count of the underlying dictionary, which is guaranteed to be less than or equal to the cardinality of the counted set itself. This may be used to avoid trapping for sets with very large multiplicities.
Each requirement has been implemented according to the mathematical definitions, complete with careful documentation to ensure users understand how that may diverge from expectations. Because Swift.Dictionary does not specify time complexity for every method being used, part of the implementation cannot either. Tests have been added to exercise all of the SetAlgebra implementation, including each of the axioms required by SetAlgebra.
This operation adds the multiplicities of each element together. A test has been added to cover it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good approach, thanks!
I did an initial review pass and added a number of comments that we should address before merging.
-
As mentioned below, I think
CountedMultiset
would be a better name for this type, Foundation's precedent nothwithstanding. -
As mentioned below, I'm vetoing the
SetAlgebra
conformance (sorry!) -- we'll need to go through the features of that protocol and individually consider each one, adapting/removing ones that don't make sense in a multiset. This means that we'll lose all the functionality thatSetAlgebra
provides by default -- which isn't a big deal, as the default implementations are pretty bad anyway. -
A proper multiset should provide a direct operation to query the multiplicity of a particular member. I.e., we need a generalization of
contains
that returns an integer instead of a boolean value. -
Like the standard
Set
, I expectCountedMultiset
to conform toHashable
,ExpressibleByArrayLiteral
andCustomStringConvertible
, and to conditionally conform toCodable
.I'm not sure what format would be best for the description of this type -- but given that this is a counted multiset, perhaps it makes sense to use a dictionary-like printout instead of repeating duplicates, unlike a dupes-preserving
SortedMultiset
. I recommend investigating existing precedents in other languages. -
We do not need the
RawRepresentable
conformance; it seems far more straightforward to expose the underlying dictionary through a custom property, like withOrderedSet
. -
Operations that return/initialize multisets (such as
union
,intersection
,subtracting
etc) need to document where the representative values in their results come from. (E.g., an initializer that takes a sequence could specify that representative values are the first instances that appear on the input, whileunion
could say that for values that appear in both inputs, the representative value is taken from the first set (a.k.a.self
).
// | ||
//===----------------------------------------------------------------------===// | ||
|
||
extension CountedSet: SetAlgebra { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I don't think it's feasible to conform this type to SetAlgebra
-- that protocol isn't designed to support multisets, and I expect that some algorithms that are generic over it will be ill-prepared to work on them. 😞
That said, I think it does make sense to provide at least some of the SetAlgebra
operations; union
/formUnion
/intersection
/formIntersection
/subtracting
/subtract
in particular seem like desirable operations, with the semantics implemented here.
We will need to manually provide subtract
and formSubtract
, as we'll no longer be able to rely on SetAlgebra
implementing them for us. (And also because SetAlgebra
implements them using intersection
& symmetricDifference
which is very wasteful.)
We'll also need to provide explicit implementations for the full suite of isSubset(of:)
etc predicates, as well as overloads for all of these that take a generic sequence instead of a CountedSet
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Painful as it is, I think you’re right about SetAlgebra
. The actual requirements of the protocol clearly assume an uncounted set, and all the documented warnings in the world won’t change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, SetAlgebra
isn't the best protocol in the stdlib -- it supports the standard unordered Set
and OptionSet
, and sadly that's about it; other set-like things tend to have a difficult time conforming to it, and we can't really define interesting generic algorithms over it. (At least, not very efficiently.)
(There is still a huge benefit in having it in the stdlib, in that it standardizes names for the most common set operations. 😀)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it’s a good example of premature specialization: it adds a lot of requirements that should be attached to Set
directly; including ExpressibleByArrayLiteral
, for no discernible reason.
Still, it’s not the most frustrating protocol of the bunch. I’d say SignedNumeric
deserves that title.
Anyway, unless someone thinks it is worth having MultisetAlgebra
as a protocol, I’ll just remove it. Maybe a future major version of Swift can fix SetAlgebra
.
@inlinable | ||
@discardableResult | ||
public mutating func insert(_ newMember: __owned Element) | ||
-> (inserted: Bool, memberAfterInsert: Element) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we aren't going to be constrained by SetAlgebra
's questionable design choices, we should change the signature of this method to return something that makes more practical sense, such as the index of the newly inserted item.
(OrderedSet.insert
is one precedent for this.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have a better suggestion: returning the new multiplicity of the inserted element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good! We may need to provide both though, because people may need the index to figure out the exact variant that's in the set. (This is relevant for element types where equal values can be still distinguished.)
This can always be reconstructed using firstIndex(of:)
, but returning it directly in insert
helps eliminate an extra hash table lookup.
(Oh, this reminds me, CountedMultiset will need to implement the (underscored) _customIndexOfEquatableElement
and _customLastIndexOfEquatableElement
requirements of Collection to speed up firstIndex(of:)
/lastIndex(of:)
. The OrderedSet codebase has an example on how to do this, as well as other tricks in the same vein.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I literally didn’t know those existed. I’m guessing they have default implementations so most people don’t know about them?
/// This is used to distinguish between indices that point to elements with | ||
/// different values. | ||
@usableFromInline | ||
let storageIndex: Dictionary<Element, UInt>.Index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Please follow The Leading Underscore Rule when naming non-public members and types.
(Throughout this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, sorry. In my defense, that was discouraged for normal (that is, not Standard Library) Swift code until around three months ago.
I actually didn’t realize the guidance changed until I tried to pull it just now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting -- where was it discouraged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I can’t find it now. Maybe I was thinking of Google’s Swift style guide? Never mind then.
/// set. | ||
@inlinable | ||
public var count: Int { | ||
Int(rawValue.values.reduce(.zero, +)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine per the Collection protocol, but it seems like a missed opportunity.
What if we kept a running count of items in this set, updating it on every mutation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing the count would mean this is no longer an implicit data structure. I’m not really sure that’s a worthwhile tradeoff.
Is there precedent for doing that with similar data structures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh definitely -- Set
and Dictionary
do it, for example, to prevent having to scan their bitmap to figure out their element count.
I don't think having to store one extra integer matters much in this case -- the hash table in the underlying dictionary already comes with way too much memory overhead to call this a truly implicit data structure. 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking it was an implicit data structure for a dictionary, but okay then.
/// set. | ||
@inlinable | ||
public var underestimatedCount: Int { | ||
rawValue.count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is neat, but if we decide to add a running total instead of using reduce
above, then we should return it here, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
underestimatedCount
has always been incredibly unclear to me, in terms of its intended role in Collection
s. I don’t believe anything actually says it should be equivalent to count
in such cases, so I figured it’d be a useful way to express the (less computationally expensive) number of unique elements.
If we change it to reference count
, we should add another property with this value.
By the way, would it be possible to add a more efficient conditional implementation of Swift Algorithms’ Unique methods for CountedSet
, in case anyone ends up using them together? I’m guessing the answer is no, but it’d be a nice-to-have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
underestimatedCount
has always been incredibly unclear to me, in terms of its intended role in Collections. I don’t believe anything actually says it should be equivalent to count in such cases, so I figured it’d be a useful way to express the (less computationally expensive) number of unique elements.
This is a good idea in general, and Collection
does allow this! underestimatedCount
is typically only used in contexts that are generic over Sequence
, but that doesn't mean it's a bad idea to make it diverge from count
-- it can still be quite beneficial.
By the way, would it be possible to add a more efficient conditional implementation of Swift Algorithms’ Unique methods for CountedSet, in case anyone ends up using them together? I’m guessing the answer is no, but it’d be a nice-to-have.
The way to do this right now would require Algorithms to import Collections, which would be a bad idea.
However, Swift does have support for cross-module overlays, which are modules that automatically get loaded whenever some module A and module B are both imported to a source file. (Apple's SDKs use this to add, e.g., MapKit-related functionality to SwiftUI when a file imports both of these modules.) If this functionality was cleaned up & promoted to a full-blown language feature, including SwiftPM support for defining cross-import modules, then we could use that to provide uniquing overloads that take CountedSets.
Another alternative is to define a set of additional collection protocols in a separate package, and have swift-collections and swift-algorithms both depend on that. This would let swift-algorithms provide generic overloads for specific algorithms to speed them up when possible.
I don't think we need to do this for CountedMultiset right now, but it's something to think about for later. (FWIW, I think we'll eventually want to define some sort of a dictionary protocol at least, and possibly a UniqueCollection protocol.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, the easiest way is to simply provide the uniquing methods directly on the counted multiset type, without importing Algorithms. This would work, but perhaps it would be even better to provide a uniqueMembers
view instead, exposing the underlying dictionary's keys
view.
(I'm not sure we need to do this though -- we are already exposing the storage dictionary, after all, so folks can simply access keys
directly.)
internal var _storage = RawValue() | ||
|
||
@inlinable @inline(__always) | ||
public var rawValue: [Element: UInt] { _storage } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not have a rawValue
property -- I think it would be better to use some other name, such as OrderedSet
's elements
.
Swift practice generally dislikes the use of UInt
for anything other than bit patterns, so while UInt
makes sense from a validation/performance sense, for usability/consistency we should probably stick to using Int
for these counters.
(E.g., note that count
returns a signed Int
, not an unsigned one.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally used a Dictionary<Element, Int>
as a backing store, but decided to change it purely to eliminate confusion if a signed multiset was ever added. No one can look at a UInt
and think it can hold a negative value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts about whether that is sufficient justification for using UInt
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't disagree with that, but I feel using UInt here would be diverging too much from established API design precedents. The dice have fallen this way long ago, and it's too late to change our minds about it.
(FWIW, I do like that in Swift we tend not to do arithmetic on UInt values, or deal with signed/unsigned conversions in regular code.)
public struct CountedSet<Element: Hashable>: RawRepresentable { | ||
// Allows internal setter to be referenced from inlined code | ||
@usableFromInline | ||
internal var _storage = RawValue() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please spell out the underlying type here, or use an explicit typealias. Do not initialize the stored property here.
internal var _storage: [Element: UInt]
} | ||
|
||
@inlinable | ||
public init?(rawValue: [Element: UInt]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should replace this with a simple unlabeled initializer that takes an [Element: Int]
value.
I'm not sure the initializer needs to be failable -- it seems fine to simply filter out items with invalid counts. (Although items with negative counts could be worth signaling somehow...)
Given that I'm vetoing the SetAlgebra
conformance, we'll definitely need to add additional initializer(s) to replace SetAlgebra
's generic init.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the initializer required by RawRepresentable
: filtering out items would mean that the input wasn’t used as a rawValue
, which would violate the protocol.
I don’t intend for this to be used often; it’s mainly important for deserialization and similar tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to say I think we should get rid of the RawRepresentable
conformance! (It comes with interoperability overtones and we don't need any of the functionality it provides.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree about removing RawRepresentable
, but I’ll acquiesce.
In that case, do we even need a dictionary initializer? I definitely don’t think a nonfailable unlabeled initializer is appropriate, as it would not be value preserving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I said I’d acquiesce, but upon reflection a lot of the desired functionality is replicating RawRepresentable
, particularly with regards to serialization and equality-checking. If I dropped RawRepresentable
, I’d just have to replicate its implementations explicitly.
Moreover,CountedMultiset
has a similar role to OptionSet
: most operations are actually calling a different operation on the raw value, and it implements set operations backed by a non-set type. I’d argue that’s a significant precedent.
RawRepresentable
is an extremely simple protocol, and literally every part of it is relevant. Moreover, guaranteeing lossless conversion to and from a dictionary makes it much easier for users to implement exotic forms of serialization for.
You keep saying it isn’t needed, but I don’t see how that justifies its exclusion. Is there some undesired effect of conformance that I’m unaware of?
Note that CountedMultiset
could expose the underlying value as both rawValue
and elements
(they both point to _storage
anyway), and add other initializers. For instance, I think there should be an unsafe variant of dictionary initializer that performs no checks on the dictionary whatsoever as a performance optimization.
/// multiplicity is decremented by one. | ||
/// - Parameter member: An element to remove from the set. | ||
/// - Complexity: O(*k*), where *k* is the number of unique elements in the | ||
/// set, if the multiplicity of the given element was one. Otherwise, O(1). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Hm, this complexity guarantee doesn't look quite right to me -- ignoring non-unique storage, removing a single item from a dictionary is expected to take O(1) hashing/equality checks on average, as long as Element
properly implements hashing.
It looks like the O(k) part comes from the Dictionary.removeValue(forKey:)
docs, which seem to be bogus -- they aren't consistent with how we document the complexity of other dictionary operations.
(Strictly speaking, removeValue(forKey:)
is indeed an O(n) operation, but so is Dictionary.subscript(key:).getter
, Set.contains(_:)
etc. -- hash tables have terrible worst-case performance. The same reasoning that allows us to say that updating a dictionary value is expected to be an "O(1)" operation also applies to removals. There is a bit of handwaving involved.)
I recommend changing this (and every other primitive operation, from contains(_:)
to insert(_:)
) to say something like:
/// - Complexity: This operation is expected to perform O(1) hashing/equality checks on
/// average, provided that `Element` implements high-quality hashing.
(Note the careful wording -- Equatable
doesn't guarantee that ==
takes O(1) time, so we cannot promise anything about how long dictionary operations will take in real time -- we can only put an upper bound on how many times they are expected to compare items. And of course random hash collisions make the worst case performance linear, so we need to talk about "expected" number of operations "on average". 😛)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent a while pondering the time complexity here, and ultimately decided it was better to underpromise than overpromise. You wouldn’t want to do this in the middle of a loop, for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's perfectly fine to remove items from a Dictionary in a loop! Removals are slower than insertions, but only by a (small) constant factor -- the two operations have exactly the same algorithmic complexity.
(In fact, removals can sometimes win by never having to resize the hash table -- in the current implementation at least. Shrinking dictionaries on removals is still a wishlist item for the stdlib.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if Dictionary.removeValue(forKey:)
has a documented time complexity of O(n), I’m going to assume a future non-breaking update can actually make it O(n).
|
||
|
||
private let x: CountedSet<Character> = ["a": 1, "b": 2, "c": 3, "d": 4] | ||
private let y: CountedSet<Character> = ["e", "f", "a", "f"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of hardwiring these particular values, these tests should be using the combinatorial testing features provided in the _CollectionsTestSupport
package to exhaustively exercise the implementations over many different combinations of inputs, generated automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was pretty sure I something like that existed, but decided to get the tests hammered out for an initial pull request before looking into it.
Thank you for the detailed feedback, I really appreciate it! |
I’ve come up with a few different candidates for how this should be expressed, and I’d like feedback on which to use. A. I’m favoring A, personally. |
It's nice to use the same term consistently where possible. Since this type is being called a counted multiset, it makes sense to me that we should be able to ask about the count of something. |
That’s true, but Swift usually uses You raise an excellent point about the name, though: I was using I don’t think there is any precedent for the name “counted multiset", and even “counted set" doesn’t seem to show up outside the context of Foundation and derivations upon it. The most common term in academia and other programming languages, by far, is “multiset”, followed by “bag”. @lorentey: Instead of |
Hello all 👋 I've recently been on the LeetCode grind and multisets/multi-set-counts are frequently used, so a lack of this data structure is fresh on the mind. Is there any interest in continuing the work in this PR? Having read all previous forum posts, issue threads, and comments on this review I feel confident on taking on the rest of the work required. Some response to/support of comments made in reviews:
Would there also be interest in an alternate implementation that stores the elements instead of purely keeping the count? I've seen comments here and there on other language implementations that some people do advocate an This would of course bring other implementation questions of the set operations and enumeration. I have some ideas but will defer to an actual implementation if interested. |
Description
A new implicit data structure has been added: an unsigned multiset. This has been discussed extensively on the Swift Forums, most notably here and here.
Detailed Design
Since this is an implicit data structure, I have introduced conformance to
RawRepresentable
, where theRawValue
isDictionary<Element, UInt>
. I have usedUInt
instead ofInt
to make it clearer that this is not a signed multiset, which may be added in the future.A potential point of controversy: I have implemented
CountedSet.underestimatedCount
as the number of unique elements, since that is indeed guaranteed to be less than or equal to its cardinality.CountedSet
implements bothCollection
andSetAlgebra
, among other protocols.This is not quite ready for release yet, in my estimation, but the contribution guidelines encourage early pull requests, and the core functionality has been implemented.
Work needs to be done on the
Codable
implementation: in particular, it should be possible to decode a flat list of objects into aCountedSet
, as one of the intended uses of the structure is to act as an unorderedArray
.Representations for Mirror, String, and debugging also need to be customized.
Additional API may be warranted as well, since a lot of functionality (like the count of a given element) is currently accessible only by reading the underlying dictionary directly.
Documentation
I’ve taken care to document all new API at the symbol-level, particularly where it may be confusing to users. However, some of it may benefit from additional usage examples and guidance.
I have not updated the Documentation folder guides yet.
Testing
There is complete test coverage for the
Sequence
andSetAlgebra
implementations, including dedicated tests for each axiom mandated bySetAlgebra
. Coverage for other elements is more spotty, but does exist.Performance
I have not implemented performance benchmarks yet.
Source Impact
This is a purely additive change, so it does not break any existing API.
Checklist