Francais | English | Espanõl

Multiset

From Wikipedia, the free encyclopedia

(Redirected from Bag (mathematics))
Jump to: navigation, search

In mathematics, a multiset (sometimes also called a bag) differs from a set in that each member has a multiplicity, which is a natural number indicating (loosely speaking) how many times it is a member, or perhaps how many memberships it has in the multiset. For example, in the multiset { a, a, b, b, b, c }, the multiplicities of the members a, b, and c are respectively 2, 3, and 1.

Contents

[edit] Formal definition

Within set theory, a multiset can be formally defined as a pair (A, m) where A is some set and m : AN is a function from A to the set N = {1, 2, 3, ...} of (positive) natural numbers. The set A is called the underlying set of elements. For each a in A the multiplicity (that is, number of occurrences) of a is the number m(a).

Any set A may be identified with the multiset (A,1A), where 1A is the indicator function: 1A(x) = 1 for x in A. So the concept of a multiset is a generalization of the concept of a set. A multiset is a set if the multiplicity of every element is one.

The function m is a set of ordered pairs { (a, m(a)) : a in A }. For example, the multiset written as { a, a, b } is defined as { (a, 2), (b, 1) }, and the multiset { a, b } is defined as { (a, 1), (b, 1) }.

An indexed family, ( ai ), where i is in some index-set, defines the multiset { ai } , provided no element occurs more than a finite number of times in the family. Even in an infinite multiset, the multiplicities must be finite numbers.

If A is a finite set, the size of the multiset (A, m) is counted with multiplicity:

<math>|(A,m)|=\sum_{a\in A}m(a).</math>

while the size of the underlying set A is counted without multiplicity:

<math>|A|=\sum_{a\in A}1.</math>

A submultiset (B, n) of a multiset (A, m) is a subset BA and a function n : BN such that n(a) ≤ m(a).

[edit] Examples

One of the most natural and simple examples is the multiset of prime factors of a number n. Here the underlying set of elements is the set of prime divisors of n. For example the number 120 has the prime factorization

<math>120 = 2^3 3^1 5^1</math>

which gives the multiset {2, 2, 2, 3, 5}.

Another is the multiset of solutions of an algebraic equation. A quadratic equation, for example, has two solutions. However, in some cases they are both the same number. Thus the multiset of solutions of the equation could be { 3, 5 }, or it could be { 4, 4 }. In the latter case it has a solution of multiplicity 2.

[edit] Operations

The usual set operations such as union, intersection and Cartesian product can be easily generalized for multisets.

Suppose (A, m) and (B, n) are multisets

  • The union can be defined as (AB, f) where f(x) = max{m(x), n(x)}.
  • The intersection can be defined as (AB, f) where f(x) = min{m(x), n(x)}.
  • The cartesian product can be defined as (A × B, f) where f((x,y)) = m(x)n(y).
  • The join (sometimes called the sum) can be defined as <math>(A,m) \biguplus (B,n) = (A \cup B, f) </math> where f(x) = m(x)+n(x).

[edit] Multiset coefficients

The number of submultisets of size k in a set of size n is the multiset coefficient

<math>\left\langle \begin{matrix}n \\ k \end{matrix}\right\rangle = {n + k - 1 \choose n-1}={n+k-1 \choose k},</math>

where the expressions to the right of "=" are binomial coefficients, i.e., the number of such multisets is the same as the number of subsets of size k in a set of size n + k − 1. Unlike the situation with sets, this cardinality will not be 0 when k > n. One simple way to prove this involves representing multisets in the following way. First, consider the notation for multisets that would represent { a, a, a, a, a, a, b, b, c, c, c, d, d, d, d, d, d, d } (6 as, 2 bs, 3 cs, 7 ds) in this form:

<math>\bullet \bullet \bullet \bullet \bullet \bullet \mid \bullet \bullet \mid \bullet \bullet \bullet \mid \bullet \bullet \bullet \bullet \bullet \bullet \bullet </math>

This is a multiset of size 18 made of elements of a set of size 4. The number of characters including both dots and vertical lines used in this notation is 18 + 4 − 1. The number of vertical lines is 4 − 1. The number of multisets of size 18 is then the number of ways to arrange the 4 − 1 vertical lines among the 18 + 4 − 1 characters, and is thus the number of subsets of size 4 − 1 in a set of size 18 + 4 − 1. Equivalently, it is the number of ways to arrange the 18 dots among the 18 + 4 − 1 characters, which is the number of subsets of size 18 of a set of size 18 + 4 − 1. This is

<math>{18+4-1 \choose 4-1}={18+4-1 \choose 18} = 1330,</math>

so that is the value of the multiset coefficient

<math>\left\langle\begin{matrix} 4 \\ 18 \end{matrix}\right\rangle.</math>

One may define a generalized binomial coefficient

<math>{n \choose k}={n(n-1)(n-2)\cdots(n-k+1) \over k!}</math>

in which n is not required to be a nonnegative integer, but may be negative or a non-integer, or a non-real complex number. (If k = 0, then the value of this coefficient is 1 because it is the product of no numbers.) Then the number of multisets of size k in a set of size n is

<math>\left\langle\begin{matrix} n \\ k \end{matrix}\right\rangle=(-1)^k{-n \choose k}.</math>

This fact led Gian-Carlo Rota to ask "Why are negative sets multisets?". He considered that question worthy of the attention of philosophers of mathematics.

[edit] Polynomial notation

The set {x} may be represented by the monomial x. The set of subsets, { {}, {x} } , is represented by the binomial 1 + x.

The set {x,y} may be represented by the monomial xy. The set of subsets, { {}, {x}, {y}, {x,y} } , is represented by the polynomial

(1 + x)(1 + y) = 1 + (x + y) + xy.

The multiset {x,x} may be represented by the monomial xx = x2. The multiset of submultisets, { {}, {x}, {x}, {x,x} } , is represented by the polynomial

(1 + x)2 = 1 + 2x + x2.

The multiset of submultisets of the multiset xn is

<math>(1+x)^n=\sum_{k=0}^n{n \choose k} x^k.</math>

That is why the binomial coefficient count the number of k-combinations of an n-set.

The multiset xKyN−K , containing N elements, K of which are hits, is called a statistical population, and a submultiset is called a statistical sample. The set of samples is

(1 + x)K(1 + y)N−K

which by the binomial theorem equals

<math>\sum_{n=0}^N\sum_{k=0}^K{K \choose k}{N-K \choose n-k}x^ky^{n-k}.</math>

So the number of n-samples with k hits is

<math>{K \choose k}{N-K \choose n-k}</math>

See hypergeometric distribution and inferential statistics for further on the distribution of hits.

The infinite set of finite multisets of elements from the set {x}:

{ {}, {x}, {x,x}, {x,x,x}, ... }

may be represented by the formal power series

S = 1 + x + x2 + x3 + ... = 1 + xS .

The formal solution,

S = (1 − x)−1,

makes sense as a set of multisets, but the intermediate result, 1−x, does not make sense as a set of multisets.

The infinite set of finite multisets of elements from the set xy is

(1 − x)−1(1 − y)−1 = 1 + (x + y) + (x2 + xy + y2) + ...

The special case y=x : The infinite multiset of finite multisets of elements from the multiset x2 is

(1 − x)−2 =  1 + 2x + 3x2 + ...

The general case: The infinite multiset of finite multisets of elements from the multiset xn is

<math>(1-x)^{-n}=\sum_{k=0}^\infty{-n \choose k} (-x)^k</math> .

This explains why "multisets are negative sets". The negative binomial coefficients count the number of k-multisets of elements from an n-set.

[edit] Cumulant generating function

A non-negative integer, n, can be represented by the monomial xn .

So a finite multiset of non-negative integers, say {2, 2, 2, 3, 5}, can be represented by a polynomial f(x), say f(x) = 3x2 + x3 + x5 .

It is convenient to consider the cumulant generating function g(t) = log(f(et)), say g(t) = log(3e2t+e3t+e5t) .

The number of elements in the multiset is eg(0) = f(1), say 3 + 1 + 1 = 5.

The derivative of the g function is g '(t) = f(et)−1f '(et)et, say g '(t) = (3e2t+e3t+e5t)−1(6e2t+3e3t+5e5t).

The mean value of the multiset is μ = g '(0) = f(1)−1f '(1), say μ = (3+1+1)−1(6 + 3 + 5) = 2.8.

The second derivative of the g function is g ' '(t).

The variance of the multiset is σ2 = g ' '(0).

The numbers ( μ, σ2, ... ) = ( g '(0), g ' '(0), ... ) are called cumulants, and that's why g is called the cumulant generating function.

The infinite set of non-negative integers {0, 1, 2, ...} is represented by the power series 1 + x + x2 + ... = (1−x)−1. The mean value and standard deviation are undefined. Nevertheless it has a cumulant-generating function, g(t) = −log(1−et). The derivative of this cumulant-generating function is g '(t) = (et−1)−1.

A finite multiset of real numbers , A = { Ai }, is represented by the cumulant generating function

<math>g_{A}(t) = \log \sum_{i} e^{tA_i} . </math>

This representation is unique: different multisets have different cumulant generating functions. See partition function (statistical mechanics) for the case where the numbers in question are the energy levels of a physical system.

The cumulant-generating function of a multiset of n real numbers having mean μ and standard deviation σ is: g(t) = log(n) + μt + 2−1t)2 + ... , and the derivative is simply: g '(t) = μ + σ2t + ...

The cumulant-generating function of set, {k}, consisting of a single real number, k, is g(t) = kt , and the derivative is the number itself: g '(t) = k . So the concept of the derivative of the cumulant generating function of a multiset of real numbers is a generalization of the concept of a real number.

The cumulant-generating function of a constant multiset, {k , k, k, k, ..., k} of n elements all equal to the same real number k, is g(t) = log(n)+kt , and the derivative is the number itself: g '(t) = k , irrespective of n.

The cumulant-generating function of the multiset of sums of elements of two multisets of numbers is the sum of the two cumulant-generating functions:

<math>g_{A+B}(t) = \log \sum_{i,j} e^{t(A_i+B_j)} = \log \sum_{i,j} e^{tA_i}e^{tB_j} \, </math>
<math> = \log \sum_i e^{tA_i}\sum_je^{tB_j} = \log \sum_i e^{tA_i}+ \log\sum_je^{tB_j} = g_{A}(t)+g_{B}(t).</math>

There is unfortunately no general formula for computing the cumulant generating function of a product

<math>g_{AB}(t) = \log \sum_{i,j} e^{tA_iB_j} \, </math>

but the special case of a constant times a multiset of numbers is:

<math>g_{kA}(t) = \log \sum_{i} e^{t(kA_i)} = \log \sum_{i} e^{(tk)A_i} = g_{A}(kt).</math>

The multiset 2A = {2Ai} is not the same multiset as 2×A =A+A = {Ai+Aj}. For example, 2{+1,−1} = {+2,−2} while 2×{+1,−1} = {+1,−1} + {+1,−1} = {+1+1,+1−1,−1+1,−1−1} = {+2,0,0,−2}.

<math>g_{k\times A}(t) = k g_{A}(t).</math>

The standard normal distribution is like a limit of big multisets of numbers.

<math>\lim_{k \rarr \infty}k^{-1}(k^2\times \{+1,-1\}).</math>

This limit does not make sense as a multiset of numbers, but the derivative of the cumulant generating functions of the multisets in question makes sense, and the limit is well defined.

<math>\lim_{k \rarr \infty} g'_{k^{-1}(k^2\times \{+1,-1\})}(t)=\lim_{k \rarr \infty} \frac{d(k^2\log(e^{+tk^{-1}}+e^{-tk^{-1}}))}{dt}=\lim_{k \rarr \infty} \frac{d(k^2\log(2)+2^{-1}t^2+\cdots)}{dt}=t.</math>

The constant term k2log(2) vanishes by differentiation. The terms ... vanish in the limit. So for the standard normal distribution, having mean 0 and standard deviation 1, the derivative of the cumulant generating function is simply g '(t) = t . For the normal distribution having mean μ and standard deviation σ, the derivative of the cumulant generating function is g '(t) = μ+σ2t .

See also random variable.

[edit] Free commutative monoids

There is a connection with the free object concept: the free commutative monoid on a set X can be taken to be the set of finite multisets with elements drawn from X, with the obvious addition operation. Such monoids are also known as (finite) formal sums of elements of X with natural coefficents. Compare free abelian group.

[edit] References

es:Multiconjunto fr:Multiensemble ko:중복집합 it:Multiinsieme nl:Multiset pl:Multizbiór sl:Večkratna množica uk:Мультимножина

Personal tools