A Belief (Bayesian) network is "defined" to be a DAG. A better question would be
Why a distribution needs to be represented by a Belief network, i.e.
a DAG?
Dependency relationships can be modeled with both directed-acyclic and undirected-cyclic representations. Therefore, a distribution may be represented in either ways.
Why acyclic and directed?
Every distribution that can be sequentially factorized into conditional probabilities as follows can be represented by a DAG:
$P(X_1,...,X_n)=P(X_n|X_1,...,X_{n-1})\color{blue}{P(X_1,...,X_{n-1})}$,
$\color{blue}{P(X_1,...,X_{n-1})}=P(X_{n-1}|X_1,...,X_{n-2})P(X_1,...,X_{n-2})$,
...,
$P(X_1, X_2, X_3)=P(X_3|X_1, X_2)\color{brown}{P(X_1, X_2)}$,
$\color{brown}{P(X_1, X_2)}=P(X_2|X_1)P(X_1)$.
Belief network $G$ is built as follows: $X_n$ has no outlink, $X_{n-1}$ links to $X_n$, $X_{n-2}$ links to $X_{n-1}$ and $X_{n}$, and finally $X_1$ links to all $X_2$ to $X_n$.
Generally, there is no unique order for factorization, thus multiple networks can represent the same distribution. For example $P(A, B)$ can be factorized as $P(A|B)P(B)$ or $P(B|A)P(A)$ which are represented with two different Bayesian networks.
Every Bayesian network represents a sub-graph of $G$. Some directed links from $G$ are removed to introduce "independence assumptions between variables".
How about cyclic and undirected?
However, there is also Markov networks. Some distributions can be represented by Markov networks but not Bayesian networks [Example 3.6, Page 82, Probabilistic Graphical Models Principles and Techniques - Daphne Koller]. These networks use undirected lines to represent dependencies between random variables. A line between A and B represents [part of] a factor (a function).
For example, $\phi_1(A, B)=A.B/2=B.A/2$ is a factor from factorization $$P_{\theta}(A, B, C, D) = \frac{1}{Z(\theta)}\phi_1(A, B)\phi_2(B, C, D).$$ Note that we cannot unambiguously assign a direction to relation $(A, B)$, since both directions are justified for $\phi_1(A, B)$.
As an example for cyclic relations, factor $\phi(B, C, D) = B.(C+D)$ would be represented with a triangle among $B$, $C$, and $D$.
Generally speaking, cycles tend to complicate the learning and inference on Markov networks, so fewer cycles is favorable.