Perplexity is a measurement of how well a probability model predicts a sample.
Intuitively, perplexity can be understood as a measure of uncertainty. Say the real thing you want to predict is the sequence of numbers from one to six. If you predict each number in turn with a six-sided die, you will be right about one-sixth of the time. The die's perplexity with this data (and really with any data it could possibly predict - dice are dumb) is six. The perplexity of whatever you're evaluating, on the data you're evaluating it on, sort of tells you "this thing is right about as often as an x-sided die would be."
Wikipedia defines perplexity as:
“a measurement of how well a probability distribution or probability model predicts a sample."
What is perplexity? from stats.exchange.com by Henry
It gives the perplexity of a discrete distribution as
$$2^{-\sum_x p(x)\log_2 p(x)}$$
which could also be written as
$$\exp\left({\sum_x p(x)\log_e \frac{1}{p(x)}}\right)$$
i.e. as a weighted geometric average of the inverses of the probabilities. For a continuous distribution, the sum would turn into an integral.
The article also gives a way of estimating perplexity for a model using $N$ pieces of test data
$$2^{-\sum_{i=1}^N \frac{1}{N} \log_2 q(x_i)}$$
which could also be written
$$\exp\left(\frac{{\sum_{i=1}^N \log_e \left(\dfrac{1}{q(x_i)}\right)}}{N}\right) \text{ or } \sqrt[N]{\prod_{i=1}^N \frac{1}{q(x_i)}}$$
or in a variety of other ways, and this should make it even clearer where "log-average inverse probability" comes from.
Bits-per-character and bits-per-word
Bits-per-character (BPC) is another metric often reported for recent language models. It measures exactly the quantity that it is named after the average number of bits needed to encode on character. This leads to revisiting Shannon’s explanation of entropy of a language:
“if the language is translated into binary digits (0 or 1) in the most efficient way, the entropy is the average number of binary digits required per letter of the original language."
By this definition, entropy is the average number of BPC. The reason that some language models report both cross-entropy loss and BPC is purely technical.
In the case of Alex Graves' papers, the aim of the model is to approximate the probability distribution of the next character given past characters. At each time step $t$, let's call this (approximate) distribution $\hat{P}_t$ and let $P_t$ be the true distribution. These discrete probability distributions can be represented by a vector of size $n$, where n is the number of possible characters in your alphabet.
So the BPC or average cross-entropy can be calculated as follows:
\begin{align}
bpc(string) = \frac{1}{T}\sum_{t=1}^T H(P_t, \hat{P}_t) &= -\frac{1}{T}\sum_{t=1}^T \sum_{c=1}^n P_t(c) \log_2 \hat{P}_t(c), \\
& = -\frac{1}{T}\sum_{t=1}^T \log_2 \hat{P}_t(x_t).
\end{align}
Where $T$ is the length of your input string.
The equality in the second line comes from the fact that the true distribution $P_t$ is zero everywhere except at the index corresponding to the true character $x_t$ in the input string at location $t$.
How to compute bits per character (BPC)?, from stats.exchange.com by user6952886