In machine learning, a model $M$ with parameters and hyper-parameters looks like,
$Y \approx M_{\mathcal{H}}(\Phi | D)$
where $\Phi$ are parameters and $\mathcal{H}$ are hyper-parameters. $D$ is training data and $Y$ is output data (class labels in case of classification task).
The objective during training is to find estimate of parameters $\hat{\Phi}$ that optimizes some loss function $\mathcal{L}$ we have specified. Since, model $M$ and loss-function $\mathcal{L}$ are based on $\mathcal{H}$, then the consequent parameters $\Phi$ are also dependent on hyper-parameters $\mathcal{H}$.
The hyper-parameters $\mathcal{H}$ are not 'learnt' during training, but does not mean their values are immutable. Typically, the hyper-parameters are fixed and we think simply of the model $M$, instead of $M_{\mathcal{H}}$. Herein, the hyper-parameters can also be considers as a-priori parameters.
The source of confusion stems from the use of $M_{\mathcal{H}}$ and modification of hyper-parameters $\mathcal{H}$ during training routine in addition to, obviously, the parameters $\hat{\Phi}$. There are potentially several motivations to modify $\mathcal{H}$ during training. An example would be to change the learning-rate during training to improve speed and/or stability of the optimization routine.
The important point of distinction is that, the result, say label prediction, $Y_{pred}$ is based on model parameters $\Phi$ and not the hyper-parameters $\mathcal{H}$.
The distinction however has caveats and consequently the lines are blurred. Consider for example the task of clustering, specifically Gaussian Mixture Modeling (GMM). The parameters set here is $\Phi = \{\bar{\mu}, \bar{\sigma} \}$, where $\bar{\mu}$ is set of $N$ cluster means and $\bar{\sigma}$ is set of $N$ standard-deviations, for $N$ Gaussian kernels.
You may have intuitively recognized the hyper-parameter here. It is the number of clusters $N$. So $\mathcal{H} = \{N \}$. Typically, cluster validation is used to determine $N$ apriori, using a small sub-sample of the data $D$. However, I could also modify my learning algorithm of Gaussian Mixture Models to modify the number of kernels $N$ during training, based on some criterion. In this scenario, the hyper-parameter, $N$ becomes part of the set of parameters $\Phi = \{\bar{\mu}, \bar{\sigma}, N \}$.
Nevertheless, it should be pointed out that result, or predicted value, for a data point $d$ in data $D$ is based on $GMM(\bar{\mu}, \bar{\sigma})$ and not $N$. That is, each of the $N$ Gaussian kernels will contribute some likelihood value to $d$ based on the distance of $d$ from their respective $\mu$ and their own $\sigma$. The 'parameter' $N$ is not explicitly involved here, so its arguably not 'really' a parameter of the model.
Summary: the distinction between parameters and hyper-parameters is nuanced due to the way they are utilized by practitioners when designing the model $M$ and loss-function $\mathcal{L}$. I hope this helps disambiguate between the two terms.