In statistical inference, one often encounters a dataset and seek to characterize it by estimating the parameters
of a chosen probability distribution
. A prevalent technique for achieving this is Maximum Likelihood Estimation (MLE). At its core, MLE is the method of finding the parameter values
that makes the observed data most probable under a specified statistical model. To do this, we construct a likelihood function for the dataset by assuming each data point
contributes independently to the total likelihood. Thus, the overall likelihood is the product of individual probabilities, which we aim to maximize with respect to
:
The goal of MLE is to find the parameter values, , that maximize this likelihood function:
However directly maximizing the likelihood can be computationally challenging, especially with a large number of observations. Instead, it is often more practical to maximize the logarithm of the likelihood, known as the log-likelihood.
This transformation is helpful due to the log function’s property of being monotonically increasing. This means that the parameters that maximize the likelihood also maximize the log-likelihood, allowing us to work with sums instead of products, which is mathematically and computationally more convenient.
Once we have the log-likelihood function, we can find the maximum likelihood estimates of the parameters by taking the derivative of the log-likelihood with respect to the parameters, setting these derivatives equal to zero, and solving for the parameters. Lets illustrate this concept with an example. Imagine we’re dealing with a multivariate Gaussian distribution over . Our objective here is to estimate the mean
and covariance
that most effectively represent the characteristics of the data.
Estimating the Mean

Having laid the foundational concepts, and again assuming for all
, we can proceed first with a best estimate of the mean,
:
,
Considering that the first term, , is independent of
, we may disregard it for our present purposes, thus simplifying our expression to
and, noting the presence of the negative factor -1/2, we find it expedient to replace finding the argument of the maximum with that of the minimum, thereby simplifying our endeavor.
The optimal solution for this expression can be unveiled by equating its gradient to naught. It is at this value, , where the function’s gradient vanishes that we find its minimum argument.
Since is positive-definite, we now have
or
The observant reader will, no doubt, recognize this formula as none other than the arithmetic mean of . It is quite natural, then, to question the necessity of this extensive exposition for a result so well-trodden within the annals of statistics. To this, I offer two-fold reasoning. Firstly, the MLE serves to illuminate that the arithmetic mean does not merely emerge by happenstance; rather, it is the very culmination of an optimization process, aiming to maximize the likelihood across our dataset. Furthermore, the framework we’ve now established paves the way for our ensuing explorations. For example, when dealing with Maximum A Posteriori methods, we’ll find ourselves substituting our familiar likelihood function with the intricate products of posterior distributions.
Estimating Covariance

Applying the same reasoning we did for finding , we can estimate
:
.
Lets observe the second term here. A nice property of trace, is that for any scalar , we have
. We can expand on this, by saying for any two vectors
, and for matrix
, it is clear
.
Thus, we can rewrite our equation in the following manner:
.
and since the trace is linear, we can move our summand in the second term,
.
Now we’ll expand on the first term. Observe there is a fraction inside of the logarithm, that can be simplified:
therefore by substitution, it is clear that
.
Again, the first term is not a function of , and can therefore be eliminated
.
Finally, we have simplified it enough to solve for :
,
which implies
,
or
One can’t help but discern that just as corresponds to the arithmetic mean of
, the matrix
is the conventional formula representing the variance of
. This is important, particularly when navigating situations devoid of any prior assumptions regarding
and
. As we progress to MAP estimation, it will become evident that such situations are not always useful… but where might maximum likelihood be useful?
There are many potential realms of applicability for the MLE technique. While the standard pedagogical trajectory often couples MLE to linear regression, the scope of MLE stretches well beyond, encompassing many analytical techniques and applications.
Allow me to put forth a thought experiment (one both intriguing and yet grounded in historical trajectory). Picture, if you will, the discernment patterns in the terminal phases of Intercontinental Ballistic Missile (ICBM) launches, tracing the arc of history back to its advent in 1957. Such a vast repository of data, with each entry denoted by altitude-latitude-longitude coordinates, naturally adopts the guise of spherical coordinates, symbolically represented as .
For the purposes of a more tractable analysis, a transition to Cartesian coordinates, represented as , would prove invaluable. Here’s where the underlying mathematical beauty unveils itself: one could postulate that a sample of the data adheres to a Von Mises-Fisher distribution (or otherwise known as “spherical normal”), succinctly denoted as
. Within this distribution,
embodies the mean direction while k serves as a testament to the concentration parameter.
The challenge, thus, metamorphoses into leveraging the formidable prowess of MLE to discern the optimal parameters, and k, that best encapsulate the historical distribution of these launches. Not only would this endeavor provide a mathematical framework to understand past trajectories, but it also lays the groundwork for predictive insights into future launch patterns.
In synthesizing such abstract mathematical principles with tangible, historical datasets, one witnesses the confluence of theory and application. As our journey through this realm progresses, we shall uncover even more intricate tapestries woven from the threads of mathematical abstraction and real-world applicability.