Andrew Murray and Ken Peters recently presented an interesting paper on mixing of oils and gas condensates, and the use of Alternating Least Squares (ALS) and Hierarchical Cluster Analysis (HCA) for interpreting such mixtures.
The slides can be found at: https://lnkd.in/gn2uEsS.
This short note provides a bit of insight into why some of their conclusions are valid from a mathematical / statistical perspective, which formalises the results of Appendix A in Peters et al. (2008). It does not significantly address the issue of unmixing gas condensates and oils, which is a different problem.
The conclusion on the application of ALS (or any other linear deconvolution) using compound concentrations versus ratios in Peters and Murray (2020) and Peters et al. (2008) relates to some quite simple maths.
Imagine we have two source oils (end members) e1 and e2. For simplicity of presentation assume we measure just two peak concentrations [p_1] and [p_2] in each oil. So denoting the first peak measured in e1 as [p_e1,1] and the second peak as [p_e1,2] (i.e. [p_ei,j] is the concentration of the i'th sample, j'th peak). Assume a perfect linear mixture of the fluids m = w1*e1 + w2*e2, where m is the mixed oil sample, and w1 and w2 are the mixing coefficients (weights) of the two end members in the mixture. The peak concentrations will also mix linearly, so:
[p_m,1] = w1*[p_e1,1] + w2*[p_e2,1]
that is the first peak of the mixture will be a weighted linear combination of the first peaks of the end members, and similarly for the second peak.
We can directly solve this because for a two-component mixture we have: w2 = 1 - w1, so substituting in we end up with:
w1 = ([p_m,1] – [p_e2,1]) / ([p_e1,1] – [p_e2,1]).
This generalises to mixtures of more than two end members but requires more peaks to be measured (which are present in different proportions in each end member).
Of course, this also (inappropriately) assumes no noise - we'll come back to that.
Now let's use a ratio, r_e1 = [p_e1,1] / [p_e1,2] - that is we'll use the ratio of the two peaks. We can write the ratio of the mixture as the ratio of the peaks in the mixture:
r_m = (w1*[p_e1,1] + w2*[p_e2,1]) / (w1*[p_e1,2] + w2*[p_e2,2]).
After a bit of algebra, using w2 = 1 - w1, we get:
w1 = (r_m*[p_e2,2] – [p_e2,1]) / ([p_e1,1] – [p_e2,1] - r_m*([p_e1,2] – [p_e2,2]))
What this means is that we can (in the noise-free case) deconvolve a mixture using ratios only if we also know the peak concentrations in the end members - this is pretty much because r_m is not equal to w1*r_e1 + w2*r_e2, that is ratios don't mix linearly, as Murray & Peters note. If we only have ratios we will always be limited what we can do in terms of deconvolution – although again see later, as if you had more (independent) ratios with the right tools I suspect you could solve this.
So much for maths. You could claim this is useless arithmetic ... there is no consideration of noise, and in all real data there is noise. This is where ALS and other tools come in - ALS is just one technique available – that can deconvolve a mixture in the presence of noise.
So what happens in the presence of noise? Well things get a bit messy. If you are using peaks, then because we are assuming linear mixing, ALS should still work fine, if the noise is Gaussian (and I suspect also it will remain unbiased so long as the noise is symmetric). As an aside I rather doubt the noise is Gaussian here (we are dealing with concentration data, so at low values I’d expect something quite non-Gaussian, and even non-symmetric), so it could be that other methods would provide better estimates. But for anything measured at ‘reasonable’ concentrations any effect would be small in practice.
For now let’s assume Gaussian noise on the peak measurements. ALS will work fine on concentrations.
But what about ratios? Well here the noise will definitely be non-Gaussian (they may have a Cauchy distribution, if the peaks follow a Gaussian distribution), so ALS will struggle here whatever, but also as we have seen previously, you cannot use independent ratios alone to deconvolve a linear mixture – you will need other constraints if you want to ascertain the relative concentrations of the end members in the mixture.
Conjecture: Interestingly, I suspect if you used ratios that ‘overlap’ (share common peaks) or used a sufficient number of ‘independent’ ratios there could be the information content present to deconvolve the mixture (subject to noise issues). I might take a look at that on a rainy day - it would need a quite specific model (maybe a latent variable model?) that considers both the non-linear mixing of ratios, and the non-Gaussian nature of the observational uncertainty. This would produce a non-linear optimisation problem, which could be complex to solve numerically, and might admit multiple solutions. But if possible, it could have benefits of not requiring the careful pre-processing of concentrations.
I think the use of HCA with ratios (or concentrations) is a rather different issue. HCA is not attempting to deconvolve the mixture, and HCA does not make assumptions about linearity directly. The main comment I would make is that one should not expect HCA, or any clustering, to separate mixtures well, using either ratios or concentrations.
Peters and Murray (2020) directly raise the issue of the use of concentrations versus ratios in the context of separating mixtures of oils and gas condensates, which I think is a known problem (in essence gas condensates will carry far lower proportions of biomarkers and other heavier molecules, hence the pre-processing applied to ‘renormalise’ the concentrations). I assume that in an ideal world we would obtain measurements of properties (or ratios, but see above) across compounds that would help identify differences in composition between oils and condensates (i.e. in the gasoline range assuming these were not too badly affected by loss due to evaporation or biodegradation / fractionation).
No statistical modelling technique is magic … your data needs to have the information content necessary to identify the parameters you are estimating, and your estimator needs to be unbiased and efficient. Measuring concentrations across more peaks will not help if the data are strongly correlated as you add very little information doing this, so the choice of concentrations (or ratios) used will also be important here.
In this short technical note I have tried to show mathematically that applying ALS to ratios is not just practically wrong, it is theoretically wrong: we should not use a linear unmixing method on ratios because they mix non-linearly. The fact that ALR on ratios works to a degree is down to the degree of non-linearity which will depend on the ranges over which the concentrations vary – if this range is small, then a linear approximation will be good enough.
In the presence of noise the situation is likely to be worse, as the least squares element of ALS implies a Gaussian distribution on the (observation + model) errors, which cannot be true for both concentrations and ratios at the same time!
A future challenge would be to develop an unmixing method which could provide an unbiased estimate of the mixing proportions when using ratios. In principle such a model could be developed, although estimation or inference in that model could be computationally challenging.
Dan Cornford, Nov 2020.
References
Peters, K. E. and Murray, A. 2020. Deconvoluting Mixed Petroleum and the Effect of Oil and Gas-Condensate Mixtures on Identifying Petroleum Systems, Search and Discovery Article #42564. http://www.searchanddiscovery.com/pdfz/documents/2020/42564peters/ndx_peters.pdf.html
Peters, K. E., Ramos, L. S., Zumberge, J. E., Valin, Z. C. and Bird, K. J. 2008. De-Convoluting Mixed Crude Oil in Prudhoe Bay Field, North Slope, Alaska, Organic Geochemistry, 39(6), 623-645