Wrangling through Dataland

A case study on the Portuguese bank marketing dataset

Classification algorithms are often able to output predicted probabilities. Sometimes these predicted probabilities are of interest themselves, such as when assessing betting odds. Predicted probabilities may also help with imbalanced data by giving us the option of adjusting the classification threshold to improve model predictions.

This article discusses the differences in predicted probabilities across various machine learning algorithms, and how they may be used to boost the predictive power of these algorithms through a case study. The case study is a Portuguese bank marketing dataset, where the target variable is a “yes” or “no” subscription to a term deposit.

I…


Wrangling through Dataland

A case study with Bayesian probabilistic modeling

We have little to no control over the actual return from our financial investments. We hope for a positive return, but there is no guarantee. The only thing we can control throughout the process is the volatility in the value of those investments, and we do this through diversification. It is the age-old wisdom of not putting all your eggs in one basket. This is the key insight and goal of Harry Markowitz’s mean-variance portfolio optimization.

Portfolio optimization essentially seeks to provide a certain targeted return with the lowest volatility possible. It does so by combining various financial assets with…


Wrangling through Dataland

Insights on decision-making under uncertainty from the HBO-Sky TV series

Chernobyl (2019) is a mesmerizing drama of human incompetence, ingenuity and courage in the face of disaster. The show’s analytical examination of the swirling confusion and haphazard reponse in the aftermath of an unprecedented catastrophe also offers valuable lessons for data scientists facing a dynamic situation amid uncertainty and incomplete information. Moreover, the show cast a scientist (a physicist no less!) as the main hero, and put the science and scientific reasoning up front and center.

Discussions of data science all too often focus on the “how to do” analytics and modelling, but neglect how should we make decisions when…


Wrangling through Dataland

Using unsupervised machine learning to identify behavioral currency clusters

How might we deal with a jumble of jigsaw puzzle pieces? A tried-and-tested strategy is to start by spreading out the pieces so they may be taken in with a glance. Then we sort them into recognizable groups, first of edge pieces and then by color. Grouping the pieces according to some pre-defined shared attribute helps us perceive how they may relate to one another, and facilitates solving the puzzle. Clustering in data science follows a similar process.

Clustering seeks to find groups of objects such that the objects in a group are similar to one another, yet different from…


Wrangling through Dataland

How relying on historical daily returns statistics may provide false inferences about drawdown risk in the S&P 500 Index

Financial markets are a cornucopia of delights for data scientists, disgorging endless reams of data. Yet that temptation can be a siren song for the unwary because financial market statistics are slippery and treacherous. Oftentimes they seem almost normal (as in distribution), but what appears as solid ground can dissolve swiftly into quicksand, as was seen earlier this year.

The most important stock market index in the world is the S&P 500 Index (SPX), which is widely regarded as the bellweather of the overall US equity market, and even of the state of the American economy. The SPX suffered a…


The gift of ultra-low interest rates in the pandemic

A few days ago investors lined up to pay the British government for the privilege of lending it money. Nearly $5 billion of UK Gilt bonds were issued at a yield of -0.003% on May 20. Let that sink it: a negative interest rate. If a person had even voiced this possibility out loud pre-2008, he/she would have been laughed out of town.

It’s still remarkable, but actually no longer a wholly novel situation given that the German and Japanese governments have been issuing negative-yielding debt for some time now. …


Wrangling through Dataland

Crafting & testing a dynamic Recency-Frequency-Monetary (RFM) model

The 2012 news story that Target could predict its customers’ pregnancies was arguably a watershed moment in machine learning’s rise to mass consciousness. Apocryphal or not, the story went viral, swathed in those layers of wonder and anxiety that embody the popular view of AI. Most businesses do not require such intimate insights into customer behavior, but data science has revolutionized customer analytics.

The recency-frequency-monetary (RFM) customer segmentation model is one of the fundamental customer analytics frameworks. The three facets of the model are:
Recency: How recent was the last transaction (usually measured in days)?
Frequency: How frequent were the transactions (in…


Wrangling through Dataland

Comparing models in a social media NLP challenge

Zen and the Art of Motorcycle Maintenance was one of my favorite books in college. Set amidst a father-son motorcycle journey across the United States, the book considers how to lead a meaningful life. Arguably, the key message expounded by the author, Robert Pirsig, is that we achieve excellence only when we are fully engaged, heart and mind, with the task at hand. If something is worth doing, then it is worth doing well.

At about the same time, research design and statistical inference courses drilled the importance of interpretability and parsimony into me. Communication is indeed widely stated (hoped?)…


Wrangling through Dataland

A model combining prediction and statistical inference

The feature richness of the Ames housing dataset (2011) is both alluring and bewildering in equal measure. It is easy to become entagled in its bountiful features while trying to uncover its patterns. It is first and foremost useful to understand that the Ames dataset fits into the long-established hedonic pricing method to analyzing housing prices.

I had previously studied statistics/econometrics in some detail. Cognizant of the “big data” revolution and intrigued by its promise, I have immersed myself in coding and machine learning these past few months. It was in this process that I encountered the Ames housing dataset…

Alvin T. Tan

Financier by profession. Economist by training. Data scientist & essayist by inclination.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store