Wrangling through Dataland

Image by Brandon Martinez from Pixabay

Finance workers have long relied on the trusty Excel spreadsheet. But times they are a-changin’, and there is a shift to Python among other coding languages. That said, I still use Excel on a daily basis myself, along with Python, and there are many things that Excel still does a lot easier and better, in my mind, such as making attractive multi-axes charts.

I’ve built a career in macro trading, and I will cover some of the top Python hacks I use regularly in my daily work. I will assume comfort with basic Python coding, and would orient this article…

Wrangling through Dataland

Picture by James Wainscoat from Unsplash

Much of what impels human behaviour is not directly observable. This is a common refrain in data analysis of social phenomenon, such as the socio-cultural drivers of education and income. Or in financial markets where many of the underlying factors that drive the buying or selling of particular securities at particular times are not observable. The metaphorical “bulls” and “bears” of the market are intangible yet perceptible at the same time, like the waft of a powerful scent.

Thankfully, much progress has been made to better estimate these unobservable or “latent” factors, though trying to understand what we find can…

Wrangling through Dataland

Image by Budarinphoto via Creative Market

Classification algorithms are often able to output predicted probabilities. Sometimes these predicted probabilities are of interest themselves, such as when assessing betting odds. Predicted probabilities may also help with imbalanced data by giving us the option of adjusting the classification threshold to improve model predictions.

This article discusses the differences in predicted probabilities across various machine learning algorithms, and how they may be used to boost the predictive power of these algorithms through a case study. The case study is a Portuguese bank marketing dataset, where the target variable is a “yes” or “no” subscription to a term deposit.


Wrangling through Dataland

Image by doroheine from Pixabay

We have little to no control over the actual return from our financial investments. We hope for a positive return, but there is no guarantee. The only thing we can control throughout the process is the volatility in the value of those investments, and we do this through diversification. It is the age-old wisdom of not putting all your eggs in one basket. This is the key insight and goal of Harry Markowitz’s mean-variance portfolio optimization.

Portfolio optimization essentially seeks to provide a certain targeted return with the lowest volatility possible. It does so by combining various financial assets with…

Wrangling through Dataland

Image by PublicDomainPictures from Pixabay

How might we deal with a jumble of jigsaw puzzle pieces? A tried-and-tested strategy is to start by spreading out the pieces so they may be taken in with a glance. Then we sort them into recognizable groups, first of edge pieces and then by color. Grouping the pieces according to some pre-defined shared attribute helps us perceive how they may relate to one another, and facilitates solving the puzzle. Clustering in data science follows a similar process.

Clustering seeks to find groups of objects such that the objects in a group are similar to one another, yet different from…

Wrangling through Dataland

Financial markets are a cornucopia of delights for data scientists, disgorging endless reams of data. Yet that temptation can be a siren song for the unwary because financial market statistics are slippery and treacherous. Oftentimes they seem almost normal (as in distribution), but what appears as solid ground can dissolve swiftly into quicksand, as was seen earlier this year.

The most important stock market index in the world is the S&P 500 Index (SPX), which is widely regarded as the bellweather of the overall US equity market, and even of the state of the American economy. The SPX suffered a…

The gift of ultra-low interest rates in the pandemic

A few days ago investors lined up to pay the British government for the privilege of lending it money. Nearly $5 billion of UK Gilt bonds were issued at a yield of -0.003% on May 20. Let that sink it: a negative interest rate. If a person had even voiced this possibility out loud pre-2008, he/she would have been laughed out of town.

It’s still remarkable, but actually no longer a wholly novel situation given that the German and Japanese governments have been issuing negative-yielding debt for some time now. …

Wrangling through Dataland

The 2012 news story that Target could predict its customers’ pregnancies was arguably a watershed moment in machine learning’s rise to mass consciousness. Apocryphal or not, the story went viral, swathed in those layers of wonder and anxiety that embody the popular view of AI. Most businesses do not require such intimate insights into customer behavior, but data science has revolutionized customer analytics.

The recency-frequency-monetary (RFM) customer segmentation model is one of the fundamental customer analytics frameworks. The three facets of the model are:
Recency: How recent was the last transaction (usually measured in days)?
Frequency: How frequent were the transactions (in…

Wrangling through Dataland

Zen and the Art of Motorcycle Maintenance was one of my favorite books in college. Set amidst a father-son motorcycle journey across the United States, the book considers how to lead a meaningful life. Arguably, the key message expounded by the author, Robert Pirsig, is that we achieve excellence only when we are fully engaged, heart and mind, with the task at hand. If something is worth doing, then it is worth doing well.

At about the same time, research design and statistical inference courses drilled the importance of interpretability and parsimony into me. Communication is indeed widely stated (hoped?)…

Wrangling through Dataland

Image of Ames, Iowa by Tim Kiser from Creative Commons

The feature richness of the Ames housing dataset (2011) is both alluring and bewildering in equal measure. It is easy to become entagled in its bountiful features while trying to uncover its patterns. It is first and foremost useful to understand that the Ames dataset fits into the long-established hedonic pricing method to analyzing housing prices. Some domain knowledge will go a long way.

I had previously studied statistics/econometrics in some detail. Cognizant of the “big data” revolution and intrigued by its promise, I have immersed myself in coding and machine learning these past few months. It was in this…

Alvin T. Tan

Financier by profession. Economist by training. Data scientist & essayist by inclination.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store