In this week’s Python Data Weekly Roundup:
This article provides a good overview of the Data Natives 2019 – Europe meeting and the main trends being discussed for 2020 and beyond. For example, topics such as “AI and its use in Healthcare” and “AI and Ethics” looked like good talks.
An excellent review of “Ray”, a distributed computing system for python. Ray is:
is an open-source system for scaling Python applications from single machines to large clusters. Its design is driven by the unique needs of next-generation ML/AI systems, which face several unique challenges, including diverse computational patterns, management of distributed, evolving state, and the desire to address all those needs with minimal programming effort.
As always, Jason Brownlee does a great job explaining to begin to build an intuition for identifying imbalanced and skewed distributions – and how to handle / manage those distributions. One of the most difficult things to do in data science / machine learning is to understand and manage data with different distributions. You can’t always apply a model to a data-set because the distribution of said data makes that model invalid.
Scatter Plot of Binary Classification Dataset With A 1 to 10 Class Distribution – from here.
Seven differences between academia and industry for building machine learning and deep learning models
It should be no surprise that academia and industry approach data science and machine learning differently. In this article, some differences are described- they include: Accuracy, Training vs Production, Engineering focus (e.g., end-to-end pipeline development) and more.
A very good paper describing the challenges of technical debt with machine learning systems. The abstract:
Machine learning offers a fantastically powerful toolkit for building useful complex prediction systems quickly. This paper argues it is dangerous to think of these quick wins as coming for free. Using the software engineering framework of technical debt, we find it is common to incur massive ongoing maintenance costs in real-world ML systems. We explore several ML-specific risk factors to account for in system design. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, configuration issues, changes in the external world, and a variety of system-level anti-patterns.
A recent post I wrote describing how to perform market basket analysis using python and pandas. I provide a walk-through of using MLxtend’s apriori function as well as a ‘roll your own’ approach to market basket analysis.