Thoughts on data science, statistics and machine learning.
Notes on Optimizing Torch Models
ML researchers are from Mars and the ML engineers responsible for deploying models are from Venus. The two have vastly different motivations. The ML researcher’s job, given a dataset and some compute, is to find the lowest possible loss on a task. In this pursuit, no engineering cost is too high. No tech debt is too large. Worse still, if they get published, they must include their code in their paper. Reproducible research only means that everyone be able to reproduce benchmarks and charts from the original paper. It usually has very little to do with production.
The Bridge of Asses: Learning Coding with Novices
Over the last few years, I have been deeply involved with the IIT-M Programme in Data Science & Applications - as a student, a mentor and an analytics consultant. The programme provides diplomas and bachelor’s degrees in data science and applications. I’m often asked why I’m so invested in the programme - especially since I’m already an experienced data scientist.
At least three people are mad at me for being in the programme. One of them thinks that I’ve unfairly claimed a higher education seat I don’t need - which would be true if this was a conventional program with limited seats. But it’s a MOOC.
Book Review: Invisible Women by Caroline Criado Perez
I learnt long ago that throwing data at people doesn’t change their opinions. After reading this book, I’m inclined to think that I might have been wrong. I wish I could carry around several hardbound editions of this book and throw them at anyone who says or does anything sexist. A well-aimed hardback to the bridge of the nose could work magic.
And it would count as throwing data, too. Of the 400 pages of this book, 70 are just the endnotes. Every other chapter has almost a hundred references. It’s an insanely well researched book about the pervasive gender data gap (not the gender gap).
Bayesian Storytelling
I launched a newsletter yesterday. So far, the feedback has been good. A few readers said that they felt drawn in by the writing. In any case, the purpose of the first few posts is simply to get myself warmed up. Any extra flutter the posts generate is a bonus. Amit Varma recommends not looking at the stats for a couple of years.
It taught me quite a few things. Particularly that I need to be smart about the data analysis. It’s important to remember that this isn’t a work project. It’s meant for a general audience, so there is such a thing as too much detail. For example, detailed handwritten notes on every single table in the HCES isn’t important. You could have dealt with each table independently. Only focusing on the tables needed for the problem in question should have sufficed. On the other hand, cleaning and denormalizing the data and releasing it on GitHub was a good idea; and the tweet announcing this has 10 reposts, 69 likes, 52 bookmarks, and has gained me 12 new followers. This is what Austin Kleon calls “showing your work” (Mahima Vashist says “your journal is your art”). Perhaps one needs to think carefully about what shareable and useful assets can be created in the service of a larger, transcending work.