Thoughts on data science, statistics and machine learning.
The PlotCaptions Dataset: Automating the Narration of Visual Analytics
I’ve always been interested in how we narrate visual analytics. The hardest task in dataviz is not analysis or visualization, but figuring out what to say about it. I used to believe that a well designed chart does not need a narration. That may be valid, but over the years I’ve realized that it is the narrative that turns something that is merely pretty and insightful into something that is viral. Imagine one of Hans Rosling’s talks without him in the picture.
Notes on Optimizing Torch Models
ML researchers are from Mars and the ML engineers responsible for deploying models are from Venus. The two have vastly different motivations. The ML researcher’s job, given a dataset and some compute, is to find the lowest possible loss on a task. In this pursuit, no engineering cost is too high. No tech debt is too large. Worse still, if they get published, they must include their code in their paper. Reproducible research only means that everyone be able to reproduce benchmarks and charts from the original paper. It usually has very little to do with production.
The Bridge of Asses: Learning Coding with Novices
Over the last few years, I have been deeply involved with the IIT-M Programme in Data Science & Applications - as a student, a mentor and an analytics consultant. The programme provides diplomas and bachelor’s degrees in data science and applications. I’m often asked why I’m so invested in the programme - especially since I’m already an experienced data scientist.
At least three people are mad at me for being in the programme. One of them thinks that I’ve unfairly claimed a higher education seat I don’t need - which would be true if this was a conventional program with limited seats. But it’s a MOOC.
Book Review: Invisible Women by Caroline Criado Perez
I learnt long ago that throwing data at people doesn’t change their opinions. After reading this book, I’m inclined to think that I might have been wrong. I wish I could carry around several hardbound editions of this book and throw them at anyone who says or does anything sexist. A well-aimed hardback to the bridge of the nose could work magic.
And it would count as throwing data, too. Of the 400 pages of this book, 70 are just the endnotes. Every other chapter has almost a hundred references. It’s an insanely well researched book about the pervasive gender data gap (not the gender gap).