NormConf 2022 Notes

I spent part of my holiday going through the backlog of all the great NormConf talks. There were some awesome topics discussed by some great folks in the data community. Tried to compile my notes and figured I’d post them here. I don’t have notes for every talk so be sure to check out https://normconf.com/ to find info on all the talks/speakers.

Two themes I found consistent throughout many of the talks:

Most people are working on very unsexy but super important data work within their companies
Get really good at the fundamentals, it will make you better at advanced stuff

Group by Statements that Save the Day

by Vincent D. Warmerdam

it’s unlikely some ML finding will surprise you, Data Visualization findings can surprise you
what if there are dead chickens in the dataset
made a library called doubtlab for trying to find bad labels in your data
worry less about “must have skills”, focus on fundamentals
you can’t get a certificate in common sense and critical thinking
https://deon.drivendata.org/examples/
- mentions the above for a checklist before pushing something to prod
write more TIL blogs

Five semesters of linear algebra and all I do is solve Python dependency problems

by Tim Hopper

reflection on his own interests in his career and how they’ve shifted
your career is unlikely to follow the path you think it will

NLP Tip and Tricks

by Lynn Cherny

recommends UMAP for data exploration
string_grouprt library can be used for trying to solve string similarity problems

How Small Can I get that Docker Container

by Matthijs Brouns

.dockerignore used in a similar way to .gitignore
dive is a tool that allows you to look at what your docker image looks like
every run statement in your dockerfile creates a new layer in the docker image
- each layer is essentially a diff of the previous layer
- you can add and delete stuff in the same run statement to save space

Spark Horror Stories from the Field

by Guenia Izquierdo

modularize your code
write unit tests
remove unnecessary code from files that will live in prod
never had the chance to use spark before, so not much else to add from this talk

Geriatric Data Science: Life after Senior

by Luca Belli

talk is about the IC vs leadership ladder for Data scientists
IC’s don’t typically get the same level of leadership training as a manager, but there’s still some expectation for them to lead
- just because you’re an IC doesn’t mean you can ignore leadership development
the higher you climb the IC ladder the more people will expect management type skills from you
- should be mindful of this and spend more time training them

Hack Your Way to a Better API

by Zachary Blackwood

the internals of the software you’re using may be more accessible than you think
monkey-patching
- updating or changing code of a piece of software at runtime
when something isn’t working on imported code, start debugging by looking at its docstring
- in your IDE this should actually bring you to the file that is being run…you have the ability to change this file if you’d like
- not best practice to do this at the source level because you’re unlikely to document it and it will disappear when you update your packages
- other option is to write your new, desired function within your own file that overwrites the old function
the develop tools that you find in chrome can also be found in many apps you use, including VScode
uses a website called curlconverter.com to format cURL commands to the language/library of your choice

What’s the simplest possible thing that might work, why didn’t you try that first?

by Joel Grus

his favorite question to ask
in 2022 implementing bert models can be simple, even though the model is more sophisticated than Logistic Regression or Naive Bayes
as our tools get better the boundary between complex and simple changes
people create systems that abstract away complexity, this is different than dumbing something down
think about ways to abstract complexity away from things you are often doing
simplicity is something we only discover through experience and with confidence
- simplicity is not the sign of a newbie

It’s all about cost: How to think about Machine Learning Products

by Peter Sobot

engineering is about building the best thing you can given the constraint of the problem

ML doesn’t always replace rules, sometimes they work together

by Jeremy Jordan

Traditional approach for ML
- first deploy a heuristic approach
  - Rules based approach
  - e.g. for a spam filter you can give it specific words to look for. Then as time goes on you can monitor user-behavior to see how they might label messages as spam or pull messages out of the spam folder
- once you have labelled data you can take a more ML approach
rules plus ML can give you much greater results than either approach used independently
combine the two systems in a policy layer
- could be as simple as an OR statement. i.e. if one system evaluates to True then take that
have an evaluation set that you can use to more easily test versions of your system against

All my machine learning problems are actually data management problems

by Shreya Shankar

most ML failures exist outside of whatever business logic that is running ML algos
assumptions that exist in dev/training do not always translate to production
telemetry from prod systems is important

Ethan Rosenthal and the M1 misadventure

by Ethan Rosenthal

has a medium article talking about managing python env for DS
my take away is, python dependencies and environment management are such a mess, believe it or not
reproducibility of an environment is important. If you need to revisit an analysis from 6 months ago, you should be able to have it up an running easily

Data is the new coffee

by Peter Baumgartner

talks about practices for annotating data for data science purposes
calibration and agreement
- each of your annotators should be on the same page for what each criteria or label means
- how often multiple people agree to give the same case the same label is important
- want to have annotation guidelines that give people some structure on what to do
as you encounter more data it is normal to see your annotation task drift a little bit
- don’t expect to get it right the first time
you don’t necessarily need to limit your annotation task to a subset of domain experts
it will take iterations until you get everyone on same page and your annotated dataset becomes “gold” level
- “it’s going to take longer than you think”
having a correlation amongst annotators at .9 and above is very rigourous

How to Translate to PM speak and back

by Katie Bauer

early on would give PMs too much detail, which they didn’t seem to like
“assume good intent”, believe that you and the person you are working with have each other’s best interests at heart
- she later amended this to “assume good intent, but consider incentives”
product managers speak a language of progress
most questions PMs ask you are implicitly causal.
- Focus your work on things they can control
- prioritize logical consistency over being technically correct
- describe your results as inputs and outputs
plan your work according to their positioning
- while not all environments are going to be cutthroat, PMs are inherently competing with other PMs
accept that no translation will be perfect
your suggestions will more be used as guardrails, rather than taken as bible
pick your battles when you try to apply a lot of rigor to a situation, should be when stakes are high
arming PMs with data that they can take and apply to a different context can be valuable in helping them be bought into the idea of data
she likes a hub and spoke style data team structure

Tracer Bullets and Working Backwards: Simple Frameworks for Solving Problems

by Caitlin Hudon

premortems are a good way to tackle known unknowns
- can do on your own or ask SMEs questions to fill out this framework
Tracer bullets can be used in the unknown unknown domain
- they give real time feedback
- use minimum amount of code to get code to next step of project
- different from a prototype, because tracer bullets are more of an along-the-way, iterative process
- expounded on further in the book “the Pragmatic Programmer”
overall goal is to use frameworks to increase the amount of feedback you are getting throughout all stages of development
also recommends the book “Thinking in Bets” by Annie Duke

Building an HTTPS Model API for Cheap

by Ben Labaschin

we do not have enough time
weigh the trade-offs of your tools before choosing software
normy software
- reliable
- an investment
- easy to learn

Data’s Desire Paths

by James Kirk

best way to think about Recommender systems is as a desire path
- a desire path describes the phenomena of not putting down a walkway on a college campus until you see the path people take to cut across the grass
a healthy recommender project has
- clearly defined users
- a measurable definition of success
- a clear relationship between recommender success and business success
- data and a tech stack ready to implement and iterate on recommendations
types of recommendations:
- basic recommendations
  - you are webpage X so we will point you towards webpage Y
- personalization
- omakase
  - “hey alexa, play music”
don’t be afraid to pre-calculate recommendations as you scale up
recommends a book by Kim Falk for intro to recommenders

The Zen of Tedium

by Brandon Rohrer

you only have so many hours to be productive within a day
there are trade-offs for doing things the hard way vs automating vs doing tedious work

How should I represent the intermediate thing?

by Brianna McHorse

data structures are clear at the beginning and end, but things are less clearly defined in the interim
care about performance later. You can only do so many things at once
Heuristics for choosing data structures
- dict: I am a human with a human brain
- defaultdict: I am a human and i want to add things as i go
- class: I am a human and I’m very sure about what’s going into this object
- list: I have several things, i want to sort them, and I don’t mind if they change
- tuple: I only have a few things, they’re not going to change, I don’t need to access them
- namedtuple: ??? maybe if things really need to be immutable
  - most cases can just use a dict

I’d have written a shorter solution but i didnt have the time

by JD Long

tells a story about some study where people’s only thought was to add something, until they were prompted that removing something is an option too
- additive ideas come to mind quickly, but subtractive ideas require more cognitive effort
- e.g. if you give someone a recipe and say how do we makes this better, almost no one will attempt to remove something
the MVP model is a subtractive priming prompt
writing a reproducible example(reprex) is a critical tech skill
- minimum reproducible example
- reprex debugging is akin to rubber duck debugging
- helps to remove noise of other things involved and isolate to just what your problem is
don’t try to boil the ocean, build things thrice
- first time there’s bugs, second time you can avoid them, third time is when you make it pretty
on his team of analysts the top 2 success criteria are 2 sides of the same coin
- ask a lot of questions
- don’t try to fake it if you don’t know it

Just use one big machine for model training and inference

by Josh Wills

be careful at what you get good at
using one big machine is a paradigm strategy for keeping things simple
htop is a unix tool to see underlying process running on machine
- can combine with tail
he is a DuckDB enthusiast

Data Driven Promotions

by Rose Wiegley

using data to move up in your corp
- what matters?
  - make a responsibility matrix, if your company doesn’t already have anything like this
    - idea of proving you are already doing the next level’s job before being promoted
- track progress
  - keep an ongoing brag document
    - organize it by the same categories that are in your responsibility matrix
    - should be your map of where you are and where you need to go
    - be proactive about this
- frame your review

Don’t Do Invisible Work

by Chris Albon

record your work and tell people about it
if you don’t consciously spend time to track your work, you will never remember it and it will be forgotten
- if it’s not remembered it’s like it never happened
no one is going to do this for you
work that tends to be invisible
- mentorship
- ad-hoc work
he just uses an activity log. Keeps a text file open all day and write a bunch of one line entries
- dump in anything that might be useful
the goal is if someone asks your boss what you did/do, they have a deep well of concrete examples to choose from

Matt Hartz