A People Data conundrum and Over-dependence on Demographics

4 min readJun 20, 2021

Asked to work on people churn a couple of months ago. With a young team, and I’ve struggled more than I’d care to admit.

For one, when asked to use AI to solve a people churn issue, the whole team, including me of course, went on a googling spree. AND yes we found a ton of models — predicted churn, survival analysis, employee survey analysis, sentiment and topic models, LR, RF, Boosting, GPT, Community Detection, Lime, $Value determination; all that’s available!

Only — all we had were demographic data, transactional data, reference data.

For people who’ve taken a statistics class, I’m sure the same alarm bells rang too. In school you’re taught the best practice of spending good time to build the null and alternative hypothesis.

Predicting churn using demographic and personal attributes like location or marital status? What does it mean when someone tells you 72% of the ‘reason’ your people are leaving you is explained by the fact they’re working in Tara Toma City? Why will someone churn because of that? So, there are 2 million other people working in that city. And what should you do now? Shut operations?

BI & Reporting dashboards should be providing this kind of output not predictive AI!

‘City’ is where it’s happening, not why. Nothing wrong with the model really. It still has 90% out of sample accuracy. Ironic? Not really. The model correctly attributes group explainability to City because that’s the closest to the real reasons that we’ve fed it. It is not a ‘reason’ for churn. But it may have captured the effect of more real, latent variables. People churn because of compensation, perks, growth, lack of creativity, trust in management, better market demand and external opportunities, work overload, autonomy, internal culture and transparency, company integrity, right fit even tactical problems like childcare or distance to work, or health issues. Let’s not ignore the human facet either. People also quit because of envy, anger, unhealthy competitiveness, unfair treatment, isolation etc.

Maybe this City embodies many of these real reasons? And maybe this 72% should be divided among the other latent variables.

The model isn’t wrong. The math still works. Auto feature engineering still shows the best out of the total features we throw in. Where we probably went wrong was our quick assumption that any data available CAN and SHOULD be modelled. Certainly demographics are easy to source. I mean, which company has data on how an employee is ‘feeling’ or whether he/she feels ignored or undervalued, right? (My opinion on statistical red flags of using those employee surveys is another post altogether)

Also the very questionable assumption that this is even something that can be modelled. AI is not yet a sentient science. Even cognitive AI is not really too mature. And unfortunately churn is not necessarily a rational decision taken by a stable mind. A lot of churn is the outcome of emotional and potentially irrational reasons. Employees are with a company because they are happy there, or they have some compulsion. What variables define this happiness or those compulsions?

Unfortunately model accuracy is my problem now. For my customer, the challenge is real life actionability not model accuracy. Yet I can show a high accuracy but she sees low actionability. And that makes her lose faith in models.

So that’s when we went back to the drawing board and spent time to build the hypotheses. Found believable factors. The set up a data collection process. Only then started modelling these.

There were many challenges. Especially while selecting alternative data. When substituting a variable with an alternative, it is crucial to assess how correlated these are. (Futures traders from the F&O world would use the analogy cross-hedging correlation). Alternative data can be extremely useful, but it can equally ruin the whole outcome.

Eventually we did get to a reliable model. We modelled churn using some of the actual reasons for churn, and where we could not find true, reliable data, we conducted an extensive alternative data analysis exercise. Looked at employee work hours, number of mails sent out per day, survey results (adjusted for a ‘lot’ of statistical biases), participation in non-work activities, rating discord etc. Even went back to the customer and suggested they start collecting some of these new data points going ahead.

We now have a set of 25 reliable features. On which we did a massive transformation and clean up exercise, adjusted imbalances, dropped variables, imputed some, for the unstructured part, reworked regex, ran custom NER. Then we created the solution comprising 4–5 use cases.
1. Churn prediction (group and individual) — Catboost and Shap.
2. Sentiment and Topic Models — Vader and Bertopic. Visualized using the Jason Kessler models.
3. Business Value Indicator — Replacement Costs
4. High Risk cohorts (Detecting Associative Purge) — Simple graph based community detection (Different set of variables — mails exchanged, internal publications, peer awards, other connectedness and popularity metrics)
5. Employee Journey Depiction — Fairly complex because of the multitude of factors that can alleviate or aggravate feelings. Still work in progress. I will post an update once I’ve built this.

Survival Analysis — Started out but dropped because it wasn’t looking relevant.
(It is meant to estimate time to death (event). The output variable should ideally be a non-control random event, not a choice variable. Plus employee churn depends not just on employee’s past tenure, but also on the very important factor of “when” most of these people will find an alternative offer? Is there really a “true mean” value that represents the lead time to finding an alternative? You can derive an average from any bunch of numbers, but for it to mean something it needs to be representative of the modelled population. For that kind of homogeneity, all employees would follow the same patterns of timely exits.)

And we’re still working with the customer to test it as we tweak and improve it.

My learning from this all —
A. In a tearing hurry to model data, we often neglect the fact that we should be modelling ‘true’ relationships.
B. There is way too much glamour in what I’m going to call ‘Algorithmania’ and modellers end up paying too little attention to the basic hygiene of data engineering and feature selection.

A People Data conundrum and Over-dependence on Demographics

Written by Tanushree Datta