Einstein Studio occupies an interesting spot inside Salesforce. It is the supervised-learning surface where custom predictive models are built on top of Data Cloud objects, and the promise is that an administrator with a clear business question and reasonably clean data can train a model, evaluate it honestly, and deploy the scores back into records without ever leaving the platform. The promise largely holds. What teams skip are the unglamorous steps that make a model trustworthy, and a model nobody trusts is a model nobody uses.
STAGE 1Frame the question
A useful model answers a specific business question that someone is willing to act on. "Predict churn" is not useful. "Score each customer's likelihood of cancelling in the next ninety days, so the retention team can prioritise outreach" is. Both halves of the better question are mandatory: the first names what the model predicts, the second names what someone does with the prediction. Without the second half, the model is decoration. Framing it is a half-day workshop with the team who will act on the output, where you write the action workflow first and derive the prediction it needs.
STAGE 2Prepare the data
This is where most attempts die. The data has to live in a Data Cloud object, well defined, with a clear time boundary, because the model learns from labelled history. For every row you need the inputs as they were at a moment in time and the outcome as it eventually played out. The common killer is leakage: a field that gets populated after the outcome is known sneaks into training, the model looks miraculous in evaluation, and then performs no better than chance in production. Audit every input with one question. Was this value already known at the moment the prediction would be made? If not, exclude it. For most use cases you want at least a thousand labelled examples, and five thousand is comfortable.
The miraculous model in evaluation is almost always a leak. Audit every column against one question: was this known at prediction time?
STAGE 3Train
Einstein Studio tries several algorithms automatically, logistic regression, gradient-boosted trees, neural networks for some problem types, and returns a leaderboard. Pick by two criteria. First, accuracy on the held-out validation set. Second, interpretability, because a model that scores marginally better but is a black box is often the wrong choice in a Salesforce context where the person receiving the score wants to know why it is high or low. Gradient-boosted trees with feature importance give you both the signal and the explanation, which is usually the right trade.
STAGE 4Evaluate honestly
The headline metric is not enough. For a churn prediction, ninety-two percent accuracy sounds excellent until you notice that ninety-one percent of customers do not churn, so the model is barely better than always predicting "stays". Use the right metric for the problem, and read the confusion matrix, not the single number.
Evaluate across slices too. Does the model perform as well for new customers as for tenured ones, across regions, across account sizes? If performance swings wildly between slices, the model is fitting one segment, or you need separate models per segment. A single global score that is excellent on average and useless for your largest accounts is a trap.
STAGE 5Deploy
Deployment writes the score back to a record as a custom field, where it shows up in reports, dashboards, list views, and is reachable from Flow and Apex. The act is two clicks. The discipline around it is what matters: decide the cadence, daily, weekly, or real time on field change; decide the action layer, whether the score triggers automation or merely informs a human; and define a kill switch, a flag that removes the model's effect on automations within minutes if it misbehaves. Build the kill switch before launch, never after.
Retrain on a fixed schedule, not only when performance degrades. Monthly for fast-moving domains, quarterly for slow ones. Predictable retraining catches drift early; reactive retraining catches it after the damage is done.
AFTERThe monitoring that keeps it honest
A deployed model is the start, not the end. Track three things weekly. Prediction drift, because the distribution of scores should not lurch week to week, and if it does the population is changing under the model. Ground-truth performance, comparing predictions against outcomes as they resolve, which gives you the real production accuracy that will sit below the validation number. And action follow-through, the share of flagged records actually acted on, because if that is below thirty percent the prediction is not landing where it should and the fix is in the workflow, not the model. Start with a churn-style problem for your first build, where labels are clear and the action layer is well understood. The models that earn their place are quiet ones, and the signal of success is the day someone refuses to do the outreach without the score in front of them.