Build a model with SynapseML

This article shows you how to build a machine learning model with SynapseML in a Microsoft Fabric notebook. You create a training pipeline that uses text featurization and LightGBM regression to predict book ratings from review text. You also learn how to use Foundry Tools for prebuilt sentiment analysis.

Create a Fabric notebook and attach a lakehouse
Import libraries and load data
Build and train a text featurization and LightGBM regression pipeline
Generate predictions
(Optional) Run Foundry Tools sentiment analysis

Prerequisites

Get a Microsoft Fabric subscription. Or, sign up for a free Microsoft Fabric trial.
Sign in to Microsoft Fabric.
Switch to Fabric by using the experience switcher on the lower-left side of your home page.

Create a new notebook in your Fabric workspace.
Attach a lakehouse to the notebook. In the Explorer pane, expand Lakehouses, and then select Add.
(Optional) To run the sentiment analysis step, you need:
- A Foundry Tools key. Follow the instructions in Quickstart: Create a multi-service resource for Foundry Tools.
- An Azure Key Vault instance with your Foundry Tools key stored as a secret.

Set up the environment

In your notebook, import SynapseML libraries and initialize your Spark session.

from pyspark.sql import SparkSession
from synapse.ml.core.platform import *

spark = SparkSession.builder.getOrCreate()

Verification: Run the following cell to confirm Spark is running:

print(f"Spark version: {spark.version}")

The output displays the Spark version number. Any version 3.4 or later is expected. The exact version depends on your Fabric runtime.

Load a dataset

Load the book reviews dataset and split it into training and test sets. The dataset contains two columns: rating (integer 1-5) and text (review content).

train, test = (
    spark.read.parquet(
        "wasbs://publicwasb@mmlspark.blob.core.windows.net/BookReviewsFromAmazon10K.parquet"
    )
    .limit(1000)
    .cache()
    .randomSplit([0.8, 0.2])
)

display(train)

Verification: Run the following cell to confirm the data loaded correctly:

print(f"Training rows: {train.count()}, Test rows: {test.count()}")
print(f"Columns: {train.columns}")
train.printSchema()

The output shows approximately 800 training rows and 200 test rows, with two columns: rating (integer) and text (string). The exact row counts vary because randomSplit is non-deterministic.

Create the training pipeline

Create a pipeline that featurizes the review text with TextFeaturizer and predicts the rating with LightGBMRegressor.

from pyspark.ml import Pipeline
from synapse.ml.featurize.text import TextFeaturizer
from synapse.ml.lightgbm import LightGBMRegressor

model = Pipeline(
    stages=[
        TextFeaturizer(inputCol="text", outputCol="features"),
        LightGBMRegressor(featuresCol="features", labelCol="rating", dataTransferMode="bulk")
    ]
).fit(train)

Verification: Run the following cell to confirm the pipeline trained:

print(f"Pipeline stages: {len(model.stages)}")
print(f"Stage 1: {type(model.stages[0]).__name__}")
print(f"Stage 2: {type(model.stages[1]).__name__}")

The output shows two pipeline stages: TextFeaturizerModel and LightGBMRegressionModel.

Predict the output of the test data

Call the transform method on the model to predict ratings for the test data and display the results.

predictions = model.transform(test)
display(predictions)

Verification: Run the following cell to confirm predictions were generated:

print(f"Prediction columns: {predictions.columns}")
print(f"Prediction count: {predictions.count()}")
predictions.select("rating", "prediction").show(5)

The output lists four columns (rating, text, features, prediction) and approximately 200 rows. The prediction column contains the model's predicted rating as a float. Compare it against the actual rating column to assess model performance.

(Optional) Use Foundry Tools for sentiment analysis

If you want to analyze the sentiment of your book reviews, you can use SynapseML's integration with Foundry Tools. This step uses the prebuilt TextSentiment model to classify text sentiment, which is a different task than the rating prediction in the previous steps.

Important

This step requires a Foundry Tools key stored in Azure Key Vault. If you skipped those prerequisites, complete them first or skip this section.

Run the following code with these replacements:

Replace <your-secret-name> with the name of your Foundry Tools key secret in Key Vault.
Replace <your-key-vault-name> with the name of your Azure Key Vault instance.

from synapse.ml.services import TextSentiment
from synapse.ml.core.platform import find_secret

sentiment_model = TextSentiment(
    textCol="text",
    outputCol="sentiment",
    subscriptionKey=find_secret("<your-secret-name>", "<your-key-vault-name>")
).setLocation("eastus")

sentiment_results = sentiment_model.transform(test)
display(sentiment_results)

Note

Update the setLocation value if your Foundry Tools resource is in a different Azure region (for example, "westus2" or "westeurope").

Verification: Run the following cell to confirm sentiment analysis completed:

print(f"Sentiment columns: {sentiment_results.columns}")
sentiment_results.select("text", "sentiment").show(3, truncate=50)

The output shows three columns (rating, text, sentiment). The sentiment column contains structured results with labels like positive, negative, neutral, or mixed for each review.

Troubleshooting

Issue	Cause	Resolution
`JAVA_GATEWAY_EXITED` error when creating SparkSession	Running code outside a Fabric notebook	Run this code in a Fabric notebook where Spark is preconfigured. Don't run locally without a Spark installation.
`Could not find <secret> in keyvault <vault>`	Key Vault name or secret name is incorrect, or notebook identity lacks access	Verify names match exactly. In the Azure portal, confirm your Fabric workspace identity has Get permission on Key Vault secrets.
`TextFeaturizer` returns empty features	Input text column is null or empty	Check for null values: `train.filter(train.text.isNull()).count()` - remove nulls before training.
`randomSplit` returns unexpected row counts	Spark's random splitting is non-deterministic	This is expected behavior. Set a seed for reproducibility: `.randomSplit([0.8, 0.2], seed=42)`
`AnalysisException: Path does not exist`	Network issue accessing the sample data blob	Verify network connectivity. In Fabric, confirm your workspace can access external Azure Blob Storage URLs.
Foundry Tools returns 401 or 403	Invalid or expired subscription key	Generate a new key in the Azure portal under your Foundry Tools resource Keys and Endpoint section. Update the Key Vault secret.
`setLocation` returns 404	Region mismatch	Set the location to match the Azure region where you created your Foundry Tools resource.

Clean up resources

If you created Azure resources for the optional Foundry Tools step and no longer need them, delete them to avoid charges:

In the Azure portal, delete the Foundry Tools multi-service resource.
In the Azure portal, delete the Key Vault instance.
In your Fabric workspace, delete the test notebook if you no longer need it.

Feedback

Was this page helpful?

Last updated on 2026-05-15