Use Amazon SageMaker Studio to build Machine Learning Pipeline
Amazon SageMaker Studio is Machine Learning Integrated Development Environment (IDE) that AWS launching in re:invent 2019. Allowing users to easily build, train, debug, deploy and monitor machine learning models, and focus on developing machine learning models, not the setting of the environment or the conversion between development tools.
- Write and execute code in Jupyter notebooks
- Building and training machine learning models
- Deploy the model and monitor the performance of the model predictions
- Tuning and improving the effectiveness of machine learning models
Amazon SageMaker Studio is currently only available in us-east-2 in the US East (Ohio) region.
Amazon SageMaker Studio provides the following features that can be used to effectively build machine learning models and reduce development time:
- Amazon SageMaker Studio Notebooks
- Amazon SageMaker Autopilot
- Amazon SageMaker Experiments
- Amazon SageMaker Debugger
- Amazon SageMaker Model Monitor
Amazon SageMaker Studio Notebooks
Amazon SageMaker Studio Notebooks is a notebook within Amazon SageMaker Studio that provides simultaneous collaboration capabilities for multiple people and quickly switches between different computing resources as needed. Even if the computing resources are closed after training, you can still view the experimental results.
Amazon SageMaker Studio Notebooks provides four main environments for users to choose from:
- Data Science (includes most common data science packages such as NumPy, SciKit Learn, and more)
- Base Python (a plain, vanilla Python environment)
- MXNet CPU optimized
- TensorFlow CPU optimized
- PyTorch CPU optimized
After introducing Amazon SageMaker Studio Notebooks, many veterans who have used SageMaker will think about how this is different from Instance-based Amazon SageMaker Notebooks of AWS, and Amazon SageMaker Studio Notebooks has the following advantages over Instance-based Amazon SageMaker Notebooks:
- Starts 5-10 times faster than Instance-based Amazon SageMaker Notebooks.
- Create snapshots and use URLs to share code, program development environment, and experiment results with team members.
- Built-in latest version of Amazon SageMaker SDK, you can use and build, train, adjust and monitor models directly in Amazon SageMaker Studio IDE.
- Use AWS SSO for authentication and log in and run your notebook using the unique URL provided by AWS to the development team, without having to log in to the AWS console.
- Amazon SageMaker Studio Notebooks preconfigures a set of environments for data science, enabling users to start data science tasks faster, such as data preprocessing and modeling.
Amazon SageMaker Autopilot
The introduction of Amazon SageMaker Autopilot makes it easy for users who are not familiar with machine learning to train their own machine learning models, including Classification models and Regression models. Users only need to upload the dataset (must be in csv format) and select the fields to be predicted. Autopilot will explore the combination of different algorithms, hyperparameters, and data pre-processing methods to find a more accurate machine learning model.
In Amazon SageMaker Autopilot can be applied to the following problems:
- Linear regression : Using a function and multiple features weights to predict the value of the target and the prediction field is a continuous value, that is, linear regression, such as house price prediction, in the existing number of pings, the number of bedrooms, the presence or absence of parking spaces After waiting for the information, use this to predict the selling price.
- Binary classification : Binary classification is the classification of common classification problems, such as credit card fraud detection, cancer detection, etc.
- Multi-class classification : Multi-class classification is the prediction result that the model may make. For example, the movie classification model may have plot, action, love, animation and other classifications. This model can not only predict the category but also return probability in each category.
- Automatic problem type detection : When using the AutoML API to set the problem type, you can choose to define the problem type or let Autopilot detect it automatically.
When users use Autopilot for training, experiment and multiple trials will be created, and each trial will store data results generated by different algorithms, hyperparameters, and data preprocessing methods, thereby providing users with more efficient ways to comparison models.
When we create an Autopilot experiment, two notebooks are created and stored in S3, the first is a Data Exploration Notebook and the second is a Candidate Generation Notebook.
- Data Exploration Notebook : It mainly presents the analysis of the data set, which will analyze the data and whether there are missing values, whether to use the average or median for replacement and one-hot encoding conversion for the classification of String. If the data dimension is too high due to too many classifications, PCA or other dimensionality reduction techniques will be applied for Data Reduction.
-
Candidate Generation Notebook : Provide suggested algorithms and perform Hyperparameter Optimization (HPO) in a given parameter range. Here you can choose to only (1) generate this notebook or (2) generate and execute AutoML. The first one can decide to execute only specific algorithms After the combination, the optimal algorithm is determined. The second method is to directly display the optimal algorithm.
Amazon SageMaker Experiments
In the process of building a machine learning model, it is cumbersome to use paper or Excel to record the impact of each parameter on the model, and Amazon SageMaker Experiments can track and evaluate the accuracy and data of each training model and the impact of changes on the model. The effects of algorithms and hyperparameters on this experiment can also be recorded.
Amazon SageMaker Experiments are mainly divided into experiment and trials, experiment is the experiment to be observed, and trials are experiments with different results each time the target variable is adjusted in the experiment.
Amazon SageMaker Experiments recording and tracking experiments are mainly divided into two types, one is Automated Tracking, and the other is Manual Tracking. Automated Tracking is an experiment that automatically tracks Amazon SageMaker Autopilot, and treats each training with different algorithms, parameters, and model evaluation indicators as a trial, allowing users to track each experiment and manage machine learning models. Manual Tracking uses the API provided by the Amazon SageMaker Python SDK to record and track machine learning models and training jobs running in Amazon SageMaker Notebooks and Amazon SageMaker Studio Notebooks.
Amazon SageMaker Experiments will automatically capture the parameters, settings and experimental results of the model input and store them in the experiment, and compare the experimental results visually to find the best model. With the example of handwriting recognition provided by AWS(mnist-handwritten-digits-classification-experiment.ipynb), we will use this directly as an example.
- Install the Amazon SageMaker Experiments SDK in your notebook.
!{sys.executable} -m pip install sagemaker-experiments
- Import Amazon SageMaker and Experiments packages into this notebook.
import sagemaker
from sagemaker import get_execution_role
from sagemaker.session import Session
from sagemaker.analytics import ExperimentAnalytics
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker
- Tracker is used to tracking the transformation of the dataset while recording the normalization parameters and URIs into Amazon S3.
with Tracker.create(display_name="Preprocessing", sagemaker_boto_client=sm) as tracker:
tracker.log_parameters({
"normalization_mean": 0.1307,
"normalization_std": 0.3081,
})
tracker.log_input(name="mnist-dataset", media_type="s3/uri", value=inputs)
- Create and track experiments with
mnist-hand-written-digits-classification-xxxxxxxx
.
mnist_experiment = Experiment.create(
experiment_name=f"mnist-hand-written-digits-classification-{int(time.time())}",
description="Classification of mnist hand-written digits",
sagemaker_boto_client=sm)
print(mnist_experiment)
- Then set up various experiments to test the effect of using different hidden channels
(2, 5, 10, 20, 32)
on the final results of the experiment, while other parameters remain the same, and usetrain loss
,test loss
, andtest accuracy
to measure models.
for i, num_hidden_channel in enumerate([2, 5, 10, 20, 32]):
# create trial
trial_name = f"cnn-training-job-{num_hidden_channel}-hidden-channels-{int(time.time())}"
cnn_trial = Trial.create(
trial_name=trial_name,
experiment_name=mnist_experiment.experiment_name,
sagemaker_boto_client=sm,
)
hidden_channel_trial_name_map[num_hidden_channel] = trial_name
# associate the proprocessing trial component with the current trial
cnn_trial.add_trial_component(preprocessing_trial_component)
# all input configurations, parameters, and metrics specified in estimator
# definition are automatically tracked
estimator = PyTorch(
entry_point='./mnist.py',
role=role,
sagemaker_session=sagemaker.Session(sagemaker_client=sm),
framework_version='1.1.0',
train_instance_count=1,
train_instance_type='ml.c4.xlarge',
hyperparameters={
'epochs': 2,
'backend': 'gloo',
'hidden_channels': num_hidden_channel,
'dropout': 0.2,
'optimizer': 'sgd'
},
metric_definitions=[
{'Name':'train:loss', 'Regex':'Train Loss: (.*?);'},
{'Name':'test:loss', 'Regex':'Test Average loss: (.*?),'},
{'Name':'test:accuracy', 'Regex':'Test Accuracy: (.*?)%;'}
],
enable_sagemaker_metrics=True,
)
cnn_training_job_name = "cnn-training-job-{}".format(int(time.time()))
# Now associate the estimator with the Experiment and Trial
estimator.fit(
inputs={'training': inputs},
job_name=cnn_training_job_name,
experiment_config={
"TrialName": cnn_trial.trial_name,
"TrialComponentDisplayName": "Training",
},
wait=True,
)
# give it a while before dispatching the next training job
time.sleep(2)
Next, we can use the SageMaker Experiment List to visualize the experimental results. Here we compare the average accuracy of the test when the value of the hidden channel is 2, 5, 10, 20, 32. There are two categories of Time Series and Summary statistics in the right menu bar for users to draw comparison charts.
Amazon SageMaker Debugger
Amazon SageMaker Debugger monitors, records, and analyzes the status of Tensors and neural networks in machine learning models. Users can set different Rules to monitor the model. When SageMaker Debugger detects problems such as overfitting, vanishing gradient, class imbalance. Then warning will be issued to the user in a timely manner. Users can use the visual interface of Amazon SageMaker Studio to check the status of Tenosr and neural network during the training process and then debug and improve the model.
When a Rule is triggered, Amazon SageMaker Studio displays the log of the triggered Rule to analyze the cause of the training exception.
Amazon SageMaker Debugger has two types of Rule: Built-in Rules and Custom Rules:
-
Built-in Rules contains four scenarios:
- Deep learning framework : such as dead relu, vanishing gradient, weight update ratio, whether Tensor has infinite value or NaN ⋯⋯, etc.
- Deep learning framework and XGBoost algorithm : Imbalance classification, overfitting, overtraining, confusion, etc.
- Deep learning application : check whether the picture input value is normalized and the ratio of specific words in the string.
- XGBoost algorithm : the depth of the classification tree.
Import the SageMaker Debugger package directly, and set the ExplodingTensor、VanishingGradient rules.
from sagemaker.debugger import Rule, CollectionConfig, rule_configs exploding_tensor_rule = Rule.sagemaker( base_config=rule_configs.exploding_tensor(), rule_parameters={"collection_names": "weights,losses"}, collections_to_save=[ CollectionConfig("weights"), CollectionConfig("losses") ] ) vanishing_gradient_rule = Rule.sagemaker( base_config=rule_configs.vanishing_gradient() ) import sagemaker as sm sagemaker_estimator = sm.tensorflow.TensorFlow( entry_point='src/mnist.py', role=sm.get_execution_role(), base_job_name='smdebug-demo-job', train_instance_count=1, train_instance_type="ml.m4.xlarge", framework_version="1.15", py_version="py3", # smdebug-specific arguments below rules=[exploding_tensor_rule, vanishing_gradient_rule], ) sagemaker_estimator.fit()
-
Custome Rule:Users define their own monitoring Rule. For example, set the index to be monitored as gradients, get the gradients of each step and average to see if it is greater than threshold.
custom_rule = Rule.custom( name='MyCustomRule', # used to identify the rule # rule evaluator container image image_uri='759209512951.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rule-evaluator:latest', instance_type='ml.t3.medium', # instance type to run the rule evaluation on source='rules/my_custom_rule.py', # path to the rule source file rule_to_invoke='CustomGradientRule', # name of the class to invoke in the rule source file volume_size_in_gb=30, # EBS volume size required to be attached to the rule evaluation instance collections_to_save=[CollectionConfig("gradients")], # collections to be analyzed by the rule. since this is a first party collection we fetch it as above rule_parameters={ "threshold": "20.0" # this will be used to intialize 'threshold' param in your constructor } ) estimator = TensorFlow( role=sagemaker.get_execution_role(), base_job_name='smdebug-custom-rule-demo-tf-keras', train_instance_count=1, train_instance_type='ml.p2.xlarge', entry_point=entrypoint_script, framework_version='1.15', py_version='py3', train_max_run=3600, script_mode=True, ## New parameter rules = [custom_rule] ) estimator.fit(wait=False)
Amazon SageMaker Model Monitor
After deploying a machine learning model, the model accuracy rate will decrease over time, which may be due to the gap between the training data and the actual data. The SageMaker Model Monitor continuously monitors and analyzes the predicted Request, and uses the built-in rules to confirm the Request sent Whether the data is abnormal. If the data is abnormal, a warning will be issued to remind the user to maintain the accuracy of the model. The following is the actual flow chart of SageMaker Model Monitor.
SageMaker Model Monitor can be divided into four processes, which are capturing data, establishing a baseline, setting monitoring schedules, and interpreting results.
- Capturing Data:Retrieve the data of the request passed into the machine learning model.
- Establishing Baseline:Create a Baseline as a comparison target when new data is received, and use Deequ to measure the quality of the data in the data.
- Setting Monitoring Schedules:You can monitor and analyze the data in the period according to the specified time and frequency, compare it with Baseline, and set which kind of exception report is generated. Here you can use cron expression (
cron (0 17/12? * * *)
) schedule. - Interpreting Results:View the final results and compare the latest data with Baseline and send the reported anomalies to Amazon CloudWatch.
Conclusion
In the past, developing machine learning from the installation environment to the establishment of machine learning models often took a lot of time, but Amazon SageMaker Studio saves the trouble of installing the environment while using other features to save the time to build an effective machine learning model. Amazon SageMaker Autopilot is a good starting point for novice machine learning. You only need to set the data and then use Autopilot to set the problem type and make predictions. As for machine learning veterans, Amazon SageMaker Debugger allows us to instantly understand the problems encountered during the training of the model, and directly adjust without waiting for the model training to complete. Compared with Amazon SageMaker Model Monitor, Amazon SageMaker Debugger monitors the training process of the model, while Amazon SageMaker Model Monitor monitors whether the model needs to be adjusted after deployment. Both can be set to trigger. In addition, AWS has Sample Notebook that provide these services. Those who are interested can try it out at this link.