TFX Pipelines in Colab

Feb 20, 2024

Colab is a lightweight development environment which differs significantly from a production environment. In production, you may have various pipeline components like data ingestion, transformation, model training, run histories, etc. across multiple, distributed systems. For this tutorial, you should be aware that siginificant differences exist in Orchestration and Metadata storage – it is all handled locally within Colab. Learn more about TFX in Colab here.

Setup

First, we install and import the necessary packages, set up paths, and download data.

Upgrade Pip

To avoid upgrading Pip in a system when running locally, check to make sure that we’re running in Colab. Local systems can of course be upgraded separately.

try:
  import colab
  !pip install --upgrade pip
except:
  pass

Install and import TFX

pip install -q tfx

Import packages

Did you restart the runtime?

If you are using Google Colab, the first time that you run the cell above, you must restart the runtime by clicking above “RESTART RUNTIME” button or using “Runtime > Restart runtime …” menu. This is because of the way that Colab loads packages.

import os
import tempfile
import urllib
import pandas as pd

import tensorflow_model_analysis as tfma
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext

Check the TFX, and MLMD versions.

from tfx import v1 as tfx
print('TFX version: {}'.format(tfx.__version__))
import ml_metadata as mlmd
print('MLMD version: {}'.format(mlmd.__version__))

TFX version: 1.13.0 MLMD version: 1.13.1

Download the dataset

In this colab, we use the Palmer Penguins dataset which can be found on Github. We processed the dataset by leaving out any incomplete records, and drops island and sex columns, and converted labels to int32. The dataset contains 334 records of the body mass and the length and depth of penguins’ culmens, and the length of their flippers. You use this data to classify penguins into one of three species.

DATA_PATH = 'https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/penguin/data/labelled/penguins_processed.csv'
_data_root = tempfile.mkdtemp(prefix='tfx-data')
_data_filepath = os.path.join(_data_root, "penguins_processed.csv")
urllib.request.urlretrieve(DATA_PATH, _data_filepath)

(‘/tmpfs/tmp/tfx-datap4i8w56n/penguins_processed.csv’, <http.client.HTTPMessage at 0x7f3a76776370>)

Create an InteractiveContext

To run TFX components interactively in this notebook, create an InteractiveContext. The InteractiveContext uses a temporary directory with an ephemeral MLMD database instance. Note that calls to InteractiveContext are no-ops outside the Colab environment.

In general, it is a good practice to group similar pipeline runs under a Context.

interactive_context = InteractiveContext()

WARNING:absl:InteractiveContext pipeline_root argument not provided: using temporary directory /tmpfs/tmp/tfx-interactive-2023-07-28T11_11_10.063419-p4royv0g as root for pipeline outputs. WARNING:absl:InteractiveContext metadata_connection_config not provided: using SQLite ML Metadata database at /tmpfs/tmp/tfx-interactive-2023-07-28T11_11_10.063419-p4royv0g/metadata.sqlite.

Construct the TFX Pipeline

A TFX pipeline consists of several components that perform different aspects of the ML workflow. In this notebook, you create and run the ExampleGen, StatisticsGen, SchemaGen, and Trainer components and use the Evaluator and Pusher component to evaluate and push the trained model.

Refer to the components tutorial for more information on TFX pipeline components.Note: Constructing a TFX Pipeline by setting up the individual components involves a lot of boilerplate code. For the purpose of this tutorial, it is alright if you do not fully understand every line of code in the pipeline setup.

Instantiate and run the ExampleGen Component

example_gen = tfx.components.CsvExampleGen(input_base=_data_root)
interactive_context.run(example_gen)

WARNING:apache_beam.runners.interactive.interactive_environment:Dependencies required for Interactive Beam PCollection visualization are not available, please use: `pip install apache-beam[interactive]` to install necessary dependencies to enable all data visualization features. WARNING:apache_beam.io.tfrecordio:Couldn’t find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be. https://www.tensorflow.org/frame/tfx/tutorials/mlmd/mlmd_tutorial_dc915821900c2ddc9df5706106eccae38ed8e9133bd8b7f4ec4830b21fd5752f.frame

Instantiate and run the StatisticsGen Component

statistics_gen = tfx.components.StatisticsGen(
    examples=example_gen.outputs['examples'])
interactive_context.run(statistics_gen)

https://www.tensorflow.org/frame/tfx/tutorials/mlmd/mlmd_tutorial_11c793bddba01361e56ac43f7067a719e5a306d6dbff47c1f510157aeee39709.frame

Instantiate and run the SchemaGen Component

infer_schema = tfx.components.SchemaGen(
    statistics=statistics_gen.outputs['statistics'], infer_feature_shape=True)
interactive_context.run(infer_schema)

https://www.tensorflow.org/frame/tfx/tutorials/mlmd/mlmd_tutorial_dd300a10c49c2b766450e05f82b9f81bbaf0bf885db9bc610c6f337cb345222b.frame

Instantiate and run the Trainer Component

# Define the module file for the Trainer component
trainer_module_file = 'penguin_trainer.py'

%%writefile {trainer_module_file}

# Define the training algorithm for the Trainer module file
import os
from typing import List, Text

import tensorflow as tf
from tensorflow import keras

from tfx import v1 as tfx
from tfx_bsl.public import tfxio

from tensorflow_metadata.proto.v0 import schema_pb2

# Features used for classification - culmen length and depth, flipper length,
# body mass, and species.

_LABEL_KEY = 'species'

_FEATURE_KEYS = [
    'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g'
]


def _input_fn(file_pattern: List[Text],
              data_accessor: tfx.components.DataAccessor,
              schema: schema_pb2.Schema, batch_size: int) -> tf.data.Dataset:
  return data_accessor.tf_dataset_factory(
      file_pattern,
      tfxio.TensorFlowDatasetOptions(
          batch_size=batch_size, label_key=_LABEL_KEY), schema).repeat()


def _build_keras_model():
  inputs = [keras.layers.Input(shape=(1,), name=f) for f in _FEATURE_KEYS]
  d = keras.layers.concatenate(inputs)
  d = keras.layers.Dense(8, activation='relu')(d)
  d = keras.layers.Dense(8, activation='relu')(d)
  outputs = keras.layers.Dense(3)(d)
  model = keras.Model(inputs=inputs, outputs=outputs)
  model.compile(
      optimizer=keras.optimizers.Adam(1e-2),
      loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
      metrics=[keras.metrics.SparseCategoricalAccuracy()])
  return model


def run_fn(fn_args: tfx.components.FnArgs):
  schema = schema_pb2.Schema()
  tfx.utils.parse_pbtxt_file(fn_args.schema_path, schema)
  train_dataset = _input_fn(
      fn_args.train_files, fn_args.data_accessor, schema, batch_size=10)
  eval_dataset = _input_fn(
      fn_args.eval_files, fn_args.data_accessor, schema, batch_size=10)
  model = _build_keras_model()
  model.fit(
      train_dataset,
      epochs=int(fn_args.train_steps / 20),
      steps_per_epoch=20,
      validation_data=eval_dataset,
      validation_steps=fn_args.eval_steps)
  model.save(fn_args.serving_model_dir, save_format='tf')

Writing penguin_trainer.py

Run the Trainer component.

trainer = tfx.components.Trainer(
    module_file=os.path.abspath(trainer_module_file),
    examples=example_gen.outputs['examples'],
    schema=infer_schema.outputs['schema'],
    train_args=tfx.proto.TrainArgs(num_steps=100),
    eval_args=tfx.proto.EvalArgs(num_steps=50))
interactive_context.run(trainer)

Epoch 1/5 20/20 [==============================] – 3s 19ms/step – loss: 0.9458 – sparse_categorical_accuracy: 0.7500 – val_loss: 0.8589 – val_sparse_categorical_accuracy: 0.7800 Epoch 2/5 20/20 [==============================] – 0s 11ms/step – loss: 0.6942 – sparse_categorical_accuracy: 0.8000 – val_loss: 0.5478 – val_sparse_categorical_accuracy: 0.7800 Epoch 3/5 20/20 [==============================] – 0s 11ms/step – loss: 0.4146 – sparse_categorical_accuracy: 0.8100 – val_loss: 0.3478 – val_sparse_categorical_accuracy: 0.7800 Epoch 4/5 20/20 [==============================] – 0s 11ms/step – loss: 0.2747 – sparse_categorical_accuracy: 0.9350 – val_loss: 0.2253 – val_sparse_categorical_accuracy: 0.9600 Epoch 5/5 20/20 [==============================] – 0s 11ms/step – loss: 0.1738 – sparse_categorical_accuracy: 0.9700 – val_loss: 0.1330 – val_sparse_categorical_accuracy: 0.9800 INFO:tensorflow:Assets written to: /tmpfs/tmp/tfx-interactive-2023-07-28T11_11_10.063419-p4royv0g/Trainer/model/4/Format-Serving/assets INFO:tensorflow:Assets written to: /tmpfs/tmp/tfx-interactive-2023-07-28T11_11_10.063419-p4royv0g/Trainer/model/4/Format-Serving/assets https://www.tensorflow.org/frame/tfx/tutorials/mlmd/mlmd_tutorial_dabd232be46281ee1b66111ba5e56bc3fb7e2e1f40c945cc806b99c82363f73c.frame

Evaluate and push the model

Use the Evaluator component to evaluate and ‘bless’ the model before using the Pusher component to push the model to a serving directory.

_serving_model_dir = os.path.join(tempfile.mkdtemp(),
                                  'serving_model/penguins_classification')

eval_config = tfma.EvalConfig(
    model_specs=[
        tfma.ModelSpec(label_key='species', signature_name='serving_default')
    ],
    metrics_specs=[
        tfma.MetricsSpec(metrics=[
            tfma.MetricConfig(
                class_name='SparseCategoricalAccuracy',
                threshold=tfma.MetricThreshold(
                    value_threshold=tfma.GenericValueThreshold(
                        lower_bound={'value': 0.6})))
        ])
    ],
    slicing_specs=[tfma.SlicingSpec()])

evaluator = tfx.components.Evaluator(
    examples=example_gen.outputs['examples'],
    model=trainer.outputs['model'],
    schema=infer_schema.outputs['schema'],
    eval_config=eval_config)
interactive_context.run(evaluator)

WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_model_analysis/writers/metrics_plots_and_validations_writer.py:110: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version. Instructions for updating: Use eager execution and: `tf.data.TFRecordDataset(path)` WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_model_analysis/writers/metrics_plots_and_validations_writer.py:110: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version. Instructions for updating: Use eager execution and: `tf.data.TFRecordDataset(path)` https://www.tensorflow.org/frame/tfx/tutorials/mlmd/mlmd_tutorial_95f9ade5a835fe7b74322527310faa394b63720fab5d79dad81f24c808bcd1fe.frame

pusher = tfx.components.Pusher(
    model=trainer.outputs['model'],
    model_blessing=evaluator.outputs['blessing'],
    push_destination=tfx.proto.PushDestination(
        filesystem=tfx.proto.PushDestination.Filesystem(
            base_directory=_serving_model_dir)))
interactive_context.run(pusher)

https://www.tensorflow.org/frame/tfx/tutorials/mlmd/mlmd_tutorial_3340d3e183da53861d6cf5cc1b179e175533c12ff787df83149387d6c33ed78e.frame

Running the TFX pipeline populates the MLMD Database. In the next section, you use the MLMD API to query this database for metadata information.

You need to login in order to like this post: click here