Get in touch
or send us a question?
CONTACT

Create a TFX pipeline using templates with Local orchestrator (3)

Introduction

This document will provide instructions to create a TensorFlow Extended (TFX) pipeline using templates which are provided with TFX Python package. Most of instructions are Linux shell commands, and corresponding Jupyter Notebook code cells which invoke those commands using ! are provided.

You will build a pipeline using Taxi Trips dataset released by the City of Chicago. We strongly encourage you to try to build your own pipeline using your dataset by utilizing this pipeline as a baseline.

We will build a pipeline which runs on local environment. If you are interested in using Kubeflow orchestrator on Google Cloud, please see TFX on Cloud AI Platform Pipelines tutorial.

Prerequisites

  • Linux / MacOS
  • Python >= 3.5.3

You can get all prerequisites easily by running this notebook on Google Colab.

Step 1. Set up your environment.

Throughout this document, we will present commands twice. Once as a copy-and-paste-ready shell command, once as a jupyter notebook cell. If you are using Colab, just skip shell script block and execute notebook cells.

You should prepare a development environment to build a pipeline.

Install tfx python package. We recommend use of virtualenv in the local environment. You can use following shell script snippet to set up your environment.

# Create a virtualenv for tfx.
virtualenv -p python3 venv
source venv/bin/activate
# Install python packages.
python -m pip install --upgrade "tfx<2"

If you are using colab:

import sys
!{sys.executable} -m pip install --upgrade "tfx<2"

Note: There might be some errors during package installation. For example,

ERROR: some-package 0.some_version.1 has requirement other-package!=2.0.,<3,>=1.15, but you’ll have other-package 2.0.0 which is incompatible.

Please ignore these errors at this moment.

# Set `PATH` to include user python binary directory.
HOME=%env HOME
PATH=%env PATH
%env PATH={PATH}:{HOME}/.local/bin

env: PATH=/tmpfs/src/tf_docs_env/bin:/usr/local/cuda/bin:/opt/android-sdk/current/cmdline-tools/tools/bin:/opt/android-sdk/current/bin:/usr/local/go/bin:/usr/local/go/packages/bin:/opt/kubernetes/client/bin:/usr/local/cuda/bin:/opt/android-sdk/current/cmdline-tools/tools/bin:/opt/android-sdk/current/bin:/usr/local/go/bin:/usr/local/go/packages/bin:/opt/kubernetes/client/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/kbuilder/.local/bin:/home/kbuilder/.local/bin

Let’s check the version of TFX.

python -c "from tfx import version ; print('TFX version: {}'.format(version.__version__))"

python3 -c "from tfx import version ; print('TFX version: {}'.format(version.__version__))"
TFX version: 1.14.0

And, it’s done. We are ready to create a pipeline.

Step 2. Copy predefined template to your project directory.

In this step, we will create a working pipeline project directory and files by copying additional files from a predefined template.

You may give your pipeline a different name by changing the PIPELINE_NAME below. This will also become the name of the project directory where your files will be put.

export PIPELINE_NAME="my_pipeline"
export PROJECT_DIR=~/tfx/${PIPELINE_NAME}
PIPELINE_NAME="my_pipeline"
import os
# Create a project directory under Colab content directory.
PROJECT_DIR=os.path.join(os.sep,"content",PIPELINE_NAME)

TFX includes the taxi template with the TFX python package. If you are planning to solve a point-wise prediction problem, including classification and regresssion, this template could be used as a starting point.

The tfx template copy CLI command copies predefined template files into your project directory.

tfx template copy \
   --pipeline_name="${PIPELINE_NAME}" \
   --destination_path="${PROJECT_DIR}" \
   --model=taxi
!tfx template copy \
  --pipeline_name={PIPELINE_NAME} \
  --destination_path={PROJECT_DIR} \
  --model=taxi

CLI Copying taxi pipeline template Traceback (most recent call last): File “/tmpfs/src/tf_docs_env/bin/tfx”, line 8, in <module> sys.exit(cli_group()) File “/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/click/core.py”, line 1157, in __call__ return self.main(*args, **kwargs) File “/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/click/core.py”, line 1078, in main rv = self.invoke(ctx) File “/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/click/core.py”, line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File “/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/click/core.py”, line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File “/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/click/core.py”, line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File “/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/click/core.py”, line 783, in invoke return __callback(*args, **kwargs) File “/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/click/decorators.py”, line 92, in new_func return ctx.invoke(f, obj, *args, **kwargs) File “/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/click/core.py”, line 783, in invoke return __callback(*args, **kwargs) File “/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tfx/tools/cli/commands/template.py”, line 66, in copy template_handler.copy_template(ctx.flags_dict) File “/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tfx/tools/cli/handler/template_handler.py”, line 170, in copy_template _copy_and_replace_placeholder_dir(template_dir, destination_dir, ignore_paths, File “/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tfx/tools/cli/handler/template_handler.py”, line 110, in _copy_and_replace_placeholder_dir fileio.makedirs(dst) File “/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tfx/dsl/io/fileio.py”, line 80, in makedirs _get_filesystem(path).makedirs(path) File “/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tfx/dsl/io/plugins/tensorflow_gfile.py”, line 71, in makedirs tf.io.gfile.makedirs(path) File “/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/lib/io/file_io.py”, line 513, in recursive_create_dir_v2 _pywrap_file_io.RecursivelyCreateDir(compat.path_to_bytes(path)) tensorflow.python.framework.errors_impl.PermissionDeniedError: /content; Permission denied

Change the working directory context in this notebook to the project directory.

cd ${PROJECT_DIR}
%cd {PROJECT_DIR}

[Errno 2] No such file or directory: ‘/content/my_pipeline’ /tmpfs/src/temp/docs/tutorials/tfx

Step 3. Browse your copied source files.

The TFX template provides basic scaffold files to build a pipeline, including Python source code, sample data, and Jupyter Notebooks to analyse the output of the pipeline. The taxi template uses the same Chicago Taxi dataset and ML model as the Airflow Tutorial.

In Google Colab, you can browse files by clicking a folder icon on the left. Files should be copied under the project directoy, whose name is my_pipeline in this case. You can click directory names to see the content of the directory, and double-click file names to open them.

Here is brief introduction to each of the Python files.

  • pipeline – This directory contains the definition of the pipeline
    • configs.py — defines common constants for pipeline runners
    • pipeline.py — defines TFX components and a pipeline
  • models – This directory contains ML model definitions.
    • features.pyfeatures_test.py — defines features for the model
    • preprocessing.pypreprocessing_test.py — defines preprocessing jobs using tf::Transform
    • estimator – This directory contains an Estimator based model.
      • constants.py — defines constants of the model
      • model.pymodel_test.py — defines DNN model using TF estimator
    • keras – This directory contains a Keras based model.
      • constants.py — defines constants of the model
      • model.pymodel_test.py — defines DNN model using Keras
  • local_runner.pykubeflow_runner.py — define runners for each orchestration engine

You might notice that there are some files with _test.py in their name. These are unit tests of the pipeline and it is recommended to add more unit tests as you implement your own pipelines. You can run unit tests by supplying the module name of test files with -m flag. You can usually get a module name by deleting .py extension and replacing / with .. For example:

python -m models.features_test

{sys.executable} -m models.features_test
{sys.executable} -m models.keras_model.model_test
/tmpfs/src/tf_docs_env/bin/python: Error while finding module specification for ‘models.features_test’ (ModuleNotFoundError: No module named ‘models’) /tmpfs/src/tf_docs_env/bin/python: Error while finding module specification for ‘models.keras_model.model_test’ (ModuleNotFoundError: No module named ‘models’)

Step 4. Run your first TFX pipeline

You can create a pipeline using pipeline create command.

tfx pipeline create --engine=local --pipeline_path=local_runner.py

tfx pipeline create --engine=local --pipeline_path=local_runner.py
CLI Creating pipeline Using TensorFlow backend Invalid pipeline path: local_runner.py

Then, you can run the created pipeline using run create command.

tfx run create --engine=local --pipeline_name="${PIPELINE_NAME}"

tfx run create --engine=local --pipeline_name={PIPELINE_NAME}
CLI Creating a run for pipeline: my_pipeline Using TensorFlow backend Pipeline “my_pipeline” does not exist.

If successful, you’ll see Component CsvExampleGen is finished. When you copy the template, only one component, CsvExampleGen, is included in the pipeline.