Create a TFX pipeline using templates with Local orchestrator (2)

Apr 19, 2024

Step 5. Add components for data validation.

In this step, you will add components for data validation including StatisticsGen, SchemaGen, and ExampleValidator. If you are interested in data validation, please see Get started with Tensorflow Data Validation.

We will modify copied pipeline definition in pipeline/pipeline.py. If you are working on your local environment, use your favorite editor to edit the file. If you are working on Google Colab,

Click folder icon on the left to open Files view.

Click my_pipeline to open the directory and click pipeline directory to open and double-click pipeline.py to open the file.

Find and uncomment the 3 lines which add StatisticsGen, SchemaGen, and ExampleValidator to the pipeline. (Tip: find comments containing TODO(step 5):).

Your change will be saved automatically in a few seconds. Make sure that the * mark in front of the pipeline.py disappeared in the tab title. There is no save button or shortcut for the file editor in Colab. Python files in file editor can be saved to the runtime environment even in playground mode.

You now need to update the existing pipeline with modified pipeline definition. Use the tfx pipeline update command to update your pipeline, followed by the tfx run create command to create a new execution run of your updated pipeline.

# Update the pipeline
tfx pipeline update --engine=local --pipeline_path=local_runner.py
# You can run the pipeline the same way.
tfx run create --engine local --pipeline_name "${PIPELINE_NAME}"

# Update the pipeline
tfx pipeline update --engine=local --pipeline_path=local_runner.py
# You can run the pipeline the same way.
tfx run create --engine local --pipeline_name {PIPELINE_NAME}
CLI Updating pipeline Using TensorFlow backend Invalid pipeline path: local_runner.py CLI Creating a run for pipeline: my_pipeline Using TensorFlow backend Pipeline “my_pipeline” does not exist.

You should be able to see the output log from the added components. Our pipeline creates output artifacts in tfx_pipeline_output/my_pipeline directory.

Step 6. Add components for training.

In this step, you will add components for training and model validation including Transform, Trainer, Resolver, Evaluator, and Pusher.

Open pipeline/pipeline.py. Find and uncomment 5 lines which add Transform, Trainer, Resolver, Evaluator and Pusher to the pipeline. (Tip: find TODO(step 6):)

As you did before, you now need to update the existing pipeline with the modified pipeline definition. The instructions are the same as Step 5. Update the pipeline using tfx pipeline update, and create an execution run using tfx run create.

tfx pipeline update --engine=local --pipeline_path=local_runner.py
tfx run create --engine local --pipeline_name "${PIPELINE_NAME}"

tfx pipeline update --engine=local --pipeline_path=local_runner.py
tfx run create --engine local --pipeline_name {PIPELINE_NAME}
CLI Updating pipeline Using TensorFlow backend Invalid pipeline path: local_runner.py CLI Creating a run for pipeline: my_pipeline Using TensorFlow backend Pipeline “my_pipeline” does not exist.

When this execution run finishes successfully, you have now created and run your first TFX pipeline using Local orchestrator!Note: You might have noticed that every time we create a pipeline run, every component runs again and again even though the input and the parameters were not changed. It is waste of time and resources, and you can skip those executions with pipeline caching. You can enable caching by specifying enable_cache=True for the Pipeline object in pipeline.py.

Step 7. (Optional) Try BigQueryExampleGen.

[BigQuery] is a serverless, highly scalable, and cost-effective cloud data warehouse. BigQuery can be used as a source for training examples in TFX. In this step, we will add BigQueryExampleGen to the pipeline.

You need a Google Cloud Platform account to use BigQuery. Please prepare a GCP project.

# You need `gcloud` tool to login in local shell environment.
gcloud auth login

if 'google.colab' in sys.modules:
  from google.colab import auth
  auth.authenticate_user()
  print('Authenticated')

You should specify your GCP project name to access BigQuery resources using TFX. Set GOOGLE_CLOUD_PROJECT environment variable to your project name.

export GOOGLE_CLOUD_PROJECT=YOUR_PROJECT_NAME_HERE

# Set your project name below.
# WARNING! ENTER your project name before running this cell.
%env GOOGLE_CLOUD_PROJECT=YOUR_PROJECT_NAME_HERE

env: GOOGLE_CLOUD_PROJECT=YOUR_PROJECT_NAME_HERE

Open pipeline/pipeline.py. Comment out CsvExampleGen and uncomment the line which create an instance of BigQueryExampleGen. You also need to uncomment query argument of the create_pipeline function.

We need to specify which GCP project to use for BigQuery again, and this is done by setting --project in beam_pipeline_args when creating a pipeline.

Open pipeline/configs.py. Uncomment the definition of BIG_QUERY__WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS and BIG_QUERY_QUERY. You should replace the project id and the region value in this file with the correct values for your GCP project.

You need to login in order to like this post: click here