In this step, you will add components for data validation including StatisticsGen
, SchemaGen
, and ExampleValidator
. If you are interested in data validation, please see Get started with Tensorflow Data Validation.
We will modify copied pipeline definition in pipeline/pipeline.py
. If you are working on your local environment, use your favorite editor to edit the file. If you are working on Google Colab,
Click folder icon on the left to open
Files
view.
Click
my_pipeline
to open the directory and clickpipeline
directory to open and double-clickpipeline.py
to open the file.
Find and uncomment the 3 lines which add
StatisticsGen
,SchemaGen
, andExampleValidator
to the pipeline. (Tip: find comments containingTODO(step 5):
).
Your change will be saved automatically in a few seconds. Make sure that the
*
mark in front of thepipeline.py
disappeared in the tab title. There is no save button or shortcut for the file editor in Colab. Python files in file editor can be saved to the runtime environment even inplayground
mode.
You now need to update the existing pipeline with modified pipeline definition. Use the tfx pipeline update
command to update your pipeline, followed by the tfx run create
command to create a new execution run of your updated pipeline.
# Update the pipeline
tfx pipeline update --engine=local --pipeline_path=local_runner.py
# You can run the pipeline the same way.
tfx run create --engine local --pipeline_name "${PIPELINE_NAME}"
# Update the pipeline
tfx pipeline update --engine=local --pipeline_path=local_runner.py
# You can run the pipeline the same way.
tfx run create --engine local --pipeline_name {PIPELINE_NAME}
CLI Updating pipeline Using TensorFlow backend Invalid pipeline path: local_runner.py CLI Creating a run for pipeline: my_pipeline Using TensorFlow backend Pipeline “my_pipeline” does not exist.
You should be able to see the output log from the added components. Our pipeline creates output artifacts in tfx_pipeline_output/my_pipeline
directory.
In this step, you will add components for training and model validation including Transform
, Trainer
, Resolver
, Evaluator
, and Pusher
.
Open
pipeline/pipeline.py
. Find and uncomment 5 lines which addTransform
,Trainer
,Resolver
,Evaluator
andPusher
to the pipeline. (Tip: findTODO(step 6):
)
As you did before, you now need to update the existing pipeline with the modified pipeline definition. The instructions are the same as Step 5. Update the pipeline using tfx pipeline update
, and create an execution run using tfx run create
.
tfx pipeline update --engine=local --pipeline_path=local_runner.py
tfx run create --engine local --pipeline_name "${PIPELINE_NAME}"
tfx pipeline update --engine=local --pipeline_path=local_runner.py
tfx run create --engine local --pipeline_name {PIPELINE_NAME}
CLI Updating pipeline Using TensorFlow backend Invalid pipeline path: local_runner.py CLI Creating a run for pipeline: my_pipeline Using TensorFlow backend Pipeline “my_pipeline” does not exist.
When this execution run finishes successfully, you have now created and run your first TFX pipeline using Local orchestrator!Note: You might have noticed that every time we create a pipeline run, every component runs again and again even though the input and the parameters were not changed. It is waste of time and resources, and you can skip those executions with pipeline caching. You can enable caching by specifying enable_cache=True
for the Pipeline
object in pipeline.py
.
[BigQuery] is a serverless, highly scalable, and cost-effective cloud data warehouse. BigQuery can be used as a source for training examples in TFX. In this step, we will add BigQueryExampleGen
to the pipeline.
You need a Google Cloud Platform account to use BigQuery. Please prepare a GCP project.
Login to your project using colab auth library or gcloud
utility.
# You need `gcloud` tool to login in local shell environment.
gcloud auth login
if 'google.colab' in sys.modules:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')
You should specify your GCP project name to access BigQuery resources using TFX. Set GOOGLE_CLOUD_PROJECT
environment variable to your project name.
export GOOGLE_CLOUD_PROJECT=YOUR_PROJECT_NAME_HERE
# Set your project name below.
# WARNING! ENTER your project name before running this cell.
%env GOOGLE_CLOUD_PROJECT=YOUR_PROJECT_NAME_HERE
env: GOOGLE_CLOUD_PROJECT=YOUR_PROJECT_NAME_HERE
Open
pipeline/pipeline.py
. Comment outCsvExampleGen
and uncomment the line which create an instance ofBigQueryExampleGen
. You also need to uncommentquery
argument of thecreate_pipeline
function.
We need to specify which GCP project to use for BigQuery again, and this is done by setting --project
in beam_pipeline_args
when creating a pipeline.
Open
pipeline/configs.py
. Uncomment the definition ofBIG_QUERY__WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS
andBIG_QUERY_QUERY
. You should replace the project id and the region value in this file with the correct values for your GCP project.
You need to login in order to like this post: click here
YOU MIGHT ALSO LIKE