Track Your Data with DVC - DVC Tutorial

Track Your Data with DVC - DVC Tutorial

Automate your workflow with DVC Pipeline for MLOPs

ยท

4 min read

Data science projects can quickly become messy, with lots of data and code files to manage, and the potential for conflicts and errors as different team members work on the project. To keep everything organized and under control, you need a version control system that is specifically designed for data science projects.

That's where Data Version Control (DVC) comes in. DVC is an open-source tool that helps data scientists and machine learning engineers manage their data and code in a Git-like fashion. With DVC, you can track changes to your data and code, collaborate with team members, and reproduce your experiments with ease.

In this blog, we will walk you through a simple tutorial on how to use DVC to manage your data science project.

Prerequisites:

Before we dive into the tutorial, you should have the following installed on your machine:

  • Git

  • Python 3

  • DVC

Also, you should have some basic knowledge of Git and Python.

Tutorial:

In this tutorial, we will create a simple data science project that predicts the price of a house based on its size. We will use DVC to manage our data and code.

  1. Create a new directory for your project:
mkdir house-price-prediction
cd house-price-prediction
  1. Initialize a Git repository:
git init
  1. Initialize a DVC repository:
dvc init
  1. Create a new Python script called train.py:
import pandas as pd
from sklearn.linear_model import LinearRegression
import joblib

# Load data
data = pd.read_csv('data/housing.csv')

# Split data into features and target
X = data[['Size']]
y = data[['Price']]

# Train model
model = LinearRegression()
model.fit(X, y)

# Save model
joblib.dump(model, 'models/housing.pkl')
  1. Create a new directory called data and download a dataset for house prices:
mkdir data
cd data
wget https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.csv
cd ..
  1. Add the data directory to DVC:
dvc add data
  1. Commit the changes to Git and DV:
git add .
git commit -m "Initial commit"
dvc push
  1. Create a new Python script called predict.py:
import joblib

# Load model
model = joblib.load('models/housing.pkl')

# Predict house price
size = [[1650]]
price = model.predict(size)

print(price)
  1. Add the predict.py script to DVC:
dvc add predict.py
  1. Commit the changes to Git and DVC:
git add .
git commit -m "Added predict script"
dvc push
  1. Update the train.py script to save the model to DVC:
import pandas as pd
from sklearn.linear_model import LinearRegression
import joblib
import dvc.api

# Load data
data_url = 'data/housing.csv'
data = pd.read_csv(dvc.api.get_url(data_url))

# Split data into features and target
X = data[['Size']]
y = data[['Price']]

# Train model
model = LinearRegression()
model.fit(X, y)

# Save model
model_url = 'models/housing.pkl'
joblib.dump(model, dvc.api.get_url(model_url))
  1. Commit the changes to Git and DVC:
git add .
git commit -m "Updated train script to save model to DVC"
git push
dvc push
  1. Now let's train our model again and push it to DVC:
python train.py
dvc push
  1. Update the predict.py script to use DVC to load the model:
import dvc.api
import joblib

# Load model
model_url = 'models/housing.pkl'
model = joblib.load(dvc.api.get_url(model_url))

# Predict house price
size = [[1650]]
price = model.predict(size)

print(price)
  1. Commit the changes to Git and DVC:
git add .
git commit -m "Updated predict script to use DVC to load model"
dvc push
  1. Now you can run the predict.py script to make predictions:
python predict.py

Congratulations! ๐Ÿ˜

That's it! You have successfully used DVC to manage your data and code for a simple data science project. With DVC, you can track changes to your data and code, collaborate with team members, and reproduce your experiments with ease.

Conclusion:

In this tutorial, we have covered the basics of using DVC to manage a simple data science project. DVC is a powerful tool that can help you keep your data and code organized and under control. By using DVC, you can make your data science projects more reproducible and easier to collaborate on with team members. We hope that this tutorial has helped you get started with using DVC for your data science projects.

Did you find this article valuable?

Support Aman by becoming a sponsor. Any amount is appreciated!

ย