cover

Version Control for Large Data with DVC and OCI Object Storage

How to Manage Large Data with Git

I usually use git for version control of my programs. By saving the state of code with each change, I can track the history of modifications and easily revert to previous states. However, it is not easy to handle large data with git. Although there is a dedicated tool called git-lfs, I don’t really want to use it because GitHub has capacity limitations for file uploads.

So today, I’d like to try managing large files with dvc and OCI Object Storage.

What is DVC? What is Object Storage?

With DVC, the data itself is stored in a separate location from git management, and git only manages the hash values of that data. This significantly reduces the capacity of data managed by git.

Relationship diagram

As a data storage destination, you can configure anything such as another directory or a server set up on-premises. Cloud usage is also possible. For example, you can use AWS S3, but it costs a bit. By using OCI Object Storage, which provides an S3-compatible API, you can store up to 20GB for free.

Installing DVC

Here are the installation instructions:

https://dvc.org/doc/install

I’ll proceed with the tutorial using uv. This allows me to install pip packages faster.

Create a working folder:

uv init dvc-oci
cd dvc-oci

The folder created by uv init is automatically in a git init state. Make sure to run git init because DVC cannot be used without a git-managed folder.

Create a virtual environment and start a shell:

uv venv
source .venv/bin/activate

Install DVC with S3 support:

uv add --dev dvc[s3]

Verify that DVC is available:

dvc --help

Commit to git at this point:

git add . -A
git commit -m 'install dvc'

Project Initialization

From here, I’ll proceed with the Get Started guide:

https://dvc.org/doc/start

Create a DVC environment:

dvc init

.dvcignore and .dvc were created. .dvcignore is like .gitignore where you can specify files you want to exclude from DVC management. .dvc is the DVC working directory and you basically don’t need to edit it. Confirm that these files are automatically added to git staging:

git status

Commit:

git commit -m "Initialize DVC"

Data Tracking

The tutorial uses XML, but I’d like to manage binary image data. I’ll use Lorem Picsum for the image data. The name will be image.jpg.

mkdir data
curl -o data/image.jpg -LO https://picsum.photos/id/237/200/300

Let’s put image.jpg under DVC management:

dvc add data/image.jpg

A file called image.jpg.dvc is generated in data, and image.jpg is excluded from git management by .gitignore. image.jpg.dvc is text data containing information such as hash values to uniquely identify image.jpg, and git only manages this.

outs:
- md5: b3a6da9d3fbd48339cb2982b5bf41e35
  size: 10839
  hash: md5
  path: image.jpg

Commit the generated files with git:

git add data/image.jpg.dvc data/.gitignore
git commit -m "Add raw data"

Cloud Setup

The actual image.jpg will be stored in OCI. First, create an Object Storage. There are too many cloud settings so it’s exhausting, but just name it bucket-dvc and leave everything else as default.

https://cloud.oracle.com/object-storage/buckets

Bucket creation screen

Also create secret keys for access. When you create them, you’ll get a secret_access_key and access_key_id, so make sure to note them down somewhere. These two are secret information, so don’t manage them in git.

Secret creation screen

https://cloud.oracle.com/identity/domains/my-profile/auth-tokens

Check the connection. Confirm the namespace and OCI region. The namespace can be checked from the bucket details screen. The region is something like ap-osaka-1 and is written at the end of the URL.

Namespace

Data Storage

Let’s actually save the data. Configure the storage destination. Replace <namespace> and <region> with the values you confirmed earlier.

dvc remote add -d storage s3://bucket-dvc
dvc remote modify storage endpointurl https://<namespace>.compat.objectstorage.<region>.oci.customer-oci.com

Save the secret_access_key and access_key_id. Change xxxxxx to your own. This is not managed by git or DVC, so you’ll need to reconfigure this in new local work environments.

dvc remote modify --local storage access_key_id 'xxxxxx'
dvc remote modify --local storage secret_access_key 'xxxxxx'

This is the important part for this time, but if you don’t set the following environment variables, it won’t work properly. It seems to be a commonly set item when using S3-compatible storage. Details are described on the following page:

https://www.ateam-oracle.com/post/using-oci-os-s3-interface

export AWS_REQUEST_CHECKSUM_CALCULATION=when_required
export AWS_RESPONSE_CHECKSUM_VALIDATION=when_required

Finally, let’s save:

dvc push

If it says 1 file pushed, it’s successful. Let’s check the contents from OCI. Note that even if you look inside, image.jpg is not stored as is, but files for differential management are stored, so at first glance strange files are there, but that’s correct.

Saved files

From other work environments, you can get the latest version by running git pull and dvc pull.

Summary