Full migration of DocumentEngine to another S3 bucket

Migrating PSPDFKit/Document Engine Data Storage to Another S3 Bucket.

Problem:#

If you need to switch to a different S3 bucket, you will find that specifying a new S3 bucket will only cause the service to store new files in the new S3 bucket while continuing to retrieve old files from the old one.

This is because the service stores the file storage configuration for each file separately.

Documents Engine stores a storage config peer document/attachment in the three database tables — pdfs, documents and attachments.

Example:

insert into documents (id, source_pdf, original_file) 
    values ('7KV77777DKG'
            '{"s3": {"bucket": "my-bucket-prod", "region": "us-east-1"}, ...',
            '{"s3": {"bucket": "my-bucket-prod", "region": "us-east-1"}, ...');

If you have millions of documents in your PSPDFKit/Document Engine installation, you have no easy full migration options.

Variants to solve:#

1. Leave as is (AWS S3)#

This option is a good solution if you can keep the old bucket permanently or temporarily and provide access to it.

If possible, then you can leave access to an old data storage bucket, and the Document Engine can work with old files in an old bucket and use a new bucket for new files.

To do this, your user or service must have a policy with read/write permissions to both buckets.

An IAM Policy Example:

{
    "Statement": [
        {
            "Action": [
                "s3:PutObject",
                "s3:List*",
                "s3:Get*",
                "s3:DeleteObject"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::<your-new-bucket>/*",
                "arn:aws:s3:::<your-new-bucket>",
                "arn:aws:s3:::<your-old-bucket>/*",
                "arn:aws:s3:::<your-old-bucket>"
            ],
            "Sid": "AccessToStorageBucket"
        }
    ],
    "Version": "2012-10-17"
}

In the future, you can use the Document Engine API for gradually migrating documents to a new bucket, DOC.

curl --request POST \
--url http://localhost:5000/api/documents/{documentId}/migrate_assets \
--header 'Authorization: Token token=secret' \
--header 'content-type: application/json' \
--header 'pspdfkit-pdf-password: password' \
--data '{"storage":{"backend":"s3","bucketName":"<your-new-bucket>","bucketRegion":"<your-new-bucket-region>"},"copy_remote_files_to_new_storage":true}'

2. Directly replace an S3 bucket name in a database.#

This option is suitable for installations with thousands of documents. For installations with millions, it may take significant time.

IMPORTANT: Before or after this, you need to move all files to a new bucket.

UPDATE documents
SET source_pdf = source_pdf || '{"s3": {"bucket": "<your-new-bucket>", "region": "<your-new-bucket-region>"}}',
    original_file = original_file || '{"s3": {"bucket": "<your-new-bucket>", "region": "<your-new-bucket-region>"}}';

UPDATE attachments
SET backend_info = backend_info || '{"s3": {"bucket": "<your-new-bucket>", "region": "<your-new-bucket-region>"}}';

UPDATE pdfs
SET backend_info = backend_info || '{"s3": {"bucket": "<your-new-bucket>", "region": "<your-new-bucket-region>"}}';

Another variant is to use batch update, but it has a problem with performance after ~500k records because of order by:

-- example for documents table
DO
$$
DECLARE
    batch_size INTEGER := 500;
    off_set     INTEGER := 0;
    updated_rows INTEGER := 0;
BEGIN
    LOOP
        WITH to_update AS (
            SELECT id
            FROM documents
            ORDER BY id
            OFFSET off_set
            LIMIT batch_size
        )
        UPDATE documents
        SET source_pdf = source_pdf || '{"s3": {"bucket": "<your-new-bucket>", "region": "<your-new-bucket-region>"}}',
            original_file = original_file || '{"s3": {"bucket": "<your-new-bucket>", "region": "<your-new-bucket-region>"}}'
        FROM to_update
        WHERE documents.id = to_update.id;

        GET DIAGNOSTICS updated_rows = ROW_COUNT;

        COMMIT;

        EXIT WHEN updated_rows = 0;
        off_set := off_set + batch_size;
    END LOOP;
END
$$;

3. Via directly replacing values in a database dump#

This option is a good solution for installations with millions of documents.

IMPORTANT:

These actions may result in data corruption in the following three tables:
- pdfs
- documents
- attachments
Before doing this, make a separate full backup of the database (or snapshot).
Recommends using tmux/screen or another tool for these tasks.

1. Make a full database dump#

You can dump only the following three tables: attachment, documents and pdfs, but you may have problems restoring them due to field constraints limitations. A full database backup is guaranteed to ensure data consistency and no problems restoring.

Also the “records” table can have a large size.

# `-j 4 ` parallelism, set this value according to your database server resources,
# CPU_CORE_NUM/2 + consider disk subsystem throughput
pg_dump -h 127.0.0.1 -U my_user -W -d my_database_name -F d -j 4 -f -Z 0 ./buckup

2. Replace a storage configuration information#

# Enter a directory with dump data
cd backup

# Find what files contain data of pdfs, documents, attachments tables
pg_restore --list ./ | grep -E 'TABLE DATA' | awk '{print $1 $7}'

# ...
# 5058;attachments
# ...
# 5067;documents
# ...
# 5060;pdfs

# Replace a storage config value for tables, attachments, documents, pdfs

# OLD_BUCKET - current s3 bucket name
# OLD_REGION - current s3 bucket region
# NEW_BUCKET - a new s3 bucket name
# NEW_REGION - a new s3 bucket region
sed -e 's/OLD_BUCKET/NEW_BUCKET/g' -e 's/OLD_REGION/NEW_REGION/g' 5058.dat > out.5058.dat

sed -e 's/OLD_BUCKET/NEW_BUCKET/g' -e 's/OLD_REGION/NEW_REGION/g' 5060.dat > out.5060.dat

sed -e 's/OLD_BUCKET/NEW_BUCKET/g' -e 's/OLD_REGION/NEW_REGION/g' 5067.dat > out.5067.dat

# If you have a compressed dump file, you need to decompress a file or use `zcat`
# zcat 5058.dat.gz | sed -e 's/OLD_BUCKET/NEW_BUCKET/g' -e 's/OLD_REGION/NEW_REGION/g' | gzip > out.5058.dat.gz

# Replace files
mv 5058.dat out.5058.dat
mv 5060.dat out.5060.dat
mv 5067.dat out.5067.dat

3. Restoring a dump of a database#

pg_restore -h 127.0.0.1 -U my_user -W -d my_database_name -F d -j 4 ./buckup

Finish#

Please contact me if you have a better solution. :)