I have a ton of images that I have hoarded over the years.
I have so many images that I have a really hard time finding images that I already know I saved.
To solve this problem, around 2020 I started using Hydrus Client, a open-source piece of software for storing and tagging images.
Back then, I would painstakingly go through all pictures that landed on my PC and give them some descriptive tags.
I did this every now and then for about 2-3 years and amassed approximately 2900 pictures.
If you are a friend of mine, it is likely that you are tagged somewhere in this database with person: <your-name>.
But I realize that I never really used it for anything and it was a waste of time for me.
I am not bashing Hydrus. It is a wonderful piece of software.
It is just not for me anymore…
Still, I want to archive these tags, such that I may sort them in another way in the future.
12 random pictures in my database. Heavily curated as I had to remove many personal pictures.
The problem is, Hydrus is a very alive project and gets updates weekly.
My database is old and has not been migrated even once.
There is a migration guide, but I could not for the life of me follow it - no version of Hydrus I could compile can open my database.
So I decided to create migration script which collects all the image files from the various Hydrus file buckets, and then creates a text file with all the tags.
It also retags all the images using TMSU, a very light-weight tagging utility for Linux.
I just wanted to share my migration script here, as I won’t be maintaining or tracking it.
When ran, the output looks a bit like this
1
2
3
4
5
6
7
8
9
10
$ python3 migrate.py
Loaded 16030 hashes
Found 2916 files on disk
Loaded 10655 tag mappings
Loaded 17 unique namespaces
Loaded 2481 unique subtags
Loaded 2436 unique tags
Copied 2916 files to migrated_files
Wrote overview file to migrated_files/hydrus_files.txt
Tagged files in TMSU database located at migrated_files
This takes about 20 seconds on my machine for my 2916 files.
It then produces some new files:
I started by trying to read the documentation, but it didn’t say muich about database schema.
So I just started playing around with a database viewer, looking at the SQLite table definitions.
Here is what I gathered:
Each file is hashed with a unique hash and placed in the client_files/f<xx>/<hash>.<extension> where hash is a hex-representation of the hex and xx are the first two characters of said representation. The <extension> does not seem to be tracked.
In the client.master.db database exists a table hashes which ties each of these hashes to a hash_id
In the client.mappings.db database exists a table current_mappings_8 and current_mappings_9 which maps from hash_id to tag_id
Why the tables are named this, I don’t know. Perhaps it is a versioning thing, or my Hydrus database is really fucked.
In the client.master.db database exists a table tags which ties each tag_id to a namespace_id and subtag_id
In the client.master.db database exists a table namespaces which ties namespace_id to a string
In the client.master.db database exists a table subtags which ties subtag_id to a string
So the general flow is this
Gather all hashes in the hashes table. Attempt to find them in the file system
For each hash, determine its associated tag_ids via current_mappings_8,9 tables
For each tag_id, determine the namespace_id and subtag_id via the tags table
For each tag_id, determine a tag name by concatenating the associated strings from the namespaces and subtags tables
You now have a list of hashes, the file paths and tags with names such as cheese and person=andreas.
Move all these files somewhere new and create a file that lists all the new files and their tags.
Loop over the files and add tag them with TMSU.
Full script
Don’t judge me on this code. I am only running once ever and never maintaining it.
import sqlite3
from pathlib import Path
from collections import defaultdict
from subprocess import run
import argparse
parser = argparse.ArgumentParser(description="Migrate Hydrus files and tags.")
parser.add_argument("hydrus_db_path", type=Path, help="Path to Hydrus database directory")
parser.add_argument("--output-dir", type=Path, default=Path("./migrated_files/"), help="Directory to store migrated files")
args = parser.parse_args()
HYDRUS_DB_PATH = args.hydrus_db_path
CLIENT_MASTER_PATH = HYDRUS_DB_PATH /"client.master.db"CLIENT_MAPPINGS_PATH = HYDRUS_DB_PATH /"client.mappings.db"CLIENT_FILES_PATH = HYDRUS_DB_PATH /"client_files"OUTPUT_DIR = args.output_dir
OVERVIEW_FILE = OUTPUT_DIR /"hydrus_files.txt"OUTPUT_DIR.mkdir(exist_ok=True)
# Open the databasesclient_master_conn = sqlite3.connect(CLIENT_MASTER_PATH)
client_mappings_conn = sqlite3.connect(CLIENT_MAPPINGS_PATH)
client_master_cursor = client_master_conn.cursor()
client_mappings_cursor = client_mappings_conn.cursor()
defhydrus_tag_to_string(namespace, subtag):
if namespace:
returnf"{namespace}={subtag}"else:
returnf"{subtag}"# Maps from hash_id to dictionary of file infofiles = defaultdict(dict)
# The 'hashes' table has hash_id (integer) and hash (blob bytes)client_master_cursor.execute("SELECT hash_id, hash FROM hashes")
for hash_id, hash_blob in client_master_cursor.fetchall():
hash_hex = hash_blob.hex()
files[hash_id]['hash'] = hash_hex
print(f"Loaded {len(files)} hashes")
# Try to find each file in the client_files directory (it may have any extension)for hash_id, info in files.items():
candidate_folder = CLIENT_FILES_PATH /f"f{info['hash'][:2]}"# Hydrus places files in subdirectories named after the first two hex digits of the hash# Use rglob to find any file with this hash prefix matched_files = list(candidate_folder.rglob(f"{info['hash']}.*"))
if matched_files:
info['path'] = str(matched_files[0])
info['new_path'] = str(OUTPUT_DIR / matched_files[0].name)
files = {k: v for k, v in files.items() if'path'in v} # Keep only files that were foundprint(f"Found {len(files)} files on disk")
# The 'current_mappings_8' and 'current_mappings_9' tables map hash_id to tag_idtag_count =0for table_name in ['current_mappings_8', 'current_mappings_9']:
client_mappings_cursor.execute(f"SELECT hash_id, tag_id FROM {table_name}")
for hash_id, tag_id in client_mappings_cursor.fetchall():
ifnot hash_id in files:
continueif'tag_ids'notin files[hash_id]:
files[hash_id]['tag_ids'] = set()
files[hash_id]['tag_ids'].add(tag_id)
tag_count +=1print(f"Loaded {tag_count} tag mappings")
# Tags consist of a namespace and a subtagclient_master_cursor.execute("SELECT namespace_id, namespace FROM namespaces")
namespaces = {namespace_id: namespace for namespace_id, namespace in client_master_cursor.fetchall()}
print(f"Loaded {len(namespaces)} unique namespaces")
client_master_cursor.execute("SELECT subtag_id, subtag FROM subtags")
subtags = {subtag_id: subtag for subtag_id, subtag in client_master_cursor.fetchall()}
print(f"Loaded {len(subtags)} unique subtags")
tags = defaultdict(dict)
client_master_cursor.execute("SELECT tag_id, namespace_id, subtag_id FROM tags")
for tag_id, namespace_id, subtag_id in client_master_cursor.fetchall():
tags[tag_id]['namespace_id'] = namespace_id
tags[tag_id]['subtag_id'] = subtag_id
tags[tag_id]['namespace'] = namespaces[namespace_id]
tags[tag_id]['subtag'] = subtags[subtag_id]
tags[tag_id]['name'] = hydrus_tag_to_string(namespaces[namespace_id], subtags[subtag_id])
print(f"Loaded {len(tags)} unique tags")
for hash_id, info in files.items():
info['tags'] = [tags[tag_id]['name'] for tag_id in info.get('tag_ids', [])]
# Copy the files to the output directoryfor hash_id, info in files.items():
src_path = Path(info['path'])
dest_path = Path(info['new_path'])
ifnot dest_path.exists():
dest_path.write_bytes(src_path.read_bytes())
print(f"Copied {len(files)} files to {OUTPUT_DIR}")
with OVERVIEW_FILE.open('w') as f:
for hash_id, info in files.items():
f.write(f"{info['new_path']}\t{'|'.join(info['tags'])}\n")
print(f"Wrote overview file to {OVERVIEW_FILE}")
# Migrate to TMSUrun(["tmsu", "init", str(OUTPUT_DIR)], capture_output=True)
for hash_id, info in files.items():
dest_path = Path(info['new_path'])
if info['tags']:
tag_string =' '.join(f'"{tag}"'for tag in info['tags'])
run(f'tmsu tag "{dest_path}" {tag_string}', shell=True, capture_output=True)
print(f"Tagged files in TMSU database located at {OUTPUT_DIR}")