Development / Summer of Code / 2026 / MusicBrainz

This page has not been reviewed by our documentation team (more info).

MusicBrainz is a community-maintained open source music encyclopaedia that collects music metadata and makes it available to the public. Try it out.

Getting Started

(see also: Getting started with GSoC)

Modernize search storage format for the MusicBrainz database

Proposed mentor: lucifer
Proposed co-mentors: bitmap, reosarevok, yvanzo
Languages/skills: Solr, Python, Java
Forum for discussion
Estimated Project Length: 350 hours
Difficulty: medium

The MusicBrainz (MB) database has a Solr search engine used for both website search and search API. It stores the data in search fields. Two output formats are supported: MB XML (returned directly to API clients) and MB JSON (returned both to API clients and to the website server). The MB JSON format is automatically generated from the MB XML format. A RELAX NG schema is used to generate bindings and check the MB XML output. However, storing this MB XML format additionally to the search fields is redundant and inefficient (in disk usage and indexing time). Actually the current implementation has not been deeply revisited since the early versions of Solr.

It would be helpful to modernize the format used to store data for search in Solr. Minimum goals:

  • Upgrade the Solr schema version from 1.5 to 1.7
  • Complete fields (in configsets and indexer) to store all the data to be returned
  • Create two response writers to return data from fields to MB XML/MB JSON formats (with automated validation tests)

Many extra goals can be added by the candidate if wanted and if time permits. See tickets.

References:

Extend the mail service with template API

Proposed mentors: bitmap
Languages/skills: Rust
Forum for discussion
Estimated Project Length: 175 hours
Difficulty: medium

A new mail rendering service is gradually being used by MusicBrainz since 2024, by ListenBrainz since 2025, and hopefully by other MetaBrainz projects in the future. The new service is written in Rust and is based on MJML markup language (for proper rendering in mail clients) and MessageFormat 1 (for internationalization). However, the mail templates currently have to be written in the same repository as the mail service. It makes adding a new template (since both the project and the mail service have to be updated, released, and deployed) and sharing translation resources more difficult.

Allowing to load templates through API would be a great extension to the mail service, making it possible to maintain the templates in the repositories of their respective projects and to load/update these on demand without requiring to redeploy the mail service.

Many extra goals can be added by the candidate if wanted and if time permits. There are several mail templates in different projects to be adapted or reworked. There are also some cron jobs to send daily mails that would greatly benefit a full rewrite in Rust.

References:

Implement a daemon that corrects out-of-sync cover art and event art metadata on archive.org

Proposed mentor: bitmap
Proposed co-mentors: reosarevok, yvanzo
Full Description: Python, SQL
Forum for discussion
Full Description: 175 hours
Full Description: medium

The Cover Art Archive and Event Art Archive store both metadata about the entity in question and metadata about the available images.

Historically there have been service issues that have introduced inconsistencies in these metadata files:

  • Outdated entity metadata (incorrect titles, artists, dates, etc.)
  • Outdated image metadata (types, comments, thumbnails, etc.)
  • Images that exist on archive.org but are missing from index.json (or the MusicBrainz database)
  • Images are are listed in index.json (or the MusicBrainz database) but are missing from archive.org
  • Malformed JSON (strings being used instead of integers, encoding issues, etc.)

Many such issues have been described as part of ROpdebee's excellent auditing work in IMG-129.

Recently a new artwork-indexer service has been deployed which manages the metadata in question. The task of this project would be to extend the artwork-indexer to monitor entities in the MusicBrainz database having images, and automatically check and repair the types of issues listed above.

Ideally we can use the auditing results in jira:IMG-129 to generate queued tasks that prioritize checking the entities contained in the audit.

Note that some initial work on this idea was started by bitmap.

Metadata recognition from cover art

Proposed mentors: bitmap
Languages/skills: React.js, WebAssembly
Forum for discussion
Estimated Project Length: 175 hours (or 350 hours if machine learning)
Difficulty: medium (or hard if machine learning)

MusicBrainz gathers metadata about releases and their cover art through the Cover Art Archive. Very often editors have to type the data contained in the cover art images. A drastic boost for them would be to programmatically parse these images to extract as much metadata as possible: free text, title, artist credit, label code, barcode, tracklist…

The optical character recognition engine Tesseract can be used through either Naptha’s port in JavaScript Tesseract.js or Knight’s build in WebAssembly tesseract-wasm. In either case, the web user interface has to be written in React.js to allow a future integration to the website.

Tesseract has a lot of parameters that allow tuning it for specific usage, or focusing on some selected areas. However the main part of the project might be to turn its output into something useful. The parsing/mapping can potentially be achieved through machine learning, but that would likely double the project length. Regardless of the method chosen, we'd like the parsing to be general enough that it can be used outside of cover artwork; for example, pasting credits from a website or digital booklet text.

Develop a GraphQL server as a MusicBrainz API alternative

Proposed mentors: bitmap
Languages/skills: Rust
Forum for discussion
Estimated Project 350 hours
Difficulty: hard

MusicBrainz offers an XML/JSON API, mostly following REST design principles, but requiring a lot of ad-hoc "inc" query parameters to request supplemental data linked to an entity. Browsing entities requires a custom implementation on the server for each type of link, and many links are missing. Browse queries are mainly implemented for links between relatable entity types, too; browsing by any other kind of entity (e.g., release packaging or medium format), or by multiple of them, requires using our Solr API.

Another limitation of our current API that users frequently run into is the inability to lookup multiple entities of the same type in a single query.

We'd like to explore GraphQL as an alternative to our existing API, as it's widely adopted and helps solve many of the limitations above without reinventing the wheel. Your goal would be to develop a GraphQL schema and server, in Rust, for a subset of our relatable entity types. Modern server libraries for Rust include async-graphql and Juniper.

As we'd like to host this server, performance is a priority, and caching strategies should be carefully considered. We'd need a way to limit query depth and disallow links that create performance issues (or you should add new database indexes or materialized tables to improve the performance of problematic links).

It's recommended to limit your proposal to a subset of our relatable entity types and fields, since the main challenge of the project will likely be developing a robust schema and server architecture including caching and query analysis.