• 2 Posts
  • 220 Comments
Joined 8 months ago
cake
Cake day: June 6th, 2025

help-circle

  • This issue is largely manifesting through AI scraping right now. Additionally, many intentionally ignore robots.txt. Currently, LLM scrapers are basically just bad actors on the internet. Courts have also ruled in favor of a number of AI companies when sued in the US, so it’s unlikely anything will change. Effectively, if you don’t like the status quo, stuff like this is one of your few options.

    This isn’t even mentioning of course whether we actually want these companies to improve their models before resolving the problems of energy consumption and potential displacement of human workers.





    1. Over-focus on the most popular artists. There is a long tail of music which only gets preserved when a single person cares enough to share it. And such files are often poorly seeded.
    • We primarily used Spotify’s “popularity” metric to prioritize tracks. View the top 10,000 most popular songs in this HTML file (13.8MB gzipped).
    • For popularity>0, we got close to all tracks on the platform. The quality is the original OGG Vorbis at 160kbit/s. Metadata was added without reencoding the audio (and an archive of diff files is available to reconstruct the original files from Spotify, as well as a metadata file with original hashes and checksums).
    • For popularity=0, we got files representing about half the number of listens (either original or a copy with the same ISRC). The audio is reencoded to OGG Opus at 75kbit/s — sounding the same to most people, but noticeable to an expert.

    Perhaps I’m reading this wrong, but is this not a little backwards? Since unpopular music is poorly preserved, shouldn’t the focus be on getting the least popular music first?