Clean Up the Default RSS Feed

June 15, 2023

Drupal's default RSS feed is bizarrely difficult to tame. There are a few issues with it, but the biggest one is the fact that each item's description has a link to the author's user profile appended to it in such a way that it can't be removed by any normal means—since user profile pages are not public for the vast majority of Drupal sites I build, this means there's effectively a link to a 403 error page in each RSS item.

Some other questionable elements:

The dc:creator is just the author's username.
The default guid is “[node ID] at [site URL]”.

All of these issues can be resolved via template_preprocess_views_view_row_rss()—the latter two can be addressed using reasonably Drupalish means, by replacing the elements before they're rendered:

function MODULE_preprocess_views_view_row_rss(&$variables) {
  $site_mail = \Drupal::config('system.site')->get('mail') ?: NULL;
  $link = (isset($variables['link']) && ($variables['link'])) ? $variables['link'] : NULL;
  if (isset($variables['item_elements'])) {
    foreach ($variables['item_elements'] as $index => $data) {
      if (isset($data['key'])) {
        if (($data['key'] == 'dc:creator') && ($site_mail)) {
          $variables['item_elements'][$index]['value'] = $site_mail;
        }
        elseif (($data['key'] == 'guid') && ($link)) {
          $variables['item_elements'][$index]['value'] = $link;
          $variables['item_elements'][$index]['attributes']['isPermaLink'] = 'true';
        }
      }
    }
  }
  …

This replaces the dc:creator with the site email address (if the actual item author's email address is required it should be accessible here via $variables['user']), and it updates the guid to the item URI. It also changes the isPermaLink attribute of the guid to “true”, since that element is now a real (and presumably permanent) URI.

The solution to the link to the author's user profile page in the item description is much less sane, and I'm grateful to @fjgarlin's comment in this issue thread for pointing me in the right direction. Note that, as described in that issue, in addition to the author link there's a useless publication date also appended to the description—this is a little less egregious than a link to a 403 page, but it's still unnecessary. The solution is to parse the entire description using PHP's DOMDocument class and remove the offending elements that way. One place where my solution differs from the one linked above: in my tests, the elements that have to be targeted don't have any identifying features (i.e., no HTML ID or class); fortunately they're the only span tags at the top level of the description, so they can be isolated that way:

function MODULE_preprocess_views_view_row_rss(&$variables) {
  …
  if (isset($variables['description']) && ($variables['description'])) {
    $description = $variables['description'];
    $dom = new \DOMDocument();
    // Re: LIBXML_NOERROR, see
    // https://stackoverflow.com/questions/9149180/domdocumentloadhtml-error.
    $dom->loadHTML($description, LIBXML_NOERROR);
    $xpath = new \DOMXpath($dom);
    $spans = $xpath->query('//html/body/span');
    $elements_to_remove = [];
    foreach ($spans as $index => $span) {
      $elements_to_remove[] = $span->ownerDocument->saveXML($span);
    }
    $description = str_replace($elements_to_remove, '', $description);
    $variables['description'] = $description;
  }
}

All the top-level spans are removed, discarding the superfluous author link and publication date.

Finally, pending resolution of this issue, there's no “Read more” link in the description—here's the full template_preprocess_views_view_row_rss() with all of the code above, plus the addition of a “Read more” link at the end:

/**
 * Implements template_preprocess_views_view_row_rss().
 *
 * The dc:creator is replaced with the site email address (by default it seems
 * to only use the name part of the author's email address, i.e., the part
 * before the "@"), and the guid is updated to be an actual link to the item.
 *
 * The description has the author and date embedded in a way that they can't be
 * removed by any rational mechanism, so that's done using PHP's DOMDOCUMENT
 * class.
 *
 * Finally, a "Read more" link is appended to the description.
 */
function MODULE_preprocess_views_view_row_rss(&$variables) {
  $site_mail = \Drupal::config('system.site')->get('mail') ?: NULL;
  $link = (isset($variables['link']) && ($variables['link'])) ? $variables['link'] : NULL;
  if (isset($variables['item_elements'])) {
    foreach ($variables['item_elements'] as $index => $data) {
      if (isset($data['key'])) {
        if (($data['key'] == 'dc:creator') && ($site_mail)) {
          $variables['item_elements'][$index]['value'] = $site_mail;
        }
        elseif (($data['key'] == 'guid') && ($link)) {
          $variables['item_elements'][$index]['value'] = $link;
          $variables['item_elements'][$index]['attributes']['isPermaLink'] = 'true';
        }
      }
    }
  }
  if (isset($variables['description']) && ($variables['description'])) {
    $description = $variables['description'];
    $dom = new \DOMDocument();
    // Re: LIBXML_NOERROR, see
    // https://stackoverflow.com/questions/9149180/domdocumentloadhtml-error.
    $dom->loadHTML($description, LIBXML_NOERROR);
    $xpath = new \DOMXpath($dom);
    $spans = $xpath->query('//html/body/span');
    $elements_to_remove = [];
    foreach ($spans as $index => $span) {
      $elements_to_remove[] = $span->ownerDocument->saveXML($span);
    }
    $description = str_replace($elements_to_remove, '', $description);
    $variables['description'] = $description;
    if ($link) {
      $url = Url::fromUri($link);
      $read_more = Link::fromTextAndUrl(t('Read more'), $url)->toRenderable();
      $variables['description'] .= \Drupal::service('renderer')->render($read_more);
    }
  }
}

A meta note: for this site I'd prefer to include the full text in the feed rather than just teasers, but I'm finding that some RSS readers don't handle the code very well, even when the RSS is valid, so including full posts breaks the feed in some contexts.

And some editorializing: RSS is a wonderful thing, and should be used much, much more than it is. Turning the web over to social media companies (“Web 2.0”) was a mistake; turning it over to crypto companies (“Web 3.0”) would have been an even bigger mistake if it hadn't been such a hilarious failure. RSS is good, the open web is good, and we should be all-in on Web 1.1.