Strip code blocks from markdown

Updated by Tom Wells on 21st November 2021

Stripping anything from markdown may seem counter-intuitive to most, but I recently had a client that used markdown for all their posts, and they were looking to add their site to Google Publisher which requires an RSS feed (or AMP) to retrieve posts. Since their site also includes some programming examples and, therefore, code blocks, it was necessary to strip these out of the markdown as Google News doesn't have any form of syntax highlighting.

A strange requirement, I know, but that's the nature of this job, right!?

Setup

Now, let's look at how to replace the code blocks in your markdown. In your js file:

index.js

import * as fs from 'fs'

const markdownFile = fs.readFileSync('<your-markdown-file>.md', 'utf-8')
const regexForReplacing = /(```.+?```)/gms
const markdownWithoutCodeBlocks = markdownFile.replace(regexForReplacing, '')

From here, you can do whatever you like with the markdownWithoutCodeBlocks variable. It should contain the raw markdown without the code blocks (you should see an additional empty line) which can be parsed as HTML. The above code will work with any markdown parser, so feel free to use your parser of choice.

How does the regular expression work?

The regular expression can be broken down into two sections. The first section inside the brackets (known as a capturing group) matches the three backticks literally with the three characters .+? translates to "matching any character" (the .) and "matches between one and unlimited times, as few times as possible, expanding as needed" (+?). In short, the brackets will match the three backticks and their contents.

The three remaining characters (/gms) are the "global pattern flags". These modifiers change the way the regular expression works, and in this case:

g = "global". Finds all matches and does not stop after the first match found.
m = "multiline". The regular expression anchors match at the beginning/end of each line respectively instead of the beginning/end of an entire string.
s = "single line". Enables the dot (.) metacharacter to additionally match new lines.

So, we're checking for all iterations of the conditions set out above and ensuring that we're capturing all contents of the code block for replacement. Quite a powerful single line of code!

Usage with Nuxt Content

The method described above is in use on this very blog, powered by Nuxt Content. To make this work, I utilised the popular marked package (yarn add marked) to generate the RSS feed alongside the @nuxtjs/feed module. I won't lie, getting this to work was a real pain in the a**, but after some trial and error it's now working like a charm.

For this to work with Nuxt Content, I utilised their content:file:beforeInsert hook:

nuxt.config.js

export default {
  hooks: {
    'content:file:beforeInsert': (document) => {
      if (document.extension === '.md') {
        // Strip code blocks from markdown for feed
        const regex = /(```.+?```)/gms
        document.plainText = document.text.replace(regex, '')
      }
    }
  }
}

The hook is utilised to create a new plainText property which can be used when generating the RSS feed:

nuxt.config.js

import marked from 'marked'

export default {
  ...
  feed: [
    {
      path: '/feed.xml',
      async create(feed) {
        feed.options = {
          ...
        }

        const { $content } = require('@nuxt/content')
        const posts = await $content()
          .only([
            'slug',
            'title',
            'description',
            'plainText'
          ])
          .sortBy('createdAt', 'desc')
          .limit(32)
          .fetch()

        posts.forEach((post) => {
          const content = marked(post.plainText)

          feed.addItem({
            title: post.title,
            id: `https://example.com/${post.slug}`,
            link: `https://example.com/${post.slug}`,
            description: post.description,
            content,
          })
        })
      },
      cacheTime: 1000 * 60 * 15,
      type: 'rss2'
    }
  ],
  ...
}

The above snippet shows how we use the new plainText property o generate new HTML for the feed without the code blocks. Nifty!

For more information on the @nuxtjs/feed module, check out their docs.

Final words

A worthwhile mention - because the example code above replaces the code block with an empty line, you may need to remove any empty lines as well. It depends on if your parser adds an empty <p> tag or not (in general, it won't).

Otherwise, I hope you found this useful and if you have any questions, don't hesitate to contact me on Twitter.