Regex: Grouping expressions of a kind

I was recently presented with a task: I had a CSV database dump that I had to turn back into MySQL queries of a different format. My toolset was everything that Sublime Text 3 had to offer (which is a lot of great tools) – though eventually it was narrowed down to a plugin and ST3’s macro functionality – and the awesome power of regular expressions. As I was working on this task, however, I ran into a problem which took me a few hours to resolve – but its solution is interesting, and in my opinion deserving of an article.

The task at hand was thus: I was presented with a CSV dump of a database of questions and answers to tests in an online testing platform. It looked like this:

As you can see, I have 4 columns in my CSV dump – the test name, the question text, the answer text and whether the answer is correct. At the very beginning I decided to work with this dump one question set at a time – not with all of the tests at once, which means that the first column is only there for the sake of my own understanding, and only the last three truly matter. So, I set out to copy a set of questions into a separate file and start working on it. So, to set up the algorithm, I copied the set of questions that contained the one presented above – and 9 more like it – and got to work. All the regex transformations that I will outline were used with the RegReplace plugin for Sublime Text 3, as opposed to its native regex functionality, so they look slightly different – double backslashes are used because the regexes are entered into a JSON string.

The first task at hand is cleanup, which entailed getting rid of the HTML tags and fixing any fallout from that action. This resulted in a few fairly straight forward regex replacements. The first replacement was to get rid of the tags, and meant finding <.*?> (with the dotall argument set to true for multiline tags) and replacing it with an empty string. Fixing broken newlines from the bad formatting meant finding ([^\"])\n (everything that is not a quote followed by a newline) and replacing it with \\1 (the original character followed by a space).

Performing those actions resulted in a clean and readable set of data:

The next task was to separate questions and answers – to have only one instance of the question followed by all the possible answers, and that is where I was met with the problem that took me the most work. In my approach, I decided to separate the different question-answer sets with a newline, but to achieve that I had to first find those sets. I was stumped, I had no clue how to find multiple appearances of the exact same phrase for any phrase.

After some googling and experimentation, I arrived at a surprisingly straight forward answer: it is possible to use captures in your search expression as well as the replacement one. This meant that all I had to do was to find a phrase and then find as many other occurrences as it had using \\1 in the search expression. In the end, the resulting replacement was thus: (\"[\\w\\s\/]+\",\".*?\"),(.*?)\n((\\1.+?\n)+) into \\1,\\2\n\\3\n. The first capture finds the common element in all sets, which is the test name and the question text. The second capture finds the first answer, and the third capture (outer brackets) finds as many occurrences of the common element followed by other answers as possible (inner brackets). The replacement leaves them unchanged, except it also adds a newline after the entire set to separate it from the next one.

Performing that replacement gave a very convenient format to work with:

Now I put the questions and answers on separate lines by finding \"[\\w\\s/]+\",\"(.*?)\",\"(.*?)\",\"(\\d+)\" (with dotall set to true) and replacing it with \\1\n\t\\2\t\\3, essentially getting rid of the test name, all the quotes and putting a tab in front of the answer, as well as a tab between the answer and its correctness flag. This action transformed the data into the following format:

It was getting better and better every step of the way. Now, I had to get rid of the repeated instances of the question text and leave only one of them followed by the tabbed answers. This was achieved by finding ([^\n])\n([^\t\n]+) (a non-newline character followed by a body of characters) and replacing it with \\1. The non-newline character in front was needed to not get rid of the first instance of the question text as well. This gave me the following result:

Finally I had the format I was looking for, which meant I could begin the final phase: reformatting text into MySQL queries. The first replacement entailed finding \n\n(.+) (the question body) and replacing it with \n\nINSERT INTO webtutor_questions (testid, question) VALUES (%TESTID%, \"\\1\");\nSELECT @lastid:=LAST_INSERT_ID();. The %TESTID% placeholder is meant to give me a guideline for manual replacements later on. This replacement has a drawback: since it defines the question body as text preceded by two newlines, it cannot find the first question, as it is located at the beginning of the file and is preceded with nothing. This meant that I had to insert double newlines in the beginning of the file using the Sublime Text macro functionality – namely, its move_to and insert functions.

The second step was to find \t(.+)\t(\\d+) (the answer text and its correctness flag) and replace it with INSERT INTO webtutor_answers (questionid, answer, correct) VALUES (@lastid, \"\\1\", \\2);.

These two transformations resulted in the following code:

All that was left now is to make another INSERT query to insert a new test and to replace the %TESTID% placeholders with the ID of the test that was created with that query – this was also done by adding actions to the ST macro.

In the end, I had a valid MySQL query that I could import directly. Making all this automated meant that I only had to copy and paste question sets into a file and run my macro on it. This sped up the workflow considerably from performing the regex replacements manually and especially manually reformatting the input data. I am certain there are better and more efficient ways to achieve my goal, and I can think of a few possibilities even now. However, the most important lesson that I learnt from this is that it is possible to use captures in the search string as well as the replace string. I expect to get some interesting uses out of that functionality!

Published by



Developer, linguistics enthusiast, amateur teacher. All opinions are my own.

Leave a Reply

Your email address will not be published. Required fields are marked *