Question 1
Find \s{2,}
\s{2,} - matches two or more spaces to avoid removing single spaces
Replace ,
, - replaces selected spaces with commas
Question 2
Find (\w+),\s(\w+),\s(.+)
(\w+) — matches the last name (one or more letters)
,\s — matches a comma followed by a space
(\w+) — matches the first name (one or more letters)
,\s — another comma followed by a space
(.+) — matches the institution name (one or more characters, capturing everything after the second comma)
Replace \2 \1 (\3)
\2 — the first name capture from second capture group
\1 — the last name capture from first capture group
(\3) — the institution name capture from third capture group
Question 3
Find (\d{4}.+?\.mp3)
\d{4} — matches exactly four digits (the track number)
.+?\.mp3 — matches the track name and .mp3 file extension (the +? makes the match lazy, so it captures everything after the number and space up until .mp3)
() - full expression match
Replace \1\n
\1 - the full expression capture
\n - followed by a singleline break after first capture
Question 4
Find (\d{4})(.+?\.mp3)
(\d{4}) - matches exactly four digits (the track number)
(.+?\.mp3) - with the parentheses this expression is fully selected as the track name and .mp3 file extension
Replace \2_\1
\2 - move second capture first with track name
_ - insert _ in between captures
\1 - move first capture last with four digits
Question 5
Find (\w)(\w+),(\w+),[^,]+,([^,]+)
(\w)(\w+), - matches the first character of the genus
(\w+),- matches the species
[^,]+, - matches and ignores the middle part by matching any non-comma characters in between two commas
([^,]+) - captures the last number
Replace \1_\3,\4
\1- first capture including first letter of genus
_ - includes _ in between captures
\3 - third capture including species
, - separates species from last number
\4 - last number
Question 6
Find (\w)(\w+),(.{4})\w+,[^,]+,([^,]+)
(\w)(\w+), - matches the first character of genus
(.{4})\w+,- matches the species and selects the first four letters of the species
[^,]+, - matches and ignores the middle part by matching any non-comma characters in between two commas
([^,]+) - captures the last number
Replace \1_\3,\4
\1 - first capture including first letter of genus
_ - includes _ in between captures
\3 - third capture including species
, - separates species from last number
\4 - last number
Question 7
Find (.{3})\w+,(.{3})\w+,(\d.+),(\d+)
(.{3})\w+, - captures first three letters of the genus
(.{3})\w+, - captures first three letters of the species
(\d.+), - captures the first decimal number
(\d+) - captures the last number
Replace \1\2, \4, \3
\1\2, - connects first three letters of genus and species together
\4, - relocates last number to second position
\3 - relocates decimcl number to third position
Question 8
1) For pathogen_binary column:
Although binary indicates that it should be a boolean determination, NA could indicate an issue with data sampling. It does not have to be adjusted to still be valueable data.
If I were to convert NA to 0, I would:
Find NA -- match NA within the column
Replace 0 -- replaces NA with 0
2) For bombus_spp and host_plant columns each separately:
Find [^\w\s]
[^ ]- the caret ^ inside the square brackets negates the character class
---- in this case, it matches anything that is not a word character or space
\w - matches any word character
\s: Matches any space character to keep the column formatting
Replace none.
The special characters need to be removed, so there is nothing in the replace
3) For bee_cast column:
Male bees are considered drone bees within the bee caste system. This requires an adjustment.
Find male -- match male within column
Replace drone -- replaces male with drone
There are empty whitespaces that need to be removed.
Find " " and Replace with nothing to fully remove the whitespaces.