Homework 3

Question 1

Find \s{2,}

\s{2,} - matches two or more spaces to avoid removing single spaces

Replace ,
, - replaces selected spaces with commas

Question 2

 Find (\w+),\s(\w+),\s(.+)
 
(\w+) — matches the last name (one or more letters)
,\s — matches a comma followed by a space
(\w+) — matches the first name (one or more letters)
,\s — another comma followed by a space
(.+) — matches the institution name (one or more characters, capturing everything after the second comma)

Replace \2 \1 (\3)

\2 — the first name capture from second capture group
\1 — the last name capture from first capture group
(\3) — the institution name capture from third capture group

Question 3

Find (\d{4}.+?\.mp3)

\d{4} — matches exactly four digits (the track number)
.+?\.mp3 — matches the track name and .mp3 file extension (the +? makes the match lazy, so it captures everything after the number and space up until .mp3)
() - full expression match

Replace \1\n
\1 - the full expression capture
\n - followed by a singleline break after first capture

Question 4

Find (\d{4})(.+?\.mp3)

(\d{4}) - matches exactly four digits (the track number)
(.+?\.mp3) - with the parentheses this expression is fully selected as the track name and .mp3 file extension

Replace \2_\1
\2 - move second capture first with track name
_ - insert _ in between captures
\1 - move first capture last with four digits

Question 5

Find (\w)(\w+),(\w+),[^,]+,([^,]+)

(\w)(\w+), - matches the first character of the genus 
(\w+),- matches the species
[^,]+, - matches and ignores the middle part by matching any non-comma characters in between two commas
([^,]+) - captures the last number 

Replace \1_\3,\4

\1- first capture including first letter of genus
_ - includes _ in between captures
\3 - third capture including species
, - separates species from last number
\4 - last number

Question 6

Find (\w)(\w+),(.{4})\w+,[^,]+,([^,]+)

(\w)(\w+), - matches the first character of genus
(.{4})\w+,- matches the species and selects the first four letters of the species
[^,]+, - matches and ignores the middle part by matching any non-comma characters in between two commas
([^,]+) - captures the last number

Replace \1_\3,\4

\1 - first capture including first letter of genus
_ - includes _ in between captures
\3 - third capture including species
, - separates species from last number
\4 - last number

Question 7

Find (.{3})\w+,(.{3})\w+,(\d.+),(\d+)

(.{3})\w+, - captures first three letters of the genus
(.{3})\w+, - captures first three letters of the species
(\d.+), - captures the first decimal number
(\d+) - captures the last number

Replace \1\2, \4, \3

\1\2, - connects first three letters of genus and species together
\4, - relocates last number to second position
\3 - relocates decimcl number to third position

Question 8

1) For pathogen_binary column:
Although binary indicates that it should be a boolean determination, NA could indicate an issue with data sampling. It does not have to be adjusted to still be valueable data.

If I were to convert NA to 0, I would:

Find NA -- match NA within the column

Replace 0 -- replaces NA with 0

2) For bombus_spp and host_plant columns each separately:

Find [^\w\s]

[^  ]- the caret ^ inside the square brackets negates the character class
          ---- in this case, it matches anything that is not a word character or space
\w - matches any word character 
\s: Matches any space character to keep the column formatting

Replace none. 

The special characters need to be removed, so there is nothing in the replace

3) For bee_cast column:
Male bees are considered drone bees within the bee caste system. This requires an adjustment. 

Find male -- match male within column

Replace drone -- replaces male with drone

There are empty whitespaces that need to be removed.

Find " " and Replace with nothing to fully remove the whitespaces.

Homework 3

Maddie Winer

2025-01-29

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8