Tidy data - a revisit
It would be useful to add the Distribution variable from NESP_SharkSpeciesList to QldShark_2017_Clean_v2, but the data in that column is very messy with multiple values inside the cells. Do a Text facet on the Distribution column and explore.
To create a tidy dataset, where:
- Each variable forms a column
- Each observation forms a row
- Each cell has one value
- Each type of observational unit forms a table,
multi-value cells need to be split by the value.
This task is helpful where there are multiple values in a cell that are not organised consistently, such as when survey respondents can select multiple, controlled values to answer a question, or a notes field has free text. The Distribution column is an example of this. Let’s add the column to QldShark_2017_Clean_v2 then tidy it.
Add a new column
- Go to
QldShark_2017_Clean_v2 - Select
Common namecolumnEdit Column > Add column based on this column - name the new Column
Distribution -
enter this GREL expression:
cell.cross("NESP_SharkSpeciesList","Common name")[0].cells["Distribution"].value
382 rows of the shark captures now have a new variable & value of Distribution added.
Now we need to tidy the new variable in QldShark_2017_Clean_v2, then split the column into two columns with a value in each.
Split a multi-value column
- Go to
QldShark_2017_Clean_v2 - Select column
Distribution > Facet > Text Facet - There are 6 choices. Two of the choices contain two values within the cells e.g. Australasia & Indonesia. The multiple values do not contain a common separator. This will need fixing to be able to split on.
Let’s change the blank choice to “NULL” as these records do not have any data.
- In
DistributionFacet box hover over(blank) - Select
editand typeNULL>Apply - At
Distribution column - Select
Edit cells > Replace > Find: & Replace with: ; - Select
Edit column > Split in several columns > by separator *;* > Split into *2* columns ok.
An additional Distribution column will be created with values only appearing where there had previously been two values in the original cell.
Watch these steps in this video.
Now let’s explore the column Notes on skies and wind. It is messy and contains multiple values within the cells. Ultimately we will create multiple columns using GREL and a language known as Regular Expression or Regex, which searches for patterns in strings.
Add new columns using GREL & Regex
- Go to column
Notes on skies and wind - Select
Facet > Text facetand explore the 36 choices
Notice the variations in common separators, order of values and case format. We cannot split the variable into multiple columns using the method.
- Hover over the
(blank)choice selecteditand change(blank)toNULL - Select
Edit cells > Common transforms > to lower caseto make the case consistent throughout the cell (and reduce the choices to 31) - Select
Edit column> add column based on this column - Type new column name
Overcast -
Click inside expression box, enter GREL expression:
if(value.contains("overcast"),"overcast",value).replace(/.*[^overcast].*/,"")This means… if (the value in the cell contains “overcast”, replace it with “overcast” value), then replace (anything that is not “overcast” that is found one or more times in the cell, with “” ie. a blank).
- Preview and ok
- Repeat steps above for the other Sky related value
fine(can reuse expression fromhistorytab)
Now let’s join up the two columns fine and overcast which are really values, into a new variable Skies
- Go to
finecolumnEdit column > Join columns... - Select
fineandovercastfrom tick box list - Select
Write result in new column named... - Type
Skiesandok.
We now have a new variable Skies. Perform and text facet to see the value choices.
- Hover over
(blank)choice selecteditand typeNULLandApply. Now each cell has data.
Next we can extract all the “wind” data into a new variable using the steps above.
- Select
Edit column> add column based on this column - Type new column name
strong breeze -
Click inside expression box, enter GREL expression:
if(value.contains("strong breeze"),"strong breeze",value).replace(/.*[^strong breeze].*/,"") - Preview and ok
- Repeat steps above for the other “wind” related values that can be found in the
Text facet windownear galeandlight winds(can reuse expression fromhistorytab).
Now let’s join up the 3 columns strong breeze, near gale, light winds which are really values, into a new variable Wind
- Go to
near galecolumnEdit column > Join columns... - Select
strong gale,calm winds,strong breeze,near gale,light winds,gale force,light breezefrom tick box list - Select
Write result in new column named... - Type
Windandok.
We now have a new variable Wind. Perform a text facet to see the value choices.
- Hover over
(blank)choice selecteditand typeNULLandApply. Now each cell has data. - Go to
AllcolumnEdit columns > Re-order / remove columns... - Drag and drop the following columns to remove:
fine,overcast,Notes on skies and wind,near gale,strong breeze,light winds ok.
This great GREL cheat sheet from code4lib Toronto has more details on building expressions using Regex.
See how this works in the video below.
The final step is to export specific variables from this tidy dataset to a .csv file which can be parsed (understood by) the geo.json tool used in the next lesson.
Export selected columns to .csv file
- Go to
Allcolumn>Edit Columns> Reorder Remove columns - Drag and drop the columns not needed including
Species Code,Date,Area,Location, ` Genus,Species,Distribution,Distribution1, to theremove` box. Common name,Latitude,Longitude,Length (m),Rainfall_mm,Formatted Date,Water Temp (C),Skies,Windwill remainOk- Click
Exportbutton, top right hand corner - Select Comma-separated value dataset
- save file