The team I am a part of was recently asked if there was any work that we could possibly automate to save time and effort. Leaving aside the strange question of asking the boiling frog how the boiling process might be sped up, I was left thinking about the limits of automation when what is trying to be automated is not always straight-forward and often involves contextual knowledge. My suggestion in response was not to automate the task, but possibly automate the process.
Ever since I have stepped into my new role in the technical team, I have made a conscious effort to capture everything that I have learnt and store it in a canonical space as one of the issues I had early on was that it was too difficult to find anything. This then allows me to produce repeatable processes. (In some respects I imagine that organising information in this manner actually makes it easier for an AI assistant to support?)
In addition to automating the day-to-day, I recently built out a spreadsheet to help with a particular task of cleaning up duplicate records across the database. Given a list of hundreds of records to review, I felt that saving any time and effort was going to make a difference. My thought was to create a series of declare statements to use to query the database, then copy the raw data, review it, create tasks for the relevant teams and produce a summary of the records changed and those that remained the same.
Doubling Down on the Declare
One of the issues I initially found when producing my query was updating all the variables each time. It therefore occurred to me that rather than changing the names and various IDs each time, which was always fiddly and prone to error, I could create a series of declare statements that I updated with the SQL statement each time. This allowed me to leave the rest of the query the same:
="
DECLARE @Sur1 VARCHAR(30) = '"&D2&"'
DECLARE @Sur2 VARCHAR(30) = 'SUR2'
DECLARE @Pre1 VARCHAR(30) = '"&E2&"'
DECLARE @Pre2 VARCHAR(30) = 'PRE2'
DECLARE @Login1 VARCHAR(30) = '"&H2&"'
DECLARE @Login2 VARCHAR(30) = 'CECVIDS\UPN2'
DECLARE @Login3 VARCHAR(30) = 'CECVIDS\UPN3'
DECLARE @CNUM1 VARCHAR(15) = 'CX'
DECLARE @CNUM2 VARCHAR(15) = 'CX'
DECLARE @CNUM3 VARCHAR(15) = 'CX'"
In addition to this, I stripped out any additional information that was included within a text string of possible information and had this represented on unique lines after the initial list:
--"&IFNA(JOIN(CHAR(10)&"--",ARRAYFORMULA(LEFT(QUERY(TRIM(MID(SUBSTITUTE(J2, ":", REPT(" ", LEN(J2))), LEN(J2)*ROW(INDIRECT("1:"&LEN(J2)))-LEN(J2)+1, LEN(J2))),"SELECT * WHERE Col1 LIKE 'C%'"),16))))&""
The output then looks something like this:
--"
DECLARE @Sur1 VARCHAR(30) = 'Davis'
DECLARE @Sur2 VARCHAR(30) = 'SUR2'
DECLARE @Pre1 VARCHAR(30) = 'Aaron'
DECLARE @Pre2 VARCHAR(30) = 'PRE2'
DECLARE @Login1 VARCHAR(30) = 'CECVIDS\01234567'
DECLARE @Login2 VARCHAR(30) = 'CECVIDS\UPN2'
DECLARE @Login3 VARCHAR(30) = 'CECVIDS\UPN3'
DECLARE @CNUM1 VARCHAR(15) = 'C0000001'
DECLARE @CNUM2 VARCHAR(15) = 'CX'
DECLARE @CNUM3 VARCHAR(15) = 'CX'
--CECVIDS\30000001
--C1000001"
I commented the additional information out so that I could use it if required or simply run the updated script it was. It also meant that I did not have to delete the quotations at the start and end that come across when copying from Google Sheets every time. I did tinker with how I could get the additional logins and numbers to prepopulate within the declare statement, but hit a wall as the number of additional values was not consistent.
Streamlining the Analysis and Fix
Once I had all the records in the system, I then had to organise them into those to be retained and those to be updated. The first step was to create a conditional formulas using the following custom formula that quickly highlighted possible duplicate records:
=COUNTIFS(L:L, L1, L:L, "<>BirthDate")
This allowed me to then reorder the list and review the rest of the details. Once I had done this, I then created a summary of the data.
Often I create a config tab where I keep tab where I create various formulas that I can refer to. This is particular important when using variable NAMED RANGES. As I wanted to make this spreadsheet somewhat self-contained, I decided to just use the first row of the sheet, which I then hid. This included a summary of the different community names in a single cell:
=LEFT(JOIN("/",UNIQUE(QUERY(F2:F,"SELECT F WHERE NOT F = 'NameExternal'"))),LEN(JOIN("/",UNIQUE(QUERY(F2:F,"SELECT F WHERE NOT F = 'NameExternal'"))))-1)
A summary of staff numbers in a single cell:
=IFNA(JOIN(",",UNIQUE(QUERY(G2:G,"SELECT G WHERE G LIKE 'C%' ORDER BY G ASC"))))
And a summary of login records in a cell
IFNA(JOIN(", ",UNIQUE(QUERY(H2:H,"SELECT H WHERE H LIKE 'CEC%'"))))
Each was given a Named Range that could be easily referred to in other formulas.
Initially, I created a formula which combined these different attributes, but I realised I then had to explain what sort of duplicate it was. Overall, there are four types of duplicate records:
- User with multiple staff numbers and login records
- User with multiple login records
- User with multiple staff numbers
- User with multiple community records
To accommodate these differences, I created an formula using the named ranges to combine the IF and AND formulas:
=IF( AND( IFNA(COUNTUNIQUE(QUERY(G2:G,"SELECT G WHERE G LIKE 'C%' ORDER BY G ASC")))>1, IFNA(COUNTUNIQUE(QUERY(H2:H,"SELECT H WHERE H LIKE 'CEC%' ORDER BY H")))>1), F1&" has multiple staff numbers: "&G1&" and multiple login records: "&SUBSTITUTE(H1,"IDS\",""),
IF(AND(IFNA(COUNTUNIQUE(QUERY(G2:G,"SELECT G WHERE G LIKE 'C%' ORDER BY G ASC")))=1, IFNA(COUNTUNIQUE(QUERY(H2:H,"SELECT H WHERE H LIKE 'CEC%' ORDER BY H")))>1), F1&" ( "&G1&" ) has multiple login records: "&SUBSTITUTE(H1,"IDS\",""),
IF(AND(IFNA(COUNTUNIQUE(QUERY(G2:G,"SELECT G WHERE G LIKE 'C%' ORDER BY G ASC")))=1, IFNA(COUNTUNIQUE(QUERY(H2:H,"SELECT H WHERE H LIKE 'CEC%' ORDER BY H")))=1), F1&" ( "&G1&" / logins: "&SUBSTITUTE(H1,"IDS\","")&") has duplicate community records",
IF(AND(IFNA(COUNTUNIQUE(QUERY(G2:G,"SELECT G WHERE G LIKE 'C%' ORDER BY G ASC")))>1, IFNA(COUNTUNIQUE(QUERY(H2:H,"SELECT H WHERE H LIKE 'CECV%' ORDER BY H")))=1), F1&" logins: "&SUBSTITUTE(H1,"IDS\"," ")&" has multiple staff numbers: "&G1&"",FALSE))))
Using the IF and AND combination allowed me to quickly and easily produce a summary of the issue in one sentence to raise in our incident management system.
In addition to this, I created a SWITCH formula to provide the server listener name, as opposed to the IP address provided in the data export:
=IFNA(SWITCH(B2:B,
"IP Address 1","Listener Name 1",
"IP Address 2","Listener Name 2",
"IP Address 3","Listener Name 3",
"IP Address 4","Listener Name 4",
"IP Address 5","Listener Name 5"))
Summarising the Changes
In addition to the summary of issue, I also created a summary of changes that I could easily copy and paste into the incident notes once I had cleaned up the various duplicates:
={"","","","","","";
"Record(s) updated / merged:","","","","","";
"","","","","","";
SORT(Indirect(Table1));
"","","","","","";
"Record(s) retained:","","","","","";
"","","","","","";
SORT(Indirect(Table2))}
As the number of records varied, I created a NAMED RANGE for records updated / merged and records retained and used the INDIRECT formula to refer to them.
The formula used for Table1 was:
="C2:H"&MATCH("Database Name", Indirect(CThreeC), 0) + ROW(INDIRECT(CThree)) - 2
While the formula used for Table2 was:
=ADDRESS(MATCH("Database Name", INDIRECT(CThreeC), 0) + ROW(INDIRECT(CThree))-1, COLUMN(INDIRECT(CThree)))&":H"
As I was continually moving and deleting rows around in the spreadsheet, I found that it kept on breaking these named ranges, so that is why I created two further NAMED RANGES.
This was formula used for CThree:
="$C$3"
And this was formula used for CThreeC:
="$C$3:$C"
Again, all this ‘magic‘ was hidden in the top row of the spreadsheet.
An example of the spreadsheet with the various formulas can be found here.
There has been a lot spoken at my workplace lately about what it means to be proud. Personally, fixing 250+ duplicate records is not necessarily something I am ‘proud’ of or excited by, but I am proud of my ability to improve processes and identify ways in which my actions can make it easier for others to do their job. The challenge I often have is being able to hand over such solutions to others. Most prefer their own way of working, even if that means complaining that they do not have time for this or that. I feel that the issue is finding the right balance between improving a process and creating more work.
If you enjoy what you read here, feel free to sign up for my monthly newsletter to catch up on all things learning, edtech and storytelling.
Automating Processes, Not Work by Aaron Davis is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.