How to find duplicates in a Google Sheet.
February 23, 2023 1:26 AM   Subscribe

I have a Google Sheet with two columns of data and need to check for duplicate words. But it's slightly complicated.

Column A has a list of about 150 words, one per cell. Column B has an example sentence for each of those words, one sentence per cell.

What I want to check is that each of the words is only used once, in its example sentence, but it can't be used in any other sentence. So I need to check for duplicates, but I need to check each of the words in Column A with each word in each sentence in Column B. In other words, each word in Column A should only be used once in Column B.

Any advice with this? I can do this with Excel if needed.
posted by zardoz to Computers & Internet (4 answers total)
 
easiest is to do a formula in col C that counts how many times Row # of Col A appears in all of Col B. Make them a table, then filter for Col C anything over 1 and there are your failures.

https://www.automateexcel.com/formulas/count-cells-that-contain-specific-text/
posted by dorothyisunderwood at 1:59 AM on February 23, 2023 [7 favorites]


Set Theory-ish....

Make enough duplicate columns of the A/B source columns so you have a bit to work with.
Add a column that does a B' that is B passed through a search/replace to remove it's column A word.

Make another column that checks whether the A can be found in any of the B', if it can, it's a duplicate, if it can't then it's unique to it's original B.

I don't do spreadsheets. But roughly.....

dog, what a good dog.
cat, what a good cat.
eats, the dog eats the rat.
sleeps, the dog sleeps with the cat.
Becomes:

dog, what a good .
cat, what a good .
eats, the dog  the rat.
sleeps, the dog  with the cat.
Then some function to see if A1 is anywhere in B'1-4 , and likewise for A2 anywhere in B'1-4.

If an A is found in the B' list, it's a duplicate (dog & cat), if it doesn't it's unique (eats & sleeps).

So you need search/replace across two columns in a row and a boolean search between a cell in a column and a range.

Hope your words are simple and don't require conjugation or such. This is about the easiest way I can think of to do this with spreadsheet functions and not like in Perl/Python/etc.
posted by zengargoyle at 2:10 AM on February 23, 2023


Am I understanding correctly that you want to make sure that each word in column A is not used in any cells in B other than the directly corresponding example sentence? Are you looking for the count of those wrong-cell uses, the location of those wrong-cell uses, or just a true/false on whether they exist? Assuming it's one of the latter two, I think your focus on counting the duplicates is a bit of a red herring - you probably want an actionable list of those "incorrect" uses of the word.

I personally would find it easiest to create a function looking for those uses in the wrong cell and then returning the info that you need. If you just want to know whether your word exists in any other cell, I would use something like the countif foruma in the link dorothyisunderwood posted.

Or, if you wanted the locations of the incorrect uses of your word, you could use SEARCH and do something like =JOIN(",", ARRAYFORMULA(IF(ISERROR(SEARCH(A1, B1:B150)), , ROW(B1:B150)))). This needs a little shaping to remove the location of the "correct" use of the word and to remove the blanks (both could be done easily with =FILTER), but perhaps it's a useful starting point.
posted by mosst at 8:30 AM on February 23, 2023


The formula referenced in the first comment will not work with just a plain cell reference because of the need for wildcards. So, you can string together the formula using two helper cells that contain parts of the formula and not including the equal sign:
C1: COUNTIF(B:B, "*
D1: *")

Assuming your data starts in the second row, you can make cell c2 =C$1&A2&D$1 and copy it all the way down the column.
Then you can copy column C and paste->special values over the whole column. Now replace COUNTIF with =COUNTIF and then your formulas should be correct. Any values over 1 are duplicated somewhere else in column B.
posted by soelo at 8:53 AM on February 23, 2023


« Older Is there a good way to use up sweet coffee creamer...   |   Is the 'freedom of speech' rhetoric now just... Newer »
This thread is closed to new comments.