xsharp.eu • Duplicates, again, but other sort of ;)
Page 1 of 1

Duplicates, again, but other sort of ;)

Posted: Thu Nov 26, 2020 6:39 pm
by FFF
I need to write a tool to find duplicate files in a folder. We talk about some 25k files in there ;), so "by hand" is no option.
The filenames are arbitrary, (i just found a file named: "____ ___ ___ (_____, ____. ___).docx" ); the extensions are arbitrary.

The app that controls this folder "notices", if a given file is already present, and changes the filename of the newly inserted file, by adding a date plus an increment. E.g:
If there's a "myTest.prg", that copy will be ""myTest-Nov-26-1.prg", if another occurs at the same date, it will be "myTest-Nov-26-2.prg", if it appears tomorrrow it will be named: "myTest-Nov-27-1.prg
I have no control over this.
My first thought was to get a list of FileInfos, stepping through, comparing file n with file n+1, if filenname front parts are identical, check for same size and same change date, delete the n+1 file (or better, move to backup ;->), iterate until no dups are found.
Does that make sense?
I'd feel better, if i could check for files being identitical, but didn't find some tool in .Net (probably searched wrongly).
Maybe one could send pairs to something like WinMerge?

The dups appear usually rather rarely, but every once and again there are hiccups in the upstream process, and i get 1000 new ones ;-(

Any idea welcomed!
EDIT: maybe should have consulted the web prior to write ;) - found some candidates, and found one which tells my how dumb i was - ignoring the first "marker" - two identical files have to be the same size...

Duplicates, again, but other sort of ;)

Posted: Thu Nov 26, 2020 8:17 pm
by Terry
Hi Karl

Don't know if this helps, but some time ago I had similar problems using C#.

I can't remember (or find) exactly what I did, but essentially it involved creating a newClass with separate fields eg name, fullname, dates and so on, that class was initialised in the way you suggest from FileInfos.

I then added it all to a list<newClass> which allowed me to jigger things about in any way I wished.

I was doing this over several directories. The overall processing time for the same ballpark figures you quote was just a few m/s.

Terry

Duplicates, again, but other sort of ;)

Posted: Fri Nov 27, 2020 9:21 am
by Terry
Hi Karl

Further to my last, have just remembered a bit more.

You'll need to do a number of passes generating new arrays as you go.

Order is important. So if things get out of order make one long string and use sort. File names etc will need to be padded out (space) to a consistent minimum length, the added together. Don't use StringBuilder "a" + "b" is, I think, far more efficient.

You can consider introducing some oddball Unicode characters as identifiers and so on.

One other point you'll generate a lot of redundant strings in the process so make sure they go out of scope asap or they'll fill up memory.

I hope this makes sense - you'll have absolute control over everything, no need for 3rd party tools.

Sorry it's C#.

Terry

Duplicates, again, but other sort of ;)

Posted: Sat Nov 28, 2020 1:14 pm
by ArneOrtlinghaus
Hi Karl, I have attached two files with some VO or the converted XS functions.I believe it isn't so difficult do it with what we are used to do. - Create an array of the files with directory2arrayex- Verify that with this order always the first file appears in the list, otherwise change the order- Make two for loops to compare all files. Use the filesize ( second parameter of the inner array) to make a quick comparison. - If filesize is equal then use fFilesEqual to compare contents. If they are equal you can delete the second file. - An alternative is is to order the two dimensional array by file size (second parameter) with ASortTwoDim (adir, 2, true). In this case the for loop can be enhanced a little bit, but it must be verified, which file is the second one. Arne

Duplicates, again, but other sort of ;)

Posted: Tue Dec 15, 2020 7:53 pm
by ic2
Hello Karl,

Not sure why you need to write it yourself. This is a great & free tool for finding duplicate files, using several criteria:

www.clonespy.com

Dick