------------------------------------------------------------------------------- Comparing files by Words (especially plain english text) Generally comparing large collections of different files looking for common sequences of text. ------------------------------------------------------------------------------- Wdiff -- diff on a word by word basis EG: differences basied on white space or non-alphnumeric seperations Esentually it breaks up the two input file by white space into a one word per line file to feed into diff. It then runs diff on the results and examines the output. Ir is while reading the output of diff, that wdiff comes into its own. It reads the change lines (and ignores all else from the diff output) and writes out or skips the words AND THE WHITESPACE from the appropriate input file, inserting special "deleted" and "inserted" tags to the merged output file. It also uniquely summerises the results of the diff output lists the total word counts, the words in command, deleted, inserted and changed. This lets you color the output for your terminal. wdiff -w "$(tput bold;tput setaf 1)" -x "$(tput sgr0)" \ -y "$(tput bold;tput setaf 2)" -z "$(tput sgr0)" \ file1 file2 ------------------------------------------------------------------------------- Perl IO::Event wdiff I implemented a wdiff in perl which required no tempory space and only limited buffers. It did not re-process the output from the diff, but did collect the statistics results just as wdiff did. The program made use of a multi-IO event handling perl modules which I wrote based on the "select()" system call (perl IO::Select module). The input files were filtered and passed to the diff command, and the output from the diff command was collected and statistics gathered, all SIMULTANIOUSLY. Basically I wanted to prove I could do this simultanious processing without temps. I plan it as a example code for the IO modules. The perl IO event handler module I wrote has not however been published onto the CPAN archive yet. Nor have I seen something like it for me to re-use. As such it is not practical for a public release, (prehaps with perl 6 :-) The whole thing also proved to be slower that the original C wdiff due my use of perl regular expressions in the input filters, rather than character by character method you used. The program is not however a practical design to re-process the output into wdiff format, without some major re-writing of the input handling to allow it to read the input streams twice. Perhaps as a future project. ------------------------------------------------------------------------------- Word diff with context Problem: When a block of text is deleted from a input file, the output of wdiff, locks on to simular words especially common words (like: a, the, and) and this results in a flood of changes in the output until the two input files syncronize again if ever. Posible solution... When you break the files into 1-word-per-line for comparision, to include "context" information around each word? That is you include the words both before and after the specific word for that line. For example... When you break you break the break the files the files into ... That would make it much more unlikely for individual common words to be improperly picked out of context for matching in the two files. Then when you reconstructed the diff output, you would grab the middle word on each line instead of the entire line (word). This might even be tune-able -- the user could specify whether to add 1, 2, or N extra words on each side of the middle word. But I believe even 1 word on each side would dramatically improve things. Gary Fritz My own notes.. This solution works very well and file syncronization. It does require some extra work to re-aligh word boundaries, such as when a single word is added/deleted/modified. ------------------------------------------------------------------------------- sim_text I downloaded and modified "sim_text" program to generate a "sim_words" version which tokenizes the file into simple alphabetic words ignoring all space, punctuation, numbers. As it lexically tokenizes the files, the comparison function (which is not detailed) works extremly fast to compare one or two lists of files, without requiring pre-finger-printing. This is memory intensive, but the program uses that memory well. Wdiff only cleans the original files for analysis by the original 'diff' program which only can compare two 'per-processed' files at a time. The amount of I/O traffic and command spawning is very high which slows comaprisions of large collections down enormously. Unfortunateally "sim_words" does not accept filenames from files or streams, or do recursive reading of files in a directory structure, whcih can make command line limits a problem. Through it does allow comparision of two seperate groups of files. It also does not appear to 'sync' the diffs as well as "wdiff" with a context switch). For example in one case wdiff -c of two files found 26% common, while sim_words only found 3% common. This may be caused by it's much larger 'context' handling (-r option), when finding duplicate strings. ------------------------------------------------------------------------------- New variation: dwdiff http://os.ghalkes.nl/dwdiff.html This suposedly provides more control of 'what is a word'. -------------------------------------------------------------------------------