sitedir: the directory the site is in run winhttrack with full debug log take lines begining with "##:##:## Info: engine: transfer-status: link added:" (where # is a digit char) remove the first 53 + urlLength chars, where urlLength is the root path of the site (e.g. site.com) (this will leave a leading /, which is important later) do s/' -> .*$'// to remove local file references do s/'[?].*$'// to remove URL parameters to PHP pages, etc. save resulting file to goodlist.txt cp -R sitedir sitedir2 create a mirror of the directory structure from sitedir2/../ with: find sitedir -type d -exec mkdir -p sitedir_good/{} move all good links out of sitedir2 to somewhere else: for filename in `cat goodlist.txt`; do mv sitedir2$filename sitedir_good$filename; done; unset filename check remaining files for dependencies (e.g. server-side includes that would not show up in winhttrack logs) on good files: (for filename in `ls -ARp ../sican2/* | grep -vE "\>:" | grep -vE "\>/" | grep -v "^$"`; do linksto $filename; done; unset filename ) > ../linkagereport.txt problems: - pathnames are not preserved during the linksto command, so linksto "foo.htm" returns pages with links to /foo/bar/foo.htm and /foo/foo.htm without distinction - because of the above problem, bad links that may need to be changed because the files they want are still available, will show up (even if foobar.html links to /foo.htm and foo.htm is in /bar/, foo.htm will show up as a needed file) - somehow reconcile the above two? - partial filenames, e.g. logo.jpg matches cap_logo.jpg