sitedir: the directory the site is in
run winhttrack with full debug log
take lines begining with "##:##:## Info: engine: transfer-status: link added:" (where # is a digit char)
remove the first 53 + urlLength chars, where urlLength is the root path of the site (e.g. site.com) (this will leave a leading /, which is important later)
do s/' -> .*$'// to remove local file references
do s/'[?].*$'// to remove URL parameters to PHP pages, etc.
save resulting file to goodlist.txt
cp -R sitedir sitedir2
create a mirror of the directory structure from sitedir2/../ with:
find sitedir -type d -exec mkdir -p sitedir_good/{}
move all good links out of sitedir2 to somewhere else:
for filename in `cat goodlist.txt`; do
mv sitedir2$filename sitedir_good$filename;
done;
unset filename
check remaining files for dependencies (e.g. server-side includes that would not show up in winhttrack logs) on good files:
(for filename in `ls -ARp ../sican2/* | grep -vE "\>:" | grep -vE "\>/" | grep -v "^$"`; do
linksto $filename; done; unset filename ) > ../linkagereport.txt
problems:
- pathnames are not preserved during the linksto command, so linksto "foo.htm"
returns pages with links to /foo/bar/foo.htm and /foo/foo.htm without
distinction
- because of the above problem, bad links that may need to be changed because
the files they want are still available, will show up (even if foobar.html
links to /foo.htm and foo.htm is in /bar/, foo.htm will show up as a needed
file)
- somehow reconcile the above two?
- partial filenames, e.g. logo.jpg matches cap_logo.jpg