How to use Wget to download selective files from a directory
How to use wget to download selective files from a directory in a remote ftp or http server:
Let’s say you have found a good and large resource on the internet with movies, interesting documents and tutorials. You want to save them locally on another server, in this scenario a linux server. How to do it without watching it and taking care all the time about your download? In my case the download was limited to 40 KB/sec and to download all the directory take days. But for my linux server which is always turned on this is not an issue. More than that, sometimes the remote server does not reply and many times the connection is lost. Wget is one of the best choices here, because wget supports resume and you don’t even must tell wget to resume. It does that automatically. So if a video of 2 Gb is copied at 90 % and you loose your connection, wget resume the copy from the last bit saved. It is a great tool.
I will show you right now what I want to mean. I have a passion for TED Talks and I found on the internet the archive with all the videos at day.
All this directory contains at this moment about 64 Gb of videos. It’s a ftp server with anonymous access. And if you watch it carefully there are 2 files from each talk. But I don’t want them both variants. I want only the video with the highest quality which is the 480 variant.
Let’s go to our linux machine where we want to download all these.
1. Obtaining a file with the videos in a single row.
Firstly, I log with a ftp client into the server. In my case is yafc (yet another ftp client). This have the advantage that it is command line based. (The prompter means that you must type). If you have another way to obtain the list of videos that you need from the directory, use it. We just want to obtain the files that we want to download.
$ ftp://anonymous:@ftp.leg.uct.ac.za:21
Type your password. Now that you are logged in you must go to the directory.
$ cd pub/stuff/video/ted/
Now we obtain the list with videos. Nlist from yafc shows in a single row only the file names. Just to assure that you are in the right directory you can tape :
$ nlist
But we want a file to contain all these videos to be able to select them later. Assure first you have a scripts directory somewhere. So :
$ nlist > /home/scripts/teds
We must wait 20 seconds until it finishes. Now you can quit.
As we want only the 480 videos in our case we will select only those.
$ grep ’480′ /home/scripts/teds > /home/scripts/ted
Ok. We have a nice list now with all the videos that we want arranged in a single row. This is important we will see next why.
2. Building the script.
As we want to automate all this thing we must have something like a script. Which eventually can start in crontab from time to time and do the job.
Here is the script I build. You must save it with the name wget_ted in the same /home/scripts/ directory.
#########################################################################
# Wget selective script to download with wget from a directory, using
# a variable to look in a file which tells wget what to take.
# Put in cron to start because sometimes it stops
# Grep takes only the existing process not also the grep process
#########################################################################
ps | grep “[w]get” > /dev/null
if [ $? -ne 0 ]
then
for i in $(cat /home/scripts/ted ); do wget -c -q ftp://ftp.leg.uct.ac.za/pub/stuff/video/ted/$i –directory-prefix=/storage/MOVIES/TED/ ; done &
fi
I will explain what it does.
ps | grep “[w]get” > /dev/null
First line with ps checks for existing wget process. If wget somehow fails to connect to the server or the process simply dies it will be seen as non existent and will pass the if condition. I used the regular expression with s in brackets to not select the grep process itself when searching for the wget process.
The next line :
if [ $? -ne 0 ]
it’s an if statement which looks to the previous command. If the previous command have an output this means that $? is equal to 0. So if I don’t have a process with wget it will go and execute the code sequence after then. If I have a process the output of $? will be different than 0 and it will ignore the condition statement. It will do nothing.
Now the real command. I used a for loop in which I defined a variable i. This variable looks in our video list file which I obtained at the step 1 : /home/scripts/ted
It takes every line once and executes wget only to a single video at a time. It begins with the first video, downloads it, then the for passes to the second video, and so on. If it happens the process to be interrupted the wget automatically passes through the downloaded files and goes directly to the file where the interruption occurred.
The options(flags) that we used with wget are :
‘-c’ or ‘–continue’
means Continue getting a partially-downloaded file. This is useful when you want to finish up a download started by a previous instance of Wget, or by another program.
‘-q’ or ‘–quiet’
means Turn off Wget’s output.
‘-P’ or ‘–directory-prefix’
This is the local directory you must specify to save all the downloaded videos.
3. Crontab job.
I put it to start every 10 minutes. Anyway does not charge the server, because it runs quietly and if wget process runs then nothing happens. You must edit your crontab using the command ‘crontab –e’. And you list crontab with :
crontab –l
#WGET Selective Script TED
*/10 * * * * sh /home/scripts/wget_ted > /dev/null 2>&1
So, this is all. I hope I was sufficiently explicit and this small tutorial helps you.

18. Dec, 2010 