• 2012-06-29 12:00:00

    How to build an inexpensive RAID for storage and streaming

    Overview

    I have written about this before and it is still my favorite mantra

    There are people that take backups and there are people that never had a hard drive fail

    This mantra is basically what should drive everyone to take regular backups of their electronic files, so as to ensure that nothing is lost in case of hardware failures.

    Let's face it. Hard drives or any magnetic media are man made and will fail at some point. When is what we don't know, so we have to be prepared for it.

    Services like Google Drive, iCloud, Dropbox, BoundlessCloud and others offer good backup services, ensuring that there is at least one safe copy of your data. But that is not enough. You should ensure that whatever happens, the memories stored in your pictures or videos, the important email communications, the important documents are all kept in a safe place and there are multiple backups of it. Once they are gone, they are gone for good, so backups are the only way to ensure that this does not happen.

    Background

    My current setup at home consists of a few notebooks, a mac-mini and a Shuttle computer with a 1TB hard drive, where I store all my pictures, some movies and my songs. I use Google Music Manager for my songs so that they are available at any time on my android phone, Picasa> to be able to share my photos with my family and friends and Google Drive so as to keep every folder I have in sync. I also use RocksBox to stream some of that content (especially the movies) upstairs on either of our TVs through the Roku boxes we have.

    Recently I went downstairs and noticed that the Shuttle computer (which run Windows XP at the time) was stuck in the POST screen. I rebooted the machine but it refused to load Windows, getting stuck either in the computer's POST screen or in the Starting Windows.

    Something was wrong and my best guess was the hard drive was failing. I booted up then in Safe mode (and that worked), set the hard drive to be checked for defects and rebooted again to let the whole process finish. After an hour or so, the system was still checking the free space and was stuck at 1%. The hard drive was making some weird noises so I shut the system down.

    Although I had backups of all my files on the cloud through Picasa, Google Music Manager and Google Drive, I still wanted to salvage whatever I had in there just in case. I therefore booted up the system with a Linux live CD, mounted the hard drive and used FileZilla to transfer all the data from the Shuttle's hard drive to another computer. There was of course a bit of juggling going on since I had to transfer data in multiple hard drives due to space restrictions.

    Replacing the storage

    I had to find something very cheap and practical. I therefore went to Staples and found a very nice (for the money) computer by Lenovo. It was only $300, offering 1TB 7200 SATA hard drive, i5 processor and 6GB of RAM.

    As soon as I got the system I booted it up, started setting everything up and before the end of the day everything was back in place, synched to the cloud.

    However the question remained: what happens if the hard drive fails again? Let's face it, I did lose a lot of time trying to set everything up again so I wasn't prepared to go through that again.

    My solution was simple:

    Purchase a cheap RAID (hardware) controller, an identical 1TB hard drive to the one I have, and a second hard drive to have the OS on. This way, the two 1TB hard drives can be connected to the RAID controller on a mirror setup (or RAID 1), while the additional hard drive can keep the operating system.

    I opted for a solid state drive from Crucial for the OS. Although it was not necessary to have that kind of speed, I thought it wouldn't hurt. It did hurt my pocket a bit but c'est la vie. For your own setup you can choose whatever you like.

    Hardware

    NOTE : For those that are not interested in having a solid state drive for the OS, one can always go with other, much cheaper drives such as this one.

    Setup

    After all the components arrived, I opened the computer and had a look at what I am facing with. One thing I did not realize was the lack of space for the third hard drive (the one that will hold the OS). I was under the impression that it would fit under the DVD ROM drive, but I did not account for the SD card reader that was installed in that space, so I had to be a bit creative (Picture 1).

    A couple of good measurements and two holes with the power drill created a perfect mounting point for the solid state hard drive. It is sitting now secure in front of the card reader connections, without interfering in any way.

    The second hard drive and the raid card were really easy to install, just a couple of screws and everything was set in place.

    The second hard drive ended up in the only expansion 'bay' available for this system. This is below the existing drive, mounted on the left side of the case. The actual housing has guides that allow you to slide the drive until the screw holes are aligned and from there it is a two minute job to secure the drive in place.

    I also had a generic nVidia 460 1GRAM card, which I also installed in the system. This was not included in the purchases for building this system, but it is around $45 if not less now. I have had it for over a year now and it was installed in the old Shuttle computer, so I wasn't going to let it go to waste.

    With everything in place, all I had to do is boot the system up and enter the BIOS screen so as to ensure that the SSD drive had a higher priority than any other drive.

    Once that was done, I put the installation disks in the DVD-ROM and restored the system on the SSD drive. 4 DVDs later (around 30 minutes) the system was installed and booted up. It took however another couple of hours until I had everything set up. The myriad of Windows Updates, (plus my slow Internet connection) contributed to this delay. However I have to admit, that the SSD drive was a very good purchase, since I have never seen Windows boot in less than 10 seconds (from power up to the login screen).

    The Windows updates included the nVidia driver so everything was set up (well almost that is). The only driver not installed was for the HighPoint RaidRocket RAID controller.

    The installation disk provided that driver, alongside with a web based configuration tool. After installing the driver and a quick reboot, the RAID configuration tool was not easy to understand but I figured it out, even without reading the manual.

    Entering the Disk Manager, I initialized and formatted the drive and from then on, I started copying all my files in there.

    As a last minute change, I decided not to install RocksBox and instead go with Plex Media Server. After playing around with Plex, I found out that it was a lot easier to setup than RocksBox (RocksBox requires a web server to be installed on the server machine, whereas Plex automatically discovers servers). Installing the relevant channel on my Roku boxes was really easy and everything was ready to work right out of the box so to speak.

    Problems

    The only problem that I encountered had to do with Lenovo itself. I wanted basically to install the system on the SSD drive. Since the main drive is 1TB and the SSD drive 128GB I could not use CloneZilla or Image for Windows to move the system from one drive to another. I tried almost everything. I shrank the 1TB system partition so as to make it fit in the 128GB drive. I had to shut hibernation off, reboot a couple of times in Safe Mode to remove unmovable files, in short it was a huge waste of time.

    Since Lenovo did not include the installation disks (only an applications DVD), I called their support line and inquired about those disks. I was sent from the hardware department to the software department, where a gentleman informed me that I have to pay $65 to purchase the OS disks. You can imagine my frustration to the fact that I had already paid for the OS by purchasing the computer. We went back and forth with the technician and in the end got routed to a manager who told me I can create the disks myself using Lenovo's proprietary software.

    The create rescue process required 10 DVDs, so I started the process. On DVD 7 the process halted. I repeated the process, only to see the same error on DVD 4. The following day I called Lenovo hardware support and managed to talk to a lady who was more than willing to send me the installation disks free of charge. Unfortunately right after I gave her my address, the line dropped, so I had to call again.

    The second phone call did not go very well. I was transferred to the software department again, where I was told that I have to pay $65 for the disks. The odd thing is that the technician tried to convince me that Lenovo actually doesn't pay money to Microsoft since they get an OEM license. Knowing that this is not correct, and after the fact that the technician was getting really rude, I asked to speak to a supervisor. The supervisor was even worse and having already spent 45 minutes on the phone, I asked to be transferred to the hardware department again. Once there, I spoke to another lady, explained the situation and how long I have been trying to get this resolved (we are at 55 minutes now) and she happily took my information and sent me the installation disks free of charge.

    Conclusion

    The setup discussed in this post is an inexpensive and relatively secure way of storing data in your own home/home network. The RAID 1 configuration offers redundancy, while the price of the system does not break the bank.

    I am very disappointed with Lenovo, trying to charge me for something I already paid for (the Operating System that is). Luckily the ladies at the hardware department were a lot more accommodating and I got what I wanted in the end.

    I hope this guide helped you.

  • 2011-03-10 14:00:00

    Additional storage for Google Apps users

    I have been using Google Apps for a number of years now and I have gotten so used to it that I cannot fathom any other way of operating. I am sure that some of you share that sentiment. :)

    Limitations

    Up until a few months ago, Google Apps had its limitations. The actual Apps was in some sort of a jailshell, isolated from the whole Google suite of applications. For that reason you could not use your Google Apps login to enjoy the service of Google Reader for instance. You had to be sneaky about it. You had to create a Google Account with the same username (and password if you liked) as your Google Apps domain and although the two did not communicate, you could have effectively "one login" for all services.

    This limitation became more apparent with the increased usage of Android phones (where you need to have a Google Account on your phone) as well as Google Voice. Users have been asking about the "merge" and Google responded with significant infrastructure changes to cater for the transition. In my blog post Google Apps and Google Accounts Merge I present additional information about this, inclusive of a How-To on the transition for administrators of Google Apps. Unfortunately the process is not perfect and there are still some services that are not fully integrated with the new infrastructure (but will be in the future). For instance in my domain, since I use Google Voice with my domain email account, I am still on the "old" system because the account could not be transitioned. It will happen in the end, it just takes time.

    Storage Needs

    The biggest issue for me that was related to these two separated accounts (Google Apps Account vs. Google Account) was Picasa and Google Docs.

    I have been very methodical in my filing, utilizing electronic storage as much as possible. For that reason I have been scanning documents and uploading them to Google Docs (or if they were available in PDF format I would just upload them). The documents would range from personal, utility bills, bank statements, anything that I want to store. Soon I realized that the 1GB that Google Apps offers for documents will not cut it. I therefore created a new account which I named DocsMule1 (clearly to signify its purpose). I created one folder in that account, uploaded as many documents as I could there and shared that document with my own account as well as my wife's. Soon I found more limitations since I ended up with 3 mule accounts. Since there was no option for me to upgrade the storage (even if I paid for it), I had to change my strategy. Managing documents from 3 or more different accounts is not an easy and convenient task.

    I downloaded all my documents back to my computer (gotta love Google's Data Liberation) and deleted them from the Google Apps mule accounts and then deleted those accounts - just to keep everything tidy. I then launched my Gmail account and signed into Google Docs. I created one folder which I shared with my Google Apps accounts (my wife's and mine) and then paid $50.00 for a whole year - which provided me with 200GB of space. You can always check how much space you are using by visiting the Manage Storage page of your Google Account.

    Once that was done, I started creating my folders (collections now) and uploaded all my documents up there. In addition to that, since my parents live in Greece, they rely on VoIP chat as well as my Picasa to stay in touch with their grandchildren. My wife and I, through the use of our mini camera as well as our Android phones, take a lot of pictures of the kids, documenting the little things that they do on a regular basis. This serves as a good archive for them when they grow up but also as a good way to stay in touch with my parents. Google's additional storage was the solution.

    Problem solved. With minimal money I had everything sorted out. It did however inconvenience me quite a bit in the end, since a lot of my data was scattered now. The GMail account would keep Picasa and Docs, the Google Apps account my email, my Google account my Reader, Web history etc. Not very convenient but it works.

    Storage for Google Apps

    Around February, Google announced that they will be offering the option to Google Apps users to purchase additional storage. I was really happy about that since I could therefore ditch the GMail account for handling my docs and keep everything under the domain account. However something was wrong. When Google revealed their pricing (the announcement is not there any more but the prices I am quoting are real), I quickly found out that for the storage I currently have, I would need to spend $700.00 a year instead of $50.00. That did not make sense at all. Needless to say, I stayed with my existing plan.

    Initial Pricing Plan
     Storage     Price (per year)
       5 GB         17.50 USD
      20 GB         70.00 USD
      80 GB        280.00 USD
     200 GB        700.00 USD  <==
     400 GB      1,400.00 USD
       1 TB      3,500.00 USD
       2 TB      7,000.00 USD
       4 TB     14,000.00 USD
       8 TB     28,000.00 USD
      16 TB     56,000.00 USD
    
    New Pricing Plan

    A few days ago, Google announced changes in the pricing of additional storage for Google Apps users as well as changes to the free storage that Google offers for Picasa. Picasa Web Albums does offer 1GB of free storage but now photos of 800x800 pixels or less as well as videos of 15 minutes or less do not count against the 1GB of storage. You can read more about the Picasa Web Albums storage in the relevant help page.

     Storage     Price (per year)
      20 GB         5.00 USD
      80 GB        20.00 USD
     200 GB        50.00 USD
     400 GB       100.00 USD
       1 TB       256.00 USD
       2 TB       512.00 USD
       4 TB     1,024.00 USD
       8 TB     2,048.00 USD
      16 TB     4,096.00 USD
    

    As far as the new pricing is concerned, Google brought everything in line with Google Accounts pricing (effectively scrapping the initial - expensive - pricing plan for extra storage). The help page Google Storage - How it Works offers additional information for those that want to use/upgrade their storage while using a Google Apps account. Effectively it now costs exactly the same to purchase additional space for your Google Apps account (to store documents) as it would if you were using a different Google Account. That probably means that I have to download everything to my computer and re-upload it to my Google Apps account....

    To take advantage of this feature, you will have to go to the Purchase additional storage page while logged in with the Google Apps account that you wish to purchase storage for. Note that there is a warning that appears in red (see image) that warns you that you are using a Google Apps account. Google provides this information since your Google Apps account relies on the Google Apps administrator. If you have an account on Google Apps and you purchase storage, that storage will be gone if the administrator deletes or restricts access to your account.

    Conclusion

    In my view, Google has done it again. They now offer an extremely affordable and secure way of storing your data. There are loads of people that have concerns about where their data is stored, who has ownership of the data stored, what does Google do with the data etc. A lot of these questions can easily be answered if you google (duh) the relevant terms or search in Google's Help Center. Data Liberation allows you to retrieve your data whenever you want to. If on the other hand you are skeptical and do not wish to store your data there, don't. It is your choice.

    References

  • 2011-02-20 13:56:00

    Keeping your Linux users in sync in a cluster

    As websites become more and more popular, your application might not be able to cope with the demand that your users put on your server. To accommodate that, you will need to move out of the one server solution and create a cluster. The easiest cluster to create would be with two servers, one to handle your web requests (HTTP/PHP etc.) and one to handle your database load (MySQL). Again this setup can only get you so far. If your site is growing, you will need a cluster of servers.

    Database

    The scope of this How-To will not cover database replication; I will need to dedicate a separate blog post for that. However, clustering your database is relatively easy with MySQL replication. You can set a virtual hostname (something like mysql.mydomain.com) which is visible only within your network. You then set up the configuration of your application to use that as the host. The virtual hostname will map to the current master server, while the slave(s) will only replicate.

    For instance, if you have two servers A and B, you can configure both of them to become master or slave in MySQL. You then set one of them as master (A) and the other as slave (B). If something happens to A, B gets promoted to master instantly. Once A comes back up, it gets demoted to a slave and the process is repeated if/when B has a problem. This can be a very good solution but you will need to have pretty evenly matched servers to keep with the demand. Alternatively B can be less powered than A and when A comes back up you keep it momentarily as a slave (until everything is replicated) and then promote it back to master.

    One thing to note about replication (that I found through trial and error). MySQL keeps binary logs to handle replication. If you are not cautious in your deployment, MySQL will never recycle those logs and therefore you will soon run out of space when having a busy site. By default those logs will be under /var/lib/mysql.

    By changing directives in my.cnf you can store the binary logs in a different folder and even set up 'garbage collection' or recycling. You can for instance set the logs to rotate every X days with the following directive in my.cnf:

    expire_logs_days = 5
    

    I set mine to 5 days which is extremely generous. If your replication is broken you must have the means to know about it within minutes (see nagios for a good monitoring service). In most cases 2 days is more than enough.

    Files

    There are numerous ways of keeping your cluster in sync. A really good tool that I have used when playing around with a cluster is csync2. Installation is really easy and and all you will need is to run a cron task every X minutes (up to you) to synchronize the new files. Imagine it as a two way rsync. Another tool that can do this is unison but I found it to be slow and difficult to implement - that's just me though.

    Assume an implementation of a website being served by two (or more) servers behind a load balancer. If your users upload files, you don't know where those files are uploaded, which server that is. As a result if user A uploads the file abc.txt to server A, user B might be served the content from server B and would not be able to access the file. csync2 would synchronize the file across the number of servers, thus providing access to the content and keeping multiple copies of the content (additional backup if you like).

    NFS

    An alternative to keeping everything synchronized is to use a NFS. This approach has many advantages and some disadvantages. It is up to you on whether the disadvantages are something you can live with.

    Disadvantages
    • NFS is slow - slower than the direct access to a local hard drive.
    • Most likely you will use a symlink to the NFS folder, which can slow things down even more.
    Advantages
    • The NFS does not rely on the individual web servers for content.
    • The web servers can be low to medium spec boxes without the need to have really fast and large hard drives
    • A well designed NFS with DRDB provides a raid-1 over a network. Using gigabit Network Interface Cards you can keep performance at really high levels.

    I know that my friend Floren does not agree with my approach on the NFS and would definitely have gone with the csync2 approach. Your implementation depends on your needs.

    Users and Groups

    Using the NFS approach, we need to keep the files and permissions properly set up for our application. Assume that we have two servers and we need to create one user to access our application and upload files.

    The user has been created on both servers and the files are stored on the NFS. Connecting to server A and looking at the files we can see something like this:

    drwxr-xr-x 17 niden  niden  4096 Feb 18 13:41 www.niden.net
    drwxr-xr-x  5 niden  niden  4096 Nov 15 22:10 www.niden.net.files
    drwxr-xr-x  7 beauty beauty 4096 Nov 21 17:42 www.beautyandthegeek.it
    

    However when connecting to server B, the same listing tells another story:

    drwxr-xr-x 17 508    510    4096 Feb 18 13:41 www.niden.net
    drwxr-xr-x  5 508    510    4096 Nov 15 22:10 www.niden.net.files
    drwxr-xr-x  7 510    511    4096 Nov 21 17:42 www.beautyandthegeek.it
    

    The problem here is the uid and gid of the users and groups of each user respectively. Somehow (and this is really easy to happen) server A had one or more users added to it, thus the internal counter of the user IDs has been increased by one or more and is not identical to that one of server B. So adding a new user in server A will get the uid 510 while on server B the same process will produce a user with a uid of 508.

    To have all users setup on all servers the same way, we need to use two commands: groupadd and useradd (in some Linux distributions you might find them as addgroup and adduser).

    groupadd

    First of all you will need to add groups. You can of course keep all users in one group but my implementation was to keep one user and one group per access. To cater for that I had to first create a group for every user and then the user account itself. Like users, groups have unique ids (gid). The purpose of gid is:

    The numerical value of the groups ID. This value must be unique, unless the -o option is used. The value must be non-negative. The default is to use the smallest ID value greater than 999 and greater than every other group. Values between 0 and 999 are typically reserved for system accounts.

    I chose to assign each group a unique id (you can override this behavior by using the -o switch in the command below, thus allowing a gid to be used in more than one group). The arbitrary number that I chose was 2000.

    As an example, I will set niden as the user/group for accessing this site and beauty as the user/group that accesses BeautyAndTheGeek.IT. Note that this is only an example.

    groupadd --gid 2000 niden
    groupadd --gid 2001 beauty
    

    Repeat the process as many times as needed for your setup. Connect to the second server and repeat this process. Of course if you have more than two servers, repeat the process on each of the servers that you have (and each accesses your NFS)

    useradd

    The next step is to add the users. Like groups, we will need to set the uid up. The purpose of the uid is:

    The numerical value of the users ID. This value must be unique, unless the -o option is used. The value must be non-negative. The default is to use the smallest ID value greater than 999 and greater than every other user. Values between 0 and 999 are typically reserved for system accounts.

    Like with the groups, I chose to assign each user a unique id starting from 2000.

    So to in the example above, the commands that I used were:

    useradd --uid 2000 -g niden --create-home niden
    useradd --uid 2000 -g beauty --create-home beauty
    

    You can also use a different syntax, utilizing the numeric gids:

    useradd --uid 2000 --gid 2000 --create-home niden
    useradd --uid 2000 --gid 2001 --create-home beauty
    

    Again, repeat the process as many times as needed for your setup and to as many servers as needed.

    In the example above I issued the --create-home switch (or -m) so as a home folder to be created under /home for each user. Your setup might not need this step. Check the references at the bottom of this blog post for the manual pages for groupadd and useradd.

    I would suggest that you keep a log of which user/group has which uid/gid. It helps in the long run, plus it is a good habit to keep proper documentation on projects :)

    Passwords?

    So how about the passwords on all servers? My approach is crude but effective. I connected to the first server, and set the password for each user, writing down what the password was:

    passwd niden
    

    Once I had all the passwords set, I opened the /etc/shadow file.

    nano /etc/shadow
    

    and that revealed a long list of users and their scrambled passwords:

    niden:$$$$long_string_of_characters_goes_here$$$$:13864:0:99999:7:::
    beauty:$$$$again_long_string_of_characters_goes_here$$$$:15009:0:99999:7:::
    

    Since I know that I added niden and beauty as users, I copied these two lines. I then connected to the second server, opened /etc/shadow and located the two lines where the niden and beauty users are referenced. I deleted the existing lines, and pasted the ones that I had copied from server A. Saved the file and now my passwords are synchronized in both servers.

    Conclusion

    The above might not be the best way of keeping users in sync in a cluster but it gives you an idea on where to start. There are different implementations available (Google is your friend) and your mileage might vary. The above has worked for me for a number of years since I never needed to add more than a handful of users on the servers each year.

    References

  • 2010-08-21 13:38:00

    Create an inexpensive hourly remote backup

    There are two kinds of people, those who backup regularly, and those that never had a hard drive fail

    As you can tell the above is my favorite quote. It is so true and I believe everyone should evaluate how much their data (emails, documents, files) is worth to them and, based on that value, create a backup strategy that suits them. I know for sure that if I ever lost the pictures and videos of my family I would be devastated since those are irreplaceable.

    So the question is how can I have an inexpensive backup solution? All my documents and emails are stored in Google, since my domain is on Google Apps. What happens to the live/development servers though that host all my work? I program on a daily basis and the code has to be backed up regularly so as to avoid any hard drive failures and thus result in loss of time and money.

    So here is my solution. I have an old computer (IBM Thincentre) which I decided to beef up a bit. I bought 4Gb of RAM from eBay for less than $100 for it. Although this is was not necessary since my solution would be based on Linux (Gentoo in particular), I wanted to have faster compilation times for packages.

    I bought two external drives (750Gb and 500Gb respectively) and one 750Gb internal drive. I already have a 120Gb hard drive in the computer. The two external ones are connected to the computer using USB while the internal ones are connected using SATA.

    The external drives are formatted using NTFS while the whole computer is built using ReiserFS.

    Here is the approach:

    • I have installed and have a working Gentoo installation on the machine
    • I have an active Internet connection
    • I have installed LVM on the machine and set up the core system on the 120Mb drive while the 500Mb is on LVM
    • I have 300Mb active on the LVM (from the available 500Mb)
    • I have generated a public SSH key (I will need this to exchange it with the target servers)
    • I have mounted the internal 500Mb drive to the /storage folder
    • I have mounted the external USB 750Mb drive to the /backup_hourly folder
    • I have mounted the external USB 500Mb drive to the /backup_daily folder

    Here is how my backup works:

    Every hour a script runs. The script uses rsync to syncrhonize files and folders from a remote server locally. Those files and folders are kept in relevant server named subfolders in the /storage folder (remember this is my LVM). So for instance my subfolders will be /storage/beryllium.niden.net, /storage/nitrogen.niden.net, /storage/argon.niden.net etc.

    Once the rsync completes, the script continues by compressing the relevant 'server' folder and creates the compressed file with a date-time stamp on its name.

    When all compressions are completed, if the time that the script has executed is midnight, the backups are moved from the /storage folder to the /backup_daily folder (which has the external USB 500Gb mounted). If it is any other time, the files are moved in the /backup_hourly folder (which has the external USB 750Gb mounted).

    This way I ensure that I keep a lot of backups (daily and hourly ones). The backups are being recycled, so older ones get deleted. The amount of data that you need to archive as well as the storage space you have available dictate how far back you can go in your hourly and daily cycles.

    So let's get down to business. The script itself:

    #!/bin/bash
    DATE=`date +%Y-%m-%d-%H-%M`
    DATE2=`date +%Y-%m-%d`
    DATEBACK_HOUR=`date --date='6 days ago' +%Y-%m-%d`
    DATEBACK_DAY=`date --date='60 days ago' +%Y-%m-%d`
    FLAGS="--archive --verbose --numeric-ids --delete --rsh='ssh'"
    BACKUP_DRIVE="/storage"
    DAY_USB_DRIVE="/backup_daily"
    HOUR_USB_DRIVE="/backup_hourly"
    

    These are some variables that I need for the script to work. The DATE and DATE2 are used to date/time stamp the backups, while the DATEBACK_* are used to clear previous backups. In this case it is set to 6 days ago (for my system). It can be set to whatever you want provided that you do not run out of space.

    The FLAGS variable keeps the rsync command options while the BACKUP_DRIVE, DAY_USB_DRIVE and HOUR_USB_DRIVE hold the locations of the rsync folders, daily backup and hourly backup sorage areas.

    The script works with arrays. I have 4 arrays to do the work and the 3 of them must have exactly the same elements.

    # RSync Information
    rsync_info[1]="beryllium.niden.net html rsync"
    rsync_info[2]="beryllium.niden.net db rsync"
    rsync_info[3]="nitrogen.niden.net html rsync"
    rsync_info[4]="nitrogen.niden.net html db"
    rsync_info[5]="nitrogen.niden.net html svn"
    rsync_info[6]="argon.niden.net html rsync"
    

    This is the first array which holds descriptions to what needs to be done as far as source is concerned. These descriptions get appended to the log and helps me identify what step I am in.

    # RSync Source Folders
    rsync_source[1]="beryllium.niden.net:/var/www/localhost/htdocs/"
    rsync_source[2]="beryllium.niden.net:/niden_backup/db/"
    rsync_source[3]="nitrogen.niden.net:/var/www/localhost/htdocs/"
    rsync_source[4]="nitrogen.niden.net:/niden_backup/db"
    rsync_source[5]="nitrogen.niden.net:/niden_backup/svn"
    rsync_source[6]="argon.niden.net:/var/www/localhost/htdocs/"
    

    This array holds the source host and folder. Remember that I have already exchanged SSH keys with each server, therefore when the script runs there is a direct connection to the source server. If you need to keep things a bit more secure for you, then you will need to alter the contents of the rsync_source array so that it reflects the user that you log in with as well as the password.

    # RSync Target Folders
    rsync_target[1]="beryllium.niden.net/html/"
    rsync_target[2]="beryllium.niden.net/db/"
    rsync_target[3]="nitrogen.niden.net/html/"
    rsync_target[4]="nitrogen.niden.net/db/"
    rsync_target[5]="nitrogen.niden.net/svn/"
    rsync_target[6]="argon.niden.net/html/"
    

    This array holds the target locations for the rsync. These folders exist in my case under the /storage subfolder.

    # GZip target files
    servers[1]="beryllium.niden.net"
    servers[2]="nitrogen.niden.net"
    servers[3]="argon.niden.net"
    

    This array holds the names of the folders to be archived. These are the folders directly under the /storage folder and I am also using this array for the prefix of the compressed files. The suffix of the compressed files is a date/time stamp.

    Here is how the script evolves:

    echo "BACKUP START" >> $BACKUP_DRIVE/logs/$DATE.log
    date >> $BACKUP_DRIVE/logs/$DATE.log
    
    echo "BACKUP START" >> $BACKUP_DRIVE/logs/$DATE.log
    date >> $BACKUP_DRIVE/logs/$DATE.log
    
    # Loop through the RSync process
    element_count=${#rsync_info[@]}
    let "element_count = $element_count + 1"
    index=1
    while [ "$index" -lt "$element_count" ]
    do
        echo ${rsync_info[$index]} > $BACKUP_DRIVE/logs/$DATE.log
        rsync $FLAGS ${rsync_source[$index]} $BACKUP_DRIVE/${rsync_target[$index]} > $BACKUP_DRIVE/logs/$DATE.log
        let "index = $index + 1"
    done
    

    The snippet above loops through the rsync_info array and prints out the information in the log file. Right after that it uses the rsync_source and rsync_target arrays (as well as the FLAGS variable) to rsync the contents of the source server with the local folder. Remember that all three arrays have to be identical in size (rsync_info, rsync_source, rsync_target).

    The next thing to do is zip the data (I loop through the servers array)

    # Looping to GZip data
    element_count=${#servers[@]}
    let "element_count = $element_count + 1"
    index=1
    while [ "$index" -lt "$element_count" ]
    do
        echo "GZip ${servers[$index]}" > $BACKUP_DRIVE/logs/$DATE.log
        tar cvfz $BACKUP_DRIVE/${servers[$index]}-$DATE.tgz $BACKUP_DRIVE/${servers[$index]} > $BACKUP_DRIVE/logs/$DATE.log
        let "index = $index + 1"
    done
    

    The compression method I use is tar/gzip. I found it to be fast with a good compression ratio. You can choose anything you like.

    Now I need to delete old files from the drives and copy the files on those drives. I use the servers array again.

    # Looping to copy the produced files (if applicable) to the daily drive
    element_count=${#servers[@]}
    let "element_count = $element_count + 1"
    index=1
    
    while [ "$index" -lt "$element_count" ]
    do
        # Copy the midnight files
        echo "Removing old daily midnight files" > $BACKUP_DRIVE/logs/$DATE.log
        rm -f $DAY_USB_DRIVE/${servers[$index]}/${servers[$index]}-$DATEBACK_DAY*.* > $BACKUP_DRIVE/logs/$DATE.log
        echo "Copying daily midnight files" > $BACKUP_DRIVE/logs/$DATE.log
        cp -v $BACKUP_DRIVE/${servers[$index]}-$DATE2-00-*.tgz $DAY_USB_DRIVE/${servers[$index]} &nbsp>>; $BACKUP_DRIVE/logs/$DATE.log
        rm -f $BACKUP_DRIVE/${servers[$index]}-$DATE2-00-*.tgz > $BACKUP_DRIVE/logs/$DATE.log
    
        # Now copy the files in the hourly
        echo "Removing old hourly files" > $BACKUP_DRIVE/logs/$DATE.log
        rm -f $HOUR_USB_DRIVE/${servers[$index]}/${servers[$index]}-$DATEBACK_HOUR*.* > $BACKUP_DRIVE/logs/$DATE.log
        echo "Copying daily midnight files" > $BACKUP_DRIVE/logs/$DATE.log
        cp -v $BACKUP_DRIVE/${servers[$index]}-$DATE.tgz $HOUR_USB_DRIVE/${servers[$index]} > $BACKUP_DRIVE/logs/$DATE.log
        rm -f $HOUR_USB_DRIVE/${servers[$index]}/${servers[$index]}-$DATEBACK*.* > $BACKUP_DRIVE/logs/$DATE.log
        let "index = $index + 1"
    done
    
    echo "BACKUP END" >> $BACKUP_DRIVE/logs/$DATE.log
    

    The last part of the script loops through the servers array and:

    • Deletes the old files (recycling of space) from the daily backup drive (/storage/backup_daily) according to the DATEBACK_DAY variable. If the files are not found a warning will appear in the log.
    • Copies the daily midnight file to the daily drive (if the file does not exist it will simply echo a warning in the log - I do not worry about warnings of this kind in the log file and was too lazy to use an IF EXISTS condition)
    • Removes the daily midnight file from the /storage drive.

    The reason I am using copy and then remove instead of the move (mv) command is that I have found this method to be faster.

    Finally the same thing happens with the hourly files

    • Old files are removed (DATEBACK_HOUR variable)
    • Hourly file gets copied to the /backup_hourly drive
    • Hourly file gets deleted from the /storage drive

    All I need now is to add the script in my crontab and let it run every hour.

    NOTE: The first time you will run the script you will need to do it manually (not in a cron job). The reason behind it is that the first time rsync will need to download all the contents of the source servers/folders in the /storage drive so as to create an exact mirror. Once that lengthy step is done, the script can be added in the crontab. Subsequent runs of the script will download only the changed/deleted files.

    This method can be very effective while not using a ton of bandwidth every hour. I have used this method for the best part of a year now and it has saved me a couple of times.

    The last thing I need to present you is the backup script that I have for my databases. As you notice above the source folder of beryllium.niden.net as far as databases are concerned is beryllium.niden.net/db/. What I do is I dump and zip the databases every hour on my servers. Although this is not a very efficient way of doing things and it adds to the bandwidth consumption every hour (since the dump will create a new file every hour) I have the following script running on my database servers every hour at the 45th minute:

    #!/bin/bash
    
    DBUSER=mydbuser
    DBPASS='dbpassword'
    DBHOST=localhost
    BACKUPFOLDER="/niden_backup"
    DBNAMES="`mysql --user=$DBUSER --password=$DBPASS --host=$DBHOST --batch --skip-column-names -e "show databases"| sed 's/ /%/g'`"
    OPTIONS="--quote-names --opt --compress "
    
    # Clear the backu folder
    rm -fR $BACKUPFOLDER/db/*.*
    
    for i in $DBNAMES; do
        echo Dumping Database: $i
        mysqldump --user=$DBUSER --password=$DBPASS --host=$DBHOST $OPTIONS $i > $BACKUPFOLDER/db/$i.sql
        tar cvfz $BACKUPFOLDER/db/$i.tqz $BACKUPFOLDER/db/$i.sql
        rm -f $BACKUPFOLDER/db/$i.sql
    done
    

    That's it.

    The backup script can be found in my GitHub here.

    Update: The metric units for the drives were GB not MB. Thanks to Jani Hartikainen for pointing it out.

  • 2009-11-18 12:00:00

    Google Paid Storage

    A week or so ago I read a blog post in my Google Reader about Google providing now more storage for less money.

    To be quite frank I did not read the whole post but I did get the message. Google was offering 10GB for $20.00 and now they are offering 20GB for $5.00. This extra storage is mostly for Picasa and web albums but it can be used for other products like GMail (if you ever get above the 7.5GB that you already have there).

    Although I was really happy to see such a move, I was kinda saddened since not more than a month ago I decided to purchase the 10GB for $20.00 and I didn't take advantage of the new rate. The reason for the extra storage is that I can store pictures and videos for my family to see. Since most of my part of the family is located in Greece and my family and I are in the US, it only makes sense to take advantage of the Internet to keep in touch. Photographs of different events that we attend are available now to them too, while we keep a good journal of events through the years.

    Logging into my web album in Picasa I was pleasantly surprised to see that my storage is not 10GB but 81GB! I could not believe my eyes and frankly I thought that Google made a mistake. I dug up the blog post and found out what had happened. It appears that by not reading the whole article, I missed the

    and people who have extra storage will be automatically upgraded.

    The funny thing is that they even counted the 1GB that Picasa comes with (for free) once you sign up for their web albums.

    All I can say is that now I will probably store more and more media online, not only for my family abroad to watch but for backup reasons too.

    All we need now is a GDrive - a drive extension to connect to our online storage so that we can store everything online and never worry about anything - computer crashes and all....

  • 2009-11-03 12:00:00

    Flexible storage in MySQL

    We all need data to function. Whether this is information regarding what our body craves at the moment - hence go to the local take-away and get it or cook it - or whether this is electronic data to make our tasks easier, makes no difference.

    Storing data in an electronic format is always a challenge. When faced with a new project you always try to out think the project's needs and ensure that you covered all the possible angles of it. Some projects though are plain vanilla since say you only need to enter the customer's name, surname, address and phone. But what happens when you need to enter data that you don't know their type?

    This is where flexible storage comes into play. You can develop a database system that will store efficiently data (well within reason) without knowing what the data will be.

    Say we need to build an application that will be given to the customer to store data about his contacts, without knowing what the fields the customer needs. Fair enough storing the name, surname, address, phone etc. of the customer are pretty much easy and expected to be features. However what about a customer that needs to store in his/her contacts the operating system the contact uses on their computer? How about storing the contact's favorite food recipe, their car mileage, etc. Information is so diverse that you can predict up to a point what is needed but after that you just face chaos. Of course if the application we are building is intended for one customer then everything is simpler. How about more than one customers are our target audience? For sure we cannot fill the database with fields that will definitely be empty for certain customers.

    A simple format to store information can be achieved by storing a type and a value. The first field (data_type) will be a numeric one to hold the ID of the field while the second field (data_value) will be of TEXT type for the "value". The reason for the TEXT is because we don't know the size of the data that will be stored there. Indexes on both fields can help with speeding up the searches. If you use MySQL 4+ you can for sure opt for the FULLTEXT indexing method than the one used in previous MySQL versions.

    We also need a second table to hold the list of our data types (data_type field). This table will have 2 columns and will have of course an ID (AUTOINCREMENT INT) and a VARCHAR column to hold the description of the field.

    CREATE TABLE data_types (
        type_id MEDIUMINT( 8 ) UNSIGNED NOT NULL AUTO_INCREMENT,
        type_name VARCHAR( 50 ) NOT NULL,
        PRIMARY KEY ( type_id )
    );
    

    The table to store the data in will be as follows:

    CREATE TABLE data_store (
        cust_id MEDIUMINT( 8 ) UNSIGNED NOT NULL ,
        type_id MEDIUMINT( 8 ) UNSIGNED NOT NULL ,
        field_data TEXT NOT NULL,
        PRIMARY_KEY ( cust_id, type_id )
    );
    

    And also creating another index:

    ALTER TABLE data_store ADD FULLTEXT (field_data);
    

    (Note that the FULLTEXT support is a feature of MySQL version 4+)

    So what does this table do for us. We need to store the information of Mr. John Doe, 123 Somestreet Drive, VA, USA, +1 (000) 12345678 who likes cats and has a Ford Mustang.

    We first add the necessary fields we need to store in our data_types table. These fields for our example are as follows:

    1 - Title

    2 - Country

    3 - Favorite animal

    4 - Car

    The numbers in front are the IDs that I got when entering the data in the table.

    Assuming that the customer has a unique id of 1, we are off to store the data in our table. In essence we will be adding 4 records into the data_store table for every contact we have. The cust_id field holds the unique ID for each customer so that we can match the information to a single contact as a block.

    INSERT INTO data_store
        (cust_id, type_id, field_data)
        VALUES
        ('1', '1', 'Mr.'),
        ('1', '2', 'USA'),
        ('1', '3', 'Cat'),
        ('1', '4', 'Ford Mustang');
    

    That's it. Now Mr. John Doe is in our database.

    Adding a new field will be as easy as adding a new record in our data_types table. Now with a bit of clever PHP you can read the data_types table and display the data from the data store field.

    We can use the above example to store customer data either as a whole or as a supplement. So for instance in our example we can start by storing the customer ID, first name, surname etc. as fields also in the data_store table using a specific data type. On a different angle we can just keep the core data in a separate table (storing the first name, surname, address etc.) and linking that table with the data_store one.

    This approach although very flexible it has its disadvantages. The first one is that each record has a TEXT field to store data in. This will be a huge overkill for data types that are meant to store boolean values or integers. Another big disadvantage is the search through the table. It is TEXT but also it is vertically structured in blocks. So if you need to search for everyone living in the USA you will need to first find the data_type representing the Country field and then match it to the field_data field of the data_store table.

    There is no one right way of doing something in programming. It all depends on the circumstances and of course to the demands of the application we are developing.

    This is just another way of storing data.