Automated Site Security - Kennedy Design and Development

This blog post is going to be a little different for two reasons.

This is going to be a discussion of a security product that, to the best of my knowledge, is still in active use on real websites, so publicly posting the code would not be in the best interest of those clients.
The code that I wrote was written for my employer, and so I don’t own it.

So, there will be no code samples to go with this post. That said, I think it is still well worth talking about, and it’s a great demonstration of how a solution can evolve over time to meet new needs.

The Problem

When I was first hired in October of 2016, my employer was a small company just growing out of a large staff turnover that specialized in providing IT services to other small businesses. The owner was doing the majority of the development work himself, and needed someone to take as much of it off his plate as possible so he could concentrate on other aspects of the business. He ended up hiring me and giving me an opportunity to grow and to learn. At the time of my hire, I was entirely self-taught, and my only experience with PHP and SQL was with a few online tutorials I was working from in order to complete a freelance project I was working on at the same time.

Because the owner was busy, I continued to be almost entirely self taught, but because multiple clients were waiting on me there was a lot more motivation to learn and to learn quickly. For about a year, I would also be the only full time developer.

One of the first things I learned in that time is that a combination of a stretched or new development staff and lean budgets from clients ends up creating an environment where highly technical aspects, like security, are ignored, overlooked, or plain forgotten. And this is true of both sides of relationship, clients and developers.

In short, because I was supporting a relatively large number of clients, somewhere around 10 – 15 clients on a regular basis, and 100 or so on an as needed basis, and I was still a relatively new developer, many potential security compromises slipped through the cracks, and you can bet that bad actors from the internet took advantage.

We were lucky. We had a combination of understanding clients, regular site backups, low value sites, and bad actors who were more interested in defacing websites than making a profit. And that combination allowed me to learn effectively and quickly how to deal with problems like this, and slowly craft my own automated solutions and defenses to prevent it in the future. Each event was a chance to learn, grow, and improve the service we were providing to our clients.

The Methods

I’ve already mentioned the first method. This is the main method we used at first because it was the simplest and most time effective. But we maintained regular weekly, monthly, and yearly backups. Since the majority of the client websites didn’t contain any overly secure information (the clients frequently didn’t set their own passwords, and their databases only contained information about the site itself, no customer or user data), this was enough to resolve the most obvious issue and reverse whatever damage was done.

Now, you’ll note that this did not remove whatever exploit the bad actors used to break into the website in the first place. But it did mean that the websites could return to their purpose of providing information to the customers of those businesses.

After a while, our college intern joined the company full time, and I was able to spend more of my time on server administration and infrastructure. At this point I was able to spend some time writing a horrifically simple crawler that would check the homepages of websites and try to determine if they had been defaced.

The code itself is laughable to me now, but the ideas behind it were sound. We had a database of the websites that it would crawl, and it would save whatever it found to that database. It would compare that to the last known crawl that was tagged as “Clean”, and if there was a difference between the two it would alert me. As I continued developing it, it would also look for the most obvious sign of defacement the term “hacked”, and it did this via a simple string search. Like I said this was super simple code, but it was a functional prototype. It was checking for things that I normally do manually, and it could run in the background while I worked on something else.

It had a lot of downsides though. First of all, the database was huge. These days I would use hashes for storage and comparison, but then I was storing the full HTML (so that we could analyze it after the fact I suppose). The script was also slow. It did not work in parallel, and since it ran locally it was dependent on the network resources available to it. The server was located in a data center in another town, and so the script could be a network hog. While the script was running, accessing any resources over the network from the same device was noticeably slower.

That said, it also helped us recover sites before the defacement could start to effect their SEO and page rankings, so I pushed forward on the development. At first, this was just checking for the existence of the correct Google Analytics codes and site registration meta tags, things that were frequently forgotten by rushed developers or clients who didn’t know better, so having automated oversight to ensure that clients weren’t getting short changed added significant value to our services.

The next step was to move this script up into the server itself so we could do it faster, and potentially use the extra overhead to do more complex analysis of the websites. Once that was done, the whole team could use it regularly, and multi-tasking while it was running was easier.

Now, at the same time I was developing this automated site health monitor, I was doing a lot of research into how shells worked, how they were uploaded, what they did when they were, and how they were accessed. I had a number of samples from the folks who were defacing our websites, and I was able to start reverse engineering them over time. I even wrote a few of my own to learn first hand all the ways they could be written, and what the commonalities of them were.

One of my first solutions was all about identifying changes in the file system. It was also super simple. I just used the terminal to create a list of all files in the hosting, and then I output that list to a file that I would save outside of the hosting itself. Then on a regular basis, I would generate a fresh list, and have the terminal show me the differences between the two. This would point me in the direction of files that I would need to check manually to ensure the code was not malicious. This was still mostly manual, and time consuming on our larger hosting clients, but it also meant that I was finding more of the problems and had a much better idea of the kind of activity that was normal for different accounts.

After a few months I had enough information on the problem that I was starting to think of a solution.

One of the first WordPress plugins I was taught about was Wordfence. Their security research team does a lot of good work, and their plugin is a must install for every WordPress I work on. So when I was trying to design a solution for our clients not on WordPress, their code was the first place I looked. Credit to the Wordfence team, most of their code is not stored in the plugin itself. It is done by their API, which is incredibly important for a security project working in an Open Source ecosystem, written in a language that is not precompiled. It means that bad actors can’t pick apart exactly how their code works and easily write their way around the defenses put into place.

But, what you can learn is how Wordfence ensures that its code is executed before any other WordPress components. Now, I wouldn’t be surprised if there weren’t several methods used to achieve this effect, but the one I found was that it used a part of the settings of PHP called “auto-prepend-file”, which just instructs the PHP process to always include the declared file before running any other code.

I played with that for a while, and then I just ran with it. I took my idea of the list of files, and I wrote a PHP script that could generate that list, save it to a database, and update that list as needed. Then, I wrote a very small file that would be loaded via “auto-prepend-file” and compare whatever file loaded it to that database. If it wasn’t in that database, it killed the process with a generic error, effectively cutting off access to any stand alone shells that were uploaded, which was about 80% of what we were seeing.

In hindsight, it was exceedingly simple, but it was also hugely effective. That first version effectively stopped the vast majority of the events we were seeing on the server. It wasn’t perfect. It could get easily confused by the way WordPress or OpenCart would rewrite URLs, but as WordPress already had a strong solution and OpenCart is… well OpenCart (and since that was only a small fraction of our client-base). It was a compromise we were willing to make, and the code was simple enough that if we needed to do on the fly modifications to create site specific exceptions we could do so easily.

The other big downside of this version was that if the file was in the database, it was executed, no matter what was in the file. So I turned back to my inspiration, Wordfence. This time, it wasn’t the code of Wordfence itself, but an extended blog-post they did about a strain of WordPress malware they were calling “Baba Yaga”. The malware was designed to turn infected websites into illegal traffic farms for ad networks and scam sites. In order to operate as long as possible without detection, the malware would actually scan the WordPress installation for other, more obvious, forms of malware and remove them automatically for the user. Looking at the code and reading through their explanations I knew that while I didn’t have the background, or the resources, to write a full website antivirus, I could start looking for patterns and flag files for review. I was already crawling the file system to generate my list of files, so I had half of the code already.

I approached the problem from several directions under the belief that something that generated false positives would be better to start with than something that missed obvious signs. The first and easiest step was starting to save the file properties to the database so they could be compared. We saved, permissions, timestamp, name, owner, and a small hash of the file contents. Between these we could easily identify when a file changed and get a good overview of what had changed about the file, as well as easily identify files that were not there last time we scanned. Then I started writing out various functions and code quirks that I had seen over and over in my malware samples, and we would search for each in every file we opened, and if one of the patterns was found, we would flag the file.

The best part about these changes is that it started identifying shells that had been uploaded, but unable to be used. It’s not the most common attack, but there are a number of shells I found in normally non-executable files. Things like images, icons, error logs, text files. The attack comes in two parts, first you upload your code in a “stable” extension, and then you either modify the server settings to make that file type executable by PHP, or you change the filename to an executable extension. These were left over from attacks where the files were successfully uploaded, but they were unable to complete the next stage. This was good because it gave us more malware samples and identified more vectors of attack, helping us track down upload forms that weren’t secured well enough and other holes in framework or custom code.

At first I was running all of this manually, updating definitions by hand, running scans by switching flags in the code and visiting temporary URLs, but I had this other site health monitor project that was already doing automated scans, so I linked the two and wrote up a report that aggregated all of the information the scans were collecting.

Then I moved the definitions from being defined on a site to site basis, to being centrally stored and sent out via the API. All of the communications and commands were encrypted in both directions using a secret key that was never transmitted by either party, so even if we weren’t running on the same server, it would have proved difficult to impersonate the main API, and even if you did, the contents of each specific file were not transmitted, and the script wasn’t capable of deleting or modifying files, so no major damage could be done.

In addition to that, for truly problematic sites, we could even put in some basic upload malware scanning, detecting any of our patterns in the files listed by the superglobal $_FILES meant that we could remove the offending file, and the entry in the $_FILES array before the PHP code could do anything with it.

The Result

It was a revelation. Suddenly it was not only shutting down the majority of bad actors on our server, but it was identifying more complex attacks, and as I updated the patterns it was looking for, it started to identify security holes in files that may never have been identified otherwise.

I know for a fact that it gave a security researcher we had doing some casual penetration tests on the server such a hard time that he completely abandoned the upload attack vector and doubled down on other approaches.

In addition to that, when we had evidence that a site was being compromised despite our systems, it gave us a granular way to save and analyze user activity information in a way that our existing logging solutions did not offer. That feature helped us identify several compromised administrative users in an eCommerce site, and track down exactly what information the bad actors had access to. Unfortunately, in this instance it was everything short of payment methods and social security numbers, but I was able to close the compromised accounts, close the method of compromise, and verify that the actors were not able to regain access, while also giving the client a full report of what was compromised and my recommendations for communicating it to their customers.

Effectively, this project pushed our overall server security from little better than bumbling, to good enough that it was able to stop bad actors from taking advantage of known vulnerabilities in frameworks that clients did not have the time or budget to regularly maintain.

By no means was it perfect, but it was effective, and was probably my largest contribution to the productivity of the business. Having to manually restore backups of sites on a regular basis is time consuming. Dealing with clients that are understandably upset that their website is broken or obviously hacked is time consuming. By the end of this project, I had taken a weekly day long process that was hugely error prone, and automated it to the point where I would usually only have to spend an hour or two a month on actually securing client websites, and the rest of the time spent on infrastructure could be spent on ensuring performance, ease of access, or prototyping services that could be turned into products.

In short, it was far more successful than I had anticipated when I first set out on the project, and is a huge source of my confidence in how much my skills have grown over the last few years.