Face Detector and Legend-Creator

for Group-Photos

 

Chris Gahan (4304470)

CISC499 Project

April, 2005


Overview

 

As you’ve probably found in the past, the legends at the bottom of the vast majority of group photos are not very user-friendly. They usually look something like this:

 

First Row (left to right):

Person 1, Person 2, Person 3, Mrs. Person 3, Person 3 Jr.

Middle-Half of Second Row (right-wise diagonally, except for people on railing): 

Person 4, Sir Person the 5th, Dr. Person 6, Agent Person 007, etc.

           

As I’m sure you’ve experienced, it takes a fair amount of effort to match a name to a face, or a face to a name. There is an alternative legend that you sometimes see, and it’s much easier to read. It involves creating a smaller version of the picture that contains only the outlines of the people, each of which is numbered, and then a table maps each number to the person’s name. Here’s an example:

 

Fig. 1: Improved Legend

 

Legend:

1.        Person 1

2.        Person 2

3.        Person 3

4.        Mrs. Person 3

5.        Person 3 Jr.

6.        etc..

 

 

 

These legends make it much easier to find people’s names, but take significantly more effort to create, which is why they’re only seen on group photos occasionally. Therefore, the goal of this project is to create a piece of software that aids people in creating these superior legends quickly and easily, in the hopes that future generations won’t have to suffer from the frustration and eye fatigue that results from the use of inferior legends.

 

I use a computer vision approach to solve this problem. The reason and motivations will be explained later. The major component of this project involved finding the fastest and most accurate approach to detecting people in photographs, and then learning and understanding the theory behind how to implement this solution.

 

 

Formulating a Solution

 

My primary motivation in creating this program (aside from finishing the course required to graduate) was to create a piece of software that’s useful to people, and that actually makes their lives easier. So, when designing my solution, the most important goal of the project was usability. So, the program should be easy and intuitive to use. Also, it should be fast enough that it won’t be outperformed by a fully manual solution. It was a definite possibility that having the user click on each person in the picture would be faster. After all, people are quite good at computer vision tasks.

 

Another requirement that would encourage people to use this program is how long it takes to install. I figured that most people would have one or two group photos that they’d want to process at a time, and that few people would be motivated to install a big piece of software just to process a couple photos. So, to solve this problem I decided to make it a web application. The user could go to a webpage and upload a photo, the webserver would then automatically detect the people in the image and present the user with an interface to correct the computer’s detections if necessary. Then, when the user was satisfied that all the people in the image had been accounted for, it would let them name all of these people and, finally, produce a legend which could be saved to the hard drive or printed out.

 

A web application might seem like a bad solution for this problem at first since the user will have to do a lot of interaction with the image. However, JavaScript and dynamic-HTML can provide very rich GUI interfaces. As you will see later, the JavaScript GUI performs just as well as a desktop GUI.

 

So, the next problem was deciding how to detect the people in the photo. After some thinking, I decided that the best basis for detecting people in an image would be their faces. From looking at group photos, I realized that using the person’s body outline (as suggested by one of the supervisors) was out of the question – as you can see in the photograph below, many people’s bodies are hidden behind other people.

 

Fig. 2: Group Photo

 
 

 


In the worst cases, all you can see is part of a person’s head. Faces, on the other hand, are generally the minimum requirement for somebody to be considered “in a picture” (it’s doubtful that a person would be identifiable even by a human if only their arm was in the picture).

 

The final problem is creating outlines of all the people and then numbering them with a legend to map the names to their numbers. This is actually a much more difficult task that you’d assume, because creating an outline the way a human would (in Fig. 1) requires you to know what every object in the scene is. In computer vision, the task of separating an image into semantically meaningful regions is called segmentation, and is very difficult to solve in general. Essentially, the computer would have to know how to identify every object in the image to segment it the way a human would. To segment a human, it would have to be able to recognize the head, the torso, and the limbs – with or without hair, with or without different kinds of clothes, etc.

 

It’s not as difficult a problem to solve if the image contains depth information. Segmenting objects when you know the depth is much easier since you can separate foreground objects from their background, which aids segmentation enormously. It’s too bad that the average camera doesn’t capture depth information, because my job would be much easier. So, the solution I came up with was to just do simple edge-detection on the image and use that as a kind of pseudo-outline that could be printed out on a black-and-white printer and stuck in the picture frame with the original photo.

 

Detecting Faces

 

So, how do you actually go about detecting faces? It’s a difficult problem because, in a photograph containing people’s faces, there are many variables which are hard to control:

 

Face Variability

There are many different kinds of faces (some faces have beards, glasses, hair, no hair, differently shaped heads, etc.)

 

Input Image Quality

The quality of the photograph or scanned image can vary. There could be scratches on it, it could be scanned at a low resolution, it could be JPEG-compressed at a lossy setting, etc.

 

Lighting

The lighting in the photograph can vary, and shadows can cause many problems. For example, half the image could be lighter than the other half if the shadow of a building is cast across the picture. Or, the angle of the sun could differ between pictures which would make people’s faces have different kinds of shadows on them. Just in terms of the pixel data, a person’s face will be different at high noon than at 4:00pm.

 

Rotation

People can be facing the camera, but their heads might be tilted sideways a bit. Or, even worse, they may not be facing the camera at all. This is called an out-of-plane rotation or a profile.

 


Size/Scale

It’s never certain how big a person’s face will be in an image. With a large group, there will be many tiny faces, whereas with a small group, there will be a few large faces.

 

Environment

There could be other objects in the scene which look like faces. Also, the background cannot be counted on to be consistent in any way, and so it can’t be used as a basis to separate the people from the rest of the scene.

 

A robust algorithm for detecting faces will have to deal with most of these to accurately detect the faces in the image.

 

The severity of the lighting problem can be reduced somewhat by using histogram normalization and other lighting-adjustment techniques, but the rest of the aforementioned variables cannot be controlled very easily.

 

Therefore, simple image analysis methods such as edge detection, thresholding, templates, and Fourier coding cannot solve the problem; they could be components of a larger solution, but by themselves they won’t be able to reduce enough of the variables so that a region can be accurately classified as a face or not.

 

Again, the goal of this face detection routine is to out-perform a human who’s clicking on the faces by hand. So, it has to be quite fast. The average user could click on all the faces in a relatively large group in under 20 seconds. Also, if the algorithm isn’t very accurate, the user has to correct a lot of misdetected faces (things that aren’t faces), and on top of that, have to classify the rest of the undetected faces by hand. So, accuracy and speed are of great importance when picking a face-detection algorithm.

 

Robust Face-Detection Algorithms

 

Luckily, face-detection is a large area in the field of computer vision and there are a lot of papers that describe excellent techniques. The most popular classification algorithms are: support vector machines (SVMs) [7], artificial neural networks (ANNs) [1, 2, 4], principal component analysis (PCA) [6], manifold learning [14], and detector-trees of boosted classifiers (aka. Boosting or Adaptive Boosting) [3, 10, 11, 12, 13].

 

All of the aforementioned algorithms have the same fundamental operation. There is a first a training phase in which the algorithm analyzes sets of faces and non-faces to learn regularities and patterns which exist in the data. These regularities are called features, and they could be anything in the image which helps determine whether or not a region contains a face – there is generally no restriction on the specific features that the algorithm is allowed to use.

 

After a training phase, each algorithm is tested by trying to find the faces in a set of images which it’s never seen before. In this set of images, all the faces were detected by a human before-hand so that the algorithm’s performance can be compared.

 

The scoring works as follows: the hit-rate is how many of the faces the algorithm correctly finds, the miss-rate is how many faces it overlooked, and the false-detection rate is how many non-faces that it mistakes for faces. Sometimes they combine the mistakes and refer to it as the error-rate. This testing phase is very important because it gauges whether or not the classifier is actually learning the faces properly.

 

It’s very important to give the network testing and training data that are representative of the total possible space of inputs which it could encounter in real world situations. If the training data has been chosen poorly, it’s likely that these algorithms will rely on features that only occur frequently in the training set and which won’t be useful for detecting faces in real world applications.

 

The US military ran into a problem which illustrates this principle very well. During the cold war, they wanted to create an automated system which could determine whether a tank in an image was American or Russian. So, they decided to train a neural network to do the task. They compiled a huge set of images of American and Russian tanks, and exhaustively trained their neural network on this corpus of data until it could classify the tanks with a very high accuracy. When it was tested in the field, however, it didn’t work at all! They were quite surprised, and eventually figured out that the problem was that, in their training and test data, all of the pictures of American tanks were darker than the pictures of Russian tanks. Since this was the strongest feature between the two sets of images, the network relied heavily on it to classify the tanks.

 

This property of performing well on real-world data is commonly referred in the machine learning field as generalization [5]. A classifier which performs well on data outside of its training set is said to “generalize” well. This ability is the most important attribute to seek when designing a classifier. The opposite of generalization is over-fitting, which is when a network becomes trained to recognize only data from its training set.

 

A problem that the early researchers into face detection techniques ran across was that it’s difficult to create a representative training set of things which are non-faces. The space is almost limitless. There is, however, a clever trick which gives you a good approximation to the set of non-faces. Once your detector has been trained on the face examples, you run the face-detector on a large set of images which are known to have no faces in them (or run it on all the non-face regions of an image that has faces). Whenever the network thinks part of this faceless image is a face, you store it. After it’s gone through all of these images you can randomly pick a selection of those false-detections and use them as non-face training data. If you do this enough times, the network will be able to determine what differentiates a face from a non-face over a wide range of images [1]. Note that the randomization step is very important because you’re scanning the images linearly. If you just use the first n false detections, they’ll probably all be from the same image.

 

So, to recap, the general algorithm for training a classifier is as follows:

 

  1. Train the classifier on a set of training-faces
  2. Run the classifier on photos which contain no faces and save the false-detections that it finds
  3. Pick a random subset of these false-detections and train the classifier by telling it that these are not faces
  4. Test the classifier on a set of test-images which contain known face locations and measure its accuracy
  5. If a desired level of accuracy has not been achieved, repeat the process

 

Detecting faces in the images is fairly simple. The network runs the detector on a small region of the image (20 by 20 pixels, for example), then moves that region over a couple pixels. It repeats this until it reaches the end of a row, then it moves down a few pixels and starts on the next row.

 

Scale invariance is achieved by shrinking the image by 20% and rescanning it. This is equivalent to scanning the original image using 24-pixel regions. (In the literature, this series of scaled photographs is referred to as the image pyramid.)

 

This is the basic strategy used by the most effective algorithms. Before this strategy was invented, people were trying to use less general techniques for detecting faces such as deformable templates, as is mentioned in [6]. The first really effective algorithms (and the ones which are most-cited in the literature) are: Rowley, Baluja and Kanade’s [1] which is based on a clever neural network design, and Sung and Poggio’s [6] which first maps the faces to vectors in a high-dimensional space, and then uses PCA clustering techniques to classify them into 12 different clusters (each of which can be labeled as a face or non-face).

 

Fig 3: Convolutional manifold classification and pose estimation

 
A very recent and novel approach that’s described by Osadchy, Miller and LeCun [14] is to create a high-dimensional manifold, and then to apply a convolutional network to the face which maps it into this manifold’s high-dimensional space. All faces will be mapped to points close to the manifold while non-faces will be far from it (see Fig 3). The manifold is learned by optimizing a complex function and it is very good at detecting rotated faces (in and out of plane). It also has the extra advantages of being able to tell you the roll, pitch, and yaw of the face in the 3D scene, and the ability to process images at 5 frames per second.

 

The algorithm which looked most promising to me, however, was the artificial neural network described by Rowley et al. in [1]. It seemed to be the most straight forward to implement, and didn’t seem like it would be a problem. I later discovered, however, that this technique would be far too slow to meet my usability requirements. Luckily, I ran across a much faster method which uses adaptively boosted classifiers and Haar basis functions as feature detectors [9, 10, 11, 12, 13]. This method can run at speeds of 30 frames per second on a fast desktop computer. My final solution uses this technique, but first I’ll describe the technique which I initially attempted.

 

Neural Network Face Detector

 

A comprehensive treatment of the theory of neural networks and other connectionist architectures is beyond the scope of this paper. A good treatment of the subject can be found in a book that many machine learning experts refer to as “The Bible” [15].

 

The neural network classifier designed by Rowley et al. [1] (seen in Fig. 4) achieves a 90.5% detection rate and a 0.000004% false-detection rate on a large test-set of well-selected images. The high quality of this detector is dependant on a fair amount of a priori knowledge which was incorporated into the design of the network. For example, they knew that faces would most likely be somewhat oval-shaped and symmetrical, so they applied a lighting corrector to an oval-shaped region of the image before passing the image to the neural network. This lighting corrector was designed to normalize a light gradient across the image (for example, if one half of the image was darker than the other because of a shadow). The corrector works by regression-fitting a linear function to the pixel intensities in the input region and then normalizes it so that the resulting function flattens out. It also applies histogram-normalization to the image to correct the contrast.

Fig 4: Upright face detector architecture

 

 


Another bit of a priori knowledge which influenced the network’s architecture was to use receptive fields as the first layer after the input (see Fig. 4). A receptive field is a 2D field of neurons which become trained to detect a feature, and which are somewhat insensitive to displacement. For example, if a receptive field learns to detect an eye, it can be moved several pixels to the right of the eye and still output a value. Its output will be much weaker, however, because a receptive field’s output is strongest when the feature is centered on it. The receptive field’s output under displacement looks like a Gaussian distribution – the further you displace the feature, faster the output strength drops off.

 

The paper also mentions that the receptive fields are replicated across the image. This means that multiple copies of a single receptive field are made which are allowed to look at different locations of the image, depending on where replicated features occur. For example, perhaps an ear appears in two different places (if the head is rotated to the left slightly).

 

The receptive fields are very helpful because they constrain how the network operates. They force it to find a small set of features (equal to the number of receptive fields in the image) and uses these as input to the hidden layer of the network. If they hadn’t used receptive fields, for example, the network would have to figure out that it should use receptive fields all by itself, based on the patterns of input it was given. This is very hard to achieve, and if you want to achieve it, you have to train the network very slowly. If you train it too fast, it is likely to fly off into some random weight configuration that over-fits the data and doesn’t contain feature detectors which are displacement-invariant. The network’s learning rate must be very small so that the weights break symmetry slowly. By breaking symmetry slowly, the network can creep towards a configuration that finds displacement-invariant features. It’s a bit of a waste of time, however, when you already know that’s what you’re looking for. So, instead, we take a short-cut and hard-wire the network with the approximate number of receptive fields that we think it will need.

 

The technique described in [1] can’t detect faces rotated out-of-plane or greater than 10 degrees from upright. So, Rowley et al. created a more advanced architecture which could detect faces that were rotated at any angle (in-plane) [2]. It achieves this by first passing the region through a router network which is a neural network that’s trained for one purpose: to detect what angle a face is rotated at. If the input region isn’t a face, it assumes that it is and reports an angle anyways. This isn’t a problem because a full-blown face detector at a later stage will validate the faceness of the region. This technique is also quite robust and effective.

 

Neural Network Implementation Attempt

 

I attempted to create a neural network classifier based on this model. However, my network exhibited the Russian-tank effect because I didn’t realize how many training faces I would require, or how uniformly cropped they’d have to be. In [1], they used 1050 training faces which were painstakingly cropped and resized so that the nose, eyes, and mouth of every image were within 4 pixels of each other.

 

So, when I realized that I’d have to crop and resize over a thousand faces, and on top of that wait over a week to train the network, and that it would run slowly once it was trained, I decided I needed to find a better solution. That’s when fate intervened, and I was presented with the holy grail of face detectors.

 

A Decision-Tree Classifier using Haar-like Features

 

Up to this point, all of the algorithms I’d been looking at operated on the image pixel data directly. However, Paul Viola and Michael Jones published a paper [10] in 2001 that describes a very clever technique of detecting faces using features which can be calculated incredibly quickly, at any scale, in constant time!

 

A neural network’s feature detectors consist of receptive-fields of neurons that assign a score to every individual pixel, them sum these scores together. But what if there was another way of achieving the same result?

 

The Haar basis is a set of rectangular regions which are used in wavelet analysis to reconstruct waveforms of any complexity. Similar to the way a Fourier transform works, a wavelet transform decomposes a waveform into a set of simple components that, when summed together, (approximately) generates the original waveform. The Haar basis is an orthogonal set of elements that can be used to do this.

 

Viola/Jones found that an over-complete set of basis functions (in other words, ones which are not orthogonal) work well as feature detectors, and a classifier can be trained to find a set of feature detectors that will accurately classify faces and non-faces. The squares shown in Fig. 5 are the feature detectors that Lienhart et al. use in their implementation of this algorithm [11, 12, 13]. The white regions represent the sum of all the pixels within an area, while the black regions represent the negative sum. Viola/Jones found that a trained classifier would use an inverted feature 2b to detect a person’s nose (a light vertical bar with darker areas on either side), and feature 1b to detect a person’s eyes (a dark bar with a light bar below it). The combination of these two features is often the first stage in a Harr-based classifier.

 

After training the first stage of a classifier on the exhaustive set of features, Viola/Jones decided to limit the feature set. The reason is that if each feature could be composed of any number of square regions of any size and shape, there would be an (almost) infinite number of possible features. The classifier would take an incredibly long time to train since it would have to search this enormous space of features to find which were the most useful for classifying the training cases. So, using a bit of a priori knowledge, they restricted the set of features to combinations of 2 or 3 adjacent squares of alternating colour (seen in Fig. 5 as features 1a, 1b, 2a, 2b, 2c, 2d). For a 24x24 region, this simple restriction produces a much more tractable set of 45,396 possible features [10].

 

Lienhart et al. expanded this set of features to include rotated features (as is seen in Fig. 5), and also discarded feature 4 (the diagonal line detector) since they found that it wasn’t very useful.

 

Now, the reason for using this new set of features is that it can be accelerated with a lookup table. They called this lookup table the integral image (it’s very similar to the summed area table used in graphics for texture mapping). Every (x, y) element in the integral image contains the sum of all of the pixels above and to the left of pixel (x, y) in the original image (Fig. 6). This table can be computed very quickly (in a single pass), and it allows you to get  the sum of a large region of pixels in a single operation.

 

This is pretty neat, but the real power comes when you realize that if you subtract the sums of smaller regions from the larger region, you can find the sum of any square region in the image!

 

If you wanted to find the sum of region D in Fig. 6, you could first use point 4 to find the sum of region ABCD, then use point 2 to find the area of region AB and subtract that away, and repeat until all you have left is region D. The sum would be:

[4] - [3] - [2] + [1]

 

As a result, you can find the area of any of the Haar features in Fig. 5 with 8 array lookups and 8 additions!

 

This technique is orders of magnitude faster than a neural network.

 

Computing rotated regions in Fig. 5 (1c, 1d, etc.) can be done with the lookup table shown in Fig. 8. How the rotated feature detectors in Fig. 5 are computed is left as an exercise to the reader.

 

Training the Haar Classifier

The actual classifier is implemented as a degenerate decision tree, or a cascade of detectors (as illustrated in Fig. 9). The image is fed into the classifier at the left side (stage 1), and at each stage is either rejected (downwards arrow), or accepted and passed on to the next stage (rightwards arrow). If the image makes it all the way to the end of the cascade, then it’s a valid face.

 

The training procedure for the cascade is called adaptive boosting (AdaBoost). It results in a series of weak classifiers (each stage) which, when combined (into a cascade) create a strong classifier.

Each classifier in the cascade is trained to be incrementally better at detecting faces. The early stages are simple, while the later stages are complex. This increases the speed of the classifier as a whole since most of the non-faces get thrown out in the early stages.

 

The overall performance of the cascade can be measured very easily based on the hit-rate and false-alarm rate of every stage. If you have 20 stages, and each stage has a hit-rate of 90% and a false-alarm rate of 30%, then the overall hit-rate at the end of the cascade will be the hit-rate of the first stage, multiplied by the hit-rate of the second stage, …, multiplied by the hit-rate of the Nth stage. Why? Well, if stage 1 lets 90% of valid faces through, and stage 2 lets 90% of those faces which are valid through, then stage 3 will get 0.9*0.9 = 81% of all the valid faces given to the cascade. So, with 20 stages and a hit/alarm rate of 0.9/0.3, the overall performance of the cascade will be: 0.920 = 12% hits, and 0.320 = 0.000000003% false-alarms.

 

If the desired performance of the cascade is 90%, then each stage must have at least a 99.5% hit-rate. And, if the desired overall false-alarm is 0.000001%, each stage must have a false-alarm rate of 40%.

 

The 99.5%/40% numbers were chosen by empirical study to be the best combination for training the cascade [10]. When they tried decrease the false-alarms (which makes the classifier stricter), then the increased strictness also makes the hit-rate decrease. The most important goal is to have the hit-rate stay above 99.5%, and the false-alarm rate isn’t significant because after 20 stages, anything below 40% becomes trivial.

 

Actually training the cascade is relatively simple. First, a very large set of training images is collected (5000 face examples, and 3000 randomly selected non-face examples for each training phase). Then, the first classifier tries all the possible features until it finds a combination that lets through 99.5% of faces and 40% of non-faces. Then, the set of faces that it let through which weren’t valid faces get boosted (weighted more heavily), so that when the next stage is trained on the faces, it has to be more accurate at detecting those specific faces. Once that stage has been properly trained on the boosted set so that it lets through 99.5% of real faces and 40% of the non-faces, then another boosting is done.

 

As you get further and further down the cascade, it takes longer and longer to train these classifiers since the faces it’s focusing on get harder and harder. For this reason, the later stages need to use more feature-detectors than earlier stages to have the desired accuracy.

 

The training can take quite a long time, however the result is a detector that’s incredibly fast and accurate, with a very low false-detect rate.

 

Implementing the Cascade

 

I ended up creating a web application in Python [15], which uses CherryPy [17] as the web framework, the Python Imaging Library (PIL) [18] to do the image processing, and the Haar-classifier which comes with OpenCV [16] to do the face detection.

 

Since I only discovered that Holy Grail of a paper [10] a couple days before my project was due, I didn’t manage to implement my own Haar-like classifier from scratch. I did, however, make a valiant effort.

 

See, OpenCV uses XML files to store all of the pre-trained classifiers. So, I got an XML library and attempted to implement one using that data. I got as far as implementing the Summed Area Tables, Triangular Summed Area Tables, and the XML parser when I realized that I didn’t know how the actual decision tree nodes were stored in the XML file. I tried tracing through OpenCV’s module loader, but it’s a pretty mangled mess of code because it’s only recently been forced into parsing/using XML. There were a whole bunch of red-herring functions that weren’t being used anymore, etc. So, I ended up using Pyrex [19] to write a Python extension to OpenCV.

 


The Rest of It

 

The final problem I was faced with was creating a web interface, and a legend for the images.

 

Choice of Tools

 

Python

  • Beautiful, dynamic, object-oriented programming language
  • Rapid development
  • Generally slow for writing loops and doing lots of method calls, but it’s a very high level language and most of the high level code that your program calls is written in C and heavily optimized
  • It’s very easy to extend via C(++) or Pyrex
  • There are a wealth of free libraries for it, and OOP wrappers around popular C libraries
  • Lets you wrap gross C libraries with beautiful OOP models. This gives you the best of both worlds: C speed, and Python cleanliness.

 

CherryPy

  • Model-View-Controller web framework which provides excellent separation of domain logic and display logic
  • Comes with a built-in web server that you just execute, so it’s incredibly easy to setup without Apache
  • Very fast (can serve 450 pages per second)
  • Easy to setup and customize
  • Tiny codebase (5000 lines total!)
  • A very easy environment to work in

 

PIL (Python Imaging Library)

  • Fast, efficient, mature.
  • A great library for doing image manipulation.
  • The imaging standard in the Python world.

 

Pyrex

  • A language which looks like statically typed Python
  • The language can be translated directly into C
  • You can import Python modules, and mix untyped dynamic Python code with typed Python code in the same function
  • Easy to write interfaces to C(++) libraries
  • Encourages you to prototype in Python, and then optimize the slow parts of your program by converting them to Pyrex

 


JavaScript

  • An excellent choice for designing GUIs nowadays
  • Fast, cross-platform, and gives users a GUI fancy application without having to install any software!

 

 

Interface Design

 

To make the program intuitive and easy to learn, I decided to make a simple Wizard-style user interface.

 

When verifying the faces of people in the image, the user is given a GUI that allows little boxes to be placed and dragged around. The boxes are initially created by the face detector, and as you can see to the right, it managed to detect everybody except the guy in the back-middle whose head just barely visible.

 

 

 

 

 

 

Naming the People

 

After all the regions are selected, the user clicks “Continue>>>” and they’re transported to the page you can see to the left. They get little cutouts of the detected faces which they can then name.

 

 


Rendering the Legend

The final click brings them to the page you see on the right: a nicely rendered legend with all of the names on the bottom, and the numbers in a human-readable order! Also, the users can resize the image with the drag-bar at the top if they’re going to be printing directly from the webpage.

 

I had to figure out a way to number the people in the image. The method I settled on uses an agglomerative clustering algorithm to initially put each number in its own cluster, and slowly merge the clusters whose centroids are within a threshold of each other. I then did a final pass that merged clusters which were below the threshold, but within a large horizontal gap between two other clusters.

 

Creating the outline was fairly simple – I just applied a Sobel edge-detection filter to the image and inverted it!

 

Things that could be improved

 

There are a few small details which I think would improve the program:

  • Better outlining
  • An auto-numbering bug needs to be fixed (if people’s heads are right at the top of the image, their numbers, which get drawn above their heads, will get cut off by the edge of the image)
  • Optionally render the legend to a PDFs (for ease of printing)
  • More user-tuneable features for rendering the final legend (font size, number of rows/columns, ability to put the legend vertically down the right hand side instead of at the bottom)
  • There’s a bug in my Drag and Drop code which I couldn’t figure out. Basically, it doesn’t work at all in IE. Not a big loss, though, since everybody should use Firefox anyways! J

 


References

 

[1]

H. Rowley, S. Baluja, and T. Kanade. Neural Network-Based Face Detection. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, San Francisco, CA, pp. 203-207. 1996.

[2]

H. Rowley, S. Baluja, and T. Kanade. Rotation Invariant Neural Network-Based Face Detection. In Proceedings of Computer Vision and Pattern Recognition, 1998

[3]

Y. Amit, D. Geman, and K. Wilder. Joint induction of shape features and tree classifiers. IEEE Trans. PAMI, 19:1300--1306, 1997.

[4]

C. Siagian and L. Itti. Biologically-Inspired Face Detection: Non-Brute-Force-Search Approach. Retrieved from: http://citeseer.ist.psu.edu/715136.html. Published in 2004.

[5]

Y. LeCun. Generalization and Network Design Strategies. In Connectionism in Perspective, 1989.

[6]

K. Sung and T. Poggio. Example-based learning for view-based human face detection. A.I. MEMO 1521, C.B.C.L Paper 112, December 1994.

[7]

E. Osuna, R. Freund, and F. Girosi. Support vector machines: Training and applications. A.I. Memo 1602, MIT A. I. Lab., 1997.

[8]

Y. Freund, and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119--139, August 1997.

[9]

R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. In Machine Learning: Proceedings of the Fourteenth International Conference, 1997.

[10]

P. Viola and M. Jones. Robust real-time object detection. Technical Report 2001/01, Compaq CRL, February 2001.

[11]

R. Lienhart and J. Maydt. An Extended Set of Haar-like Features for Rapid Object Detection. In IEEE ICIP 2002, Vol. 1, pp. 900-903, Sep. 2002.

[12]

R. Lienhart, L. Liang, and A. Kuranov. A Detector Tree of Boosted Classifiers for Real-time Object Detection and Tracking. In IEEE ICME2003, Vol. 2, pp. 277-280, July 2003.

[13]

R. Lienhart, A. Kuranov, and V. Pisarevsky. Empirical Analysis of Detection Cascades of Boosted Classifiers for Rapid Object Detection. In DAGM'03, 25th Pattern Recognition Symposium, Madgeburg, Germany, pp. 297-304, Sep. 2003.

[14]

R. Osadchy, M. Miller, and Y. LeCun. Synergistic Face Detection and Pose Estimation with Energy-Based Model. In Advances in Neural Information Processing Systems, 2004.

[15]

D. E. Rumelhart and J. L. McClelland. Parallel Distributed Processing:

Explorations in the Microstructure of Cognition. MIT Press, 1986.

[16]

The Python Programming Language, available at: http://Python.org

[17]

The OpenCV Image Processing Library, available at: http://www.intel.com/research/mrl/research/opencv/

[18]

The CherryPy Web Framework, available at: http://CherryPy.org/

[19]

The Python Imaging Library, available at: http://www.pythonware.com/products/pil/

[20]

The Pyrex Extension Language, available at: http://www.cosc.canterbury.ac.nz/~greg/python/Pyrex/