Java Collections

Introduction

A topic that appears frequently in programming is to organize a collection of objects in memory. This topic is also known under the title "data structures", and also relates to various algorithms to expedite access. Before speaking specifically about Java collections I'll depart a little talking a little about the general problem of storing data in memory.

Algorithms and data structures

Every programmer learns, as soon as he starts, a couple of simple data structures. We could say that the first data structure is variable. It stores a single object, and the access time is negligible. The second data structure that it's always learnt is the array. This collection allows us to obtain a value given a position in a table. It's very fast as this structure has a strong parallel with how computer memory is internally organized. The values in an array are in order, and in that order are arranged in memory, therefore it manages access the hardware directly (because memory works as a physical alteration).

A array in Java is done well (but let's assume you know this).

String[] names = new String[10];

Many programmers stop here. It helps the fact that in some languages the array was the most complex data structure supported (for example in BASIC ... or even C!).

One can use arrays for everything. Do I want to save the list of ice cream flavours that I like? Student' class notes? Map coordinates? They can all be stored in an array ... Now ... suppose you want to save a class student's notes. If you want to use arrays it could be like this: I create a class with the student's name and note:

class Note
{
	String student;
	int note;
	
	Note(String student, int note) { this.student = student; this.note = note; }

	public int getNote() { return note; }
	public String getAlumno() { return student; }
}

Now I will store 500 students' notes:

	Note[] notes = new Note[500];
	
	notes[0] = new Note("Juan", 7);
	notes[1] = new Note("Pedro", 8);
	notes[2] = new Note("Manuel", 4);
	notes[3] = new Note("David", 6);
	
	// ...

	notes[499] = new Note("ElÃas", 9);

Now... What if we want to find out the note of Elijah? We do not know "a priori" in which array position the note is stored. The only way for us to proceed is to simply walk the array and ask whether we already got the one we wanted:

	for(Note n : notes)
		if(n.getNombre().equals("ElÃas"))
			return n.getNote();

This is very slow! Okay, "Elijah" is the worst case, because it's at the end. We'll find Elijah after having made 499 checks. But if we assume that all students are accessed with more or less the same probability then we have that the average number of checks that will be needed by student seek is 250!

Can we do better? What if we could count on the order of the notes matched with the students' names alphabetical order? In that case we could go to student 250, and ask: Is Elijah is more or less alphabetically that you? If the 250th student is Luis, then Elias is clearly before him. Then, Elijah is between 0 and 249. So let's check the student 125, who is, for example, "Carlos". Elijah is after Carlos, then we know that their position is higher than 125 (and it was less than 249). So we are limiting up to Elijah in, at most, 9 checks. This algorithm is often called "search dichotomy."

There's no need to fully understand the entire preceding paragraph. The crux of the matter is understood that, using algorithms, we've reduced the need to verify 250 times on average, to check at most nine times. And if, instead of 500 names we will had 100,000 we would need to check, at most, 17 times. The use of this algorithm depends on the fact that the data is sorted.

Besides each solution can bring secondary problems that will be more or less important according to our needs. In the example above, for example, we need to maintain a sorted array... what if we add a new student? We'll have to insert it into the correct position in the alphabetical order and displace all those who are on his right one place. In the worst case means processing the whole array!

All this means that the type of data structure used depends on what we can assume about the collection and how we expect this collection will be used (how it will be read and how it will be updated). Some questions to be asked are:

Am I interested in maintaining the original aribtrary order? Can I (as with the notes) reorder items?
Are there duplicates?
How often is it necessary to add new elements?

Collections in Java

Java has since version 1.2 a full set of classes and interfaces to store collections of objects. In it, all conceptual entities are represented by interfaces, and classes are used to provide implementations of these interfaces. A conceptual introduction should then focus first on these interfaces.

The interface tells us what we can do with an object. An object that meets certain interface is "something I can do X with". The interface is not the entire description of the object, only a minimum of functionality with which it must comply.

As befits such an object-oriented language, these classes and interfaces are structured in a hierarchy. As we descend to more specific levels specific requirements increase and there are more things that objects are required to know how to do.

Collection

The more important interface is Collection. A Collection is anything that can be traversed (or "iterated") and from which we can know the size. Many other classes extend Collection imposing more restrictions and giving more functionality. It's noteworthy that the requirement that "the size must be known" makes it inconvenient to use these classes with collections of objects from which the size is notknown "a priori" (this could be considered a limitation of this framework).

Just in case I clarify again: I can't construct a Collection object. You can not do "new" of a Collection, but all the classes that actually handle collections "are" Collection, and support its operations.

The core operations of a collection then are:

add (T): It adds an element.
Iterator(): You get a "iterator" that allows traversing the Collection, visiting each item once.
size (): Gets the number of items stored in this collection.
contains (t): Asks if the item t is already inside the collection.

If I have a Collection object, there are many things that I can't assume. I can't assume that the order in which I traverse through its contents is relevant, i.e. that if I traverse it again I'll see the elements in the same order in which I saw them the first time. Nor can I assume that there are no duplicates. I can't take performance characteristics: Asking if there's an object in the collection can take anywhere from a little to... a lot =).

A capacity of a Collection object is to be able to be traversed. As at this level there's no defined order, the only way is by providing an iterator, by the iterator() method. An iterator is a "wandering" object that will allow us to go get all the objects as we progressively invoke its next() method. Besides, if the collection is modifiable, we can remove an object during the journey by calling the method remove() on the iterator. The following example scans an Integer collection erasing all zeros:

void removeZeros(Collection<Integer> zeros)
{
	Iterator<Integer> it = zeros.iterator();
	while(it.hasNext())
	{
		int i = it.next();
		if(i == 0)
			it.remove();
	}
}

In this example I use automatic conversion between Integer and int.

From Java 6 is a simplified way to traverse a collection (which served if we do not need to delete items). Useful only if we don't need to remove elements. It's done with a new use of the for keyword:

void show(Collection<?> col)
{
	for(Object o : col)
		System.out.println(o);
}

Internally this code does nothing other than getting the iterator, but looks much more elegant and readable this way. It must be noted that not having the iterator as an explicit variable makes it imposible to use this new syntax for the previous example: you can't remove elements.

There are four interfaces extending Collection: List, Set, Map and Queue. Each one adds requirements on the data that they store and, in return, offers more functionality. Next I'll what is each of them about.

List

A List, is a Collection. The difference from a simple Collection is that the list maintains an arbitrary element order and allows access to those elements in order. We could say that in a list, most of the cases, the order is data. That is, the order is also important information that the list is storing for us.

There is no method in Collection for getting the third element. There can't be because, as stated before, at the Collection leve we are not even sure that if we traverse again the collection the elements will appear in the same order. On the other hand, a list must allow access to the third element, so the following methods are added:

get(int i): Gets element at position i.
set(int i, t T): Puts element t in the position i.

In Java there are mainly two List implementations, both useful in different cases. One implementation is nearly identical to common arrays (such as the notes one I've used as an example). This implementation is ArrayList. The advantage of ArrayList over a simple array is that it's expansible, meaning that it grows as you add elements (while the size of an array is fixed since its creation). The access time to a particular item is very small, but if we remove an element from the start, or from the middle, the class must move all that remaining ones to the previous position, to "cover up" the hole left by the removed element. This makes drawing elements from the middle or from the start to be expensive.

The other implementation is LinkedList. In it, the elements are in a series of nodes attached to each other as links in a chain. Each of these nodes points to his predecessor and the element that follows. Nothing more. There is nothing in each of these nodes that has something to do with the position in the list. To get the item number "n", this List implementation then needs to start from the beginning, from the first node, and to move forward "next" n times. Finding the 400th element then involves 400 of these little steps. The advantage is that it is possible to remove items from the beginning of the list and from the middle very efficiently. To remove an element, only its two "neighbours" need to be modified so that they "connect" with each other, ignoring the element that is being removed. As in a chain, a link is withdrew opening the links that are adjacent to the one being deleted and closing them so as to exclude it. There is no need to make any changes to the rest of the elements of the list.

In other languages dealing with linked lists can be a bit more laborious. In Java, LinkedList is used exactly like other types of List, so you don't need to know anything extra to start using it. Well, this is not quite true... we must be clear as to their particular performance. Its get(int) method is slow because, as I said, it first needs to arrive to the requested item (jumping from link to link). This makes traversing the list with a simple for( int i = 0 ; i < list.size(); i++) to be incredibly slow! The complexity goes from being linear to quadratic, i.e.: If 300 items list is traversed this way, it will take as if you have 44,850 items! A LinkedList only be walked with iterators.

An ideal use for LinkedList is to create "queues", in which the elements are added to the end, and removed from the beginning. For this use, instead of List, another interface can be used: Queue (also implemented by LinkedList) which is more specific for this task.

To be completed!

En castellano

By Nicolás Lichtmaier. Questions, comments, suggestions and corrections will be well received.
From Buenos Aires, Argentina.