In OOP, when to use a Collection of Objects with Properties vs. one Object containing a Collection for each Property?
August 11, 2004 1:23 PM   Subscribe

OOPFilter - How do you decide when to use a Collection of Objects with Properties vs. one Object containing a Collection for each Property? [mi]

That is, when you are going to have a bunch of things with some set of properties, how do you decide whether to:

1. Make an object with the properties in question as its fields, and then create an collection of instances of that object for each of the things in your dataset.


2. Make on object with a collection for each property, and have the properties of a particular item in your dataset spread across the different collections, at the same index in each collection.

Very simple example: say I have data for a scatterplot. I could either make a Point object with a field for x and y, and create an array of Point instances, or I could have a Points object with an array for x values and an array for y values.
posted by badstone to Computers & Internet (23 answers total)
Response by poster: I imagine this is a basic topic for people with a formal CS background, but mine is very informal and haphazard. So, if you can just point me at the right reference, or the right search terms that characterize this problem, I'd appreciate. Oh, and for argument's sake, I primarily work with Java.
posted by badstone at 1:26 PM on August 11, 2004

Intuitively I'd say #1 is always the correct choice; it's more in the spirit of OOP. #2 feels to me like the equivalent of reading a series of novels by reading the first word of the first book, then the first word of the second book, then the first word of the third book, and so on until you run out of books, then going back to the second word of the first book and so on.
posted by wanderingmind at 1:29 PM on August 11, 2004

Are the properties attributes of some other entity, or do they belong more fully to each other?

I agree with wandering mind that #1 is generally the correct choice, but every once in a while...
posted by weston at 1:42 PM on August 11, 2004

Definitely #1. The object is a thing with properties. If you have more than one, you put those objects in some kind of collection.

In option #2, you would have to keep around an array index just to navigate around, rather than simply having a pointer to the current object.
posted by smackfu at 1:44 PM on August 11, 2004

badstone: #2 looks like the right answer, but can you give us a more concrete example?
posted by bshort at 1:58 PM on August 11, 2004

PResumably these objects in the collection belong to some other object if you're designing a completely OOP application. So one of its properties would be a collection of the objects. So my answer is 'kind of both'.
posted by billsaysthis at 2:04 PM on August 11, 2004

Response by poster: bshort - I'm looking for general strategy regarding when to choose one or the other, rather than the answer for a specific case. Amongst the tradeoffs I imagine, is the performance cost of instantiating so many full fledged objects vs. maintainability/readability of code.

I figured that this issue is one that already has a standard answer, and it's just a matter of knowing how it is usually named or referred to in CS texts.
posted by badstone at 2:07 PM on August 11, 2004

it might depend on your programming style or the libraries you are using.

how are you going to iterate over these things?

say you want to plot each point, where is the plotting code going to go? where is the iterator code going to go? i would imagine code with an iterator over points is going to be nicer than code with an iterator of x and another iterator over y.

and what happens when you add colour to points? you don't want to add another iterator - you want to keep iterating over points, but change the code doing the plotting to plot colour. if you had a list of colours, you'd need to add an iterator over that list too.

so i'd go with 1 in this case.

[on preview - i don't think general rules exist. i would guess 1 is often more useful for the reasons given above, but i wouldn't exclude 2 without thinking about the problem]
posted by andrew cooke at 2:11 PM on August 11, 2004

ok - here's a good way to choose. one of your collections is indexed by something simple (number of point). the other is indexed by something complex (x coord, colour, etc). simply indexed things go in collections. complex indexed things tend to have separate methods. you don't have a getProperty(x) where x is "x-cooord" or "colour", you have getXcoord() and getColour().

typically. i can imagine cases where you would like this - it depends on a lot on how dynamic the language is.
posted by andrew cooke at 2:17 PM on August 11, 2004

Response by poster: OK, so this is just one of those issues that's more art than science. I can deal with that, it's how I've been treating the issue so far. Just wanted to check and see if there's something I really ought to know. again, I'm trying to fill in gaps in my CS knowledge which was constructed via random walk rather than by design.
posted by badstone at 2:17 PM on August 11, 2004

well, don't listen to me then, because i have no cs degree either ;o)
posted by andrew cooke at 2:30 PM on August 11, 2004

well, if you're using OOP then the correct choice is #1. It's the way i would write it every time. If an object "has a" particular property then it should be a field of that object.

If you're doing structural programming, then both ways are okay (#2 being the more common in C type programs, i would say).

Aside from "programming styles", you may want to make your decision based on performance (data size and execution performance).

I don't think data size is any consideration, the same amount of data is being created using either method.

Execution performance may be a consideration, but i think the performance differences will be neglible. The basic deal is that if you have more properties than objects, then #2 will be faster. If you have more objects than properties, then #1 will be faster. IE, if you have only a few objects but each one has a TON of properties, then it would be better to have #2. If you have many objects each having a few properties, #1 will be faster. If you want me to go into detail as to why this is the case, lemme know.

in summary: if you want to stick with OOP, choose #1. If you want best performance then use #1 if you have fewer properties than objects, or use #2 if you have fewer objects than properties. I personally always use #1 because i write all OOP stuff these days, but i also think it just is clearer code.
posted by escher at 2:51 PM on August 11, 2004

thinking of a counter example is hard, but i think you might make a case for 2 if the relationships were very rapidly varying. however, i must admit that i can't come up with a convincing concrete example. another possibility is interfacing to a relational database, but then you could argue that's because you're being forced into a relational pattern and should use an object store.

i was thinking more abstractly - of a group of things which can be grouped in two ways, and how do you decide which grouping takes priority. i think the answer is that it's normally obvious which grouping is "tighter" because there's an "obvious" object. in the example, a point is an obvious object, a collection of x coordinates isn't. but it's difficult to think of an example where either grouping appears "equally valid".
posted by andrew cooke at 3:42 PM on August 11, 2004

ok, think of a table. do you have rows that contain columns, or columns that contain rows? that's the kind of thing i was wondering about when i said 2 might be equally valid. but i think really you need a better approach than 1 or 2 in such a case.

i'll shut up now.
posted by andrew cooke at 3:45 PM on August 11, 2004

In the situation you gave, then #2 is the correct choice. I think it comes down to complexity, and the potential number of data items. It's going to be easier and faster to work with an object containing a million data points than it is to deal with a million objects. In the latter example, you'd end up creating some kind of interface object anyway. For that OO buzz, just think of your Array as a collection of simple 2 property objects, and the object as the thing that looks after them all.

Hoping that somebody will correct me if I'm wrong, but you may also want to look at the "flyweight" design pattern for an approach to how to deal with large collections of simple objects.

On a final note, you'll have to make sure that all access to the array is controlled through the one object. This is definitely a situation where data-hiding is mandatory. If you end up writing code to access items individually from outside your controlling object, then you're doing something wrong.

posted by seanyboy at 4:57 PM on August 11, 2004

Final point before I go to bed. It'd be nice to try the two methods to see whether there's a performance difference, but for more modern languages, I doubt that there would be. Modern Compilers are pretty adept at noticing & optimising things like this, so I'm guessing that you wouldn't have anything to worry about.
posted by seanyboy at 5:10 PM on August 11, 2004

In your scatterplot example, 1 is more appropriate for OOP, because it affords the opportunity to have methods for manipulating a point attached to the Point class (Point.Transform, Point.Scale, &c.) and methods for manipulating the plot attached to the Plot class that would have a collection of points as an attribute.

More generally, your goal in OO design to model the problem as naturally as possible. What's a scatterplot? It's a bunch of points.

Here's a real-world example: I maintain a Perl class hierarchy for handling Web forms. Now, there's a pretty good form-handling class in CPAN (the public library of Perl classes) called CGI::Formbuilder, but I don't like it. The specific way it offends me is that it has separate attributes for the list of the form's field's names, the list of their labels, the list of their values, the list of their validation rules, &c.

What's a form? It's a bunch of fields. My Form::CGI class has a single "fields" attribute that holds a list of Field objects, each of which encapsulates everything that field needs to know about itself to get its value from the CGI data, print the XHTML controls that represent it on screen, and - if it mixes in Field::DBI::Base (don't try this in Java! no multiple inheritance) - how to format its value for insertion into a SQL database.

And that's the way we do it down on the farm.
posted by nicwolff at 5:46 PM on August 11, 2004

Option (1) is the object-oriented-prgramming solution. Option (2) is the database-centric solution.

Object Oriented programming is a method for taking management structure and applying it to the development of code. Somtimes it is very successful, other times it isn't successful at all. You shold not feel pressured to choose an option simply because it is the Orthodox Methodology. Instead, you should choose the solution that best fits your problem.

The first solution can be rephrased as: An array of pointers to objects, each of which contain data relevant to that object.

The second solution can be rephrased as: a group of pointers to arrays of data, indexed by their respective object ID.

Note that if there are more objects than properties, option (1) has many more non-contiguous memory references than option (2). If you expect to access data relating to adjacent objects, option (2) will be much faster, since the system cache will tend to contain that adjacent data.

On the other hand, if you are doing a binary search, or some other algorithm that is highly non-local, option (1) may provide more direct access to the data in question, despite the numerous memory references (pointers).

Many programmers would make a toy problem that simulates the most important aspects of their primary assignment. They would then implement both solutions and study the relative benefits of both. The resulting data would help them make a confident decision.
posted by Kwantsar at 8:15 PM on August 11, 2004

#1 is the correct OOP way to do it. Consequently, #2 is faster.

I kid, I kid.

In all seriousness, this strikes me as a situation where the correct answer depends entirely upon the particular problem you're trying to solve. I'd code both, run some benchmarks, and then decide. And I'd keep my numbers and test builds around in case I was ever taken to task during a code review or boss meeting.

Full disclosure: I'm a CS autodidact, and I tend to focus more on hard numbers than theory. Your proclivities may vary.
posted by amery at 10:12 PM on August 11, 2004

I'd code both, run some benchmarks, and then decide.

Run time is much less important than programmer time. Do whichever makes the most sense, which will also (usually) be the easiest to code.
posted by kindall at 8:47 AM on August 12, 2004

Your proclivities may vary.

I can promise you that.
posted by yerfatma at 10:01 AM on August 12, 2004

#1, by far. The properties belong to the object. Unless you're going to be regularly accessing things by a specific property, then you'll want to do some other strategy.

For instance, if I want to retrieve a specific employee from a collection and print his hair color (employee.getHairColor()) then it's obviously a property that belongs to that person. If I want to get all employees with brown hair... well, then I'm either going to iterate, or have some caching. Just figure out what your basic unit is. If it's hair color, then go with option #2. If it's the employee (which it probably is), go with #1.
posted by mikeh at 1:04 PM on August 12, 2004

#1 is elegant and extensible. In your example, you could create a ColoredPoint subclass of point or a LabeledPoint or a FloatingLabeledPoint or whatever. That type of flexibility is addictive and is supposed to be waht OOP is about.

If you have to do any huge number of complex transformations on that data, #1 will run like a dog. For example, if you have to run curve fitting or Bezier approximation, you really want a big pile of X's and Y's. Under those circumstances, your best best to retain the OOP model, but pull out some speed is to allocate an array of X's and and array of Y's (or one array of (x,y) pairs), and make all the C-style assumptions you want.

It comes down to this question:
is my data fixed in its representation and its transformations so that I'm able to freely make assumptions about that data?
By the by, I've rarely seen a case where answering yes was the right thing to do.
A better question might be, "can my flexible data format be transformed into a non-flexible format which allows for efficient operation?" I've rarly seen a case where 'no' was the answer here.
posted by plinth at 1:12 PM on August 12, 2004

« Older Parsing Problem   |   Fundraising Newer »
This thread is closed to new comments.