Petabytes

Science magazine has a nice article about dark energy by Adrian Cho. But you can’t read it unless you subscribe. Except that the nice folks at UC Davis have decided that the article is nice publicity for Tony Tyson and the Large Synoptic Survey Telescope, so they’ve put the article online for free. See Mark’s post for some of the theoretical background.

The LSST is an ambitious project — a proposed giant telescope with a wide-field camera that scans the sky in real time. Every three nights it will complete a survey of the visible sky, providing unprecedented access to astrophysical phenomena in the time domain — supernovae, asteroids, variable stars, you name it. It will probe dark energy in at least two ways: using Type Ia supernovae as standard candles (which is how the acceleration of the universe was first discovered), and by measuring cosmic structure via weak gravitational lensing.

One of the great challenges of the project is the huge amount of data it will produce. We are talking about a petabyte of data per year (pdf) — about the size of the entire internet archive. To search such a database for some string of characters (say, using “grep”) would take several years! It’s a tremendous intellectual challenge just to design the ways that such data can be usefully arranged so that we can find what we’re looking for. As you might guess, expertise from people like Google is turning out to be very valuable. In fact, the value goes both ways. It turns out that computer companies love to play with astrophysical data, for simple reasons — it’s publicly available, and worth nothing on the open market. We like to think that the data has a loftier kind of worth.

8 Comments

8 thoughts on “Petabytes”

  1. In principle, one could grep almost all of the Internet Archive in a day or two. The IA is composed of some 2000 independent machines each with four large disks, 250 GB to 500 GB each; so the grep could be done in parallel. On any given day a number of machines and disks are not working. The data decompression of all the data would be the major bottleneck. — Bruce Baumgart, Research Engineer at the Internet Archive.

  2. One shouldn’t pretend these data volumes are unprecedented in physics experiments. A petabyte/year is the typical order of magnitude data rate for a modern large high-energy physics experiment. The two Tevatron experiments currently running at Fermilab, D0 and CDF, take roughly a petabyte/year combined. The LHC experiments, CMS and ATLAS will each take about a petabyte/year. They will start in a couple of years.

  3. Fermilab already have those data issues, yeah. Relational Databases are getting quite a bit better too (The Sloan Digital Sky Survey data is stored in an off-the-shelf RDBMS, I believe, despite initial concerns that no OTS product would be sufficient; of course, SDSS has rather less than a petabyte of data). POSTGRESQL, I think, has no absolute limit on DB size, but does have a limit on table size (30 Tb or so?). SQL Server 2005 is size-limited at about a thousand petabytes(what is that, an exabyte?) but has no table size limit. Of course, your OS won’t like it, and in any case, searching might take a long time, but for general search, relational DBs can be pretty good.

    Will the reduced dataset really be a Petabyte, though? Searching raw data doesn’t seem that likely to me.

  4. Sean:

    It’s stupidity beyond imagination if one talks about having to survey the whole sky and accumulate terabytes of data just to identify a few supernovaes happening at the edge of the universe.

    The point is such events are NOT rare at all. If you are talking about supernovaes within no more than a few thousand light years away, it may be as rare as once in a couple hundred years. But when you are talking about any supernovae that can be as far as 10 or 14 billion light years away, then there’s plenty of such events happening every split second at any patch of the sky you look at. So why do you have to survey the whole sky for such signals?

    The universe contains roughly 200 billion galaxies, each contain roughly 200 billion stars, and each has a good chance at a supernovae at the end of the star lifetime. So counting for the whole age of the universe, there has been roughly 200 billion X 200 billion supernovaes, about 4×10^22. That number is at least correct in terms of order of magnitude. The age of the universe is about 14 billion years, or about 4.4×10^17 seconds. So within any one second time period, there’s roughtly 10^5 newly occured supernovaes some where in the universe, within a distance no more than 14 billion light years.

    If you look at just one small patch of the sky, roughly “10 square degree”, that’s 2.5×10^-4 of the whole sky. It figures into roughly 40 new supernovaes per second just within the small patch of sky the LSST looks at. So it should instantaneously discover many many remote supernovaes, any patch of sky it looks at, any time it looks at, without the need to accumulate terabytes of data.

    And certainly, with such a huge collection of samples of terabyte scale, you can ALWAYS find data of ANY conceiveable particular strange patterns that fits ANY speculative theory you may have. But such data mining can hardly be considered convincing scientific evidences.

    Quantoken

  5. Sean,

    Can you explain to this simpleton why the CC is explained as a proposal by Einstein as a pressure “to counteract gravity and keep the universe from imploding,” when, actually, without it, the Einstein theoretical universe expands? It seems to me, given the explanation that it was intended to prevent an implosion, setting it to zero would lead to a contracting universe, not an expanding one. When Einstein found out that the universe was actually expanding, wouldn’t that have just called for a CC that exceeded gravity, not just balanced it? In other words, wouldn’t it have required an increase in the value of the constant?

  6. Anonymous, you can think of the universe as a ball in free-fall above the surface of the earth. It will either be going up or coming down, but only be stationary for a second. Likewise, without the cosmological constant, the universe can be expanding or contracting, but will only be static for an instant. In the ball analogy, the role of the cosmological constant is played by a tiny booster rocket on the ball, that fires with an absolutely constant thrust. You can now imagine a perfect balance between the tug of gravity and the thrust of the rocket, but it’s an unstable configuration. It’s still more likely to see the ball either going up or going down, as we actually do observe the universe to be expanding, even with the cosmological constant.

  7. Thanks Sean, but is the cc shifted from one side of the equation to other as needed? If Einstein’s equations would have caused the theoretical universe to collapse without the cc, then how is it that removing it, or setting it to zero, causes the universe to expand? That’s what seems inconsistent to me.

  8. Removing the cosmological constant doesn’t cause the universe to expand. It can be either expanding or contracting, just like a ball in flight can be either going up or going down; at the moment it just happens to be expanding. It is not true that “Einstein’s equations would have caused the theoretical universe to collapse”; that would only be true if the universe started in a stationary state, which it clearly didn’t.

Comments are closed.

Scroll to Top