Your questions prompted a bit of curious research on what you're pursuing. Your ...

graycat · on Oct 17, 2015

> Oh, and that datacenter you mentioned, with dual 10GbE - is that $30k a month, or more?

I don't recall the details of their pricing. The colocation facility is at the old IBM Wappingers Falls, Myers Corners lab complex. There were several nice buildings, one with a lot of well done raised floor. IIRC, they were getting revenue from having a fiber roughly down the Hudson River for the 70 miles or so to Wall Street.

I don't plan to go for a $4000+ Xeon processor soon, but I wanted to understand what the value is and see what I was missing.

Or can get an eight core AMD processor at 4.0 GHz for about $160. And, say, an 18 core Xeon at 3.3 GHz for $4000.

So, the AMD has price per core per GH of

160 / ( 8 * 4.0 ) = 5

and the Xeon has

4000 / ( 18 * 3.2 ) = 69.44

for a ratio of

69.44 / 5 = 13.89

That's 1389%, and that's a biggie.

For my startup, the software for at least early production appears to be ready. Now I'm loading some initial data and generally, e.g., for my server farm, doing some planning.

Then I will do some testing, alpha, beta and possibly some revisions.

Then I will go live and go for publicity, users, ads, revenue, earnings, and growth.

The basic core applied math and software really should work as I intend. The main question is, will lots of users like the site? If so, then the project should become a big thing.

From my software and server farm architecture and software timings, one 8 core AMD processor at 4.0 GHz should be able to support on average, 24 x 7, one new user a second. Then I should be able to get, ballpark,

$207,360

dollars a month in revenue.

In that case, growth should be fast, and I should consider Xeon processors if they have some significant advantages.

E.g., some builders offer two Xeon processors, 18 cores each, on a motherboard in a full tower case with lots of room for hard disks. So, that would be 36 cores, and four of the 8 core AMD processors would be 32 cores in four cases.

Then, sure, for more, get some standard racks and put in servers designed for racks.

I have a lot of flexibility in how many processors and motherboards I use because the software and server farm architecture is highly scalable just via simple sharding.

The architecture has just five boxes, Web server, Web session state server, SQL Server, and two specialized servers full of applied math. Each of these five can run in their own server(s), or, for a good start, all five can run in one server.

i336_ · on Oct 18, 2015

> I don't recall the details of their pricing. The colocation facility is at the old IBM Wappingers Falls, Myers Corners lab complex. There were several nice buildings, one with a lot of well done raised floor.

Cool.

> IIRC, they were getting revenue from having a fiber roughly down the Hudson River for the 70 miles or so to Wall Street.

Ooh, interesting. Very interesting. That means deliciously low latency from anything inside that building to the stock market. If they know what's good for them, they'll have slapped inflated prices on the pipes to Wall Street, and done reasonably pricing for everything else. Ideal-case scenario, this means they have lower pricing for stuff that doesn't need those lines.

> I don't plan to go for a $4000+ Xeon processor soon, but I wanted to understand what the value is and see what I was missing.

Most definitely.

> Or can get an eight core AMD processor at 4.0 GHz for about $160. And, say, an 18 core Xeon at 3.3 GHz for $4000.

> ...

> That's 1389%, and that's a biggie.

oooo.

I'd be interested to know the exact models you compared there, because GHz is never, ever an effective baseline measurement. How big is/are the on-die cache(s)? What chip architecture does it use? What are the memory timings? What's the system bus speed?

Let me give you a nice example.

I used to use a 2.66Ghz Pentium 4, back in the bad old days of Firefox 3 when Gecko was dog-slow and Chrome didn't exist.

I now use a laptop based on a 1.86GHz Pentium M. The chip is almost running downclocked at 800MHz to conserve energy and produce less heat (this laptop could almost cook eggs).

Guess what? In practice, this laptop is noticeably faster.

It wasn't until a little while ago that I learned why: the Pentium 4 was using 100MHz SDRAM. This laptop's memory pushes 2000MB/s. I'd need to pop a cover to check the clock speed and memory type but I suspect it's 667MHz DDR2.

I also eventually figured out the other main cause of my woes: the chipset that computer was based on had an issue that made all IDE accesses synchronous, where the WHOLE SYSTEM would halt when the disk was waiting for data. Remember the old days when popping in a CD would make the system freeze for a few seconds? This was like that, but at the atomic level, in terms of fetching individual bytes from the disk. Whenever I'd need to request data the entire system would lock up for the few milliseconds it would take for that request to complete. If something was issuing tons of requests, the system could be brought to its knees pretty easily.

In practice, this meant that requesting even just a few MB/s from the disk could make my mouse pointer laggy and move across the screen like a slideshow. I can run "updatedb" - a program that iterates over my entire disk to build a quick-access index - on this laptop and only slightly notice it running in the background, whereas on the old system I had to walk away while it ran because I couldn't even move the mouse pointer smoothly. On this laptop it completes in about 3-4 minutes at the most, for a 60GB disk; the desktop had an 80GB disk and IIRC it took upwards of 10-15 minutes.

Other people could give you much more relevant examples, but these are some of my own experiences that I can share, that demonstrate that it's also the RAM, motherboard chipset - all the components put together - that contribute to a system's overall effectiveness.

Granted, few motherboards have serious issues like I experienced, and since most systems aim for maximum performance the differences are reasonably minor in the grand scheme of things; enough for people to nitpick, but ultimately equivalent, especially with server boards.

> For my startup, the software for at least early production appears to be ready. Now I'm loading some initial data and generally, e.g., for my server farm, doing some planning. Then I will do some testing, alpha, beta and possibly some revisions. Then I will go live and go for publicity, users, ads, revenue, earnings, and growth. The basic core applied math and software really should work as I intend.

Sounds awesome...

> The main question is, will lots of users like the site? If so, then the project should become a big thing.

Please add me to your list of potential alpha testers. I'd love to see what this is, but I'm not sure if I'm squarely in your target market; you say this is a Project X for everyone, and probability applied to search sounds like a very enticing field, but unless it's something as ubiquitous as Google (applies to literally the entire Web, has multi-exabyte cache of the entire Internet held in RAM) I'm not sure how frequently I'd use it. I love WolframAlpha, for example, and yet I've used less than 10 times, and that just to play with.

> From my software and server farm architecture and software timings, one 8 core AMD processor at 4.0 GHz should be able to support on average, 24 x 7, one new user a second.

The way you've worded that generates a lot of curiosity. What do you mean by "one new user a second"? O.o

> Then I should be able to get, ballpark, $207,360 dollars a month in revenue.

Okay that's definitely worth it. :D

> In that case, growth should be fast, and I should consider Xeon processors if they have some significant advantages.

They do. They definitely do, especially compared to the AMD you put next to it earlier.

> E.g., some builders offer two Xeon processors, 18 cores each, on a motherboard in a full tower case with lots of room for hard disks. So, that would be 36 cores, and four of the 8 core AMD processors would be 32 cores in four cases. Then, sure, for more, get some standard racks and put in servers designed for racks.

I would start with racks, unless you have standard cases just lying around, and can afford (financially) to be a bit inefficient to begin with. Datacenters are designed explicitly for rackmount servers, not tower cases; two immediate advantages that come to mind with racked servers are exponentially superior cooling and significantly higher computation density - and that last one will greatly impact your bottom line: tower cases are atrocious for packing lots of computational power into a small space, so you'll use more space at the datacenter, and likely get charged higher rent because of it.

> I have a lot of flexibility in how many processors and motherboards I use because the software and server farm architecture is highly scalable just via simple sharding.

That's good, you may need it in the future.

> The architecture has just five boxes, Web server, Web session state server, SQL Server, and two specialized servers full of applied math. Each of these five can run in their own server(s), or, for a good start, all five can run in one server.

Awesome.

I have to say, some of the people here have mentioned starting using AWS nodes. I have to say, this may well work out to be significantly cheaper (in terms of time and energy, not just money) to start out with than renting space in a datacenter.

i336_ · on Oct 18, 2015

(Got confused by HN threading, replied to myself by mistake)

graycat · on Oct 18, 2015

> What do you mean by "one new user a second"?

On average, once a second a user comes to the, call it, the home page of the site, and is a "new" user in the sense of the number of unique users per month. The ad people seem to want to count mostly only the unique users. At my site, if that user likes it at all, then they stand to see several Web pages before they leave. Then. with more assumptions, the revenue adds to the number I gave.

At this point this is a Ramen noodle budget project. So, no racks for now. Instead, it's mid-tower cases.

One mid-tower case, kept busy will get the project well in the black with no further problems about costs of racks, Xeon processors, if they are worth it, etc.

Then the first mid-tower case will become my development machine or some such.

This project, if successful, should go like the guy that did Plenty of Fish, just one guy, two old Dell servers, ads just via Google, and $10 million a year in revenue. He just sold out for, $575 million in cash.

My project, if my reading of humans is at all correct, should be of interest, say, on average, once a week for 2+ billion Internet users.

So, as you know, it's a case of search. I'm not trying to beat Google, Bing, Yahoo at their own game. But my guesstimate is that those keyword/phrase search engines are good for only about 1/3rd of interesting (safe for work) content on the Internet, searches people want to do, and results they want to find.

Why? In part, as the people in old information retrieval knew well long ago, what keyword/phrase search needs are three assumptions: (1) the user knows what content they want, e.g., a transcript of, say, Casablanca, (2) know that that content exists, and (3) have some keywords/phrases that accurately characterize that content.

Then there's the other 2/3rds, and that's what I'm after.

My approach is wildly, radically different but, still, for users easy to use. So, there is nothing like page rank or keyword/phrases. There is nothing like what the ad targeting people use, say, Web browsing history, cookies, demographics, etc.

You mentioned probability. Right. In that subject there are random variables. So, we're supposed to do an experiment, with trials, for some positive integer n, get results x(1), x(2), ..., x(n). Then those trials are supposed to be independent and the data a simple random sample, and then those n values form a histogram and approximate a probability density.

Could get all confused thinking that way!

The advanced approach is quite different. There, walk into a lab, observe a number, call it X, and that's a random variable. And that's both the first and last hear about random. Really, just f'get about random. Don't want it; don't need it. And those trials, there's only one, for all of this universe for all time. Sorry 'bout that.

Now we may also have random variable Y. And it may be that X and Y are independent. The best way to know is to consider the sigma algebras they generate -- that's much more powerful than what's in the elementary stuff. And we can go on and define expectation E[X], variance E[(X - E{X})2], covariance E[(X - E[X])(Y - E[Y])], conditional expectation E[X|Y}, convergence of sequences of random variables, in probability, in distribution, in mean-square, almost surely, etc. We can define stochastic processes, etc.

With this setup, a lot of derivations wouldn't think of otherwise become easy.

Beyond that, there were some chuckholes in the road, but I patched up all of them.

Some of those are surprising: Once I sat in the big auditorium as the NIST with 2000 scientists struggling with the problem. They "were digging in the wrong place". Even L. Breiman missed this one. I got a solution.

Of course, users will only see the results, not the math!

Then I wrote the software. Here the main problem was digging through 5000+ Web pages of documentation. Otherwise, all the software was fast, fun, easy, no problems, no tricky debugging problems, just typed the code into my favorite text editor, just as I envisioned it. Learning to use Visual Studio looked like much, much more work than was worth it.

I was told that I'd have to use Visual Studio at least for the Web pages. Nope: What IIS and ASP.NET do is terrific.

I was told that Visual Studio would be terrific for debugging. I wouldn't know since I didn't have any significant debugging problems.

For some issues where the documentation wasn't clear, I wrote some test code. Fine.

Code repository? Not worth it. I'm just making good use of the hierarchical file system -- one of my favorite things.

Some people laughed at my using Visual Basic .NET and said that C# would be much better. Eventually I learned that the two languages are nearly the same as ways to use the .NET Framework and get to the CLR and otherwise are just different flavors of syntactic sugar, and I find the C, C++, C# flavor bitter and greatly prefer the more verbose and traditional VB.

So, it's 18,000 statements in Visual Basic .NET with ASP.NET, ADO.NET, etc. in 80,000 lines of my typing text.

But now, something real is in sight.

You are now on the alpha list.

That will be sooner if I post less at HN!