Amazon SimpleDB/RDS billing question

Portal Home > Knowledgebase > Articles Database > Amazon SimpleDB/RDS billing question

Posted by satoo, 06-09-2011, 05:50 PM
For Amazon SimpleDB, it says: and they charge $0.14 an hour for that. I want the database to be available 24/7, but we're not going to actually be making a complex requests 24/7. This is for a web app so it needs to always be "on" in that someone at 3 am could make a request, but we aren't going to be using an hour's worth of 1.7 ghz Xeon computing power every single hour of the day. So because I need to be available 24/7, does that I mean I'd be paying $108 (0.14*24*30)? Or would I only be paying for the actual computing hours (which might only be a few hours worth of a 1.7 ghz Xeon's computing ability)?
Posted by srfreeman, 06-09-2011, 08:36 PM
satoo; Amazon's SimpleDB is not an RDBMS it is a no sql solution more closely related to a key value data store. The charges are based on data in and data out. You will only be charged by the transaction. You will want to do more reading on SimpleDB, it may or may not be what what you need, it is certainly different from MySQL.
Posted by satoo, 06-09-2011, 11:21 PM
Thanks for clarifying that about the pricing for SimpleDB. I think RDS is the opposite where it is billed like EC2, because you pay per allocated instance. We don't really need an RDBMS for this project (same one as mentioned in my other thread). All we're doing is returning results near someone's zip code. By converting zip codes to latitude and longitude on each row ahead of time and then putting that in SimpleDB, we remove one other step and reduce the amount of computing power per query. We don't need MySQL's math functions to do the haversine formula and compute distance, since we're basically just making a square by adding/subtracting latitude and longitude rather than a circle and saying "close enough." There's never a need to do a JOIN or really anything complex. It's just very simple SELECT * where ____ each time. The script that processes that 1.2 gb CSV file and reduces it in size to ~134 mb takes 14 minutes on my desktop computer. We could possibly even do this online with EC2 and just keep the instance running for one hour a day, storing today's and yesterday's results on the EBS storage and just automatically starting it once a day, then terminating it once it's done. Seems like a sound idea to you, as long as SimpleDB's speed doesn't become a problem? Last edited by satoo; 06-09-2011 at 11:26 PM.
Posted by srfreeman, 06-10-2011, 04:28 PM
Sounds like you have a handle on it. Let us know how it works out. Just an opinion but I believe you may be better off if you compare yesterday's 1.2GB file with today's 1.2GB file > then create the file with just the changes > then run the script to do the major surgery on the smaller data set > use the tiny resulting file to perform the daily update.
Posted by satoo, 06-10-2011, 06:45 PM
SimpleDB won't work out. 1) You have to zero pad all integers because it compares everything lexicographically i.e. it thinks 5 is greater than 10 because 5 is greater than the first digit in 10...so you have to make it 05. There's no schema at all. Everything must be a string. 2) You can't have negative numbers unless you only ever want to use them as strings. You have to do negative offsets like add 100000 to the negative numbers in order for the < > etc. logic to work. 3) There's no mass import. You have to do the equivalent of INSERT to get all your starting data on there. 4) It cost $3.18 to "INSERT" ~43,000 rows. That means just keeping the database updated on SimpleDB, even if I compute all the changes offline and don't query the database for that, will cost ~$100/mo if we have to INSERT 43,000 new rows every day (not counting anything we drop). The queries are very fast, though. For what I think this is going to cost on SimpleDB, we'd actually save money using Amazon RDS and just leaving a small (Small DB Instance: 1.7 GB memory, 1 ECU (1 virtual core with 1 ECU), 64-bit platform, Moderate I/O Capacity) instance on 24/7. It's only $227.50/year if you pay 1 year in advance, which we'll do if it works out. Otherwise, hourly billing is $0.11 while it's on. And that lets you use mysqlimport anyway. So we're looking into that and exploring other cloud-based solutions to store the database.
Posted by Latic, 06-10-2011, 07:12 PM
Satoo, Look at something like a simple http://doc.3tera.com/applogic24/RefAppLampCluster.html but instead of having multiple webservers you'd want to have a "mysql cluster"
Posted by tchen, 06-10-2011, 07:48 PM
The $227 is the up-front fee. You still need to pay $0.046/hr for every hour used. This is a discount to the nominal $0.110/hr rate. A MySQL cluster is very different from a LAMP cluster. It can be a component of the later though. The AWS RDS is actually not a cluster per-se. Each instance is akin to a standard EC2 server loaded with a protocol proxy, MySQL and with the requisite scripts to do backups and dumps. They're controllable via the API. Out of the box, an individual RDS server isn't that 'reliable'. To get into HA territory, you need the Mutli-AZ deployments which shims itself into the protocol proxy and replicates your insert and update commands to both live and hot standby server. To scale out horizontally, also chain together multiple RDS instances together (one as master, the others as Read Replicas).
Posted by satoo, 06-10-2011, 07:59 PM
I'm not following you, and I find their pricing extremely confusing. Are you saying that: 1) I have to pay $227.50+(0.046*x) where x is the total number of hours I end up using it within one year? or 2) I buy $227.50 worth of hours at the $0.046 rate. Anything beyond that, I pay $0.046/hour for.
Posted by tchen, 06-10-2011, 09:31 PM
The first option. $227.50+(0.046*x)
Posted by Latic, 06-10-2011, 10:25 PM
tchen, Its not all that different. Just different components. I'm trying to keep this as simplistic as possible.
Posted by srfreeman, 06-11-2011, 02:46 AM
satoo; Regardless of how or where you do it, a scalable SQL solution is going to be expensive. Yes, SimpleDB is all string data, the most common or simplest type of data to work with. The "DB" part is tough to comprehend if you come from the RDBMS world. In fact it is hard to explain without using some form of the word relate. First you have to break every rule you learned about normalized or structured data. Keys and indexes are still important. I am going to throw in an example that will probably cause many to call me crazy and it is simplistic but you may find it useful. If you think of wanting to create a system to return the address of a retail store within 25 miles of a given zip code. The processing is done off line - only once - to find the zip codes that are within 25 miles of the store address. Let's say that we find there are 10 zip codes within a 25 mile radius of the store address. 10 rows will be inserted, all containing the store address and each will contain a key field containing one of the 10 zip codes. When a zip code is entered - a simple rundown of the index on the key field will return the address if there is a match. Almost no compute power is required (no math) and the data will remain static until there is some change in the zip code structure. The mantra - storage is cheap. You must have key values that can designate what you want to find. Again, this type of thing may not match what you want to do but in the long run it is the most scalable and cheapest.
Posted by satoo, 06-11-2011, 04:01 AM
It's a db of ~900,000 products and the zip code is the product's location. Each product is only available to buyers within a radius from that zip code that the vendor specifies. Sometimes the radius is 30 miles, and sometimes it's 500 miles, so we can't precalculate all the matching zip codes, unfortunately. What we're doing for distance is just converting the zip codes to latitude and longitude, and then basing our queries on the fact that 1 degree of latitude/longitude is ~69 miles and making a 'square' around the user's zip code. Technically, the haversine formula would be more accurate because the radius should be a circle, not a square, but it's very close anyway, and we could still compute the circle later on. Anyway, the original CSV is a gigantic 1.2 gb but after getting rid of unnecessary fields, we reduced it in size to ~130'ish mb. It's still the ~900,000 rows- we just got rid of the columns. The zip code is just one of the 17 remaining columns. But as I said in the other thread, whole rows and portions of rows change every day. The whole thing doesn't even have to be that scalable. The issue here is really just processing all this data, keeping it updated daily, and finding a way to get decent query performance with a 900,000 row database. As I said in the other thread, there will only be a few hundred people using this per day, and each only making one or two queries per day. It's not like hundreds of people using it throughout the day making queries all the time. I do want the ability to make that a few thousand people, however this is just people making one or two queries each. Not thousands of simultaneous users or anything. Even if we process the whole thing offline and compute the daily difference between the CSV we are given, enough rows change daily that it's impractical to use SimpleDB and just execute tens of thousands of drop/insert when I'm getting charged for every single one. I'm open to any ideas people have to solve this...doesn't have to be a cloud-based solution.
Posted by srfreeman, 06-11-2011, 05:49 AM
satoo; Sorry for rambling but I am flying somewhat blind (and database questions are somewhat of a hobby) From a performance standpoint, calculating the distance for each of 900,000 products every time a query runs is not viable, (probably not what you are doing but it sounds that way from here) is there any way that these products can be grouped by fulfillment center and the customer profile designate the centers available to them? From the update standpoint, are there new products (or SKUs) added everyday or is it inventory fluctuating where for example an SKU is out of stock today and in stock another day?
Posted by satoo, 06-11-2011, 06:56 AM
For each product in the CSV, I've converted the zip codes to latitude and longitude and stored them in two separate fields. I've also taken into account that product's buyer radius and added the max/min latitude and longitude. So if the product had a 69 mile radius, I add 1 to their latitude and longitude for the maximums, and then subtract 1 for the minimums. ^^At the moment, all of the above I'm doing offline on my desktop. So when a user enters their zip code, I would convert that zip code to latitude and longitude, and then query the database for all products where the user's zip code is within that max/min latitude and longitude. Having never worked with a 900,000 row database before, only ever using MySQL, and not being an expert on this, I don't know if that's viable from a performance standpoint, but the coding is just a very simple SELECT statement. I haven't decided yet if I'm going to let users pick their maximum radius, which would make it even simpler. If I do let users pick the max radius, I suppose I could split the db into smaller tables or 'domains' in SimpleDB terminology. In that case, I would just split it up by metropolitan area, or divide the US up into let's say 10 regions. Then I would just add a column to my separate zip code/latitude/longitude table with a region code or something, and I'd match the surfer's zip code with that, query the database for their region, etc. That's actually a pretty good idea...and might make the data way more manageable. The other way I could split it up would be by price. To answer your question about updates, these are all off-lease/used/refurbished/etc., so new listings are added, old listings are dropped, prices are updated, etc. Tens of thousands change per day. My interest in the cloud is really just because I thought it would be less expensive than getting another dedicated server and also a learning opportunity. The potential scalability is more like a bonus than the actual reason for considering it.
Posted by Latic, 06-11-2011, 10:19 AM
srfreeman, I'm not sure it'd be mega expensive maybe $200 - $300 a month depending on how many Cpu's / much memory is required.
Posted by tchen, 06-11-2011, 12:41 PM
@satoo With 900k records, you're still way within MySQL limtiations. Here's a MySQL AB presentation on your topic http://www.scribd.com/doc/2569355/Ge...rch-with-MySQL Basically, they have a 800k table with an optimized query running at 0.06 to 1.2 secs. An alternative to look at as well might be MongoDB with its geospatial indexing, although if you're already using SQL, you might as well stick with MySQL or RDS. I just personally tend to roll my own instances with MySQL just because RDS is a bit opaque when it comes to profiling. Not impossible, but it requires a bit more hoop jumping. It's a tradeoff with ease of setup you'll have to weigh. Last edited by tchen; 06-11-2011 at 12:48 PM. Reason: Read it again and srfreeman was suggesting something else different altogether :)
Posted by satoo, 06-11-2011, 05:43 PM
Thanks for the link- very handy. I'm open to using MySQL on EC2 or something, it's just that if using any Amazon service is going to end up costing more than $150-200/mo., then we'd be better off just getting another dedicated server at our current hosting company. Our current server is a P4 with 2 gigs of RAM. I don't think that will cut it.
Posted by tchen, 06-12-2011, 10:11 AM
Generally speaking, EC2 is not designed for robust single-server use. If you're not setting up dynamic horizontal scaling, then I'd normally recommend against it as most people are better off just getting a dedicated or VPS. You'd have to setup the same data backup/recovery systems anyways - and you'll generally have better node resilience given that most are setup with some form of RAID1/RAID10. The only reason to stick with EC2 at low server numbers is if you wish to build the developer expertise in that system for future expansion. If it was a matter of HA, you can achieve the same using basic cluster techniques that apply just as well to dedicated servers at a fraction of the cost.
Posted by satoo, 06-12-2011, 03:40 PM
Hmm. Well, SQL Azure is only $10/mo. for up to 1 gb database. http://www.microsoft.com/windowsazure/sqlazure/ They bill by storage usage, not computing hours or by RAM usage. It differs from the Amazon model, except it's a multi-tenant system and Microsoft uses throttling to prevent you from abusing it. http://social.msdn.microsoft.com/For...-af7e59a4f939/ From what I've read about it, it has most of the same features as Microsoft SQL Server, and you can now use bcp to load the database from CSV. Might be the best solution here. I'd probably still pull and process the original file every day offline or on ec2, only because the script relies on unix command line cut, split, sort, and cat.
Posted by srfreeman, 06-12-2011, 08:24 PM
satoo; Let me see if i really understand the query issue. My perception: 1) Each row (product listing) contains a zip code (lon, lat) and the distance from that zip that the product is available. 2) The user provides a zip code (lon, lat). We have to visit each row and see if the users zip code is within the product's circle or square of availability. This appears to be just the opposite of the example provided by tchen where the speed is provided from the fact that it does not have to scan the entire table (find ten and we are done). Effectively we are not trying to find 10 hotels within XX miles of us, we are checking with 900,000 hotels and asking if we are within YY miles of them. Do I have this straight or am I all wet?
Posted by tchen, 06-12-2011, 09:22 PM
Even if points of interest have a radius, you can just extend the lat/log of the optimization box by the maximum radius. This assumes of course, there's some reasonable limit, otherwise yes, it would be a different problem. If you can distinguish between regional suppliers and local suppliers, you'd be better off partitioning the table and run the query across both.
Posted by satoo, 06-12-2011, 09:36 PM
We are checking 900,000 hotels and asking if we are within YY miles of them. The thing is, the radius for the majority of them is 30 miles, and users may not be interested in products greater than 30 miles away even if buyers are willing to sell to them. I could just base every query off assuming 30 miles if performance were an issue. Or I could limit the results to 500 products. It's flexible. This isn't work I'm doing on behalf of a client. We get paid when a user inquires about a product. Essentially we're letting users input their zip code, displaying the results we want to show them, and then letting them choose which ones they want more info about. It's completely up to me how much data I display and which data. The company sending us the CSV is just a broker giving us all of their listings. We don't have to display them in any specific order, or display all of them, etc. I've looked into other solutions like https://mongohq.com/pricing and https://cloudant.com/#!/solutions/cloud . I see Xeround also has a distributed MySQL solution. Whichever solution has the best performance at the most reasonable price is what I want to go with. If someone else is handling administration, then I can easily adapt the code as necessary. We don't have to use MySQL or any form of SQL at all.
Posted by srfreeman, 06-13-2011, 12:51 AM
satoo; Really it is not a performance issue or reliant on the software used. Querying a 900,000 row data base is trivial, 43,000 daily updates is trivial. Doing these things while constrained by time, compute resources, scale-ability issues, etc... make the project non trivial. It is somewhat the same as the age old computer adage - Smaller, faster, cheaper. You can have any two. Using an SQL solution will require some thought about normalizing the data and building the best table structure. A non SQL solution will require some thought about reasonable keys. The SQL solutions will be the most expensive in regards compute resources and scale-ability. Hardware will help you keep up to a point, distribution can help but complexity will be an issue. Reducing accuracy is a time honored way to increase the performance of a constrained system, with its obvious detractors. The data in your given example (only updated once daily) is inherently stale, if you get paid on inquiries for products that are no longer available, reduced accuracy may be in your favor. Since the daily updates present a problem to any constrained data base solution and accuracy is not altogether paramount, you may want to consider not updating at all, simply replace the file each day and display information by using the file parsing capabilities of a language such as PHP. All in all it seems that you may be running into the spreadsheet vs. data base issue. Last edited by srfreeman; 06-13-2011 at 12:54 AM.
Posted by satoo, 06-13-2011, 02:15 AM
The lead will be rejected if it's outside the buyer's radius, and we don't get paid on dead listings. If nothing else, I still have to make sure to only pass unsold/valid listing ID's within the buyer's radius. Keeping the database updated is very easy if whatever service we use has some kind of CSV import, which is easy with MySQL, SQL Server, and as I was just reading, Mongo. It's also very easy to adapt the very simple queries to whatever underlying engine is necessary. Really, the only difficult part of the project is getting decent performance. This is for the web...the query has to be done very quickly, and I don't have experience with databases this large. Otherwise, I would just use MySQL and be done with it.
Posted by Akisoft, 06-13-2011, 04:18 AM
+1 for Mongo here, look at MongoHQ or Mongolab, they're both pretty solid providers.
Posted by srfreeman, 06-13-2011, 10:52 AM
Ok, we want to display all of the possible listings of interest to a buyer and provide the list as fast as possible. We want to display information to the largest possible number of prospective buyers. The validity of each listing gets more questionable as time from update increases. The probable reason that you have not worked with such large tables is that data base design practices call for a much different table design than your current spreadsheet. All other things being equal, the speed of the query is dependent on the length of the table/list being scanned. The fact that the availability metric is part of the product listing is a major speed issue. If multiple products are provided from a single vendor location, the vendor location information and their availability restriction can become a separate, much shorter, table to scan. If the large table can be broken up by product manufacturer or product class or ... and allowing the prospective buyer to enter a second variable can help the speed. Logical data structure and the needed results are going to have to be decided prior to selection of the best infrastructure.
Posted by tchen, 06-13-2011, 11:27 AM
With all due respect, the speed of the query is related to the index, not the number of rows in the table. If the YY cutoff radius is 30 miles, then use the optimization and downselect to an smaller resulting table which you can then do a final validity check for supplier YY radius and stock. All of which can be built into a single stored proc. I'm usually the first person to jump onto using NoSQL just for fun, but this isn't a hard problem for MySQL and if you're still worried about repeated read calls to a given location, for instance request spam from an ajax ui, then use memcache to store the results for the geohash.
Posted by srfreeman, 06-13-2011, 12:17 PM
Respect taken. In it's current form, the index (if one can be found) includes every row in the table. Deciding that the cutoff radius for all is 30 miles, reduces the accuracy. While this will increase speed, it has costs. In the current form, we still have to visit each of the 900,000 rows to make the determination. Ajax, memcache and associated toys may have their place but part of the issue here is that we want to to perform on constrained systems (Shared P4-2GB RAM and VPS-1GB RAM and EC2-Small instance have been mentioned), with multiple users. Not a lot of room to play. No, it is not a hard SQL project but (as with any SQL project) data normalization and structure based on the expected usage are the first issues to tackle. In my opinion.
Posted by satoo, 06-15-2011, 07:01 AM
I have this working on yesterday's CSV using mongolab with a geospatial index and querying with $near. It's very fast. Instead of doing a diff of today's vs. yesterday's, I just mongoimport the whole processed CSV, and then create {loc: {lat :xx, lon: xx}} for each document in the collection. So that's one battle over with. The only remaining problem is where to host the script that downloads the original CSV and processes it. It overwhelmed the EC2 micro instance and was at 100% CPU. The large instance handled it no problem. I could just run that for an hour a day. I would then need to figure out their API, and I guess a cron job to tell it to start every day, but then have the instance terminate itself when the script is done. I really do not like dealing with Amazon AWS. They have been the most frustrating part of this, and I find their approach to everything confusing. I may go with Linode for that or just do it on my desktop.
Posted by WholesaleBackup, 06-15-2011, 12:48 PM
It's actually very easy to set up a cron job to do what you want. . . 1. get your bash/scripting environment set up with your keys for Amazon, and the apis and java etc. .. you can google for instructions for your distribution 2. set up a script on your machine that looks like this: #!/bin/bash ec2-start-instances INSTANCEID ssh USER@INSTANCE /PATHTODBCOMMAND/DBCOMMAND where you would replace the items in caps with your values. It's very simple, but ec2 is very different from other stuff out there, so it has a learning curve. You can also create a command on the server itself that stops the server after processing is complete with: ec2-stop-instances INSTANCEID Luck!
Posted by satoo, 06-18-2011, 03:38 PM
Got this working just using Amazon's PHP SDK. Two cron jobs on our other server...one starts up a large instance, the other gets a list of running instances 45 minutes later (though it has never taken large instances more than 10-15 minutes to dl/process the CSV) and then stops any ones with this particular customized production AMI. That way if I decide to run more instances for other projects, I don't have to worry about those being killed. There are breakpoints in the script running on ec2 to email me using Amazon Simple Notification Service if anything fails (e.g. the broker's FTP server is down), and SNS emails me each time it starts/stops. Time will tell how reliable the whole thing is and how mongo performs under real world traffic or how it scales, but with the number of queries to start with, I don't anticipate that being a problem. If it is, the script could be easily adapted to use mysqlimport with RDS or something instead of mongoimport.
Posted by tchen, 06-18-2011, 07:03 PM
Nice followup.
Posted by WholesaleBackup, 06-20-2011, 11:45 AM
Excellent!

Add to Favourites Print this Article