Can I charge my clients for debugging the code I have developed for them? - payment

I charge my clients on an hourly basis, some times they came back with an error or bug in code requesting me to resolve it. It takes time, sometimes reaching 2-3 hours. Most clients think it should not be charged as it was my fault and I should fix it for free. Is that so? It's almost impossible to code 100% error free.

To me it depends. Is the product that you sold working as described in the contract ? If not, well you can't decently ask for more money since you didn't do your job in the first place. You should test your software and do debug for free. It is true that no software is bug free, but it isn't the customer's fault and, as long as you didn't explicitly state that debugging had a cost, well I think it isn't okay to charge for it. (be sure to not let them add features pretending it's a bug, though) !


Azure Web App. Free is faster than Basic and Standard?

I have a C# MVC application with a WCF service running on Azure. First of it was of course hosted on the free version, but as I had that one running smoothly I wanted to try and see how it ran on either Basic or Standard, which as far as I know should be dedicated servers.
To my surprise the code ran significantly slower once it was changed from Free to either Standard or Basic. I chose the smallest instance, but still expected them to perform better than the Free option?
From my performance logging I can see that the code that runs especially slow is something that is started as async from Task.Run. Initially it was old school Thread.Start() but considered whether this might spawn it in some lower priority thread and therefore changed it to Task.Run - without this changing anything - so perhaps it has nothing to do with it - but it might, so now you know.
The code that runs really slow basically works on some XML document, through XDocument, XElement etc. It loops through, has some LINQ etc. but nothing too fancy. But still it is 5-10 times slower on Basic and Standard as on the Free version? For the exact same request the Free version uses around 1000ms where as Basic and Standard uses 8000-10000ms?
In each test I have tried 5-10 times but without any decrease in response-times. I thought about whether I need to wait some hours before the Basic/Standard is fully functional or something like that, but each time I switch back, the Free version just outperforms it from the get-go.
Any suggestions? Is the Free version for some strange reason more powerful than Basic or Standard or do I need to configure something differently once I get up and running on Basic or Standard?
The notable difference between the Free and Basic/Standard tiers is that Free uses an undisclosed number of shared cores, whereas Basic/Standard has a defined number of CPU cores (1-4 based on how much you pay). Related to this is the fact that Free is a shared instance while Basic/Standard is a private instance.
My best guess based on this that since the Free servers you would be on house multiple different users and applications, they probably have pretty beef specs. Their CPUs are probably 8-core Xeons and there might even be multiple CPUs. Most likely, Azure isn't enforcing any caps but rather relying on quotas (60 CPU minutes / day for the Free tier) and overall demand on the server to restrict CPU use. In other words, if your site is the only one that happens to be doing anything at the moment (unlikely of course, but for the sake of example), you could be potentially utilizing all 8+ cores on the box, whereas when you move over to Basic/Standard you are hard-limited to 1-4. Processing XML is actually very CPU heavy, so this seems to line up with my assumptions.
More than likely, this is a fluke. Perhaps your residency is currently on a relatively newly provisioned server that hasn't been fill up with tenants yet. Maybe you just happen to be sharing with tenants that aren't doing much. Who knows? But, if the server is ever actually under real load, I'd imagine you'd see a much worse response time on the Free tier than even Basic/Standard.

Windows Service Increasing CPU Consumption

At my job, I have a clutch of six Windows services that I am responsible for, written in C# 2003. Each of these services contain a timer that fires every minute or so, where the majority of their work happens.
My problem is that, as these services run, they start to consume more and more CPU time through each iteration of the loop, even if there is no meaningful work for them to do (ie, they're just idling, looking through the database for something to do). When they start up, each service uses an average of (about) 2-3% of 4 CPUs, which is fine. After 24 hours, each service will be consuming an entire processor for the duration of its loop's run.
Can anyone help? I'm at a loss as to what could be causing this. Our current solution is to restart the services once a day (they shut themselves down, then a script sees that they're offline and restarts them at about 3AM). But this is not a long term solution; my concern is that as the services get busier, restarting them once a day may not be sufficient... but as there's a significant startup penalty (they all use NHibernate for data access), as they get busier, exactly what we don't want to be doing is restarting them more frequently.
#akmad: True, it is very difficult.
Yes, a service run in isolation will show the same symptom over time.
No, it doesn't. We've looked at that. This can happen at 10AM or 6PM or in the middle of the night. There's no consistency.
We do; and they are. The services are doing exactly what they should be, and nothing else.
Unfortunately, that requires foreknowledge of exactly when the services are going to be maxing out CPUs, which happens on an unpredictable schedule, and never very quickly... which makes things doubly difficult, because my boss will run and restart them when they start having problems without thinking of debug issues.
No, they're using a fairly consistent amount of RAM (approx. 60-80MB each, out of 4GB on the machine).
Good suggestions, but rest assured, we have tried all of the usual troubleshooting. What I'm hoping is that this is a .NET issue that someone might know about, that we can work on solving. My boss' solution (which I emphatically don't want to implement) is to put a field in the database which holds multiple times for the services to restart during the day, so that he can make the problem go away and not think about it. I'm desperately seeking the cause of the real problem so that I can fix it, because that solution will become a disaster in about six months.
#Yaakov Ellis: They each have a different function. One reads records out of an Oracle database somewhere offsite; another one processes those records and transfers files belonging to those records over to our system; a third checks those files to make sure they're what we expect them to be; another is a maintenance service that constantly checks things like disk space (that we have enough) and polls other servers to make sure they're alive; one is running only to make sure all of these other ones are running and doing their jobs, monitors and reports errors, and restarts anything that's failed to keep the whole system going 24 hours a day.
So, if you're asking what I think you're asking, no, there isn't one common thing that all these services do (other than database access via NHibernate) that I can point to as a potential problem. Unfortunately, if that turns out to be the actual issue (which wouldn't surprise me greatly), the whole thing might be screwed -- and I'll end up rewriting all of them in simple SQL. I'm hoping it's a garbage collector problem or something easier to deal with than NHibernate.
#Joshdan: No secret. As I said, we've tried all the usual troubleshooting. Profiling was unhelpful: the profiler we use was unable to point to any code that was actually executing when the CPU usage was high. These services were torn apart about a month ago looking for this problem. Every section of code was analyzed to attempt to figure out if our code was the issue; I'm not here asking because I haven't done my homework. Were this a simple case of the services doing more work than anticipated, that's something that would have been caught.
The problem here is that, most of the time, the services are not doing anything at all, yet still manage to consume 25% or more of four CPU cores: they're finding no work to do, and exiting their loop and waiting for the next iteration. This should, quite literally, take almost no CPU time at all.
Here's a example of behaviour we're seeing, on a service with no work to do for two days (in an unchanging environment). This was captured last week:
Day 1, 8AM: Avg. CPU usage approx 3%
Day 1, 6PM: Avg. CPU usage approx 8%
Day 2, 7AM: Avg. CPU usage approx 20%
Day 2, 11AM: Avg. CPU usage approx 30%
Having looked at all of the possible mundane reasons for this, I've asked this question here because I figured (rightly, as it turns out) that I'd get more innovative answers (like Ubiguchi's), or pointers to things I hadn't thought of (like Ian's suggestion).
So does the CPU spike happen
immediately preceding the timer
callback, within the timer callback,
or immediately following the timer
You misunderstand. This is not a spike. If it were, there would be no problem; I can deal with spikes. But it's not... the CPU usage is going up generally. Even when the service is doing nothing, waiting for the next timer hit. When the service starts up, things are nice and calm, and the graph looks like what you'd expect... generally, 0% usage, with spikes to 10% as NHibernate hits the database or the service does some trivial amount of work. But this increases to an across-the-board 25% (more if I let it go too far) usage at all times while the process is running.
That made Ian's suggestion the logical silver bullet (NHibernate does a lot of stuff when you're not looking). Alas, I've implemented his solution, but it hasn't had an effect (I have no proof of this, but I actually think it's made things worse... average usage is seeming to go up much faster now). Note that stripping out the NHibernate "sections" (as you recommend) is not feasible, since that would strip out about 90% of the code in the service, which would let me rule out the timer as a problem (which I absolutely intend to try), but can't help me rule out NHibernate as the issue, because if NHibernate is causing this, then the dodgy fix that's implemented (see below) is just going to have to become The Way The System Works; we are so dependent on NHibernate for this project that the PM simply won't accept that it's causing an unresolvable structural problem.
I just noted a sense of desperation in
the question -- that your problems
would continue barring a small miracle
Don't mean for it to come off that way. At the moment, the services are being restarted daily (with an option to input any number of hours of the day for them to shutdown and restart), which patches the problem but cannot be a long-term solution once they go onto the production machine and start to become busy. The problems will not continue, whether I fix them or the PM maintains this constraint on them. Obviously, I would prefer to implement a real fix, but since the initial testing revealed no reason for this, and the services have already been extensively reviewed, the PM would rather just have them restart multiple times than spend any more time trying to fix them. That's entirely out of my control and makes the miracle you were talking about more important than it would otherwise be.
That is extremely intriguing (insofar
as you trust your profiler).
I don't. But then, these are Windows services written in .NET 1.1 running on a Windows 2000 machine, deployed by a dodgy Nant script, using an old version of NHibernate for database access. There's little on that machine I would actually say I trust.
You mentioned that you're using NHibernate - are you closing your NHibernate sessions at appropriate points (such as the end of each iteration?)
If not, then the size of the object map loaded into memory will be gradually increasing over time, and each session flush will take increasingly more CPU time.
Here's where I'd start:
Get Process Explorer and show %Time in JIT, %Time in GC, CPU Cycles Delta, CPU Time, CPU %, and Threads.
You'll also want kernel and user time, and a couple of representative stack traces but I think you have to hit Properties to get snapshots.
Compare before and after shots.
A couple of thoughts on possibilities:
excessive GC (% Time in GC going up. Also, Perfmon GC and CPU counters would correspond)
excessive threads and associated context switches (# of threads going up)
polling (stack traces are consistently caught in a single function)
excessive kernel time (kernel times are high - Task Manager shows large kernel time numbers when CPU is high)
exceptions (PE .NET tab Exceptions thrown is high and getting higher. There's also a Perfmon counter)
virus/rootkit (OK, this is a last ditch scenario - but it is possible to construct a rootkit that hides from TaskManager. I'd suspect that you could then allocate your inevitable CPU usage to another process if you were cunning enough. Besides, if you've ruled out all of the above, I'm out of ideas right now)
It's obviously pretty difficult to remotely debug you're unknown application... but here are some things I'd look at:
What happens when you only run one of the services at a time? Do you still see the slow-down? This may indicate that there is some contention between the services.
Does the problem always occur around the same time, regardless of how long the service has been running? This may indicate that something else (a backup, virus scan, etc) is causing the machine (or db) as a whole to slow down.
Do you have logging or some other mechanism to be sure that the service is only doing work as often as you think it should?
If you can see the performance degradation over a short time period, try running the service for a while and then attach a profiler to see exactly what is pegging the CPU.
You don't mention anything about memory usage. Do you have any of this information for the services? It's possible that your using up most of the RAM and causing the disk the trash, or some similar problem.
Best of luck!
I suggest to hack the problem into pieces.
First, find a way to reproduce the problem 100% of the times and quickly. Lower the timer so that the services fire up more frequently (for example, 10 times quicker than normal). If the problem arises 10 times quicker, then it's related to the number of iterations and not to real time or to real work done by the services). And you will be able to do the next steps quicker than once a day.
Second, comment out all the real work code, and let only the services, the timers and the synchronization mechanism. If the problem still shows up, than it will be in that part of the code.
If it doesn't, then start adding back the code you commented out, one piece at a time. Eventually, you should find out what part of the code is causing the problem.
'Fraid this answer is only going to suggest some directions for you to look in, but having seen similar problems in .NET Windows Services I have a couple of thoughts you might find helpful.
My first suggestion is your services might have some bugs in either the way they handle memory, or perhaps in the way they handle unmanaged memory. The last time I tracked down a similar issue it turned out a 3rd party OSS libray we were using stored handles to unmanaged objects in static memory. The longer the service ran the more handles the service picked up which caused the process' CPU performance to nose-dive very quickly. The way to try and resolve this sort of issue to ensure your services store nothing in memory inbetween the timer invocations, although if your 3rd party libraries use static memory you might have to do something clever like create an app domain for the timer invocation and ditch the app doamin (and its static memory) once processing is complete.
The other issue I've seen in similar circumstances was with the timer synchronization code being suspect, which in effect allowed more than one thread to be running the processing code at once. When we debugged the code we found the 1st thread was blocking the 2nd, and by the time the 2nd kicked off there was a 3rd being blocked. Over time the blocking was lasting longer and longer and the CPU usage was therefore heading to the top. The solution we used to fix the issue was to implement proper synchronization code so the timer only kicked off another thread if it wouldn't be blocked.
Hope this helps, but apologies up front if both my thoughts are red herrings.
Sounds like a threading issue with the timer. You might have one unit of work blocking another running on different worker threads, causing them to stack up every time the timer fires. Or you might have instances living and working longer than you expect.
I'd suggest refactoring out the timer. Replace it with a single thread that queues up work on the ThreadPool. You can Sleep() the thread to control how often it looks for new work. Make sure this is the only place where your code is multithreaded. All other objects should be instantiated as work is readied for processing and destroyed after that work is completed. STATE IS THE ENEMY in multithreaded code.
Another area where the design is lacking appears to be that you have multiple services that are polling resources to do something. I'd suggest unifying them under a single service. They might do seperate things, but they're working in unison; you're just using the filesystem, database, etc as a substitution for method calls. Also, 2003? I feel bad for you.
Good suggestions, but rest assured, we have tried all of the usual troubleshooting. What I'm hoping is that this is a .NET issue that someone might know about, that we can work on solving.
My feeling is that no matter how bizarre the underlying cause, the usual troubleshooting steps are your best bet for locating the issue.
Since this is a performance issue, good measurements are invaluable. The overall process CPU usage is far too broad a measurement. Where is your service spending its time? You could use a profiler to measure this, or just log various section start and stops. If you aren't able to do even that, then use Andrea Bertani's suggestion -- isolate sections by removing others.
Once you've located the general area, then you can make even finer-grained measurements, until you sort out the source of the CPU usage. If it's not obvious how to fix it at that point, you at least have ammunition for a much more specific question.
If you have in fact already done all this usual troubleshooting, please do let us in on the secret.

Why is the HttpWebRequest ReadWriteTimeout set to 5 minutes?

The ReadWriteTimeout for HttpWebRequests seems to be defaulted to 5 minutes.
Is there a reason why it is that high? I was trying to set the timeout of an API call to 10 seconds, but it was spinning for a over 2 minutes.
WHen I set this to 30 seconds, it times out in a reasonable amount of time now.
Is it dangerous to set this too low?
I can't imagine something taking longer than 20-30 seconds in my application (small 2-30kb payloads).
Sure there's a reason for a 5 minute time-out. It looks like this:
This contraption is a robotic tape retrieval system, used by the International Centre for Radio Astronomy Research. It stores 32.5 petabytes of historical data. When its server gets an HttpWebRequest, the machine sends the robot on its way to retrieve the tape with the data. This takes a while, as you might imagine.
These systems were quite common a decade ago, around the time .NET was designed. Not so much today, the unrelenting improvements in hard disk storage capacity made them close to obsolete. Although more than 5 petabyte of SAN storage still sets you back a rather major chunk of money. If speed is not essential then tape is hard to beat.
Clearly .NET cannot possibly reliably declare a timeout when it doesn't know anything about what's happening on the other end of the wire. So the default is high. If you have good reasons to believe that there's an upper limit on your particular setup then don't hesitate to lower it. Do make it an editable setting, you can't predict the future.
You can't possibly know what connection speed the users have that connect to your website. And as the creator of this framework you can't know either what the developer will host. This class already existed in .NET 1.1, so for a very long time. And back then the users had slower speed too.
Finding a good default value is very difficult. You don't want to set it too high to prevent security flaws, and you don't want to set it too low because this would result in a million (exaggerated) threads and requests about aborted requests.
I'm sorry I can't give you any official sources, but this is just reasonable.
Why 5 minutes? Why not?
JustAnotherUserYouMayKnow explained it to you pretty good.
But as usual, you have the freedom to change this default value to a value that suits to your very case, so feel free to follow the path that Christian pointed out.
Setting a default value is not an easy task at all when we are talking about millions of users and maybe millions of billions of possible scenarios involved.
The bootom line is that it isn't that much important why it's 5 minutes but rather how you can adjust it to your very needs.
Well by setting it that low you may or may introduce a series of issues. As you may be able to reach the site within a reasonable time, others may not.
A perfect example is Verizon, they invoke a series of Proxy Servers which can drastically slow a connection down. The reason I brought such an example up; is our application specified a one-minute Timeout before it throws an exception.
Our server has no issues with large amounts of request, it handles them quite easily. However, some of our users throughout the world receive this error: Error 10060.
The issue can route from a incorrect Proxy Configuration or Invalid Registry Key which actually handles the Timeout request.
You'd think that one minute would indeed be fast enough, but it actually isn't. As with this customers particular network it doesn't siphon through the data quick enough- thus causing an error.
So you asked:
Why is the HttpWebRequest ReadWrite Timeout Defaulted to five minutes?
They are attempting to account for the lowest common denominator.
Simply, each network and client may have a vast degree of traffic or delays as it moves to the desired location. If it can't get to the destination within your ports ideal socket request your user will experience an exception.
Some really important things to know about a network:
Some networks that are configured have a limited hop count / time to live.
Proxies and Firewalls which are heavy in filtering data and security, may delay your traffic.
Some areas do not have Fiber or Cable high-speed. They may rely on Satellite or DSL.
Each network protocol is different.
Those are a few variables that you have to consider. If we are talking about an internet; each client has a home network; which connects to ISP; which connects to the Internet; which connects to you. So you have several forms of traffic to be aggregated.
If we are talking about an Intranet, with most modern day technology the odds of your time being an issue are slim but still possible.
Also each individual computer can partake or cause an issue. In Windows 8 the default Timeout specified for the browser is one minute; in some cases those users may experience exceptions with your application, your site, or others. So you'd manually alter the ServerTimeOut and TimeOut key in the registry to assign a longer value.
In short:
Client Machines may pose a problem in reaching your site within your allocated time.
Network / ISP may incur a problem for some users.
Your Server may be configured incorrectly or not allocate the right amount of time.
These are all variables that need to be accounted for; as they will impact access to your application. Unfortunately you won't know for certain until it's launched and users begin to utilize your site.
Unfortunately you won't know if your time you specified will be enough; but it defaults to a higher number because there is so much variation across the world that it is trying to consider the lowest common denominator. As your goal is to reach as many people as possible.
By the way very nice question, and some great answers so far as well.

ZeroMQ subscriber fails to initialize using 1000+ publishers

I am trying to evaluate ZeroMQ for a larger monitoring and data gathering system. On a smaller scale everything works nice but stepping up the load and scale a bit seems tricky.
Right now I am using a C# wrapper (clrzmq, 3.0.0-rc1) to create both a publisher and a subscriber application. I am binding the Publisher socket (1 socket, 1 context) to 1000 endpoints (localhost + a range of ports) and let the Subscriber applications socket (again 1 socket, 1 context) bind to the publisher endpoints.
This sometimes works, and sometimes not (I guess it relates to the max number of sockets handled by the process somehow). It seems to depend on in which order I start the applications but I cannot tell for sure. The only thing I see is nasty SEHExceptions, containing no details at all. If I create simple console applications I sometimes see low level C++ Asserts like:
Assertion failed: fds.size () <= FD_SETSIZE (......\src\select.cpp:70)
Assertion failed: Permission denied (......\src\signaler.cpp:281)
Assertion failed: Connection reset by peer (......\src\signaler.cpp:124)
Not very helpful to me. In the C# wrapper, the Context creation fails. It does not even get a chance to begin connecting to or even creating sockets. I would expect low level ZeroMQ errors to be handled by throwing exceptions, maybe I just have not understood how to deal with errors yet.
The questions I have right now is:
How do I create a (somewhat) realistic test setup to simulate 1000 separate publishers on a single machine (in real world 1 publisher = 1 machine) and a couple of Subscribers on Another machine, all using C#. Is that even possible?
More importantly, how do I trap ZeroMQ errors in C# code to be able to understand what goes wrong?
Since ZeroMQ seems pretty stable and mature I have a hard time believing 1000 publishers should be a problem to handle. However, I need better error support than currently available (unless I completely missed something here) in order to use ZeroMQ over C#.
After diggin into the source, I end up with a zmq_assert(...) leading to RaiseException (0x40000015, EXCEPTION_NONCONTINUABLE, 1, extra_info);. This will abruptly terminate the application after dumping the original assert statement to the console. This seems a bit harsh, but may well be the best option given that it is really unrecoverable. However, a somewhat better error message would not hurt. Not everyone knows what fds.size () <= FD_SETSIZE means. The comment in the source gives some clues, would be nice to have that comment in the error message. Anyway, given that my application is not a console app, this just leaves me with an unhandled SEHException, which does not seem to contain even the assert statement or line/file info. I wonder how many other bugs I will create that will result in other similar cryptical errors.
The default FD_SETSIZE is 1024 (defined in the MSVC libzmq project), so you will hit this about half-way through your test case. The other asserts tumble on from that.
Increase this in your libzmq project, to 4K or 8K, and things should work better.
As for the assert() call, it's too brutal on Windows, for sure. On Linux this gives a decent stack dump and enough information to trace the problem. Feel free to improve the assert macro so that it does something smarter, e.g. launch the debugger. In any case if you hit an assert you can't reasonably continue.
Asserting when the FD set is full, well, that could be handled better. If you know anything about C/C++, feel free to take a look at the code. We do depend on peoples' patches.
Also, if you feel 1024 is too small, feel free to raise this in the project and send us the patch.
After looking into this a bit more, it seems the default number of sockets are set to 1024. The C# wrapper has a property on the Context object that should be able to change this setting but it is not working, at least not as expected. Also, the native zmqlib does not have this setting on the context object.
Running a setup like in the description does not seem possible, at least not using the clrzmq C# ZeroMQ wrapper. I solved it by running 500 publishers on a separate machine and another 500 plus 1000 subscribers on another machine. This worked nice without any errors.
The other topic is also a bit disappointing. When the maximum number of sockets are reached, ZeroMQ simply throws an uncatchable exception causing the application to crash abruptly. This is a fail fast approach, avoiding any further data/state corruption but unfortunatly also leaves very few clues to what happend that caused the application to die. Judging from other posts, it seems very hard to gather data for post-mortem when this happens. Catching the exception in the C# code seems impossible or very hard, and hooking into the stdout to capture the printed assert also seems very hard to achieve (if we are not running from a command prompt, in which case the assert message is printed just before the application dies).
All-in-all, this makes low-level trouble shooting and post-mortem analysis in a non-console C# setting very hard when ZeroMQ terminates via the zmq_assert(...) call. Hopefully this was an extreme case. Not all failure modes seems to cause termination in this abrupt way.
A quick and dirty look into this problem suggest that you're creating too many socket connections for your computer. Check out this link on the max number of sockets from MSDN. The error's you are getting look suspiciously relevant enough for this to be a possible source of your error.
To be honest, having 1000 separate publishers seems like you are tackling the problem a little incorrectly for using zmq. Why not have 1 publisher and use 'namespaces' and have the subscribers SUBSCRIBE to what it needs to split out what messages subscribers get.

How to provoke a timer trigger in glassfish?

We need some consistency in our functional test cases.
The best we can do currently is to wait for an estimated time before the Java EE timers in the product should have been triggered. It would be much more predictable if the test cases could trigger the timers programmatically, probably with JMX.
How can this be done? Is there a JMX interface to the Glassfish Timer facility which we can use?
Its apparently impossible, since the question remain unanswered for almost three months.
However, I realized that for testing purposes it is enough to be notified when the triggering has actually occured. (Triggering it actively will only buy me time at the trade for quality)
I'm adding monitoring for event completions instead, but thanks for letting me know that it's actually impossible;)