Resolve race conditions involving Stasis bridges.

This resolves two observed race conditions. First, a bit of background on what the Stasis application does: 1a Creates a stasis_app_control structure. This structure is linked into a global container and can be looked up using a channel's unique ID. 2a Puts the channel in an event loop. The event loop can exit either because the stasis_app_control structure has been marked done, or because of some other factor, such as a hangup. In the event loop, the stasis_app_control determines if any specific ARI commands need to be run on the channel and will run them from this thread. 3a Checks if the channel is bridged. If the channel is bridged, then ast_bridge_depart() is called since channels that are added to Stasis bridges are always imparted as departable. 4a Unlink the stasis_app_control from the container. When an ARI command is received by Asterisk, the following occurs 1b A thread is spawned to handle the HTTP request 2b The stasis_app_control(s) that corresponds to the channel(s) in the request is/are retrieved. If the stasis_app_control cannot be retrieved, then it is assumed that the channel in question has exited the Stasis app or perhaps was never in Stasis in the first place. 3b A command is queued onto the stasis_app_control, and the channel's event loop thread is signaled to run the command. 4b While most ARI commands do nothing further, some, such as adding or removing channels from a bridge, will block until the command they issued has been completed by the channel's event loop. The first race condition that is solved by this patch involves a crash that can occur due to faulty detection of the channel's bridged status in step 3a. What can happen is that in step 2a, the event loop may run the ast_bridge_impart() function to asynchronously place the channel into a bridge, then immediately exit the event loop because the channel has hung up. In step 3a, we would detect that the channel was not bridged and would not call ast_bridge_depart(). The reason that the channel did not appear to be bridged was that the depart_thread that is spawned by ast_bridge_impart() had not yet started. That is the thread where the channel is marked as being bridged. Since we did not call ast_bridge_depart(), the Stasis application would exit, and then the channel would be destroyed Then the depart_thread would start up and try to manipulate the destroyed channel, causing a crash. The fix for this is to switch from using ast_channel_is_bridged() to checking the NULLity of ast_channel_internal_bridge_channel() to determine if ast_bridge_depart() needs to be called. The channel's internal bridge_channel is set when ast_bridge_impart() is called and is NULLed by the call to ast_bridge_depart(). If the channel's internal bridge_channel is non-NULL, then the channel must have been imparted into the bridge and needs to be departed, even if the actual bridging operation has not yet started. By departing the channel when necessary, the thread that is running the Stasis application will block until the bridge gives the okay that the depart_thread has exited. The second race condition that is solved by this patch involves a leak of HTTP handler threads. The problem was that step 2b would successfully retrieve a stasis_app_control structure. Then step 2a would exit the channel from the event loop due to a hangup. Steps 3a and 4a would execute, and then finally steps 3b and 4b would. The problem is that at step 4b, when attempting to add a channel to a bridge, the thread would block forever since the channel would never execute the queued command since it was finished with the event loop. This meant that the HTTP handling thread would be leaked, along with any references that thread may have owned (in my case, I was seeing bridges leaked). The fix for this is to hone in better on when the channel has exited the event loop. The stasis_app_control structure has an is_done field that is now set at each point where the channel may exit the event loop. If step 2b retrieves a valid stasis_app_control structure but the control is marked as done, then the attempted operation exits immediately since there will be nothing to service the attempted command. ASTERISK-25091 #close Reported by Ilya Trikoz Change-Id: If66265b73b4c9f8f58599124d777fedc54576628
author: Mark Michelson <mmichelson@digium.com> 2015-06-18 13:16:29 -0500
committer: Richard Mudgett <rmudgett@digium.com> 2015-06-18 16:19:20 -0500
commit: d7a1e84a1ee62743f9473ecbd7de31ea7afa8def (patch)
tree: 291bd1895b047a0d8d9bd60e8c1b2d31e67c198c /res/res_stasis.c
parent: 7cdb85865105f3d52da28f4e1146d10a5e9dbec6 (diff)
1 files changed, 5 insertions, 1 deletions
diff --git a/res/res_stasis.c b/res/res_stasis.c
index 631af1e2e..f7d8299f4 100644
--- a/res/res_stasis.c
+++ b/res/res_stasis.c
@@ -1280,6 +1280,7 @@ int stasis_app_exec(struct ast_channel *chan, const char *app_name, int argc,
 
 		/* Check to see if a bridge absorbed our hangup frame */
 		if (ast_check_hangup_locked(chan)) {
+			control_mark_done(control);
 			break;
 		}
 
@@ -1303,6 +1304,7 @@ int stasis_app_exec(struct ast_channel *chan, const char *app_name, int argc,
 		if (r < 0) {
 			ast_debug(3, "%s: Poll error\n",
 				  ast_channel_uniqueid(chan));
+			control_mark_done(control);
 			break;
 		}
 
@@ -1323,6 +1325,7 @@ int stasis_app_exec(struct ast_channel *chan, const char *app_name, int argc,
 			/* Continue on in the dialplan */
 			ast_debug(3, "%s: Hangup (no more frames)\n",
 				ast_channel_uniqueid(chan));
+			control_mark_done(control);
 			break;
 		}
 
@@ -1331,13 +1334,14 @@ int stasis_app_exec(struct ast_channel *chan, const char *app_name, int argc,
 				/* Continue on in the dialplan */
 				ast_debug(3, "%s: Hangup\n",
 					ast_channel_uniqueid(chan));
+				control_mark_done(control);
 				break;
 			}
 		}
 	}
 
 	ast_channel_lock(chan);
-	needs_depart = ast_channel_is_bridged(chan);
+	needs_depart = (ast_channel_internal_bridge_channel(chan) != NULL);
 	ast_channel_unlock(chan);
 	if (needs_depart) {
 		ast_bridge_depart(chan);
author	Mark Michelson <mmichelson@digium.com>	2015-06-18 13:16:29 -0500
committer	Richard Mudgett <rmudgett@digium.com>	2015-06-18 16:19:20 -0500
commit	d7a1e84a1ee62743f9473ecbd7de31ea7afa8def (patch)
tree	291bd1895b047a0d8d9bd60e8c1b2d31e67c198c /res/res_stasis.c
parent	7cdb85865105f3d52da28f4e1146d10a5e9dbec6 (diff)