Building a Face Detecting Robot with URLSessionWebSocketTask, CoreML, SwiftUI and an Arduino

Building a Face Detecting Robot with URLSessionWebSocketTask, CoreML, SwiftUI and an Arduino

iOS 13 marks the release of long-waited features like Dark Mode, but it also brought some needed changes on less popular aspects. Prior to iOS 13, creating socket connections required coding very low-level network interactions which made libraries like Starscream and Socket.io the go-to solution for sockets in iOS. Now with iOS 13, a new native URLSessionWebSocketTask class is available to finally make creating and managing socket connections easier for developers.

Some time ago I created a little side project that involved an Arduino-powered servo motor that menacingly pointed at people's faces with the help of CoreML, mimicking the Team Fortress 2 Engineer's Sentry Gun. With iOS 13, I decided to re-write that using the new Socket APIs and SwiftUI.

Although the final project involves an Arduino and a Raspberry Pi, the focus will be the iOS part of it since Swift is the focus of this blog. If at the end you want more info about how the other components are connected, feel free to contact me with questions!

Organizing the ideas

Since this project has several components, lets detail what needs to be done. To build a face tracking murder robot, we'll need an iOS app that does the following:

* The app opens the user's camera.

* Each time the camera's frame is updated, we capture its output.

* The output is routed to Vision's VNDetectFaceRectanglesRequest.

* From the results, we draw rectangles on the screen to show where faces were detected.

* Assuming that the first found face is our target, we calculate the X/Y angles that our robot should face based on the face's bounds on the screen.

* Using sockets, we'll send these angles to a Raspberry Pi which will be running a server and handling the communication with the Arduino-connected servo motor.

* While all that happens, we display a HUD with some basic information to the user.

Using URLSessionWebSocketTask

We can start this project by creating a worker class that handles a socket connection to a generic server. Similar to how regular URLSession tasks are created, we can retrieve an instance of a socket task by calling URLSession.webSocketTask() and passing the URL to the socket server:

lazy var session = URLSession(configuration: .default,
                              delegate: self,
                              delegateQueue: OperationQueue())

lazy var webSocketTask: URLSessionWebSocketTask = {
    // This is the IP of my Raspberry Pi.
    let url = URL(string: "ws://192.168.15.251:12354")!
    return session.webSocketTask(with: url)
}()

Although receiving messages from the socket isn't necessary for this project, covering it is important: URLSessionWebSocketTask supports receiving and sending messages in both Data and String formats, and calling receive will allow the client to receive a message from the server:

func receive() {
    webSocketTask.receive { result in
        switch result {
        case .success(let message):
            switch message {
            case .data(let data):
                print(data)
            case .string(let string):
                print(string)
            }
        case .failure(let error):
            print(error)
        }
    }
}

An important aspect here is that unlike other Socket solutions for iOS, receive isn't permanent -- once you receive a message, you'll need to call this method again to receive more messages. This is a very weird design decision considering that the point of sockets is to continuously receive and send messages, but it's how it works in iOS 13.

In a similar fashion, sending messages can be done by calling the send method:

webSocketTask.send(.string("I love sockets!")) { error in
    if let error = error {
        print(error)
    }
}

Connecting to URLSessionWebSocketTask servers

Similar to how regular tasks are started, socket tasks can be initiated by calling the resume() method from the task. After the connection is made, the process of receiving and sending messages will start, and updates to the socket's connection will be posted via the URLSessionWebSocketDelegate type. That's a lot easier than pre-iOS 13 socket solutions!

An important server-side aspect of URLSessionWebSocketTask that is not documented is that Apple expects socket servers to conform to the RFC 6455 - Web Socket Protocol. While Starscream and Socket.io allow you to play with sockets in iOS with simple generic Python servers, using URLSessionWebSocketTask will require you to build a server that is capable of handling the protocol's handshakes and data patterns. If the server doesn't conform to RFC 6455, iOS will simply never connect to the server. This is an example of a Python server that works with it, which you can use to test the new socket APIs.

SocketWorker

Now that we know how to use iOS 13's new socket class we can build a simple wrapper class to use it. Because we're going to build our UI with SwiftUI, my worker class will inherit from ObservableObject and define a publisher to allow me to route the socket's connection information back to the views. Here's the almost final version of the SocketWorker (sending messages will be added later):

final class SocketWorker: NSObject, ObservableObject {
    lazy var session = URLSession(configuration: .default,
                                  delegate: self,
                                  delegateQueue: OperationQueue())

    lazy var webSocketTask: URLSessionWebSocketTask = {
        let url = URL(string: "ws://169.254.141.251:12354")!
        return session.webSocketTask(with: url)
    }()

    var lastSentData = ""

    let objectWillChange = ObservableObjectPublisher()

    var isConnected: Bool? = nil {
        willSet {
            DispatchQueue.main.async { [weak self] in
                self?.objectWillChange.send()
            }
        }
    }

    var connectionStatus: String {
        guard let isConnected = isConnected else {
            return "Connecting..."
        }
        if isConnected {
            return "Connected"
        } else {
            return "Disconnected"
        }
    }

    func resume() {
        webSocketTask.resume()
    }

    func suspend() {
        webSocketTask.suspend()
    }
}

extension SocketWorker: URLSessionWebSocketDelegate {
    func urlSession(_ session: URLSession, webSocketTask: URLSessionWebSocketTask, didOpenWithProtocol protocol: String?) {
        isConnected = true
    }

    func urlSession(_ session: URLSession, webSocketTask: URLSessionWebSocketTask, didCloseWith closeCode: URLSessionWebSocketTask.CloseCode, reason: Data?) {
        isConnected = false
    }
}

It would be nice to display the socket's connection status and camera's targeting information to the user, so we'll build a view to do so. Thankfully SwiftUI allows us to do this quicker than ever:

struct DataView: View {

    let descriptionText: String
    let degreesText: String
    let connectionText: String
    let isConnected: Bool?

    let padding: CGFloat = 32

    var body: some View {
        VStack(alignment: .center) {
            HStack {
                Text(connectionText)
                    .font(.system(.callout))
                    .foregroundColor(isConnected == true ? .green : .red)
                Spacer()
            }
            Spacer()
            VStack(alignment: .center, spacing: 8) {
                Text(descriptionText)
                    .font(.system(.largeTitle))
                Text(degreesText)
                    .font(.system(.title))
            }
        }.padding(padding)
    }
}

TargetDataViewModel

To allow our DataView to receive the camera's targeting information, we'll create an observable TargetDataViewModel that will be responsible for storing and routing the user's UI strings as well as the actual data that will be sent to the socket.

Processing the socket data will work by receiving a face's CGRect; based on the face's frame on the screen, we can calculate the X and Y angles that the Arduino's servo motor should be aiming at. In this case since I only have one servo, only the X angle will be sent to the socket. You can send anything to the socket, and for simplicity, I've decided to follow the X{angle}. format. This means that if the servo needs to aim at 90 degrees, the socket will receive a message containing X90.. The dot at the end serves as an "end of command" token, which makes things easier on the Arduino side. If there's no visible face on the screen, we'll send L to the socket (as in, target lost). We'll make this communication work through delegates so we can send this data back to SocketWorker:

protocol TargetDataViewModelDelegate: AnyObject {
    func targetDataDidChange(_ data: String)
}

class TargetDataViewModel: ObservableObject {

    weak var delegate: TargetDataViewModelDelegate?

    let objectWillChange = ObservableObjectPublisher()

    var targetTitle = "..." {
        willSet {
            self.objectWillChange.send()
        }
    }

    var targetDescription = "..." {
        willSet {
            self.objectWillChange.send()
        }
    }

    func process(target: CGRect?) {
        guard let target = target else {
            targetTitle = "Sentry mode"
            targetDescription = "No targets"
            delegate?.targetDataDidChange("L")
            return
        }
        let oldMin: CGFloat = 0
        let offset: CGFloat = 40
        let newMin: CGFloat = 0 + offset
        let newMax: CGFloat = 180 - offset
        let newRange = newMax - newMin
        func convertToDegrees(position: CGFloat, oldMax: CGFloat) -> Int {
            let oldRange = oldMax - oldMin
            let scaledAngle = (((position - oldMin) * newRange) / oldRange) + newMin
            return Int(scaledAngle)
        }
        let bounds = UIScreen.main.bounds
        let oldMaxX = bounds.width
        let oldMaxY = bounds.height
        let xAngle = convertToDegrees(position: target.midX, oldMax: oldMaxX)
        let yAngle = convertToDegrees(position: target.midY, oldMax: oldMaxY)
        targetTitle = "Shooting"
        targetDescription = "X: \(xAngle) | Y: \(yAngle)"
        let data = "X\(xAngle)."
        delegate?.targetDataDidChange(data)
    }
}

With the view model in place, we can finish SocketWorker by making it send the target's information to the socket:

extension SocketWorker: TargetDataViewModelDelegate {
    func targetDataDidChange(_ data: String) {
        // Avoid sending duplicate data to the socket.
        guard data != lastSentData else {
            return
        }
        lastSentData = data
        webSocketTask.send(.string(data)) { error in
            if let error = error {
                print(error)
            }
        }
    }
}

Showing the user's camera on the screen

Getting access to a device's camera with SwiftUI has a small problem. Although we can easily do it in UIKit using AVCaptureVideoPreviewLayer, the concept of layers doesn't really exist in SwiftUI, so we can't use this layer in normal SwiftUI views. Fortunately, Apple allows you to create bridges between UIKit and SwiftUI so that no functionality is lost.

In order to display a camera, we'll need to build an old-fashioned UIKit CameraViewController and bridge it to SwiftUI using the UIViewControllerRepresentable protocol. We'll also route our data view model to this view controller so it can update it based on the camera's output:

struct CameraViewWrapper: UIViewControllerRepresentable {

    typealias UIViewControllerType = CameraViewController
    typealias Context = UIViewControllerRepresentableContext

    let viewController: CameraViewController

    func makeUIViewController(context: Context<CameraViewWrapper>) -> CameraViewController {
        return viewController
    }

    func updateUIViewController(_ uiViewController: CameraViewController, context: Context<CameraViewWrapper>) {}
}

final class CameraViewController: UIViewController, ObservableObject {

    @ObservedObject var targetViewModel: TargetDataViewModel

    init(targetViewModel: TargetDataViewModel) {
        self.targetViewModel = targetViewModel
        super.init(nibName: nil, bundle: nil)
    }

    required init?(coder: NSCoder) {
        fatalError()
    }

    override func loadView() {
        let view = CameraView(delegate: self)
        self.view = view
    }

    override func viewWillAppear(_ animated: Bool) {
        super.viewWillAppear(animated)
        (view as? CameraView)?.startCaptureSession()
    }

    override func viewWillDisappear(_ animated: Bool) {
        super.viewWillAppear(animated)
        (view as? CameraView)?.stopCaptureSession()
    }
}

extension CameraViewController: CameraViewDelegate {
    func cameraViewDidTarget(frame: CGRect) {
        targetViewModel.process(target: frame, view: view)
    }

    func cameraViewFoundNoTargets() {
        targetViewModel.process(target: nil, view: view)
    }
}

I will skip the definition of CameraView because the code is massive and there's nothing special about it -- we just add a camera layer to the screen and start recording once the view controller appears. We also define a CameraViewDelegate so that our face detection events can be routed back to our view model. You can see the code for it in the NSSentryGun repo.

Now with the actual face detection being all that's left, we can wrap everything up into our app's main ContentView!

struct ContentView: View {

    @ObservedObject var socketWorker: SocketWorker
    @ObservedObject var targetViewModel: TargetDataViewModel

    let cameraViewController: CameraViewController

    init() {
        self.socketWorker = SocketWorker()
        let targetViewModel = TargetDataViewModel()
        self.targetViewModel = targetViewModel
        self.cameraViewController = CameraViewController(
            targetViewModel: targetViewModel
        )
        targetViewModel.delegate = socketWorker
    }

    var body: some View {
        ZStack {
            CameraViewWrapper(
                viewController: cameraViewController
            )
            DataView(
                descriptionText: targetViewModel.targetTitle,
                degreesText: targetViewModel.targetDescription,
                connectionText: socketWorker.connectionStatus,
                isConnected: socketWorker.isConnected
            ).expand()
        }.onAppear {
            self.socketWorker.resume()
        }.onDisappear {
            self.socketWorker.suspend()
        }.expand()
    }
}

extension View {
    func expand() -> some View {
        return frame(
            minWidth: 0,
            maxWidth: .infinity,
            minHeight: 0,
            maxHeight: .infinity,
            alignment: .topLeading
        )
    }
}

After turning on our socket server, running this app will allow us to connect to it and see the user's camera. However, no information will be actually sent since we're doing nothing with the camera's output yet.

>

Detecting faces in images with CoreML

After the release of CoreML, every iOS release adds new capabilities to the built-in Vision framework. In this case, the ability to detect faces in an image can be easily done by performing the VNDetectFaceRectanglesRequest request with the camera's output as a parameter.

The result of this request is a [VNFaceObservation] array that contains the bounding box of every face detected in the image. We can use this result to draw rectangles around the detected faces and use these rectangles as the input for our previously made TargetDataViewModel. Before defining the request itself, lets first handle the response code inside CameraView:

func handleFaces(request: VNRequest, error: Error?) {
    DispatchQueue.main.async { [unowned self] in
        guard let results = request.results as? [VNFaceObservation] else {
            return
        }
        // We'll add all drawn rectangles in a `maskLayer` property
        // and remove them when we get new responses.
        for mask in self.maskLayer {
            mask.removeFromSuperlayer()
        }
        let frames: [CGRect] = results.map {
            let transform = CGAffineTransform(scaleX: 1, y: -1)
                                .translatedBy(x: 0, y: -self.frame.height)
            let translate = CGAffineTransform
                                .identity
                                .scaledBy(x: self.frame.width, y: self.frame.height)
            return $0.boundingBox
                        .applying(translate)
                        .applying(transform)
        }
        frames
            .sorted { ($0.width * $0.height) > ($1.width * $1.height) }
            .enumerated()
            .forEach(self.drawFaceBox)
        guard let targetRect = results.first else {
            self.delegate.cameraViewFoundNoTargets()
            return
        }
        delegate.cameraViewDidTarget(frame: targetRect)
    }
}

(The boundingBox property contains values from zero to one with the y-axis going up from the bottom, so this method also needs to handle scaling and reversing the y-axis for the box to match the camera's bounds.)

After processing the face's rectangle, we sort them by total area and draw them on the screen. We'll consider the rectangle with the biggest area as our target.

func drawFaceBox(index: Int, frame: CGRect) {
    let color = index == 0 ? UIColor.red.cgColor : UIColor.yellow.cgColor
    createLayer(in: frame, color: UIColor.red.cgColor)
}

private func createLayer(in rect: CGRect, color: CGColor) {
    let mask = CAShapeLayer()
    mask.frame = rect
    mask.opacity = 1
    mask.borderColor = color
    mask.borderWidth = 2
    maskLayer.append(mask)
    layer.insertSublayer(mask, at: 1)
}

To perform the request, we need access to the camera's image buffer. This can be done by setting our CameraView as the layer's AVCaptureVideoDataOutput delegate, which has a method containing the camera's buffer. We can then convert the buffer into Vision's expected CVImageBuffer input and finally perform the request:

extension CameraView: AVCaptureVideoDataOutputSampleBufferDelegate {
    func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
        guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer),
              let exifOrientation = CGImagePropertyOrientation(rawValue: 0) else
        {
            return
        }
        var requestOptions: [VNImageOption : Any] = [:]
        let key = kCMSampleBufferAttachmentKey_CameraIntrinsicMatrix
        if let cameraIntrinsicData = CMGetAttachment(sampleBuffer, key: key, attachmentModeOut: nil) {
            requestOptions = [.cameraIntrinsics: cameraIntrinsicData]
        }
        let imageRequestHandler = VNImageRequestHandler(
            cvPixelBuffer: pixelBuffer,
            orientation: exifOrientation,
            options: requestOptions
       )
        do {
            let request = VNDetectFaceRectanglesRequest(completionHandler: handleFaces)
            try imageRequestHandler.perform([request])
        } catch {
            print(error)
        }
    }
}

If we run the app now and point at something, we'll see rectangles being drawn around faces!

And because we're using ObservableObjects, the processing that happens inside TargetDataViewModel will be published back to the SocketWorker, which will finally send our target's information to the server.

Beyond iOS: Making an Arduino receive this information

As stated at the beginning of the article I won't go too deep on the components that aren't iOS related, but if you're wondering how sending the information to a socket results in a servo motor being moved, we'll use this section to talk more about it.

Arduinos can communicate with other devices through its serial port, which is connected as a USB device on the other end. With the Arduino connected to your server (I used a Raspberry Pi for the video in the introduction, but simply running the server script on your Mac works as well), you can use Python's serial library to open a connection with the Arduino:

import serial
ser = serial.Serial('/dev/cu.usbmodem1412401', 9600)

In this case, /dev/cu.usbmodem1412401 is the name of the port where my Arduino is connected, and 9600 is the baud rate expected by the Arduino.

When the server receives data from the iOS app, we can write it into the serial port:

while True:
    newData = client.recv(4096)
    msg = self.connections[fileno].recover(newData)
    if not msg: continue
    print msg
    ser.write(msg)

If the Arduino is listening to its serial port, it will be able to receive this data and parse it into something meaningful. With a servo motor connected to pin 9, here's the Arduino C code to parse the little rules we created when processing the data in the view model:

#include <Servo.h>

Servo myservo;

bool isScanning = false;

void setup() {
  Serial.begin(9600);
  myservo.attach(9);
  myservo.write(90);
  delay(1000);
}

void loop() {
  if (isScanning == true) {
    scan();
    return;
  }
  if (Serial.available() > 0) {
    char incomingByte = Serial.read();
    if (incomingByte == 'L') {
      isScanning = true;
    } else if (incomingByte == 'X') { // else if
      setServoAngle();
    } else { //Unwanted data, get rid of it
      while(Serial.read() != -1) {
        incomingByte = Serial.read();
      }
    }
  }
}

void scan() {
  int angle = myservo.read();
  for (int i = angle; i<= 160; i++) {
    if (Serial.peek() != -1) {
      isScanning = false;
      return;
    }
    myservo.write(i);
    delay(15);
  }
  for (int i = 160; i>= 20; i--) {
    if (Serial.peek() != -1) {
      isScanning = false;
      return;
    }
    myservo.write(i);
    delay(15);
  }
}

void setServoAngle() {
  unsigned int integerValue = 0;
  while(1) {
    char incomingByte = Serial.read();
    if (incomingByte == -1) {
      continue;
    }
    if (isdigit(incomingByte) == false) {
      break;
    }
    integerValue *= 10;
    integerValue = ((incomingByte - 48) + integerValue);
  }
  if (integerValue >= 0 && integerValue <= 180) {
      myservo.write(integerValue);
  }
}

To make it short, Arduino will enter a "scan mode" that makes the servo wiggle back and forth if an L is received from the server. This goes on until an X is received -- when this happens, the Arduino then parses the numeric component of the command and sets it as the servo's angle. All that's left now is to attach a Nerf gun to it and make it actually shoot intruders!

Conclusion

The introduction of the new URLSessionWebSocketTask finally solves a major pain point in developing networking applications for iOS, and I hope this article sparks some ideas for new side projects for you. Creating side projects is a cool way to learn new technologies -- I first did this project to know more about Raspberry Pis, and this article was a chance for me to have my first interaction with SwiftUI and Combine after Xcode 11's release.

Follow me on my Twitter (@rockbruno_), and let me know of any suggestions and corrections you want to share.

References and Good reads

URLSessionWebSocketTask
RFC 6455 - The WebSocket Protocol
Python implementation of an RFC 6455 socket server