In my opinion, edge nodes in a Hadoop cluster are typically nodes that are responsible for running the client-side operations of a Hadoop cluster. Typically edge-nodes are kept separate from the nodes that contain Hadoop services such as HDFS, MapReduce, etc, mainly to keep computing resources separate. For smaller clusters only having a few nodes, it's common to see nodes playing a hybrid combination of roles for master services (JT, NN, etc.) , slave services (TT, DN, etc) and gateway services.
Note that running master and slave Hadoop services on the same node is not an ideal setup, and can cause scaling and resource issues depending on what's at use. This kind of configuration is typically seen on a small-scale dev environment.
With that said, here's some answers to your questions posted:
1) Does the edge node have to be part of the cluster The edge node does not have to be part of the cluster, however if it is outside of the cluster (meaning it doesn't have any specific Hadoop service roles running on it), it will need some basic pieces such as Hadoop binaries and current Hadoop cluster config files to submit jobs on the cluster.
2) What advantages do we have if it is inside the cluster? Depending on which distribution is in use, edge nodes run within the cluster allow for centralized management of all the Hadoop configuration entries on the cluster nodes which helps to reduce the amount of administration needed to update the config files. Usually this is a one-to-many approach, where config entries are updated in one location and are pushed out to all (many) nodes in the cluster.
However, when one of the nodes within the cluster is also used as an edge node, there are CPU and memory resources that are consumed by the client operations which detracts the available resources that could be utilized by the running Hadoop services in that node.
3) Does it store any blocks of data in hdfs? Unless the edge node is configured with a DataNode service, blocks of data will not be stored on that node.
4) Should the edge node be outside the cluster? As mentioned above, it can be dependent on the cluster environment and use-case; One of the supporting reasons to configure it outside of the cluster is to keep the client-running and Hadoop services separated.
Keeping an edge node separate allows that node to utilize the full computing resources available for Hadoop processing.