Pages

Wednesday, March 26, 2014

Collections- Part - I

Table of Contents 

  • Goals
  • Vector
  • Map
  • Set
  • Pair
  • Immutable Collections
  • Collection Interfaces
  • Square Bracket Syntax
  • Literal Syntax
  • Basic PHP operators and Collections
  • PHP Builtin Support
  • Examples
  • Limitations
  • Future work on collections
The PHP language provides one primary mechanism for expressing containers of elements: the PHP array. Traditionally, the term "array" in programming languages is thought of as a collection of elements, indexed using consecutive integers (starting at 0 or 1), and possibly of a fixed size (e.g., Java). PHP arrays are significantly different than the traditional concept of arrays. A PHP array is essentially a dictionary-style collection associating keys with values that maintains insertion order. The keys can be either integers or strings; the values can be of any type. PHP arrays are dynamically-sized and grow as needed without requiring a function call to increase their size/capacity. In all Hack documentation, the term "array" will be considered to refer to a PHP array unless explicitly noted otherwise. Arrays in PHP are created using the array() language construct. Here are a few examples:
<?php
 
// Simple array with the values 1, 2, 3 and default
// indexed by integers, starting with 0
$arr1 = array(123);
 
// Another simple array with the values "a", "b", "c"
// and again default indexed by integers
$arr2 = array("a""b""c");
 
// Array with strings as both keys and values$arr3 = array("a" => "aaa""b" => "bbb""c" => "ccc");
 
// Array containing the values 1, 2, 3 but now indexed
// by a mix of integer keys and string keys
$arr4 = array("foo" => 173 => 2"bar" => 3);
 
// Array having mixed-typed values, default indexed by 
// integers.
$arr5 = array(1"hello", array(23), "goodbye");
 
// Dynamically grow arrays by just adding new values. The
// keys do not have to be sequential or of the same type.
$arr1[] = 4// The key will be 3$arr1[4] = 5;$arr2["bap"] = 6;$arr3["d"] = "ddd";$arr4[] = "blah"// The key will be 74$arr5[9] = 3;
Even with the relatively simple examples above, it is easy to see that PHP arrays are different than the "arrays" in other popular programming languages. In addition to offering dictionary-like operations, PHP arrays also offer stack-like operations. For example, values on PHP arrays can be pushed, popped or shifted.
Hack adds container types and interfaces to PHP. Building on Hack's support for generics, Hack adds first class, built-in parameterized collections such as vectors and maps. Collections are specialized for data storage and retrieval. Collections implement many of the same interfaces and are extendable to create even more specialized collections. Currently, Hack implements the following concrete collection types:
  • Vector: An ordered, index-based list collection.
  • Map: An ordered dictionary-style collection.
  • Set: A list-based collection that stores unique values.
  • Pair: An index-based collection that can hold exactly two elements.
The primary interfaces implemented by these collections (except Pair) have two incarnations: the normal, read-write interface implemented by the classes above and a read-only interface. The read-write interface is prefixed by Mutableand the read-only interface is prefixed by Const.
Here is an example of using Hack collections:
<?hh
function main_col() {

  
$vector Vector {510};

  
$vector->add(15);
  
$vector->add(20);

  
$vector[] = 25;

  
$vector->removeKey(2);

  foreach (
$vector as $item) {
    echo 
$item "\n";
  }
}
main_col();
The above example will output:
5
10
20
25

Goals 

Given the wide range of functionality offered by PHP arrays, they can be used to mimic common specialized collection types such as vectors, dictionaries, and sets. So why extend HHVM with its own collection classes and functionality? One key reason is code clarity. It's not always clear in PHP how an array is being used, or what the types of the keys and values are. This makes it harder to confidently make changes to a larger codebase without introducing subtle bugs. Another key reason is PHP array performance. Generally speaking, it is impossible to know with certainty if a PHP array is going to be used as a vector, as a map, as a set, etc., and as such the code generated by HHVM and associated data structures are more complex. Also, PHP arrays are copy-on-write. This means that programmers must use PHP references when they don't want the array to be copied, and using PHP references generally degrades code performance. Enter the Hack collection classes.
The goals of Hack collections are four-fold:
  1. Provide a unified collections framework that is simple and intuitive.
  2. Provide equal or better performance than the equivalent PHP array pattern.
  3. Provide a collection implementation that allows for optional static typing, integrating seamlessly with Hack.
  4. Provide an easy migration path to this framework by building on top of standard functionality from PHP5.
Like PHP arrays, Hack collections support foreach syntax for enumerating elements, and most collections support bracket syntax (e.g. $c[$k]) for accessing elements. Unlike PHP arrays, however, Hack collections are considered to be objects. Like all objects, Hack collections have reference-like semantics for assignment, parameter passing, and foreach. In other words, the collection is not copied when it is assigned to a variable, passed as a parameter, or returned as a return value (i.e., no copy-on-write semantics).
Hack collections were designed to work with Hack and integrate with Hack's generics. Generics allow for the parameterization of a type. Why is this useful? It allows for the creation of a collection type that is type-safe at compile time. In other words, instead of assuming one starts with a top-level object and then performs instanceof checks and casting at runtime to retrieve the appropriate type, the correct type can be determined at compile time. And while at present Hack collections do not perform type-checking on elements at run time (i.e., type information will be dropped at runtime), using generics allows Hack to catch typing errors involving collection elements in near real time. Generics are not available with PHP5.
Note that Hack generic type annotations are not strictly required for Hack collections. The Hack collection interfaces and APIs may be used in regular, untyped PHP code. Like all generics in Hack, the generics used with Hack collections will be "erased" (either fully or partially) at run time.

Vector 

Vectors are an integer-indexed (zero-based) collection with similar semantics to a C++ vector or a C#/JavaArrayList. Random access to elements happen in O(1) time. Inserts occur at O(1) when added to the end, but could hit O(n) with inserts elsewhere. Removal has similar time semantics. Iteration over the entire vector occurs at O(n) time. Inserting n elements into an empty Vector will take O(n) time in the average case (amortized). Here is the programmer accessible interface of a Hack Vector.
Here are some basic examples that show how Vector can be used:
<?hh
function main() {
  
// Create a Vector using collection literal syntax
  
$vector Vector {51015};

  
// Add elements using "$c[]=$v" syntax
  
$vector[] = 20;
  
$vector[] = 25;

  
// Access value by key using "$c[$k]" syntax; note that "$c[$k]"
  // syntax will throw if the key is out of bounds
  
echo $vector[0] . "\n";

  
// Access value by key using get(); null will be returned if the key is
  // out of bounds
  
echo $vector->get(1) . "\n\n";

  
// Set value by key using "$c[$k]=$v" syntax, overwriting the previous
  // value; note that "$c[$k]=$v" syntax for Vectors will throw if the key
  // is out of bounds
  
$vector[0] = 999;

  
// Remove an element by key
  
$vector->removeKey(2);

  
// Iterate over the values using "foreach ($c as $v)" syntax
  
foreach ($vector as $v) {
    echo 
$v "\n";
  }
  echo 
"\n";

  
// Iterate over the values using "for" and "$c[$x]" syntax
  
for ($i 0$i count($vector); ++$i) {
    echo 
$vector[$i] "\n";
  }

  
// Iterate over the keys and values using "foreach ($c as $k=>$v)"
  // syntax
  
foreach ($vector as $k => $v) {
    echo 
$k ": " $v "\n";
  }
  echo 
"\n";
}
main();
The above example will output:
999
10

999
10
20
25

999
10
20
25

0: 999
1: 10
2: 20
3: 25
<?hh
function main() {
  
$v Vector {40802060};
  
$it $v->filter(function($x) { return $x >= 50; });
  foreach (
$it as $key => $val) {    echo $key " " $val "\n";
  }
}
main();
The above example will output:
0 80

1 60

Map 

Maps are an ordered dictionary-style collection. Elements are stored as key/value pairs. Maps retain element insertion order, meaning that iterating over a Map will visit the elements in the same order that they were inserted. Insert, remove and search operations are performed in O(lg n) time or better (amortized). Note: Map only supports int and stringkeys at present. Support for other types may come in the future. Here is the programmer accessible interface for Map.
Here are some basic examples showing how Map can be used:
<?hh
 
function main() {
  
// Create a Map using collection literal syntax
  
$map Map {"A" => 1"B" => 2"C" => 3};
 
  
// Add elements using "$c[$k]=$v" syntax; note that if $k is
  // already present, "$c[$k]=$v" syntax will overwrite the previous
  // value
  
$map["D"] = 4;
  
$map["E"] = 5;
 
  
// Access value by key using "$c[$k]" syntax; note that "$c[$k]"
  // syntax will throw if the key is not present
  
echo $map["A"] . "\n";
 
  
// Access value by key via get(); null will be returned if the key is
  // not present
  
echo $map->get("B") . "\n\n";
 
  
// Remove element by key; if the key is not present the remove()
  // method will do nothing and return
  
$map->remove("B");
 
  
// Testing if a key is present
  
echo ($map->contains("A") ? "true" "false") . "\n\n";
 
  
// Iterate over the values using "foreach ($c as $v)" syntax
  
foreach ($map as $v) {
    echo 
$v "\n";
  }
 
  
// Iterate over the keys and values using "foreach ($c as $k=>$v)"
  // syntax
  
foreach ($map as $k => $v) {
    echo 
$k ": " $v "\n";
  }
}
 
main(); // REMEMBER, insertion order is maintained.
The above example will output:
1
2

true

1
3
4
5

A: 1
C: 3
D: 4
E: 5
<?hh
 
function main() {
  
$m Map {=> 'a'=> 'b'=> 'c'=> 'd'};
  
var_dump($m->filterWithKey(function($k$v) { return $k >= 3; }));
}
 
main();
The above example will output:
object(Map)#1 (2) {
  [3]=>
  string(1) "c"
  [4]=>
  string(1) "d"
}

Set 

Sets are an unordered associative collection that stores unique values. Unlike vectors and maps, sets do not have keys, and thus cannot be iterated on keys (i.e., Set does not implement KeyedIterable). Likewise, sets do not support$c[$k] or $c[$k]=$v syntax (though $c[]=$v syntax is supported for adding elements). Insert, remove, and search operations for Set are performed in O(lg N) time or better. Intersect and some other operations are available in the Sets class. Note: Set only supports int and string values at present. Support for other types may come in the future.Here is the programmer accessible interface of a Hack implemented Set.
Here is a basic example showing how Set can be used:
<?hh
 
function main() {
  
// Create a Set using collection literal syntax
  
$set Set {"A""B"};
 
  
// Add elements using "$c[]=$v" syntax
  
$set[] = "C";
 
  
// Add elements using the add() method
  
$set->add("D")->add("E");
 
  
// Remove element by value
  
$set->remove("B");
 
  
// Testing if a value is present
  
echo ($set->contains("A") ? "true" "false") . "\n\n";
 
  
// Iterate over the values using "foreach ($c as $v)" syntax
  
foreach ($set as $item) {
    echo 
$item "\n";
  }
}
 
main();
The above example will output:
true

D
A
E
C
Here is an example of taking the difference between two sets using removeAll
<?hhfunction foo(): void {
  
$s Set {234};
  
$v Set {235};
  
$s->add(6);
  
$z $s->RemoveAll($v); //difference between $v and $s
  
var_dump($s$v$z);
}
 
foo();
The above example will output:
object(Set)#1 (2) {
  int(4)
  int(6)
}
object(Set)#2 (3) {
  int(2)
  int(3)
  int(5)
}
object(Set)#1 (2) {
  int(4)
  int(6)
}

Pair 

An indexed container restricted to containing exactly two elements. Pair has integer keys; key 0 refers to the first element and key 1 refers to the second element (all other integer keys are out of bounds). The first type parameter specifies the type constraint for the first element, and the second type parameter specifies the type constraint for the second element. Pairs are immutable, meaning that the elements of a Pair cannot be assigned to or removed. Note however that Pairs may contain mutable objects. In many cases, a tuple is a better choice than a Pair. Here is theprogrammer accessible interface for Hack pairs.
Here is a simple example showing how Pair can be used:
<?hh
 
function main() {
  
$p Pair {7'a'};
  echo 
$p[0] . "\n";
  echo 
$p[1] . "\n";
  echo 
"\n";
  foreach (
$p as $val) {
    echo 
$val "\n";
  }
}
 
main();
The above example will output:
7
a

7
a

Immutable Collections 

Collections in Hack can take on two primary forms: mutable and immutable. The mutable Vector, Map, and Set all have immutable counterparts. The immutable collections are, as you would imagine, constant. No elements can be added or removed from them. Likewise, elements cannot be overridden using assignment.
Immutable collections generally implement a ConstXYZ interface (e.g., ConstVector). Here are the current concrete immutable collection classes:

Collection Interfaces 

The following diagram shows at a high level how Hack collection interfaces are structured.

Collections Implementation

Collection implementation diagram
In the design of HipHop collections, these general design patterns were followed:
  1. Used patterns that PHP developers are accustomed to within various codebases (e.g., foreach both key/value and just value iteration).
  2. Allowed use of foreach, brackets syntax (e.g., $a[$b]) and higher order functions (e.g., filter) instead of manually operating on iterators.
  3. Allowed for supporting covariance and contravariance in the future to provide maximum power and flexibility while maintaining type safety.
  4. Followed design patterns from other popular collections frameworks where appropriate (e.g, C++, C#, Java, Python, Scala) for maximum developer appeal.
Potential support for covariance and contravariance is indicated by the "out" keyword and "in" keyword respectively , though at present Hack does not support "out" and "in".

Specific Collection Interfaces 

The concrete collection classes (Vector, Map, etc.) use the collection interfaces as building blocks. Before diving into the lower layers of interfaces, it's worth covering the specific collection interfaces first.
The "const" interfaces (ConstVector, ConstMap, and ConstSet) can be used in parameter type constraints to indicate that a given function will not mutate the collection. This can be important for larger codebases since collections have "reference-like"semantics and do not have "copy-on-write" behavior like PHP arrays. The "const" interfaces also provides a good foundation for adding "immutable" versions of collections (ImmVector, ImmMap, etc.) and for potentially supporting covariance in the future. The ConstMap interface can be used in a parameter type constraint to indicate that a given function will accept both Maps and ImmMaps. The collection specific interfaces also lay the foundation for adding more types of collections or user-defined collections down the road. Each of the "const" interfaces has a "mutable" counterpart (e.g. MutableVector)

Iteration Interfaces 

The Iterable, Traversable, and associated interfaces allow for the traversal of values or keys/values coming from some source, whether it be a collection or a generator or some other kind of object.
The Traversable style interfaces provide a continuous feeding of elements with foreach. Traversable also provides the capability for code to be compatible with both arrays and collections, while allowing for the typing of keys and values. A function that takes an Indexish, for example, can be passed an array<int> or a Vector<int>.
Indexish is another interface that works with both arrays and collections, and it is intended to help ease the migration from arrays to typed collections for certain use cases. Indexish supports the array-like, square bracket (e.g. $x[$k]) syntax while also supporting the iteration via foreach (by extending the Traversable interface). The Indexishinterface does not include functionality for mutating the underlying array or collection; this includes appending, removing, changing element types and changing the values of elements. While the Indexish does offer functionality for mutation, it is still possible to mutate the underlying array or collection if the programmer uses is_array() or instanceof to cast the container to its concrete type, or if the container is passed to "untyped" code. It is important to note that the copy semantics do remain the same for arrays and collections when using Indexish. Arrays are still copy-on-write, while collections are passed by reference.
Given that Indexish supports keyed collections, Vector and Map implement Indexish. Set (which doesn't have a key) and Pair do not.
The Iterable style interfaces allow for retrieving values using foreach at chosen times through the use of iterators (e.g.. next), while also providing various helper functions that can be used during iteration.

General Collection Interfaces 

The Collection interface provides basic functionality for adding and erasing elements, as well as iterating over elements in a uniform manner.

Access Interfaces 

Depending on the type of collection type implemented, one of the following interfaces (IndexAccess, SetAccess orMapAccess) will be implemented. These interfaces provide the functionality for checking values in a collection by index/key, removing items from a collection by index/key, as well as providing [] syntax for accessing values by index/key. All three of these interface types have a "const" companion for cases where the programmer wants to enforce a "read-only" contract.

Square Bracket Syntax 

The Hack collection classes provides two primary mechanisms for accessing elements: square bracket syntax (e.g. []) and explicit get/set methods. Both are obviously completely valid for use. However, it is recommended that square bracket syntax be used. Square bracket syntax should be familiar to PHP developers via arrays, as well as developers in other languages like C#. [] is the preferred for several reasons:
  • It is currently faster to use at runtime.
  • It is more readable to people who read PHP code since [] is used with arrays.
  • Code that is required to support both collections and arrays will require [], along with the KeyedTraversableand/or Indexishinterfaces, to be interoperable.

Literal Syntax 

Hack collections has introduced a literal syntax in order to make it easy to create a new collection without the need for temporary arrays or helper functions. HHVM natively understands collection literals. Here is an example:
<?hh
$vec 
Vector {123};
The code above requires no hidden arrays or helper functions to build a Vector containing 1, 2, and 3.
Here are some basic examples showing how collection literal syntax can be used:
<?hh
 
// This example creates collections using literal syntax and
// stores these collections in normal local variables
function f() {
  
$vec Vector {123};
  
var_dump($vec);
  
$map Map {42 => 'foo'73 => 'bar'144 => 'baz'};
  
var_dump($map);
}
f();
The above example will output:
object(Vector)#1 (3) {
  [0]=>
  int(1)
  [1]=>
  int(2)
  [2]=>
  int(3)
}
object(Map)#1 (3) {
  [42]=>
  string(3) "foo"
  [73]=>
  string(3) "bar"
  [144]=>
  string(3) "baz"
}
  
<?hhclass {
  public static 
$bar Map {'a' => 1'b' => 2};
 
  
// Each instance of class C will get its own copy
  // of Vector {1, 2, 3} in property $foo
  
public $foo Vector {123};
 
  
// Each invocation of h() with no parameters will
  // return a distinct empty Vector
  
function h($x Vector {}) {
    return 
$x;
  }
 
  function 
j() {
    static 
$y Map {=> 'a'=> 'b'};
    return 
$y;
  }
}
There are several reasons why this mechanism of literal syntax was chosen over other candidates (e.g., Python-style or using parenthesis instead of curly braces):
  • "[..]" syntax has already been taken by PHP 5.4 for 'short array syntax'.
  • The current collection literal syntax is clearly distinct from function call syntax, which makes it easier for HipHop to treat it differently than a function call and deliver performance that is competitive with PHP arrays. This also means that new keywords don't need to be added or the HipHop runtime hacked to allow "=>" inside map(..)but not inside other function calls.
  • This literal syntax scales better for adding future built-in collection types. This syntax also leaves the possibility open for allowing some user-defined classes to use this syntax down the road.
  • It leaves the possibility open for specifying the extended type of a collection (e.g.,. Vector<Foo> {f(), g()}), which may come in handy for some use cases, and it may be useful down the road when generics are integrated into HipHop's type system.
  • This literal syntax is reasonably concise when compared with the classic array literal syntax (e.g., array(..)), and it doesn't preclude the possibility of having a more concise syntax in the future for vectors and maps.
  • This syntax has similarities with various initialization constructs in several popular languages that are syntactically similar to PHP (ex. C, C++, Java, C#), giving it a familiar feel for PHP developers.

Basic PHP operators and Collections 

For the most part, collections behave like regular PHP objects, though there are some places where collections behave differently than regular PHP objects:
  • list assignment with a collection on the right hand side works as you would expect.
  • Comparing collections using === returns true if and only if they are the same object (reference equality).
  • Comparing collections using == will compare collections for "structural equality". == will always return false if dealing with different kinds of collections. == takes order into account for Vector and Pair, but does not care about order for Map and Set.
  • Casting a collection to array ((array)$c) will produce an array containing the keys and values from the collection (same as $c->toArray()).
  • Casting a collection to boolean ((bool)$c) will produce true if the collection is non-empty, false otherwise (same as !$c->isEmpty()).
  • Other kinds of casts behave the same way for collections as they do for regular PHP objects.

PHP Builtin Support 

In large PHP codebases, data flow can be very complex, and so it's important to make collections work together with common PHP builtins so that migration is as easy as possible. Below is a summary of what PHP builtin functions currently work with collections:
  • apc_store(string,mixed,int)
  • array_combine()
  • array_diff()
  • array_diff_key()
  • array_filter()
  • array_intersect()
  • array_intersect_key()
  • array_keys()
  • array_map()
  • array_push()
  • array_values()
  • count()
  • debug_zval_dump()
  • implode()
  • join() (alias of implode())
  • print_r()
  • serialize()
  • sizeof() (alias for count())
  • unserialize()
  • var_dump()
  • var_export()
  • array_key_exists() - for Vector and Map only
  • idx() - for Vector and Map only
  • sort() - for Vector only
  • rsort() - for Vector only
  • usort() - for Vector only
  • arsort() - for Map only
  • uasort() - for Map only
  • ksort() - for Map only
  • krsort() - for Map only
  • uksort() - for Map only
  • natsort() - for Map only
  • natcasesort() - for Map only
At present, most other standard PHP builtins that accept arrays as parameters either (1) do not support collections or (2) silently copy the collection to an array.

No comments:

Post a Comment