13F
Before you do anything automated with these instructions, read this:
https://www.sec.gov/os/accessing-edgar-data.
It's important that you follow those rules so we all can keep working with the SEC Edgar datasets.
-----------------------------------------------------------------------------------------
Most of the times, it is difficult to uniquely identify companies in the samples we work with. Unless you have access to expensive, private data sets, you have to work your way through publicly available data. While CIKs are easily available from the SEC (I love those guys, thank you for your service) website and we use them to find all sorts of information from all forms available, when it comes to market data, CUSIPs are essential, but they are not readily linked to the CIK. At least not if you're looking at the usual forms, such as 10-K, 8-K or DEF 14A. That's where 13 Fs become handy.
In short, the 13 F must be filed quarterly by institutional managers (details can be found here) with more than US$ 100mi in investments and can bring three kinds of information: a list of stock holdings, a notification that the manager's holdings are listed by some other manager and whether those holdings are jointly reported, i.e., some other manager reports them partially. Although this could be a great source of information for research, I would like to mention what kinds of data the 13 F reports when it fits its second purpose. In this case, we can find the CUSIP number of security held by the filing manager and the issuer's name! Bear in mind that you can use that name as unique key to search for a CIK using the SEC Edgar system, which finally gives you a CIK-name-CUSIP set. How about that for your market-based research?
We could also use the SC 13G and SC 13D to find that information, but that would more challenging. Maybe in another post.
Anyway, how do we do that? How do we get all those CUSIP number to a local database without paying an arm and a leg for it? There is no free lunch. You are going to spend a lot of time doing it, however.
First, let's take a look at one of those forms: click here. What you see is a table that brings information on what the filing agent of this 13 F reported that quarter (we could ask how much of the filing agent's portfolio that means looking at the NPORT form, but that is also for another post). We are going to get that information and we can do it either by reading the html file or by reading the xml file. For the moment, I'll show you how to do it using the xml file, just because it's faster than the alternative.
From the URL, you can get the filer's CIK, which is (000)1815025 (always preceded by zeroes, so our tables look nice). We have to write a script to search the SEC Edgar for 13 F forms from that manager. I don't know if I've mentioned before, but we are using Python in the blog. In the future, we'll use Julia.
Before you do anything else with your Jupyter notebook (iPython for me), this link takes you to a listing of files available from the SEC Edgar system. It shows all the files related to a given filing. In this case, we want the aci_13f.xml. Click on it and you'll see the xml file we are going to parse to get our data (more details regarding the 13F's xml format can be found here).
By now, you probably understand that we need more than just "to parse a file". We need to tell our script the address for a list of filings, which one we want, which file amongst the existing html and xml files, then we can read the relevant xml and parse it. Only then we can store the information for further use. And that is just for one company.
Let's start by doing it backwards. Download that xml file so we can parse it (right click, download file). We are using the module etree, so you need to import it (make sure it's installed before, of course) and attribute the name of the file you just downloaded to a variable:
from lxml import etree
xmlFile = 'pathToYourFile.xml'
Once you've done that, just parse the file, give it to some other variable and work with the file root:
xmlRoot = etree.parse(xmlFile).getRoot()
Now you have an etree element object you can work with. Go back to you internet browser and look at the xml file again. You should note a very distinct structure with a second level header named "infoTable". Inside that, you'll find our relevant data. For the moment, we are interested only in "nameOfIssuer" and "cusip".
This file's structure is indeed built like a tree. The infoTable branches open to other branches, from which we can get the information we want. To see that, type the following in your interpreter:
xmlRoot.xpath("./*")[0].xpath("./*")
"xpath()" is just a search method and it's bringing a list of elements from the first (zero-indexed, right?) infoTable item in that xml file. You should be seeing something like this:
[<Element {http://www.sec.gov/edgar/document/thirteenf/informationtable}nameOfIssuer at 0x11c2c0248>,
<Element {http://www.sec.gov/edgar/document/thirteenf/informationtable}titleOfClass at 0x11c26ed48>,
<Element {http://www.sec.gov/edgar/document/thirteenf/informationtable}cusip at 0x11c26ebc8>,
<Element {http://www.sec.gov/edgar/document/thirteenf/informationtable}value at 0x11d4685c8>,
<Element {http://www.sec.gov/edgar/document/thirteenf/informationtable}shrsOrPrnAmt at 0x11d468b48>,
<Element {http://www.sec.gov/edgar/document/thirteenf/informationtable}investmentDiscretion at 0x11d468f48>,
<Element {http://www.sec.gov/edgar/document/thirteenf/informationtable}otherManager at 0x11d468948>,
<Element {http://www.sec.gov/edgar/document/thirteenf/informationtable}votingAuthority at 0x11d468108>]
Now, we are interested in name and cusip, respectively the first and third items from that list. Let's create a Pandas DataFrame with data from this xml file:
import pandas as pd
df = pd.DataFrame({(iTable.xpath("./*")[0].text,iTable.xpath("./*")[2].text) for iTable in fileRoot.xpath("./*")},columns=['firm_name','cusip'])
This "dictionary comprehension" gets the first and third items, issuer name and cusip, from each infoTable element in the file and populates the DataFrame with them. This DataFrame should look like this:
All we need to do now is loop over more companies to get more data. However, if you already have the xml data file, you can just parse it the way I've explained and voilĂ , you have a nice data set with names and CUSIP numbers. You can automate a search in the SEC Edgar website to look for the names and get the CIKs or you can just use the module edgar to download names and CIKs and to the match them without overloading the SEC unnecessarily (that's would be the right thing to do) with thousands of queries. Anyway, I'll do the search for other firms in another post, since it's a different topic.
Remember: what we've got from this post is that now we have the CUSIP numbers for companies other managers have invested in. Using this method we won't have the CUSIP for companies investors are not interested in, meaning, your sample could be severely biased towards firms that issue securities or even those that perform better.
Send me an email if you have any suggestions or would like to chat about research! Thanks for reading!!