Dortmund and Christian Pulisic are Colder Than an Eskimo in an Ice Bath

Wait What?

The prolific rise of Christian Pulisic as an American soccer hero over the past year has led to two things in my life specifically:

  1. I have had my hopes and dreams crushed by our national team (#USMNT)
  2. I have started watching the Bundesliga more often, and have become an avid fan of Borussia Dortmund (the team Pulisic plays for when he is not busy losing to Caribbean Island squads with his fellow countrymen)

In the wake of the USMNT not qualifying to the World Cup in 2018, I have placed the full burden of my fanship on Dortmund for the past few weeks. Perhaps many other Americans have done the same, because Dortmund seem to be weighed down by all these expectations. In fact, just this past week they lost their spot atop the league table to Bayern Munich.

Talk Data to Me…

Note: No match statistics from this past week (Oct 28th) are included yet

In an effort to fully process this information and make lemons out of lemonade, I decided to visualize their ice cold streak over the last few weeks in a Tableau Dashboard(Data source: http://www.football-data.co.uk/)

Dortmund started the season hot, and have maintained the largest positive goal differential all season. The peak of their form was on September 23rd, when they beat their rivals Borussia Monchengladbach 6-1 and cemented themselves at the top position in the Bundesliga.

Following this convincing victory from Dortmund, Bayern had 2 draws in a row against teams they should have easily beaten (Wolfsburg, and Hertha Berlin). In fact, the probability of a draw in both of these games was 2.4% (according to BET365 odds). It was in between these two games that Bayern fired their coach Carlo Ancelotti, and internal turmoil may have been to blame.

The two Bayern ties stand out as outliers in terms of probable match outcomes.

However, Dortmund have not won a game in the Bundesliga since defeating Augsburg 2-1 at the end of September (and they have been having a tough time in the Champions League as well). This left the door wide open for Bayern, who tied them in points October 21st, and just surpassed them this weekend (October 28th).

The Silver Lining

All this being said, it is still early in the season and one should expect a team’s form to ebb and flow as the season goes along. In addition, Pulisic is still only 19! By the time the next World Cup rolls around he will be a weathered veteran at the ripe age of 23. Finally, I am sure there are plenty of other hidden storylines I have missed, so check out the dashboard and let me know what you think!

Better Late than Never…

Back on Christmas Day in 2016, I officially graduated from the Udacity Data Analyst Nanodegree program.

Over the course of a year I covered topics like Statistics, Data Wrangling, Exploratory Data Analysis, Machine Learning, Data Visualization, A/B Testing, MongoDB, and Python. The projects were intense, and the feedback from the coaches was invaluable. I highly recommend it to anyone else looking to break into the data analytics field!

How to Avoid Living in a Glorified Cardboard Box in NYC

The short answer: Data, context, and gainful employment… but read on for more specifics, including which neighborhood I currently believe is the best value overall.

After reading an article about how rents hit an all time high in July 2017, I decided it was not too early to get a jump on my impending move in 2018. Inspired to get a feel for the NYC rental market on my own, I set off to see if there was any way of systematically identifying high value apartments besides spending hours browsing the web.

Naturally (as any normal person does), I decided to start off by web scraping as many of the current rental listings from StreetEasy for the neighborhoods I was most interested in to begin with (I may add more later). Honestly, this was the hardest part of the whole process since StreetEasy is quite obviously worried about competitors scraping their data and using it for their own gain. However, I was able to get it working using python, selenium, and the google chrome driver. If you are interested, I have posted the code on my GitHub page. (I do plan on expanding my search and enriching the data more in the future)

When it was all said and done I had detailed information for ~1,200 one bedroom apartment listings from 14 different neighborhoods spread across Manhattan and Brooklyn. Rather than only dealing with neighborhood summary statistics, I was able to get a real sense of the price, size, and amenity variations within each area. The best way to start digging into this data was to create a Tableau Dashboard (which I have posted on Tableau Public)

What I learned

There is a big difference between ‘Williamsburg’ and ‘East Williamsburg’

Once the hipster capital of the world (no longer the case since they have a Dunkin Donuts), Williamsburg has become one of the most desirable places to live in Brooklyn for all different types of people. I was surprised to see that the median rent of $3,100 made it the 9th most expensive neighborhood of 14 I analyzed, which is slightly misleading. When you separate out East Williamsburg, the median rental price in Williamsburg goes up $200 (making it more expensive than SoHo). However, the silver lining is if you still want to tell people you live in Williamsburg but are on a budget, the median price in East Williamsburg is $2,450 (that is cheaper than any of the other neighborhoods I analyzed). In fact, the maximum amount you should expect to pay for a 1BR in East Williamsburg is $3,300… the same amount as the median price in Williamsburg “proper”.

Be prepared to pay for that doorman

I primarily focused on 6 amenities, which I personally may want in my next place: a balcony, doorman, elevator, washer/dryer in-unit, dishwasher, or rooftop access. (I purposefully left out zipline, an amenity I have not had since middle school)

In general, listings with a doorman are 25% more expensive than those without ($3,771 vs. $2,850). On the other side of the spectrum, balconies generally meant that the listing would only be 7% more expensive. (Note: There are other confounding variables at play here, since these amenities rarely come alone, but it at least provides a ballpark estimate of their value)

In addition, Tribeca had the highest percentage of listings with a doorman (88%), while SoHo had the lowest (0%), which was interesting given their close proximity.

If you are looking to find an apartment with a balcony, I suggest you start your search in Vinegar Hill, where 40% of the apartments have one.

Square footage is consistently and conspicuously absent

In an effort to maximize value, the logical starting point was price per square foot. However, only 36% of listings show the square footage making it difficult to trust any conclusions made on this metric alone.

Interestingly, the most expensive neighborhoods are the ones that report square footage the least often. West Village and Tribeca reported square footage on 14% and 28% of their listings respectively. Apparently, in these neighborhoods they want people measuring value by the experience and culture of the area (priceless) rather than how much space your money buys…

For those of you who are curious, the square footage information that I was able to scrape suggests that your money goes almost twice as far per square foot in DUMBO than it does in the West Village ($7.28 vs $4.60), and DUMBO is the most expensive neighborhood in Brooklyn. Additionally, of the neighborhoods I sampled, your money goes the absolute farthest in Carroll Gardens ($3.05).

So where is the best value?

WARNING: the conclusion below is definitely debatable, and I am hoping to come up with a better way to quantify the value of amenities to make this conclusion less qualitative…

Brooklyn. More specifically, DUMBO/Vinegar Hill. Based upon the data I have seen thus far, you get the most amenities and space for the money. Feel free to play with the dashboard yourself and see if you come to a different conclusion.

One of the best ways to hone in on the Price vs. Size tradeoff is with the chart seen below. The bluer a circle is, the less per square foot it is, and the bigger the circle, the more amenities it has. You can even select a specific apartment and link to the StreetEasy listing!

Wishlist for other metadata (future state)

How good is the view? Beyond text analysis for keywords that would infer there is some sort of view this gets fairly complicated.

How much counter space/closet space/cupboard space is there? This would likely require that there is at least some sort of floor plan available (which is not always the case).

How far is the nearest grocery store (or other important ? This would involve pulling some data from the Google Maps API.

Who “Won” Tomorrowland 2017?

Some of you may have read my last article where I analyzed 59 sets from EDC Las Vegas 2017. This time, I put on my big boy pants and analyzed 236 sets from both weekends of Tomorrowland in Belgium. This means I had more than 14,000 tracks played by ~200 DJ’s over the course of 2 weekends to feed my analysis. Similar to last time, this data was web scraped from 1001tracklists.com and I have made the code available on my github page if you wish to do something similar yourself. In addition, I also created a summary dashboard on Tableau Public if you wish to explore the data in more detail yourself.

*Note: I have added links to some songs along the way for your enjoyment, so please read on!

1. DJ Snake

DJ Snake had the the most tracks played at Tomorrowland of any DJ by far (65), which was 23 more than the next highest artist (Axwell /\ Ingrosso). In addition, his songs were played by a wide variety of artists (38), far and away the broadest reach of any artist playing at the festival (the next highest was Calvin Harris with 23). The main drivers behind his popularity were his two biggest hits “Propaganda” and “Let Me Love You”, which were the #2 and #5 most played songs overall.

For those who may be interested in a new spin on these (already overplayed) hits, the most popular remixes were the “Propaganda (Nom de Strip & TJR Remix)” and “Let Me Love You (Don Diablo Remix)”.

2. Ed Sheeran

As amazing as it would be to see Ed Sheeran working the turntables and screaming for the crowd to “put their hands up in the air”, sadly he was not. However, he still ended up having his songs played (in one form or another) by 22 different artists at the festival. This puts him at tied for third alongside of Axwell /\ Ingrosso and Valentino Khan. Much like Kendrick Lamar was for EDC, Ed is the most popular artist not appearing in person to have his tracks played (given that they are such similar artists this should come as no surprise *sarcasm*).

“Shape of You” led the way as the most common song played, but if anyone is looking to impress their friends with their Ed Sheeran discology, I would recommend checking out the most popular remix of “Castle on the Hill” by Gareth Emery & Ashley Wallbridge.

3. Hardstyle

While trap music may be slowly taking hold of the Americas, its older cousin hardstyle is alive and well in Europe. By using Python scikit-learn K-Means clustering and my limited knowledge of a few hardstyle artists, I was able to decipher which other artists fell into this genre. For me personally, exploring cluster 12 on the Tableau dashboard led to some entertaining artist (“Phuture Noize“) and song (“Destination“) discoveries. Worth noting: hardstyle is very high energy and is definitely not for everyone.

Note: I plan on writing a post that goes into the clustering in more detail, drawing a few more insights from the data and explaining the methodology.

4. Heads Will Roll (A-Trak Remix)

While mining this data set for new and exciting remixes I ran across this track, which was tied for 2nd as the most commonly played remix at Tomorrowland 2017 (12 plays across 11 DJs). Personally, I found this incredibly amusing since it was released 8 years ago (2009, if you cannot find your calculator). Therefore, this track is a winner for its popularity and longevity at a festival well known for revealing tracks never heard before.

It is also be worth noting that Don Diablo Remixes were incredibly popular (as can be seen on the left).

What is Next?

As I alluded to before, I tested out whether I could use python to cluster various DJs based upon the the tracks and artists they played. I plan on providing a more detailed analysis of this output soon.

Spoiler Alert: The clusters are on the Tableau Dashboard already… if you agree/disagree, leave a comment!

A Full Analysis of Every Song Played at EDC 2017

Every year I always enjoy the wealth of full live sets that get released on SoundCloud around this time. However, what I have noticed over the years is that a number of the sets sound quite similar, and there are a few songs/artists each year that dominate the airtime across a wide variety of sets.

So I set out to answer the following questions:

  1. Who were the hottest artists at EDC 2017?
  2. What were the the hottest songs?
  3. How can I investigate the relationships between all the sets that were played?

Getting the Tracklist Data

Since there is no central repository or database where you can simply download structured tracklist data, I was forced to web scrape the data from 1001tracklists.comusing python. Luckily, the BeautifulSoup module helped simplify the data extraction process for the 59 sets that I scraped. Once I had all the csvs, I used Pandas to combine, and parse the important fields out of the data such as: track artist, simple track title (without any of the remix information), all the featured artists, and the set it was played in. Parts of this code have been made publicly available on my github page.

Initial Insights

Link to interactive dashboard on Tableau Public: HERE

I was surprised to see that Boombox Cartel got the most plays of any artists at EDC (28 plays across 11 sets, with the most popular song being “Jefe”).

Additionally, I found it amusing that the second most played artist at EDC was not even there… Kendrick Lamar got a whopping 26 plays across 15 separate sets.

When the track_basic field is expanded within the “Most Played Songs” chart, you can find the most commonly played remixes. I personally have found this a gold mine for new twists on old classics, especially for some of the songs which were beginning to feel a bit overplayed as originals (I’m looking at you Propaganda).

Going One Deeper

Although the summary values were interesting, I wanted to explore the complex connections between all the sets, songs, and artists a bit more. To do this, I used Python NetworkX, which allowed me to use the power of graph analysis to explore the complex relationships between all the entities.

(Above) Details from the Noisecontrollers set

Nodes:

  • DJ (red icon): The artist playing the set at EDC (can also create songs)
  • Track Artist (blue icon): An artist not at EDC that contributed in some way to creating the song
  • Song (gray icon): The track that was played in the set at EDC

Connections (aka Edges):

  • Played (red edge): When a DJ plays a track they did not create in their set
  • Created/Played (purple edge): When a DJ plays a track they created in their own set
  • Created (blue edge): When an Artist contributes to the creation of a song that was played during an EDC set

Link to the interactive dashboard on Tableau Public: HERE

This type of view allows me to search “Kendrick Lamar”, and see all the DJ’s who played his songs, which songs they played, and any other artists that may have collaborated with Kendrick on the various songs. (See example below)

I could also search the song “M.A.A.D City”, to see which sets it was played during, and which artists helped create it.

The views become increasingly complex if you search an artist (like Flosstradamus) who played a full set, but also had their songs played in many other sets (see below).

This is where some of the summary values at the top of the dashboard help. We can see Flosstradamus songs were involved in one way or another in 18 sets, and that there were 50 artists that either collaborated on Flosstradamus songs, or created songs that Flosstradamus played during their set. I find that this is illustrative of the fact that each of the sets played does not happen in a bubble, and that it really involves many members of the community to make it happen (everyone from Darude to Pitbull).

This network view is quite dynamic and there are many more interesting nuggets that are still undiscovered… check it out and let me know what you find!

Where is the South?

I recently completed the “Data Visualizationwith D3.js” segment of my Udacity Data Analyst Nanodegree. Of the modules I have done so far, this was my favorite and I was able to produce a very cool D3 viz from scratch.

I am from North Carolina originally, but attended college in upstate New York, and currently work in New York City. Therefore, this article on fivethirtyeight especially resonated with me, as it is a conversation I have often had (in jest) with friends. There is more debate about which states are actually in the south than one would think, and I wanted to create my own take on the subject.

Check out the viz here: Where is the South?

A link to the full Gist can be found here.

wheresthesouth