Scrapes and parses pbp data for ncaa basketball games




the game_id for the specified game from


a dataframe of cleaned, parsed play-by-play data for the specified game_idcheck for which team has possession on each play -- this is effectively the instantaneous "who has possession" whenever a given stat is recorded e.g. if team A makes a shot, that shot is recorded as the possession of team A, not team B who has possession immediately after the made shot however, if team B forces a steal, team B is recorded as the possessing team fixing a very annoying edge case where a team wins jump ball then immediately loses it on a turnover for when we can't guess who has possession, we basically fill in the gaps based on who had possession before and after a play these are deliberately commented out for a few reasons:

  1. As stringer data, these designators are somewhat noisy

  2. As Seth Partnow pointed out in The Midrange Theory, these designators can be biased

  3. THese designators are only available in V2 of the PBP, not V1 grouping by period because some lineups will change between periods without being noted in the pbp whenever a player is subbed in, they have a 1 in that row and a 0 in the row before. whenever a player is subbed out, they have a 0 in that row and a 1 in the row before we then cascade the 1s and 0s up and down to create map of who is in the game at any given time now we map player names to the roster df. this could be noisy in theory but i haven't seen any issues in practice some character encoding stuff, dropping players with mispelled names in the pbp for v1, these columns are not recorded, so they are set to NA so they don't register as false negatives